54462 Commits

Author SHA1 Message Date
Alessio Balsini
a9490d32bc
FROMLIST: fuse: Introduce passthrough for mmap
Enabling FUSE passthrough for mmap-ed operations not only affects
performance, but has also been shown as mandatory for the correct
functioning of FUSE passthrough.
yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
data inconsistencies if the same file is also accessed with mmap. What
happens is that read/write operations are directly applied to the lower
file system (and its cache), while mmap-ed operations are affecting the
FUSE cache.

Extend the FUSE passthrough implementation to also handle memory-mapped
FUSE file, to both fix the cache inconsistencies and extend the
passthrough performance benefits to mmap-ed operations.

[1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-9-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Ifad4698b0380f6e004c487940ac6907b9a9f2964
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit bf5cb932f0e0dd028dcebf3a6c2fcfedb4fd8265)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:56 +05:30
Alessio Balsini
658dc211a1
FROMLIST: fuse: Use daemon creds in passthrough mode
When using FUSE passthrough, read/write operations are directly
forwarded to the lower file system file through VFS, but there is no
guarantee that the process that is triggering the request has the right
permissions to access the lower file system. This would cause the
read/write access to fail.

In passthrough file systems, where the FUSE daemon is responsible for
the enforcement of the lower file system access policies, often happens
that the process dealing with the FUSE file system doesn't have access
to the lower file system.
Being the FUSE daemon in charge of implementing the FUSE file
operations, that in the case of read/write operations usually simply
results in the copy of memory buffers from/to the lower file system
respectively, these operations are executed with the FUSE daemon
privileges.

This patch adds a reference to the FUSE daemon credentials, referenced
at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
to temporarily raise the user credentials when accessing lower file
system files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This
is because it is still the FUSE daemon responsible for enabling or not
the passthrough feature at file open time, and should enable the feature
only after appropriate access policy checks.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-8-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Idb4f03a2ce7c536691e5eaf8fadadfcf002e1677
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit 5f3d78268b21d381310574af1c16882c7680ceb1)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:56 +05:30
Alessio Balsini
5a0f00bb01
FROMLIST: fuse: Handle asynchronous read and write in passthrough
Extend the passthrough feature by handling asynchronous IO both for read
and write operations.

When an AIO request is received, if the request targets a FUSE file with
the passthrough functionality enabled, a new identical AIO request is
created. The new request targets the lower file system file and gets
assigned a special FUSE passthrough AIO completion callback.
When the lower file system AIO request is completed, the FUSE
passthrough AIO completion callback is executed and propagates the
completion signal to the FUSE AIO request by triggering its completion
callback as well.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-7-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: I47671ef36211102da6dd3ee8b2f226d1e6cd9d5c
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit ea2b7a36847b14dee60d1f5dbf2aa26cf101c426)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:56 +05:30
Alessio Balsini
0fe77264fe
FROMLIST: fuse: Introduce synchronous read and write for passthrough
All the read and write operations performed on fuse_files which have the
passthrough feature enabled are forwarded to the associated lower file
system file via VFS.

Sending the request directly to the lower file system avoids the
userspace round-trip that, because of possible context switches and
additional operations might reduce the overall performance, especially
in those cases where caching doesn't help, for example in reads at
random offsets.

Verifying if a fuse_file has a lower file system file associated with
can be done by checking the validity of its passthrough_filp pointer.
This pointer is not NULL only if passthrough has been successfully
enabled via the appropriate ioctl().
When a read/write operation is requested for a FUSE file with
passthrough enabled, a new equivalent VFS request is generated, which
instead targets the lower file system file.
The VFS layer performs additional checks that allow for safer operations
but may cause the operation to fail if the process accessing the FUSE
file system does not have access to the lower file system.

This change only implements synchronous requests in passthrough,
returning an error in the case of asynchronous operations, yet covering
the majority of the use cases.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-6-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Ifbe6a247fe7338f87d078fde923f0252eeaeb668
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit ea9685a7f9cb16b30e25386386274fdd30627c3a)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:56 +05:30
Alessio Balsini
58aebf8c21
FROMLIST: fuse: Passthrough initialization and release
Implement the FUSE passthrough ioctl that associates the lower
(passthrough) file system file with the fuse_file.

The file descriptor passed to the ioctl by the FUSE daemon is used to
access the relative file pointer, that will be copied to the fuse_file
data structure to consolidate the link between the FUSE and lower file
system.

To enable the passthrough mode, user space triggers the
FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl and, if the call succeeds, receives
back an identifier that will be used at open/create response time in the
fuse_open_out field to associate the FUSE file to the lower file system
file.
The value returned by the ioctl to user space can be:
- > 0: success, the identifier can be used as part of an open/create
reply.
- <= 0: an error occurred.
The value 0 represents an error to preserve backward compatibility: the
fuse_open_out field that is used to pass the passthrough_fh back to the
kernel uses the same bits that were previously as struct padding, and is
commonly zero-initialized (e.g., in the libfuse implementation).
Removing 0 from the correct values fixes the ambiguity between the case
in which 0 corresponds to a real passthrough_fh, a missing
implementation of FUSE passthrough or a request for a normal FUSE file,
simplifying the user space implementation.

For the passthrough mode to be successfully activated, the lower file
system file must implement both read_iter and write_iter file
operations. This extra check avoids special pseudo files to be targeted
for this feature.
Passthrough comes with another limitation: no further file system
stacking is allowed for those FUSE file systems using passthrough.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-5-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: I4d8290012302fb4547bce9bb261a03cc4f66b5aa
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit 28e86146c501a0f943fe9dc0ec0252df066a2b3d)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:55 +05:30
Alessio Balsini
1f04880bf8
FROMLIST: fuse: Definitions and ioctl for passthrough
Expose the FUSE_PASSTHROUGH interface to user space and declare all the
basic data structures and functions as the skeleton on top of which the
FUSE passthrough functionality will be built.

As part of this, introduce the new FUSE passthrough ioctl, which allows
the FUSE daemon to specify a direct connection between a FUSE file and a
lower file system file. Such ioctl requires user space to pass the file
descriptor of one of its opened files through the fuse_passthrough_out
data structure introduced in this patch. This structure includes extra
fields for possible future extensions.
Also, add the passthrough functions for the set-up and tear-down of the
data structures and locks that will be used both when fuse_conns and
fuse_files are created/deleted.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-4-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: I732532581348adadda5b5048a9346c2b0868d539
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit d02368d67989781a3484cd8dd71e0079d0d1bda2)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:55 +05:30
Alessio Balsini
248bfcfd3d
FROMLIST: fuse: 32-bit user space ioctl compat for fuse device
With a 64-bit kernel build the FUSE device cannot handle ioctl requests
coming from 32-bit user space.
This is due to the ioctl command translation that generates different
command identifiers that thus cannot be used for direct comparisons
without proper manipulation.

Explicitly extract type and number from the ioctl command to enable
32-bit user space compatibility on 64-bit kernel builds.

Bug: 179164095
Link: https://lore.kernel.org/lkml/20210125153057.3623715-3-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: I595517c54d551be70e83c7fcb4b62397a3615004
Signed-off-by: Alessio Balsini <balsini@google.com>
(cherry picked from commit af4048924e191bda0bb85b4bf127f22cf3c70fba)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:55 +05:30
Miklos Szeredi
f40391d56f
fuse: dir: Honor AT_STATX_DONT_SYNC
The description of this flag says "Don't sync attributes with the server".
In other words: always use the attributes cached in the kernel and don't
send network or local messages to refresh the attributes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit ff1b89f389a8e64d0a583ce0b0308696f4ab5860)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:54 +05:30
Seth Forshee
cb2cc6e61b
fuse: Restrict allow_other to the superblock's namespace or a descendant
Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed for a mount
done with user namespace root permissions. In such cases allow_other should
not allow users outside the userns to access the mount as doing so would
give the unprivileged user the ability to manipulate processes it would
otherwise be unable to manipulate. Restrict allow_other to apply to users
in the same userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a module.

Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 73f03c2b4b527346778c711c2734dbff3442b139)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:54 +05:30
Eric W. Biederman
96a80d0ebb
fuse: Support fuse filesystems outside of init_user_ns
In order to support mounts from namespaces other than init_user_ns, fuse
must translate uids and gids to/from the userns of the process servicing
requests on /dev/fuse. This patch does that, with a couple of restrictions
on the namespace:

 - The userns for the fuse connection is fixed to the namespace
   from which /dev/fuse is opened.

 - The namespace must be the same as s_user_ns.

These restrictions simplify the implementation by avoiding the need to pass
around userns references and by allowing fuse to rely on the checks in
setattr_prepare for ownership changes.  Either restriction could be relaxed
in the future if needed.

For cuse the userns used is the opener of /dev/cuse.  Semantically the cuse
support does not appear safe for unprivileged users.  Practically the
permissions on /dev/cuse only make it accessible to the global root user.
If something slips through the cracks in a user namespace the only users
who will be able to use the cuse device are those users mapped into the
user namespace.

Translation in the posix acl is updated to use the uuser namespace of the
filesystem.  Avoiding cases which might bypass this translation is handled
in a following change.

This change is stronlgy based on a similar change from Seth Forshee and
Dongsu Park.

Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 8cb08329b0809453722bc12aa912be34355bcb66)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:54 +05:30
Eric W. Biederman
ae7c1521b5
fuse: Fail all requests with invalid uids or gids
Upon a cursory examinination the uid and gid of a fuse request are
necessary for correct operation.  Failing a fuse request where those
values are not reliable seems a straight forward and reliable means of
ensuring that fuse requests with bad data are not sent or processed.

In most cases the vfs will avoid actions it suspects will cause
an inode write back of an inode with an invalid uid or gid.  But that does
not map precisely to what fuse is doing, so test for this and solve
this at the fuse level as well.

Performing this work in fuse_req_init_context is cheap as the code is
already performing the translation here and only needs to check the
result of the translation to see if things are not representable in
a form the fuse server can handle.

[SzM] Don't zero the context for the nofail case, just keep using the
munging version (makes sense for debugging and doesn't hurt).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit c9582eb0ff7d2b560be60eafab29183882cdc82b)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:54 +05:30
Eric W. Biederman
f28dd40557
fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
At the point of fuse_dev_do_read the user space process that initiated the
action on the fuse filesystem may no longer exist.  The process have been
killed or may have fired an asynchronous request and exited.

If the initial process has exited, the code "pid_vnr(find_pid_ns(in->h.pid,
fc->pid_ns)" will either return a pid of 0, or in the unlikely event that
the pid has been reallocated it can return practically any pid.  Any pid is
possible as the pid allocator allocates pid numbers in different pid
namespaces independently.

The only way to make translation in fuse_dev_do_read reliable is to call
get_pid in fuse_req_init_context, and pid_vnr followed by put_pid in
fuse_dev_do_read.  That reference counting in other contexts has been shown
to bounce cache lines between processors and in general be slow.  So that
is not desirable.

The only known user of running the fuse server in a different pid namespace
from the filesystem does not care what the pids are in the fuse messages so
removing this code should not matter.

Getting the translation to a server running outside of the pid namespace of
a container can still be achieved by playing setns games at mount time.  It
is also possible to add an option to pass a pid namespace into the fuse
filesystem at mount time.

Fixes: 5d6d3a301c4e ("fuse: allow server to run in different pid_ns")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit dbf107b2a7f36fa635b40e0b554514f599c75b33)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:54 +05:30
Szymon Lukasz
c1236a421d
fuse: Return -ECONNABORTED on /dev/fuse read after abort
Currently the userspace has no way of knowing whether the fuse
connection ended because of umount or abort via sysfs. It makes it hard
for filesystems to free the mountpoint after abort without worrying
about removing some new mount.

The patch fixes it by returning different errors when userspace reads
from /dev/fuse (-ENODEV for umount and -ECONNABORTED for abort).

Add a new capability flag FUSE_ABORT_ERROR. If set and the connection is
gone because of sysfs abort, reading from the device will return
-ECONNABORTED.

Signed-off-by: Szymon Lukasz <noh4hss@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 3b7008b226f3de811d4ac34238e9cf670f7c9fe7)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:53 +05:30
Adam Manzanares
ee0c19576b
fs: Add aio iopriority support
This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit d9a08a9e616beeccdbd0e7262b7225ffdfa49e92)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:53 +05:30
Adam Manzanares
e9bab1a88e
fs: Convert kiocb rw_hint from enum to u16
In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit fc28724d67c90ff48b976e0687caf79993160bed)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:53 +05:30
Christoph Hellwig
27275d0df5
fs: aio: Refactor read/write iocb setup
Don't reference the kiocb structure from the common aio code, and move
any use of it into helper specific to the read/write path.  This is in
preparation for aio_poll support that wants to use the space for different
fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit 54843f875f7a9f802bbb0d9895c3266b4a0b2f37)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:52 +05:30
Christoph Hellwig
976d474581
aio: Remove the extra get_file/fput pair in io_submit_one
If we release the lockdep write protection token before calling into
->write_iter and thus never access the file pointer after an -EIOCBQUEUED
return from ->write_iter or ->read_iter we don't need this extra
reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from commit 92ce4728563ad1fc42466f9bbecc1ac31d675894)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:39:52 +05:30
Adithya R
0a57010f20
Revert "proc: cmdline: Patch SafetyNet flags"
* most roms do this in system/core and magisk hide does
   it as well, while in some roms breaks boot due to avb
   being enforced

This reverts commit fb6704a8d07cf7ad9a46e4ecc4a9e94fbba8ca32.

Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:34:33 +05:30
Yaroslav Furman
c8fe0232d8
xattr: Reduce the size of on-stack allocations
4kb on stack allocations are pretty unsafe.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:33:28 +05:30
celtare21
90b59717cb
fs: f2fs: Set DEF_CP_INTERVAL to 200secs
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:31:33 +05:30
Jebaitedneko
2724d35e11
fs: pstore: Always execute ramoops_pstore_write()
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:31:31 +05:30
Tyler Nijmeh
a10d8080fb
fs: Reduce cache pressure
We can more utilize our available RAM better like this.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:31:28 +05:30
Khazhismel Kumykov
371b06c93f
fs: ext4: cond_resched in work-heavy group loops
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:31:27 +05:30
Kees Cook
68766e5802
pstore/ram: Do not use stack VLA for parity workspace
Instead of using a stack VLA for the parity workspace, preallocate a
memory region. The preallocation is done to keep from needing to perform
allocations during crash dump writing, etc. This also fixes a missed
release of librs on free.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:21:37 +05:30
Kees Cook
81348fff97
ntfs: decompress: remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
moves the stack buffer used during decompression to be allocated
externally.

The existing "dest_max_index" used in the VLA is bounded by cb_max_page.
cb_max_page is bounded by max_page, and max_page is bounded by nr_pages.
Since nr_pages is used for the "pages" allocation, it can similarly be
used for the "completed_pages" allocation and passed into the
decompression function.  The error paths are updated to free the new
allocation.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:21:35 +05:30
Dmitry Safonov
acf3c274d3
mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio
As kernel expect to see only one of such mappings, any further operations
on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
safe side, but there doesn't seem to be any expected use-case for this, so
restrict it now.

Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Alex Winkowski <dereference23@outlook.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:21:30 +05:30
Peter Collingbourne
6bcfe3cb84
mm: remove unnecessary wrapper function do_mmap_pgoff()
The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f82 ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.

The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap().  However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().

Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Alex Winkowski <dereference23@outlook.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:21:13 +05:30
Linus Torvalds
b36a01d3c1
mm: use helper functions for allocating and freeing vm_area structs
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Alex Winkowski <dereference23@outlook.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:21:02 +05:30
Arun KS
b0e4766681
mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alex Winkowski <dereference23@outlook.com>

Change-Id: Iad12311402dcdebc7804fc3e4866b67c25eb4d00
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:20:57 +05:30
trautamaki
05df20b719
pstore: Dump ramoops even when kernel doesn't crash
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:20:48 +05:30
Sultan Alsawaf
8a9268e00b
mm: Eliminate d_path_outlen() and further speed up show_map_vma()
d_path_outlen() isn't needed because we know that d_path() always
populates the given buffer backwards starting from the last byte; with
this, we can easily calculate the length of the generated string by
using the returned pointer from d_path() and the size of the buffer
given to d_path(). This eliminates the need for d_path_outlen() and
removes the bizarre strlen() usage, which makes things simpler and
faster. We also now avoid a memmove() when d_path() completely uses up
its provided buffer.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:19:08 +05:30
Jens Axboe
4e5a138e49
buffer: eliminate the need to call free_more_memory() in __getblk_slow()
Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.

Since this is also the last user of free_more_memory(), kill
it off completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:19:08 +05:30
Jens Axboe
efb8d821e6
buffer: grow_dev_page() should use __GFP_NOFAIL for all cases
We currently use it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.

After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:19:07 +05:30
Jens Axboe
268070cc02
buffer: have alloc_page_buffers() use __GFP_NOFAIL
Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:19:07 +05:30
Sultan Alsawaf
27f9509f00
mm: Micro-optimize PID map reads for arm64 while retaining output format
Android and various applications in Android need to read PID map data in
order to work. Some processes can contain over 10,000 mappings, which
results in lots of time wasted on simply generating strings. This wasted
time adds up, especially in the case of Unity-based games, which utilize
the Boehm garbage collector. A game's main process typically has well
over 10,000 mappings due to the loaded textures, and the Boehm GC reads
PID maps several times a second. This results in over 100,000 map
entries being printed out per second, so micro-optimization here is
important. Before this commit, show_vma_header_prefix() would typically
take around 1000 ns to run on a Snapdragon 855; now it only takes about
50 ns to run, which is a 20x improvement.

The primary micro-optimizations here assume that there are no more than
40 bits in the virtual address space, hence the CONFIG_ARM64_VA_BITS
check. Arm64 uses a virtual address size of 39 bits, so this perfectly
covers it.

This also removes padding used to beautify PID map output to further
speed up reads and reduce the amount of bytes printed, and optimizes the
dentry path retrieval for file-backed mappings. Note, however, that the
trailing space at the end of the line for non-file-backed mappings
cannot be omitted, as it breaks some PID map parsers.

This still retains insignificant leading zeros from printed hex values
to maintain the current output format.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:18:46 +05:30
Al Viro
77d312d3e3
fs: eventpoll: Clean the failure exits up a bit
commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.

Bug: 147802478
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: If479181d881c59c6d136299ba97e2cc850aa325c
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:17:28 +05:30
Marc Zyngier
dd521b94e4
epoll: Keep a reference on files added to the check list
commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.

When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Bug: 147802478
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Iee8a2e6770ccdf96898058a0a7d953ace080dae7
(cherry picked from commit c5dda0b69cf92399ce410cbb8cfdaf382e51dd6b)
Signed-off-by: Forenche <prahul2003@gmail.com>

commit 3dbf6600a559834da65e17c8cac075493bfd9fe7
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Sep 2 11:30:48 2020 -0400

    fix regression in "epoll: Keep a reference on files added to the check list"

    [ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]

    epoll_loop_check_proc() can run into a file already committed to destruction;
    we can't grab a reference on those and don't need to add them to the set for
    reverse path check anyway.

    Bug: 147802478

    Tested-by: Marc Zyngier <maz@kernel.org>
    Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    Signed-off-by: Sasha Levin <sashal@kernel.org>
    Change-Id: I541299a6325a6e9765add9e920cfa0203de9f4a0
    Signed-off-by: Forenche <prahul2003@gmail.com>

Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:17:27 +05:30
laoyi
b4107647bb
fs: proc: Update perms of process_reclaim node
Other userspace apps like AppCOmpaction would like to use this node,
so update permission.

Change-Id: Ied22bd6ad489bef4028cde943ac185d1354ab971
Signed-off-by: <laoyi@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:14:02 +05:30
John Dias
9b1cd81198
fs: Improve eventpoll logging to stop indicting timerfd
timerfd doesn't create any wakelocks; eventpoll can, and is creating the
wakelocks we see called "[timerfd]".  eventpoll creates two kinds of
wakelocks: a single top-level lock associated with the eventpoll fd
itself, and one additional lock for each fd it is polling that needs such
a lock (e.g. those using EPOLLWAKEUP).  Current code names the per-fd
locks using the undecorated names of the fds' associated files (hence
"[timerfd]"), and is naming the top-level lock after the PID of the caller
and the name of the file behind the first fd for which a per-fd lock is
created.  To make things clearer, the top-level lock is now named using
the caller PID and an "epollfd" designation, while the per-fd locks are
also named with the caller's PID (to associate them with the top-level
lock) and their respective fds' file names.

Port of fix already applied to previous 2 generations.  Note that this
set of changes does not fully solve the problem of eventpoll/timerfd
wakelock attribution to the original process, since most activity is
relayed through system_server, but it does at least ensure that different
eventpoll wakelocks - and their stats - are properly disambiguated.

Test: Ran on device and observed new wakelock naming in
/d/wakeup_sources and (file naming in) lsof output.
Bug: 116363986
Change-Id: I34bada5ddab04cf3830762c745f46bfcd1549cb8
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Kelly Rossmoyer <krossmo@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:14:02 +05:30
Adhitya Mohan
f591fea7a8
fs/fuse: shortcircuit: Make it compile
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:12:14 +05:30
LibXZR
5bdd63eaa2
fs/fuse: shortcircuit: Disable logging
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:12:14 +05:30
LibXZR
0b49172223
fs: fuse: Implement fuse short circuit
* This significantly improves i/o performance under /sdcard
* From OnePlus 8T Oxygen OS 11.0.8.11.KB05AA and OnePlus 8 Oxygen OS 11.0.5.5.IN21AA and OnePlus 8 Pro Oxygen OS 11.0.5.5.IN11AA

RealJohnGalt: make proper Kconfig, add back dependencies from OnePlus
source onto our CAF tree.

Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:12:14 +05:30
Theodore Ts'o
d88efacc59
ext4: Improve smp scalability for inode generation
->s_next_generation is protected by s_next_gen_lock but its usage
pattern is very primitive.  We don't actually need sequentially
increasing new generation numbers, so let's use prandom_u32() instead.

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:12:13 +05:30
Dave Kleikamp
715f4c034d
AIO: Don't plug the I/O queue in do_io_submit()
Asynchronous I/O latency to a solid-state disk greatly increased between the 2.6.32 and 3.0 kernels.
By removing the plug from do_io_submit(), we observed a 34% improvement in the I/O latency.
Unfortunately, at this level, we don't know if the request is to
a rotating disk or not.

Change-Id: I7101df956473ed9fd5dcff18e473dd93b688a5c1
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: linux-aio@kvack.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:12:04 +05:30
Park Ju Hyung
2d59edb069
kernfs: Use kmem_cache pool for struct kernfs_open_node/file
These get allocated and freed millions of times on this kernel tree.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:11:53 +05:30
Park Ju Hyung
be788d1d3d
sdcardfs: Use kmem_cache pool for struct sdcardfs_file_info
These get allocated and freed millions of times on this kernel tree.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:11:53 +05:30
Sayali Lokhande
b7d8d03984
f2fs: Avoid double lock for cp_rwsem during checkpoint
There could be a scenario where f2fs_sync_node_pages gets
called during checkpoint, which in turn tries to flush
inline data and calls iput(). This results in deadlock as
iput() tries to hold cp_rwsem, which is already held at the
beginning by checkpoint->block_operations().

Call stack :

Thread A		Thread B
f2fs_write_checkpoint()
- block_operations(sbi)
 - f2fs_lock_all(sbi);
  - down_write(&sbi->cp_rwsem);

                        - open()
                         - igrab()
                        - write() write inline data
                        - unlink()
- f2fs_sync_node_pages()
 - if (is_inline_node(page))
  - flush_inline_data()
   - ilookup()
     page = f2fs_pagecache_get_page()
     if (!page)
      goto iput_out;
     iput_out:
			-close()
			-iput()
       iput(inode);
       - f2fs_evict_inode()
        - f2fs_truncate_blocks()
         - f2fs_lock_op()
           - down_read(&sbi->cp_rwsem);

Change-Id: I048bbf42c0b11108e2444f4d9df5d58e7a779c3c
Fixes: 2049d4fcb057 ("f2fs: avoid multiple node page writes due to inline_data")
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Git-commit: 34c061ad85a2f5d5e9e3b045d72f3b211db6e282
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:11:16 +05:30
Adithya R
13520f3f70
fs/ext4: inode: Remove an unused variable
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:10:29 +05:30
Sultan Alsawaf
c8ce6f4d68
mm: Perform PID map reads on the little CPU cluster
PID map reads for processes with thousands of mappings can be done
extensively by certain Android apps, burning through CPU time on
higher-performance CPUs even though reading PID maps is never a
performance-critical task. We can relieve the load on the important CPUs
by moving PID map reads to little CPUs via sched_migrate_to_cpumask_*().

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:10:10 +05:30
Alexey Dobriyan
f14405e529
proc: reject "." and ".." as filenames
Various subsystems can create files and directories in /proc with names
directly controlled by userspace.

Which means "/", "." and ".." are no-no.

"/" split is already taken care of, do the other 2 prohibited names.

Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 175721be2e505649ee481a9cd2bf14b228d12f2b)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
2022-04-02 13:09:15 +05:30