54423 Commits

Author SHA1 Message Date
laoyi
bb32c75f6c fs: proc: Update perms of process_reclaim node
Other userspace apps like AppCOmpaction would like to use this node,
so update permission.

Change-Id: Ied22bd6ad489bef4028cde943ac185d1354ab971
Signed-off-by: <laoyi@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2021-06-05 14:55:41 +05:30
John Dias
f1033fb035 fs: Improve eventpoll logging to stop indicting timerfd
timerfd doesn't create any wakelocks; eventpoll can, and is creating the
wakelocks we see called "[timerfd]".  eventpoll creates two kinds of
wakelocks: a single top-level lock associated with the eventpoll fd
itself, and one additional lock for each fd it is polling that needs such
a lock (e.g. those using EPOLLWAKEUP).  Current code names the per-fd
locks using the undecorated names of the fds' associated files (hence
"[timerfd]"), and is naming the top-level lock after the PID of the caller
and the name of the file behind the first fd for which a per-fd lock is
created.  To make things clearer, the top-level lock is now named using
the caller PID and an "epollfd" designation, while the per-fd locks are
also named with the caller's PID (to associate them with the top-level
lock) and their respective fds' file names.

Port of fix already applied to previous 2 generations.  Note that this
set of changes does not fully solve the problem of eventpoll/timerfd
wakelock attribution to the original process, since most activity is
relayed through system_server, but it does at least ensure that different
eventpoll wakelocks - and their stats - are properly disambiguated.

Test: Ran on device and observed new wakelock naming in
/d/wakeup_sources and (file naming in) lsof output.
Bug: 116363986
Change-Id: I34bada5ddab04cf3830762c745f46bfcd1549cb8
Signed-off-by: John Dias <joaodias@google.com>
Signed-off-by: Kelly Rossmoyer <krossmo@google.com>
Signed-off-by: Miguel de Dios <migueldedios@google.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2021-06-05 14:55:41 +05:30
Adithya R
4ad4df234e Merge tag 'LA.UM.9.1.r1-10200-SMxxx0.0' of https://source.codeaurora.org/quic/la/kernel/msm-4.14 into staging/LA.UM.9.x
"LA.UM.9.1.r1-10200-SMxxx0.0"
2021-06-05 14:39:57 +05:30
Adhitya Mohan
8a4166ccd4 fs/fuse: shortcircuit: Make it compile 2021-05-25 19:26:37 +05:30
LibXZR
0f1df8b6c0 fs/fuse: shortcircuit: Disable logging 2021-05-25 19:26:37 +05:30
LibXZR
a650eef6fc fs: fuse: Implement fuse short circuit
* This significantly improves i/o performance under /sdcard
* From OnePlus 8T Oxygen OS 11.0.8.11.KB05AA and OnePlus 8 Oxygen OS 11.0.5.5.IN21AA and OnePlus 8 Pro Oxygen OS 11.0.5.5.IN11AA

RealJohnGalt: make proper Kconfig, add back dependencies from OnePlus
source onto our CAF tree.

Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-05-25 19:26:37 +05:30
Theodore Ts'o
b8328115cd ext4: Improve smp scalability for inode generation
->s_next_generation is protected by s_next_gen_lock but its usage
pattern is very primitive.  We don't actually need sequentially
increasing new generation numbers, so let's use prandom_u32() instead.

Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
2021-05-25 19:26:37 +05:30
Dave Kleikamp
0179baa847 AIO: Don't plug the I/O queue in do_io_submit()
Asynchronous I/O latency to a solid-state disk greatly increased between the 2.6.32 and 3.0 kernels.
By removing the plug from do_io_submit(), we observed a 34% improvement in the I/O latency.
Unfortunately, at this level, we don't know if the request is to
a rotating disk or not.

Change-Id: I7101df956473ed9fd5dcff18e473dd93b688a5c1
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: linux-aio@kvack.org
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
2021-05-25 19:26:34 +05:30
Park Ju Hyung
6da4dd3220 kernfs: Use kmem_cache pool for struct kernfs_open_node/file
These get allocated and freed millions of times on this kernel tree.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
2021-05-25 19:23:54 +05:30
Park Ju Hyung
1af6fc556c sdcardfs: Use kmem_cache pool for struct sdcardfs_file_info
These get allocated and freed millions of times on this kernel tree.

Use a dedicated kmem_cache pool and avoid costly dynamic memory allocations.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
2021-05-25 19:23:54 +05:30
Pradeep P V K
71070edfaa fuse: Set fuse request error upon fuse abort connection
There is a minor race in setting the fuse out request error
between fuse_abort_conn() and fuse_dev_do_read() as explained
below.
Thread-1                        Thread-2
========                        ========
->fuse_simple_request()        ->shutdown
  ->__fuse_request_send()
     ->queue_request()         ->fuse_abort_conn()
->fuse_dev_do_read()                ->acquire(fpq->lock)
  ->wait_for(fpq->lock)        ->set err to all req's in fpq->io
  ->release(fpq->lock)
  ->acquire(fpq->lock)
  ->add req to fpq->io
The above scenario may cause Thread-1 request to add into
fpq->io list after Thread-2 sets -ECONNABORTED err to all
its requests in fpq->io list. This leaves Thread-1 request
with unset err and this further misleads as a completed
request without an err set upon request_end().

Handle this by setting the err appropriately.

Change-Id: I7961c421939423290f6b78a619490d2cc3fbcb33
Signed-off-by: Pradeep P V K <pragalla@codeaurora.org>
2021-05-24 00:07:39 -07:00
Sayali Lokhande
2493b43cc1 f2fs: Avoid double lock for cp_rwsem during checkpoint
There could be a scenario where f2fs_sync_node_pages gets
called during checkpoint, which in turn tries to flush
inline data and calls iput(). This results in deadlock as
iput() tries to hold cp_rwsem, which is already held at the
beginning by checkpoint->block_operations().

Call stack :

Thread A		Thread B
f2fs_write_checkpoint()
- block_operations(sbi)
 - f2fs_lock_all(sbi);
  - down_write(&sbi->cp_rwsem);

                        - open()
                         - igrab()
                        - write() write inline data
                        - unlink()
- f2fs_sync_node_pages()
 - if (is_inline_node(page))
  - flush_inline_data()
   - ilookup()
     page = f2fs_pagecache_get_page()
     if (!page)
      goto iput_out;
     iput_out:
			-close()
			-iput()
       iput(inode);
       - f2fs_evict_inode()
        - f2fs_truncate_blocks()
         - f2fs_lock_op()
           - down_read(&sbi->cp_rwsem);

Change-Id: I048bbf42c0b11108e2444f4d9df5d58e7a779c3c
Fixes: 2049d4fcb057 ("f2fs: avoid multiple node page writes due to inline_data")
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Git-commit: 34c061ad85a2f5d5e9e3b045d72f3b211db6e282
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
2021-05-22 16:11:26 +05:30
Adithya R
295a88ddbd fs/ext4: inode: Remove an unused variable 2021-05-15 13:52:20 +05:30
Sultan Alsawaf
c65672b0d7 mm: Perform PID map reads on the little CPU cluster
PID map reads for processes with thousands of mappings can be done
extensively by certain Android apps, burning through CPU time on
higher-performance CPUs even though reading PID maps is never a
performance-critical task. We can relieve the load on the important CPUs
by moving PID map reads to little CPUs via sched_migrate_to_cpumask_*().

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2021-05-11 20:45:50 +05:30
Alexey Dobriyan
2777028c65 proc: reject "." and ".." as filenames
Various subsystems can create files and directories in /proc with names
directly controlled by userspace.

Which means "/", "." and ".." are no-no.

"/" split is already taken care of, do the other 2 prohibited names.

Link: http://lkml.kernel.org/r/20180310001223.GB12443@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 175721be2e505649ee481a9cd2bf14b228d12f2b)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:35 +05:30
Alexey Dobriyan
465a23a06b proc: do mmput ASAP for /proc/*/map_files
mm_struct is not needed while printing as all the data was already
extracted.

Link: http://lkml.kernel.org/r/20180309223120.GC3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 77c98ef42fa00e1d6e5631ed41c98d75fefd48e3)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:35 +05:30
Alexey Dobriyan
6cf273d41e proc: register filesystem last
As soon as register_filesystem() exits, filesystem can be mounted.  It
is better to present fully operational /proc.

Of course it doesn't matter because /proc is not modular but do it
anyway.

Drop error check, it should be handled by panicking.

Link: http://lkml.kernel.org/r/20180309222709.GA3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 6f695f71eb903c41491e75e25dcfff47ec3a6db7)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:29 +05:30
Alexey Dobriyan
76decf60fd proc: fix /proc/*/map_files lookup some more
I totally forgot that _parse_integer() accepts arbitrary amount of
leading zeroes leading to the following lookups:

		OK
	# readlink /proc/1/map_files/56427ecba000-56427eddc000
	/lib/systemd/systemd

		bogus
	# readlink /proc/1/map_files/00000000000056427ecba000-56427eddc000
	/lib/systemd/systemd
	# readlink /proc/1/map_files/56427ecba000-00000000000056427eddc000
	/lib/systemd/systemd

Link: http://lkml.kernel.org/r/20180303215130.GA23480@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit a97c1ccade29b8cf7a7ca614ece30929b045a143)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:28 +05:30
Danilo Krummrich
f4bf867a08 fs/proc/proc_sysctl.c: remove redundant link check in proc_sys_link_fill_cache()
proc_sys_link_fill_cache() does not need to check whether we're called
for a link - it's already done by scan().

Link: http://lkml.kernel.org/r/20180228013506.4915-2-danilokrummrich@dk-develop.de
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit b7c0004afe47f4d2e830d1b4a971253911bd861c)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:28 +05:30
Alexey Dobriyan
81cec1ffcc proc: use set_puts() at /proc/*/wchan
Link: http://lkml.kernel.org/r/20180217072011.GB16074@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 05cca699535589d40946e4eb5e613547b02b5f8f)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:28 +05:30
Alexey Dobriyan
c526f505f7 proc: check permissions earlier for /proc/*/wchan
get_wchan() accesses stack page before permissions are checked, let's
not play this game.

Link: http://lkml.kernel.org/r/20180217071923.GA16074@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 52dfa0c717dafa851b93cab220d168c7ebd2eaec)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:27 +05:30
Andrei Vagin
90778f2436 proc: replace seq_printf by seq_put_smth to speed up /proc/pid/status
seq_printf() works slower than seq_puts, seq_puts, etc.

== test_proc.c
int main(int argc, char **argv)
{
	int n, i, fd;
	char buf[16384];

	n = atoi(argv[1]);
	for (i = 0; i < n; i++) {
		fd = open(argv[2], O_RDONLY);
		if (fd < 0)
			return 1;
		if (read(fd, buf, sizeof(buf)) <= 0)
			return 1;
		close(fd);
	}

	return 0;
}
==

$ time ./test_proc  1000000 /proc/1/status

== Before path ==
real	0m5.171s
user	0m0.328s
sys	0m4.783s

== After patch ==
real	0m4.761s
user	0m0.334s
sys	0m4.366s

Link: http://lkml.kernel.org/r/20180212074931.7227-4-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 23e929be65a987090ef4f723112865018b339b0f)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:27 +05:30
Andrei Vagin
a3fd0359df proc: optimize single-symbol delimiters to spead up seq_put_decimal_ull
A delimiter is a string which is printed before a number.  A
syngle-symbol delimiters can be printed by set_putc() and this works
faster than printing by set_puts().

== test_proc.c

int main(int argc, char **argv)
{
	int n, i, fd;
	char buf[16384];

	n = atoi(argv[1]);
	for (i = 0; i < n; i++) {
		fd = open(argv[2], O_RDONLY);
		if (fd < 0)
			return 1;
		if (read(fd, buf, sizeof(buf)) <= 0)
			return 1;
		close(fd);
	}

	return 0;
}
==

$ time ./test_proc  1000000 /proc/1/stat

== Before patch ==
  real	0m3.820s
  user	0m0.337s
  sys	0m3.394s

== After patch ==
  real	0m3.110s
  user	0m0.324s
  sys	0m2.700s

Link: http://lkml.kernel.org/r/20180212074931.7227-3-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 089d07e8ab9cb3a49baba464378303fbc452cc4b)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:26 +05:30
Andrei Vagin
6a13533dcb proc: replace seq_printf on seq_putc to speed up /proc/pid/smaps
seq_putc() works much faster than seq_printf()

== Before patch ==
  $ time python test_smaps.py
  real    0m3.828s
  user    0m0.413s
  sys     0m3.408s

== After patch ==
  $ time python test_smaps.py
  real	0m3.405s
  user	0m0.401s
  sys	0m3.003s

== Before patch ==
-   75.51%     4.62%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 70.88% show_smap.isra.33
      + 24.82% seq_put_decimal_ull_aligned
      + 19.78% __walk_page_range
      + 12.74% seq_printf
      + 11.08% show_map_vma.isra.23
      + 1.68% seq_puts

== After patch ==
-   69.16%     5.70%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 63.46% show_smap.isra.33
      + 25.98% seq_put_decimal_ull_aligned
      + 20.90% __walk_page_range
      + 12.60% show_map_vma.isra.23
        1.56% seq_putc
      + 1.55% seq_puts

Link: http://lkml.kernel.org/r/20180212074931.7227-2-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 8857ea44552da119f9b8997adeedb6bb4e3e6853)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:25 +05:30
Andrei Vagin
5ec7cad9b9 proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps
seq_put_decimal_ull_w(m, str, val, width) prints a decimal number with a
specified minimal field width.

It is equivalent of seq_printf(m, "%s%*d", str, width, val), but it
works much faster.

== test_smaps.py
  num = 0
  with open("/proc/1/smaps") as f:
          for x in xrange(10000):
                  data = f.read()
                  f.seek(0, 0)
==

== Before patch ==
  $ time python test_smaps.py
  real    0m4.593s
  user    0m0.398s
  sys     0m4.158s

== After patch ==
  $ time python test_smaps.py
  real    0m3.828s
  user    0m0.413s
  sys     0m3.408s

$ perf -g record python test_smaps.py
== Before patch ==
-   79.01%     3.36%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 75.65% show_smap.isra.33
      + 48.85% seq_printf
      + 15.75% __walk_page_range
      + 9.70% show_map_vma.isra.23
        0.61% seq_puts

== After patch ==
-   75.51%     4.62%  python   [kernel.kallsyms]    [k] show_smap.isra.33
   - 70.88% show_smap.isra.33
      + 24.82% seq_put_decimal_ull_w
      + 19.78% __walk_page_range
      + 12.74% seq_printf
      + 11.08% show_map_vma.isra.23
      + 1.68% seq_puts

[akpm@linux-foundation.org: fix drivers/of/unittest.c build]
Link: http://lkml.kernel.org/r/20180212074931.7227-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 56d90368ba6ee38ca96ee45ce9c0980af4982117)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:25 +05:30
Konstantin Khlebnikov
603befd8e5 fs/proc/task_mmu.c: do not show VmExe bigger than total executable virtual memory
If start_code / end_code pointers are screwed then "VmExe" could be
bigger than total executable virtual memory and "VmLib" becomes
negative:

  VmExe:	  294320 kB
  VmLib:	18446744073709327564 kB

VmExe and VmLib documented as text segment and shared library code size.

Now their sum will be always equal to mm->exec_vm which sums size of
executable and not writable and not stack areas.

I've seen this for huge (>2Gb) statically linked binary which has whole
world inside.  For it start_code ..  end_code range also covers one of
rodata sections.  Probably this is bug in customized linker, elf loader
or both.

Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus
we cannot trust them without validation.

Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit cfcad2ca397928130217acc011604d530d320049)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:24 +05:30
Kirill A. Shutemov
0dc85f98cd mm: consolidate page table accounting
Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation.  We also provide the
information to userspace, but it has dubious value there too.

This patch switches page table accounting to single counter.

mm->pgtables_bytes is now used to account all page table levels.  We use
bytes, because page table size for different levels of page table tree
may be different.

The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status.  Not sure if anybody uses them.  (As
alternative, we can always report 0 kB for them.)

OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.

Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.

The only downside can be debuggability because we do not know which page
table level could leak.  But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.

[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
  Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 82bf5da35830928c04d1cb35ab89d26d4a14da71)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:24 +05:30
Kirill A. Shutemov
4ca2eb2e71 mm: introduce wrappers to access mm->nr_ptes
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
and nr_pud.

The patch also makes nr_ptes accounting dependent onto CONFIG_MMU.  Page
table accounting doesn't make sense if you don't have page tables.

It's preparation for consolidation of page-table counters in mm_struct.

Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 7a81a0cfb6310b5453d851fece5fb573049930bd)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:23 +05:30
Kirill A. Shutemov
f6d7bac1dc mm: account pud page tables
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35b66b ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit e786257a2ff2dfbaeaad7aae13fff7c7528da54e)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:23 +05:30
Alexey Dobriyan
397d74d883 proc: account "struct pde_opener"
The allocation is persistent in fact as any fool can open a file in
/proc and sit on it.

Link: http://lkml.kernel.org/r/20180214082409.GC17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit a7a0047939ded2ba4e578e510047c4dd7349c326)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:22 +05:30
Alexey Dobriyan
c477fdcf56 proc: move "struct pde_opener" to kmem cache
"struct pde_opener" is fixed size and we can have more granular approach
to debugging.

For those who don't know, per cache SLUB poisoning and red zoning don't
work if there is at least one object allocated which is hopeless in case
of kmalloc-64 but not in case of standalone cache.  Although systemd
opens 2 files from the get go, so it is hopeless after all.

Link: http://lkml.kernel.org/r/20180214082306.GB17157@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 579f1c6190bb2d5b1a31c8563a60c809510a88a8)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:22 +05:30
Alexey Dobriyan
92a448b67e fs/proc: use __ro_after_init
/proc/self inode numbers, value of proc_inode_cache and st_nlink of
/proc/$TGID are fixed constants.

Link: http://lkml.kernel.org/r/20180103184707.GA31849@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit f81586a6d74dd559ca21d65ee80b62eeea19e23d)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:21 +05:30
Alexey Dobriyan
8855c18227 proc: faster open/close of files without ->release hook
The whole point of code in fs/proc/inode.c is to make sure ->release
hook is called either at close() or at rmmod time.

All if it is unnecessary if there is no ->release hook.

Save allocation+list manipulations under spinlock in that case.

Link: http://lkml.kernel.org/r/20180214063033.GA15579@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 7055d8fbe00803a0542947ac9ca41ea554cb5d02)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:21 +05:30
Alexey Dobriyan
fd35b002ac proc: move /proc/sysvipc creation to where it belongs
Move the proc_mkdir() call within the sysvipc subsystem such that we
avoid polluting proc_root_init() with petty cpp.

[dave@stgolabs.net: contributed changelog]
Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 650fedae68888a3b53b7b052147fb5e69af53afb)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:20 +05:30
Alexey Dobriyan
ee658e9c60 proc: do less stuff under ->pde_unload_lock
Commit ca469f35a8e9ef ("deal with races between remove_proc_entry() and
proc_reg_release()") moved too much stuff under ->pde_unload_lock making
a problem described at series "[PATCH v5] procfs: Improve Scaling in
proc" worse.

While RCU is being figured out, move kfree() out of ->pde_unload_lock.

On my potato, difference is only 0.5% speedup with concurrent
open+read+close of /proc/cmdline, but the effect should be more
noticeable on more capable machines.

$ perf stat -r 16 -- ./proc-j 16

 Performance counter stats for './proc-j 16' (16 runs):

     130569.502377      task-clock (msec)         #   15.872 CPUs utilized            ( +-  0.05% )
            19,169      context-switches          #    0.147 K/sec                    ( +-  0.18% )
                15      cpu-migrations            #    0.000 K/sec                    ( +-  3.27% )
               437      page-faults               #    0.003 K/sec                    ( +-  1.25% )
   300,172,097,675      cycles                    #    2.299 GHz                      ( +-  0.05% )
    96,793,267,308      instructions              #    0.32  insn per cycle           ( +-  0.04% )
    22,798,342,298      branches                  #  174.607 M/sec                    ( +-  0.04% )
       111,764,687      branch-misses             #    0.49% of all branches          ( +-  0.47% )

       8.226574400 seconds time elapsed                                          ( +-  0.05% )
       ^^^^^^^^^^^

$ perf stat -r 16 -- ./proc-j 16

 Performance counter stats for './proc-j 16' (16 runs):

     129866.777392      task-clock (msec)         #   15.869 CPUs utilized            ( +-  0.04% )
            19,154      context-switches          #    0.147 K/sec                    ( +-  0.66% )
                14      cpu-migrations            #    0.000 K/sec                    ( +-  1.73% )
               431      page-faults               #    0.003 K/sec                    ( +-  1.09% )
   298,556,520,546      cycles                    #    2.299 GHz                      ( +-  0.04% )
    96,525,366,833      instructions              #    0.32  insn per cycle           ( +-  0.04% )
    22,730,194,043      branches                  #  175.027 M/sec                    ( +-  0.04% )
       111,506,074      branch-misses             #    0.49% of all branches          ( +-  0.18% )

       8.183629778 seconds time elapsed                                          ( +-  0.04% )
       ^^^^^^^^^^^

Link: http://lkml.kernel.org/r/20180213132911.GA24298@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit c67e4ec3a63b3c3935de1dabd249e859c49bdd9d)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:20 +05:30
Mateusz Guzik
2d7289323f proc: get rid of task lock/unlock pair to read umask for the "status" file
get_task_umask locks/unlocks the task on its own.  The only caller does
the same thing immediately after.

Utilize the fact the task has to be locked anyway and just do it once.
Since there are no other users and the code is short, fold it in.

Link: http://lkml.kernel.org/r/1517995608-23683-1-git-send-email-mguzik@redhat.com
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 7172a169e15bd4dd022214b349d0d304ce296475)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:19 +05:30
Andrei Vagin
a5c7b9ffa5 procfs: optimize seq_pad() to speed up /proc/pid/maps
seq_printf() is slow and it can be replaced by memset() in this case.

== test.py
  num = 0
  with open("/proc/1/maps") as f:
          while num < 10000 :
                  data = f.read()
                  f.seek(0, 0)
                  num = num + 1
==

== Before patch ==
  $  time python test.py
  real	0m0.986s
  user	0m0.279s
  sys	0m0.707s

== After patch ==
  $ time python test.py
  real	0m0.932s
  user	0m0.261s
  sys	0m0.669s

$ perf record -g python test.py
== Before patch ==
-   47.35%     3.38%  python   [kernel.kallsyms] [k] show_map_vma.isra.23
   - 43.97% show_map_vma.isra.23
      + 20.84% seq_path
      - 15.73% show_vma_header_prefix
      + 6.96% seq_pad
   + 2.94% __GI___libc_read

== After patch ==
-   44.01%     0.34%  python   [kernel.kallsyms] [k] show_pid_map
   - 43.67% show_pid_map
      - 42.91% show_map_vma.isra.23
         + 21.55% seq_path
         - 15.68% show_vma_header_prefix
         + 2.08% seq_pad
        0.55% seq_putc

Link: http://lkml.kernel.org/r/20180112185812.7710-2-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit e29c242fe3af7ff38fe3a116cfcaddb6d81af2bf)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:19 +05:30
Andrei Vagin
ab9961756e procfs: add seq_put_hex_ll to speed up /proc/pid/maps
seq_put_hex_ll() prints a number in hexadecimal notation and works
faster than seq_printf().

== test.py
  num = 0
  with open("/proc/1/maps") as f:
          while num < 10000 :
                  data = f.read()
                  f.seek(0, 0)
                 num = num + 1
==

== Before patch ==
  $  time python test.py

  real	0m1.561s
  user	0m0.257s
  sys	0m1.302s

== After patch ==
  $ time python test.py

  real	0m0.986s
  user	0m0.279s
  sys	0m0.707s

$ perf -g record python test.py:

== Before patch ==
-   67.42%     2.82%  python   [kernel.kallsyms] [k] show_map_vma.isra.22
   - 64.60% show_map_vma.isra.22
      - 44.98% seq_printf
         - seq_vprintf
            - vsnprintf
               + 14.85% number
               + 12.22% format_decode
                 5.56% memcpy_erms
      + 15.06% seq_path
      + 4.42% seq_pad
   + 2.45% __GI___libc_read

== After patch ==
-   47.35%     3.38%  python   [kernel.kallsyms] [k] show_map_vma.isra.23
   - 43.97% show_map_vma.isra.23
      + 20.84% seq_path
      - 15.73% show_vma_header_prefix
           10.55% seq_put_hex_ll
         + 2.65% seq_put_decimal_ull
           0.95% seq_putc
      + 6.96% seq_pad
   + 2.94% __GI___libc_read

[avagin@openvz.org: use unsigned int instead of int where it is suitable]
  Link: http://lkml.kernel.org/r/20180214025619.4005-1-avagin@openvz.org
[avagin@openvz.org: v2]
  Link: http://lkml.kernel.org/r/20180117082050.25406-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/20180112185812.7710-1-avagin@openvz.org
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Panchajanya1999 <rsk52959@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit 962ec0e343fc30e08568fcac42cc4ef099d30e14)
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:57:18 +05:30
Adithya R
9a0ad9edbe Revert "mm: Micro-optimize PID map reads for arm64 while retaining output format"
* causes some games to ban users weirdly

This reverts commit c0eaff615fd0380759424e54dcada03f5ec2512e.

Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-04-20 14:45:59 +05:30
Sultan Alsawaf
c0eaff615f mm: Micro-optimize PID map reads for arm64 while retaining output format
Android and various applications in Android need to read PID map data in
order to work. Some processes can contain over 10,000 mappings, which
results in lots of time wasted on simply generating strings. This wasted
time adds up, especially in the case of Unity-based games, which utilize
the Boehm garbage collector. A game's main process typically has well
over 10,000 mappings due to the loaded textures, and the Boehm GC reads
PID maps several times a second. This results in over 100,000 map
entries being printed out per second, so micro-optimization here is
important. Before this commit, show_vma_header_prefix() would typically
take around 1000 ns to run on a Snapdragon 855; now it only takes about
50 ns to run, which is a 20x improvement.

The primary micro-optimizations here assume that there are no more than
40 bits in the virtual address space, hence the CONFIG_ARM64_VA_BITS
check. Arm64 uses a virtual address size of 39 bits, so this perfectly
covers it.

This also removes padding used to beautify PID map output to further
speed up reads and reduce the amount of bytes printed, and optimizes the
dentry path retrieval for file-backed mappings. Note, however, that the
trailing space at the end of the line for non-file-backed mappings
cannot be omitted, as it breaks some PID map parsers.

This still retains insignificant leading zeros from printed hex values
to maintain the current output format.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2021-04-09 19:32:56 +05:30
Adithya R
10a895011e Revert "mm: Micro-optimize PID map reads for arm64"
This reverts commit 9d2d6e9568b96acf0d858801707c64a2bcb058c7.
2021-04-09 19:31:16 +05:30
Park Ju Hyung
8840d7f9a9 ext4: Remove additional tracings added by CAF
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-03-29 15:36:05 +05:30
Park Ju Hyung
6ac1ece269 f2fs: Remove additional tracings added by CAF
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
2021-03-29 15:36:05 +05:30
Park Ju Hyung
ddffb9da80 time: Move frequently used functions to headers and inline
Those function are frequently used in various places and declaring them inline
can reduce overheads.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
2021-03-29 03:47:52 +05:30
NeilBrown
ad0b0e6fad VFS: Use synchronize_rcu_expedited() in namespace_unlock()
The synchronize_rcu() in namespace_unlock() is called every time
a filesystem is unmounted.  If a great many filesystems are mounted,
this can cause a noticable slow-down in, for example, system shutdown.

The sequence:
  mkdir -p /tmp/Mtest/{0..5000}
  time for i in /tmp/Mtest/*; do mount -t tmpfs tmpfs $i ; done
  time umount /tmp/Mtest/*

on a 4-cpu VM can report 8 seconds to mount the tmpfs filesystems, and
100 seconds to unmount them.

Boot the same VM with 1 CPU and it takes 18 seconds to mount the
tmpfs filesystems, but only 36 to unmount.

If we change the synchronize_rcu() to synchronize_rcu_expedited()
the umount time on a 4-cpu VM drop to 0.6 seconds

I think this 200-fold speed up is worth the slightly high system
impact of using synchronize_rcu_expedited().

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (from general rcu perspective)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
2021-03-29 03:43:58 +05:30
Suren Baghdasaryan
267f03a540 mm: Don't loop through tasks in __set_oom_adj when not necessary
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ]

Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-29 03:37:26 +05:30
Alexey Dobriyan
974e2cf043 fs: proc: Faster /proc/cmdline
Use seq_puts() and skip format string processing.

Link: http://lkml.kernel.org/r/20180309222948.GB3843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
2021-03-28 04:34:17 +05:30
Sultan Alsawaf
fb6704a8d0 proc: cmdline: Patch SafetyNet flags
These can be checked by `cat /proc/cmdline`

Change-Id: I1438f14121f13ee09ae5c7577c7001544416e3c6
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
2021-03-28 04:33:52 +05:30
Vinayak Menon
cb1aedb06e mm: process_reclaim: Consider compound pages
Avoid passing tail pages to isolate_lru_page. In the case
process reclaim, since only lru pages are considered, this
is just to avoid a warning from isolate_lru_page.

Change-Id: I1f54dcec15f8c2d5ba16738657e79d2793d36c77
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
(cherry-picked from commit 2a580c4466231c19452284c9a4cca22486ae8967)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
2021-03-27 22:11:38 +05:30
Jeff Liu
406cbc0149 binfmt_elf: Use get_random_int() to fix entropy depleting
Changes:
--------
v4->v3:
- s/random_stack_user()/get_atrandom_bytes()/
- Move this function to ahead of its use to avoid the predeclaration.

v3->v2:
- Tweak code comments of random_stack_user().
- Remove redundant bits mask and shift upon the random variable.

v2->v1:
- Fix random copy to check up buffer length that are not 4-byte multiples.

v3 can be found at:
http://www.spinics.net/lists/linux-fsdevel/msg59597.html
v2 can be found at:
http://www.spinics.net/lists/linux-fsdevel/msg59418.html
v1 can be found at:
http://www.spinics.net/lists/linux-fsdevel/msg59128.html

Thanks,
-Jeff

Entropy is quickly depleted under normal operations like ls(1), cat(1),
etc...  between 2.6.30 to current mainline, for instance:

$ cat /proc/sys/kernel/random/entropy_avail
3428
$ cat /proc/sys/kernel/random/entropy_avail
2911
$cat /proc/sys/kernel/random/entropy_avail
2620

We observed this problem has been occurring since 2.6.30 with
fs/binfmt_elf.c: create_elf_tables()->get_random_bytes(), introduced by
f06295b44c296c8f ("ELF: implement AT_RANDOM for glibc PRNG seeding").

/*
 * Generate 16 random bytes for userspace PRNG seeding.
 */
get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes));

The patch introduces a wrapper around get_random_int() which has lower
overhead than calling get_random_bytes() directly.

With this patch applied:
$ cat /proc/sys/kernel/random/entropy_avail
2731
$ cat /proc/sys/kernel/random/entropy_avail
2802
$ cat /proc/sys/kernel/random/entropy_avail
2878

Analyzed by John Sobecki.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <aedilger@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnn@arndb.de>
Cc: John Sobecki <john.sobecki@oracle.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Alex Naidis <alex.naidis@linux.com>
2021-03-27 22:11:36 +05:30