exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
There are cases where RCU callback needs to be bounced to a sleepable
context. This is currently done by the RCU callback queueing a work
item, which can be cumbersome to write and confusing to read.
This patch introduces rcu_work, a workqueue work variant which gets
executed after a RCU grace period, and converts the open coded
bouncing in fs/aio and kernel/cgroup.
v3: Dropped queue_rcu_work_on(). Documented rcu grace period behavior
after queue_rcu_work().
v2: Use rcu_barrier() instead of synchronize_rcu() to wait for
completion of previously queued rcu callback as per Paul.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit b787e7545d50e2bc01ae791abd3b184dfe42dc42)
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input. Add a
generic parse helper to handle inputs.
Update the interface convention documentation about the use of
percentage numbers. While at it, also clarify the default time unit.
Bug: 120440300
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit a5e112e6424adb77d953eac20e6936b952fd6b32)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ic1fcf21d7955eb8edd2e8e91517bca6aef41694f
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
(cherry picked from commit f71e4f444e76b4e495175753382ac343a7778bb7)
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The 'cgrp' is set but not used in commit <76f969e8948d8>
("cgroup: cgroup v2 freezer").
Remove it to avoid [-Wunused-but-set-variable] warning.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from 533307dc20a9e84a0687d4ca24aeb669516c0243)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Change-Id: I6221a975c04f06249a4f8d693852776ae08a8d8e
(cherry picked from commit 16824e5cf17313cbc2f67d4547446ddaa4f05258)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
another preemption point.
Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Ic53e0f2d6624b0bb90817b0c57060fb7db971348
(cherry picked from commit 937c6b27c73e02cd4114f95f5c37ba2c29fadba1)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit e1d012822232da6109493cefd904b827bbd2adf2)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.
The problem can be reproduced with the test_cgfreezer_mkdir test.
This is the output before this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
not ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
And with this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
Reported-by: Mark Crossen <mcrossen@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I6ba7b8dec5600e78bb7448f03fd97a9b43838fa0
(cherry picked from commit 97a61369830ab085df5aed0ff9256f35b07d425a)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit c17a8b6b2fd57447790df4eab8ccee7d1ba80648)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Alex Xu reported a regression in strace, caused by the introduction of
the cgroup v2 freezer. The regression can be reproduced by stracing
the following simple program:
#include <unistd.h>
int main() {
write(1, "a", 1);
return 0;
}
An attempt to run strace ./a.out leads to the infinite loop:
[ pre-main omitted ]
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
write(1, "a", 1) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[ repeats forever ]
The problem occurs because the traced task leaves ptrace_stop()
(and the signal handling loop) with the frozen bit set. So let's
call cgroup_leave_frozen(true) unconditionally after sleeping
in ptrace_stop().
With this patch applied, strace works as expected:
[ pre-main omitted ]
write(1, "a", 1) = 1
exit_group(0) = ?
+++ exited with 0 +++
Reported-by: Alex Xu <alex_y_xu@yahoo.ca>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: If644b15ead36ce13f0c2c3dd57eebe3658e3edf7
(cherry picked from commit 05b289263772b0698589abc47771264a685cd365)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit fa924fde25c4b8ac3a95314b169359a3f193d57e)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 666d8913b8f1adef750ae86d9acb74c9cb84c4ef)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.
Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Change-Id: Iaa9085d2375d395a051543d2555389213c2892d6
(cherry picked from commit aade7f9efba098859681f8e88d81a5b44ad09b12)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit b3ae5f18b2caa9068135e6720a9b4a5dfda1e142)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The callers of cgroup_migrate_prepare_dst() correctly call
cgroup_migrate_finish() for success and failure cases both. No need to
call it in cgroup_migrate_prepare_dst() in failure case.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I785d7ab70a42b1b79aea9852bb14ba5abefcaa9b
(cherry picked from commit d6e486ee0ef2f99a4069d9186e53dac61b28cb3c)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 4f7bb9d8fc662202bbb835d728e2aa1a55015d16)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
* make the reference from superblock to cgroup_root counting -
do cgroup_put() in cgroup_kill_sb() whether we'd done
percpu_ref_kill() or not; matching grab is done when we allocate
a new root. That gives the same refcounting rules for all callers
of cgroup_do_mount() - a reference to cgroup_root has been grabbed
by caller and it either is transferred to new superblock or dropped.
* have cgroup_kill_sb() treat an already killed refcount as "just
don't bother killing it, then".
* after successful cgroup_do_mount() have cgroup1_mount() recheck
if we'd raced with mount/umount from somebody else and cgroup_root
got killed. In that case we drop the superblock and bugger off
with -ERESTARTSYS, same as if we'd found it in the list already
dying.
* don't bother with delayed initialization of refcount - it's
unreliable and not needed. No need to prevent attempts to bump
the refcount if we find cgroup_root of another mount in progress -
sget will reuse an existing superblock just fine and if the
other sb manages to die before we get there, we'll catch
that immediately after cgroup_do_mount().
* don't bother with kernfs_pin_sb() - no need for doing that
either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: I8e088dfc516b76c42d9d4b34db7f49f0cebc5414
(cherry picked from commit 35ac1184244f1329783e1d897f74926d8bb1103a)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 492eddfab04e41872f549aa92e883275da9d6a4c)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
It can be useful to inhibit all cgroup1 hierarchies especially during
transition and for debugging. cgroup_no_v1 can block hierarchies with
controllers which leaves out the named hierarchies. Expand it to
cover the named hierarchies so that "cgroup_no_v1=all,named" disables
all cgroup1 hierarchies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Marcin Pawlowski <mpawlowski@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Ibd093dd9b70d15402a21db3c1ef56005ebc7f99e
(cherry picked from commit 3fc9c12d27b4ded4f1f761a800558dab2e6bbac5)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit eecbbe1429334e1a6884121831e2dad2a17c6248)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
WARN_ON() already contains an unlikely(), so it's not necessary to use
unlikely.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I092c0aae2a06b13d3fc9ecfbb24ab3e8d10235f6
(cherry picked from commit 4d9ebbe2b061a9c25e12ba8539ba172533132eb6)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 8559142778afbd714e6afd18a25da823d6d88403)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The "cgroup." core interface files bypass the usual interface removal
path and get removed recursively along with the cgroup itself. While
this works now, the subtle discrepancy gets in the way of implementing
common mechanisms.
This patch updates cgroup core interface file handling so that it's
consistent with controller interface files. When added, the css is
marked CSS_VISIBLE and they're explicitly removed before the cgroup is
destroyed.
This doesn't cause user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I4091581388cb1514171d6de8fdac5f0fe6ae1695
(cherry picked from commit 5faaf05f2976fd9ec0ecd582bcfb3a41cde4c65e)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 1c95325d9b62e843a2bfca3a8a3b5d7b11deebb1)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Make cgroup.threads file delegatable.
The behavior of cgroup.threads should follow the behavior of cgroup.procs.
Signed-off-by: Roman Gushchin <guro@fb.com>
Discovered-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I82d23cd511122e5a75b23b26e03ccc9e43b171e5
(cherry picked from commit 4f58424da3deead2605e39a9df65f5f06107a3cb)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 2823470a35ea6af77b8c3da6ef7d218720a3e426)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
e7fd37ba1217 ("cgroup: avoid copying strings longer than the buffers")
converted possibly unsafe strncpy() usages in cgroup to strscpy().
However, although the callsites are completely fine with truncated
copied, because strscpy() is marked __must_check, it led to the
following warnings.
kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
^
To avoid the warnings, 50034ed49645 ("cgroup: use strlcpy() instead of
strscpy() to avoid spurious warning") switched them to strlcpy().
strlcpy() is worse than strlcpy() because it unconditionally runs
strlen() on the source string, and the only reason we switched to
strlcpy() here was because it was lacking __must_check, which doesn't
reflect any material differences between the two function. It's just
that someone added __must_check to strscpy() and not to strlcpy().
These basic string copy operations are used in variety of ways, and
one of not-so-uncommon use cases is safely handling truncated copies,
where the caller naturally doesn't care about the return value. The
__must_check doesn't match the actual use cases and forces users to
opt for inferior variants which lack __must_check by happenstance or
spread ugly (void) casts.
Remove __must_check from strscpy() and restore strscpy() usages in
cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
(cherry picked from commit 08a77676f9c5fc69a681ccd2cd8140e65dcb26c7)
[backport the cgroup portions that weren't applied with the earlier
patch
779128d80cb01e6434936e13754fc25a1cc30929 'string: drop __must_check from
strscpy() and restore strscpy() usages in cgroup']
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Change-Id: Iaa636d39d15c44be47fc6b6ba202ecb7ff73c5e7
(cherry picked from commit 0b9e380c2246e105f27816258f030bebdabb598c)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
As long as cft->name is guaranteed to be NUL-terminated, using strlcpy() would
work just as well and avoid that warning, so the change below could be folded
into that commit.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I8215beea12d94fda6a7834f8be6f8e0891285d0e
(cherry picked from commit 50034ed49645463a16327cad05694e201e6b4126)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit f6a58f922163cba0a99c1588b094f127982e87af)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
cgroup root name and file name have max length limit, we should
avoid copying longer name than that to the name.
tj: minor update to $SUBJ.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Iff4f30be79184f19d9f3ec253bbab9c4ad91f36c
(cherry picked from commit e7fd37ba12170cc414be8b639dfc2c5f7172fac2)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit d4440591421eeda5d5dae5d36883513b7a28f35a)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The active development of cgroups v2 sometimes leads to a creation
of interfaces, which are not turned on by default (to provide
backward compatibility). It's handy to know from userspace, which
cgroup v2 features are supported without calculating it based
on the kernel version. So, let's export the list of such features
using /sys/kernel/cgroup/features pseudo-file.
The list is hardcoded and has to be extended when new functionality
is added. Each feature is printed on a new line.
Example:
$ cat /sys/kernel/cgroup/features
nsdelegate
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I2baf0b7bcc27491568772defc43a06d0a5ed46bf
(cherry picked from commit 5f2e673405b742be64e7c3604ed4ed3ac14f35ce)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 0007e6331622a80da8c02653b6319bdc08518910)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Delegatable cgroup v2 control files may require special handling
(e.g. chowning), and the exact list of such files varies between
kernel versions (and likely to be extended in the future).
To guarantee correctness of this list and simplify the life
of userspace (systemd, first of all), let's export the list
via /sys/kernel/cgroup/delegate pseudo-file.
Format is siple: each control file name is printed on a new line.
Example:
$ cat /sys/kernel/cgroup/delegate
cgroup.procs
cgroup.subtree_control
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I9d3143ecbae9d7579d2b1e6ccf381190ef5d3255
(cherry picked from commit 01ee6cfb1483fe57c9cbd8e73817dfbf9bacffd3)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit ac2c0193a454c36919c3b4b952aa9f0e6362ba69)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
These allocationsare just 1kb in size, using kmalloc is not
worth it for them. This should speed up printing of kernel log
when uptime gets very long.
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Salllz <sal235222727@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Since Infinity Kernel is released as standalone package other ROMs can
lack mm tuning done in rootdir. For this reason move those tweaks
directly to kernel.
Note that dirty_writeback_centisecs and drity_background_ratio have to be
restricted from writing because Android < 12 modifies these values badly
on boot.
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed for a mount
done with user namespace root permissions. In such cases allow_other should
not allow users outside the userns to access the mount as doing so would
give the unprivileged user the ability to manipulate processes it would
otherwise be unable to manipulate. Restrict allow_other to apply to users
in the same userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a module.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Dongsu Park <dongsu@kinvolk.io>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 73f03c2b4b527346778c711c2734dbff3442b139)
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: azrim <mirzaspc@gmail.com>
RCU callbacks are not time-critical and constitute kernel housekeeping.
Offload the no-callback kthreads onto little CPUs to clear load off of
the more important, higher-performance CPUs.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Commit "rcu: Create RCU-specific workqueues with rescuers" switched RCU
to using local workqueses and removed power efficiency flag from them.
This caused a performance regression that can be observed in Geekbench 5
after enabling CONFIG_WQ_POWER_EFFICIENT_DEFAULT: score went down from
760/2500 to 620/2300 (single/multi core respectively).
Add WQ_POWER_EFFICIENT flag to avoid this regression.
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>
Signed-off-by: billaids <jimmy.nelle@hsw-stud.de>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
commit 614ddad17f22a22e035e2ea37a04815f50362017 upstream.
Currently, rcu_advance_cbs_nowake() checks that a grace period is in
progress, however, that grace period could end just after the check.
This commit rechecks that a grace period is still in progress while
holding the rcu_node structure's lock. The grace period cannot end while
the current CPU's rcu_node structure's ->lock is held, thus avoiding
false positives from the WARN_ON_ONCE().
As Daniel Vacek noted, it is not necessary for the rcu_node structure
to have a CPU that has not yet passed through its quiescent state.
Tested-by: Guillaume Morin <guillaume@morinfr.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
Expedited RCU grace periods invoke sync_rcu_exp_select_node_cpus(), which
takes two passes over the leaf rcu_node structure's CPUs. The first
pass gathers up the current CPU and CPUs that are in dynticks idle mode.
The workqueue will report a quiescent state on their behalf later.
The second pass sends IPIs to the rest of the CPUs, but excludes the
current CPU, incorrectly assuming it has been included in the first
pass's list of CPUs.
Unfortunately the current CPU may have changed between the first and
second pass, due to the fact that the various rcu_node structures'
->lock fields have been dropped, thus momentarily enabling preemption.
This means that if the second pass's CPU was not on the first pass's
list, it will be ignored completely. There will be no IPI sent to
it, and there will be no reporting of quiescent states on its behalf.
Unfortunately, the expedited grace period will nevertheless be waiting
for that CPU to report a quiescent state, but with that CPU having no
reason to believe that such a report is needed.
The result will be an expedited grace period stall.
Fix this by no longer excluding the current CPU from consideration during
the second pass.
Fixes: b9ad4d6ed18e ("rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()")
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
This commit marks accesses to the rcu_state.n_force_qs. These data
races are hard to make happen, but syzkaller was equal to the task.
Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
The sync_sched_exp_online_cleanup() checks to see if RCU needs
an expedited quiescent state from the incoming CPU, sending it
an IPI if so. Before sending IPI, it checks whether expedited
qs need has been already requested for the incoming CPU, by
checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which
sync_sched_exp_online_cleanup() is running. This works for the
case where incoming CPU is same as self. However, for the case
where incoming CPU is different from self, expedited request
won't get marked, which can potentially delay reporting of
expedited quiescent state for the incoming CPU.
Fixes: e015a3411220 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.
This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.
Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.
There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.
apic_timer_interrupt()
smp_apic_timer_interrupt()
hrtimer_interrupt()
__hrtimer_run_queues()
tick_sched_timer()
tick_sched_handle()
update_process_times()
rcu_sched_clock_irq()
This triggers RCU stalls on our devices during VM resume.
If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.
Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
An srcu_struct structure that is initialized before rcu_init_geometry()
will have its srcu_node hierarchy based on CONFIG_NR_CPUS. Once
rcu_init_geometry() is called, this hierarchy is compressed as needed
for the actual maximum number of CPUs for this system.
Later on, that srcu_struct structure is confused, sometimes referring
to its initial CONFIG_NR_CPUS-based hierarchy, and sometimes instead
to the new num_possible_cpus() hierarchy. For example, each of its
->mynode fields continues to reference the original leaf rcu_node
structures, some of which might no longer exist. On the other hand,
srcu_for_each_node_breadth_first() traverses to the new node hierarchy.
There are at least two bad possible outcomes to this:
1) a) A callback enqueued early on an srcu_data structure (call it
*sdp) is recorded pending on sdp->mynode->srcu_data_have_cbs in
srcu_funnel_gp_start() with sdp->mynode pointing to a deep leaf
(say 3 levels).
b) The grace period ends after rcu_init_geometry() shrinks the
nodes level to a single one. srcu_gp_end() walks through the new
srcu_node hierarchy without ever reaching the old leaves so the
callback is never executed.
This is easily reproduced on an 8 CPUs machine with CONFIG_NR_CPUS >= 32
and "rcupdate.rcu_self_test=1". The srcu_barrier() after early tests
verification never completes and the boot hangs:
[ 5413.141029] INFO: task swapper/0:1 blocked for more than 4915 seconds.
[ 5413.147564] Not tainted 5.12.0-rc4+ #28
[ 5413.151927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5413.159753] task:swapper/0 state:D stack: 0 pid: 1 ppid: 0 flags:0x00004000
[ 5413.168099] Call Trace:
[ 5413.170555] __schedule+0x36c/0x930
[ 5413.174057] ? wait_for_completion+0x88/0x110
[ 5413.178423] schedule+0x46/0xf0
[ 5413.181575] schedule_timeout+0x284/0x380
[ 5413.185591] ? wait_for_completion+0x88/0x110
[ 5413.189957] ? mark_held_locks+0x61/0x80
[ 5413.193882] ? mark_held_locks+0x61/0x80
[ 5413.197809] ? _raw_spin_unlock_irq+0x24/0x50
[ 5413.202173] ? wait_for_completion+0x88/0x110
[ 5413.206535] wait_for_completion+0xb4/0x110
[ 5413.210724] ? srcu_torture_stats_print+0x110/0x110
[ 5413.215610] srcu_barrier+0x187/0x200
[ 5413.219277] ? rcu_tasks_verify_self_tests+0x50/0x50
[ 5413.224244] ? rdinit_setup+0x2b/0x2b
[ 5413.227907] rcu_verify_early_boot_tests+0x2d/0x40
[ 5413.232700] do_one_initcall+0x63/0x310
[ 5413.236541] ? rdinit_setup+0x2b/0x2b
[ 5413.240207] ? rcu_read_lock_sched_held+0x52/0x80
[ 5413.244912] kernel_init_freeable+0x253/0x28f
[ 5413.249273] ? rest_init+0x250/0x250
[ 5413.252846] kernel_init+0xa/0x110
[ 5413.256257] ret_from_fork+0x22/0x30
2) An srcu_struct structure that is initialized before rcu_init_geometry()
and used afterward will always have stale rdp->mynode references,
resulting in callbacks to be missed in srcu_gp_end(), just like in
the previous scenario.
This commit therefore causes init_srcu_struct_nodes to initialize the
geometry, if needed. This ensures that the srcu_node hierarchy is
properly built and distributed from the get-go.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
Currently, rcu_spawn_core_kthreads() is invoked via an early_initcall(),
which works, except that rcu_spawn_gp_kthread() is also invoked via an
early_initcall() and rcu_spawn_core_kthreads() relies on adjustments to
kthread_prio that are carried out by rcu_spawn_gp_kthread(). There is
no guaranttee of ordering among early_initcall() handlers, and thus no
guarantee that kthread_prio will be properly checked and range-limited
at the time that rcu_spawn_core_kthreads() needs it.
In most cases, this bug is harmless. After all, the only reason that
rcu_spawn_gp_kthread() adjusts the value of kthread_prio is if the user
specified a nonsensical value for this boot parameter, which experience
indicates is rare.
Nevertheless, a bug is a bug. This commit therefore causes the
rcu_spawn_core_kthreads() function to be invoked directly from
rcu_spawn_gp_kthread() after any needed adjustments to kthread_prio have
been carried out.
Fixes: 48d07c04b4cc ("rcu: Enable elimination of Tree-RCU softirq processing")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: azrim <mirzaspc@gmail.com>
Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.
Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.
Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.
Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.
Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org
Signed-off-by: azrim <mirzaspc@gmail.com>
Deferred wakeup of rcuog kthreads upon RCU idle mode entry is going to
be handled differently whether initiated by idle, user or guest. Prepare
with pulling that control up to rcu_eqs_enter() callers.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-2-frederic@kernel.org
Signed-off-by: azrim <mirzaspc@gmail.com>
Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:
# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m1.064s
user 0m0.000s
sys 0m0.004s
Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.
After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.
With the new patch we have:
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m0.131s
user 0m0.000s
sys 0m0.004s
Which drops it down to 13% of what the original wait time was.
Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
Instrumenting print_time is insufficient for adjusting the timestamp of
logged messages. Add the sleep time offset to true message timestamps as
well to address this issue.
Test: dmesg shows true boot time after several suspend/resume cycles
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Fiqri Ardyansyah <fiqri15072019@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
cpu_clock() uses monotonic time, which skews when the system suspends,
making it difficult to interpret kmsg timestamps.
Add the sleep time offset to cpu_clock() in order to make kmsg timestamps
reflect the actual boot time.
Signed-off-by: Sultanxda <sultanxda@gmail.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Fiqri Ardyansyah <fiqri15072019@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
RCU callbacks are not time-critical and constitute kernel housekeeping.
Offload the no-callback kthreads onto little CPUs to clear load off of
the more important, higher-performance CPUs.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>