Simple LMK's reclaim thread needs to run as quickly as possible to
reduce memory allocation latency when memory pressure is high. Run it
on fast, big cluster CPUs.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I6fef904710722b89b19cf119bf92779c364b2c2e
Provide userspace with a mechanism to discover features supported by
the binder driver to refrain from using any unsupported ones in the
first place. Starting with "oneway_spam_detection" only new features
are to be listed under binderfs and all previous ones are assumed to
be supported.
Assuming an instance of binderfs has been mounted at /dev/binderfs,
binder feature files can be found under /dev/binderfs/features/.
Usage example:
$ mkdir /dev/binderfs
$ mount -t binder binder /dev/binderfs
$ cat /dev/binderfs/features/oneway_spam_detection
1
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20210715031805.1725878-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Commit e00eb41c0c7b ("ANDROID: binder: add support for RT prio
inheritance.") added the use of MAX_USER_RT_PRIO to the binder.c code,
but that commit was never sent upstream.
In commit ae18ad281e82 ("sched: Remove MAX_USER_RT_PRIO"), that define
was taken away, so to fix up the build breakage, move the binder code to
use MAX_RT_PRIO instead of the now-removed MAX_USER_RT_PRIO define.
Hopefully this is correct, who knows, it's binder RT code! :)
Bug: 34461621
Bug: 37293077
Bug: 120446518
Fixes: e00eb41c0c7b ("ANDROID: binder: add support for RT prio inheritance.")
Fixes: ae18ad281e82 ("sched: Remove MAX_USER_RT_PRIO")
Cc: Martijn Coenen <maco@google.com>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Alistair Strachan <astrachan@google.com>
Cc: Todd Kjos <tkjos@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I66b85e99697fdde462ec2d2ade5c92d5917896a3
Signed-off-by: Jebaitedneko <Jebaitedneko@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
* google/android-4.14-stable:
FROMLIST: binder: fix UAF of ref->proc caused by race condition
UPSTREAM: x86/pci: Fix the function type for check_reserved_t
Linux 4.14.290
PCI: hv: Fix interrupt mapping for multi-MSI
PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
PCI: hv: Fix multi-MSI to allow more than one MSI vector
net: usb: ax88179_178a needs FLAG_SEND_ZLP
tty: use new tty_insert_flip_string_and_push_buffer() in pty_write()
tty: extract tty_flip_buffer_commit() from tty_flip_buffer_push()
tty: drop tty_schedule_flip()
tty: the rest, stop using tty_schedule_flip()
tty: drivers/tty/, stop using tty_schedule_flip()
Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks
Bluetooth: SCO: Fix sco_send_frame returning skb->len
Bluetooth: Fix passing NULL to PTR_ERR
Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg
Bluetooth: SCO: Replace use of memcpy_from_msg with bt_skb_sendmsg
Bluetooth: Add bt_skb_sendmmsg helper
Bluetooth: Add bt_skb_sendmsg helper
ALSA: memalloc: Align buffer allocations in page size
tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator
drm/tilcdc: Remove obsolete crtc_mode_valid() hack
bpf: Make sure mac_header was set before using it
mm/mempolicy: fix uninit-value in mpol_rebind_policy()
Revert "Revert "char/random: silence a lockdep splat with printk()""
be2net: Fix buffer overflow in be_get_module_eeprom
tcp: Fix a data-race around sysctl_tcp_notsent_lowat.
igmp: Fix a data-race around sysctl_igmp_max_memberships.
igmp: Fix data-races around sysctl_igmp_llm_reports.
net: stmmac: fix dma queue left shift overflow issue
i2c: cadence: Change large transfer count reset logic to be unconditional
tcp: Fix a data-race around sysctl_tcp_probe_interval.
tcp: Fix a data-race around sysctl_tcp_probe_threshold.
tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept.
ip: Fix a data-race around sysctl_fwmark_reflect.
perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
power/reset: arm-versatile: Fix refcount leak in versatile_reboot_probe
xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in xfrm_bundle_lookup()
xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE
A transaction of type BINDER_TYPE_WEAK_HANDLE can fail to increment the
reference for a node. In this case, the target proc normally releases
the failed reference upon close as expected. However, if the target is
dying in parallel the call will race with binder_deferred_release(), so
the target could have released all of its references by now leaving the
cleanup of the new failed reference unhandled.
The transaction then ends and the target proc gets released making the
ref->proc now a dangling pointer. Later on, ref->node is closed and we
attempt to take spin_lock(&ref->proc->inner_lock), which leads to the
use-after-free bug reported below. Let's fix this by cleaning up the
failed reference on the spot instead of relying on the target to do so.
==================================================================
BUG: KASAN: use-after-free in _raw_spin_lock+0xa8/0x150
Write of size 4 at addr ffff5ca207094238 by task kworker/1:0/590
CPU: 1 PID: 590 Comm: kworker/1:0 Not tainted 5.19.0-rc8 #10
Hardware name: linux,dummy-virt (DT)
Workqueue: events binder_deferred_func
Call trace:
dump_backtrace.part.0+0x1d0/0x1e0
show_stack+0x18/0x70
dump_stack_lvl+0x68/0x84
print_report+0x2e4/0x61c
kasan_report+0xa4/0x110
kasan_check_range+0xfc/0x1a4
__kasan_check_write+0x3c/0x50
_raw_spin_lock+0xa8/0x150
binder_deferred_func+0x5e0/0x9b0
process_one_work+0x38c/0x5f0
worker_thread+0x9c/0x694
kthread+0x188/0x190
ret_from_fork+0x10/0x20
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Bug: 239630375
Link: https://lore.kernel.org/all/20220801182511.3371447-1-cmllamas@google.com/
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: I5085dd0dc805a780a64c057e5819f82dd8f02868
(cherry picked from commit ae3fa5d16a02ba7c7b170e0e1ab56d6f0ba33964)
Binder code is very hot, so checking frequently to see if a debug
message should be printed is a waste of cycles. We're not debugging
binder, so just stub out the debug prints to compile them out entirely.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
If it's possible for a task to have no pages, then there could be a case
where `pages_found` is zero while `nr_found` isn't, which would cause
the found tasks' locks to never be unlocked, and thus mayhem. We can
change the `pages_found` check to use `nr_found` instead in order to
naturally defend against this scenario, in case it is indeed possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Hard-coding adj ranges to search for victims results in a few problems.
Firstly, the hard-coded adjs must be vigilantly updated to match what
userspace uses, which makes long-term support a headache. Secondly, a
full traversal of every running process must be done for each adj range,
which can turn out to be quite expensive, especially if userspace
assigns many different adj values and we want to enumerate them all.
This leads us to the final problem, which is that processes with
different adjs within the same hard-coded adj range will be treated the
same, even though they're not: the process with a higher adj is less
important, and the process with a lower adj is more important. This
could be fixed by enumerating every possible adj, but again, that would
necessitate several scans through the active process list, which is bad
for performance, especially since latency is critical here.
Since adjs are only 16 bits, and we only care about positive adjs, that
leaves us with 15 bits of the adj that matter. This is a relatively
small number of potential adjs (32,768), which makes it possible to
allocate a static array that's indexed using the adj. Each entry in this
array is a pointer to the first task_struct in a singly-linked list of
task_structs sharing an adj. A `simple_lmk_next` member is added to
task_struct to accommodate this linked list. The victim finder now
iterates downward through the array searching for linked lists of tasks,
starting from the highest adj found, so that the lowest-priority
processes are always considered first for reclaim. This fixes all of the
problems mentioned above, and now there is only one traversal through
every running process. The array itself only takes up 256 KiB of memory
on 64-bit, which is a very small price to pay for the advantages gained.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The victims array and mm_free_lock data structures can be used very
heavily in parallel on SMP, in which case they would benefit from being
cacheline-aligned. Make it so for SMP.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When sort() isn't provided with a custom swap function, it falls back
onto its generic implementation of just swapping one byte at a time,
which is quite slow. Since we know the type of the objects being sorted,
we can provide our own swap function which simply uses the swap() macro.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When there aren't enough pages found, it means all of the victims that
were found need to be killed. The additional processing that attempts to
reduce the number of victims can be skipped in this case.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
There's no reason to pass this constant around in a parameter.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When the mm_free_lock write lock is held, it means that reclaim is
either starting or ending, in which case there's nothing that needs to
be done in simple_lmk_mm_freed(). We can use a trylock here instead to
avoid blocking.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Simple LMK uses VM pressure now, not a kswapd hook like before. Update
the Kconfig description to reflect such.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When PSI is enabled, lmkd in userspace will use PSI notifications to
perform low memory kills. Therefore, to ensure that Simple LMK is the
only active LMK implementation, add a !PSI dependency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This aids in selecting an adequate timeout. If the timeout is hit often
and Simple LMK is killing too much, then the timeout should be
lengthened. If the timeout is rarely hit and Simple LMK is not killing
fast enough under pressure, then the timeout should be shortened.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Zeroing out the mm struct pointers when the timeout is hit isn't needed
because mm_free_lock prevents any readers from accessing the mm struct
pointers while clean-up occurs, and since the simple_lmk_mm_freed() loop
bound is set to zero during clean-up, there is no possibility of dying
processes ever reading stale mm struct pointers.
Therefore, it is unnecessary to clear out the mm struct pointers when
the timeout is reached. Now the only step to do when the timeout is
reached is to re-init the completion, but since reinit_completion() just
sets a struct member to zero, call reinit_completion() unconditionally
as it is faster than encapsulating it within a conditional statement.
Also take this opportunity to rename some variables and tidy up some
code indentation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
We already check to see if each eligible process isn't already dying, so
an RCU read lock can be used to speed things up instead of holding the
tasklist read lock.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
We are allowed to kill any process with a positive adj, so we shouldn't
exclude any processes with adjs greater than 999. This would present a
problem with quirky applications that set their own adj score, such as
stress-ng. In the case of stress-ng, it would set its adj score to 1000
and thus exempt itself from being killed by Simple LMK. This shouldn't
be allowed; any process with a positive adj, up to the highest positive
adj possible (32767) should be killable.
Reported-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Android 10 changed its adj assignments. Update Simple LMK to use the
new adjs, which also requires looking at each pair of adjs as a range.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Using kswapd's scan depth to trigger task kills is inconsistent and
unreliable. When memory pressure quickly spikes, the kswapd scan depth
trigger fails to kick off Simple LMK fast enough, causing severe lag.
Additionally, kswapd could stop scanning prematurely before reaching the
desired scan depth to trigger Simple LMK, which could also cause stalls.
To remedy this, use the vmpressure framework instead, since it provides
more consistent and accurate readings on memory pressure. This is not
very tunable though, so remove CONFIG_ANDROID_SIMPLE_LMK_AGGRESSION.
Triggering Simple LMK to kill when the reported memory pressure is 100
should yield good results on all setups.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Swap memory usage is important when determining what to kill, so include
it in the victim size calculation.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
wake_up() executes a full memory barrier when waking a process up, so
there's no need for the acquire in the wait event. Additionally,
because of this, the atomic_cmpxchg() only needs a read barrier.
The cmpxchg() in simple_lmk_mm_freed() is atomic when it doesn't need to
be, so replace it with an extra line of code.
The atomic_inc_return() in simple_lmk_mm_freed() lies within a lock, so
it doesn't need explicit memory barriers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Just increasing the victim's priority to the maximum niceness isn't
enough to make it totally preempt everything in SCHED_FAIR, which is
important to make sure victims die quickly. Resource-wise, this isn't
very burdensome since the RT priority is just set to zero, and because
dying victims don't have much to do: they only need to finish whatever
they're doing quickly. SCHED_RR is used over SCHED_FIFO so that CPU time
between the victims is divided evenly to help them all finish at around
the same time, as fast as possible.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Simple LMK tries to wait until all of the victims it kills have their
memory freed; however, sometimes victims can take a while to die, which
can block Simple LMK from killing more processes in time when needed.
After the specified timeout elapses, Simple LMK will stop waiting and
make itself available to kill more processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Dying processes aren't going to help free memory, so ignore them.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
set_user_nice() doesn't schedule, and although set_cpus_allowed_ptr()
can schedule, it will only do so when the specified task cannot run on
the new set of allowed CPUs. Since cpu_all_mask is used,
set_cpus_allowed_ptr() will never schedule. Therefore, both the priority
elevation and cpus_allowed change can be moved to inside the task lock
to simplify and speed things up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Makes it clear that Simple LMK tried its best but there was nothing it
could do.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Queuing up reclaim requests while a reclaim is in progress doesn't make
sense, since the additional reclaims may not be needed after the
existing reclaim completes. This would cause Simple LMK to go berserk
during periods of high memory pressure where kswapd would fire off
reclaim requests nonstop.
Make Simple LMK ignore new reclaim requests until an existing reclaim is
finished to prevent a slaughter-fest.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
After commit "simple_lmk: Make reclaim deterministic", Simple LMK's
behavior changed and thus requires some slight re-tuning to make it work
well again.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Using a parameter to pass around a unmodified pointer to a global
variable is crufty; just use the `victims` variable directly instead.
Also, compress the code in simple_lmk_init_set() a bit to make it look
cleaner.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The 20 ms delay in the reclaim thread is a hacky fudge factor that can
cause Simple LMK to behave wildly differently depending on the
circumstances of when it is invoked. When kswapd doesn't get enough CPU
time to finish up and go back to sleep within 20 ms, Simple LMK performs
superfluous reclaims.
This is suboptimal, so make Simple LMK more deterministic by eliminating
the delay and instead queuing up reclaim requests from kswapd.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When the reclaim thread writes to victims_to_kill on one CPU, it expects
the updated value to be immediately reflected on all CPUs in order for
simple_lmk_mm_freed() to work correctly. Due to the lack of memory
barriers to guarantee multicopy atomicity, simple_lmk_mm_freed() can be
given a victim's mm without knowing the correct victims_to_kill value,
which can cause the reclaim thread to remain stuck waiting forever for
all victims to be freed. This scenario, despite being rare, has been
observed.
Fix this by using proper atomic helpers with memory barriers.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
cmpxchg() is only atomic with respect to the local CPU, so it cannot be
relied on with how it's used in Simple LMK. Switch to fully atomic
operations instead for full atomic guarantees.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Simple LMK's reclaim thread should never stop; there's no need to have
this check.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Previously, pages_found would be calculated using an uninitialized
variable. Fix it.
Reported-by: Julian Liu <wlootlxt123@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Excessive logging -- not present on angler -- is affecting
performance, contributing to missed audio deadlines and likely other
latency-dependent tasks.
Bug: 30375418
Change-Id: I88b9c7fa4540ad46e564f44a0e589b5215e8487d
Signed-off-by: Alex Naidis <alex.naidis@linux.com>
[nullxception: extend it to binder_alloc_debug_mask]
Signed-off-by: Nauval Rizky <enuma.alrizky@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
security_secid_to_secctx() can fail because of a GFP_ATOMIC allocation
This needs to be retried from userspace. However, binder driver doesn't
propagate specific enough error codes just yet (WIP b/28321379). We'll
retry on the binder driver as a temporary work around until userspace
can do this instead.
Bug: 174806915
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: Ifebddeb7adf9707613512952b97ab702f0d2d592
Signed-off-by: Nauval Rizky <enuma.alrizky@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>