This tames quite a bit of the log spam and makes dmesg readable.
Uses work from Danny Lin <danny@kdrag0n.dev>.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reverts commit 911ed9aadf134f633ec8c933acf06754a328b250.
This hurts battery and jitter without a noticeable real-world
performance improvement.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I4174bfae9046aae85054ada7a5ec5b25b111e827
Signed-off-by: azrim <mirzaspc@gmail.com>
As per qualcomm - "SM8150/SM8250/SM8350/SM7250/SM7150/SM6150 - KPTI Not
required".
It can also help increase performance by a lot in some scenarios.
Signed-off-by: Kuba Wojciechowski <nullbytepl@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reserved memory dump region is intended to be used with the memory
dump v2 driver, but we've disabled that and we don't need this memory
dumping functionality. Remove the unused region and assiciated driver
node to save 36 MiB of memory.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
* Libperfmgr increases the minimal frequency to 9999999 in order to boost
the cpu to the maximal frequency. This usally works because it also
increases the max frequency to 9999999 at init. However if we decrease
the maximal frequency afterwards, which mi_thermald does, setting the
minimal frequency to 9999999 fails because it exceeds the maximal
frequency.
* We can allow setting a minimal frequency higher than the maximal
frequency and setting a lower maximal frequency than the minimal
frequency by adjusting the minimal frequency if it exceeds the
maximal frequency.
Change-Id: I25b7ccde714aac14c8fdb9910857c3bd38c0aa05
Signed-off-by: azrim <mirzaspc@gmail.com>
Add a sysfs mechanism to track the idle state of display subsystem.
This allows user space to poll on the idle state node to detect when
display goes idle for longer than the time set.
Bug: 142159002
Bug: 126304228
Change-Id: I21e3c7b0830a9695db9f65526c111ce5153d1764
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Robb Glasser <rglasser@google.com>
(cherry picked from commit 11a2193b434cb3130743fbff89a161062883132e)
Signed-off-by: azrim <mirzaspc@gmail.com>
The msm_performance module is only used by QCOM perfd, so remove it.
Test: boot
Bug: 157242328
Signed-off-by: Wei Wang <wvw@google.com>
Change-Id: I981561829c0f26dfe21a907de16a5665c1085775
Signed-off-by: azrim <mirzaspc@gmail.com>
This functionality is unused on this platform. Disable it to prevent
incurring unnecessary overhead.
Change-Id: Ia52ab5fb9a7119ba4495879fa755c846fdde498e
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This feature is undesirable and not required by Android.
Bug: 153203661
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I8adeb2ab1cac3041c812bbab7907df6bac57ac6d
Signed-off-by: azrim <mirzaspc@gmail.com>
As previous projects, disable sched autocgroup helps
reduce jank in certain workloads
Bug: 142549504
Test: build and boot to home
Change-Id: I5781468a2b584df93b8ee34b1af49ba6a78f340c
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Android service iorapd uses mm tracing to check what files are being loaded
by an app during launch. It then compiles the traces to perform madvise
syscalls to speedup app launch. This, however, enforces tracepoints to be
enabled for whole kernel which effects with much larger image size and some
performance penalties.
It turns out that tracing can be disabled by passing NOTRACE flag.
To make use of this flag pass it globally and undef where needed to make tracing
work, also, keep mm tracing in place for iorapd.
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
Signed-off-by: alanndz <alanndz7@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
There are some chunks of code in the kernel running in process context
where it may be helpful to run the code on a specific set of CPUs, such
as when reading some CPU-intensive procfs files. This is especially
useful when the code in question must run within the context of the
current process (so kthreads cannot be used).
Add an API to make this possible, which consists of the following:
sched_migrate_to_cpumask_start():
@old_mask: pointer to output the current task's old cpumask
@dest: pointer to a cpumask the current task should be moved to
sched_migrate_to_cpumask_end():
@old_mask: pointer to the old cpumask generated earlier
@dest: pointer to the dest cpumask provided earlier
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Exporting the IRQ of a SPI device's master controller can help device
drivers utilize the PM QoS API to force the SPI master IRQ to be
serviced with low latency.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Although mm structs are not often freed from finish_task_switch() during
a context switch, they can still slow things down and waste CPU time on
high priority CPUs when freed. Since unbound workqueues are now affined
to the little CPU cluster, we can offload the mm struct frees away from
the current CPU entirely if it's a high-performance CPU, and defer them
onto a little CPU. This reduces the amount of time spent in context
switches and reclaims CPU time from more-important CPUs. This is
achieved without increasing the size of the mm struct by reusing the
mmput async work, which is guaranteed to not be in use by the time
mm_count reaches zero.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Task stacks are frequently freed from finish_task_switch() during a
context switch, in addition to the occasional task struct itself. This
not only slows down context switches, but also wastes CPU time on high
priority CPUs. Since unbound workqueues are now affined to the little
CPU cluster, we can offload the task frees away from the current CPU
entirely if it's a high-performance CPU, and defer them onto a little
CPU. This reduces the amount of time spent in context switches and
reclaims CPU time from more-important CPUs.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When exiting the camera, there's a period of intense lag caused by all
of the buffer-free workers consuming all CPUs at once for a few seconds.
This isn't very good, and freeing the buffers isn't super time critical,
so we can lower the burden of the workers by marking the per-heap
workqueues as CPU intensive, which offloads the burden of balancing the
workers onto the scheduler.
Also, mark these workqueues with WQ_MEM_RECLAIM so forward progress is
guaranteed via a rescuer thread, since these are used to free memory.
The unnecessary WQ_UNBOUND_MAX_ACTIVE is removed as well, since it's
only used for increasing the active worker count on large-CPU systems.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The ION driver suffers from massive code bloat caused by excessive
debug features, as well as poor lock usage as a result of that. Multiple
locks in ION exist to make the debug features thread-safe, which hurts
ION's actual performance when doing its job.
There are numerous code paths in ION that hold mutexes for no reason and
hold them for longer than necessary. This results in not only unwanted
lock contention, but also long delays when a mutex lock results in the
calling thread getting preempted for a while. All lock usage in ION
follows this pattern, which causes poor performance across the board.
Furthermore, a big mutex lock is used mostly everywhere, which causes
performance degradation due to unnecessary lock overhead.
Instead of having a big mutex lock, multiple fine-grained locks are now
used, improving performance.
Additionally, dup_sg_table is called very frequently, and lies within
the rendering path for the display. Speed it up by copying scatterlists
in page-sized chunks rather than iterating one at a time. Note that
sg_alloc_table zeroes out `table`, so there's no need to zero it out
using the memory allocator.
This also features a lock-less caching system for DMA attachments and
their respective sg_table copies, reducing overhead significantly for
code which frequently maps and unmaps DMA buffers and speeding up cache
maintenance since iteration through the list of buffer attachments is
now lock-free. This is safe since there is no interleaved DMA buffer
attaching or accessing for a single ION buffer.
Overall, just rewrite ION entirely to fix its deficiencies. This
optimizes ION for excellent performance and discards its debug cruft.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I0a21435be1eb409cfe140eec8da507cc35f060dd
Signed-off-by: azrim <mirzaspc@gmail.com>
This scope of this driver's lock usage is extremely wide, leading to
excessively long lock hold times. Additionally, there is lots of
excessive linked-list traversal and unnecessary dynamic memory
allocation in a critical path, causing poor performance across the
board.
Fix all of this by greatly reducing the scope of the locks used and by
significantly reducing the amount of operations performed when
msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled
for better cleanliness and performance.
Note that ION must be modified to pass a known structure via the private
dma_buf pointer, so that the IOMMU driver can prevent races when
operating on the same buffer concurrently. This is the only way to
eliminate said buffer races without hurting the IOMMU driver's
performance.
Some additional members are added to the device struct as well to make
these various performance improvements possible.
This also removes the manual cache maintenance since ION already handles
it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The current delay implementation uses the yield instruction, which is a
hint that it is beneficial to schedule another thread. As this is a hint,
it may be implemented as a NOP, causing all delays to be busy loops. This
is the case for many existing CPUs.
Taking advantage of the generic timer sending periodic events to all
cores, we can use WFE during delays to reduce power consumption. This is
beneficial only for delays longer than the period of the timer event
stream.
If timer event stream is not enabled, delays will behave as yield/busy
loops.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
The arch timer configuration for a CPU might get reset after suspending
said CPU.
In order to reliably use the event stream in the kernel (e.g. for delays),
we keep track of the state where we can safely consider the event stream as
properly configured. After writing to cntkctl, we issue an ISB to ensure
that subsequent delay loops can rely on the event stream being enabled.
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
A measurably significant amount of CPU time is spent on logging events
for debugging purposes in lpm_cpuidle_enter. Kill the useless logging to
reduce overhead.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
We want to reduce the lock contention so replace the global lock with
atomic.
bug: 127722781
Change-Id: I08ed3d55bf6bf17f31f4017c82c998fb513bad3e
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
There are cases where EXT4 is a bit too conservative sending barriers down to
the disk; there are cases where the transaction in progress is not the one
that sent the barrier (in other words: the fsync is for a file for which the
IO happened more time ago and all data was already sent to the disk).
For that case, a more performing tradeoff can be made on SSD devices (which
have the ability to flush their dram caches in a hurry on a power fail event)
where the barrier gets sent to the disk, but we don't need to wait for the
barrier to complete. Any consecutive IO will block on the barrier correctly.
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
The two conditions are mutually exclusive and gcc compiler will optimise
this into if-else-like pattern. Given that the majority of free_slowpath
is free_frozen, let's provide some hint to the compilers.
Tests (perf bench sched messaging -g 20 -l 400000, executed 10x
after reboot) are done and the summarized result:
un-patched patched
max. 192.316 189.851
min. 187.267 186.252
avg. 189.154 188.086
stdev. 1.37 0.99
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.
This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This deadlock is hitting Android users (Pixel 3/3a/4) with Magisk, due
to frequent umount/mount operations that trigger quota_sync, hitting
the race. See https://github.com/topjohnwu/Magisk/issues/3171 for
additional impact discussion.
In commit db6ec53b7e03, we added a semaphore to protect quota flags.
As part of this commit, we changed f2fs_quota_sync to call
f2fs_lock_op, in an attempt to prevent an AB/BA type deadlock with
quota_sem locking in block_operation. However, rwsem in Linux is not
recursive. Therefore, the following deadlock can occur:
f2fs_quota_sync
down_read(cp_rwsem) // f2fs_lock_op
filemap_fdatawrite
f2fs_write_data_pages
...
block_opertaion
down_write(cp_rwsem) - marks rwsem as
"writer pending"
down_read_trylock(cp_rwsem) - fails as there is
a writer pending.
Code keeps on trying,
live-locking the filesystem.
We solve this by creating a new rwsem, used specifically to
synchronize this case, instead of attempting to reuse an existing
lock.
Signed-off-by: Shachar Raindel <shacharr@gmail.com>
Fixes: db6ec53b7e03 f2fs: add a rw_sem to cover quota flag changes
Signed-off-by: azrim <mirzaspc@gmail.com>
In OPPO's kernel:
enlarge min_fsync_blocks to optimize performance
- yanwu@TECH.Storage.FS.oF2FS, 2019/08/12
Huawei is also doing this in their production kernel.
If this optimization is good for them and shipped
with their devices, it should be good for us.
Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
On high fs utilization, congestion is hit quite frequently and waiting for a
whooping 20ms is too expensive, especially on critical paths.
Reduce it to an amount that is unlikely to affect UI rendering paths.
The new times are as follows:
100 Hz => 1 jiffy (effective: 10 ms)
250 Hz => 2 jiffies (effective: 8 ms)
300 Hz => 2 jiffies (effective: 6 ms)
1000 Hz => 6 jiffies (effective: 6 ms)
Co-authored-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Change-Id: I2978c7de07e6fa8d8261b532d5bc1325006433f9
Signed-off-by: azrim <mirzaspc@gmail.com>
We don't want the background GC work causing UI jitter should it ever
collide with periods of user activity.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
GC should run conservatively as possible to reduce latency spikes to the user.
Setting ioprio to idle class will allow the kernel to schedule GC thread's I/O
to not affect any other processes' I/O requests.
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Optimize modulo operation instruction generation by
using single MSUB instruction vs MUL followed by SUB
instruction scheme.
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
using wfe for spinlock that can boost performance of sibling threads
by putting the current cpu to a wait state that is broken only when
the monitored variable changes or an external event happens.
OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.
The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the wait state. For
ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
Will has indicated that he is planning to para-virtualize wfe instead
of defining vcpu_is_preempted for PV support. So just add a comment in
arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
should not be defined as suggested.
On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s
After patch, the numbers were:
Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s
So there was about 20% performance improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200113150735.21956-1-longman@redhat.com
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Apparently there exist certain workloads which rely heavily on software
checksumming, for which the generic do_csum() implementation becomes a
significant bottleneck. Therefore let's give arm64 its own optimised
version - for ease of maintenance this foregoes assembly or intrisics,
and is thus not actually arm64-specific, but does rely heavily on C
idioms that translate well to the A64 ISA and the typical load/store
capabilities of most ARMv8 CPU cores.
The resulting increase in checksum throughput scales nicely with buffer
size, tending towards 4x for a small in-order core (Cortex-A53), and up
to 6x or more for an aggressive big core (Ampere eMAG).
Reported-by: Lingyan Huang <huanglingyan2@huawei.com>
Tested-by: Lingyan Huang <huanglingyan2@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Whilst we currently provide smp_cond_load_acquire() and
atomic_cond_read_acquire(), there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.
This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
It is probably safe to assume that all Armv8-A implementations have a
multiplier whose efficiency is comparable or better than a sequence of
three or so register-dependent arithmetic instructions. Select
ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the
few dusty old corners which care.
In a contrived benchmark calling hweight64() in a loop, this does indeed
turn out to be a small win overall, with no measurable impact on
Cortex-A57 but about 5% performance improvement on Cortex-A53.
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>