799189 Commits

Author SHA1 Message Date
Yaroslav Furman
e7c78ba42c
[SQUASH] power: supply: Silence massive debug logspam
power/supply: qcom: Silence charging drivers

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

power: supply: ti: silence charging spam

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

qpnp-qg: silence qg_get_prop_soc_decimal spam

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

qpnp-smb5: silence another annoying logger

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

qcom: smb5-lib: stfu

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

smb5-lib: silence annoying loggers

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

power: maxim: silence drivers

Leave actual errors enabled.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

power_supply_sysfs: silence 'failed to report 'flash_trigger' spam

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>

qpnp-smb5: silence some kernel spam

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:48 +07:00
Adithya R
728e2fe7eb
aw8624_haptic: Silence all debug logging
* this driver is spammy af and quite annoying so
   just shut the f*k up already

pr_info -> pr_debug

Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:48 +07:00
Forenche
19a63dcd5d
input/ts: nt36xxx: Disable debugging
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:48 +07:00
Yaroslav Furman
27812d229c
power/supply: cp_qc30: Silence logging
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:48 +07:00
Yaroslav Furman
47711fb724
techpack/audio: massive stfu
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:47 +07:00
Yaroslav Furman
4e8cb17df7
wl2866d: silence logspam
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:47 +07:00
Yaroslav Furman
fa5936590b
aw8624_haptic: silence and change the wq
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:47 +07:00
Sultan Alsawaf
cc9834cfcf
treewide: Suppress overly verbose log spam
This tames quite a bit of the log spam and makes dmesg readable.

Uses work from Danny Lin <danny@kdrag0n.dev>.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:47 +07:00
Danny Lin
c48ee556ad
Revert "ARM: dts: msm: Set rcu_expedited for sdm855"
This reverts commit 911ed9aadf134f633ec8c933acf06754a328b250.

This hurts battery and jitter without a noticeable real-world
performance improvement.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I4174bfae9046aae85054ada7a5ec5b25b111e827
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:47 +07:00
Kuba Wojciechowski
2d6bb1b472
ARM: dts: msm: disable kpti on sdmmagpie
As per qualcomm - "SM8150/SM8250/SM8350/SM7250/SM7150/SM6150 - KPTI Not
required".
It can also help increase performance by a lot in some scenarios.

Signed-off-by: Kuba Wojciechowski <nullbytepl@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:46 +07:00
Danny Lin
d58d2e8724
ARM64/dts: sdmmagpie: Remove unused 36 MiB memdump region
This reserved memory dump region is intended to be used with the memory
dump v2 driver, but we've disabled that and we don't need this memory
dumping functionality. Remove the unused region and assiciated driver
node to save 36 MiB of memory.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:46 +07:00
Mimi Wu
626cfeeeb6
scsi: ufs: disable clock scaling
Disable clock scaling to avoid costly workqueue overheads.

Power test results on Blueline:
[without this change]
  Suspend: 9.75mA
  Idle: 238.26mA
  Camera Preview: 1309.99mA
  Partial Wake Lock: 13.67mA
[with this change - disable clock scaling]
  Suspend: 9.73mA (-0.21%)
  Idle: 215.87mA (-9.4%)
  Camera Preview: 1181.71mA (-9.79%)
  Partial Wake Lock: 13.85mA (+1.32%)

Bug: 78601190
Signed-off-by: Mimi Wu <mimiwu@google.com>
Change-Id: I09f07619ab3e11b05149358c1d06b0d1039decf3
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:46 +07:00
Arian
b7b95b766d
cpufreq: Ensure the minimal frequency is lower than the maximal frequency
* Libperfmgr increases the minimal frequency to 9999999 in order to boost
  the cpu to the maximal frequency. This usally works because it also
  increases the max frequency to 9999999 at init. However if we decrease
  the maximal frequency afterwards, which mi_thermald does, setting the
  minimal frequency to 9999999 fails because it exceeds the maximal
  frequency.

* We can allow setting a minimal frequency higher than the maximal
  frequency and setting a lower maximal frequency than the minimal
  frequency by adjusting the minimal frequency if it exceeds the
  maximal frequency.

Change-Id: I25b7ccde714aac14c8fdb9910857c3bd38c0aa05
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:46 +07:00
Adrian Salido
5989e9daa4
drm/msm: add idle state sysfs node
Add a sysfs mechanism to track the idle state of display subsystem.
This allows user space to poll on the idle state node to detect when
display goes idle for longer than the time set.

Bug: 142159002
Bug: 126304228
Change-Id: I21e3c7b0830a9695db9f65526c111ce5153d1764
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Robb Glasser <rglasser@google.com>
(cherry picked from commit 11a2193b434cb3130743fbff89a161062883132e)
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:46 +07:00
Danny Lin
d5c02ad549
ARM64: configs: surya: Shorten PELT ramp/delay halflife to 16 ms
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:45 +07:00
Wei Wang
306bcaf596
ARM64: configs: surya: Disable CONFIG_MSM_PERFORMANCE
The msm_performance module is only used by QCOM perfd, so remove it.

Test: boot
Bug: 157242328
Signed-off-by: Wei Wang <wvw@google.com>
Change-Id: I981561829c0f26dfe21a907de16a5665c1085775
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:45 +07:00
Steve Muckle
2e8086808f
ARM64: configs: surya: Turn off CONFIG_SCHED_CORE_CTL
This functionality is unused on this platform. Disable it to prevent
incurring unnecessary overhead.

Change-Id: Ia52ab5fb9a7119ba4495879fa755c846fdde498e
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:45 +07:00
Wei Wang
709fa09f37
ARM64: configs: surya: Remove unused governors and CONFIG_CPU_BOOST
Bug: 115684360
Bug: 113594604
Test: Build
Change-Id: I9141b9bac316604730f0e277ca0212e86df3a90d
Signed-off-by: Wei Wang <wvw@google.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:45 +07:00
Suren Baghdasaryan
aa2ad7d14d
ARM64: configs: surya: Remove FAIR_GROUP_SCHED
This feature is undesirable and not required by Android.

Bug: 153203661
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I8adeb2ab1cac3041c812bbab7907df6bac57ac6d
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:44 +07:00
Kyle Lin
78e4c86e5d
ARM64: configs: surya: Disable CONFIG_AUTOCGROUP
As previous projects, disable sched autocgroup helps
reduce jank in certain workloads

Bug: 142549504
Test: build and boot to home
Change-Id: I5781468a2b584df93b8ee34b1af49ba6a78f340c
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:44 +07:00
Andrzej Perczak
8ac1ee44be
configs: vayu: Enable MINIMAL_TRACING_FOR_IORAP
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
Signed-off-by: alanndz <alanndz7@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:44 +07:00
Andrzej Perczak
2a09fecdeb
trace: Introduce minimal tracing for iorapd
Android service iorapd uses mm tracing to check what files are being loaded
by an app during launch. It then compiles the traces to perform madvise
syscalls to speedup app launch. This, however, enforces tracepoints to be
enabled for whole kernel which effects with much larger image size and some
performance penalties.

It turns out that tracing can be disabled by passing NOTRACE flag.
To make use of this flag pass it globally and undef where needed to make tracing
work, also, keep mm tracing in place for iorapd.

Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
Signed-off-by: alanndz <alanndz7@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:44 +07:00
Sultan Alsawaf
fd30ed4aae
sched: Add API to migrate the current process to a given cpumask
There are some chunks of code in the kernel running in process context
where it may be helpful to run the code on a specific set of CPUs, such
as when reading some CPU-intensive procfs files. This is especially
useful when the code in question must run within the context of the
current process (so kthreads cannot be used).

Add an API to make this possible, which consists of the following:
sched_migrate_to_cpumask_start():
 @old_mask: pointer to output the current task's old cpumask
 @dest: pointer to a cpumask the current task should be moved to

sched_migrate_to_cpumask_end():
 @old_mask: pointer to the old cpumask generated earlier
 @dest: pointer to the dest cpumask provided earlier

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:43 +07:00
UtsavBalar1231
0f1991bc8e
ARM64/dts: sdmmagpie-gpu: Use iommu_unmap_fast
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:43 +07:00
Yaroslav Furman
a16e32a89d
input/ts: nt36xxx: Switch to device initcall
Device boots and dt2w switch works.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:43 +07:00
Sultan Alsawaf
fc7fe1470b
spi-geni-qcom: Add a function to get IRQ of device's master
Exporting the IRQ of a SPI device's master controller can help device
drivers utilize the PM QoS API to force the SPI master IRQ to be
serviced with low latency.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:43 +07:00
Sultan Alsawaf
db8c27e097
sched/core: Free dead mm structs asynchronously in finish_task_switch()
Although mm structs are not often freed from finish_task_switch() during
a context switch, they can still slow things down and waste CPU time on
high priority CPUs when freed. Since unbound workqueues are now affined
to the little CPU cluster, we can offload the mm struct frees away from
the current CPU entirely if it's a high-performance CPU, and defer them
onto a little CPU. This reduces the amount of time spent in context
switches and reclaims CPU time from more-important CPUs. This is
achieved without increasing the size of the mm struct by reusing the
mmput async work, which is guaranteed to not be in use by the time
mm_count reaches zero.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:42 +07:00
Sultan Alsawaf
38913c58c5
sched/core: Free dead tasks asynchronously in finish_task_switch()
Task stacks are frequently freed from finish_task_switch() during a
context switch, in addition to the occasional task struct itself. This
not only slows down context switches, but also wastes CPU time on high
priority CPUs. Since unbound workqueues are now affined to the little
CPU cluster, we can offload the task frees away from the current CPU
entirely if it's a high-performance CPU, and defer them onto a little
CPU. This reduces the amount of time spent in context switches and
reclaims CPU time from more-important CPUs.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:42 +07:00
Sultan Alsawaf
9ffd325945
ion: Mark workqueues freeing buffers asynchronously as CPU intensive
When exiting the camera, there's a period of intense lag caused by all
of the buffer-free workers consuming all CPUs at once for a few seconds.
This isn't very good, and freeing the buffers isn't super time critical,
so we can lower the burden of the workers by marking the per-heap
workqueues as CPU intensive, which offloads the burden of balancing the
workers onto the scheduler.

Also, mark these workqueues with WQ_MEM_RECLAIM so forward progress is
guaranteed via a rescuer thread, since these are used to free memory.
The unnecessary WQ_UNBOUND_MAX_ACTIVE is removed as well, since it's
only used for increasing the active worker count on large-CPU systems.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:42 +07:00
Sultan Alsawaf
47196f428b
ion: Rewrite to improve clarity and performance
The ION driver suffers from massive code bloat caused by excessive
debug features, as well as poor lock usage as a result of that. Multiple
locks in ION exist to make the debug features thread-safe, which hurts
ION's actual performance when doing its job.

There are numerous code paths in ION that hold mutexes for no reason and
hold them for longer than necessary. This results in not only unwanted
lock contention, but also long delays when a mutex lock results in the
calling thread getting preempted for a while. All lock usage in ION
follows this pattern, which causes poor performance across the board.
Furthermore, a big mutex lock is used mostly everywhere, which causes
performance degradation due to unnecessary lock overhead.

Instead of having a big mutex lock, multiple fine-grained locks are now
used, improving performance.

Additionally, dup_sg_table is called very frequently, and lies within
the rendering path for the display. Speed it up by copying scatterlists
in page-sized chunks rather than iterating one at a time. Note that
sg_alloc_table zeroes out `table`, so there's no need to zero it out
using the memory allocator.

This also features a lock-less caching system for DMA attachments and
their respective sg_table copies, reducing overhead significantly for
code which frequently maps and unmaps DMA buffers and speeding up cache
maintenance since iteration through the list of buffer attachments is
now lock-free. This is safe since there is no interleaved DMA buffer
attaching or accessing for a single ION buffer.

Overall, just rewrite ION entirely to fix its deficiencies. This
optimizes ION for excellent performance and discards its debug cruft.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: I0a21435be1eb409cfe140eec8da507cc35f060dd
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:41 +07:00
Sultan Alsawaf
754104ecf9
iommu: msm: Rewrite to improve clarity and performance
This scope of this driver's lock usage is extremely wide, leading to
excessively long lock hold times. Additionally, there is lots of
excessive linked-list traversal and unnecessary dynamic memory
allocation in a critical path, causing poor performance across the
board.

Fix all of this by greatly reducing the scope of the locks used and by
significantly reducing the amount of operations performed when
msm_dma_map_sg_attrs() is called. The entire driver's code is overhauled
for better cleanliness and performance.

Note that ION must be modified to pass a known structure via the private
dma_buf pointer, so that the IOMMU driver can prevent races when
operating on the same buffer concurrently. This is the only way to
eliminate said buffer races without hurting the IOMMU driver's
performance.

Some additional members are added to the device struct as well to make
these various performance improvements possible.

This also removes the manual cache maintenance since ION already handles
it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:41 +07:00
Julien Thierry
3b50339f91
arm64: Use WFE for long delays
The current delay implementation uses the yield instruction, which is a
hint that it is beneficial to schedule another thread. As this is a hint,
it may be implemented as a NOP, causing all delays to be busy loops. This
is the case for many existing CPUs.

Taking advantage of the generic timer sending periodic events to all
cores, we can use WFE during delays to reduce power consumption. This is
beneficial only for delays longer than the period of the timer event
stream.

If timer event stream is not enabled, delays will behave as yield/busy
loops.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:40 +07:00
Julien Thierry
8012529c4a
arm_arch_timer: Expose event stream status
The arch timer configuration for a CPU might get reset after suspending
said CPU.

In order to reliably use the event stream in the kernel (e.g. for delays),
we keep track of the state where we can safely consider the event stream as
properly configured. After writing to cntkctl, we issue an ISB to ensure
that subsequent delay loops can rely on the event stream being enabled.

Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:40 +07:00
Danny Lin
42273bce73
cpuidle: lpm-levels: Remove debug event logging
A measurably significant amount of CPU time is spent on logging events
for debugging purposes in lpm_cpuidle_enter. Kill the useless logging to
reduce overhead.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:39 +07:00
Kyle Lin
001f5b500c
cpufreq: stats: replace the global lock with atomic
We want to reduce the lock contention so replace the global lock with
atomic.

bug: 127722781
Change-Id: I08ed3d55bf6bf17f31f4017c82c998fb513bad3e
Signed-off-by: Kyle Lin <kylelin@google.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:39 +07:00
Arjan van de Ven
08abdcb276
ipv4/tcp: allow the memory tuning for tcp to go a little bigger than default
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:38 +07:00
Arjan van de Ven
23a4d4ad7d
fs: ext4: fsync: optimize double-fsync() a bunch
There are cases where EXT4 is a bit too conservative sending barriers down to
the disk; there are cases where the transaction in progress is not the one
that sent the barrier (in other words: the fsync is for a file for which the
IO happened more time ago and all data was already sent to the disk).

For that case, a more performing tradeoff can be made on SSD devices (which
have the ability to flush their dram caches in a hurry on a power fail event)
where the barrier gets sent to the disk, but we don't need to wait for the
barrier to complete. Any consecutive IO will block on the barrier correctly.

Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:38 +07:00
Arjan van de Ven
7e04c8929c
kernel: do accept() in LIFO order for cache efficiency
Signed-off-by: Diab Neiroukh <lazerl0rd@thezest.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:37 +07:00
Abel Wu
dc2cf4a06b
mm/slub.c: branch optimization in free slowpath
The two conditions are mutually exclusive and gcc compiler will optimise
this into if-else-like pattern.  Given that the majority of free_slowpath
is free_frozen, let's provide some hint to the compilers.

Tests (perf bench sched messaging -g 20 -l 400000, executed 10x
after reboot) are done and the summarized result:

	un-patched	patched
max.	192.316		189.851
min.	187.267		186.252
avg.	189.154		188.086
stdev.	1.37		0.99

Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:37 +07:00
Ritesh Harjani
62cda3812f
BACKPORT: ext4: optimize file overwrites
In case if the file already has underlying blocks/extents allocated
then we don't need to start a journal txn and can directly return
the underlying mapping. Currently ext4_iomap_begin() is used by
both DAX & DIO path. We can check if the write request is an
overwrite & then directly return the mapping information.

This could give a significant perf boost for multi-threaded writes
specially random overwrites.
On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
could be seen in random writes (overwrite). Also bcoz this optimizes
away the spinlock contention during jbd2 slab cache allocation
(jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:37 +07:00
Shachar Raindel
d870b8e400
f2fs: Fix deadlock between f2fs_quota_sync and block_operation
This deadlock is hitting Android users (Pixel 3/3a/4) with Magisk, due
to frequent umount/mount operations that trigger quota_sync, hitting
the race. See https://github.com/topjohnwu/Magisk/issues/3171 for
additional impact discussion.

In commit db6ec53b7e03, we added a semaphore to protect quota flags.
As part of this commit, we changed f2fs_quota_sync to call
f2fs_lock_op, in an attempt to prevent an AB/BA type deadlock with
quota_sem locking in block_operation.  However, rwsem in Linux is not
recursive. Therefore, the following deadlock can occur:

f2fs_quota_sync
down_read(cp_rwsem) // f2fs_lock_op
filemap_fdatawrite
f2fs_write_data_pages
...
                                   block_opertaion
				   down_write(cp_rwsem) - marks rwsem as
				                          "writer pending"
down_read_trylock(cp_rwsem) - fails as there is
                              a writer pending.
			      Code keeps on trying,
			      live-locking the filesystem.

We solve this by creating a new rwsem, used specifically to
synchronize this case, instead of attempting to reuse an existing
lock.

Signed-off-by: Shachar Raindel <shacharr@gmail.com>

Fixes: db6ec53b7e03 f2fs: add a rw_sem to cover quota flag changes
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:37 +07:00
Jesse Chan
f76a504091
f2fs: Enlarge min_fsync_blocks to 20
In OPPO's kernel:
enlarge min_fsync_blocks to optimize performance
  - yanwu@TECH.Storage.FS.oF2FS, 2019/08/12

Huawei is also doing this in their production kernel.

If this optimization is good for them and shipped
with their devices, it should be good for us.

Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:36 +07:00
Park Ju Hyung
f3e434ec21
f2fs: Reduce timeout for uncongestion
On high fs utilization, congestion is hit quite frequently and waiting for a
whooping 20ms is too expensive, especially on critical paths.

Reduce it to an amount that is unlikely to affect UI rendering paths.

The new times are as follows:
  100 Hz  => 1 jiffy   (effective: 10 ms)
  250 Hz  => 2 jiffies (effective: 8 ms)
  300 Hz  => 2 jiffies (effective: 6 ms)
  1000 Hz => 6 jiffies (effective: 6 ms)

Co-authored-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Change-Id: I2978c7de07e6fa8d8261b532d5bc1325006433f9
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:36 +07:00
Danny Lin
8ab83a114f
f2fs: Demote GC thread to idle scheduler class
We don't want the background GC work causing UI jitter should it ever
collide with periods of user activity.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:36 +07:00
Park Ju Hyung
9923ec2d27
f2fs: Set ioprio of GC kthread to idle
GC should run conservatively as possible to reduce latency spikes to the user.

Setting ioprio to idle class will allow the kernel to schedule GC thread's I/O
to not affect any other processes' I/O requests.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:36 +07:00
Jerin Jacob
ed190c65a5
arm64: bpf: Optimize modulo operation
Optimize modulo operation instruction generation by
using single MSUB instruction vs MUL followed by SUB
instruction scheme.

Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:36 +07:00
Waiman Long
d8f8e9d93b
locking/osq: Use optimized spinning loop for arm64
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
using wfe for spinlock that can boost performance of sibling threads
by putting the current cpu to a wait state that is broken only when
the monitored variable changes or an external event happens.

OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.

The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the wait state. For
ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
Will has indicated that he is planning to para-virtualize wfe instead
of defining vcpu_is_preempted for PV support. So just add a comment in
arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
should not be defined as suggested.

On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s

After patch, the numbers were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s

So there was about 20% performance improvement.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200113150735.21956-1-longman@redhat.com
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:35 +07:00
Robin Murphy
0fe7b18bef
arm64: Implement optimised checksum routine
Apparently there exist certain workloads which rely heavily on software
checksumming, for which the generic do_csum() implementation becomes a
significant bottleneck. Therefore let's give arm64 its own optimised
version - for ease of maintenance this foregoes assembly or intrisics,
and is thus not actually arm64-specific, but does rely heavily on C
idioms that translate well to the A64 ISA and the typical load/store
capabilities of most ARMv8 CPU cores.

The resulting increase in checksum throughput scales nicely with buffer
size, tending towards 4x for a small in-order core (Cortex-A53), and up
to 6x or more for an aggressive big core (Ampere eMAG).

Reported-by: Lingyan Huang <huanglingyan2@huawei.com>
Tested-by: Lingyan Huang <huanglingyan2@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:35 +07:00
Will Deacon
822ccebefc
locking/barriers: Introduce smp_cond_load_relaxed() and atomic_cond_read_relaxed()
Whilst we currently provide smp_cond_load_acquire() and
atomic_cond_read_acquire(), there are cases where the ACQUIRE semantics are
not required because of a subsequent fence or release operation once the
conditional loop has exited.

This patch adds relaxed versions of the conditional spinning primitives
to avoid unnecessary barrier overhead on architectures such as arm64.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: kdrag0n <dragon@khronodragon.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:35 +07:00
Robin Murphy
fe31af8684
arm64: Select ARCH_HAS_FAST_MULTIPLIER
It is probably safe to assume that all Armv8-A implementations have a
multiplier whose efficiency is comparable or better than a sequence of
three or so register-dependent arithmetic instructions. Select
ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the
few dusty old corners which care.

In a contrived benchmark calling hweight64() in a loop, this does indeed
turn out to be a small win overall, with no measurable impact on
Cortex-A57 but about 5% performance improvement on Cortex-A53.

Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:35 +07:00