812291 Commits

Author SHA1 Message Date
Sebastian Andrzej Siewior
dda747ed52 sched/rt: Don't try push tasks if there are none.
I have a RT task X at a high priority and cyclictest on each CPU with
lower priority than X's. If X is active and each CPU wakes their own
cylictest thread then it ends in a longer rto_push storm.
A random CPU determines via balance_rt() that the CPU on which X is
running needs to push tasks. X has the highest priority, cyclictest is
next in line so there is nothing that can be done since the task with
the higher priority is not touched.

tell_cpu_to_push() increments rto_loop_next and schedules
rto_push_irq_work_func() on X's CPU. The other CPUs also increment the
loop counter and do the same. Once rto_push_irq_work_func() is active it
does nothing because it has _no_ pushable tasks on its runqueue. Then
checks rto_next_cpu() and decides to queue irq_work on the local CPU
because another CPU requested a push by incrementing the counter.

I have traces where ~30 CPUs request this ~3 times each before it
finally ends. This greatly increases X's runtime while X isn't making
much progress.

Teach rto_next_cpu() to only return CPUs which also have tasks on their
runqueue which can be pushed away. This does not reduce the
tell_cpu_to_push() invocations (rto_loop_next counter increments) but
reduces the amount of issued rto_push_irq_work_func() if nothing can be
done. As the result the overloaded CPU is blocked less often.

There are still cases where the "same job" is repeated several times
(for instance the current CPU needs to resched but didn't yet because
the irq-work is repeated a few times and so the old task remains on the
CPU) but the majority of request end in tell_cpu_to_push() before an IPI
is issued.

Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20230801152648._y603AS_@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Change-Id: I51731f3bee90080170e45a548282cbd0a3ec2e85
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 22:59:33 -03:00
Sultan Alsawaf
b199b86873 qos: Only wake idle CPUs which are affected by a request change
The pm_qos idle wake-up mechanism currently wakes up *all* idle CPUs when
there's a pm_qos request change, instead of just the CPUs which are
affected by the change. This is horribly suboptimal and increases power
consumption by needlessly waking idled CPUs.

Additionally, pm_qos may kick CPUs which aren't even idle, since
wake_up_all_idle_cpus() only checks if a CPU is running the idle task,
which says nothing about whether or not the CPU is really in an idle state.

Optimize the pm_qos wake-ups by only sending IPIs to CPUs that are idle,
and by using arch_send_wakeup_ipi_mask() instead of wake_up_if_idle()
which is used under the hood in wake_up_all_idle_cpus(). Using IPI_WAKEUP
instead of IPI_RESCHEDULE, which is what wake_up_if_idle() uses behind the
scenes, has the benefit of doing zero work upon receipt of the IPI;
IPI_WAKEUP is designed purely for sending an IPI without a payload.

Determining which CPUs are idle is done efficiently with an atomic bitmask
instead of using the wake_up_if_idle() API, which checks the CPU's runqueue
in an RCU read-side critical section and under a spin lock. Not very
efficient in comparison to a simple, atomic bitwise operation. A cpumask
isn't needed for this because NR_CPUS is guaranteed to fit within a word.

CPUs are marked as idle as soon as IRQs are disabled in the idle loop,
since any IPI sent after that point will cause the CPU's idle attempt to
immediately exit (like when executing the wfi instruction). CPUs are marked
as not-idle as soon as they wake up in order to avoid sending redundant
IPIs to CPUs that are already awake.

Change-Id: I04c9e2bd9317357e16d8184a104fe603d0d2dab2
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 22:59:27 -03:00
Sebastian Andrzej Siewior
35314d4a2d mm/mglru: Move lru_gen_add_mm() out of IRQ-off region
lru_gen_add_mm() has been added within an IRQ-off region in the commit
mentioned below.  The other invocations of lru_gen_add_mm() are not within
an IRQ-off region.

The invocation within IRQ-off region is problematic on PREEMPT_RT because
the function is using a spin_lock_t which must not be used within
IRQ-disabled regions.

The other invocations of lru_gen_add_mm() occur while
task_struct::alloc_lock is acquired.  Move lru_gen_add_mm() after
interrupts are enabled and before task_unlock().

Link: https://lkml.kernel.org/r/20221026134830.711887-1-bigeasy@linutronix.de
Fixes: bd74fdaea1460 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change-Id: I63ef837e43e727fd9223ad0e30170465b826a4bb
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:18:27 -03:00
Yu Zhao
24cc942015 mm/mglru: Don't sync disk for each aging cycle
wakeup_flusher_threads() was added under the assumption that if a system
runs out of clean cold pages, it might want to write back dirty pages more
aggressively so that they can become clean and be dropped.

However, doing so can breach the rate limit a system wants to impose on
writeback, resulting in early SSD wearout.

Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Reported-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Change-Id: Ib4def4286264de926b11ec5247185edc3a780619
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:18:18 -03:00
Richard Raya
11be2baf36 zram: Lower disksize value to 2GB
Change-Id: I0641df2b3bba4b5809c21f020b47ed55c8b7d405
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:18:11 -03:00
Richard Raya
c6f06373e4 mm: Make swappiness value read-only
Change-Id: I86ff919ab356dbdf4ab31927232cefb702060b87
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:18:07 -03:00
Richard Raya
18e5e63668 defconfig: Regenerate full defconfig
Change-Id: Ia50610c884e1c41a87b27c771987db7ff51f596b
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:18:05 -03:00
Richard Raya
d97f51cb57 mm: Revert some hacks
This reverts commits:
- 0fc9fbd21297173aa822f97fe33a481053cb96ec [mm + sysctl: tune swappiness and make some values read only]
- 94181990a4ea1a20bb8bf443f3fbe500d05901c3 [mm: Import oplus memory management hacks]
- 97bdd381c8292d43e68ff55bd08767db17e62810 [mm: Set swappiness for CONFIG_INCREASE_MAXIMUM_SWAPPINESS=y case]
- fa8d2aa0e20da6b943157f6ab58068bd80d68920 [mm: move variable under a proper #ifdef]
- f9daeaa423b745b2c2c34a6fb5ac6b69daf746c4 [mm: merge Samsung mm hacks]
- 1a460a832c9c6550f5cbe32dca4c15cf89806b57 [mm: Make watermark_scale_factor read-only]
- 963a3bfe3352b45ea21c58d53055689e46d81eeb [mm: Tune parameters for Android]

Change-Id: I70495ca93a05384a2d7bc2498fd2d56bd9928390
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 21:17:59 -03:00
Richard Raya
591785ca22 defconfig: Enable CASS
Change-Id: I578ccd72188f8f619712c8f0fbc895188fe97e24
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 20:27:40 -03:00
Sultan Alsawaf
c752733d47 sched/cass: Don't pack tasks with uclamp boosts below minimum CPU capacity
To save energy, CASS may prefer non-idle CPUs for uclamp-boosted tasks in
order to pack them onto a single performance domain rather than spreading
them across multiple performance domains. This way, it is more likely for
only one performance domain to be boosted a higher P-state when there is
more than one uclamp-boosted task running.

However, when a task has a uclamp boost value that is below a CPU's minimum
capacity, it is nearly the same thing as not having a uclamp boost at all.

In spite of that, CASS may still prefer non-idle CPUs for tasks with bogus
uclamp boost values. This is not only worse for latency, but also energy
efficiency since the load on the CPU is spread less evenly as a result.

Therefore, don't pack tasks with uclamp boosts below a CPU's minimum
configured capacity, since such tasks do not force the CPU to run at a
higher P-state.

Change-Id: Ide8f62162723dc0c509fa5cccf92b8124f20f4aa
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 20:27:40 -03:00
Sultan Alsawaf
aaf602d464 arch_topology: Introduce minimum frequency scale
The scheduler is unaware of the applied min_freq limit to a CPU, which is
useful information when predicting the frequency a CPU will run at for
energy efficiency purposes.

Export this information via arch_scale_min_freq_capacity() and wire it up
for arm64.

Change-Id: Icdff7628c095185280e95dd965d497e6f740c871
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 20:27:26 -03:00
Richard Raya
a2d50bb156 sched/cass: Drop IRQs util accounting
Change-Id: Ib4dc068d54b1a6c8c98fb8215dcf8002be05cef1
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 20:27:11 -03:00
Richard Raya
45b9a00b85 Revert "defconfig: Enable IRQ time accounting"
This reverts commit 0a0414f8a65655f816f5e04cb997ef8738106c05.

Change-Id: Ia14ba25e11bec4927c60502ee8c4dad40b71cf24
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 19:59:22 -03:00
Richard Raya
603c5804f5 Revert "sched: Backport IRQ utilization tracking"
This reverts commit d6e561f94c2a1c83186116d4e35b8300a41d6a22.

Change-Id: I712d1a2c14b45ab522a815c5decd60b4389633e0
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-08 19:59:22 -03:00
Richard Raya
5460c500bc qcacld-3.0: Remove unused variables
Some checks failed
Build Kernel / build (push) Has been cancelled
Change-Id: If1b005827b2cf9b1ceb8ce35a5c119bc46dca425
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-06 02:58:44 -03:00
Juhyung Park
238de8f978 qcacld-3.0: Fix conditional umac_stop_complete_cb() implementation
After turning the location services off (to actually turn off or on the
Wi-Fi device), users started reporting a 30 seconds delay when their
PCIe Wi-Fi device is brought back up.

This is due to Qualcomm's improper conditional umac_stop_complete_cb()
implementation that simply neglects `qdf_event_set(stop_evt)` when
QDF_ENABLE_TRACING is not defined.

As our local modification disables QDF_ENABLE_TRACING, this
implementation bug triggers a 30 seconds delayed reset.

All functions/APIs used within umac_stop_complete_cb() are available
without QDF_ENABLE_TRACING defined, so remove the conditional
implementation.

Fixes: bcd3d019d8e1 ("qcacld-3.0: Execute sme_stop and mac_stop in mc thread context")
Change-Id: I6055404c5df4e0232ea344e1c2669871e61cb9a7
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-06 02:55:17 -03:00
Richard Raya
74413c1378 build.sh: Bump Slim LLVM to 20.1.0-rc1
Some checks failed
Build Kernel / build (push) Has been cancelled
Change-Id: I78f8d8b8703c3fc5f54b023253c6e8c46c261f81
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-04 23:41:16 -03:00
Richard Raya
95dffb0397 Revert "defconfig: Enable CONFIG_FAIR_GROUP_SCHED"
Some checks are pending
Build Kernel / build (push) Waiting to run
This reverts commit 09427173765ecb63836d49e4608ee2b65eb947df.

Change-Id: Iab647d5a3fc56a6e84eaded70252a0d736a5cd88
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-03 23:27:23 -03:00
Richard Raya
eabaff3f40 Revert "sched/fair: Don't allow boosted tasks to be migrated to small cores"
This reverts commit d0661f464d00db0cce80068cb1ea3a3d462b2bf9.

Change-Id: I8a1c946ea23485d0a1aac6755a741d24f2c03ca6
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-02-03 23:27:13 -03:00
Richard Raya
32dbe1c94e defconfig: Disable SBalance
Some checks are pending
Build Kernel / build (push) Waiting to run
Change-Id: I7d8f5192b5540cc57d2da6f95833a5902ad34295
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:09:31 -03:00
Richard Raya
65fbbc2a2d qcacld-3.0: Only enable IRQ affinity if SBalance is not enabled
SBalance manages IRQ affinity automatically. However, if SBalance is
not enabled, we restore HIF_IRQ_AFFINITY to maintain optimized
IRQ distribution.

Change-Id: I0f99803959fc7fe080184ea4dcea7d16ab70997a
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:07:53 -03:00
Richard Raya
485a90ebf2 irq: Only block userspace IRQ affinities if SBalance is enabled
If SBalance is enabled, IRQ affinity should be managed automatically.
Prevent userspace from modifying it in this case, but allow changes
when SBalance is disabled.

Change-Id: Ibf37bf258a2358ad8b982704e8f035bd9739866b
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:07:33 -03:00
Richard Raya
0636ec7ac0 Revert "msm-4.14: Drop sched_migrate_to_cpumask"
This reverts commit 602aa3bba862bb7ff51bdf2c9303db4b057f5353.

Change-Id: I4517bdb857e7e1ab02749596dedcaa8220dc040a
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:07:33 -03:00
Richard Raya
e72721b923 Revert "msm-4.14: Drop perf-critical API"
This reverts commit 1b396d869a6da9fa864d4de8235f2d0afc7164c1.

Change-Id: I13b4629e9aefcd23da2e58ef534c1057f81059cd
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:07:32 -03:00
Richard Raya
0979e5af82 defconfig: Disable CASS
Change-Id: I09417d8d3b19dabf4fd9a3a020db56f4b0170115
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-31 20:07:31 -03:00
Richard Raya
6fac6e2200 defconfig: Bump SBalance polling interval to 10ms
Some checks failed
Build Kernel / build (push) Has been cancelled
Change-Id: I7359b8d5b59809712929ec2271ec71602fe669f6
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 13:15:49 -03:00
Richard Raya
74b14355a2 defconfig: Bump boost durations to 100ms
Some checks are pending
Build Kernel / build (push) Waiting to run
Change-Id: I15bc0cffa7427c5f3cbbf160cf313c26e071c223
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 03:57:43 -03:00
Juhyung Park
44682225b6 simple_lmk: Lower vmpressure trigger threshold
Change-Id: Ifa5949aa44c5f6ceb8001c3105f1b3cf92fbefd5
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 03:57:28 -03:00
Richard Raya
0096c8c85e Revert "defconfig: Disable MGLRU"
This reverts commit 4043298f9af526f1703b88c3dfbeae7a16e88425.

Change-Id: Idd5ef50ccc82197606c2a1851072d9056308fb19
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 03:57:27 -03:00
Richard Raya
299d88e36b Revert "defconfig: Bump SLMK timeout to 250ms"
This reverts commit 06287f3167d25f3328e90ba5bf86de28761c0c1b.

Change-Id: I0fe1113f5369c4c9bfe93418b8c68529baa51b04
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 03:57:26 -03:00
Richard Raya
00faff4c72 Revert "simple_lmk: Sleep after killing victims"
This reverts commit 318d4145ba4f21fe23bd96b998c0a170ab5a26b6.

Change-Id: I14279c6c33a7da292376204c6824b8178b74fd6a
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-25 03:57:25 -03:00
Davidlohr Bueso
8013fc0dd7 ipc/mqueue: Optimize msg_get()
Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
improve performance of send/recv").  However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course).  By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases.  Furthermore, some callers can call msg_get() in a
loop.

A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game.  Passes ltp mq testcases.

Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Change-Id: I234983728fbc30aba482a6b58b2a70b1c38f3145
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:58 -03:00
Carlos Llamas
80305660a1 binder: Fix hang of unregistered readers
commit 31643d84b8c3d9c846aa0e20bc033e46c68c7e7d upstream.

With the introduction of binder_available_for_proc_work_ilocked() in
commit 1b77e9dcc3da ("ANDROID: binder: remove proc waitqueue") a binder
thread can only "wait_for_proc_work" after its thread->looper has been
marked as BINDER_LOOPER_STATE_{ENTERED|REGISTERED}.

This means an unregistered reader risks waiting indefinitely for work
since it never gets added to the proc->waiting_threads. If there are no
further references to its waitqueue either the task will hang. The same
applies to readers using the (e)poll interface.

I couldn't find the rationale behind this restriction. So this patch
restores the previous behavior of allowing unregistered threads to
"wait_for_proc_work". Note that an error message for this scenario,
which had previously become unreachable, is now re-enabled.

Fixes: 1b77e9dcc3da ("ANDROID: binder: remove proc waitqueue")
Cc: stable@vger.kernel.org
Cc: Martijn Coenen <maco@google.com>
Cc: Arve Hjønnevåg <arve@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20240711201452.2017543-1-cmllamas@google.com
Change-Id: I72954fb5fa749c7e0694fd036ed6862cff38cdb8
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:18 -03:00
YueHaibing
2217ef92de fq_codel: Remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count'
Fixes gcc '-Wunused-but-set-variable' warning:

net/sched/sch_fq_codel.c: In function fq_codel_dequeue:
net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable]
net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable]

They are not used since commit 77ddaff218fc ("fq_codel: Kill
useless per-flow dropped statistic")

Change-Id: I09426b4d4b41b9302e534e41fdcab109ef55c571
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:18 -03:00
Dave Taht
f32b57998c fq_codel: Kill useless per-flow dropped statistic
It is almost impossible to get anything other than a 0 out of
flow->dropped statistic with a tc class dump, as it resets to 0
on every round.

It also conflates ecn marks with drops.

It would have been useful had it kept a cumulative drop count, but
it doesn't. This patch doesn't change the API, it just stops
tracking a stat and state that is impossible to measure and nobody
uses.

Change-Id: Ibac1a0fd6825aa5bf862ec7cf20227de7a939ec9
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:18 -03:00
Dave Taht
5749c25962 fq_codel: Increase fq_codel count in the bulk dropper
In the field fq_codel is often used with a smaller memory or
packet limit than the default, and when the bulk dropper is hit,
the drop pattern bifircates into one that more slowly increases
the codel drop rate and hits the bulk dropper more than it should.

The scan through the 1024 queues happens more often than it needs to.

This patch increases the codel count in the bulk dropper, but
does not change the drop rate there, relying on the next codel round
to deliver the next packet at the original drop rate
(after that burst of loss), then escalate to a higher signaling rate.

Change-Id: I47562b843bb86abed0b502cea62368a1195eeb0f
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:18 -03:00
Jiong Wang
09c7663480 lib/reciprocal_div: Implement the improved algorithm
The new added "reciprocal_value_adv" implements the advanced version of the
algorithm described in Figure 4.2 of the paper except when
"divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then
requires u128 divide on host. The exception case could be easily handled
before calling "reciprocal_value_adv".

The advanced version requires more complex calculation to get the
reciprocal multiplier and other control variables, but then could reduce
the required emulation operations.

It makes no sense to use this advanced version for host divide emulation,
those extra complexities for calculating multiplier etc could completely
waive our saving on emulation operations.

However, it makes sense to use it for JIT divide code generation (for
example eBPF JIT backends) for which we are willing to trade performance of
JITed code with that of host. As shown by the following pseudo code, the
required emulation operations could go down from 6 (the basic version) to 3
or 4.

To use the result of "reciprocal_value_adv", suppose we want to calculate
n/d, the C-style pseudo code will be the following, it could be easily
changed to real code generation for other JIT targets.

  struct reciprocal_value_adv rvalue;
  u8 pre_shift, exp;

  // handle exception case.
  if (d >= (1U << 31)) {
    result = n >= d;
    return;
  }
  rvalue = reciprocal_value_adv(d, 32)
  exp = rvalue.exp;
  if (rvalue.is_wide_m && !(d & 1)) {
    // floor(log2(d & (2^32 -d)))
    pre_shift = fls(d & -d) - 1;
    rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
  } else {
    pre_shift = 0;
  }

  // code generation starts.
  if (imm == 1U << exp) {
    result = n >> exp;
  } else if (rvalue.is_wide_m) {
    // pre_shift must be zero when reached here.
    t = (n * rvalue.m) >> 32;
    result = n - t;
    result >>= 1;
    result += t;
    result >>= rvalue.sh - 1;
  } else {
    if (pre_shift)
      result = n >> pre_shift;
    result = ((u64)result * rvalue.m) >> 32;
    result >>= rvalue.sh;
  }

Change-Id: I54385f0df42aa43355d940d20d6818d2fb3197d9
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:18 -03:00
Florian La Roche
06381819ff lib/int_sqrt: Fix int_sqrt64() for very large numbers
If an input number x for int_sqrt64() has the highest bit set, then
fls64(x) is 64.  (1UL << 64) is an overflow and breaks the algorithm.

Subtracting 1 is a better guess for the initial value of m anyway and
that's what also done in int_sqrt() implicitly [*].

[*] Note how int_sqrt() uses __fls() with two underscores, which already
    returns the proper raw bit number.

    In contrast, int_sqrt64() used fls64(), and that returns bit numbers
    illogically starting at 1, because of error handling for the "no
    bits set" case. Will points out that he bug probably is due to a
    copy-and-paste error from the regular int_sqrt() case.

Change-Id: I5be5be3e03ddbe68cc8025a64698bbb49c57c3a5
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Florian La Roche <Florian.LaRoche@googlemail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Crt Mori
f111e963e2 lib/int_sqrt: Add strongly typed 64bit int_sqrt
There is no option to perform 64bit integer sqrt on 32bit platform.
Added stronger typed int_sqrt64 enables the 64bit calculations to
be performed on 32bit platforms. Using same algorithm as int_sqrt()
with strong typing provides enough precision also on 32bit platforms,
but it sacrifices some performance. In case values are smaller than
ULONG_MAX the standard int_sqrt is used for calculation to maximize the
performance due to more native calculations.

Change-Id: I8b22ef3fc9e63ea74fb1df14115fc374170549c3
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Crt Mori <cmo@melexis.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Peter Zijlstra
83c2c2b980 lib/int_sqrt: Adjust comments
Our current int_sqrt() is not rough nor any approximation; it calculates
the exact value of: floor(sqrt()).  Document this.

Link: http://lkml.kernel.org/r/20171020164645.001652117@infradead.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Anshul Garg <aksgarg1989@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michael Davidson <md@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Change-Id: Iea660f36312f879010d16028bc21b6bb50905078
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Andy Shevchenko
4d011c9c8c lib/sort: Move swap, cmp and cmp_r function types for wider use
The function types for swap, cmp and cmp_r functions are already
being in use by modules.

Move them to types.h that everybody in kernel will be able to use
generic types instead of custom ones.

This adds more sense to the comment in bsearch() later on.

Link: http://lkml.kernel.org/r/20191007135656.37734-1-andriy.shevchenko@linux.intel.com

Change-Id: I4848ccb09bac73774e2b0071eb767d596e4f6f90
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Rasmus Villemoes
829cc32160 lib/sort: Implement sort() variant taking context argument
Our list_sort() utility has always supported a context argument that
is passed through to the comparison routine. Now there's a use case
for the similar thing for sort().

This implements sort_r by simply extending the existing sort function
in the obvious way. To avoid code duplication, we want to implement
sort() in terms of sort_r(). The naive way to do that is

static int cmp_wrapper(const void *a, const void *b, const void *ctx)
{
  int (*real_cmp)(const void*, const void*) = ctx;
  return real_cmp(a, b);
}

sort(..., cmp) { sort_r(..., cmp_wrapper, cmp) }

but this would do two indirect calls for each comparison. Instead, do
as is done for the default swap functions - that only adds a cost of a
single easily predicted branch to each comparison call.

Aside from introducing support for the context argument, this also
serves as preparation for patches that will eliminate the indirect
comparison calls in common cases.

Requested-by: Boris Brezillon <boris.brezillon@collabora.com>

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Philipp Zabel <p.zabel@pengutronix.de>
Change-Id: I3ad240253956f6ec3f41833fc9ddefa5749fbc58
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Randy Dunlap
ea7577a782 lib/sort: Fix kernel-doc notation warnings
Fix kernel-doc notation in lib/sort.c by using correct function parameter
names.

  lib/sort.c:59: warning: Excess function parameter 'size' description in 'swap_words_32'
  lib/sort.c:83: warning: Excess function parameter 'size' description in 'swap_words_64'
  lib/sort.c:110: warning: Excess function parameter 'size' description in 'swap_bytes'

Link: http://lkml.kernel.org/r/60e25d3d-68d1-bde2-3b39-e4baa0b14907@infradead.org
Fixes: 37d0ec34d111a ("lib/sort: make swap functions more generic")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: George Spelvin <lkml@sdf.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Change-Id: I40d3917918ee9a73ac983ecaf4d62abcd924a45f
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
George Spelvin
26e9d7f193 lib/sort: Avoid indirect calls to built-in swap
Similar to what's being done in the net code, this takes advantage of
the fact that most invocations use only a few common swap functions, and
replaces indirect calls to them with (highly predictable) conditional
branches.  (The downside, of course, is that if you *do* use a custom
swap function, there are a few extra predicted branches on the code
path.)

This actually *shrinks* the x86-64 code, because it inlines the various
swap functions inside do_swap, eliding function prologues & epilogues.

x86-64 code size 767 -> 703 bytes (-64)

Link: http://lkml.kernel.org/r/d10c5d4b393a1847f32f5b26f4bbaa2857140e1e.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Change-Id: I4f4850f79f2a1596ec4d19780f329cd073c4f11c
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
George Spelvin
c93eb1b0d2 lib/sort: Use more efficient bottom-up heapsort variant
This uses fewer comparisons than the previous code (approaching half as
many for large random inputs), but produces identical results; it
actually performs the exact same series of swap operations.

Specifically, it reduces the average number of compares from
  2*n*log2(n) - 3*n + o(n)
to
    n*log2(n) + 0.37*n + o(n).

This is still 1.63*n worse than glibc qsort() which manages n*log2(n) -
1.26*n, but at least the leading coefficient is correct.

Standard heapsort, when sifting down, performs two comparisons per
level: one to find the greater child, and a second to see if the current
node should be exchanged with that child.

Bottom-up heapsort observes that it's better to postpone the second
comparison and search for the leaf where -infinity would be sent to,
then search back *up* for the current node's destination.

Since sifting down usually proceeds to the leaf level (that's where half
the nodes are), this does O(1) second comparisons rather than log2(n).
That saves a lot of (expensive since Spectre) indirect function calls.

The one time it's worse than the previous code is if there are large
numbers of duplicate keys, when the top-down algorithm is O(n) and
bottom-up is O(n log n).  For distinct keys, it's provably always
better, doing 1.5*n*log2(n) + O(n) in the worst case.

(The code is not significantly more complex.  This patch also merges the
heap-building and -extracting sift-down loops, resulting in a net code
size savings.)

x86-64 code size 885 -> 767 bytes (-118)

(I see the checkpatch complaint about "else if (n -= size)".  The
alternative is significantly uglier.)

Link: http://lkml.kernel.org/r/2de8348635a1a421a72620677898c7fd5bd4b19d.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Change-Id: I370b088649c56ae9a0d8040c30ed5e13b847cc7c
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
George Spelvin
f00f930f82 lib/sort: Make swap functions more generic
Patch series "lib/sort & lib/list_sort: faster and smaller", v2.

Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
thought I'd try to reduce the number made by the library sort functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that nice
simple solid heapsort is preferable to more complex algorithms (sorry,
Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size
and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
corresponding upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced by
commit 835cc0c8477f with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document the
clever algorithm.

TESTING NOTES: I have some ugly user-space benchmarking code which I
used for testing before moving this code into the kernel.  Shout if you
want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
round of minor edits to quell checkpatch.  I figure there will be at
least one round of comments and final testing.

This patch (of 5):

Rather than having special-case swap functions for 4- and 8-byte
objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
most users of sort() by avoiding fallback to the byte copy loop.

Despite what ca96ab859ab4 ("lib/sort: Add 64 bit swap function") claims,
very few users of sort() sort pointers (or pointer-sized objects); most
sort structures containing at least two words.  (E.g.
drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
Uytterhoeven.

Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Change-Id: I9f21e6eb4bcacf83d40cef3637a492b19db501fd
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
John Galt
bb664c8510 msm: kgsl: Make mem workqueue freezable
Freezing during no interactivity may benefit power

Change-Id: I685806f9d39da1f3523ba70590797de926840e18
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Sultan Alsawaf
103e03a5c4 drm/msm: Reduce latency while completing non-blocking commits
In order to complete a commit, we must first wait for the previous
commit to finish up. This is done by sleeping, during which time the CPU
can enter a deep idle state and take a while to finish processing the
commit after the wait is over. We can alleviate this by optimistically
assuming that the kthread this commit worker is running on won't migrate
to another CPU partway through. We only do this for the non-blocking
case where the commit completion is done in an asynchronous worker
because the generic DRM code already does this for atomic ioctls.

Change-Id: I55b822211b91f4a31c2bc6e65d7b31989e56aa7d
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 01:13:17 -03:00
Sultan Alsawaf
9be57f3582 qos: Execute notifier callbacks atomically
Allowing the pm_qos notifier callbacks to execute without holding
pm_qos_lock can cause the callbacks to misbehave, e.g. the cpuidle
callback could erroneously send more IPIs than necessary.

Fix this by executing the pm_qos callbacks while pm_qos_lock is held.

Change-Id: I0f5b0de2b022997a8f7d88755d7b60070b9a091d
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 00:51:45 -03:00
Sultan Alsawaf
874879f16f pinctrl: msm: Restore some barriers to prevent reordering of I/O writes
Although data dependencies and one-way, semi-permeable barriers provided by
spin locks satisfy most ordering needs here, it is still possible for some
I/O writes to be reordered with respect to one another in a dangerous way.
One such example is that the interrupt status bit could be cleared *after*
the interrupt is unmasked when enabling the IRQ, potentially leading to a
spurious interrupt if there's an interrupt pending from when the IRQ was
disabled.

To prevent dangerous I/O write reordering, restore the minimum amount of
barriers needed to ensure writes are ordered as intended.

Change-Id: I4c44eaa93f39591d5c963dba2b9dcaf33831bdbe
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2025-01-17 00:51:16 -03:00