wake_affine_idle() prefers to move a task to the current CPU if the
wakeup is due to an interrupt. The expectation is that the interrupt
data is cache hot and relevant to the waking task as well as avoiding
a search. However, there is no way to determine if there was cache hot
data on the previous CPU that may exceed the interrupt data. Furthermore,
round-robin delivery of interrupts can migrate tasks around a socket where
each CPU is under-utilised. This can interact badly with cpufreq which
makes decisions based on per-cpu data. It has been observed on machines
with HWP that p-states are not boosted to their maximum levels even though
the workload is latency and throughput sensitive.
This patch uses the previous CPU for the task if it's idle and cache-affine
with the current CPU even if the current CPU is idle due to the wakup
being related to the interrupt. This reduces migrations at the cost of
the interrupt data not being cache hot when the task wakes.
A variety of workloads were tested on various machines and no adverse
impact was noticed that was outside noise. dbench on ext4 on UMA showed
roughly 10% reduction in the number of CPU migrations and it is a case
where interrupts are frequent for IO competions. In most cases, the
difference in performance is quite small but variability is often
reduced. For example, this is the result for pgbench running on a UMA
machine with different numbers of clients.
4.15.0-rc9 4.15.0-rc9
baseline waprev-v1
Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%)
Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%)
Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%)
Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%)
Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%)
Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%)
Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%)
Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%)
Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%)
Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%)
CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%)
Note that the different in performance is marginal but for low utilisation,
there is less variability.
Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
Change-Id: Iba4b2fee25a8c7b4f0218f656b0db5eab9e11691
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
If waking from an idle CPU due to an interrupt then it's possible that
the waker task will be pulled to wake on the current CPU. Unfortunately,
depending on the type of interrupt and IRQ configuration, there may not
be a strong relationship between the CPU an interrupt was delivered on
and the CPU a task was running on. For example, the interrupts could all
be delivered to CPUs on one particular node due to the machine topology
or IRQ affinity configuration. Another example is an interrupt for an IO
completion which can be delivered to any CPU where there is no guarantee
the data is either cache hot or even local.
This patch was motivated by the observation that an IO workload was
being pulled cross-node on a frequent basis when IO completed. From a
wakeup latency perspective, it's still useful to know that an idle CPU is
immediately available for use but lets only consider an automatic migration
if the CPUs share cache to limit damage due to NUMA migrations. Migrations
may still occur if wake_affine_weight determines it's appropriate.
These are the throughput results for dbench running on ext4 comparing
4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
completions can happen on any CPU.
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%)
Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%)
Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%)
Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%)
Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%)
Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%)
Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%)
Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%)
Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%)
Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%)
Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%)
Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%)
Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%)
Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%)
Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%)
Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%)
Note the small increase in throughput for low thread counts but also
note that the standard deviation for each sample during the test run is
lower. The throughput figures for dbench can be misleading so the benchmark
is actually modified to time the latency of the processing of one load
file with many samples taken. The difference in latency is
4.15.0-rc3 4.15.0-rc3
vanilla lessmigrate
Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%)
Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%)
Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%)
Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%)
Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%)
Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%)
Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%)
Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%)
Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%)
Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%)
Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%)
Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%)
Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%)
Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%)
Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%)
Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%)
It's a small improvement which is not surprising given that migrations that
migrate to a different node as not that common. However, it is noticeable
in the CPU migration statistics which are reduced by 24%.
There was a query for v1 of this patch about NAS so here are the results
for C-class using MPI for parallelisation on the same machine
nas-mpi
4.15.0-rc3 4.15.0-rc3
vanilla noirq
Time cg.C 24.25 ( 0.00%) 23.17 ( 4.45%)
Time ep.C 8.22 ( 0.00%) 8.29 ( -0.85%)
Time ft.C 22.67 ( 0.00%) 20.34 ( 10.28%)
Time is.C 1.42 ( 0.00%) 1.47 ( -3.52%)
Time lu.C 55.62 ( 0.00%) 54.81 ( 1.46%)
Time mg.C 7.93 ( 0.00%) 7.91 ( 0.25%)
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
User 3799.96 3748.34
System 672.10 626.15
Elapsed 91.91 79.49
lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
regressions but in terms of absolute time, the difference is small and
likely within run-to-run variance. System CPU usage is slightly reduced.
schbench from Facebook was also requested. This is a bit of a mixed bag but
it's important to note that this workload should not be heavily impacted
by wakeups from interrupt context.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 43.00 ( 0.00%) 44.00 ( -2.33%)
Lat 95.00th-qrtle-1 44.00 ( 0.00%) 46.00 ( -4.55%)
Lat 99.00th-qrtle-1 57.00 ( 0.00%) 58.00 ( -1.75%)
Lat 99.50th-qrtle-1 59.00 ( 0.00%) 59.00 ( 0.00%)
Lat 99.90th-qrtle-1 67.00 ( 0.00%) 78.00 ( -16.42%)
Lat 50.00th-qrtle-2 40.00 ( 0.00%) 51.00 ( -27.50%)
Lat 75.00th-qrtle-2 45.00 ( 0.00%) 56.00 ( -24.44%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 59.00 ( -11.32%)
Lat 95.00th-qrtle-2 57.00 ( 0.00%) 61.00 ( -7.02%)
Lat 99.00th-qrtle-2 67.00 ( 0.00%) 71.00 ( -5.97%)
Lat 99.50th-qrtle-2 69.00 ( 0.00%) 74.00 ( -7.25%)
Lat 99.90th-qrtle-2 83.00 ( 0.00%) 77.00 ( 7.23%)
Lat 50.00th-qrtle-4 51.00 ( 0.00%) 51.00 ( 0.00%)
Lat 75.00th-qrtle-4 57.00 ( 0.00%) 56.00 ( 1.75%)
Lat 90.00th-qrtle-4 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 99.00th-qrtle-4 73.00 ( 0.00%) 72.00 ( 1.37%)
Lat 99.50th-qrtle-4 76.00 ( 0.00%) 74.00 ( 2.63%)
Lat 99.90th-qrtle-4 85.00 ( 0.00%) 78.00 ( 8.24%)
Lat 50.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%)
Lat 75.00th-qrtle-8 59.00 ( 0.00%) 62.00 ( -5.08%)
Lat 90.00th-qrtle-8 65.00 ( 0.00%) 66.00 ( -1.54%)
Lat 95.00th-qrtle-8 67.00 ( 0.00%) 70.00 ( -4.48%)
Lat 99.00th-qrtle-8 78.00 ( 0.00%) 79.00 ( -1.28%)
Lat 99.50th-qrtle-8 81.00 ( 0.00%) 80.00 ( 1.23%)
Lat 99.90th-qrtle-8 116.00 ( 0.00%) 83.00 ( 28.45%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 71.00 ( 7.79%)
Lat 90.00th-qrtle-16 83.00 ( 0.00%) 82.00 ( 1.20%)
Lat 95.00th-qrtle-16 87.00 ( 0.00%) 87.00 ( 0.00%)
Lat 99.00th-qrtle-16 95.00 ( 0.00%) 96.00 ( -1.05%)
Lat 99.50th-qrtle-16 99.00 ( 0.00%) 103.00 ( -4.04%)
Lat 99.90th-qrtle-16 104.00 ( 0.00%) 122.00 ( -17.31%)
Lat 50.00th-qrtle-32 71.00 ( 0.00%) 73.00 ( -2.82%)
Lat 75.00th-qrtle-32 91.00 ( 0.00%) 92.00 ( -1.10%)
Lat 90.00th-qrtle-32 108.00 ( 0.00%) 107.00 ( 0.93%)
Lat 95.00th-qrtle-32 118.00 ( 0.00%) 115.00 ( 2.54%)
Lat 99.00th-qrtle-32 134.00 ( 0.00%) 129.00 ( 3.73%)
Lat 99.50th-qrtle-32 138.00 ( 0.00%) 133.00 ( 3.62%)
Lat 99.90th-qrtle-32 149.00 ( 0.00%) 146.00 ( 2.01%)
Lat 50.00th-qrtle-39 83.00 ( 0.00%) 81.00 ( 2.41%)
Lat 75.00th-qrtle-39 105.00 ( 0.00%) 102.00 ( 2.86%)
Lat 90.00th-qrtle-39 120.00 ( 0.00%) 119.00 ( 0.83%)
Lat 95.00th-qrtle-39 129.00 ( 0.00%) 128.00 ( 0.78%)
Lat 99.00th-qrtle-39 153.00 ( 0.00%) 149.00 ( 2.61%)
Lat 99.50th-qrtle-39 166.00 ( 0.00%) 156.00 ( 6.02%)
Lat 99.90th-qrtle-39 12304.00 ( 0.00%) 12848.00 ( -4.42%)
When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
are small gains in many cases. Otherwise it depends on the quartile used
where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
are probably a co-incidence. For this workload, much depends on what node
the threads get placed on and their relative locality and not wakeups from
interrupt context. A larger component on how it behaves would be automatic
NUMA balancing where a fault incurred to measure locality would be a much
larger contributer to latency than the wakeup path.
This is the results from an almost identical machine that happened to run
the same test. They only differ in terms of storage which is irrelevant
for this test.
4.15.0-rc3 4.15.0-rc3
vanilla noirq-v1r1
Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 90.00th-qrtle-1 44.00 ( 0.00%) 43.00 ( 2.27%)
Lat 95.00th-qrtle-1 53.00 ( 0.00%) 45.00 ( 15.09%)
Lat 99.00th-qrtle-1 59.00 ( 0.00%) 58.00 ( 1.69%)
Lat 99.50th-qrtle-1 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 99.90th-qrtle-1 86.00 ( 0.00%) 61.00 ( 29.07%)
Lat 50.00th-qrtle-2 52.00 ( 0.00%) 41.00 ( 21.15%)
Lat 75.00th-qrtle-2 57.00 ( 0.00%) 46.00 ( 19.30%)
Lat 90.00th-qrtle-2 60.00 ( 0.00%) 53.00 ( 11.67%)
Lat 95.00th-qrtle-2 62.00 ( 0.00%) 57.00 ( 8.06%)
Lat 99.00th-qrtle-2 73.00 ( 0.00%) 68.00 ( 6.85%)
Lat 99.50th-qrtle-2 74.00 ( 0.00%) 71.00 ( 4.05%)
Lat 99.90th-qrtle-2 90.00 ( 0.00%) 75.00 ( 16.67%)
Lat 50.00th-qrtle-4 57.00 ( 0.00%) 52.00 ( 8.77%)
Lat 75.00th-qrtle-4 60.00 ( 0.00%) 58.00 ( 3.33%)
Lat 90.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%)
Lat 95.00th-qrtle-4 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 99.00th-qrtle-4 76.00 ( 0.00%) 75.00 ( 1.32%)
Lat 99.50th-qrtle-4 77.00 ( 0.00%) 77.00 ( 0.00%)
Lat 99.90th-qrtle-4 87.00 ( 0.00%) 81.00 ( 6.90%)
Lat 50.00th-qrtle-8 59.00 ( 0.00%) 57.00 ( 3.39%)
Lat 75.00th-qrtle-8 63.00 ( 0.00%) 62.00 ( 1.59%)
Lat 90.00th-qrtle-8 66.00 ( 0.00%) 67.00 ( -1.52%)
Lat 95.00th-qrtle-8 68.00 ( 0.00%) 70.00 ( -2.94%)
Lat 99.00th-qrtle-8 79.00 ( 0.00%) 80.00 ( -1.27%)
Lat 99.50th-qrtle-8 80.00 ( 0.00%) 84.00 ( -5.00%)
Lat 99.90th-qrtle-8 84.00 ( 0.00%) 90.00 ( -7.14%)
Lat 50.00th-qrtle-16 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 75.00th-qrtle-16 77.00 ( 0.00%) 75.00 ( 2.60%)
Lat 90.00th-qrtle-16 84.00 ( 0.00%) 83.00 ( 1.19%)
Lat 95.00th-qrtle-16 88.00 ( 0.00%) 87.00 ( 1.14%)
Lat 99.00th-qrtle-16 97.00 ( 0.00%) 96.00 ( 1.03%)
Lat 99.50th-qrtle-16 100.00 ( 0.00%) 104.00 ( -4.00%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 126.00 ( -14.55%)
Lat 50.00th-qrtle-32 70.00 ( 0.00%) 71.00 ( -1.43%)
Lat 75.00th-qrtle-32 92.00 ( 0.00%) 94.00 ( -2.17%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 95.00th-qrtle-32 121.00 ( 0.00%) 118.00 ( 2.48%)
Lat 99.00th-qrtle-32 135.00 ( 0.00%) 137.00 ( -1.48%)
Lat 99.50th-qrtle-32 140.00 ( 0.00%) 146.00 ( -4.29%)
Lat 99.90th-qrtle-32 150.00 ( 0.00%) 160.00 ( -6.67%)
Lat 50.00th-qrtle-39 80.00 ( 0.00%) 71.00 ( 11.25%)
Lat 75.00th-qrtle-39 102.00 ( 0.00%) 91.00 ( 10.78%)
Lat 90.00th-qrtle-39 118.00 ( 0.00%) 108.00 ( 8.47%)
Lat 95.00th-qrtle-39 128.00 ( 0.00%) 117.00 ( 8.59%)
Lat 99.00th-qrtle-39 149.00 ( 0.00%) 133.00 ( 10.74%)
Lat 99.50th-qrtle-39 160.00 ( 0.00%) 139.00 ( 13.12%)
Lat 99.90th-qrtle-39 13808.00 ( 0.00%) 4920.00 ( 64.37%)
Despite being nearly identical, it showed a variety of major gains so
I'm not convinced that heavy emphasis should be placed on this particular
workload in terms of evaluating this particular patch. Further evidence of
this is the fact that testing on a UMA machine showed small gains/losses
even though the patch should be a no-op on UMA.
Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net
Change-Id: I788a3c61c044b86a3c45e46010660ff0d40f3615
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This is a preparation patch that has wake_affine*() return a CPU ID instead of
a boolean. The intent is to allow the wake_affine() helpers to be avoided
if a decision is already made. This patch has no functional change.
Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
Change-Id: I74fb4723c07c1dcbc401e9f88ad83834770875f4
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
wake_affine_idle() takes parameters it never uses so clean it up.
Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
Change-Id: I3093bc72dac98eac3e4a6afa131f2e806493c0ee
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
If CPU is about to idle, prevent a frequency update. With the number of
schedutil governor wake ups are reduced by more than half on a test
playing bluetooth audio.
Test: sugov wake ups drop by more than half when playing music with
screen off (476 / 1092)
[yaro]: Ported to 4.14
Bug: 64689959
Change-Id: I400026557b4134c0ac77f51c79610a96eb985b4a
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
We need this to implement "sched/fair: Skip frequency updates if CPU about to idle"
This reverts commit 3a123bbbb10d54dbdde6ccbbd519c74c91ba2f52.
Change-Id: I8abe4b2e0ae0c610a1477c24fb7b356fec0de409
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: I42206524415d198c3060063c8d4182d12551c6a3
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Based on 441d1a2 ("sysctl: Expose a few additional sched features to appease OOS when !SCHED_WALT")
This creates a set of no-op proc nodes to stand in for certain
features specific to SCHED_WALT for the purpose of appeasing
userspace and preventing runaway error spamming to logcat during
interactions, e.g.
ANDR-PERF-UTIL
Failed to read /proc/sys/kernel/sched_busy_hyst_ns
ANDR-PERF-OPTSHANDLER
Failed to read /proc/sys/kernel/sched_busy_hyst_ns
ANDR-PERF-UTIL
Failed to read /proc/sys/kernel/sched_prefer_spread
ANDR-PERF-OPTSHANDLER
Failed to read /proc/sys/kernel/sched_prefer_spread
ANDR-PERF-UTIL
Failed to read /proc/sys/kernel/sched_busy_hysteresis_enable_cpus
ANDR-PERF-OPTSHANDLER
Failed to read /proc/sys/kernel/sched_busy_hysteresis_enable_cpus
Please note that these nodes do not function properly, or at all,
and exist solely to provide CAF's performance HAL with something
to read and write meaningless numbers to.
Change-Id: I17dcd6def7ce4dcbc1a4be2b51727e85a18c050e
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit 2ef5980a4daedca5be699e284e43d0de1cb87139.
Change-Id: I7d597730c3022bb59e872ef030fc7efd403f2817
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
* 'linux-4.14.y' of https://github.com/openela/kernel-lts: (278 commits)
LTS: Update to 4.14.348
docs: kernel_include.py: Cope with docutils 0.21
serial: kgdboc: Fix NMI-safety problems from keyboard reset code
btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks()
dm: limit the number of targets and parameter size area
Revert "selftests: mm: fix map_hugetlb failure on 64K page size systems"
LTS: Update to 4.14.347
rds: Fix build regression.
RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats
af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().
net: fix out-of-bounds access in ops_init
drm/vmwgfx: Fix invalid reads in fence signaled events
dyndbg: fix old BUG_ON in >control parser
tipc: fix UAF in error path
usb: gadget: f_fs: Fix a race condition when processing setup packets.
usb: gadget: composite: fix OS descriptors w_value logic
firewire: nosy: ensure user_length is taken into account when fetching packet contents
af_unix: Fix garbage collector racing against connect()
af_unix: Do not use atomic ops for unix_sk(sk)->inflight.
ipv6: fib6_rules: avoid possible NULL dereference in fib6_rule_action()
...
Change-Id: If329d39dd4e95e14045bb7c58494c197d1352d60
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
commit a90afe8d020da9298c98fddb19b7a6372e2feb45 upstream.
If the perf buffer isn't large enough, provide a hint about how large it
needs to be for whatever is running.
Link: https://lkml.kernel.org/r/20210831043723.13481-1-robbat2@gentoo.org
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
(cherry picked from commit 78b92d50fe6ab79d536f4b12c5bde15f2751414d)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
This reverts commit bcf4a115a5068f3331fafb8c176c1af0da3d8b19 which is
commit 0958b33ef5a04ed91f61cef4760ac412080c4e08 upstream.
The change has an incorrect assumption about the return value because
in the current stable trees for versions 5.15 and before, the following
commit responsible for making 0 a success value is not present:
b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers at a time")
The return value should be 0 on failure in the current tree, because in
the functions event_trigger_callback() and event_enable_trigger_func(),
we have:
ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
/*
* The above returns on success the # of functions enabled,
* but if it didn't find any functions it returns zero.
* Consider no functions a failure too.
*/
if (!ret) {
ret = -ENOENT;
Cc: stable@kernel.org # 5.15, 5.10, 5.4, 4.19
Signed-off-by: Siddh Raman Pant <siddh.raman.pant@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 34925d01baf3ee62ab21c21efd9e2c44c24c004a)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 upstream.
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[Fix conflict due to lack dependency
commit 223a76b268c9 ("kprobes: Fix coding style issues")]
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b5808d40093403334d939e2c3c417144d12a6f33)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 6b959ba22d34ca793ffdb15b5715457c78e38b1a upstream.
perf_output_read_group may respond to IPI request of other cores and invoke
__perf_install_in_context function. As a result, hwc configuration is modified.
causing inconsistency and unexpected consequences.
Interrupts are not disabled when perf_output_read_group reads PMU counter.
In this case, IPI request may be received from other cores.
As a result, PMU configuration is modified and an error occurs when
reading PMU counter:
CPU0 CPU1
__se_sys_perf_event_open
perf_install_in_context
perf_output_read_group smp_call_function_single
for_each_sibling_event(sub, leader) { generic_exec_single
if ((sub != event) && remote_function
(sub->state == PERF_EVENT_STATE_ACTIVE)) |
<enter IPI handler: __perf_install_in_context> <----RAISE IPI-----+
__perf_install_in_context
ctx_resched
event_sched_out
armpmu_del
...
hwc->idx = -1; // event->hwc.idx is set to -1
...
<exit IPI>
sub->pmu->read(sub);
armpmu_read
armv8pmu_read_counter
armv8pmu_read_hw_counter
int idx = event->hw.idx; // idx = -1
u64 val = armv8pmu_read_evcntr(idx);
u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30
read_pmevcntrn(counter) // undefined instruction
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a2039c87d30177f0fd349ab000e6af25a0d48de8)
[Vegard: fix conflict in context due to missing commit
ece0857258cbaf20b9828157035999f46ca060c8 ("perf/core: Add a new read
format to get a number of lost samples").]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 9bc4ffd32ef8943f5c5a42c9637cfd04771d021b ]
psci_init_system_suspend() invokes suspend_set_ops() very early during
bootup even before kernel command line for mem_sleep_default is setup.
This leads to kernel command line mem_sleep_default=s2idle not working
as mem_sleep_current gets changed to deep via suspend_set_ops() and never
changes back to s2idle.
Set mem_sleep_current along with mem_sleep_default during kernel command
line setup as default suspend mode.
Fixes: faf7ec4a92c0 ("drivers: firmware: psci: add system suspend support")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 312ead3c0e23315596560e9cc1d6ebbee1282e40)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
This reverts commit 05f9f1909535fc539e2639683f3d70f6dfc0bbc9.
Change-Id: I9d179c49a6333c7786985465edb8036636621870
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: I1229bdaa5afcba572c24616aedcd831c93316108
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit 13cc43b12b3d4200d5bc8d900d05689b23c58aa5.
Change-Id: I990aed89a678011004e123ada5ac07d7b1a8f0f4
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit fba50debe41b34bc281947cecd03c67b79f97403.
Change-Id: I3fc44d88fc72afdc8c4e26dce5dca996ed415ff9
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit 099732c94bc367ad219a1d35ff5bcd66e9ea0e1e.
Change-Id: I9e6f70c472574770352bf9185736b5c5547091e0
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit bf2fdd35c1cab2dd1c704ad2a40b20ae4473a4fe.
Change-Id: Iea1e43e248b9f3a6b2b34c2313aef4e0eaad9381
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The effective affinity mask causes a lot of bugs by virtue of many
set_irq_affinity handlers only setting an effective affinity mask for an
IRQ's parent but not the IRQ itself. Since this is a widespread issue that
would require manual fixing on every different SoC, just disable the
effective affinity mask altogether and use the first CPU in an affinity
mask configured.
Change-Id: Ieec06c5c392250608d954dd144e1513a5bdad56f
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Keeping the compiler in the dark about which sched_feat() toggles are
set reduces its ability to optimize scheduler code, which results in a
measurably significant performance loss since the affected code is very
hot. Turning sched_feat() definitions into macros that the compiler can
resolve results in smaller and faster code.
The sched_feat() definitions are converted to macros with this sed:
sed -i -E \
-e 's/SCHED_FEAT\((.*),(.*)\)/#define SCHED_FEAT_\1\2/' \
-e 's/true/1/' \
-e 's/false/0/' \
kernel/sched/features.h
The sched_feat() macros are named SCHED_FEAT_*. With this done, we can
now optimize try-to-wake-up code when SCHED_FEAT_TTWU_QUEUE is disabled,
which results in a CPU usage reduction of a few percent inside
try_to_wake_up().
Change-Id: Id3564300537308a2db5aedebc890be6f8932019b
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.
The maximum of detached tasks remained the same as before.
Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
Change-Id: Icaab04df216fd58001741d6f234a3964d1569031
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
Change-Id: I5731e7967b03c1c1a542f210d128d58da98fb734
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:
this_rq->avg_idle < sd->max_newidle_lb_cost
Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.
Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
Change-Id: I615ac6f2147b4aa044b618957d31392927c1c283
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.
Also reduce the number of call to sched_clock_cpu() and track more actual
work.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org
Change-Id: I0921c8b7a37ee1bb764d9de41ca4902ca2bd15c6
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
When NR_CPUS fits in a long, it's possible to use compiler built-ins to
produce much faster code when operating on cpumasks compared to just using
the generic bitops APIs.
Therefore, add optimized helpers using compiler built-ins when NR_CPUS fits
in a long. This also turns nr_cpu_ids into a compile-time constant for
further optimization potential.
Note that compared to the upstream cpumask rewrite with this feature, these
optimized helpers perfectly preserve the semantics of the helpers they
replace. And this change is much smaller than the upstream version.
Change-Id: I1ac6058a19bd3b22a491176eef9d661cca78e521
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
[ Upstream commit 93f141324d4860a1294e6899923c01ec5411d70b ]
s2idle_wait_head is used during s2idle with interrupts disabled even on
RT. There is no "custom" wake up function so swait could be used instead
which is also lower weight compared to the wait_queue.
Make s2idle_wait_head a swait_queue_head.
Change-Id: Iac95f519c770c147ab9fa83de9616095408bc062
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
[ Upstream commit ec7ff06b919647a2fd7d2761a26f5a1d465e819c ]
This is an updated version of this patch which was merged upstream as
commit c1a957d17086d20d52d7f9c8dffaeac2ee09d6f9
Change-Id: I066f092b7bd3c5729af78c330f1049f84f0b7ad6
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Completions have no long lasting callbacks and therefor do not need
the complex waitqueue variant. Use simple waitqueues which reduces the
contention on the waitqueue lock.
Change-Id: I071e1408cb43f2b432179c42b01ced8df6efd127
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit 64a6b2c346641aa6d8b08368dad2e085dd82f3d1.
Change-Id: Ifccdc2e16c44c5c67ed974ea486411b6f6bc42d7
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
There isn't a need for cpus_affine to be atomic, and reading/writing to
it outside of the global pm_qos lock is racy anyway. As such, we can
simply turn it into a primitive integer type.
Change-Id: Icd8eceb3fcf1a07f345ce0ed104b930e405900d1
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The plist is already sorted and traversed in ascending order of PM QoS
value, so we can simply look at the lowest PM QoS values which affect
the given request's CPUs until we've looked at all of them, at which
point the traversal can be stopped early. This also lets us get rid of
the pesky qos_val array.
Change-Id: I5456b976b16f3544ac9ae3289446f5a4e115e6ef
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Andrzej Perczak discovered that his CPUs would almost never enter an
idle state deeper than C0, and pinpointed the cause of the issue to be
commit "qos: Speed up pm_qos_set_value_for_cpus()". As it turns out, the
optimizations introduced in that commit contain two issues that are
responsible for this behavior: pm_qos_remove_request() fails to refresh
the affected per-CPU targets, and IRQ migrations fail to refresh their
old affinity's targets.
Removing a request fails to refresh the per-CPU targets because
`new_req->node.prio` isn't updated to the PM QoS class' default value
upon removal, and so it contains its old value from when it was active.
This causes the `changed` loop in pm_qos_set_value_for_cpus() to check
against a stale PM QoS request value and erroneously determine that the
request in question doesn't alter the current per-CPU targets.
As for IRQ migrations, only the new CPU affinity mask gets updated,
which causes the CPUs present in the old affinity mask but not the new
one to retain their targets, specifically when a migration occurs while
the associated PM QoS request is active.
To fix these issues while retaining optimal speed, update PM QoS
requests' CPU affinity inside pm_qos_set_value_for_cpus() so that the
old affinity can be known, and skip the `changed` loop when the request
in question is being removed.
Change-Id: I8b4626a253a85a19d1fafc197ba2c18a6ed73bbf
Reported-by: Andrzej Perczak <kartapolska@gmail.com>
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
A lot of unnecessary work is done in pm_qos_set_value_for_cpus(),
especially when the request being updated isn't affined to all CPUs.
We can reduce the work done here significantly by only inspecting the
CPUs which are affected by the updated request, and bailing out if the
updated request doesn't change anything.
We can make some other micro-optimizations as well knowing that this
code is only for the PM_QOS_CPU_DMA_LATENCY class.
Change-Id: I9e8e7bb09d20a97b9c5853c32ececb17f17a891d
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit b785b90d42a353dde68edc7cd1ad2166772fdc1f.
Change-Id: I9ac0f0998104f01243b0c35e3ef1ec530bedd03a
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit fae4a4fed88fad9f23590d9890c5e8641565ce01.
Change-Id: I4b95b19ac3fc0ab48e00136cd9cf7c67adbc0f1f
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Within a DynamIQ Shared Unit (DSU), task migration cost is optimized
through L2 and L3 cache sharing. When a task is migrated between CPUs
within the same DSU cluster, there is no loss of L2$ and L3$ locality.
Since all CPU cores are tightly interconnected within a DSU, set the task
migration cost to zero knowing that the DSU will facilitate L$ sharing via
snooping.
This leverages the DSU to improve power consumption and performance on
systems which contain a single DSU cluster.
Change-Id: Ifce37fc6c125097c1965d12c61b4ee351b9dbd94
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Unity-based games (such as Wild Rift) like to shoot themselves in the foot
by setting a nonsense CPU affinity, restricting the game to a narrow set of
CPU cores that it thinks are the "big" cores in a heterogeneous CPU. It
assumes that CPUs only have two performance domains (clusters), and
therefore royally mucks up games' CPU affinities on CPUs which have more
than two performance domains.
Check if a setaffinity target task is part of a Unity-based game and
silently ignore the setaffinity request so that it can't sabotage itself.
Change-Id: I47c292bd2b9f08b6299378308ba646168b341fb5
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit 316fc7da752bbbf792ee8240c2303200e36e91eb.
Change-Id: I371bd4d88c5175ee1e4dabbf186aa44defeec711
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>