msm-4.14

mirror of https://github.com/rd-stuffs/msm-4.14.git synced 2025-02-20 11:45:48 +08:00

Author	SHA1	Message	Date
LibXZR	0e27242ef3	Revert "genirq/irqdomain: Don't try to free an interrupt that has no mapping" * This is retard. An interrupt does not have a permanent virq. The virq of an irq may be changed after the free of other irqs. * This is causing problems for msm_msi_irq_domain to free all the irqs that it allocated. * Fix unrecoverable modem crash. This reverts commit d1874e36cb3d00ba53f9e7bc3ca58d3058659cee. Change-Id: I831083118d6a7c12d43f2fa2ef01bdd27159dac8 Signed-off-by: LibXZR <xzr467706992@163.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-24 00:01:56 -03:00
Richard Raya	30f8174d8a	msm-4.14: Bump boosts input timeouts Change-Id: I55077d5fecbf231539ad94ac058226cbc39d1479 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:59:46 -03:00
Sultan Alsawaf	5626e9eb9c	cgroup: Boost whenever a zygote-forked process becomes a top app Boost to the max for 1000 ms whenever the top app changes, which improves app launch speeds and addresses jitter when switching between apps. A check to make sure that the top-app's parent is zygote ensures that a user-facing app is indeed what's added to the top app task group, since app processes are forked from zygote. Change-Id: I49563d8baef7cefa195c919acf97343fa424c3be Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:59:42 -03:00
Dietmar Eggemann	f49f0241d2	sched/features: Disable LB_BIAS by default LB_BIAS allows the adjustment on how conservative load should be balanced. The rq->cpu_load[idx] array is used for this functionality. It contains weighted CPU load decayed average values over different intervals (idx = 1..4). Idx = 0 is the weighted CPU load itself. The values are updated during scheduler_tick, before idle balance and at nohz exit. There are 5 different types of idx's per sched domain (sd). Each of them is used to index into the rq->cpu_load[idx] array in a specific scenario (busy, idle and newidle for load balancing, forkexec for wake-up slow-path load balancing and wake for affine wakeup based on weight). Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2 respectively. All the other sd idx's are set to 0. Conservative load balancing is achieved for sd idx's >= 1 by using the min/max (source_load()/target_load()) value between the current weighted CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local CPU load in load balancing or vice versa in the wake-up slow-path load balancing. There is no conservative balancing for sd idx = 0 since only current weighted CPU load is used in this case. It is very likely that LB_BIAS' influence on load balancing can be neglected (see test results below). This is further supported by: (1) Weighted CPU load today is by itself a decayed average value (PELT) (cfs_rq->avg->runnable_load_avg) and not the instantaneous load (rq->load.weight) it was when LB_BIAS was introduced. (2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate to sd's newidle and busy idx) in find_busiest_group() when comparing busiest and local avg load to make load balancing even more conservative. (3) The sd forkexec and newidle idx are always set to 0 so there is no adjustment on how conservatively load balancing is done here. (4) Affine wakeup based on weight (wake_affine_weight()) will not be impacted since the sd wake idx is always set to 0. Let's disable LB_BIAS by default for a few kernel releases to make sure that no workload and no scheduler topology is affected. The benefit of being able to remove the LB_BIAS dependency from source_load() and target_load() is that the entire rq->cpu_load[idx] code could be removed in this case. It is really hard to say if there is no regression w/o testing this with a lot of different workloads on a lot of different platforms, especially NUMA machines. The following 104 LKP (Linux Kernel Performance) tests were run by the 0-Day guys mostly on multi-socket hosts with a larger number of logical cpus (88, 192). The base for the test was commit b3dae109fa89 ("sched/swait: Rename to exclusive") (tip/sched/core v4.18-rc1). Only 2 out of the 104 tests had a significant change in one of the metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7% files_per_sec, unixbench/300s-100%-syscall-performance -11% score). Tests which showed a change in one of the metrics are marked with a '' and this change is listed as well. (a) lkp-bdw-ep3: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G dd-write/10m-1HDD-cfq-btrfs-100dd-performance fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance 7.50 7% 8.00 ± 6% fsmark.files_per_sec fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance kbuild/300s-50%-vmlinux_prereq-performance kbuild/300s-200%-vmlinux_prereq-performance kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4 kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4 (b) lkp-skl-4sp1: 192 threads Intel(R) Xeon(R) Platinum 8160 768G dbench/100%-performance ebizzy/200%-100x-10s-performance hackbench/1600%-process-pipe-performance iperf/300s-cs-localhost-tcp-performance iperf/300s-cs-localhost-udp-performance perf-bench-numa-mem/2t-300M-performance perf-bench-sched-pipe/10000000ops-process-performance perf-bench-sched-pipe/10000000ops-threads-performance schbench/2-16-300-30000-30000-performance tbench/100%-cs-localhost-performance (c) lkp-bdw-ep6: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G stress-ng/100%-60s-pipe-performance unixbench/300s-1-whetstone-double-performance unixbench/300s-1-shell1-performance unixbench/300s-1-shell8-performance unixbench/300s-1-pipe-performance * unixbench/300s-1-context1-performance 312 315 unixbench.score unixbench/300s-1-spawn-performance unixbench/300s-1-syscall-performance unixbench/300s-1-dhry2reg-performance unixbench/300s-1-fstime-performance unixbench/300s-1-fsbuffer-performance unixbench/300s-1-fsdisk-performance unixbench/300s-100%-whetstone-double-performance unixbench/300s-100%-shell1-performance unixbench/300s-100%-shell8-performance unixbench/300s-100%-pipe-performance unixbench/300s-100%-context1-performance unixbench/300s-100%-spawn-performance * unixbench/300s-100%-syscall-performance 3571 ± 3% -11% 3183 ± 4% unixbench.score unixbench/300s-100%-dhry2reg-performance unixbench/300s-100%-fstime-performance unixbench/300s-100%-fsbuffer-performance unixbench/300s-100%-fsdisk-performance unixbench/300s-1-execl-performance unixbench/300s-100%-execl-performance * will-it-scale/brk1-performance 365004 360387 will-it-scale.per_thread_ops * will-it-scale/dup1-performance 432401 437596 will-it-scale.per_thread_ops will-it-scale/eventfd1-performance will-it-scale/futex1-performance will-it-scale/futex2-performance will-it-scale/futex3-performance will-it-scale/futex4-performance will-it-scale/getppid1-performance will-it-scale/lock1-performance will-it-scale/lseek1-performance will-it-scale/lseek2-performance * will-it-scale/malloc1-performance 47025 45817 will-it-scale.per_thread_ops 77499 76529 will-it-scale.per_process_ops will-it-scale/malloc2-performance * will-it-scale/mmap1-performance 123399 120815 will-it-scale.per_thread_ops 152219 149833 will-it-scale.per_process_ops * will-it-scale/mmap2-performance 107327 104714 will-it-scale.per_thread_ops 136405 133765 will-it-scale.per_process_ops will-it-scale/open1-performance * will-it-scale/open2-performance 171570 168805 will-it-scale.per_thread_ops 532644 526202 will-it-scale.per_process_ops will-it-scale/page_fault1-performance will-it-scale/page_fault2-performance will-it-scale/page_fault3-performance will-it-scale/pipe1-performance will-it-scale/poll1-performance * will-it-scale/poll2-performance 176134 172848 will-it-scale.per_thread_ops 281361 275053 will-it-scale.per_process_ops will-it-scale/posix_semaphore1-performance will-it-scale/pread1-performance will-it-scale/pread2-performance will-it-scale/pread3-performance will-it-scale/pthread_mutex1-performance will-it-scale/pthread_mutex2-performance will-it-scale/pwrite1-performance will-it-scale/pwrite2-performance will-it-scale/pwrite3-performance * will-it-scale/read1-performance 1190563 1174833 will-it-scale.per_thread_ops * will-it-scale/read2-performance 1105369 1080427 will-it-scale.per_thread_ops will-it-scale/readseek1-performance * will-it-scale/readseek2-performance 261818 259040 will-it-scale.per_thread_ops will-it-scale/readseek3-performance * will-it-scale/sched_yield-performance 2408059 2382034 will-it-scale.per_thread_ops will-it-scale/signal1-performance will-it-scale/unix1-performance will-it-scale/unlink1-performance will-it-scale/unlink2-performance * will-it-scale/write1-performance 976701 961588 will-it-scale.per_thread_ops * will-it-scale/writeseek1-performance 831898 822448 will-it-scale.per_thread_ops * will-it-scale/writeseek2-performance 228248 225065 will-it-scale.per_thread_ops * will-it-scale/writeseek3-performance 226670 224058 will-it-scale.per_thread_ops will-it-scale/context_switch1-performance aim7/performance-fork_test-2000 * aim7/performance-brk_test-3000 74869 76676 aim7.jobs-per-min aim7/performance-disk_cp-3000 aim7/performance-disk_rd-3000 aim7/performance-sieve-3000 aim7/performance-page_test-3000 aim7/performance-creat-clo-3000 aim7/performance-mem_rtns_1-8000 aim7/performance-disk_wrt-8000 aim7/performance-pipe_cpy-8000 aim7/performance-ram_copy-8000 (d) lkp-avoton3: 8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Li Zhijian <zhijianx.li@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com Change-Id: Ia84e33416b394990da2fd0f2d21bd499ce76a65d Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:58:50 -03:00
Kazuki H	b4f4f18aa7	irq: Don't allow IRQ affinities to be set from userspace Change-Id: I8278aec4280103cdb092f197ded20831d9f57fd4 Signed-off-by: Kazuki H <kazukih0205@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:54:37 -03:00
Sultan Alsawaf	70f60e8bfe	sbalance: Fix severe misattribution of movable IRQs to the last active CPU Due to a horrible omission in the big IRQ list traversal, all movable IRQs are misattributed to the last active CPU in the system since that's what `bd` is last set to in the loop prior. This horribly breaks SBalance's notion of balance, producing nonsensical balancing decisions and failing to balance IRQs even when they are heavily imbalanced. Fix the massive breakage by adding the missing line of code to set `bd` to the CPU an IRQ actually belongs to, so that it's added to the correct CPU's movable IRQs list. Change-Id: Ide222d361152b1cd03c1894c995cab42980d16e7 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:55 -03:00
Sultan Alsawaf	9d49f44e04	sbalance: Don't race with CPU hotplug When a CPU is hotplugged, cpu_active_mask is modified without any RCU synchronization. As a result, the only synchronization for cpu_active_mask provided by the hotplug code is the CPU hotplug lock. Furthermore, since IRQ balance is majorly disrupted during CPU hotplug due to mass IRQ migration off a dying CPU, SBalance just shouldn't operate while a CPU hotplug is in progress. Take the CPU hotplug lock in balance_irqs() to prevent races and mishaps during CPU hotplugs. Change-Id: If377de7b78e3ae68a20bc95bdb84650330cfc330 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:54 -03:00
Sultan Alsawaf	8d891a9e71	sbalance: Convert various IRQ counter types to unsigned ints These counted values are actually unsigned ints, not unsigned longs. Convert them to unsigned ints since there's no reason for them to be longs. Change-Id: Ia5c4a3162a072b4fa3225afdcd969db95b60c802 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:53 -03:00
Sultan Alsawaf	9fe60c2f32	sbalance: Fix systemic issues caused by flawed IRQ statistics SBalance's statistic of new interrupts for each CPU is inherently flawed in that it cannot track IRQ migration that occurs in between balance periods. As a result, SBalance can observe a flawed number of new interrupts for a CPU, which hurts its balancing decisions. Furthermore, SBalance incorrectly assumes that IRQs are affined where SBalance last placed them, which breaks SBalance entirely when the assumption doesn't hold true. As it turns out, it can be quite common to change an IRQ's affinity and observe a successful return value despite the IRQ not actually moving. At the very least this is observed on ARM's GICv3, and results in SBalance never moving such an IRQ ever again because SBalance always thinks it has zero new interrupts. Since we can't trust irqchip drivers or hardware, gather IRQ statistics directly in order to get the true number of new interrupts for each CPU and the actual affinity of each IRQ based on the last CPU it fired upon. Change-Id: Ic846adac244a0873c4502987e0904b552ab31f22 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:52 -03:00
Sultan Alsawaf	9cddb6a1a0	sbalance: Use non-atomic cpumask_clear_cpu() variant The atomic cpumask_clear_cpu() isn't needed. Use __cpumask_clear_cpu() instead as a micro-optimization, and for clarity. Change-Id: I17d168814c4b96557c8a9f986c2c5be8e18be26b Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:51 -03:00
Sultan Alsawaf	a4e8c0a0a3	sbalance: Use a deferrable timer to avoid waking idle CPUs SBalance is designed to poll to balance IRQs, but it shouldn't kick CPUs out of idle to do so because idle CPUs clearly aren't processing interrupts. Open code a freezable wait that uses a deferrable timer in order to prevent SBalance from waking up idle CPUs when there is little interrupt traffic. Change-Id: I5f796a4590801c9a5935ca7ea8c966ca281620c7 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:51 -03:00
Sultan Alsawaf	cc5e8988ca	sbalance: Allow IRQs to be moved off of excluded CPUs Excluded CPUs are excluded from IRQ balancing with the intention that those CPUs shouldn't really be processing interrupts, and thus shouldn't have IRQs moved to them. However, SBalance completely ignores excluded CPUs, which can cause them to end up with a disproportionate amount of interrupt traffic that SBalance won't spread out. An easy example of this is when CPU0 is an excluded CPU, since CPU0 ends up with all interrupts affined to it by default on arm64. Allow SBalance to move IRQs off of excluded CPUs so that they cannot slip under the radar and pile up on an excluded CPU, like when CPU0 is excluded. Change-Id: I392a058ea8cf7672bfea39ff9525bf6b7c52a062 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:49 -03:00
Sultan Alsawaf	fc6b2568cb	kernel: Introduce SBalance IRQ balancer This is a simple IRQ balancer that polls every X number of milliseconds and moves IRQs from the most interrupt-heavy CPU to the least interrupt-heavy CPUs until the heaviest CPU is no longer the heaviest. IRQs are only moved from one source CPU to any number of destination CPUs per balance run. Balancing is skipped if the gap between the most interrupt-heavy CPU and the least interrupt-heavy CPU is below the configured threshold of interrupts. The heaviest IRQs are targeted for migration in order to reduce the number of IRQs to migrate. If moving an IRQ would reduce overall balance, then it won't be migrated. The most interrupt-heavy CPU is calculated by scaling the number of new interrupts on that CPU to the CPU's current capacity. This way, interrupt heaviness takes into account factors such as thermal pressure and time spent processing interrupts rather than just the sheer number of them. This also makes SBalance aware of CPU asymmetry, where different CPUs can have different performance capacities and be proportionally balanced. Change-Id: Ie40c87ca357814b9207726f67e2530fffa7dd198 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:48 -03:00
Sultan Alsawaf	327e77e79a	kernel: Warn when an IRQ's affinity notifier gets overwritten An IRQ affinity notifier getting overwritten can point to some annoying issues which need to be resolved, like multiple pm_qos objects being registered to the same IRQ. Print out a warning when this happens to aid debugging. Change-Id: I087a6ea7472fa7ba45bdb02efeae25af5c664950 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:47 -03:00
Sultan Alsawaf	cde9a6dfc7	kernel: Only set one CPU in the default IRQ affinity mask On ARM, IRQs are executed on the first CPU inside the affinity mask, so setting an affinity mask with more than one CPU set is deceptive and causes issues with pm_qos. To fix this, only set the CPU0 bit inside the affinity mask, since that's where IRQs will run by default. This is a follow-up to "kernel: Don't allow IRQ affinity masks to have more than one CPU". Change-Id: Ib6ef803ab866686c30e1aa1d06f98692ee39ed6c Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:46 -03:00
Sultan Alsawaf	547eaaa17b	kernel: Don't allow IRQ affinity masks to have more than one CPU Even with an affinity mask that has multiple CPUs set, IRQs always run on the first CPU in their affinity mask. Drivers that register an IRQ affinity notifier (such as pm_qos) will therefore have an incorrect assumption of where an IRQ is affined. Fix the IRQ affinity mask deception by forcing it to only contain one set CPU. Change-Id: I212ff578f731ee78fabb8f63e49ef0b96c286521 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:45 -03:00
Richard Raya	1b396d869a	msm-4.14: Drop perf-critical API Change-Id: I17edd46742608a3ed8349a60b71716c944d4a0f4 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:44 -03:00
Richard Raya	602aa3bba8	msm-4.14: Drop sched_migrate_to_cpumask Change-Id: I8b03f4b7f90c6486d42ef767ba0b52a9567830a2 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:43 -03:00
Samuel Pascua	d6e561f94c	sched: Backport IRQ utilization tracking Change-Id: Id432ab10f7acb00ad2d1bb36400504584629a2b6 Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:39 -03:00
Alexander Winkowski	90dd46c816	sched/cass: Skip reserved cpus Change-Id: I77e5663fa00afba2211b52997e007a0f2e6364e2 Signed-off-by: Alexander Winkowski <dereference23@outlook.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:38 -03:00
Alexander Winkowski	7e8a73c333	sched/cass: No thermal throttling for us Change-Id: If892e9c33656b7f829d2adb3d7228ac12313dd2c Signed-off-by: Alexander Winkowski <dereference23@outlook.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:39 -03:00
Richard Raya	c631cde2a8	sched/cass: Fix arch_scale_cpu_capacity params Change-Id: I8e55400ad416882a735a4dc72bcaeaaa23f11019 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:36 -03:00
Richard Raya	aa86f45366	sched/cass: Fix hard_util accounting Change-Id: I1c8147a04003c20eb9046d490e90ba98cf376115 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:53:33 -03:00
Sultan Alsawaf	6b6b113c0e	sched/cass: Don't fight the idle load balancer The idle load balancer (ILB) is kicked whenever a task is misfit, meaning that the task doesn't fit on its CPU (i.e., fits_capacity() == false). Since CASS makes no attempt to place tasks such that they'll fit on the CPU they're placed upon, the ILB works harder to correct this and rebalances misfit tasks onto a CPU with sufficient capacity. By fighting the ILB like this, CASS degrades both energy efficiency and performance. Play nicely with the ILB by trying to place tasks onto CPUs that fit. Change-Id: I317a3f19b83400d4b55d35d4a51e88268d0399c1 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	3bf932216b	sched/cass: Honor uclamp even when no CPUs can satisfy the requirement When all CPUs available to a uclamp'd process are thermal throttled, it is possible for them to be throttled below the uclamp minimum requirement. In this case, CASS only considers uclamp when it compares relative utilization and nowhere else; i.e., CASS essentially ignores the most important aspect of uclamp. Fix it so that CASS tries to honor uclamp even when no CPUs available to a uclamp'd process are capable of fully meeting the uclamp minimum. Change-Id: I93885cd7a94502c58a9e96eb43bb00ef01d15988 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	a421891669	sched/cass: Fix disproportionate load spreading when CPUs are throttled When CPUs are thermal throttled, CASS tries to spread load such that their resulting P-state is scaled relatively to their _throttled_ maximum capacity, rather than their original capacity. As a result, throttled CPUs are unfairly under-utilized, causing other CPUs to receive the extra burden and thus run at a disproportionately higher P-state relative to the throttled CPUs. This not only hurts performance, but also greatly diminishes energy efficiency since it breaks CASS's basic load balancing principle. To fix this, some convoluted logic is required in order to make CASS aware of a CPU's throttled and non-throttled capacity. The non-throttled capacity is used for the fundamental relative utilization comparison, while the throttled capacity is used in conjunction to ensure a throttled CPU isn't accidentally overloaded as a result. Change-Id: I2cdabc4aa88e724252886c15040eabf40ab9150e Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	7952b4c0e6	sched/cass: Eliminate redundant calls to smp_processor_id() Calling smp_processor_id() can be expensive depending on how an arch implements it, so avoid calling it more than necessary. Use the raw variant too since this code is always guaranteed to run with preemption disabled. Change-Id: If96aeeb0aeb9f0c1cb2ebf9dcf31ced04ebe135c Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	3c1e39b2bf	sched/cass: Only treat sync waker CPU as idle if there's one task running For synchronized wakes, the waker's CPU should only be treated as idle if there aren't any other running tasks on that CPU. This is because, for synchronized wakes, it is assumed that the waker will immediately go to sleep after waking the wakee; therefore, if there aren't any other tasks running on the waker's CPU, it'll go idle and should be treated as such to improve task placement. This optimization only applies when there aren't any other tasks running on the waker's CPU, however. Fix it by ensuring that there's only the waker running on its CPU. Change-Id: I03cfd16d423cc920c103b8734b6b8a9089a9e59c Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	6a8fed2d40	sched/cass: Fix suboptimal task placement when uclamp is used Uclamp is designed to specify a process' CPU performance requirement scaled as a CPU capacity value. It simply denotes the process' requirement for the CPU's raw performance and thus P-state. CASS currently treats uclamp as a CPU load value however, producing wildly suboptimal CPU placement decisions for tasks which use uclamp. This hurts performance and, even worse, massively hurts energy efficiency, with CASS sometimes yielding power consumption that is a few times higher than EAS. Since uclamp inherently throws a wrench into CASS's goal of keeping relative P-states as low as possible across all CPUs, making it cooperate with CASS requires a multipronged approach. Make the following three changes to fix the uclamp task placement issue: 1. Treat uclamp as a CPU performance value rather than a CPU load value. 2. Clamp a CPU's utilization to the task's uclamp floor in order to keep relative P-states as low as possible across all CPUs. 3. Consider preferring a non-idle CPU for uclamped tasks to avoid pushing up the P-state of more than one CPU when there are multiple concurrent uclamped tasks. This fixes CASS's massive energy efficiency and performance issues when uclamp is used. Change-Id: Ib274ceecfbbe9c2eeb1738f97029e1f4cbc68ec0 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	226fafc0b1	sched/cass: Perform runqueue selection for RT tasks too RT tasks aren't placed on CPUs in a load-balanced manner, much less an energy efficient one. On systems which contain many RT tasks and/or IRQ threads, energy efficiency and throughput are diminished significantly by the default RT runqueue selection scheme which targets minimal latency. In practice, performance is actually improved by spreading RT tasks fairly, despite the small latency impact. Additionally, energy efficiency is significantly improved since the placement of all tasks benefits from energy-efficient runqueue selection, rather than just CFS tasks. Perform runqueue selection for RT tasks in CASS to significantly improve energy efficiency and overall performance. Change-Id: Ie551296e1034baa2dfc2bb7f0191ca95f5abc639 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	f6d6472722	sched/cass: Clean up local variable scope in cass_best_cpu() Move `curr` and `idle_state` to within the loop's scope for better readability. Also, leave a comment about `curr->cpu` to make it clear that `curr->cpu` must be initialized within the loop in order for `best->cpu` to be valid. Change-Id: I1244ac06d62c172f46dbf337e7bb95758329a188 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	0f790875be	sched/cass: Fix CPU selection when no candidate CPUs are idle When no candidate CPUs are idle, CASS would keep `cidx` unchanged, and thus `best == curr` would always be true. As a result, since the empty candidate slot never changes, the current candidate `curr` always overwrites the best candidate `best`. This causes the last valid CPU to always be selected by CASS when no CPUs are idle (i.e., under heavy load). Fix it by ensuring that the CPU loop in cass_best_cpu() flips the free candidate index after the first candidate CPU is evaluated. Change-Id: Id1e371c0fe6a2e6321f1c9f68a47e4a26c9a0cba Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Richard Raya	20540b834c	sched/cass: Checkout again Change-Id: Ib7993c14e7d3ffa354be744629593a8646f55efa Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	22ba4c6d3a	schedutil: Set default rate limit to 2000 us This is empirically observed to yield good performance with reduced power consumption. With "cpufreq: schedutil: Ignore rate limit when scaling up with FIE present", this only affects frequency reductions when FIE is present, since there is no rate limit applied when scaling up. Change-Id: I1bff1f007f06e67b672877107c9685b6fb83647a Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	f1c3e01e39	schedutil: Ignore rate limit when scaling up with FIE present When schedutil disregards a frequency transition due to the transition rate limit, there is no guaranteed deadline as to when the frequency transition will actually occur after the rate limit expires. For instance, depending on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited frequency transition may be delayed indefinitely, until said CPU reaches the scheduler again. This also hurts tasks boosted via UCLAMP_MIN. For frequency transitions _down_, this only poses a theoretical loss of energy savings since a CPU may remain at a higher frequency than necessary for an indefinite period beyond the rate limit expiry. For frequency transitions _up_, however, this poses a significant hit to performance when a CPU is stuck at an insufficient frequency for an indefinitely long time. In latency-sensitive and bursty workloads especially, a missed frequency transition up can result in a significant performance loss due to a CPU operating at an insufficient frequency for too long. When support for the Frequency Invariant Engine (FIE) _isn't_ present, a rate limit is always required for the scheduler to compute CPU utilization with some semblance of accuracy: any frequency transition that occurs before the previous transition latches would result in the scheduler not knowing the frequency a CPU is actually operating at, thereby trashing the computed CPU utilization. However, when FIE support _is_ present, there's no technical requirement to rate limit all frequency transitions to a cpufreq driver's reported transition latency. With FIE, the scheduler's CPU utilization tracking is unaffected by any frequency transitions that occur before the previous frequency is latched. Therefore, ignore the frequency transition rate limit when scaling up on systems where FIE is present. This guarantees that transitions to a higher frequency cannot be indefinitely delayed, since they simply cannot be delayed at all. Change-Id: I0dc5c6c710c10c63b7fc69970db044982de2a2d7 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:43 -03:00
Sultan Alsawaf	82fd18f9bd	schedutil: Fix superfluous updates caused by need_freq_update A redundant frequency update is only truly needed when there is a policy limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS. In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a frequency update _all the time_, not just for a policy limits change, because need_freq_update is never cleared. Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads to a redundant frequency update, regardless of whether or not the driver specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the same as the current one. Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS when there's a policy limits change, and clearing need_freq_update when a requisite redundant update occurs. This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test and instead setting need_freq_update to false in sugov_update_next_freq(). Change-Id: Iedd47851eabe5a12ed3255b84cd0468da2fbbc80 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:42 -03:00
EmanuelCN	86c781853b	schedutil: Remove up/down rate limits To make way for new changes Change-Id: Ie28ebf8ea187c8c3da79ee896224f6eeb4f513a6 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:42 -03:00
Rafael J. Wysocki	2f1794050a	schedutil: Simplify sugov_update_next_freq() Rearrange a conditional to make it more straightforward. Change-Id: I1c9d793cac29bc5a2fdc047ac4c01bba5044489e Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:42 -03:00
Viresh Kumar	526e0afda2	schedutil: Don't skip freq update if need_freq_update is set The cpufreq policy's frequency limits (min/max) can get changed at any point of time, while schedutil is trying to update the next frequency. Though the schedutil governor has necessary locking and support in place to make sure we don't miss any of those updates, there is a corner case where the governor will find that the CPU is already running at the desired frequency and so may skip an update. For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz and is running at 1 GHz currently. Schedutil tries to update the frequency to 1.2 GHz, during this time the policy limits get changed as policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the frequency at various instances, we will eventually set the frequency to 1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq. Now lets say the policy limits get changed back at this time with policy->min as 1 GHz. The next time schedutil is invoked by the scheduler, we will reevaluate the next frequency (because need_freq_update will get set due to limits change event) and lets say we want to set the frequency to 1.2 GHz again. At this point sugov_update_next_freq() will find the next_freq == current_freq and will abort the update, while the CPU actually runs at 1.4 GHz. Until now need_freq_update was used as a flag to indicate that the policy's frequency limits have changed, and that we should consider the new limits while reevaluating the next frequency. This patch fixes the above mentioned issue by extending the purpose of the need_freq_update flag. If this flag is set now, the schedutil governor will not try to abort a frequency change even if next_freq == current_freq. As similar behavior is required in the case of CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be set to false if that flag is set for the driver. We also don't need to consider the need_freq_update flag in sugov_update_single() anymore to handle the special case of busy CPU, as we won't abort a frequency update anymore. Change-Id: I699f1ce2bddf3ed35e29fc8ec549fa498654965b Reported-by: zhuguangqing <zhuguangqing@xiaomi.com> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Rearrange code to avoid a branch ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:42 -03:00
Rafael J. Wysocki	5ef6e95dd8	schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set Because sugov_update_next_freq() may skip a frequency update even if the need_freq_update flag has been set for the policy at hand, policy limits updates may not take effect as expected. For example, if the intel_pstate driver operates in the passive mode with HWP enabled, it needs to update the HWP min and max limits when the policy min and max limits change, respectively, but that may not happen if the target frequency does not change along with the limit at hand. In particular, if the policy min is changed first, causing the target frequency to be adjusted to it, and the policy max limit is changed later to the same value, the HWP max limit will not be updated to follow it as expected, because the target frequency is still equal to the policy min limit and it will not change until that limit is updated. To address this issue, modify get_next_freq() to let the driver callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag is set regardless of whether or not the new frequency to set is equal to the previous one. Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled") Change-Id: Icb43808d865a28e3ff4630cf3b65502fd1e3a466 Reported-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ... Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:42 -03:00
Richard Raya	a7563ebf2c	Revert "sched: restrict iowait boost to tasks with prefer_idle" This reverts commit ba350f071e4af45aedb698851123397e5d041fd2. Change-Id: I3287279e0d8882356b6799bec3993b5b52c15a12 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-23 00:01:41 -03:00
EmanuelCN	6993932a60	schedutil: Implement tapered dvfs_headroom Inspired by: LineageOS/android_kernel_google_gs201@752c5f9 Change-Id: I2426f750416cbf9a7cb6876bcd386ae4c40825ca Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-17 04:17:29 -03:00
Samuel Pascua	d9f0298279	schedutil: Use map_util_freq() Change-Id: If9cf1b47dee3b9bd0663c88034da8edc98bd28f6 Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-12-17 04:17:21 -03:00
Qais Yousef	de8ca9bcc7	sched/uclamp: Fix rq->uclamp_max not set on first enqueue [ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ] Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq rq, struct task_struct p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Quentin Perret	56ed3a51c0	sched: Fix UCLAMP_FLAG_IDLE setting The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last uclamp active task (that is, when buckets.tasks reaches 0 for all buckets) to maintain the last uclamp.max and prevent blocked util from suddenly becoming visible. However, there is an asymmetry in how the flag is set and cleared which can lead to having the flag set whilst there are active tasks on the rq. Specifically, the flag is cleared in the uclamp_rq_inc() path, which is called at enqueue time, but set in uclamp_rq_dec_id() which is called both when dequeueing a task _and_ in the update_uclamp_active() path. As a result, when both uclamp_rq_{dec,ind}_id() are called from update_uclamp_active(), the flag ends up being set but not cleared, hence leaving the runqueue in a broken state. Fix this by clearing the flag in update_uclamp_active() as well. Fixes: e496187da710 ("sched/uclamp: Enforce last task's UCLAMP_MAX") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Quentin Perret	89fd7c7fa5	ANDROID: sched: Make uclamp changes depend on CAP_SYS_NICE There is currently nothing preventing tasks from changing their per-task clamp values in anyway that they like. The rationale is probably that system administrators are still able to limit those clamps thanks to the cgroup interface. However, this causes pain in a system where both per-task and per-cgroup clamp values are expected to be under the control of core system components (as is the case for Android). To fix this, let's require CAP_SYS_NICE to change per-task clamp values. There are ongoing discussions upstream about more flexible approaches than this using the RLIMIT API -- see [1]. But the upstream discussion has not converged yet, and this is way too late for UAPI changes in android12-5.10 anyway, so let's apply this change which provides the behaviour we want without actually impacting UAPIs. [1] https://lore.kernel.org/lkml/20210623123441.592348-4-qperret@google.com/ Bug: 187186685 Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: I749312a77306460318ac5374cf243d00b78120dd Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Xuewen Yan	1d6a30daff	sched/uclamp: Ignore max aggregation if rq is idle [ Upstream commit 3e1493f46390618ea78607cb30c58fc19e2a5035 ] When a task wakes up on an idle rq, uclamp_rq_util_with() would max aggregate with rq value. But since there is no task enqueued yet, the values are stale based on the last task that was running. When the new task actually wakes up and enqueued, then the rq uclamp values should reflect that of the newly woken up task effective uclamp values. This is a problem particularly for uclamp_max because it default to 1024. If a task p with uclamp_max = 512 wakes up, then max aggregation would ignore the capping that should apply when this task is enqueued, which is wrong. Fix that by ignoring max aggregation if the rq is idle since in that case the effective uclamp value of the rq will be the ones of the task that will wake up. Fixes: 9d20ad7dfc9a ("sched/uclamp: Add uclamp_util_with()") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> [qias: Changelog] Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lore.kernel.org/r/20210630141204.8197-1-xuewen.yan94@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Qais Yousef	4683c606d0	sched/uclamp: Fix uclamp_tg_restrict() [ Upstream commit 0213b7083e81f4acd69db32cb72eb4e5f220329a ] Now cpu.uclamp.min acts as a protection, we need to make sure that the uclamp request of the task is within the allowed range of the cgroup, that is it is clamp()'ed correctly by tg->uclamp[UCLAMP_MIN] and tg->uclamp[UCLAMP_MAX]. As reported by Xuewen [1] we can have some corner cases where there's inversion between uclamp requested by task (p) and the uclamp values of the taskgroup it's attached to (tg). Following table demonstrates 2 corner cases: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 20% -----------+-----+------+----------- With this fix we get: \| p \| tg \| effective -----------+-----+------+----------- CASE 1 -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- CASE 2 -----------+-----+------+----------- uclamp_min \| 0% \| 30% \| 30% -----------+-----+------+----------- uclamp_max \| 20% \| 50% \| 30% -----------+-----+------+----------- Additionally uclamp_update_active_tasks() must now unconditionally update both UCLAMP_MIN/MAX because changing the tg's UCLAMP_MAX for instance could have an impact on the effective UCLAMP_MIN of the tasks. \| p \| tg \| effective -----------+-----+------+----------- old -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 50% -----------+-----+------+----------- uclamp_max \| 80% \| 50% \| 50% -----------+-----+------+----------- new -----------+-----+------+----------- uclamp_min \| 60% \| 0% \| 60% -----------+-----+------+----------- uclamp_max \| 80% \|70% \| 70% -----------+-----+------+----------- [1] https://lore.kernel.org/lkml/CAB8ipk_a6VFNjiEnHRHkUMBKbA+qzPQvhtNjJ_YNzQhqV_o8Zw@mail.gmail.com/ Fixes: 0c18f2ecfcc2 ("sched/uclamp: Fix wrong implementation of cpu.uclamp.min") Reported-by: Xuewen Yan <xuewen.yan94@gmail.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210617165155.3774110-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Qais Yousef	6756e1b42a	sched/uclamp: Fix wrong implementation of cpu.uclamp.min [ Upstream commit 0c18f2ecfcc274a4bcc1d122f79ebd4001c3b445 ] cpu.uclamp.min is a protection as described in cgroup-v2 Resource Distribution Model Documentation/admin-guide/cgroup-v2.rst which means we try our best to preserve the minimum performance point of tasks in this group. See full description of cpu.uclamp.min in the cgroup-v2.rst. But the current implementation makes it a limit, which is not what was intended. For example: tg->cpu.uclamp.min = 20% p0->uclamp[UCLAMP_MIN] = 0 p1->uclamp[UCLAMP_MIN] = 50% Previous Behavior (limit): p0->effective_uclamp = 0 p1->effective_uclamp = 20% New Behavior (Protection): p0->effective_uclamp = 20% p1->effective_uclamp = 50% Which is inline with how protections should work. With this change the cgroup and per-task behaviors are the same, as expected. Additionally, we remove the confusing relationship between cgroup and !user_defined flag. We don't want for example RT tasks that are boosted by default to max to change their boost value when they attach to a cgroup. If a cgroup wants to limit the max performance point of tasks attached to it, then cpu.uclamp.max must be set accordingly. Or if they want to set different boost value based on cgroup, then sysctl_sched_util_clamp_min_rt_default must be used to NOT boost to max and set the right cpu.uclamp.min for each group to let the RT tasks obtain the desired boost value when attached to that group. As it stands the dependency on !user_defined flag adds an extra layer of complexity that is not required now cpu.uclamp.min behaves properly as a protection. The propagation model of effective cpu.uclamp.min in child cgroups as implemented by cpu_util_update_eff() is still correct. The parent protection sets an upper limit of what the child cgroups will effectively get. Fixes: 3eac870a3247 (sched/uclamp: Use TG's clamps to restrict TASK's clamps) Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210510145032.1934078-2-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00
Quentin Perret	f871554be4	FROMLIST: sched: Fix out-of-bound access in uclamp Util-clamp places tasks in different buckets based on their clamp values for performance reasons. However, the size of buckets is currently computed using a rounding division, which can lead to an off-by-one error in some configurations. For instance, with 20 buckets, the bucket size will be 1024/20=51. A task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly, correct indexes are in range [0,19], hence leading to an out of bound memory access. Clamp the bucket id to fix the issue. Bug: 186415778 Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Suggested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20210430151412.160913-1-qperret@google.com Change-Id: Ibc28662de5554f80f97533b60e747f8a6e871c56 Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>	2024-12-16 14:46:43 -03:00

1 2 3 4 5 ...

29356 Commits