29367 Commits

Author SHA1 Message Date
Martin KaFai Lau
937e41cb2e bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0.
When creating a new btf, the initial btf->refcnt is 0 and
triggered the following:

[   34.855452] refcount_t: increment on 0; use-after-free.
[   34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30
....
[   34.868809] Call Trace:
[   34.869168]  btf_new_fd+0x1af6/0x24d0
[   34.869645]  ? btf_type_seq_show+0x200/0x200
[   34.870212]  ? lock_acquire+0x3b0/0x3b0
[   34.870726]  ? security_capable+0x54/0x90
[   34.871247]  __x64_sys_bpf+0x1b2/0x310
[   34.871761]  ? __ia32_sys_bpf+0x310/0x310
[   34.872285]  ? bad_area_access_error+0x310/0x310
[   34.872894]  do_syscall_64+0x95/0x3f0

This patch uses refcount_set() instead.

Reported-by: Yonghong Song <yhs@fb.com>
Tested-by: Yonghong Song <yhs@fb.com>
Change-Id: I61918a383eb08dd8fb9320dadd1f0e616f68af1b
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
9aa08f2762 BACKPORT: bpf: btf: Clean up btf.h in uapi
This patch cleans up btf.h in uapi:
1) Rename "name" to "name_off" to better reflect it is an offset to the
   string section instead of a char array.
2) Remove unused value BTF_FLAGS_COMPR and BTF_MAGIC_SWAP

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: I480706c7a099a26b52b050f1e51e3408eabfed20
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
d0eb16b6be bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
This patch adds BPF_OBJ_GET_INFO_BY_FD support to BTF fd.
The original BTF data, which was used to create the BTF fd during
the earlier BPF_BTF_LOAD call, will be returned.

The userspace is expected to allocate buffer
to info.info and the buffer size is set to info.info_len before
calling BPF_OBJ_GET_INFO_BY_FD.

The original BTF data is copied to the userspace buffer (info.info).
Only upto the user's specified info.info_len will be copied.

The original BTF data size is set to info.info_len.  The userspace
needs to check if it is bigger than its allocated buffer size.
If it is, the userspace should realloc with the kernel-returned
info.info_len and call the BPF_OBJ_GET_INFO_BY_FD again.

Change-Id: Ibbd2966eb0e59b1ab9cbc56f92a0512cb804483a
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
274acc4648 BACKPORT: bpf: btf: Add BPF_BTF_LOAD command
This patch adds a BPF_BTF_LOAD command which
1) loads and verifies the BTF (implemented in earlier patches)
2) returns a BTF fd to userspace.  In the next patch, the
   BTF fd can be specified during BPF_MAP_CREATE.

It currently limits to CAP_SYS_ADMIN.

Change-Id: Id826446740838918cc317c75d0ccb6038842e933
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
486c2a8be6 bpf: btf: Add pretty print capability for data with BTF type info
This patch adds pretty print capability for data with BTF type info.
The current usage is to allow pretty print for a BPF map.

The next few patches will allow a read() on a pinned map with BTF
type info for its key and value.

This patch uses the seq_printf() infra.

Change-Id: I4c459c09688af606883e504bedd3794b616da01d
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
4f2cd833e9 bpf: btf: Check members of struct/union
This patch checks a few things of struct's members:

1) It has a valid size (e.g. a "const void" is invalid)
2) A member's size (+ its member's offset) does not exceed
   the containing struct's size.
3) The member's offset satisfies the alignment requirement

The above can only be done after the needs_resolve member's type
is resolved.  Hence, the above is done together in
btf_struct_resolve().

Each possible member's type (e.g. int, enum, modifier...) implements
the check_member() ops which will be called from btf_struct_resolve().

Change-Id: I24f9e39dd689125a8fcd41895982a2c92035e5fb
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
a20afbc3d5 bpf: btf: Validate type reference
After collecting all btf_type in the first pass in an earlier patch,
the second pass (in this patch) can validate the reference types
(e.g. the referring type does exist and it does not refer to itself).

While checking the reference type, it also gathers other information (e.g.
the size of an array).  This info will be useful in checking the
struct's members in a later patch.  They will also be useful in doing
pretty print later.

Change-Id: I86bf4691edd7b0114b8148cd77a77d1896fb2091
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Tim Zimmermann
d63ada96c2 bpf: Update logging functions to work with BTF
* Based on 430e68d10b,
  77d2e05abd
  and a2a7d57010

Change-Id: I27e2c804726078646ca9beda31cbae2a745dfd47
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
b1f30ebe50 BACKPORT: bpf: btf: Introduce BPF Type Format (BTF)
This patch introduces BPF type Format (BTF).

BTF (BPF Type Format) is the meta data format which describes
the data types of BPF program/map.  Hence, it basically focus
on the C programming language which the modern BPF is primary
using.  The first use case is to provide a generic pretty print
capability for a BPF map.

BTF has its root from CTF (Compact C-Type format).  To simplify
the handling of BTF data, BTF removes the differences between
small and big type/struct-member.  Hence, BTF consistently uses u32
instead of supporting both "one u16" and "two u32 (+padding)" in
describing type and struct-member.

It also raises the number of types (and functions) limit
from 0x7fff to 0x7fffffff.

Due to the above changes,  the format is not compatible to CTF.
Hence, BTF starts with a new BTF_MAGIC and version number.

This patch does the first verification pass to the BTF.  The first
pass checks:
1. meta-data size (e.g. It does not go beyond the total btf's size)
2. name_offset is valid
3. Each BTF_KIND (e.g. int, enum, struct....) does its
   own check of its meta-data.

Some other checks, like checking a struct's member is referring
to a valid type, can only be done in the second pass.  The second
verification pass will be implemented in the next patch.

Change-Id: Ic3a57709c16c02059438f5b1b85ccc94466f2db3
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Martin KaFai Lau
b191beed27 BACKPORT: bpf: Rename bpf_verifer_log
bpf_verifer_log =>
bpf_verifier_log

Change-Id: If356de35e8dff3c7d7733cf70f5cbfd1db615d30
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
Jakub Kicinski
3648bf0a7b BACKPORT: bpf: encapsulate verifier log state into a structure
Put the loose log_* variables into a structure.  This will make
it simpler to remove the global verifier state in following patches.

Change-Id: I8a84b6acfd50596f0d80339ea01db220070cbdc8
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
2025-01-13 14:37:38 -03:00
LibXZR
0e27242ef3 Revert "genirq/irqdomain: Don't try to free an interrupt that has no mapping"
* This is retard. An interrupt does not have a permanent virq. The virq of an irq may be changed
after the free of other irqs.

* This is causing problems for msm_msi_irq_domain to free all the irqs that it allocated.

* Fix unrecoverable modem crash.

This reverts commit d1874e36cb3d00ba53f9e7bc3ca58d3058659cee.

Change-Id: I831083118d6a7c12d43f2fa2ef01bdd27159dac8
Signed-off-by: LibXZR <xzr467706992@163.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-24 00:01:56 -03:00
Richard Raya
30f8174d8a msm-4.14: Bump boosts input timeouts
Change-Id: I55077d5fecbf231539ad94ac058226cbc39d1479
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:59:46 -03:00
Sultan Alsawaf
5626e9eb9c cgroup: Boost whenever a zygote-forked process becomes a top app
Boost to the max for 1000 ms whenever the top app changes, which
improves app launch speeds and addresses jitter when switching between
apps. A check to make sure that the top-app's parent is zygote ensures
that a user-facing app is indeed what's added to the top app task group,
since app processes are forked from zygote.

Change-Id: I49563d8baef7cefa195c919acf97343fa424c3be
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:59:42 -03:00
Dietmar Eggemann
f49f0241d2 sched/features: Disable LB_BIAS by default
LB_BIAS allows the adjustment on how conservative load should be
balanced.

The rq->cpu_load[idx] array is used for this functionality. It contains
weighted CPU load decayed average values over different intervals
(idx = 1..4). Idx = 0 is the weighted CPU load itself.

The values are updated during scheduler_tick, before idle balance and at
nohz exit.

There are 5 different types of idx's per sched domain (sd). Each of them
is used to index into the rq->cpu_load[idx] array in a specific scenario
(busy, idle and newidle for load balancing, forkexec for wake-up
slow-path load balancing and wake for affine wakeup based on weight).
Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
respectively. All the other sd idx's are set to 0.

Conservative load balancing is achieved for sd idx's >= 1 by using the
min/max (source_load()/target_load()) value between the current weighted
CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
CPU load in load balancing or vice versa in the wake-up slow-path load
balancing.
There is no conservative balancing for sd idx = 0 since only current
weighted CPU load is used in this case.

It is very likely that LB_BIAS' influence on load balancing can be
neglected (see test results below). This is further supported by:

(1) Weighted CPU load today is by itself a decayed average value (PELT)
    (cfs_rq->avg->runnable_load_avg) and not the instantaneous load
    (rq->load.weight) it was when LB_BIAS was introduced.

(2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
    to sd's newidle and busy idx) in find_busiest_group() when comparing
    busiest and local avg load to make load balancing even more
    conservative.

(3) The sd forkexec and newidle idx are always set to 0 so there is no
    adjustment on how conservatively load balancing is done here.

(4) Affine wakeup based on weight (wake_affine_weight()) will not be
    impacted since the sd wake idx is always set to 0.

Let's disable LB_BIAS by default for a few kernel releases to make sure
that no workload and no scheduler topology is affected. The benefit of
being able to remove the LB_BIAS dependency from source_load() and
target_load() is that the entire rq->cpu_load[idx] code could be removed
in this case.

It is really hard to say if there is no regression w/o testing this with
a lot of different workloads on a lot of different platforms, especially
NUMA machines.
The following 104 LKP (Linux Kernel Performance) tests were run by the
0-Day guys mostly on multi-socket hosts with a larger number of logical
cpus (88, 192).
The base for the test was commit b3dae109fa89 ("sched/swait: Rename to
exclusive") (tip/sched/core v4.18-rc1).
Only 2 out of the 104 tests had a significant change in one of the
metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
Tests which showed a change in one of the metrics are marked with a '*'
and this change is listed as well.

(a) lkp-bdw-ep3:
      88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G

    dd-write/10m-1HDD-cfq-btrfs-100dd-performance
    fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
  * fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
      7.50  7%  8.00  ±  6%  fsmark.files_per_sec
    fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
    fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
    fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
    kbuild/300s-50%-vmlinux_prereq-performance
    kbuild/300s-200%-vmlinux_prereq-performance
    kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
    kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4

(b) lkp-skl-4sp1:
      192 threads Intel(R) Xeon(R) Platinum 8160 768G

    dbench/100%-performance
    ebizzy/200%-100x-10s-performance
    hackbench/1600%-process-pipe-performance
    iperf/300s-cs-localhost-tcp-performance
    iperf/300s-cs-localhost-udp-performance
    perf-bench-numa-mem/2t-300M-performance
    perf-bench-sched-pipe/10000000ops-process-performance
    perf-bench-sched-pipe/10000000ops-threads-performance
    schbench/2-16-300-30000-30000-performance
    tbench/100%-cs-localhost-performance

(c) lkp-bdw-ep6:
      88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G

    stress-ng/100%-60s-pipe-performance
    unixbench/300s-1-whetstone-double-performance
    unixbench/300s-1-shell1-performance
    unixbench/300s-1-shell8-performance
    unixbench/300s-1-pipe-performance
  * unixbench/300s-1-context1-performance
      312  315  unixbench.score
    unixbench/300s-1-spawn-performance
    unixbench/300s-1-syscall-performance
    unixbench/300s-1-dhry2reg-performance
    unixbench/300s-1-fstime-performance
    unixbench/300s-1-fsbuffer-performance
    unixbench/300s-1-fsdisk-performance
    unixbench/300s-100%-whetstone-double-performance
    unixbench/300s-100%-shell1-performance
    unixbench/300s-100%-shell8-performance
    unixbench/300s-100%-pipe-performance
    unixbench/300s-100%-context1-performance
    unixbench/300s-100%-spawn-performance
  * unixbench/300s-100%-syscall-performance
      3571  ±  3%  -11%  3183  ±  4%  unixbench.score
    unixbench/300s-100%-dhry2reg-performance
    unixbench/300s-100%-fstime-performance
    unixbench/300s-100%-fsbuffer-performance
    unixbench/300s-100%-fsdisk-performance
    unixbench/300s-1-execl-performance
    unixbench/300s-100%-execl-performance
  * will-it-scale/brk1-performance
      365004  360387  will-it-scale.per_thread_ops
  * will-it-scale/dup1-performance
      432401  437596  will-it-scale.per_thread_ops
    will-it-scale/eventfd1-performance
    will-it-scale/futex1-performance
    will-it-scale/futex2-performance
    will-it-scale/futex3-performance
    will-it-scale/futex4-performance
    will-it-scale/getppid1-performance
    will-it-scale/lock1-performance
    will-it-scale/lseek1-performance
    will-it-scale/lseek2-performance
  * will-it-scale/malloc1-performance
      47025  45817  will-it-scale.per_thread_ops
      77499  76529  will-it-scale.per_process_ops
    will-it-scale/malloc2-performance
  * will-it-scale/mmap1-performance
      123399  120815  will-it-scale.per_thread_ops
      152219  149833  will-it-scale.per_process_ops
  * will-it-scale/mmap2-performance
      107327  104714  will-it-scale.per_thread_ops
      136405  133765  will-it-scale.per_process_ops
    will-it-scale/open1-performance
  * will-it-scale/open2-performance
      171570  168805  will-it-scale.per_thread_ops
      532644  526202  will-it-scale.per_process_ops
    will-it-scale/page_fault1-performance
    will-it-scale/page_fault2-performance
    will-it-scale/page_fault3-performance
    will-it-scale/pipe1-performance
    will-it-scale/poll1-performance
  * will-it-scale/poll2-performance
      176134  172848  will-it-scale.per_thread_ops
      281361  275053  will-it-scale.per_process_ops
    will-it-scale/posix_semaphore1-performance
    will-it-scale/pread1-performance
    will-it-scale/pread2-performance
    will-it-scale/pread3-performance
    will-it-scale/pthread_mutex1-performance
    will-it-scale/pthread_mutex2-performance
    will-it-scale/pwrite1-performance
    will-it-scale/pwrite2-performance
    will-it-scale/pwrite3-performance
  * will-it-scale/read1-performance
      1190563  1174833  will-it-scale.per_thread_ops
  * will-it-scale/read2-performance
      1105369  1080427  will-it-scale.per_thread_ops
    will-it-scale/readseek1-performance
  * will-it-scale/readseek2-performance
      261818  259040  will-it-scale.per_thread_ops
    will-it-scale/readseek3-performance
  * will-it-scale/sched_yield-performance
      2408059  2382034  will-it-scale.per_thread_ops
    will-it-scale/signal1-performance
    will-it-scale/unix1-performance
    will-it-scale/unlink1-performance
    will-it-scale/unlink2-performance
  * will-it-scale/write1-performance
      976701  961588  will-it-scale.per_thread_ops
  * will-it-scale/writeseek1-performance
      831898  822448  will-it-scale.per_thread_ops
  * will-it-scale/writeseek2-performance
      228248  225065  will-it-scale.per_thread_ops
  * will-it-scale/writeseek3-performance
      226670  224058  will-it-scale.per_thread_ops
    will-it-scale/context_switch1-performance
    aim7/performance-fork_test-2000
  * aim7/performance-brk_test-3000
      74869  76676  aim7.jobs-per-min
    aim7/performance-disk_cp-3000
    aim7/performance-disk_rd-3000
    aim7/performance-sieve-3000
    aim7/performance-page_test-3000
    aim7/performance-creat-clo-3000
    aim7/performance-mem_rtns_1-8000
    aim7/performance-disk_wrt-8000
    aim7/performance-pipe_cpy-8000
    aim7/performance-ram_copy-8000

(d) lkp-avoton3:
      8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G

    netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Li Zhijian <zhijianx.li@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com
Change-Id: Ia84e33416b394990da2fd0f2d21bd499ce76a65d
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:58:50 -03:00
Kazuki H
b4f4f18aa7 irq: Don't allow IRQ affinities to be set from userspace
Change-Id: I8278aec4280103cdb092f197ded20831d9f57fd4
Signed-off-by: Kazuki H <kazukih0205@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:54:37 -03:00
Sultan Alsawaf
70f60e8bfe sbalance: Fix severe misattribution of movable IRQs to the last active CPU
Due to a horrible omission in the big IRQ list traversal, all movable IRQs
are misattributed to the last active CPU in the system since that's what
`bd` is last set to in the loop prior. This horribly breaks SBalance's
notion of balance, producing nonsensical balancing decisions and failing to
balance IRQs even when they are heavily imbalanced.

Fix the massive breakage by adding the missing line of code to set `bd` to
the CPU an IRQ actually belongs to, so that it's added to the correct CPU's
movable IRQs list.

Change-Id: Ide222d361152b1cd03c1894c995cab42980d16e7
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:55 -03:00
Sultan Alsawaf
9d49f44e04 sbalance: Don't race with CPU hotplug
When a CPU is hotplugged, cpu_active_mask is modified without any RCU
synchronization. As a result, the only synchronization for cpu_active_mask
provided by the hotplug code is the CPU hotplug lock.

Furthermore, since IRQ balance is majorly disrupted during CPU hotplug due
to mass IRQ migration off a dying CPU, SBalance just shouldn't operate
while a CPU hotplug is in progress.

Take the CPU hotplug lock in balance_irqs() to prevent races and mishaps
during CPU hotplugs.

Change-Id: If377de7b78e3ae68a20bc95bdb84650330cfc330
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:54 -03:00
Sultan Alsawaf
8d891a9e71 sbalance: Convert various IRQ counter types to unsigned ints
These counted values are actually unsigned ints, not unsigned longs.
Convert them to unsigned ints since there's no reason for them to be longs.

Change-Id: Ia5c4a3162a072b4fa3225afdcd969db95b60c802
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:53 -03:00
Sultan Alsawaf
9fe60c2f32 sbalance: Fix systemic issues caused by flawed IRQ statistics
SBalance's statistic of new interrupts for each CPU is inherently flawed in
that it cannot track IRQ migration that occurs in between balance periods.
As a result, SBalance can observe a flawed number of new interrupts for a
CPU, which hurts its balancing decisions.

Furthermore, SBalance incorrectly assumes that IRQs are affined where
SBalance last placed them, which breaks SBalance entirely when the
assumption doesn't hold true.

As it turns out, it can be quite common to change an IRQ's affinity and
observe a successful return value despite the IRQ not actually moving. At
the very least this is observed on ARM's GICv3, and results in SBalance
never moving such an IRQ ever again because SBalance always thinks it has
zero new interrupts.

Since we can't trust irqchip drivers or hardware, gather IRQ statistics
directly in order to get the true number of new interrupts for each CPU and
the actual affinity of each IRQ based on the last CPU it fired upon.

Change-Id: Ic846adac244a0873c4502987e0904b552ab31f22
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:52 -03:00
Sultan Alsawaf
9cddb6a1a0 sbalance: Use non-atomic cpumask_clear_cpu() variant
The atomic cpumask_clear_cpu() isn't needed. Use __cpumask_clear_cpu()
instead as a micro-optimization, and for clarity.

Change-Id: I17d168814c4b96557c8a9f986c2c5be8e18be26b
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:51 -03:00
Sultan Alsawaf
a4e8c0a0a3 sbalance: Use a deferrable timer to avoid waking idle CPUs
SBalance is designed to poll to balance IRQs, but it shouldn't kick CPUs
out of idle to do so because idle CPUs clearly aren't processing
interrupts.

Open code a freezable wait that uses a deferrable timer in order to prevent
SBalance from waking up idle CPUs when there is little interrupt traffic.

Change-Id: I5f796a4590801c9a5935ca7ea8c966ca281620c7
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:51 -03:00
Sultan Alsawaf
cc5e8988ca sbalance: Allow IRQs to be moved off of excluded CPUs
Excluded CPUs are excluded from IRQ balancing with the intention that those
CPUs shouldn't really be processing interrupts, and thus shouldn't have
IRQs moved to them. However, SBalance completely ignores excluded CPUs,
which can cause them to end up with a disproportionate amount of interrupt
traffic that SBalance won't spread out. An easy example of this is when
CPU0 is an excluded CPU, since CPU0 ends up with all interrupts affined to
it by default on arm64.

Allow SBalance to move IRQs off of excluded CPUs so that they cannot slip
under the radar and pile up on an excluded CPU, like when CPU0 is excluded.

Change-Id: I392a058ea8cf7672bfea39ff9525bf6b7c52a062
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:49 -03:00
Sultan Alsawaf
fc6b2568cb kernel: Introduce SBalance IRQ balancer
This is a simple IRQ balancer that polls every X number of milliseconds and
moves IRQs from the most interrupt-heavy CPU to the least interrupt-heavy
CPUs until the heaviest CPU is no longer the heaviest. IRQs are only moved
from one source CPU to any number of destination CPUs per balance run.
Balancing is skipped if the gap between the most interrupt-heavy CPU and
the least interrupt-heavy CPU is below the configured threshold of
interrupts.

The heaviest IRQs are targeted for migration in order to reduce the number
of IRQs to migrate. If moving an IRQ would reduce overall balance, then it
won't be migrated.

The most interrupt-heavy CPU is calculated by scaling the number of new
interrupts on that CPU to the CPU's current capacity. This way, interrupt
heaviness takes into account factors such as thermal pressure and time
spent processing interrupts rather than just the sheer number of them. This
also makes SBalance aware of CPU asymmetry, where different CPUs can have
different performance capacities and be proportionally balanced.

Change-Id: Ie40c87ca357814b9207726f67e2530fffa7dd198
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:48 -03:00
Sultan Alsawaf
327e77e79a kernel: Warn when an IRQ's affinity notifier gets overwritten
An IRQ affinity notifier getting overwritten can point to some annoying
issues which need to be resolved, like multiple pm_qos objects being
registered to the same IRQ. Print out a warning when this happens to aid
debugging.

Change-Id: I087a6ea7472fa7ba45bdb02efeae25af5c664950
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:47 -03:00
Sultan Alsawaf
cde9a6dfc7 kernel: Only set one CPU in the default IRQ affinity mask
On ARM, IRQs are executed on the first CPU inside the affinity mask, so
setting an affinity mask with more than one CPU set is deceptive and
causes issues with pm_qos. To fix this, only set the CPU0 bit inside the
affinity mask, since that's where IRQs will run by default.

This is a follow-up to "kernel: Don't allow IRQ affinity masks to have
more than one CPU".

Change-Id: Ib6ef803ab866686c30e1aa1d06f98692ee39ed6c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:46 -03:00
Sultan Alsawaf
547eaaa17b kernel: Don't allow IRQ affinity masks to have more than one CPU
Even with an affinity mask that has multiple CPUs set, IRQs always run
on the first CPU in their affinity mask. Drivers that register an IRQ
affinity notifier (such as pm_qos) will therefore have an incorrect
assumption of where an IRQ is affined.

Fix the IRQ affinity mask deception by forcing it to only contain one
set CPU.

Change-Id: I212ff578f731ee78fabb8f63e49ef0b96c286521
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:45 -03:00
Richard Raya
1b396d869a msm-4.14: Drop perf-critical API
Change-Id: I17edd46742608a3ed8349a60b71716c944d4a0f4
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:44 -03:00
Richard Raya
602aa3bba8 msm-4.14: Drop sched_migrate_to_cpumask
Change-Id: I8b03f4b7f90c6486d42ef767ba0b52a9567830a2
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:43 -03:00
Samuel Pascua
d6e561f94c sched: Backport IRQ utilization tracking
Change-Id: Id432ab10f7acb00ad2d1bb36400504584629a2b6
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:39 -03:00
Alexander Winkowski
90dd46c816 sched/cass: Skip reserved cpus
Change-Id: I77e5663fa00afba2211b52997e007a0f2e6364e2
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:38 -03:00
Alexander Winkowski
7e8a73c333 sched/cass: No thermal throttling for us
Change-Id: If892e9c33656b7f829d2adb3d7228ac12313dd2c
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:39 -03:00
Richard Raya
c631cde2a8 sched/cass: Fix arch_scale_cpu_capacity params
Change-Id: I8e55400ad416882a735a4dc72bcaeaaa23f11019
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:36 -03:00
Richard Raya
aa86f45366 sched/cass: Fix hard_util accounting
Change-Id: I1c8147a04003c20eb9046d490e90ba98cf376115
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:53:33 -03:00
Sultan Alsawaf
6b6b113c0e sched/cass: Don't fight the idle load balancer
The idle load balancer (ILB) is kicked whenever a task is misfit, meaning
that the task doesn't fit on its CPU (i.e., fits_capacity() == false).

Since CASS makes no attempt to place tasks such that they'll fit on the CPU
they're placed upon, the ILB works harder to correct this and rebalances
misfit tasks onto a CPU with sufficient capacity.

By fighting the ILB like this, CASS degrades both energy efficiency and
performance.

Play nicely with the ILB by trying to place tasks onto CPUs that fit.

Change-Id: I317a3f19b83400d4b55d35d4a51e88268d0399c1
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
3bf932216b sched/cass: Honor uclamp even when no CPUs can satisfy the requirement
When all CPUs available to a uclamp'd process are thermal throttled, it is
possible for them to be throttled below the uclamp minimum requirement. In
this case, CASS only considers uclamp when it compares relative utilization
and nowhere else; i.e., CASS essentially ignores the most important aspect
of uclamp.

Fix it so that CASS tries to honor uclamp even when no CPUs available to a
uclamp'd process are capable of fully meeting the uclamp minimum.

Change-Id: I93885cd7a94502c58a9e96eb43bb00ef01d15988
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
a421891669 sched/cass: Fix disproportionate load spreading when CPUs are throttled
When CPUs are thermal throttled, CASS tries to spread load such that their
resulting P-state is scaled relatively to their _throttled_ maximum
capacity, rather than their original capacity.

As a result, throttled CPUs are unfairly under-utilized, causing other CPUs
to receive the extra burden and thus run at a disproportionately higher
P-state relative to the throttled CPUs. This not only hurts performance,
but also greatly diminishes energy efficiency since it breaks CASS's basic
load balancing principle.

To fix this, some convoluted logic is required in order to make CASS aware
of a CPU's throttled and non-throttled capacity. The non-throttled capacity
is used for the fundamental relative utilization comparison, while the
throttled capacity is used in conjunction to ensure a throttled CPU isn't
accidentally overloaded as a result.

Change-Id: I2cdabc4aa88e724252886c15040eabf40ab9150e
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
7952b4c0e6 sched/cass: Eliminate redundant calls to smp_processor_id()
Calling smp_processor_id() can be expensive depending on how an arch
implements it, so avoid calling it more than necessary.

Use the raw variant too since this code is always guaranteed to run with
preemption disabled.

Change-Id: If96aeeb0aeb9f0c1cb2ebf9dcf31ced04ebe135c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
3c1e39b2bf sched/cass: Only treat sync waker CPU as idle if there's one task running
For synchronized wakes, the waker's CPU should only be treated as idle if
there aren't any other running tasks on that CPU. This is because, for
synchronized wakes, it is assumed that the waker will immediately go to
sleep after waking the wakee; therefore, if there aren't any other tasks
running on the waker's CPU, it'll go idle and should be treated as such to
improve task placement.

This optimization only applies when there aren't any other tasks running on
the waker's CPU, however.

Fix it by ensuring that there's only the waker running on its CPU.

Change-Id: I03cfd16d423cc920c103b8734b6b8a9089a9e59c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
6a8fed2d40 sched/cass: Fix suboptimal task placement when uclamp is used
Uclamp is designed to specify a process' CPU performance requirement scaled
as a CPU capacity value. It simply denotes the process' requirement for the
CPU's raw performance and thus P-state.

CASS currently treats uclamp as a CPU load value however, producing wildly
suboptimal CPU placement decisions for tasks which use uclamp. This hurts
performance and, even worse, massively hurts energy efficiency, with CASS
sometimes yielding power consumption that is a few times higher than EAS.

Since uclamp inherently throws a wrench into CASS's goal of keeping
relative P-states as low as possible across all CPUs, making it cooperate
with CASS requires a multipronged approach.

Make the following three changes to fix the uclamp task placement issue:
  1. Treat uclamp as a CPU performance value rather than a CPU load value.
  2. Clamp a CPU's utilization to the task's uclamp floor in order to keep
     relative P-states as low as possible across all CPUs.
  3. Consider preferring a non-idle CPU for uclamped tasks to avoid pushing
     up the P-state of more than one CPU when there are multiple concurrent
     uclamped tasks.

This fixes CASS's massive energy efficiency and performance issues when
uclamp is used.

Change-Id: Ib274ceecfbbe9c2eeb1738f97029e1f4cbc68ec0
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
226fafc0b1 sched/cass: Perform runqueue selection for RT tasks too
RT tasks aren't placed on CPUs in a load-balanced manner, much less an
energy efficient one. On systems which contain many RT tasks and/or IRQ
threads, energy efficiency and throughput are diminished significantly by
the default RT runqueue selection scheme which targets minimal latency.

In practice, performance is actually improved by spreading RT tasks fairly,
despite the small latency impact. Additionally, energy efficiency is
significantly improved since the placement of all tasks benefits from
energy-efficient runqueue selection, rather than just CFS tasks.

Perform runqueue selection for RT tasks in CASS to significantly improve
energy efficiency and overall performance.

Change-Id: Ie551296e1034baa2dfc2bb7f0191ca95f5abc639
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
f6d6472722 sched/cass: Clean up local variable scope in cass_best_cpu()
Move `curr` and `idle_state` to within the loop's scope for better
readability. Also, leave a comment about `curr->cpu` to make it clear that
`curr->cpu` must be initialized within the loop in order for `best->cpu` to
be valid.

Change-Id: I1244ac06d62c172f46dbf337e7bb95758329a188
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
0f790875be sched/cass: Fix CPU selection when no candidate CPUs are idle
When no candidate CPUs are idle, CASS would keep `cidx` unchanged, and thus
`best == curr` would always be true. As a result, since the empty candidate
slot never changes, the current candidate `curr` always overwrites the best
candidate `best`. This causes the last valid CPU to always be selected by
CASS when no CPUs are idle (i.e., under heavy load).

Fix it by ensuring that the CPU loop in cass_best_cpu() flips the free
candidate index after the first candidate CPU is evaluated.

Change-Id: Id1e371c0fe6a2e6321f1c9f68a47e4a26c9a0cba
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Richard Raya
20540b834c sched/cass: Checkout again
Change-Id: Ib7993c14e7d3ffa354be744629593a8646f55efa
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
22ba4c6d3a schedutil: Set default rate limit to 2000 us
This is empirically observed to yield good performance with reduced power
consumption. With "cpufreq: schedutil: Ignore rate limit when scaling up
with FIE present", this only affects frequency reductions when FIE is
present, since there is no rate limit applied when scaling up.

Change-Id: I1bff1f007f06e67b672877107c9685b6fb83647a
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
f1c3e01e39 schedutil: Ignore rate limit when scaling up with FIE present
When schedutil disregards a frequency transition due to the transition rate
limit, there is no guaranteed deadline as to when the frequency transition
will actually occur after the rate limit expires. For instance, depending
on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
frequency transition may be delayed indefinitely, until said CPU reaches
the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.

For frequency transitions _down_, this only poses a theoretical loss of
energy savings since a CPU may remain at a higher frequency than necessary
for an indefinite period beyond the rate limit expiry.

For frequency transitions _up_, however, this poses a significant hit to
performance when a CPU is stuck at an insufficient frequency for an
indefinitely long time. In latency-sensitive and bursty workloads
especially, a missed frequency transition up can result in a significant
performance loss due to a CPU operating at an insufficient frequency for
too long.

When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
rate limit is always required for the scheduler to compute CPU utilization
with some semblance of accuracy: any frequency transition that occurs
before the previous transition latches would result in the scheduler not
knowing the frequency a CPU is actually operating at, thereby trashing the
computed CPU utilization.

However, when FIE support _is_ present, there's no technical requirement to
rate limit all frequency transitions to a cpufreq driver's reported
transition latency. With FIE, the scheduler's CPU utilization tracking is
unaffected by any frequency transitions that occur before the previous
frequency is latched.

Therefore, ignore the frequency transition rate limit when scaling up on
systems where FIE is present. This guarantees that transitions to a higher
frequency cannot be indefinitely delayed, since they simply cannot be
delayed at all.

Change-Id: I0dc5c6c710c10c63b7fc69970db044982de2a2d7
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:43 -03:00
Sultan Alsawaf
82fd18f9bd schedutil: Fix superfluous updates caused by need_freq_update
A redundant frequency update is only truly needed when there is a policy
limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS.

In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a
frequency update _all the time_, not just for a policy limits change,
because need_freq_update is never cleared.

Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads
to a redundant frequency update, regardless of whether or not the driver
specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the
same as the current one.

Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS
when there's a policy limits change, and clearing need_freq_update when a
requisite redundant update occurs.

This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test
and instead setting need_freq_update to false in sugov_update_next_freq().

Change-Id: Iedd47851eabe5a12ed3255b84cd0468da2fbbc80
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:42 -03:00
EmanuelCN
86c781853b schedutil: Remove up/down rate limits
To make way for new changes

Change-Id: Ie28ebf8ea187c8c3da79ee896224f6eeb4f513a6
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:42 -03:00
Rafael J. Wysocki
2f1794050a schedutil: Simplify sugov_update_next_freq()
Rearrange a conditional to make it more straightforward.

Change-Id: I1c9d793cac29bc5a2fdc047ac4c01bba5044489e
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:42 -03:00
Viresh Kumar
526e0afda2 schedutil: Don't skip freq update if need_freq_update is set
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.

For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.

Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.

This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.

As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.

We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.

Change-Id: I699f1ce2bddf3ed35e29fc8ec549fa498654965b
Reported-by: zhuguangqing <zhuguangqing@xiaomi.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-12-23 00:01:42 -03:00