msm-4.14

mirror of https://github.com/rd-stuffs/msm-4.14.git synced 2025-02-20 11:45:48 +08:00

Author	SHA1	Message	Date
Richard Raya	0636ec7ac0	Revert "msm-4.14: Drop sched_migrate_to_cpumask" This reverts commit 602aa3bba862bb7ff51bdf2c9303db4b057f5353. Change-Id: I4517bdb857e7e1ab02749596dedcaa8220dc040a Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-31 20:07:33 -03:00
Richard Raya	e72721b923	Revert "msm-4.14: Drop perf-critical API" This reverts commit 1b396d869a6da9fa864d4de8235f2d0afc7164c1. Change-Id: I13b4629e9aefcd23da2e58ef534c1057f81059cd Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-31 20:07:32 -03:00
Richard Raya	0979e5af82	defconfig: Disable CASS Change-Id: I09417d8d3b19dabf4fd9a3a020db56f4b0170115 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-31 20:07:31 -03:00
Richard Raya	6fac6e2200	defconfig: Bump SBalance polling interval to 10ms Some checks failed Build Kernel / build (push) Has been cancelled Details Change-Id: I7359b8d5b59809712929ec2271ec71602fe669f6 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 13:15:49 -03:00
Richard Raya	74b14355a2	defconfig: Bump boost durations to 100ms Some checks are pending Build Kernel / build (push) Waiting to run Details Change-Id: I15bc0cffa7427c5f3cbbf160cf313c26e071c223 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 03:57:43 -03:00
Juhyung Park	44682225b6	simple_lmk: Lower vmpressure trigger threshold Change-Id: Ifa5949aa44c5f6ceb8001c3105f1b3cf92fbefd5 Signed-off-by: Juhyung Park <qkrwngud825@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 03:57:28 -03:00
Richard Raya	0096c8c85e	Revert "defconfig: Disable MGLRU" This reverts commit 4043298f9af526f1703b88c3dfbeae7a16e88425. Change-Id: Idd5ef50ccc82197606c2a1851072d9056308fb19 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 03:57:27 -03:00
Richard Raya	299d88e36b	Revert "defconfig: Bump SLMK timeout to 250ms" This reverts commit 06287f3167d25f3328e90ba5bf86de28761c0c1b. Change-Id: I0fe1113f5369c4c9bfe93418b8c68529baa51b04 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 03:57:26 -03:00
Richard Raya	00faff4c72	Revert "simple_lmk: Sleep after killing victims" This reverts commit 318d4145ba4f21fe23bd96b998c0a170ab5a26b6. Change-Id: I14279c6c33a7da292376204c6824b8178b74fd6a Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-25 03:57:25 -03:00
Davidlohr Bueso	8013fc0dd7	ipc/mqueue: Optimize msg_get() Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue: improve performance of send/recv"). However, consuming a msg in msg_get() remains logarithmic (still being better than the case before of course). By applying well known techniques to cache pointers we can have the node with the highest priority in O(1), which is specially nice for the rt cases. Furthermore, some callers can call msg_get() in a loop. A new msg_tree_erase() helper is also added to encapsulate the tree removal and node_cache game. Passes ltp mq testcases. Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Change-Id: I234983728fbc30aba482a6b58b2a70b1c38f3145 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:58 -03:00
Carlos Llamas	80305660a1	binder: Fix hang of unregistered readers commit 31643d84b8c3d9c846aa0e20bc033e46c68c7e7d upstream. With the introduction of binder_available_for_proc_work_ilocked() in commit 1b77e9dcc3da ("ANDROID: binder: remove proc waitqueue") a binder thread can only "wait_for_proc_work" after its thread->looper has been marked as BINDER_LOOPER_STATE_{ENTERED\|REGISTERED}. This means an unregistered reader risks waiting indefinitely for work since it never gets added to the proc->waiting_threads. If there are no further references to its waitqueue either the task will hang. The same applies to readers using the (e)poll interface. I couldn't find the rationale behind this restriction. So this patch restores the previous behavior of allowing unregistered threads to "wait_for_proc_work". Note that an error message for this scenario, which had previously become unreachable, is now re-enabled. Fixes: 1b77e9dcc3da ("ANDROID: binder: remove proc waitqueue") Cc: stable@vger.kernel.org Cc: Martijn Coenen <maco@google.com> Cc: Arve Hjønnevåg <arve@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20240711201452.2017543-1-cmllamas@google.com Change-Id: I72954fb5fa749c7e0694fd036ed6862cff38cdb8 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:18 -03:00
YueHaibing	2217ef92de	fq_codel: Remove set but not used variables 'prev_ecn_mark' and 'prev_drop_count' Fixes gcc '-Wunused-but-set-variable' warning: net/sched/sch_fq_codel.c: In function fq_codel_dequeue: net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable] net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable] They are not used since commit 77ddaff218fc ("fq_codel: Kill useless per-flow dropped statistic") Change-Id: I09426b4d4b41b9302e534e41fdcab109ef55c571 Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:18 -03:00
Dave Taht	f32b57998c	fq_codel: Kill useless per-flow dropped statistic It is almost impossible to get anything other than a 0 out of flow->dropped statistic with a tc class dump, as it resets to 0 on every round. It also conflates ecn marks with drops. It would have been useful had it kept a cumulative drop count, but it doesn't. This patch doesn't change the API, it just stops tracking a stat and state that is impossible to measure and nobody uses. Change-Id: Ibac1a0fd6825aa5bf862ec7cf20227de7a939ec9 Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:18 -03:00
Dave Taht	5749c25962	fq_codel: Increase fq_codel count in the bulk dropper In the field fq_codel is often used with a smaller memory or packet limit than the default, and when the bulk dropper is hit, the drop pattern bifircates into one that more slowly increases the codel drop rate and hits the bulk dropper more than it should. The scan through the 1024 queues happens more often than it needs to. This patch increases the codel count in the bulk dropper, but does not change the drop rate there, relying on the next codel round to deliver the next packet at the original drop rate (after that burst of loss), then escalate to a higher signaling rate. Change-Id: I47562b843bb86abed0b502cea62368a1195eeb0f Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:18 -03:00
Jiong Wang	09c7663480	lib/reciprocal_div: Implement the improved algorithm The new added "reciprocal_value_adv" implements the advanced version of the algorithm described in Figure 4.2 of the paper except when "divisor > (1U << 31)" whose ceil(log2(d)) result will be 32 which then requires u128 divide on host. The exception case could be easily handled before calling "reciprocal_value_adv". The advanced version requires more complex calculation to get the reciprocal multiplier and other control variables, but then could reduce the required emulation operations. It makes no sense to use this advanced version for host divide emulation, those extra complexities for calculating multiplier etc could completely waive our saving on emulation operations. However, it makes sense to use it for JIT divide code generation (for example eBPF JIT backends) for which we are willing to trade performance of JITed code with that of host. As shown by the following pseudo code, the required emulation operations could go down from 6 (the basic version) to 3 or 4. To use the result of "reciprocal_value_adv", suppose we want to calculate n/d, the C-style pseudo code will be the following, it could be easily changed to real code generation for other JIT targets. struct reciprocal_value_adv rvalue; u8 pre_shift, exp; // handle exception case. if (d >= (1U << 31)) { result = n >= d; return; } rvalue = reciprocal_value_adv(d, 32) exp = rvalue.exp; if (rvalue.is_wide_m && !(d & 1)) { // floor(log2(d & (2^32 -d))) pre_shift = fls(d & -d) - 1; rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift); } else { pre_shift = 0; } // code generation starts. if (imm == 1U << exp) { result = n >> exp; } else if (rvalue.is_wide_m) { // pre_shift must be zero when reached here. t = (n * rvalue.m) >> 32; result = n - t; result >>= 1; result += t; result >>= rvalue.sh - 1; } else { if (pre_shift) result = n >> pre_shift; result = ((u64)result * rvalue.m) >> 32; result >>= rvalue.sh; } Change-Id: I54385f0df42aa43355d940d20d6818d2fb3197d9 Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:18 -03:00
Florian La Roche	06381819ff	lib/int_sqrt: Fix int_sqrt64() for very large numbers If an input number x for int_sqrt64() has the highest bit set, then fls64(x) is 64. (1UL << 64) is an overflow and breaks the algorithm. Subtracting 1 is a better guess for the initial value of m anyway and that's what also done in int_sqrt() implicitly []. [] Note how int_sqrt() uses __fls() with two underscores, which already returns the proper raw bit number. In contrast, int_sqrt64() used fls64(), and that returns bit numbers illogically starting at 1, because of error handling for the "no bits set" case. Will points out that he bug probably is due to a copy-and-paste error from the regular int_sqrt() case. Change-Id: I5be5be3e03ddbe68cc8025a64698bbb49c57c3a5 Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Florian La Roche <Florian.LaRoche@googlemail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Crt Mori	f111e963e2	lib/int_sqrt: Add strongly typed 64bit int_sqrt There is no option to perform 64bit integer sqrt on 32bit platform. Added stronger typed int_sqrt64 enables the 64bit calculations to be performed on 32bit platforms. Using same algorithm as int_sqrt() with strong typing provides enough precision also on 32bit platforms, but it sacrifices some performance. In case values are smaller than ULONG_MAX the standard int_sqrt is used for calculation to maximize the performance due to more native calculations. Change-Id: I8b22ef3fc9e63ea74fb1df14115fc374170549c3 Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Crt Mori <cmo@melexis.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Peter Zijlstra	83c2c2b980	lib/int_sqrt: Adjust comments Our current int_sqrt() is not rough nor any approximation; it calculates the exact value of: floor(sqrt()). Document this. Link: http://lkml.kernel.org/r/20171020164645.001652117@infradead.org Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Anshul Garg <aksgarg1989@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Michael Davidson <md@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Change-Id: Iea660f36312f879010d16028bc21b6bb50905078 Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Andy Shevchenko	4d011c9c8c	lib/sort: Move swap, cmp and cmp_r function types for wider use The function types for swap, cmp and cmp_r functions are already being in use by modules. Move them to types.h that everybody in kernel will be able to use generic types instead of custom ones. This adds more sense to the comment in bsearch() later on. Link: http://lkml.kernel.org/r/20191007135656.37734-1-andriy.shevchenko@linux.intel.com Change-Id: I4848ccb09bac73774e2b0071eb767d596e4f6f90 Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Rasmus Villemoes	829cc32160	lib/sort: Implement sort() variant taking context argument Our list_sort() utility has always supported a context argument that is passed through to the comparison routine. Now there's a use case for the similar thing for sort(). This implements sort_r by simply extending the existing sort function in the obvious way. To avoid code duplication, we want to implement sort() in terms of sort_r(). The naive way to do that is static int cmp_wrapper(const void a, const void b, const void ctx) { int (real_cmp)(const void, const void) = ctx; return real_cmp(a, b); } sort(..., cmp) { sort_r(..., cmp_wrapper, cmp) } but this would do two indirect calls for each comparison. Instead, do as is done for the default swap functions - that only adds a cost of a single easily predicted branch to each comparison call. Aside from introducing support for the context argument, this also serves as preparation for patches that will eliminate the indirect comparison calls in common cases. Requested-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Philipp Zabel <p.zabel@pengutronix.de> Change-Id: I3ad240253956f6ec3f41833fc9ddefa5749fbc58 Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Randy Dunlap	ea7577a782	lib/sort: Fix kernel-doc notation warnings Fix kernel-doc notation in lib/sort.c by using correct function parameter names. lib/sort.c:59: warning: Excess function parameter 'size' description in 'swap_words_32' lib/sort.c:83: warning: Excess function parameter 'size' description in 'swap_words_64' lib/sort.c:110: warning: Excess function parameter 'size' description in 'swap_bytes' Link: http://lkml.kernel.org/r/60e25d3d-68d1-bde2-3b39-e4baa0b14907@infradead.org Fixes: 37d0ec34d111a ("lib/sort: make swap functions more generic") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: George Spelvin <lkml@sdf.org> Cc: Andrew Morton <akpm@linux-foundation.org> Change-Id: I40d3917918ee9a73ac983ecaf4d62abcd924a45f Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
George Spelvin	26e9d7f193	lib/sort: Avoid indirect calls to built-in swap Similar to what's being done in the net code, this takes advantage of the fact that most invocations use only a few common swap functions, and replaces indirect calls to them with (highly predictable) conditional branches. (The downside, of course, is that if you do use a custom swap function, there are a few extra predicted branches on the code path.) This actually shrinks the x86-64 code, because it inlines the various swap functions inside do_swap, eliding function prologues & epilogues. x86-64 code size 767 -> 703 bytes (-64) Link: http://lkml.kernel.org/r/d10c5d4b393a1847f32f5b26f4bbaa2857140e1e.1552704200.git.lkml@sdf.org Signed-off-by: George Spelvin <lkml@sdf.org> Acked-by: Andrey Abramov <st5pub@yandex.ru> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Daniel Wagner <daniel.wagner@siemens.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Change-Id: I4f4850f79f2a1596ec4d19780f329cd073c4f11c Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
George Spelvin	c93eb1b0d2	lib/sort: Use more efficient bottom-up heapsort variant This uses fewer comparisons than the previous code (approaching half as many for large random inputs), but produces identical results; it actually performs the exact same series of swap operations. Specifically, it reduces the average number of compares from 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37n + o(n). This is still 1.63n worse than glibc qsort() which manages nlog2(n) - 1.26n, but at least the leading coefficient is correct. Standard heapsort, when sifting down, performs two comparisons per level: one to find the greater child, and a second to see if the current node should be exchanged with that child. Bottom-up heapsort observes that it's better to postpone the second comparison and search for the leaf where -infinity would be sent to, then search back up for the current node's destination. Since sifting down usually proceeds to the leaf level (that's where half the nodes are), this does O(1) second comparisons rather than log2(n). That saves a lot of (expensive since Spectre) indirect function calls. The one time it's worse than the previous code is if there are large numbers of duplicate keys, when the top-down algorithm is O(n) and bottom-up is O(n log n). For distinct keys, it's provably always better, doing 1.5nlog2(n) + O(n) in the worst case. (The code is not significantly more complex. This patch also merges the heap-building and -extracting sift-down loops, resulting in a net code size savings.) x86-64 code size 885 -> 767 bytes (-118) (I see the checkpatch complaint about "else if (n -= size)". The alternative is significantly uglier.) Link: http://lkml.kernel.org/r/2de8348635a1a421a72620677898c7fd5bd4b19d.1552704200.git.lkml@sdf.org Signed-off-by: George Spelvin <lkml@sdf.org> Acked-by: Andrey Abramov <st5pub@yandex.ru> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Daniel Wagner <daniel.wagner@siemens.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Change-Id: I370b088649c56ae9a0d8040c30ed5e13b847cc7c Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
George Spelvin	f00f930f82	lib/sort: Make swap functions more generic Patch series "lib/sort & lib/list_sort: faster and smaller", v2. Because CONFIG_RETPOLINE has made indirect calls much more expensive, I thought I'd try to reduce the number made by the library sort functions. The first three patches apply to lib/sort.c. Patch #1 is a simple optimization. The built-in swap has special cases for aligned 4- and 8-byte objects. But those are almost never used; most calls to sort() work on larger structures, which fall back to the byte-at-a-time loop. This generalizes them to aligned multiples of 4 and 8 bytes. (If nothing else, it saves an awful lot of energy by not thrashing the store buffers as much.) Patch #2 grabs a juicy piece of low-hanging fruit. I agree that nice simple solid heapsort is preferable to more complex algorithms (sorry, Andrey), but it's possible to implement heapsort with far fewer comparisons (50% asymptotically, 25-40% reduction for realistic sizes) than the way it's been done up to now. And with some care, the code ends up smaller, as well. This is the "big win" patch. Patch #3 adds the same sort of indirect call bypass that has been added to the net code of late. The great majority of the callers use the builtin swap functions, so replace the indirect call to sort_func with a (highly preditable) series of if() statements. Rather surprisingly, this decreased code size, as the swap functions were inlined and their prologue & epilogue code eliminated. lib/list_sort.c is a bit trickier, as merge sort is already close to optimal, and we don't want to introduce triumphs of theory over practicality like the Ford-Johnson merge-insertion sort. Patch #4, without changing the algorithm, chops 32% off the code size and removes the part[MAX_LIST_LENGTH+1] pointer array (and the corresponding upper limit on efficiently sortable input size). Patch #5 improves the algorithm. The previous code is already optimal for power-of-two (or slightly smaller) size inputs, but when the input size is just over a power of 2, there's a very unbalanced final merge. There are, in the literature, several algorithms which solve this, but they all depend on the "breadth-first" merge order which was replaced by commit 835cc0c8477f with a more cache-friendly "depth-first" order. Some hard thinking came up with a depth-first algorithm which defers merges as little as possible while avoiding bad merges. This saves 0.2*n compares, averaged over all sizes. The code size increase is minimal (64 bytes on x86-64, reducing the net savings to 26%), but the comments expanded significantly to document the clever algorithm. TESTING NOTES: I have some ugly user-space benchmarking code which I used for testing before moving this code into the kernel. Shout if you want a copy. I'm running this code right now, with CONFIG_TEST_SORT and CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last round of minor edits to quell checkpatch. I figure there will be at least one round of comments and final testing. This patch (of 5): Rather than having special-case swap functions for 4- and 8-byte objects, special-case aligned multiples of 4 or 8 bytes. This speeds up most users of sort() by avoiding fallback to the byte copy loop. Despite what ca96ab859ab4 ("lib/sort: Add 64 bit swap function") claims, very few users of sort() sort pointers (or pointer-sized objects); most sort structures containing at least two words. (E.g. drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct acpi_fan_fps.) The functions also got renamed to reflect the fact that they support multiple words. In the great tradition of bikeshedding, the names were by far the most contentious issue during review of this patch series. x86-64 code size 872 -> 886 bytes (+14) With feedback from Andy Shevchenko, Rasmus Villemoes and Geert Uytterhoeven. Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org Signed-off-by: George Spelvin <lkml@sdf.org> Acked-by: Andrey Abramov <st5pub@yandex.ru> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Wagner <daniel.wagner@siemens.com> Cc: Don Mullis <don.mullis@gmail.com> Cc: Dave Chinner <dchinner@redhat.com> Change-Id: I9f21e6eb4bcacf83d40cef3637a492b19db501fd Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Yousef Algadri <yusufgadrie@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
John Galt	bb664c8510	msm: kgsl: Make mem workqueue freezable Freezing during no interactivity may benefit power Change-Id: I685806f9d39da1f3523ba70590797de926840e18 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Sultan Alsawaf	103e03a5c4	drm/msm: Reduce latency while completing non-blocking commits In order to complete a commit, we must first wait for the previous commit to finish up. This is done by sleeping, during which time the CPU can enter a deep idle state and take a while to finish processing the commit after the wait is over. We can alleviate this by optimistically assuming that the kthread this commit worker is running on won't migrate to another CPU partway through. We only do this for the non-blocking case where the commit completion is done in an asynchronous worker because the generic DRM code already does this for atomic ioctls. Change-Id: I55b822211b91f4a31c2bc6e65d7b31989e56aa7d Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 01:13:17 -03:00
Sultan Alsawaf	9be57f3582	qos: Execute notifier callbacks atomically Allowing the pm_qos notifier callbacks to execute without holding pm_qos_lock can cause the callbacks to misbehave, e.g. the cpuidle callback could erroneously send more IPIs than necessary. Fix this by executing the pm_qos callbacks while pm_qos_lock is held. Change-Id: I0f5b0de2b022997a8f7d88755d7b60070b9a091d Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 00:51:45 -03:00
Sultan Alsawaf	874879f16f	pinctrl: msm: Restore some barriers to prevent reordering of I/O writes Although data dependencies and one-way, semi-permeable barriers provided by spin locks satisfy most ordering needs here, it is still possible for some I/O writes to be reordered with respect to one another in a dangerous way. One such example is that the interrupt status bit could be cleared after the interrupt is unmasked when enabling the IRQ, potentially leading to a spurious interrupt if there's an interrupt pending from when the IRQ was disabled. To prevent dangerous I/O write reordering, restore the minimum amount of barriers needed to ensure writes are ordered as intended. Change-Id: I4c44eaa93f39591d5c963dba2b9dcaf33831bdbe Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 00:51:16 -03:00
Sultan Alsawaf	d774137330	pinctrl: msm: Remove explicit barriers from mmio ops where unneeded For the vast majority of mmio operations in this driver, explicit memory barriers aren't needed either because a data dependency between a read and write already exists, or because of the presence of the spin locks which execute a full memory barrier. Removing all the unneeded explicit barriers considerably reduces overhead for pinctrl operations, which in turn benefits things like i2c. Change-Id: I566e9189fa0d392f242395b43aefbb9f324c1900 Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 00:51:11 -03:00
Tashfin Shakeer Rhythm	a9566ccc56	msm-4.14: Make macros no-op using `((void)0)` Do not solely rely on compiler optimizations to get the workaround of having macros do nothing using an empty do-while loop. It's inefficient. Use ((void)0) to which the standard assert macro expands when NDEBUG is defined. No functional change intended. [mcdofrenchfreis]: Implement this patch to tree using the command: git grep -l "do {} while (0)" \| xargs sed -i "s/do {} while (0)/((void)0)/g" Change-Id: I9615c62c46670e31ed8d0d89d195144541baa3e6 Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com> Signed-off-by: mcdofrenchfreis <xyzevan@androidist.net> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-17 00:51:10 -03:00
Tim Zimmermann	79c865d07c	syscall: Fake uname to 4.19 also for netbpfload * This is required for U QPR2 Change-Id: I0321c64f77fccf74ff2472c3abd29e8b6b4be1ce Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-13 23:02:28 -03:00
Tim Zimmermann	dcb49fbdf0	syscall: Fake uname to 4.19 for bpfloader/netd * Google is attempting to kill 4.14 in `0156d6e2ba` Change-Id: Ic87a66753a7acc89b0fe5b19158eea4c58ba980f Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-13 23:02:17 -03:00
Tyler Nijmeh	86584cb63b	net: Allow BFP JIT to compile without module support Thanks to kdrag0n @ GitHub for his original commit using vmalloc instead of kmalloc (preventing a panic). Signed-off-by: Tyler Nijmeh <tylernij@gmail.com> Change-Id: I336835a0bf9abbbbad0b9a0d299b5c22eaf15abb Signed-off-by: DennySPb <dennyspb@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-13 14:37:40 -03:00
Ard Biesheuvel	3c91c4d3a6	bpf: add __weak hook for allocating executable memory By default, BPF uses module_alloc() to allocate executable memory, but this is not necessary on all arches and potentially undesirable on some of them. So break out the module_alloc() and module_memfree() calls into __weak functions to allow them to be overridden in arch code. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: atndko <z1281552865@gmail.com> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Daniel Borkmann	3582208dd5	BACKPORT: bpf: fix unconnected udp hooks Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Change-Id: If2bab00efe5f37a591083fe2676e76f35f8cecc3 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Alexei Starovoitov	022212035d	BACKPORT: bpf: enforce return code for cgroup-bpf programs with addition of tnum logic the verifier got smart enough and we can enforce return codes at program load time. For now do so for cgroup-bpf program types. Change-Id: Iae3a46c3d38810e47cbf4ec23356abae03ded736 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	6fab2054f9	bpf: Hooks for sys_sendmsg In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr ` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr ` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Change-Id: Icf5938b0b69ddfb1e99dc2abc90204f7c97f0473 Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Cyber Knight	c016771a6f	gen_headers_{arm, arm64}: Add btf.h to the list - This fixes a build error. Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Toke Høiland-Jørgensen	c47b10d70b	BACKPORT: devmap: Allow map lookups from eBPF We don't currently allow lookups into a devmap from eBPF, because the map lookup returns a pointer directly to the dev->ifindex, which shouldn't be modifiable from eBPF. However, being able to do lookups in devmaps is useful to know (e.g.) whether forwarding to a specific interface is enabled. Currently, programs work around this by keeping a shadow map of another type which indicates whether a map index is valid. Since we now have a flag to make maps read-only from the eBPF side, we can simply lift the lookup restriction if we make sure this flag is always set. Change-Id: I42b1430605c6837710fd903a0c8abf2c7dc13f16 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Toke Høiland-Jørgensen	e2ce82046a	BACKPORT: xdp: Add devmap_hash map type for looking up devices by hashed index A common pattern when using xdp_redirect_map() is to create a device map where the lookup key is simply ifindex. Because device maps are arrays, this leaves holes in the map, and the map has to be sized to fit the largest ifindex, regardless of how many devices actually are actually needed in the map. This patch adds a second type of device map where the key is looked up using a hashmap, instead of being used as an array index. This allows maps to be densely packed, so they can be smaller. Change-Id: I6155de499a47fb45bac1a39319f0ad979032fd6d Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Tim Zimmermann	82af075896	kernel: bpf: devmap: Create __dev_map_alloc_node * Like in `fca16e5107` Change-Id: I95915b56f381c8f5851b6e7827eed6064aefe586 Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	f6a9debf24	BACKPORT: bpf: Post-hooks for sys_bind "Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Change-Id: Ibef21eed069c37684321b2401e5bb52f689ab8e7 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	ae13d58598	BACKPORT: bpf: Hooks for sys_connect == The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. bpf_bind() sets `with_lock` to `false` when calling to __inet_bind() and __inet6_bind() since all call-sites, where bpf_bind() is called, already hold socket lock. Change-Id: I03eb513369c630b203466621d1fbdb9b29c8333c Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	989ae8b398	BACKPORT: net: Introduce __inet_bind() and __inet6_bind Refactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Change-Id: I3cd102acdb2b3c14946ef8452fd7afb763e8215f Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com> Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	c0d554a69d	BACKPORT: bpf: Hooks for sys_bind == The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Change-Id: I47b4cd565cb7cd3bcf3ecf80ddf2586ee81868fb Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Alexei Starovoitov	060376e54b	BACKPORT: bpf: introduce BPF_PROG_QUERY command introduce BPF_PROG_QUERY command to retrieve a set of either attached programs to given cgroup or a set of effective programs that will execute for events within a cgroup Change-Id: I05e0ed5f6eddc30f4a18216d4541448816fd1ae5 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> for cgroup bits Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Andrey Ignatov	eaaec381f1	BACKPORT: bpf: Check attach type at prog load time == The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2 Change-Id: Idead9c9cb4251bf5bd843b68bcb83072d5746226 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Jakub Kicinski	6e5732819c	bpf: offload: rename the ifindex field bpf_target_prog seems long and clunky, rename it to prog_ifindex. We don't want to call this field just ifindex, because maps may need a similar field in the future and bpf_attr members for programs and maps are unnamed. Change-Id: I5473ea6721193bcf616ac3a1056c808446af9c8d Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Jakub Kicinski	dd497af48b	BACKPORT: bpf: offload: add infrastructure for loading programs for a specific netdev The fact that we don't know which device the program is going to be used on is quite limiting in current eBPF infrastructure. We have to reverse or limit the changes which kernel makes to the loaded bytecode if we want it to be offloaded to a networking device. We also have to invent new APIs for debugging and troubleshooting support. Make it possible to load programs for a specific netdev. This helps us to bring the debug information closer to the core eBPF infrastructure (e.g. we will be able to reuse the verifer log in device JIT). It allows device JITs to perform translation on the original bytecode. __bpf_prog_get() when called to get a reference for an attachment point will now refuse to give it if program has a device assigned. Following patches will add a version of that function which passes the expected netdev in. @type argument in __bpf_prog_get() is renamed to attach_type to make it clearer that it's only set on attachment. All calls to ndo_bpf are protected by rtnl, only verifier callbacks are not. We need a wait queue to make sure netdev doesn't get destroyed while verifier is still running and calling its driver. Change-Id: Iba7b96574abc005ad3351d6db2528eb534e47561 Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:40 -03:00
Jakub Kicinski	85e1bfd333	BACKPORT: net: bpf: rename ndo_xdp to ndo_bpf ndo_xdp is a control path callback for setting up XDP in the driver. We can reuse it for other forms of communication between the eBPF stack and the drivers. Rename the callback and associated structures and definitions. Change-Id: I08c456c9afa712ce0b7a98c24b6f46545e69f3cc Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Cyber Knight <cyberknight755@gmail.com>	2025-01-13 14:37:39 -03:00

1 2 3 4 5 ...

812269 Commits