This call to smp_processor_id() forces gic_raise_softirq() to require
being called while preemption is disabled, which isn't an actual
requirement. When called without preemption disabled, smp_processor_id()
is thus used incorrectly and generates a warning splat with the relevant
kernel debug options enabled.
Get rid of the useless pr_devel message outright to fix the incorrect
smp_processor_id() usage.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
For the vast majority of mmio operations in this driver, explicit memory
barriers aren't needed either because a data dependency between a read
and write already exists, or because of the presence of the spin locks
which execute a full memory barrier.
Removing all the unneeded explicit barriers considerably reduces
overhead for pinctrl operations, which in turn benefits things like i2c.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a rwsem. This can needlessly lengthen RCU
grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
There's no reason to hold an RCU read lock the entire time while
optimistically spinning for a mutex lock. This can needlessly lengthen
RCU grace periods and slow down synchronize_rcu() when it doesn't brute
force the RCU grace period via rcupdate.rcu_expedited=1.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
It isn't guaranteed a CPU will idle upon calling lpm_cpuidle_enter(),
since it could abort early at the need_resched() check. In this case,
it's possible for an IPI to be sent to this "idle" CPU needlessly, thus
wasting power. For the same reason, it's also wasteful to keep a CPU
marked idle even after it's woken up.
Reduce the window that CPUs are marked idle to as small as it can be in
order to improve power consumption.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The pm_qos callback currently suffers from a number of pitfalls: it
sends IPIs to CPUs that may not be idle, waits for those IPIs to finish
propagating while preemption is disabled (resulting in a long busy wait
for the pm_qos_update_target() caller), and needlessly calls a no-op
function when the IPIs are processed.
Optimize the pm_qos notifier by only sending IPIs to CPUs that are
idle, and by using arch_send_wakeup_ipi_mask() instead of
smp_call_function_many(). Using IPI_WAKEUP instead of IPI_CALL_FUNC,
which is what smp_call_function_many() uses behind the scenes, has the
benefit of doing zero work upon receipt of the IPI; IPI_WAKEUP is
designed purely for sending an IPI without a payload, whereas
IPI_CALL_FUNC does unwanted extra work just to run the empty
smp_callback() function.
Determining which CPUs are idle is done efficiently with an atomic
bitmask instead of using the wake_up_if_idle() API, which checks the
CPU's runqueue in an RCU read-side critical section and under a spin
lock. Not very efficient in comparison to a simple, atomic bitwise
operation. A cpumask isn't needed for this because NR_CPUS is
guaranteed to fit within a word.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
An empty IPI is useful for cpuidle to wake sleeping CPUs without causing
them to do unnecessary work upon receipt of the IPI. IPI_WAKEUP fills
this use-case nicely, so let it be used outside of the ACPI parking
protocol.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
None of the pm_qos functions actually run in interrupt context; if some
driver calls pm_qos_update_target in interrupt context then it's already
broken. There's no need to disable interrupts while holding pm_qos_lock,
so don't do it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reverts commit 1e5a5b5e00e9706cd48e3c87de1607fcaa5214d2.
This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?
Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.
Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reverts commit 0db49c2550a09458db188fb7312c66783c5af104.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756.
This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The memory allocated dynamically here is just used to store a single
instance of a struct. Allocate both possible structs on the stack
instead of allocating them dynamically to improve performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Waking the GPU upon touch wastes power when the screen is being touched
in a way that does not induce animation or any actual need for GPU usage.
Instead of preemptively waking the GPU on touch input, wake it up upon
receiving a IOCTL_KGSL_GPU_COMMAND ioctl since it is a sign that the GPU
will soon be needed.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Since we have multiple CCIs with qcom,cam-res-mgr defined, the global
cam_res pointer gets overwritten each time a CCI probes, causing memory
to be leaked. Since it appears that the single global cam_res pointer is
intentional, let's just skip superfluous cam_res allocations to fix the
leak.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The default frequency on Qualcomm CPUs is the lowest frequency supported
by the CPU. This hurts latency when waking from suspend, as each CPU
coming online runs at its lowest frequency until the governor can take
over later. To speed up waking from suspend, hijack the CPUHP_AP_ONLINE
hook and use it to set the highest available frequency on each CPU as
they come online. This is done behind the governor's back but it's fine
because the governor isn't running at this point in time for a CPU
that's coming online.
This speeds up waking from suspend significantly.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: Ibb92aa78b858b00b6687340f2efe66f86b866514
Signed-off-by: azrim <mirzaspc@gmail.com>
Our kernel only runs on known systems where broken IRQs would already
have been discovered, so disable this to reduce overhead in the IRQ
handling path.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Arter has done a similar commit, where the random fops routed the read
hook to the urandom_read method. However, this leads to a warning about
random_read being unused, as well as having the poll hook still linked
to random_poll. This commit should solve both of those issues, from the
roots.
Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
kgsl_3d_init is taking a lot of time in execution.
Created a kernel thread to save kernel boot time.
Change-Id: I35e7a1525204b5be4301762aa0e41c9a159784d3
Signed-off-by: ankusa <ankusa@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Trying to wait for fences that have already been signaled incurs a high
setup cost, since dynamic memory allocation must be used. Avoiding this
overhead when it isn't needed improves performance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
Same concept as here: fe23bc0887
Extended version that covers more cases.
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
[kdrag0n: Fixed compile error in Adreno driver when debugfs is enabled]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
In some use case, IRQ lists are added into user_event_list without
init them. Immediately init IRQ lists after allocated node to avoid
accessing null pointer.
Bug: 129427630
Test: boot to Android without panel, and ADB command can work
Change-Id: I39c3b50e7c11cd6b22b7dc5e9461288608694e26
Signed-off-by: Ken Huang <kenbshuang@google.com>
(cherry picked from commit f7d5d71d72960c1fd637aecc5de36e413f37b92e)
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Writing to registers is frequent enough that there is a measurably
significant portion of CPU time spent on checking the debug mask for
whether to log. Remove the check and logging call altogether to
eliminate the overhead.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
Remote register I/O amounts to a measurably significant portion of CPU
time due to how frequently this function is used. Cache the value of
each register on-demand and use this value in future invocations to
mitigate the expensive I/O.
Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Speed up command transfers by reducing unnecessary intermediate buffer
allocation. The buffer allocation is only needed if using FIFO command
transfer, but otherwise there's no need to allocate and memcpy into
intermediate buffers.
Bug: 136715342
Change-Id: Ie540c285655ec86deb046c187f1e27538fd17d1c
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: azrim <mirzaspc@gmail.com>
sdmmagpie's CPUs (semi-custom derivations of Cortex-A55 and Cortex-A76)
support ARMv8.1's efficient LSE atomic instructions as per
/proc/cpuinfo:
CPU feature detection messages in printk confirm the support:
Since our CPUs support it, enable use of LSE atomics to speed up atomic
operations since they are implemented in hardware instead of being
synthesized by a few instructions in software.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
CAF appears to have messed with some code in this kernel related to LSE
atomics and/or alternatives, causing the combined LSE + LL/SC
out-of-line calling code to be too big for its section when compiling
kernel/locking/spinlock.c. This causes gas to fail with a confusing error:
/tmp/spinlock-79343b.s: Assembler messages:
/tmp/spinlock-79343b.s:61: Error: attempt to move .org backwards
/tmp/spinlock-79343b.s:157: Error: attempt to move .org backwards
Clang's integrated assembler is more verbose and provides a more helpful
error that points to the alternatives code as being the culprit:
In file included from ../kernel/locking/spinlock.c:20:
In file included from ../include/linux/spinlock.h:88:
../arch/arm64/include/asm/spinlock.h:76:15: error: invalid .org offset '56' (at offset '60')
asm volatile(ARM64_LSE_ATOMIC_INSN(
^
../arch/arm64/include/asm/lse.h:36:2: note: expanded from macro 'ARM64_LSE_ATOMIC_INSN'
ALTERNATIVE(llsc, lse, ARM64_HAS_LSE_ATOMICS)
^
../arch/arm64/include/asm/alternative.h:281:2: note: expanded from macro 'ALTERNATIVE'
_ALTERNATIVE_CFG(oldinstr, newinstr, __VA_ARGS__, 1)
^
../arch/arm64/include/asm/alternative.h:83:2: note: expanded from macro '_ALTERNATIVE_CFG'
__ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg), 0)
^
../arch/arm64/include/asm/alternative.h:73:16: note: expanded from macro '__ALTERNATIVE_CFG'
".popsection\n\t" \
^
<inline asm>:35:7: note: instantiated into assembly here
.org . - (664b-663b) + (662b-661b)
^
Omitting the alternatives code indeed reduces the size enough to make
everything compile successfully. We don't need the patching anyway
because we will only enable CONFIG_ARM64_LSE_ATOMICS when the target CPU
is known to support LSE atomics with 100% certainty, so kill all the
dynamic out-of-line LL/SC patching code.
This change also has the side-effect of reducing the I-cache footprint
of these critical locking and atomic paths, which can reduce cache
thrashing and increase overall performance.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
On a Kryo 485 CPU (semi-custom Cortex-A76 derivative) in a Snapdragon
855 (SM8150) SoC, switching from traditional LL/SC atomics to LSE
causes LKDTM's ATOMIC_TIMING test to regress by 2x:
LL/SC ATOMIC_TIMING: 34.14s 34.08s
LSE ATOMIC_TIMING: 70.84s 71.06s
Prefetching the target operands fixes the regression and makes LSE
perform better than LSE as expected:
LSE+prfm ATOMIC_TIMING: 21.36s 21.21s
"dd if=/dev/zero of=/dev/null count=10000000" also runs faster:
LL/SC: 3.3 3.2 3.3 s
LSE: 3.1 3.2 3.2 s
LSE+p: 2.3 2.3 2.3 s
Commit 0ea366f5e1b6413a6095dce60ea49ae51e468b61 applied the same change
to LL/SC atomics, but it was never ported to LSE.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
This adds support to arm64 for fast refcount checking, as contributed
by Kees for x86 based on the implementation by grsecurity/PaX.
The general approach is identical: the existing atomic_t helpers are
cloned for refcount_t, with the arithmetic instruction modified to set
the PSTATE flags, and one or two branch instructions added that jump to
an out of line handler if overflow, decrement to zero or increment from
zero are detected.
One complication that we have to deal with on arm64 is the fact that
it has two atomics implementations: the original LL/SC implementation
using load/store exclusive loops, and the newer LSE one that does mostly
the same in a single instruction. So we need to clone some parts of
both for the refcount handlers, but we also need to deal with the way
LSE builds fall back to LL/SC at runtime if the hardware does not
support it.
As is the case with the x86 version, the performance gain is substantial
(ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
incorporates an add-from-zero check as well:
perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
116252672661 cycles # 2.207 GHz
52.689793525 seconds time elapsed
perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
127060259162 cycles # 2.207 GHz
57.243690077 seconds time elapsed
For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
which uses the validation routines implemented in C using cmpxchg():
perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
Performance counter stats for 'cat /dev/fd/63':
191057942484 cycles # 2.207 GHz
86.568269904 seconds time elapsed
As a bonus, this code has been found to perform significantly better on
systems with many CPUs, due to the fact that it no longer relies on the
load/compare-and-swap combo performed in a tight loop, which is what we
emit for cmpxchg() on arm64.
Cc: Will Deacon <will.deacon@arm.com>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
Cc: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Cc: Jan Glauber <jglauber@cavium.com>,
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[kdrag0n]
- Backported to k4.14 from:
https://www.spinics.net/lists/arm-kernel/msg735992.html
- Benchmarked on sm8150 using perf and LKDTM REFCOUNT_TIMING:
https://docs.google.com/spreadsheets/d/14CctCmWzQAGhOmpHrBJfXQy_HuNFTpEkMEYSUGKOZR8/edit
| Fast checking | Generic checking
---------+--------------------+-----------------------
Cycles | 79235532616 | 102554062037
| 79391767237 | 99625955749
Time | 32.99879212 sec | 42.5354029 sec
| 32.97133254 sec | 41.31902045 sec
Average:
Cycles | 79313649927 | 101090008893
Time | 33 sec | 42 sec
Conflicts:
arch/arm64/kernel/traps.c
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Mixing kernel and user debug hooks together is highly error-prone as it
relies on all of the hooks to figure out whether the exception came from
kernel or user, and then to act accordingly.
Make our debug hook code a little more robust by maintaining separate
hook lists for user and kernel, with separate registration functions
to force callers to be explicit about the exception levels that they
care about.
Conflicts:
arch/arm64/kernel/traps.c
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
The implementation of flush_icache_range() includes instruction sequences
which are themselves patched at runtime, so it is not safe to call from
the patching framework.
This patch reworks the alternatives cache-flushing code so that it rolls
its own internal D-cache maintenance using DC CIVAC before invalidating
the entire I-cache after all alternatives have been applied at boot.
Modules don't cause any issues, since flush_icache_range() is safe to
call by the time they are loaded.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Rohit Khanna <rokhanna@nvidia.com>
Cc: Alexander Van Brunt <avanbrunt@nvidia.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Patching kernel instructions at runtime requires other CPUs to undergo
a context synchronisation event via an explicit ISB or an IPI in order
to ensure that the new instructions are visible. This is required even
for "hotpatch" instructions such as NOP and BL, so avoid optimising in
this case and always go via stop_machine() when performing general
patching.
ftrace isn't quite as strict, so it can continue to call the nosync
code directly.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
When invalidating the instruction cache for a kernel mapping via
flush_icache_range(), it is also necessary to flush the pipeline for
other CPUs so that instructions fetched into the pipeline before the
I-cache invalidation are discarded. For example, if module 'foo' is
unloaded and then module 'bar' is loaded into the same area of memory,
a CPU could end up executing instructions from 'foo' when branching into
'bar' if these instructions were fetched into the pipeline before 'foo'
was unloaded.
Whilst this is highly unlikely to occur in practice, particularly as
any exception acts as a context-synchronizing operation, following the
letter of the architecture requires us to execute an ISB on each CPU
in order for the new instruction stream to be visible.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
Commit 959bf2fd03b5 ("arm64: percpu: Rewrite per-cpu ops to allow use of
LSE atomics") introduced alternative code sequences for the arm64 percpu
atomics, so that the LSE instructions can be patched in at runtime if
they are supported by the CPU.
Unfortunately, when patching in the LSE sequence for a value-returning
pcpu atomic, the argument registers are the wrong way round. The
implementation of this_cpu_add_return() therefore ends up adding
uninitialised stack to the percpu variable and returning garbage.
As it turns out, there aren't very many users of the value-returning
percpu atomics in mainline and we only spotted this due to a failure in
the kprobes selftests. In this case, when attempting to single-step over
the out-of-line instruction slot, the debug monitors would not be
enabled because calling this_cpu_inc_return() on the kernel debug
monitor refcount would fail to detect the transition from 0. We would
consequently execute past the slot and take an undefined instruction
exception from the kernel, resulting in a BUG:
| kernel BUG at arch/arm64/kernel/traps.c:421!
| PREEMPT SMP
| pc : do_undefinstr+0x268/0x278
| lr : do_undefinstr+0x124/0x278
| Process swapper/0 (pid: 1, stack limit = 0x(____ptrval____))
| Call trace:
| do_undefinstr+0x268/0x278
| el1_undef+0x10/0x78
| 0xffff00000803c004
| init_kprobes+0x150/0x180
| do_one_initcall+0x74/0x178
| kernel_init_freeable+0x188/0x224
| kernel_init+0x10/0x100
| ret_from_fork+0x10/0x1c
Fix the argument order to get the value-returning pcpu atomics working
correctly when implemented using the LSE instructions.
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
Our percpu code is a bit of an inconsistent mess:
* It rolls its own xchg(), but reuses cmpxchg_local()
* It uses various different flavours of preempt_{enable,disable}()
* It returns values even for the non-returning RmW operations
* It makes no use of LSE atomics outside of the cmpxchg() ops
* There are individual macros for different sizes of access, but these
are all funneled through a switch statement rather than dispatched
directly to the relevant case
This patch rewrites the per-cpu operations to address these shortcomings.
Whilst the new code is a lot cleaner, the big advantage is that we can
use the non-returning ST- atomic instructions when we have LSE.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
This reverts commit 99836317ce2f622e1e70d29770a048c12848c765.
Revert CAF's mutant pick in favor of a fresh backport from mainline.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
The CAS instructions implicitly access only the relevant bits of the "old"
argument, so there is no need for explicit masking via type-casting as
there is in the LL/SC implementation.
Move the casting into the LL/SC code and remove it altogether for the LSE
implementation.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
We need linux/compiler.h for unreachable(), so #include it here.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
We want to avoid pulling linux/preempt.h into cmpxchg.h, since that can
introduce a circular dependency on linux/bitops.h. linux/preempt.h is
only needed by the per-cpu cmpxchg implementation, which is better off
alongside the per-cpu xchg implementation in percpu.h, so move it there
and add the missing #include.
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
Having asm/cmpxchg.h pull in linux/bug.h is problematic because this
ends up pulling in the atomic bitops which themselves may be built on
top of atomic.h and cmpxchg.h.
Instead, just include build_bug.h for the definition of BUILD_BUG.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
When the LL/SC atomics are moved out-of-line, they are annotated as
notrace and exported to modules. Ensure we pull in the relevant include
files so that these macros are defined when we need them.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
In cases where x30 is used as a temporary in the out-of-line ll/sc atomics
(e.g. atomic_fetch_add), the compiler tends to put out a full stackframe,
which included pointing the x29 at the new frame.
Since these things aren't traceable anyway, we can pass -fomit-frame-pointer
to reduce the work when spilling. Since this is incompatible with -pg, we
also remove that from the CFLAGS for this file.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
TODO: benchmark
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>