799127 Commits

Author SHA1 Message Date
Sultan Alsawaf
3142fce40a
qos: Don't disable interrupts while holding pm_qos_lock
None of the pm_qos functions actually run in interrupt context; if some
driver calls pm_qos_update_target in interrupt context then it's already
broken. There's no need to disable interrupts while holding pm_qos_lock,
so don't do it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:32 +07:00
Sultan Alsawaf
eb754bba42
Revert "mutex: Add a delay into the SPIN_ON_OWNER wait loop."
This reverts commit 1e5a5b5e00e9706cd48e3c87de1607fcaa5214d2.

This doesn't make sense for a few reasons. Firstly, upstream uses this
mutex code and it works fine on all arches; why should arm be any
different?

Secondly, once the mutex owner starts to spin on `wait_lock`,
preemption is disabled and the owner will be in an actively-running
state. The optimistic mutex spinning occurs when the lock owner is
actively running on a CPU, and while the optimistic spinning takes
place, no attempt to acquire `wait_lock` is made by the new waiter.
Therefore, it is guaranteed that new mutex waiters which optimistically
spin will not contend the `wait_lock` spin lock that the owner needs to
acquire in order to make forward progress.

Another potential source of `wait_lock` contention can come from tasks
that call mutex_trylock(), but this isn't actually problematic (and if
it were, it would affect the MUTEX_SPIN_ON_OWNER=n use-case too). This
won't introduce significant contention on `wait_lock` because the
trylock code exits before attempting to lock `wait_lock`, specifically
when the atomic mutex counter indicates that the mutex is already
locked. So in reality, the amount of `wait_lock` contention that can
come from mutex_trylock() amounts to only one task. And once it
finishes, `wait_lock` will no longer be contended and the previous
mutex owner can proceed with clean up.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:31 +07:00
Sultan Alsawaf
6d0b8ae4f4
Revert "usb: gadget: mtp: Increase RX transfer length to 1M"
This reverts commit 0db49c2550a09458db188fb7312c66783c5af104.

This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:31 +07:00
Sultan Alsawaf
2aef0ad096
Revert "usb: gadget: f_mtp: Increase default TX buffer size"
This reverts commit a9a60c58e0fa21c41ac284282949187b13bdd756.

This results in kmalloc() abuse to find a large number of contiguous
pages, which thrashes the page allocator and hurts overall performance.
I couldn't reproduce the improved MTP throughput that this commit
claimed either, so just revert it.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:31 +07:00
Sultan Alsawaf
c6655a8449
msm: kgsl: Don't allocate memory dynamically for drawobj sync structs
The memory allocated dynamically here is just used to store a single
instance of a struct. Allocate both possible structs on the stack
instead of allocating them dynamically to improve performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:31 +07:00
Sultan Alsawaf
c56590f19f
msm: kgsl: Wake GPU upon receiving an ioctl rather than upon touch input
Waking the GPU upon touch wastes power when the screen is being touched
in a way that does not induce animation or any actual need for GPU usage.
Instead of preemptively waking the GPU on touch input, wake it up upon
receiving a IOCTL_KGSL_GPU_COMMAND ioctl since it is a sign that the GPU
will soon be needed.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:31 +07:00
Sultan Alsawaf
69dd40d271
msm: camera: Fix memory leak in cam_res_mgr_probe()
Since we have multiple CCIs with qcom,cam-res-mgr defined, the global
cam_res pointer gets overwritten each time a CCI probes, causing memory
to be leaked. Since it appears that the single global cam_res pointer is
intentional, let's just skip superfluous cam_res allocations to fix the
leak.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:30 +07:00
edenhuang
157ad67d60
msm: camera: Unmap secure buffers in secure usecase
Detach and unmap DMA buffers obtained previously from DMA attach
and mappings respectively.
Port from ./drivers/media/platform/msm/camera_v2/common/cam_smmu_api.c

Bug: 168589064
Signed-off-by: edenhuang <edenhuang@google.com>
Change-Id: Ib25ee5973b9f276ed99edb1805415c4c6a727249
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:30 +07:00
Sultan Alsawaf
cc7a139419
clk: qcom: clk-cpu-osm: Set each CPU clock to its max when waking up
The default frequency on Qualcomm CPUs is the lowest frequency supported
by the CPU. This hurts latency when waking from suspend, as each CPU
coming online runs at its lowest frequency until the governor can take
over later. To speed up waking from suspend, hijack the CPUHP_AP_ONLINE
hook and use it to set the highest available frequency on each CPU as
they come online. This is done behind the governor's back but it's fine
because the governor isn't running at this point in time for a CPU
that's coming online.

This speeds up waking from suspend significantly.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Change-Id: Ibb92aa78b858b00b6687340f2efe66f86b866514
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:30 +07:00
Danny Lin
bf489f7790
ARM64/dts: sdmmagpie: Disable broken IRQ detection
Our kernel only runs on known systems where broken IRQs would already
have been discovered, so disable this to reduce overhead in the IRQ
handling path.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:17:30 +07:00
Nanda Okitavera
a99fa0345d
block: zram: Use lz4 as default zram compression
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:15:37 +07:00
celtare21
6e2e801fde
block,cfq: Disable logging if trace is not enabled
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:15:37 +07:00
celtare21
03f9f56939
block,cfq: Set cfq_back_penalty to 1
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:15:36 +07:00
celtare21
98d748d79e
block,cfq: Set cfq_quantum to 16
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:15:36 +07:00
Tyler Nijmeh
ebc063413f
drivers: thermal: Don't qualify thermal polling as high priority
Don't take priority over other workqueues.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:33 +00:00
Tyler Nijmeh
c8b5360d5d
drivers: char: mem: Reroute random fops to urandom
Arter has done a similar commit, where the random fops routed the read
hook to the urandom_read method. However, this leads to a warning about
random_read being unused, as well as having the poll hook still linked
to random_poll. This commit should solve both of those issues, from the
roots.

Signed-off-by: Tyler Nijmeh <tylernij@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:33 +00:00
ankusa
09b3fcf699
msm: kgsl: Parallelization of kgsl_3d_init
kgsl_3d_init is taking a lot of time in execution.
Created a kernel thread to save kernel boot time.

Change-Id: I35e7a1525204b5be4301762aa0e41c9a159784d3
Signed-off-by: ankusa <ankusa@codeaurora.org>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:28 +00:00
Adithya R
1f3fd98715
Revert "msm: kgsl: Parallelization of kgsl_3d_init for AUTO"
This reverts commit 9aa3c0b5c856af9396e729f61cba69f78dc3630c.

Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:28 +00:00
Sultan Alsawaf
2dd3853feb
msm: kgsl: Don't try to wait for fences that have been signaled
Trying to wait for fences that have already been signaled incurs a high
setup cost, since dynamic memory allocation must be used. Avoiding this
overhead when it isn't needed improves performance.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:27 +00:00
Yaroslav Furman
4afd54aefb
drivers: msm: Don't copy fence names by default
Same concept as here: fe23bc0887
Extended version that covers more cases.

Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
[kdrag0n: Fixed compile error in Adreno driver when debugfs is enabled]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:27 +00:00
Ken Huang
f570b453a6
drm/msm/sde: Init IRQ lists after allocated node
In some use case, IRQ lists are added into user_event_list without
init them. Immediately init IRQ lists after allocated node to avoid
accessing null pointer.

Bug: 129427630
Test: boot to Android without panel, and ADB command can work
Change-Id: I39c3b50e7c11cd6b22b7dc5e9461288608694e26
Signed-off-by: Ken Huang <kenbshuang@google.com>
(cherry picked from commit f7d5d71d72960c1fd637aecc5de36e413f37b92e)
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:27 +00:00
Danny Lin
8dc628f95c
drm/msm/sde: Remove register write debug logging
Writing to registers is frequent enough that there is a measurably
significant portion of CPU time spent on checking the debug mask for
whether to log. Remove the check and logging call altogether to
eliminate the overhead.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:26 +00:00
Danny Lin
f3745cf11f
drm/msm/sde: Cache register values when performing clock control
Remote register I/O amounts to a measurably significant portion of CPU
time due to how frequently this function is used. Cache the value of
each register on-demand and use this value in future invocations to
mitigate the expensive I/O.

Co-authored-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:26 +00:00
Adrian Salido
cbe8927024
drm/msm: dsi-ctrl: Remove extra buffer copy
Speed up command transfers by reducing unnecessary intermediate buffer
allocation. The buffer allocation is only needed if using FIFO command
transfer, but otherwise there's no need to allocate and memcpy into
intermediate buffers.

Bug: 136715342
Change-Id: Ie540c285655ec86deb046c187f1e27538fd17d1c
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Adam W. Willis <return.of.octobot@gmail.com>
Signed-off-by: alk3pInjection <webmaster@raspii.tech>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:25 +00:00
celtare21
f5a0e7bd38
ARM64/dts: sdmmagpie: Set silver cluster qos-cores for msm_fastrpc
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-22 11:33:25 +00:00
Danny Lin
493152d36e
ARM64/configs: surya: Enable ARMv8.1 LSE atomics
sdmmagpie's CPUs (semi-custom derivations of Cortex-A55 and Cortex-A76)
support ARMv8.1's efficient LSE atomic instructions as per
/proc/cpuinfo:

CPU feature detection messages in printk confirm the support:

Since our CPUs support it, enable use of LSE atomics to speed up atomic
operations since they are implemented in hardware instead of being
synthesized by a few instructions in software.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Danny Lin
9fd2c88771
arm64: lse: Omit LL/SC alternatives patching
CAF appears to have messed with some code in this kernel related to LSE
atomics and/or alternatives, causing the combined LSE + LL/SC
out-of-line calling code to be too big for its section when compiling
kernel/locking/spinlock.c. This causes gas to fail with a confusing error:

    /tmp/spinlock-79343b.s: Assembler messages:
    /tmp/spinlock-79343b.s:61: Error: attempt to move .org backwards
    /tmp/spinlock-79343b.s:157: Error: attempt to move .org backwards

Clang's integrated assembler is more verbose and provides a more helpful
error that points to the alternatives code as being the culprit:

    In file included from ../kernel/locking/spinlock.c:20:
    In file included from ../include/linux/spinlock.h:88:
    ../arch/arm64/include/asm/spinlock.h:76:15: error: invalid .org offset '56' (at offset '60')
            asm volatile(ARM64_LSE_ATOMIC_INSN(
                         ^
    ../arch/arm64/include/asm/lse.h:36:2: note: expanded from macro 'ARM64_LSE_ATOMIC_INSN'
            ALTERNATIVE(llsc, lse, ARM64_HAS_LSE_ATOMICS)
            ^
    ../arch/arm64/include/asm/alternative.h:281:2: note: expanded from macro 'ALTERNATIVE'
            _ALTERNATIVE_CFG(oldinstr, newinstr, __VA_ARGS__, 1)
            ^
    ../arch/arm64/include/asm/alternative.h:83:2: note: expanded from macro '_ALTERNATIVE_CFG'
            __ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg), 0)
            ^
    ../arch/arm64/include/asm/alternative.h:73:16: note: expanded from macro '__ALTERNATIVE_CFG'
            ".popsection\n\t"                                               \
                          ^
    <inline asm>:35:7: note: instantiated into assembly here
            .org    . - (664b-663b) + (662b-661b)
                    ^

Omitting the alternatives code indeed reduces the size enough to make
everything compile successfully. We don't need the patching anyway
because we will only enable CONFIG_ARM64_LSE_ATOMICS when the target CPU
is known to support LSE atomics with 100% certainty, so kill all the
dynamic out-of-line LL/SC patching code.

This change also has the side-effect of reducing the I-cache footprint
of these critical locking and atomic paths, which can reduce cache
thrashing and increase overall performance.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Danny Lin
b4e1f81b12
arm64: lse: Prefetch operands to speed up atomic operations
On a Kryo 485 CPU (semi-custom Cortex-A76 derivative) in a Snapdragon
855 (SM8150) SoC, switching from traditional LL/SC atomics to LSE
causes LKDTM's ATOMIC_TIMING test to regress by 2x:

LL/SC ATOMIC_TIMING:    34.14s  34.08s
LSE ATOMIC_TIMING:      70.84s  71.06s

Prefetching the target operands fixes the regression and makes LSE
perform better than LSE as expected:

LSE+prfm ATOMIC_TIMING: 21.36s  21.21s

"dd if=/dev/zero of=/dev/null count=10000000" also runs faster:
    LL/SC:  3.3 3.2 3.3 s
    LSE:    3.1 3.2 3.2 s
    LSE+p:  2.3 2.3 2.3 s

Commit 0ea366f5e1b6413a6095dce60ea49ae51e468b61 applied the same change
to LL/SC atomics, but it was never ported to LSE.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Ard Biesheuvel
c35aad0595
FROMLIST: arm64: kernel: Implement fast refcount checking
This adds support to arm64 for fast refcount checking, as contributed
by Kees for x86 based on the implementation by grsecurity/PaX.

The general approach is identical: the existing atomic_t helpers are
cloned for refcount_t, with the arithmetic instruction modified to set
the PSTATE flags, and one or two branch instructions added that jump to
an out of line handler if overflow, decrement to zero or increment from
zero are detected.

One complication that we have to deal with on arm64 is the fact that
it has two atomics implementations: the original LL/SC implementation
using load/store exclusive loops, and the newer LSE one that does mostly
the same in a single instruction. So we need to clone some parts of
both for the refcount handlers, but we also need to deal with the way
LSE builds fall back to LL/SC at runtime if the hardware does not
support it.

As is the case with the x86 version, the performance gain is substantial
(ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
incorporates an add-from-zero check as well:

perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

      116252672661      cycles                    #    2.207 GHz

      52.689793525 seconds time elapsed

perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

      127060259162      cycles                    #    2.207 GHz

      57.243690077 seconds time elapsed

For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
which uses the validation routines implemented in C using cmpxchg():

perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

 Performance counter stats for 'cat /dev/fd/63':

      191057942484      cycles                    #    2.207 GHz

      86.568269904 seconds time elapsed

As a bonus, this code has been found to perform significantly better on
systems with many CPUs, due to the fact that it no longer relies on the
load/compare-and-swap combo performed in a tight loop, which is what we
emit for cmpxchg() on arm64.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
Cc: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Cc: Jan Glauber <jglauber@cavium.com>,
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

[kdrag0n]
 - Backported to k4.14 from:
     https://www.spinics.net/lists/arm-kernel/msg735992.html
 - Benchmarked on sm8150 using perf and LKDTM REFCOUNT_TIMING:
     https://docs.google.com/spreadsheets/d/14CctCmWzQAGhOmpHrBJfXQy_HuNFTpEkMEYSUGKOZR8/edit

         | Fast checking      | Generic checking
---------+--------------------+-----------------------
Cycles   | 79235532616        | 102554062037
         | 79391767237        | 99625955749
Time     | 32.99879212 sec    | 42.5354029 sec
         | 32.97133254 sec    | 41.31902045 sec

Average:
Cycles   | 79313649927        | 101090008893
Time     | 33 sec             | 42 sec

Conflicts:
	arch/arm64/kernel/traps.c

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
6dbef73225
arm64: debug: Separate debug hooks based on target exception level
Mixing kernel and user debug hooks together is highly error-prone as it
relies on all of the hooks to figure out whether the exception came from
kernel or user, and then to act accordingly.

Make our debug hook code a little more robust by maintaining separate
hook lists for user and kernel, with separate registration functions
to force callers to be explicit about the exception levels that they
care about.

Conflicts:
	arch/arm64/kernel/traps.c

Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Volodymyr Zhdanov <wight554@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
1933ba7335
arm64: Avoid flush_icache_range() in alternatives patching code
The implementation of flush_icache_range() includes instruction sequences
which are themselves patched at runtime, so it is not safe to call from
the patching framework.

This patch reworks the alternatives cache-flushing code so that it rolls
its own internal D-cache maintenance using DC CIVAC before invalidating
the entire I-cache after all alternatives have been applied at boot.
Modules don't cause any issues, since flush_icache_range() is safe to
call by the time they are loaded.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Rohit Khanna <rokhanna@nvidia.com>
Cc: Alexander Van Brunt <avanbrunt@nvidia.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
f05f9c086a
arm64: insn: Don't fallback on nosync path for general insn patching
Patching kernel instructions at runtime requires other CPUs to undergo
a context synchronisation event via an explicit ISB or an IPI in order
to ensure that the new instructions are visible. This is required even
for "hotpatch" instructions such as NOP and BL, so avoid optimising in
this case and always go via stop_machine() when performing general
patching.

ftrace isn't quite as strict, so it can continue to call the nosync
code directly.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
4621b4a85b
arm64: IPI each CPU after invalidating the I-cache for kernel mappings
When invalidating the instruction cache for a kernel mapping via
flush_icache_range(), it is also necessary to flush the pipeline for
other CPUs so that instructions fetched into the pipeline before the
I-cache invalidation are discarded. For example, if module 'foo' is
unloaded and then module 'bar' is loaded into the same area of memory,
a CPU could end up executing instructions from 'foo' when branching into
'bar' if these instructions were fetched into the pipeline before 'foo'
was unloaded.

Whilst this is highly unlikely to occur in practice, particularly as
any exception acts as a context-synchronizing operation, following the
letter of the architecture requires us to execute an ISB on each CPU
in order for the new instruction stream to be visible.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
c13613150c
arm64: percpu: Fix LSE implementation of value-returning pcpu atomics
Commit 959bf2fd03b5 ("arm64: percpu: Rewrite per-cpu ops to allow use of
LSE atomics") introduced alternative code sequences for the arm64 percpu
atomics, so that the LSE instructions can be patched in at runtime if
they are supported by the CPU.

Unfortunately, when patching in the LSE sequence for a value-returning
pcpu atomic, the argument registers are the wrong way round. The
implementation of this_cpu_add_return() therefore ends up adding
uninitialised stack to the percpu variable and returning garbage.

As it turns out, there aren't very many users of the value-returning
percpu atomics in mainline and we only spotted this due to a failure in
the kprobes selftests. In this case, when attempting to single-step over
the out-of-line instruction slot, the debug monitors would not be
enabled because calling this_cpu_inc_return() on the kernel debug
monitor refcount would fail to detect the transition from 0. We would
consequently execute past the slot and take an undefined instruction
exception from the kernel, resulting in a BUG:

 | kernel BUG at arch/arm64/kernel/traps.c:421!
 | PREEMPT SMP
 | pc : do_undefinstr+0x268/0x278
 | lr : do_undefinstr+0x124/0x278
 | Process swapper/0 (pid: 1, stack limit = 0x(____ptrval____))
 | Call trace:
 |  do_undefinstr+0x268/0x278
 |  el1_undef+0x10/0x78
 |  0xffff00000803c004
 |  init_kprobes+0x150/0x180
 |  do_one_initcall+0x74/0x178
 |  kernel_init_freeable+0x188/0x224
 |  kernel_init+0x10/0x100
 |  ret_from_fork+0x10/0x1c

Fix the argument order to get the value-returning pcpu atomics working
correctly when implemented using the LSE instructions.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:54 +00:00
Will Deacon
9dc172eec7
arm64: percpu: Rewrite per-cpu ops to allow use of LSE atomics
Our percpu code is a bit of an inconsistent mess:

  * It rolls its own xchg(), but reuses cmpxchg_local()
  * It uses various different flavours of preempt_{enable,disable}()
  * It returns values even for the non-returning RmW operations
  * It makes no use of LSE atomics outside of the cmpxchg() ops
  * There are individual macros for different sizes of access, but these
    are all funneled through a switch statement rather than dispatched
    directly to the relevant case

This patch rewrites the per-cpu operations to address these shortcomings.
Whilst the new code is a lot cleaner, the big advantage is that we can
use the non-returning ST- atomic instructions when we have LSE.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Danny Lin
99f3d060de
Revert "arm64: percpu: Initialize the ret variable for default case"
This reverts commit 99836317ce2f622e1e70d29770a048c12848c765.

Revert CAF's mutant pick in favor of a fresh backport from mainline.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
4f6dcbb9ee
arm64: Avoid masking "old" for LSE cmpxchg() implementation
The CAS instructions implicitly access only the relevant bits of the "old"
argument, so there is no need for explicit masking via type-casting as
there is in the LL/SC implementation.

Move the casting into the LL/SC code and remove it altogether for the LSE
implementation.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
bedcdcbc4f
arm64: cmpxchg: Include linux/compiler.h in asm/cmpxchg.h
We need linux/compiler.h for unreachable(), so #include it here.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
d0692e3b62
arm64: move percpu cmpxchg implementation from cmpxchg.h to percpu.h
We want to avoid pulling linux/preempt.h into cmpxchg.h, since that can
introduce a circular dependency on linux/bitops.h. linux/preempt.h is
only needed by the per-cpu cmpxchg implementation, which is better off
alongside the per-cpu xchg implementation in percpu.h, so move it there
and add the missing #include.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
c0b96259eb
arm64: cmpxchg: Include build_bug.h instead of bug.h for BUILD_BUG
Having asm/cmpxchg.h pull in linux/bug.h is problematic because this
ends up pulling in the atomic bitops which themselves may be built on
top of atomic.h and cmpxchg.h.

Instead, just include build_bug.h for the definition of BUILD_BUG.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
e7f5cdd83f
arm64: lse: Include compiler_types.h and export.h for out-of-line LL/SC
When the LL/SC atomics are moved out-of-line, they are annotated as
notrace and exported to modules. Ensure we pull in the relevant include
files so that these macros are defined when we need them.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Will Deacon
34cb0b8d88
arm64: lse: Pass -fomit-frame-pointer to out-of-line ll/sc atomics
In cases where x30 is used as a temporary in the out-of-line ll/sc atomics
(e.g. atomic_fetch_add), the compiler tends to put out a full stackframe,
which included pointing the x29 at the new frame.

Since these things aren't traceable anyway, we can pass -fomit-frame-pointer
to reduce the work when spilling. Since this is incompatible with -pg, we
also remove that from the CFLAGS for this file.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Danny Lin
aaffc31a93
configs: surya: Enable optimized inlining
TODO: benchmark

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Masahiro Yamada
7c51ad253f
compiler: Allow all arches to enable CONFIG_OPTIMIZE_INLINING
Commit 60a3cdd06394 ("x86: add optimized inlining") introduced
CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.

The idea is obviously arch-agnostic.  This commit moves the config entry
from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
architectures can benefit from it.

This can make a huge difference in kernel image size especially when
CONFIG_OPTIMIZE_FOR_SIZE is enabled.

For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.

  dec       file
  18983424  arch/arm64/boot/Image.before
  18321920  arch/arm64/boot/Image.after

This also slightly improves the "Kernel hacking" Kconfig menu as
e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
this config option would be a good fit in the "compiler option" menu.

Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[kdrag0n: Backported to k4.14]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:53 +00:00
Adithya R
24dead564d
configs: surya: Actually enable RELR relocations
* tool support must be declared first before this can be enabled

Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00
Danny Lin
8ecb879a9e
arm64: Allow PC-relative literal loads if 843419 fix is off
Such relocations are fine if the erratum 843419 fix is disabled.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00
Danny Lin
0277f773df
configs: surya: Enable ThinLTO optimizations
ThinLTO is actually quite fast on multi-threaded systems now and doesn't
increase build times by much on my Threadripper 3960X system.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00
Sami Tolvanen
cfa39fb099
ANDROID: kbuild: avoid excessively long argument lists
With LTO, modules with a large number of compilation units maybe end
up exceeding the for loop argument list in the shell. Reduce the
probability for this happening by including only the modules that have
exported symbols.

Bug: 150234396
Change-Id: I4a289aff47e1444aca28d1bd00b125628f39bcd5
Suggested-by: Hsiu-Chang Chen <hsiuchangchen@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
[panchajanya1999] : backported to 4.14-android
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00
Danny Lin
ee4ab843dd
arm64: Tweak linker flags to yield a smaller Image with LLD
Currently, there is a regression of 689 KiB in Image.gz's size when the
kernel is linked with LLD. This is reduced to 213 KiB when we use -pie
rather than -shared when invoking the linker.

Unfortunately, ld.bfd dislikes this change and regresses in size by 163
KiB with -pie as compared to using -shared. To address this problem, we
add checks so that -pie is used with LLD and -shared is used with
ld.bfd. That way, both linkers are able to perform their best.

List of Image.gz sizes:
  ld.bfd -shared: 10,066,988 bytes
  ld.bfd -pie:    10,230,316 bytes
  LLD -shared:    10,796,872 bytes
  LLD -pie:       10,280,168 bytes

Test: kernel compiles and boots with both ld.bfd and LLD
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00
Danny Lin
abab3bed5b
Makefile: Use O3 optimization level for Clang LTO
Signed-off-by: Adithya R <gh0strider.2k18.reborn@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-03-19 07:30:52 +00:00