SBalance's statistic of new interrupts for each CPU is inherently flawed in
that it cannot track IRQ migration that occurs in between balance periods.
As a result, SBalance can observe a flawed number of new interrupts for a
CPU, which hurts its balancing decisions.
Furthermore, SBalance incorrectly assumes that IRQs are affined where
SBalance last placed them, which breaks SBalance entirely when the
assumption doesn't hold true.
As it turns out, it can be quite common to change an IRQ's affinity and
observe a successful return value despite the IRQ not actually moving. At
the very least this is observed on ARM's GICv3, and results in SBalance
never moving such an IRQ ever again because SBalance always thinks it has
zero new interrupts.
Since we can't trust irqchip drivers or hardware, gather IRQ statistics
directly in order to get the true number of new interrupts for each CPU and
the actual affinity of each IRQ based on the last CPU it fired upon.
Change-Id: Ic846adac244a0873c4502987e0904b552ab31f22
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The atomic cpumask_clear_cpu() isn't needed. Use __cpumask_clear_cpu()
instead as a micro-optimization, and for clarity.
Change-Id: I17d168814c4b96557c8a9f986c2c5be8e18be26b
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
SBalance is designed to poll to balance IRQs, but it shouldn't kick CPUs
out of idle to do so because idle CPUs clearly aren't processing
interrupts.
Open code a freezable wait that uses a deferrable timer in order to prevent
SBalance from waking up idle CPUs when there is little interrupt traffic.
Change-Id: I5f796a4590801c9a5935ca7ea8c966ca281620c7
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Excluded CPUs are excluded from IRQ balancing with the intention that those
CPUs shouldn't really be processing interrupts, and thus shouldn't have
IRQs moved to them. However, SBalance completely ignores excluded CPUs,
which can cause them to end up with a disproportionate amount of interrupt
traffic that SBalance won't spread out. An easy example of this is when
CPU0 is an excluded CPU, since CPU0 ends up with all interrupts affined to
it by default on arm64.
Allow SBalance to move IRQs off of excluded CPUs so that they cannot slip
under the radar and pile up on an excluded CPU, like when CPU0 is excluded.
Change-Id: I392a058ea8cf7672bfea39ff9525bf6b7c52a062
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This is a simple IRQ balancer that polls every X number of milliseconds and
moves IRQs from the most interrupt-heavy CPU to the least interrupt-heavy
CPUs until the heaviest CPU is no longer the heaviest. IRQs are only moved
from one source CPU to any number of destination CPUs per balance run.
Balancing is skipped if the gap between the most interrupt-heavy CPU and
the least interrupt-heavy CPU is below the configured threshold of
interrupts.
The heaviest IRQs are targeted for migration in order to reduce the number
of IRQs to migrate. If moving an IRQ would reduce overall balance, then it
won't be migrated.
The most interrupt-heavy CPU is calculated by scaling the number of new
interrupts on that CPU to the CPU's current capacity. This way, interrupt
heaviness takes into account factors such as thermal pressure and time
spent processing interrupts rather than just the sheer number of them. This
also makes SBalance aware of CPU asymmetry, where different CPUs can have
different performance capacities and be proportionally balanced.
Change-Id: Ie40c87ca357814b9207726f67e2530fffa7dd198
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
An IRQ affinity notifier getting overwritten can point to some annoying
issues which need to be resolved, like multiple pm_qos objects being
registered to the same IRQ. Print out a warning when this happens to aid
debugging.
Change-Id: I087a6ea7472fa7ba45bdb02efeae25af5c664950
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
On ARM, IRQs are executed on the first CPU inside the affinity mask, so
setting an affinity mask with more than one CPU set is deceptive and
causes issues with pm_qos. To fix this, only set the CPU0 bit inside the
affinity mask, since that's where IRQs will run by default.
This is a follow-up to "kernel: Don't allow IRQ affinity masks to have
more than one CPU".
Change-Id: Ib6ef803ab866686c30e1aa1d06f98692ee39ed6c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Even with an affinity mask that has multiple CPUs set, IRQs always run
on the first CPU in their affinity mask. Drivers that register an IRQ
affinity notifier (such as pm_qos) will therefore have an incorrect
assumption of where an IRQ is affined.
Fix the IRQ affinity mask deception by forcing it to only contain one
set CPU.
Change-Id: I212ff578f731ee78fabb8f63e49ef0b96c286521
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: I346e0c62073e710021f8f664f79d8ebef0436cbd
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: Id432ab10f7acb00ad2d1bb36400504584629a2b6
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: I77e5663fa00afba2211b52997e007a0f2e6364e2
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Change-Id: If892e9c33656b7f829d2adb3d7228ac12313dd2c
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The idle load balancer (ILB) is kicked whenever a task is misfit, meaning
that the task doesn't fit on its CPU (i.e., fits_capacity() == false).
Since CASS makes no attempt to place tasks such that they'll fit on the CPU
they're placed upon, the ILB works harder to correct this and rebalances
misfit tasks onto a CPU with sufficient capacity.
By fighting the ILB like this, CASS degrades both energy efficiency and
performance.
Play nicely with the ILB by trying to place tasks onto CPUs that fit.
Change-Id: I317a3f19b83400d4b55d35d4a51e88268d0399c1
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
When all CPUs available to a uclamp'd process are thermal throttled, it is
possible for them to be throttled below the uclamp minimum requirement. In
this case, CASS only considers uclamp when it compares relative utilization
and nowhere else; i.e., CASS essentially ignores the most important aspect
of uclamp.
Fix it so that CASS tries to honor uclamp even when no CPUs available to a
uclamp'd process are capable of fully meeting the uclamp minimum.
Change-Id: I93885cd7a94502c58a9e96eb43bb00ef01d15988
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
When CPUs are thermal throttled, CASS tries to spread load such that their
resulting P-state is scaled relatively to their _throttled_ maximum
capacity, rather than their original capacity.
As a result, throttled CPUs are unfairly under-utilized, causing other CPUs
to receive the extra burden and thus run at a disproportionately higher
P-state relative to the throttled CPUs. This not only hurts performance,
but also greatly diminishes energy efficiency since it breaks CASS's basic
load balancing principle.
To fix this, some convoluted logic is required in order to make CASS aware
of a CPU's throttled and non-throttled capacity. The non-throttled capacity
is used for the fundamental relative utilization comparison, while the
throttled capacity is used in conjunction to ensure a throttled CPU isn't
accidentally overloaded as a result.
Change-Id: I2cdabc4aa88e724252886c15040eabf40ab9150e
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Calling smp_processor_id() can be expensive depending on how an arch
implements it, so avoid calling it more than necessary.
Use the raw variant too since this code is always guaranteed to run with
preemption disabled.
Change-Id: If96aeeb0aeb9f0c1cb2ebf9dcf31ced04ebe135c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
For synchronized wakes, the waker's CPU should only be treated as idle if
there aren't any other running tasks on that CPU. This is because, for
synchronized wakes, it is assumed that the waker will immediately go to
sleep after waking the wakee; therefore, if there aren't any other tasks
running on the waker's CPU, it'll go idle and should be treated as such to
improve task placement.
This optimization only applies when there aren't any other tasks running on
the waker's CPU, however.
Fix it by ensuring that there's only the waker running on its CPU.
Change-Id: I03cfd16d423cc920c103b8734b6b8a9089a9e59c
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Uclamp is designed to specify a process' CPU performance requirement scaled
as a CPU capacity value. It simply denotes the process' requirement for the
CPU's raw performance and thus P-state.
CASS currently treats uclamp as a CPU load value however, producing wildly
suboptimal CPU placement decisions for tasks which use uclamp. This hurts
performance and, even worse, massively hurts energy efficiency, with CASS
sometimes yielding power consumption that is a few times higher than EAS.
Since uclamp inherently throws a wrench into CASS's goal of keeping
relative P-states as low as possible across all CPUs, making it cooperate
with CASS requires a multipronged approach.
Make the following three changes to fix the uclamp task placement issue:
1. Treat uclamp as a CPU performance value rather than a CPU load value.
2. Clamp a CPU's utilization to the task's uclamp floor in order to keep
relative P-states as low as possible across all CPUs.
3. Consider preferring a non-idle CPU for uclamped tasks to avoid pushing
up the P-state of more than one CPU when there are multiple concurrent
uclamped tasks.
This fixes CASS's massive energy efficiency and performance issues when
uclamp is used.
Change-Id: Ib274ceecfbbe9c2eeb1738f97029e1f4cbc68ec0
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
RT tasks aren't placed on CPUs in a load-balanced manner, much less an
energy efficient one. On systems which contain many RT tasks and/or IRQ
threads, energy efficiency and throughput are diminished significantly by
the default RT runqueue selection scheme which targets minimal latency.
In practice, performance is actually improved by spreading RT tasks fairly,
despite the small latency impact. Additionally, energy efficiency is
significantly improved since the placement of all tasks benefits from
energy-efficient runqueue selection, rather than just CFS tasks.
Perform runqueue selection for RT tasks in CASS to significantly improve
energy efficiency and overall performance.
Change-Id: Ie551296e1034baa2dfc2bb7f0191ca95f5abc639
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Move `curr` and `idle_state` to within the loop's scope for better
readability. Also, leave a comment about `curr->cpu` to make it clear that
`curr->cpu` must be initialized within the loop in order for `best->cpu` to
be valid.
Change-Id: I1244ac06d62c172f46dbf337e7bb95758329a188
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
When no candidate CPUs are idle, CASS would keep `cidx` unchanged, and thus
`best == curr` would always be true. As a result, since the empty candidate
slot never changes, the current candidate `curr` always overwrites the best
candidate `best`. This causes the last valid CPU to always be selected by
CASS when no CPUs are idle (i.e., under heavy load).
Fix it by ensuring that the CPU loop in cass_best_cpu() flips the free
candidate index after the first candidate CPU is evaluated.
Change-Id: Id1e371c0fe6a2e6321f1c9f68a47e4a26c9a0cba
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This is empirically observed to yield good performance with reduced power
consumption. With "cpufreq: schedutil: Ignore rate limit when scaling up
with FIE present", this only affects frequency reductions when FIE is
present, since there is no rate limit applied when scaling up.
Change-Id: I1bff1f007f06e67b672877107c9685b6fb83647a
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
When schedutil disregards a frequency transition due to the transition rate
limit, there is no guaranteed deadline as to when the frequency transition
will actually occur after the rate limit expires. For instance, depending
on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
frequency transition may be delayed indefinitely, until said CPU reaches
the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.
For frequency transitions _down_, this only poses a theoretical loss of
energy savings since a CPU may remain at a higher frequency than necessary
for an indefinite period beyond the rate limit expiry.
For frequency transitions _up_, however, this poses a significant hit to
performance when a CPU is stuck at an insufficient frequency for an
indefinitely long time. In latency-sensitive and bursty workloads
especially, a missed frequency transition up can result in a significant
performance loss due to a CPU operating at an insufficient frequency for
too long.
When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
rate limit is always required for the scheduler to compute CPU utilization
with some semblance of accuracy: any frequency transition that occurs
before the previous transition latches would result in the scheduler not
knowing the frequency a CPU is actually operating at, thereby trashing the
computed CPU utilization.
However, when FIE support _is_ present, there's no technical requirement to
rate limit all frequency transitions to a cpufreq driver's reported
transition latency. With FIE, the scheduler's CPU utilization tracking is
unaffected by any frequency transitions that occur before the previous
frequency is latched.
Therefore, ignore the frequency transition rate limit when scaling up on
systems where FIE is present. This guarantees that transitions to a higher
frequency cannot be indefinitely delayed, since they simply cannot be
delayed at all.
Change-Id: I0dc5c6c710c10c63b7fc69970db044982de2a2d7
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
A redundant frequency update is only truly needed when there is a policy
limits change with a driver that specifies CPUFREQ_NEED_UPDATE_LIMITS.
In spite of that, drivers specifying CPUFREQ_NEED_UPDATE_LIMITS receive a
frequency update _all the time_, not just for a policy limits change,
because need_freq_update is never cleared.
Furthermore, ignore_dl_rate_limit()'s usage of need_freq_update also leads
to a redundant frequency update, regardless of whether or not the driver
specifies CPUFREQ_NEED_UPDATE_LIMITS, when the next chosen frequency is the
same as the current one.
Fix the superfluous updates by only honoring CPUFREQ_NEED_UPDATE_LIMITS
when there's a policy limits change, and clearing need_freq_update when a
requisite redundant update occurs.
This is neatly achieved by moving up the CPUFREQ_NEED_UPDATE_LIMITS test
and instead setting need_freq_update to false in sugov_update_next_freq().
Change-Id: Iedd47851eabe5a12ed3255b84cd0468da2fbbc80
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Rearrange a conditional to make it more straightforward.
Change-Id: I1c9d793cac29bc5a2fdc047ac4c01bba5044489e
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.
For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.
Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.
Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.
This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.
As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.
We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.
Change-Id: I699f1ce2bddf3ed35e29fc8ec549fa498654965b
Reported-by: zhuguangqing <zhuguangqing@xiaomi.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.
For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand. In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.
To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.
Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Change-Id: Icb43808d865a28e3ff4630cf3b65502fd1e3a466
Reported-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
During cpufreq driver's registration, if the ->init() callback for all
the CPUs fail then there is not much point in keeping the driver around
as it will only account for more of unnecessary noise, for example
cpufreq core will try to suspend/resume the driver which never got
registered properly.
The removal of such a driver is avoided if the driver carries the
CPUFREQ_STICKY flag. This was added way back [1] in 2004 and perhaps no
one should ever need it now. A lot of drivers do set this flag, probably
because they just copied it from other drivers.
This was added earlier for some platforms [2] because their cpufreq
drivers were getting registered before the CPUs were registered with
subsys framework. And hence they used to fail.
The same isn't true anymore though. The current code flow in the kernel
is:
start_kernel()
-> kernel_init()
-> kernel_init_freeable()
-> do_basic_setup()
-> driver_init()
-> cpu_dev_init()
-> subsys_system_register() //For CPUs
-> do_initcalls()
-> cpufreq_register_driver()
Clearly, the CPUs will always get registered with subsys framework
before any cpufreq driver can get probed. Remove the flag and update the
relevant drivers.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/include/linux/cpufreq.h?id=7cc9f0d9a1ab04cedc60d64fd8dcf7df224a3b4d # [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/arch/arm/mach-sa1100/cpu-sa1100.c?id=f59d3bbe35f6268d729f51be82af8325d62f20f5 # [2]
Change-Id: If2509516ec7fa7518750e23e0198614654dbf9b7
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Add a helper function to test the flags of the cpufreq driver in use
againt a given flags mask.
In particular, this will be needed to test the
CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag in the schedutil
governor.
Change-Id: I06c2ca2a224ffdfc51dbb4cee9f39f957cf2775b
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Generally, a cpufreq driver may need to update some internal upper
and lower frequency boundaries on policy max and min changes,
respectively, but currently this does not work if the target
frequency does not change along with the policy limit.
Namely, if the target frequency does not change along with the
policy min or max, the "target_freq == policy->cur" check in
__cpufreq_driver_target() prevents driver callbacks from being
invoked and they do not even have a chance to update the
corresponding internal boundary.
This particularly affects the "powersave" and "performance"
governors that always set the target frequency to one of the
policy limits and it never changes when the other limit is updated.
To allow cpufreq the drivers needing to update internal frequency
boundaries on policy limits changes to avoid this issue, introduce
a new driver flag, CPUFREQ_NEED_UPDATE_LIMITS, that (when set) will
neutralize the check mentioned above.
Change-Id: Ic68823625a3a25475fc5150215f17ae17326bcdb
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Avoid calling cpufreq driver's target() routine if new frequency is same as
policies current frequency.
Change-Id: I84b8dff84764d31e58e2c76ced34de62459323f8
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
This reverts commit ba350f071e4af45aedb698851123397e5d041fd2.
Change-Id: I3287279e0d8882356b6799bec3993b5b52c15a12
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
So it supercedes qcom governor and registers teo as the main cpuidle governor.
Change-Id: I4221fad713c32efec9fdcf5a2d198c0af65c8cd7
Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Add a no_poll flag to teo_find_shallower_state() that will let the
function optionally not consider polling states.
This allows the caller to guard against the function inadvertently
resulting in TEO putting the CPU in a polling state when that
behaviour is undesirable.
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Rename two local variables in teo_select() so that their names better
reflect their purpose.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
There are three mistakes in the loop in teo_select() that is looking
for an alternative candidate idle state. First, it should walk all
of the idle states shallower than the current candidate one,
including all of the disabled ones, but it terminates after the first
enabled idle state. Second, it should not terminate its last step
if idle state 0 is disabled (which is related to the first issue).
Finally, it may return the current alternative candidate idle state
prematurely if the time span criterion is not met by the idle state
under consideration at the moment.
To address the issues mentioned above, make the loop in question walk
all of the idle states shallower than the current candidate idle state
all the way down to idle state 0 and rearrange the checks in it.
Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment")
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Fix following coccicheck warning:
drivers/cpuidle/governors/teo.c:315:10-11: Unneeded semicolon
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
The TEO (Timer Events Oriented) cpuidle governor uses several most
recent idle duration values for a given CPU to refine the idle state
selection in case the previous long-term trends have not been
followed recently and a new trend appears to be forming. That is
done by computing the average of the most recent idle duration
values falling below the time till the next timer event ("sleep
length"), provided that they are the majority of the most recent
idle duration values taken into account, and using it as the new
expected idle duration value.
However, idle state selection based on that value may not be optimal,
because the average does not really indicate which of the idle states
with target residencies less than or equal to it is likely to be the
best fit.
Thus, instead of computing the average, make the governor carry out
computations based on the distribution of the most recent idle
duration values among the bins corresponding to different idle
states. Namely, if the majority of the most recent idle duration
values taken into consideration are less than the current sleep
length (which means that the CPU is likely to wake up early), find
the idle state closest to the "candidate" one "matching" the sleep
length whose target residency is less than or equal to the majority
of the most recent idle duration values that have fallen below the
current sleep length (which means that it is likely to be "shallow
enough" this time).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Two aspects of the current main idle state selection logic in the
TEO (Timer Events Oriented) cpuidle governor are quite questionable.
First of all, the "hits" and "misses" metrics used by it are only
updated for a given idle state if the time till the next timer event
("sleep length") is between the target residency of that state and
the target residency of the next one. Consequently, they are likely
to become stale if the sleep length tends to fall outside that
interval which increases the likelihood of subomtimal idle state
selection.
Second, the decision on whether or not to select the idle state
"matching" the sleep length is based on the metrics collected for
that state alone, whereas in principle the metrics collected for
the other idle states should be taken into consideration when that
decision is made. For example, if the measured idle duration is less
than the target residency of the idle state "matching" the sleep
length, then it is also less than the target residency of any deeper
idle state and that should be taken into account when considering
whether or not to select any of those states, but currently it is
not.
In order to address the above shortcomings, modify the main idle
state selection logic in the TEO governor to take the metrics
collected for all of the idle states into account when deciding
whether or not to select the one "matching" the sleep length.
Moreover, drop the "misses" metric that becomes redundant after the
above change and rename the "early_hits" metric to "intercepts" so
that its role is better reflected by its name (the idea being that
if a CPU wakes up earlier than indicated by the sleep length, then
it must be a result of a non-timer interrupt that "intercepts" the
CPU).
Also rename the states[] array in struct struct teo_cpu to
state_bins[] to avoid confusing it with the states[] array in
struct cpuidle_driver and update the documentation to match the
new code (and make it more comprehensive while at it).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19, adapt variables and teo_middle_of_bin() return type to
mircoseconds as unit of time ]
Co-authored-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Initialize local variables in teo_select() where they are declared.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 ]
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Rename a local variable in teo_update() so that its purpose is better
reflected by its name and use one more local variable in the loop
over the CPU idle states in that function to make the code somewhat
easier to read.
No functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 and use int data type for target_residency ]
Co-authored-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
Modify the TEO governor to take possible negative return values of
tick_nohz_get_next_hrtimer() into account by changing the data type
of some variables used by it to s64 which allows it to carry out
computations without potentially problematic data type conversions
into u64.
Also change the computations in teo_select() so that the negative
values themselves are handled in a natural way to avoid adding extra
negative value checks to that function.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[ Tashar02: Backport to k4.19 and change data type from unsigned int to int to allow negative values ]
Co-authored-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>