* pm-sleep:
PM / Hibernate: Touch Soft Lockup Watchdog in rtree_next_node
PM / Hibernate: Remove the old memory-bitmap implementation
PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()
PM / Hibernate: Implement position keeping in radix tree
PM / Hibernate: Add memory_rtree_find_bit function
PM / Hibernate: Create a Radix-Tree to store memory bitmap
PM / sleep: fix kernel-doc warnings in drivers/base/power/main.c
* pm-cpuidle:
cpuidle: Remove time measurement in poll state
cpuidle: Remove manual selection of the multiple driver support
cpuidle: ladder governor - use macro instead of hardcoded value
cpuidle: big_little: Fix build error
cpuidle: menu governor - remove unused macro STDDEV_THRESH
cpuidle: fix permission for driver name sysfs node
cpuidle: move idle traces to cpuidle_enter_state()
When a memory bitmap is fully populated on a large memory
machine (several TB of RAM) it can take more than a minute
to walk through all bits. This causes the soft lockup
detector on these machine to report warnings.
Avoid this by touching the soft lockup watchdog in the
memory bitmap walking code.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The radix tree implementatio is proved to work the same as
the old implementation now. So the old implementation can be
removed to finish the switch to the radix tree for the
memory bitmaps.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The existing implementation of swsusp_free iterates over all
pfns in the system and checks every bit in the two memory
bitmaps.
This doesn't scale very well with large numbers of pfns,
especially when the bitmaps are not populated very densly.
Change the algorithm to iterate over the set bits in the
bitmaps instead to make it scale better in large memory
configurations.
Also add a memory_bm_clear_current() helper function that
clears the bit for the last position returned from the
memory bitmap.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add code to remember the last position that was requested in
the radix tree. Use it as a cache for faster linear walking
of the bitmap in the memory_bm_rtree_next_pfn() function
which is also added with this patch.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a function to find a bit in the radix tree for a given
pfn. Also add code to the memory bitmap wrapper functions to
use the radix tree together with the existing memory bitmap
implementation.
On read accesses compare the results of both bitmaps to make
sure the radix tree behaves the same way.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This patch adds the code to allocate and build the radix
tree to store the memory bitmap. The old data structure is
left in place until the radix tree implementation is
finished.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* acpi-pm:
ACPI / PM: Use ACPI_COMPANION() instead of ACPI_HANDLE()
ACPI / PM: Always enable wakeup GPEs when enabling device wakeup
ACPI / PM: Revork the handling of ACPI device wakeup notifications
PM: Create PM workqueue if runtime PM is not configured too
* acpi-sleep:
ACPI / sleep: Do not save NVS for new machines to accelerate S3
* acpi-button:
ACPI / button: Do not propagate wakeup-from-suspend events
Pull perf fixes from Thomas Gleixner:
"A bunch of fixes for perf and kprobes:
- revert a commit that caused a perf group regression
- silence dmesg spam
- fix kprobe probing errors on ia64 and ppc64
- filter kprobe faults from userspace
- lockdep fix for perf exit path
- prevent perf #GP in KVM guest
- correct perf event and filters"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
kprobes/x86: Don't try to resolve kprobe faults from userspace
perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
perf/x86/intel: Protect LBR and extra_regs against KVM lying
perf: Fix lockdep warning on process exit
perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
perf: Revert ("perf: Always destroy groups on exit")
The PM workqueue is going to be used by ACPI PM notify handlers
regardless of whether or not runtime PM is configured, so move
it out of #ifdef CONFIG_PM_RUNTIME.
Do that in three places in the ACPI device PM code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After the introduction of freeze_ops it makes more sense to move
all of the platform suspend operations to separate functions that
each will do all of the necessary checks and choose the right
callback to execute istead of doing all that in the core code
which makes it generally harder to follow.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the OPP layer is a kernel library which has been converted to be
directly selectable by its callers rather than user selectable and
requiring architectures to enable it explicitly the ARCH_HAS_OPP symbol
has become redundant and can be removed. Do so.
Signed-off-by: Mark Brown <broonie@linaro.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Nishanth Menon <nm@ti.com>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Shawn Guo <shawn.guo@freescale.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The "uptime" trace clock added in:
commit 8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Simplify the sleep states sysfs interface /sys/power/state code by
redefining pm_states[] as an array of pointers to constant strings
such that only the entries corresponding to valid states are set.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull locking fixes from Thomas Gleixner:
"The locking department delivers:
- A rather large and intrusive bundle of fixes to address serious
performance regressions introduced by the new rwsem / mcs
technology. Simpler solutions have been discussed, but they would
have been ugly bandaids with more risk than doing the right thing.
- Make the rwsem spin on owner technology opt-in for architectures
and enable it only on the known to work ones.
- A few fixes to the lockdep userspace library"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER
locking/mutex: Disable optimistic spinning on some architectures
locking/rwsem: Reduce the size of struct rw_semaphore
locking/rwsem: Rename 'activity' to 'count'
locking/spinlocks/mcs: Micro-optimize osq_unlock()
locking/spinlocks/mcs: Introduce and use init macro and function for osq locks
locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead
locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()
locking/rwsem: Allow conservative optimistic spinning when readers have lock
tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire
tools/liblockdep: Remove debug print left over from development
tools/liblockdep: Fix comparison of a boolean value with a value of 2
Pull scheduler fix from Thomas Gleixner:
"Prevent a possible divide by zero in the debugging code"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix possible divide by zero in avg_atom() calculation
Pull timer fix from Thomas Gleixner:
"A single fix for a long standing issue in the alarm timer subsystem,
which was noticed recently when people finally started to use alarm
timers for serious work"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix bug where relative alarm timers were treated as absolute
Pull RCU fixes from Thomas Gleixner:
"Two RCU patches:
- Address a serious performance regression on open/close caused by
commit ac1bea85781e ("Make cond_resched() report RCU quiescent
states")
- Export RCU debug functions. Not a regression, but enablement to
address a serious recursion bug in the sl*b allocators in 3.17"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Reduce overhead of cond_resched() checks for RCU
rcu: Export debug_init_rcu_head() and and debug_init_rcu_head()
- Fix for a recently introduced NULL pointer dereference in the core
system suspend code occuring when platforms without ACPI attempt to
use the "freeze" sleep state from Zhang Rui.
- Fix for a recently introduced build warning in cpufreq headers from
Brian W Hart.
- Fix for a 3.13 cpufreq regression related to sysem resume that
triggers on some systems with multiple CPU clusters from Viresh Kumar.
- Fix for a 3.4 regression in request_firmware() resulting in
WARN_ON()s on some systems during system resume from Takashi Iwai.
- Revert of the ACPI video commit that changed the default value of
the video.brightness_switch_enabled command line argument to 0 as
it has been reported to break existing setups.
- ACPI device enumeration documentation update to take recent code
changes into account and make the documentation match the code again
from Darren Hart.
- Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh Kumar.
- New ACPI video blacklist entry for HP ProBook 4540s from Hans de Goede.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTyHqVAAoJEILEb/54YlRxsV4P/0wiCc/VDxC614uCRcNVv0mH
4BtAvhV5Oe32M5MNYrDO4GefRqFSgz6xZ2B55ioTyNML2b7rcXJqlS4r6P6ESlog
ccwOSsEvMmmOpMRynoQJq9vGybPdhUiA8G4ABeC/Y57MfKEbS58dw9Fi/BXpII+A
RZOzHPNvzFExkw6bBxO3l66uDTeZPVZQOs5hLRSrxa1s99+E/N4aa99nE26YPR6b
U15cfuYMigfpziO3k4OtCka3sZYMO0XzVZjH62rJJBwh9G24uTzoL/lcOpLbZyaW
DGLm/3LoH46WGGIrkIiWKplzhiSQ0BeOzlK9bAX6L11UMLMO0DOtKoZl6H37bS/J
XLbSpdH22WGaF+QNWu3yPqLOpSZZxjYFQ0u8jDL00EzbIePpx4ynKbfCG2Q+EhcT
siDpXsC084rznWaYPKAyTitrhrqsWCzuI12BLO5AgJUvQ/jdyGgim8UWLp7AV1K7
Tc40MJdR7hUpJigg3+ORj88dYJvQXRYifbfTTbmQ0VEf9g5PXV5wLDvWY5Uo99SA
pYnY+jLcc13K7eDd+dm3MhSiW6H+jvzNq2WCFHqLzCG64Dj11HMHevhIRWEueekY
fF3JwqDByvj/f4PW8wwqlAnbNzhVEDzkJHHsCXj90KFCoBG3KIKU2G0ISwMbVZkJ
y9vflHb4Vh/WK9Puyffz
=kuyY
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are a few recent regression fixes, a revert of the ACPI video
commit I promised, a system resume fix related to request_firmware(),
an ACPI video quirk for one more Win8-oriented BIOS, an ACPI device
enumeration documentation update and a few fixes for ARM cpufreq
drivers.
Specifics:
- Fix for a recently introduced NULL pointer dereference in the core
system suspend code occuring when platforms without ACPI attempt to
use the "freeze" sleep state from Zhang Rui.
- Fix for a recently introduced build warning in cpufreq headers from
Brian W Hart.
- Fix for a 3.13 cpufreq regression related to sysem resume that
triggers on some systems with multiple CPU clusters from Viresh
Kumar.
- Fix for a 3.4 regression in request_firmware() resulting in
WARN_ON()s on some systems during system resume from Takashi Iwai.
- Revert of the ACPI video commit that changed the default value of
the video.brightness_switch_enabled command line argument to 0 as
it has been reported to break existing setups.
- ACPI device enumeration documentation update to take recent code
changes into account and make the documentation match the code
again from Darren Hart.
- Fixes for the sa1110, imx6q, kirkwood, and cpu0 cpufreq drivers
from Linus Walleij, Nicolas Del Piano, Quentin Armitage, Viresh
Kumar.
- New ACPI video blacklist entry for HP ProBook 4540s from Hans de
Goede"
* tag 'pm+acpi-3.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: make table sentinel macros unsigned to match use
cpufreq: move policy kobj to policy->cpu at resume
cpufreq: cpu0: OPPs can be populated at runtime
cpufreq: kirkwood: Reinstate cpufreq driver for ARCH_KIRKWOOD
cpufreq: imx6q: Select PM_OPP
cpufreq: sa1110: set memory type for h3600
ACPI / video: Add use_native_backlight quirk for HP ProBook 4540s
PM / sleep: fix freeze_ops NULL pointer dereferences
PM / sleep: Fix request_firmware() error at resume
Revert "ACPI / video: change acpi-video brightness_switch_enabled default to 0"
ACPI / documentation: Remove reference to acpi_platform_device_ids from enumeration.txt
On ia64 and ppc64, function pointers do not point to the
entry address of the function, but to the address of a
function descriptor (which contains the entry address and misc
data).
Since the kprobes code passes the function pointer stored
by NOKPROBE_SYMBOL() to kallsyms_lookup_size_offset() for
initalizing its blacklist, it fails and reports many errors,
such as:
Failed to find blacklist 0001013168300000
Failed to find blacklist 0001013000f0a000
[...]
To fix this bug, use arch_deref_entry_point() to get the
function entry address for kallsyms_lookup_size_offset()
instead of the raw function pointer.
Suzuki also pointed out that blacklist entries should also
be updated as well.
Reported-by: Tony Luck <tony.luck@gmail.com>
Fixed-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (for powerpc)
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: sparse@chrisli.org
Cc: Paul Mackerras <paulus@samba.org>
Cc: akataria@vmware.com
Cc: anil.s.keshavamurthy@intel.com
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: rdunlap@infradead.org
Cc: dl9pf@gmx.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-ia64@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20140717114411.13401.2632.stgit@kbuild-fedora.novalocal
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I was cleaning out my INBOX and found two fixes from zhangwei from
a year ago that were lost in my mail. These fix an inconsistency between
trace_puts() and the way trace_printk() works. The reason this is
important to fix is because when trace_printk() doesn't have any
arguments, it turns into a trace_puts(). Not being able to enable a
stack trace against trace_printk() because it does not have any arguments
is quite confusing. Also, the fix is rather trivial and low risk.
While porting some changes to PowerPC I discovered that it still has
the function graph tracer filter bug that if you also enable stack tracing
the function graph tracer filter is ignored. I fixed that up.
Finally, Martin Lau, fixed a bug that would cause readers of the
ftrace ring buffer to block forever even though it was suppose to be
NONBLOCK.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTxfAHAAoJEKQekfcNnQGueDkH/jQ/FgNPw+GbKuNpZwWqMQds
ivGM8yb6NgmlDQx60oBOL9TduKN2LCqW3HWDoyzduSozuLUxpRTPWtVThKo0Q9Fp
FPHWewdrX7DwWkyEqeG7FAAhjqPedgEZoDOwy7BArzJYpHcpex42trDY4BhjPrlM
ESA+/M+Xm1vGCJzcNSqNHYE/mF/tHptdM1mlLZ3g9ql49MiTbQxEMbHX7lih24sv
AzNP1SlisYN5CkX4hbRGhCYkZdmdVZt/xiyBAQRWCGFx+jsdJbMXDJnevDCdU9Mv
GNtKwqsi1/GeE5nvKxZyQKsDEE98Pd8LyQ0kDqKpAfTutqNE4mpWJkCxUqTmE4Q=
=aFqO
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few more fixes for ftrace infrastructure.
I was cleaning out my INBOX and found two fixes from zhangwei from a
year ago that were lost in my mail. These fix an inconsistency
between trace_puts() and the way trace_printk() works. The reason
this is important to fix is because when trace_printk() doesn't have
any arguments, it turns into a trace_puts(). Not being able to enable
a stack trace against trace_printk() because it does not have any
arguments is quite confusing. Also, the fix is rather trivial and low
risk.
While porting some changes to PowerPC I discovered that it still has
the function graph tracer filter bug that if you also enable stack
tracing the function graph tracer filter is ignored. I fixed that up.
Finally, Martin Lau, fixed a bug that would cause readers of the
ftrace ring buffer to block forever even though it was suppose to be
NONBLOCK"
This also includes the fix from an earlier pull request:
"Oleg Nesterov fixed a memory leak that happens if a user creates a
tracing instance, sets up a filter in an event, and then removes that
instance. The filter allocates memory that is never freed when the
instance is destroyed"
* tag 'trace-fixes-v3.16-rc5-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Fix polling on trace_pipe
tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs
tracing: Fix graph tracer with stack tracer on other archs
tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
tracing: instance_rmdir() leaks ftrace_event_file->filter
Pull perf fixes from Ingo Molnar:
"Tooling fixes and an Intel PMU driver fixlet"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not allow optimized switch for non-cloned events
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
perf symbols: Get kernel start address by symbol name
perf tools: Fix segfault in cumulative.callchain report
Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
encapsulate the dependencies for rwsem optimistic spinning.
No logical changes here as it continues to depend on both
SMP and the XADD algorithm variant.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Acked-by: Jason Low <jason.low2@hp.com>
[ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
Cc: aswin@hp.com
Cc: Chris Mason <clm@fb.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.
There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.
Opt in for known good archs.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: stable@vger.kernel.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two definitions of struct rw_semaphore, one in linux/rwsem.h
and one in linux/rwsem-spinlock.h.
For some reason they have different names for the initial field. This
makes it impossible to use C99 named initialization for
__RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
along with the structure definitions.
The simpler patch is renaming the rwsem-spinlock variant to match the
regular rwsem.
This allows us to switch to C99 named initialization.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
proc_sched_show_task() does:
if (nr_switches)
do_div(avg_atom, nr_switches);
nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.
Fix the problem by using div64_ul() instead.
As a side effect calculations of avg_atom for big nr_switches are now correct.
Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the unlock function of the cancellable MCS spinlock, the first
thing we do is to retrive the current CPU's osq node. However, due to
the changes made in the previous patch, in the common case where the
lock is not contended, we wouldn't need to access the current CPU's
osq node anymore.
This patch optimizes this by only retriving this CPU's osq node
after we attempt the initial cmpxchg to unlock the osq and found
that its contended.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, we initialize the osq lock by directly setting the lock's values. It
would be preferable if we use an init macro to do the initialization like we do
with other locks.
This patch introduces and uses a macro and function for initializing the osq lock.
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cancellable MCS spinlock is currently used to queue threads that are
doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
the lock would access and queue the local node corresponding to the CPU that
it's running on. Currently, the cancellable MCS lock is implemented by using
pointers to these nodes.
In this patch, instead of operating on pointers to the per-cpu nodes, we
store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
A similar concept is used with the qspinlock.
By operating on the CPU # of the nodes using atomic_t instead of pointers
to those nodes, this can reduce the overhead of the cancellable MCS spinlock
by 32 bits (on 64 bit systems).
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
named "optimistic_spin_queue". However, in a follow up patch in the series
we will be introducing a new structure that serves as the new "handle" for
the lock. It would make more sense if that structure is named
"optimistic_spin_queue". Additionally, since the current use of the
"optimistic_spin_queue" structure are "nodes", it might be better if we
rename them to "node" anyway.
This preparatory patch renames all current "optimistic_spin_queue"
to "optimistic_spin_node".
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Chris Mason <clm@fb.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.
Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.
In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.
This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.
Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.
Tested-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported that commit 15a2d4de0eab5 ("perf: Always destroy groups
on exit") causes a regression with grouped events. In particular his
read_group_attached.c test fails.
https://github.com/deater/perf_event_tests/blob/master/tests/bugs/read_group_attached.c
Because of the context switch optimization in
perf_event_context_sched_out() the 'original' event may end up in the
child process and when that exits the change in the patch in question
destroys the actual grouping.
Therefore revert that change and only destroy inherited groups.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-zedy3uktcp753q8fw8dagx7a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Cc: stable@vger.kernel.org # 2.6.27
Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().
Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.
Call update_function_graph_function() for all calls to
update_ftrace_function()
Cc: stable@vger.kernel.org # 3.3+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.
In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.
Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch fixes a NULL pointer dereference issue introduced by
commit 1f0b63866fc1 (ACPI / PM: Hold ACPI scan lock over the "freeze"
sleep state).
Fixes: 1f0b63866fc1 (ACPI / PM: Hold ACPI scan lock over the "freeze" sleep state)
Link: http://marc.info/?l=linux-pm&m=140541182017443&w=2
Reported-and-tested-by: Alexander Stein <alexander.stein@systec-electronic.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790
The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.
This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.
Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Cc: 3.4+ <stable@vger.kernel.org> # 3.4+
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull cgroup fixes from Tejun Heo:
"Mostly fixes for the fallouts from the recent cgroup core changes.
The decoupled nature of cgroup dynamic hierarchy management
(hierarchies are created dynamically on mount but may or may not be
reused once unmounted depending on remaining usages) led to more
ugliness being added to kernfs.
Hopefully, this is the last of it"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: break kernfs active protection in cpuset_write_resmask()
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
kernfs: introduce kernfs_pin_sb()
cgroup: fix mount failure in a corner case
cpuset,mempolicy: fix sleeping function called from invalid context
cgroup: fix broken css_has_online_children()
Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes. Both are one liners. One fixes missing uevent
for workqueue files on sysfs. The other one fixes missing zeroing of
NUMA cpu masks which can lead to oopses among other things"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: zero cpumask of wq_numa_possible_cpumask on init
workqueue: fix dev_set_uevent_suppress() imbalance
idle_exit event is the first event after a core exits
idle state. So this should be traced before local irq
is ebabled. Likewise idle_entry is the last event before
a core enters idle state. This will ease visualising the
cpu idle state from kernel traces.
Signed-off-by: Sandeep Tripathy <sandeep.tripathy@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[rjw: Subject, rebase]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.
This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.
Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Cc: stable <stable@vger.kernel.org> #3.0+
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Pull irq fixes from Thomas Gleixner:
"A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
which I introduced in the last round of cleanups :("
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix memory leak when calling irq_free_hwirqs()
irqchip: spear_shirq: Fix interrupt offset
irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
irqchip: armada-370-xp: Mask all interrupts during initialization.
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.
Fixes: 7b6ef1262549f6afc5c881aaef80beb8fd15f908
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>