23368 Commits

Author SHA1 Message Date
Josh Poimboeuf
ae92882e56 sched/debug: Clean up schedstat macros
The schedstat_*() macros are inconsistent: most of them take a pointer
and a field which the macro combines, whereas schedstat_set() takes the
already combined ptr->field.

The already combined ptr->field argument is actually more intuitive and
easier to use, and there's no reason to require the user to split the
variable up, so convert the macros to use the combined argument.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:46 +02:00
Josh Poimboeuf
1a3d027c5a sched/debug: Rename and move enqueue_sleeper()
enqueue_sleeper() doesn't actually enqueue, it just handles some
statistics and tracepoints.  Rename it to update_stats_enqueue_sleeper()
and call it from update_stats_enqueue().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fb20b7159dc4d028c406c0e8d5f8c439b741615b.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
Wanpeng Li
61c7aca695 sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU
The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.

As a result we get:

    WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
    Call Trace:
     dump_stack+0x99/0xd0
     __warn+0xd1/0xf0
     warn_slowpath_null+0x1d/0x20
     irq_work_queue_on+0xad/0xe0
     tick_nohz_full_kick_cpu+0x44/0x50
     tick_nohz_dep_set_cpu+0x74/0xb0
     enqueue_task_dl+0x226/0x480
     activate_task+0x5c/0xa0
     dl_task_timer+0x19b/0x2c0
     ? push_dl_task.part.31+0x190/0x190

This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.

We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.

This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
seokhoon.yoon
efca03ecbe schedcore: Remove duplicated init_task's preempt_notifiers init
init_task's preempt_notifiers is initialized twice:

1) sched_init()
   -> INIT_HLIST_HEAD(&init_task.preempt_notifiers)

2) sched_init()
   -> init_idle(current,) <--- current task is init_task at this time
    -> __sched_fork(,current)
     -> INIT_HLIST_HEAD(&p->preempt_notifiers)

I think the first one is unnecessary, so remove it.

Signed-off-by: seokhoon.yoon <iamyooon@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471339568-5790-1-git-send-email-iamyooon@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:44 +02:00
Dietmar Eggemann
2665621506 sched/fair: Fix load_above_capacity fixed point arithmetic width
Since commit:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

we now have two different fixed point units for load.

load_above_capacity has to have 10 bits fixed point unit like PELT,
whereas NICE_0_LOAD has 20 bit fixed point unit on 64-bit kernels.

Fix this by scaling down NICE_0_LOAD when multiplying
load_above_capacity with it.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1470824847-5316-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:44 +02:00
Tommaso Cucinotta
d8206bb3ff sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()
These 2 exercise independent code paths and need different arguments.

After this change, you call:

  cpudl_clear(cp, cpu);
  cpudl_set(cp, cpu, dl);

instead of:

  cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */);
  cpudl_set(cp, cpu, dl, 1 /* is_valid */);

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-4-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:43 +02:00
Tommaso Cucinotta
8e1bc301aa sched/deadline: Make CPU heap faster avoiding real swaps on heapify
This change goes from heapify() ops done by swapping with parent/child
so that the item to fix moves along, to heapify() ops done by just
pulling the parent/child chain by 1 pos, then storing the item to fix
just at the end. On a non-trivial heapify(), this performs roughly half
stores wrt swaps.

This has been measured to achieve up to 10% of speed-up for cpudl_set()
calls, with a randomly generated workload of 1K,10K,100K random heap
insertions and deletions (75% cpudl_set() calls with is_valid=1 and
25% with is_valid=0), and randomly generated cpu IDs, with up to 256
CPUs, as measured on an Intel Core2 Duo.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-3-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:43 +02:00
Tommaso Cucinotta
126b3b6842 sched/deadline: Refactor CPU heap code
1. heapify up factored out in new dedicated function heapify_up()
    (avoids repetition of same code)

 2. call to cpudl_change_key() replaced with heapify_up() when
    cpudl_set actually inserts a new node in the heap

 3. cpudl_change_key() replaced with heapify() that heapifies up
    or down as needed.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:42 +02:00
Byungchul Park
97a7142f15 sched/fair: Make update_min_vruntime() more readable
The update_min_vruntime() control flow can be simplified.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: minchan.kim@lge.com
Link: http://lkml.kernel.org/r/1436088829-25768-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:41 +02:00
Ingo Molnar
62cc20bcf2 Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:24:11 +02:00
Will Deacon
c9bbdd4830 perf/core: Don't pass PERF_EF_START to the PMU ->start callback
PERF_EF_START is a flag to indicate to the PMU ->add() callback that, as
well as claiming the PMU resources required by the event being added,
it should also start the PMU.

Passing this flag to the ->start() callback doesn't make sense, because
->start() always tries to start the PMU. Remove it.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1471257765-29662-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:19:18 +02:00
Ingo Molnar
2cc538412a Merge branch 'perf/urgent' into perf/core, to pick up fixed and resolve conflicts
Conflicts:
	kernel/events/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 12:09:59 +02:00
Balbir Singh
135e8c9250 sched/core: Fix a race between try_to_wake_up() and a woken up task
The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 11:57:53 +02:00
Peter Zijlstra
5876314875 perf/core: Remove WARN from perf_event_read()
This effectively reverts commit:

  71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")

... and puts in a comment explaining why we ignore the return value.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 11:55:00 +02:00
Linus Torvalds
1c3333600b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two fixlet from the timers departement:

   - A fix for scheduler stalls in the tick idle code affecting
     NOHZ_FULL kernels

   - A trivial compile fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/nohz: Fix softlockup on scheduler stalls in kvm guest
  clocksource/drivers/atmel-pit: Fix compilation error
2016-09-04 08:43:45 -07:00
Lianwei Wang
01b4115906 cpu/hotplug: Handle unbalanced hotplug enable/disable
When cpu_hotplug_enable() is called unbalanced w/o a preceeding
cpu_hotplug_disable() the code emits a warning, but happily decrements the
disabled counter. This causes the next operations to malfunction.

Prevent the decrement and just emit a warning.

Signed-off-by: Lianwei Wang <lianwei.wang@gmail.com>
Cc: peterz@infradead.org
Cc: linux-pm@vger.kernel.org
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1465541008-12476-1-git-send-email-lianwei.wang@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:37:17 +02:00
Sebastian Frias
f88eecfe2f genirq/generic_chip: Verify irqs_per_chip <= 32
Most (if not all) code here implicitly assumes that the maximum number of
IRQs per chip will be 32, and thus uses 'u32' or 'unsigned long' for many
tasks (for example "struct irq_data" declares its 'mask' field as 'u32',
and "struct irq_chip_generic" declares its 'installed' field as 'unsigned
long')

However, there is no check to verify that irqs_per_chip is <= 32.  Hence,
calling irq_alloc_domain_generic_chips() with a bigger value will result in
unexpected results.

Provide a wrapper with a MAYBE_BUILD_BUG_ON(nrirqs >= 32) to catch such
cases.

[ tglx: Reduced changelog to the essential information ]

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/57B31D94.5040701@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:20:59 +02:00
Thomas Gleixner
cf392d10b6 cpu/hotplug: Add multi instance support
This patch adds the ability for a given state to have multiple
instances. Until now all states have a single instance and the startup /
teardown callback use global variables.
A few drivers need to perform a the same callbacks on multiple
"instances". Currently we have three drivers in tree which all have a
global list which they iterate over. With multi instance they support
don't need their private list and the functionality has been moved into
core code. Plus we hold the hotplug lock in core so no cpus comes/goes
while instances are registered and we do rollback in error case :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:05:05 +02:00
Thomas Gleixner
a724632ca0 cpu/hotplug: Rework callback invocation logic
This is preparation for the following patch.
This rework here changes the arguments of cpuhp_invoke_callback(). It
passes now `state' and whether `startup' or `teardown' callback should
be invoked. The callback then is looked up by the function.

The following is a clanup of callers:
- cpuhp_issue_call() has one argument less
- struct cpuhp_cpu_state (which is used by the hotplug thread) gets also
  its callback removed. The decision if it is a single callback
  invocation moved to the `single' variable. Also a `bringup' variable
  has been added to distinguish between startup and teardown callback.
- take_cpu_down() needs to start one step earlier. We always get here
  via CPUHP_TEARDOWN_CPU callback. Before that change cpuhp_ap_states +
  CPUHP_TEARDOWN_CPU pointed to an empty entry because TEARDOWN is saved
  in bp_states for this reason. Now that we use cpuhp_get_step() to
  lookup the state we must explicitly skip it in order not to invoke it
  twice.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:05:05 +02:00
Alexei Starovoitov
aa6a5f3cb2 perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
fdc15d388d bpf: perf_event progs should only use preallocated maps
Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
0515e5999a bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
    struct pt_regs regs;
     __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
ea2e7ce5d0 bpf: support 8-byte metafield access
The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Steven Rostedt (Red Hat)
7b2c862501 tracing: Add NMI tracing in hwlat detector
As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.

As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.

Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:

 #3     inner/outer(us):    9/9     ts:1470836488.206734548
 #4     inner/outer(us):    0/8     ts:1470836497.140808588
 #5     inner/outer(us):    0/6     ts:1470836499.140825168 nmi-total:5 nmi-count:1
 #6     inner/outer(us):    9/9     ts:1470836501.140841748

All time is still tracked in microseconds.

The NMI information is only shown when an NMI occurred during the sample.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:55 -04:00
Steven Rostedt (Red Hat)
0330f7aa8e tracing: Have hwlat trace migrate across tracing_cpumask CPUs
Instead of having the hwlat detector thread stay on one CPU, have it migrate
across all the CPUs specified by tracing_cpumask. If the user modifies the
thread's CPU affinity, the migration will stop until the next instance that
the tracer is instantiated. The migration happens at the end of each window
(period).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:54 -04:00
Steven Rostedt (Red Hat)
e7c15cd8a1 tracing: Added hardware latency tracer
The hardware latency tracer has been in the PREEMPT_RT patch for some time.
It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.

The logic is pretty simple. It simply creates a thread that spins on a
single CPU for a specified amount of time (width) within a periodic window
(window). These numbers may be adjusted by their cooresponding names in

   /sys/kernel/tracing/hwlat_detector/

The defaults are window = 1000000 us (1 second)
                 width  =  500000 us (1/2 second)

The loop consists of:

	t1 = trace_clock_local();
	t2 = trace_clock_local();

Where trace_clock_local() is a variant of sched_clock().

The difference of t2 - t1 is recorded as the "inner" timestamp and also the
timestamp  t1 - prev_t2 is recorded as the "outer" timestamp. If either of
these differences are greater than the time denoted in
/sys/kernel/tracing/tracing_thresh then it records the event.

When this tracer is started, and tracing_thresh is zero, it changes to the
default threshold of 10 us.

The hwlat tracer in the PREEMPT_RT patch was originally written by
Jon Masters. I have modified it quite a bit and turned it into a
tracer.

Based-on-code-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:51 -04:00
Sebastian Frias
0c228919e0 irqdomain: Mask irq type in irq_domain_xlate_onetwocell()
According to the xlate() callback definition, the 'out_type' parameter
needs to be the "linux irq type".

A mask for such bits exists, IRQ_TYPE_SENSE_MASK, which is correctly
applied in irq_domain_xlate_twocell()

So use it for irq_domain_xlate_onetwocell() as well.

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/57A05F5D.103@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:50 +02:00
Sebastian Frias
ee26c013cd genirq/generic_chip: Add irq_unmap callback
Without this patch irq_domain_disassociate() cannot properly release the
interrupt. In fact, irq_map_generic_chip() checks a bit on 'gc->installed'
but said bit is never cleared, only set.

Commit 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
added irq_map_generic_chip() function and also stated "This lacks a removal
function for now".

This commit provides an implementation of an unmap function that can be
called by irq_domain_disassociate().

[ tglx: Made the function static and removed the export as we have neither
  	a prototype nor a modular user. ]

Fixes: 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/579F5C5A.2070507@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Sebastian Frias
f0c450eaa3 genirq/generic_chip: Get rid of code duplication
irq_map_generic_chip() contains about the same code as
irq_get_domain_generic_chip() except for the return values.

Split out the irq_get_domain_generic_chip() implementation so it can be
reused.

[ tglx: Removed the extra churn in irq_get_domain_generic_chip() callers
  	and massaged changelog ]

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/579F5C69.8070006@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Thomas Gleixner
48e0fba842 genirq: Remove export of irq_map_generic_chip()
No module users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Thomas Gleixner
fc590c22f9 genirq: Robustify handle_percpu_devid_irq()
The percpu_devid handler is not robust against spurious interrupts. If a
spurious interrupt happens and no action is installed then the handler
crashes with a NULL pointer dereference.

Add a sanity check for this and log the wreckage once in dmesg.

Reported-by: Majun <majun258@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: guohanjun@huawei.com
Cc: dingtianhong@huawei.com
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1609021436160.5647@nanos
2016-09-02 18:06:49 +02:00
Wanpeng Li
08d0725992 tick/nohz: Fix softlockup on scheduler stalls in kvm guest
tick_nohz_start_idle() is prevented to be called if the idle tick can't 
be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle 
enter"). As a result, after suspend/resume the host machine, full dynticks 
kvm guest will softlockup:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
 Call Trace:
  default_idle+0x31/0x1a0
  arch_cpu_idle+0xf/0x20
  default_idle_call+0x2a/0x50
  cpu_startup_entry+0x39b/0x4d0
  rest_init+0x138/0x140
  ? rest_init+0x5/0x140
  start_kernel+0x4c1/0x4ce
  ? set_init_arg+0x55/0x55
  ? early_idt_handler_array+0x120/0x120
  x86_64_start_reservations+0x24/0x26
  x86_64_start_kernel+0x142/0x14f

In addition, cat /proc/stat | grep cpu in guest or host:

cpu  398 16 5049 15754 5490 0 1 46 0 0
cpu0 206 5 450 0 0 0 1 14 0 0
cpu1 81 0 3937 3149 1514 0 0 9 0 0
cpu2 45 6 332 6052 2243 0 0 11 0 0
cpu3 65 2 328 6552 1732 0 0 11 0 0

The idle and iowait states are weird 0 for cpu0(housekeeping). 

The bug is present in both guest and host kernels, and they both have 
cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
path etc will touch watchdog to avoid the softlockup.

- The watchdog will not be touched in tick_nohz_stop_idle path (need be 
  touched since the scheduler stall is expected) if idle_active flags are 
  not detected.
- The idle and iowait states will not be accounted when exit idle loop 
  (resched or interrupt) if idle start time and idle_active flags are 
  not set. 

This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop 
idle tick doesn't mean can't be idle.

Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
Cc: stable@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 10:25:40 +02:00
Linus Torvalds
b9677faf45 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rapidio/tsi721: fix incorrect detection of address translation condition
  rapidio/documentation/mport_cdev: add missing parameter description
  kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
  MAINTAINERS: Vladimir has moved
  mm, mempolicy: task->mempolicy must be NULL before dropping final reference
  printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
  treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
  drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
  mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
  lib/test_hash.c: fix warning in preprocessor symbol evaluation
  lib/test_hash.c: fix warning in two-dimensional array init
  kconfig: tinyconfig: provide whole choice blocks to avoid warnings
  kexec: fix double-free when failing to relocate the purgatory
  mm, oom: prevent premature OOM killer invocation for high order request
2016-09-01 18:23:22 -07:00
Michal Hocko
735f2770a7 kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
Commit fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
exit") has caused a subtle regression in nscd which uses
CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
shared databases, so that the clients are notified when nscd is
restarted.  Now, when nscd uses a non-persistent database, clients that
have it mapped keep thinking the database is being updated by nscd, when
in fact nscd has created a new (anonymous) one (for non-persistent
databases it uses an unlinked file as backend).

The original proposal for the CLONE_CHILD_CLEARTID change claimed
(https://lkml.org/lkml/2006/10/25/233):

: The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
: on behalf of pthread_create() library calls.  This feature is used to
: request that the kernel clear the thread-id in user space (at an address
: provided in the syscall) when the thread disassociates itself from the
: address space, which is done in mm_release().
:
: Unfortunately, when a multi-threaded process incurs a core dump (such as
: from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
: the other threads, which then proceed to clear their user-space tids
: before synchronizing in exit_mm() with the start of core dumping.  This
: misrepresents the state of process's address space at the time of the
: SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
: problems (misleading him/her to conclude that the threads had gone away
: before the fault).
:
: The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
: core dump has been initiated.

The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
seems to have a larger scope than the original patch asked for.  It
seems that limitting the scope of the check to core dumping should work
for SIGSEGV issue describe above.

[Changelog partly based on Andreas' description]
Fixes: fec1d0115240 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: William Preston <wpreston@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andreas Schwab <schwab@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:02 -07:00
David Rientjes
c11600e4fe mm, mempolicy: task->mempolicy must be NULL before dropping final reference
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions.  It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:

	BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
	CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
	...
	Call Trace:
		dump_stack
		kasan_object_err
		kasan_report_error
		__asan_report_load2_noabort
		alloc_pages_current	<-- use after free
		depot_save_stack
		save_stack
		kasan_slab_free
		kmem_cache_free
		__mpol_put		<-- free
		do_exit

This patch sets current->mempolicy to NULL before dropping the final
reference.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>	[4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Sergey Senozhatsky
19feeff18b printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
__printk_nmi_flush() can be called from nmi_panic(), therefore it has to
test whether it's executed in NMI context and thus must route the
messages through deferred printk() or via direct printk().

This is to avoid potential deadlocks, as described in commit
cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic").

However there remain two places where __printk_nmi_flush() does
unconditional direct printk() calls:

 - pr_err("printk_nmi_flush: internal error ...")
 - pr_cont("\n")

Factor out print_nmi_seq_line() parts into a new printk_nmi_flush_line()
function, which takes care of in_nmi(), and use it in
__printk_nmi_flush() for printing and error-reporting.

Link: http://lkml.kernel.org/r/20160830161354.581-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Arnd Bergmann
236dec0510 kconfig: tinyconfig: provide whole choice blocks to avoid warnings
Using "make tinyconfig" produces a couple of annoying warnings that show
up for build test machines all the time:

    .config:966:warning: override: NOHIGHMEM changes choice state
    .config:965:warning: override: SLOB changes choice state
    .config:963:warning: override: KERNEL_XZ changes choice state
    .config:962:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:933:warning: override: SLOB changes choice state
    .config:930:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state
    .config:870:warning: override: SLOB changes choice state
    .config:868:warning: override: KERNEL_XZ changes choice state
    .config:867:warning: override: CC_OPTIMIZE_FOR_SIZE changes choice state

I've made a previous attempt at fixing them and we discussed a number of
alternatives.

I tried changing the Makefile to use "merge_config.sh -n
$(fragment-list)" but couldn't get that to work properly.

This is yet another approach, based on the observation that we do want
to see a warning for conflicting 'choice' options, and that we can
simply make them non-conflicting by listing all other options as
disabled.  This is a trivial patch that we can apply independent of
plans for other changes.

Link: http://lkml.kernel.org/r/20160829214952.1334674-2-arnd@arndb.de
Link: https://storage.kernelci.org/mainline/v4.7-rc6/x86-tinyconfig/build.log
https://patchwork.kernel.org/patch/9212749/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Thiago Jung Bauermann
070c43eea5 kexec: fix double-free when failing to relocate the purgatory
If kexec_apply_relocations fails, kexec_load_purgatory frees pi->sechdrs
and pi->purgatory_buf.  This is redundant, because in case of error
kimage_file_prepare_segments calls kimage_file_post_load_cleanup, which
will also free those buffers.

This causes two warnings like the following, one for pi->sechdrs and the
other for pi->purgatory_buf:

  kexec-bzImage64: Loading purgatory failed
  ------------[ cut here ]------------
  WARNING: CPU: 1 PID: 2119 at mm/vmalloc.c:1490 __vunmap+0xc1/0xd0
  Trying to vfree() nonexistent vm area (ffffc90000e91000)
  Modules linked in:
  CPU: 1 PID: 2119 Comm: kexec Not tainted 4.8.0-rc3+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x4d/0x65
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    ? find_vmap_area+0x19/0x70
    ? kimage_file_post_load_cleanup+0x47/0xb0
    __vunmap+0xc1/0xd0
    vfree+0x2e/0x70
    kimage_file_post_load_cleanup+0x5e/0xb0
    SyS_kexec_file_load+0x448/0x680
    ? putname+0x54/0x60
    ? do_sys_open+0x190/0x1f0
    entry_SYSCALL_64_fastpath+0x13/0x8f
  ---[ end trace 158bb74f5950ca2b ]---

Fix by setting pi->sechdrs an pi->purgatory_buf to NULL, since vfree
won't try to free a NULL pointer.

Link: http://lkml.kernel.org/r/1472083546-23683-1-git-send-email-bauerman@linux.vnet.ibm.com
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Linus Torvalds
511a8cdb65 Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit
Pull audit fixes from Paul Moore:
 "Two small patches to fix some bugs with the audit-by-executable
  functionality we introduced back in v4.3 (both patches are marked
  for the stable folks)"

* 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit:
  audit: fix exe_file access in audit_exe_compare
  mm: introduce get_task_exe_file
2016-09-01 15:55:56 -07:00
Thomas Gleixner
0cb7bf61b1 Merge branch 'linus' into smp/hotplug
Apply upstream changes to avoid conflicts with pending patches.
2016-09-01 18:33:46 +02:00
Namhyung Kim
8861dd303c ftrace: Access ret_stack->subtime only in the function profiler
The subtime is used only for function profiler with function graph
tracer enabled.  Move the definition of subtime under
CONFIG_FUNCTION_PROFILER to reduce the memory usage.  Also move the
initialization of subtime into the graph entry callback.

Link: http://lkml.kernel.org/r/20160831025529.24018-1-namhyung@kernel.org

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-01 12:19:40 -04:00
Namhyung Kim
613dccdf68 function_graph: Handle TRACE_BPUTS in print_graph_comment
It missed to handle TRACE_BPUTS so messages recorded by trace_bputs()
will be shown with symbol info unnecessarily.

You can see it with the trace_printk sample code:

  # cd /sys/kernel/tracing/
  # echo sys_sync > set_graph_function
  # echo 1 > options/sym-offset
  # echo function_graph > current_tracer

Note that the sys_sync filter was there to prevent recording other
functions and the sym-offset option was needed since the first message
was called from a module init function so kallsyms doesn't have the
symbol and omitted in the output.

  # cd ~/build/kernel
  # insmod samples/trace_printk/trace-printk.ko

  # cd -
  # head trace

Before:

  # tracer: function_graph
  #
  # CPU  DURATION                  FUNCTION CALLS
  # |     |   |                     |   |   |   |
   1)               |  /* 0xffffffffa0002000: This is a static string that will use trace_bputs */
   1)               |  /* This is a dynamic string that will use trace_puts */
   1)               |  /* trace_printk_irq_work+0x5/0x7b [trace_printk]: (irq) This is a static string that will use trace_bputs */
   1)               |  /* (irq) This is a dynamic string that will use trace_puts */
   1)               |  /* (irq) This is a static string that will use trace_bprintk() */
   1)               |  /* (irq) This is a dynamic string that will use trace_printk */

After:

  # tracer: function_graph
  #
  # CPU  DURATION                  FUNCTION CALLS
  # |     |   |                     |   |   |   |
   1)               |  /* This is a static string that will use trace_bputs */
   1)               |  /* This is a dynamic string that will use trace_puts */
   1)               |  /* (irq) This is a static string that will use trace_bputs */
   1)               |  /* (irq) This is a dynamic string that will use trace_puts */
   1)               |  /* (irq) This is a static string that will use trace_bprintk() */
   1)               |  /* (irq) This is a dynamic string that will use trace_printk */

Link: http://lkml.kernel.org/r/20160901024354.13720-1-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-01 11:19:55 -04:00
Dmitry Safonov
5ba8a4a96f tracing/uprobe: Drop isdigit() check in create_trace_uprobe
It's useless. Before:
  [tracing]# echo 'p:test /a:0x0' >> uprobe_events
  [tracing]# echo 'p:test a:0x0' >> uprobe_events
  -bash: echo: write error: No such file or directory
  [tracing]# echo 'p:test 1:0x0' >> uprobe_events
  -bash: echo: write error: Invalid argument

After:
  [tracing]# echo 'p:test 1:0x0' >> uprobe_events
  -bash: echo: write error: No such file or directory

Link: http://lkml.kernel.org/r/20160825152110.25663-3-dsafonov@virtuozzo.com

Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-01 11:18:09 -04:00
Steve Muckle
8314bc83f6 cpufreq / sched: ignore SMT when determining max cpu capacity
PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-01 00:32:57 +02:00
Vegard Nossum
979515c564 time: Avoid undefined behaviour in ktime_add_safe()
I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16
    signed integer overflow:
    9223372036854775807 + 50000 cannot be represented in type 'long long int'
    CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
     0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3
     ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60
     000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320
    Call Trace:
     [<ffffffff82344740>] dump_stack+0xac/0xfc
     [<ffffffff82344694>] ? _atomic_dec_and_lock+0xc4/0xc4
     [<ffffffff8242df78>] ubsan_epilogue+0xd/0x8a
     [<ffffffff8242e6b4>] handle_overflow+0x202/0x23d
     [<ffffffff8242e4b2>] ? val_to_string.constprop.6+0x11e/0x11e
     [<ffffffff8236df71>] ? timerqueue_add+0x151/0x410
     [<ffffffff81485c48>] ? hrtimer_start_range_ns+0x3b8/0x1380
     [<ffffffff81795631>] ? memset+0x31/0x40
     [<ffffffff8242e6fd>] __ubsan_handle_add_overflow+0xe/0x10
     [<ffffffff81488ac9>] hrtimer_nanosleep+0x5d9/0x790
     [<ffffffff814884f0>] ? hrtimer_init_sleeper+0x80/0x80
     [<ffffffff813a9ffb>] ? __might_sleep+0x5b/0x260
     [<ffffffff8148be10>] common_nsleep+0x20/0x30
     [<ffffffff814906c7>] SyS_clock_nanosleep+0x197/0x210
     [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150
     [<ffffffff823c7113>] ? __this_cpu_preempt_check+0x13/0x20
     [<ffffffff8162ef60>] ? __context_tracking_exit.part.3+0x30/0x1b0
     [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150
     [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0
     [<ffffffff845f85aa>] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

Add a new ktime_add_unsafe() helper which doesn't check for overflow, but
doesn't throw a UBSAN warning when it does overflow either.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-31 14:43:36 -07:00
Vegard Nossum
469e857f37 time: Avoid undefined behaviour in timespec64_add_safe()
I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/time.c:783:2
    signed integer overflow:
    5273 + 9223372036854771711 cannot be represented in type 'long int'
    CPU: 0 PID: 17363 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #88
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
    04/01/2014
     0000000000000000 ffff88011457f8f0 ffffffff82344f50 0000000041b58ab3
     ffffffff84f98080 ffffffff82344ea4 ffff88011457f918 ffff88011457f8c8
     ffff88011457f8e0 7fffffffffffefff ffff88011457f6d8 dffffc0000000000
    Call Trace:
     [<ffffffff82344f50>] dump_stack+0xac/0xfc
     [<ffffffff82344ea4>] ? _atomic_dec_and_lock+0xc4/0xc4
     [<ffffffff8242f4c8>] ubsan_epilogue+0xd/0x8a
     [<ffffffff8242fc04>] handle_overflow+0x202/0x23d
     [<ffffffff8242fa02>] ? val_to_string.constprop.6+0x11e/0x11e
     [<ffffffff823c7837>] ? debug_smp_processor_id+0x17/0x20
     [<ffffffff8131b581>] ? __sigqueue_free.part.13+0x51/0x70
     [<ffffffff8146d4e0>] ? rcu_is_watching+0x110/0x110
     [<ffffffff8242fc4d>] __ubsan_handle_add_overflow+0xe/0x10
     [<ffffffff81476ef8>] timespec64_add_safe+0x298/0x340
     [<ffffffff81476c60>] ? timespec_add_safe+0x330/0x330
     [<ffffffff812f7990>] ? wait_noreap_copyout+0x1d0/0x1d0
     [<ffffffff8184bf18>] poll_select_set_timeout+0xf8/0x170
     [<ffffffff8184be20>] ? poll_schedule_timeout+0x2b0/0x2b0
     [<ffffffff813aa9bb>] ? __might_sleep+0x5b/0x260
     [<ffffffff833c8a87>] __sys_recvmmsg+0x107/0x790
     [<ffffffff833c8980>] ? SyS_recvmsg+0x20/0x20
     [<ffffffff81486378>] ? hrtimer_start_range_ns+0x3b8/0x1380
     [<ffffffff845f8bfb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60
     [<ffffffff8148bcea>] ? do_setitimer+0x39a/0x8e0
     [<ffffffff813aa9bb>] ? __might_sleep+0x5b/0x260
     [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790
     [<ffffffff833c91e9>] SyS_recvmmsg+0xd9/0x160
     [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790
     [<ffffffff823c7853>] ? __this_cpu_preempt_check+0x13/0x20
     [<ffffffff8162f680>] ? __context_tracking_exit.part.3+0x30/0x1b0
     [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790
     [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0
     [<ffffffff845f936a>] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

Line 783 is this:

783         set_normalized_timespec64(&res, lhs.tv_sec + rhs.tv_sec,
784                         lhs.tv_nsec + rhs.tv_nsec);

In other words, since lhs.tv_sec and rhs.tv_sec are both time64_t, this
is a signed addition which will cause undefined behaviour on overflow.

Note that this is not currently a huge concern since the kernel should be
built with -fno-strict-overflow by default, but could be a problem in the
future, a problem with older compilers, or other compilers than gcc.

The easiest way to avoid the overflow is to cast one of the arguments to
unsigned (so the addition will be done using unsigned arithmetic).

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-31 14:43:35 -07:00
Ruchi Kandoi
0bf43f15db timekeeping: Prints the amounts of time spent during suspend
In addition to keeping a histogram of suspend times, also
print out the time spent in suspend to dmesg.

This helps to keep track of suspend time while debugging using
kernel logs.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-31 14:43:34 -07:00
Kyle Walker
36374583f9 clocksource: Defer override invalidation unless clock is unstable
Clocksources don't get the VALID_FOR_HRES flag until they have been
checked by a watchdog. However, when using an override, the
clocksource_select logic will clear the override value if the
clocksource is not marked VALID_FOR_HRES during that inititial check.
When using the boot arguments clocksource=<foo>, this selection can
run before the watchdog, and can cause the override to be incorrectly
cleared.

To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Kyle Walker <kwalker@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-31 14:43:33 -07:00
Pratyush Patel
b4d90e9f1e hrtimer: Spelling fixes
Fix a minor spelling error.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Pratyush Patel <pratyushpatel.1995@gmail.com>
[jstultz: Added commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-08-31 14:43:20 -07:00
Mateusz Guzik
5efc244346 audit: fix exe_file access in audit_exe_compare
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.

Use get_task_exe_file to safely obtain stable exe_file.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Cc: <stable@vger.kernel.org> # 4.3.x
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-08-31 16:16:35 -04:00