summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* kprobes: Return error if we fail to reuse kprobe instead of BUG_ON()Masami Hiramatsu2018-11-211-7/+20
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 819319fc93461c07b9cdb3064f154bd8cfd48172 ] Make reuse_unused_kprobe() to return error code if it fails to reuse unused kprobe for optprobe instead of calling BUG_ON(). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S . Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/lockdep: Fix debug_locks off performance problemWaiman Long2018-11-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9506a7425b094d2f1d9c877ed5a78f416669269b ] It was found that when debug_locks was turned off because of a problem found by the lockdep code, the system performance could drop quite significantly when the lock_stat code was also configured into the kernel. For instance, parallel kernel build time on a 4-socket x86-64 server nearly doubled. Further analysis into the cause of the slowdown traced back to the frequent call to debug_locks_off() from the __lock_acquired() function probably due to some inconsistent lockdep states with debug_locks off. The debug_locks_off() function did an unconditional atomic xchg to write a 0 value into debug_locks which had already been set to 0. This led to severe cacheline contention in the cacheline that held debug_locks. As debug_locks is being referenced in quite a few different places in the kernel, this greatly slow down the system performance. To prevent that trashing of debug_locks cacheline, lock_acquired() and lock_contended() now checks the state of debug_locks before proceeding. The debug_locks_off() function is also modified to check debug_locks before calling __debug_locks_off(). Signed-off-by: Waiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cpuidle: Do not access cpuidle_devices when !CONFIG_CPU_IDLECatalin Marinas2018-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | commit 9bd616e3dbedfc103f158197c8ad93678849b1ed upstream. The cpuidle_devices per-CPU variable is only defined when CPU_IDLE is enabled. Commit c8cc7d4de7a4 ("sched/idle: Reorganize the idle loop") removed the #ifdef CONFIG_CPU_IDLE around cpuidle_idle_call() with the compiler optimising away __this_cpu_read(cpuidle_devices). However, with CONFIG_UBSAN && !CONFIG_CPU_IDLE, this optimisation no longer happens and the kernel fails to link since cpuidle_devices is not defined. This patch introduces an accessor function for the current CPU cpuidle device (returning NULL when !CONFIG_CPU_IDLE) and uses it in cpuidle_idle_call(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: 4.5+ <stable@vger.kernel.org> # 4.5+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/fair: Fix throttle_list starvation with low CFS quotaPhil Auld2018-11-102-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream. With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from c06f04c7048 to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: generally move prog destruction to RCU deferralDaniel Borkmann2018-11-103-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1aacde3d22c42281236155c1ef6d7a5aa32a826b ] Jann Horn reported following analysis that could potentially result in a very hard to trigger (if not impossible) UAF race, to quote his event timeline: - Set up a process with threads T1, T2 and T3 - Let T1 set up a socket filter F1 that invokes another filter F2 through a BPF map [tail call] - Let T1 trigger the socket filter via a unix domain socket write, don't wait for completion - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put() - Let T3 close the file descriptor for F2, dropping the reference count of F2 to 2 - At this point, T1 should have looked up F2 from the map, but not finished executing it - Let T3 remove F2 from the BPF map, dropping the reference count of F2 to 1 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping the reference count of F2 to 0 and scheduling bpf_prog_free_deferred() via schedule_work() - At this point, the BPF program could be freed - BPF execution is still running in a freed BPF program While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf event fd we're doing the syscall on doesn't disappear from underneath us for whole syscall time, it may not be the case for the bpf fd used as an argument only after we did the put. It needs to be a valid fd pointing to a BPF program at the time of the call to make the bpf_prog_get() and while T2 gets preempted, F2 must have dropped reference to 1 on the other CPU. The fput() from the close() in T3 should also add additionally delay to the reference drop via exit_task_work() when bpf_prog_release() gets called as well as scheduling bpf_prog_free_deferred(). That said, it makes nevertheless sense to move the BPF prog destruction generally after RCU grace period to guarantee that such scenario above, but also others as recently fixed in ceb56070359b ("bpf, perf: delay release of BPF prog after grace period") with regards to tail calls won't happen. Integrating bpf_prog_free_deferred() directly into the RCU callback is not allowed since the invocation might happen from either softirq or process context, so we're not permitted to block. Reviewing all bpf_prog_put() invocations from eBPF side (note, cBPF -> eBPF progs don't use this for their destruction) with call_rcu() look good to me. Since we don't know whether at the time of attaching the program, we're already part of a tail call map, we need to use RCU variant. However, due to this, there won't be severely more stress on the RCU callback queue: situations with above bpf_prog_get() and bpf_prog_put() combo in practice normally won't lead to releases, but even if they would, enough effort/ cycles have to be put into loading a BPF program into the kernel already. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/core: Don't leak event in the syscall error pathAlexander Shishkin2018-11-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 201c2f85bd0bc13b712d9c0b3d11251b182e06ae ] In the error path, event_file not being NULL is used to determine whether the event itself still needs to be free'd, so fix it up to avoid leaking. Reported-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 130056275ade ("perf: Do not double free") Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/cgroup: Fix cgroup entity load tracking tear-downPeter Zijlstra2018-11-103-20/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6fe1f348b3dd1f700f9630562b7d38afd6949568 ] When a cgroup's CPU runqueue is destroyed, it should remove its remaining load accounting from its parent cgroup. The current site for doing so it unsuited because its far too late and unordered against other cgroup removal (->css_free() will be, but we're also in an RCU callback). Put it in the ->css_offline() callback, which is the start of cgroup destruction, right after the group has been made unavailable to userspace. The ->css_offline() callbacks are called in hierarchical order after the following v4.4 commit: aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Skip more functions when doing stack tracing of eventsSteven Rostedt (Red Hat)2018-11-101-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit be54f69c26193de31053190761e521903b89d098 ] # echo 1 > options/stacktrace # echo 1 > events/sched/sched_switch/enable # cat trace <idle>-0 [002] d..2 1982.525169: <stack trace> => save_stack_trace => __ftrace_trace_stack => trace_buffer_unlock_commit_regs => event_trigger_unlock_commit => trace_event_buffer_commit => trace_event_raw_event_sched_switch => __schedule => schedule => schedule_preempt_disabled => cpu_startup_entry => start_secondary The above shows that we are seeing 6 functions before ever making it to the caller of the sched_switch event. # echo stacktrace > events/sched/sched_switch/trigger # cat trace <idle>-0 [002] d..3 2146.335208: <stack trace> => trace_event_buffer_commit => trace_event_raw_event_sched_switch => __schedule => schedule => schedule_preempt_disabled => cpu_startup_entry => start_secondary The stacktrace trigger isn't as bad, because it adds its own skip to the stacktracing, but still has two events extra. One issue is that if the stacktrace passes its own "regs" then there should be no addition to the skip, as the regs will not include the functions being called. This was an issue that was fixed by commit 7717c6be6999 ("tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()" as adding the skip number for kprobes made the probes not have any stack at all. But since this is only an issue when regs is being used, a skip should be added if regs is NULL. Now we have: # echo 1 > options/stacktrace # echo 1 > events/sched/sched_switch/enable # cat trace <idle>-0 [000] d..2 1297.676333: <stack trace> => __schedule => schedule => schedule_preempt_disabled => cpu_startup_entry => rest_init => start_kernel => x86_64_start_reservations => x86_64_start_kernel # echo stacktrace > events/sched/sched_switch/trigger # cat trace <idle>-0 [002] d..3 1370.759745: <stack trace> => __schedule => schedule => schedule_preempt_disabled => cpu_startup_entry => start_secondary And kprobes are not touched. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/ring_buffer: Prevent concurent ring buffer accessJiri Olsa2018-11-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cd6fb677ce7e460c25bdd66f689734102ec7d642 ] Some of the scheduling tracepoints allow the perf_tp_event code to write to ring buffer under different cpu than the code is running on. This results in corrupted ring buffer data demonstrated in following perf commands: # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.383 [sec] [ perf record: Woken up 8 times to write data ] 0x42b890 [0]: failed to process type: -1765585640 [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ] # perf report --stdio 0x42b890 [0]: failed to process type: -1765585640 The reason for the corruption are some of the scheduling tracepoints, that have __perf_task dfined and thus allow to store data to another cpu ring buffer: sched_waking sched_wakeup sched_wakeup_new sched_stat_wait sched_stat_sleep sched_stat_iowait sched_stat_blocked The perf_tp_event function first store samples for current cpu related events defined for tracepoint: hlist_for_each_entry_rcu(event, head, hlist_entry) perf_swevent_event(event, count, &data, regs); And then iterates events of the 'task' and store the sample for any task's event that passes tracepoint checks: ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]); list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { if (event->attr.type != PERF_TYPE_TRACEPOINT) continue; if (event->attr.config != entry->type) continue; perf_swevent_event(event, count, &data, regs); } Above code can race with same code running on another cpu, ending up with 2 cpus trying to store under the same ring buffer, which is specifically not allowed. This patch prevents the problem, by allowing only events with the same current cpu to receive the event. NOTE: this requires the use of (per-task-)per-cpu buffers for this feature to work; perf-record does this. Signed-off-by: Jiri Olsa <jolsa@kernel.org> [peterz: small edits to Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Vagin <avagin@openvz.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events") Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* cgroup: Fix deadlock in cpu hotplug pathPrateek Sood2018-10-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 116d2f7496c51b2e02e8e4ecdd2bdf5fb9d5a641 upstream. Deadlock during cgroup migration from cpu hotplug path when a task T is being moved from source to destination cgroup. kworker/0:0 cpuset_hotplug_workfn() cpuset_hotplug_update_tasks() hotplug_update_tasks_legacy() remove_tasks_in_empty_cpuset() cgroup_transfer_tasks() // stuck in iterator loop cgroup_migrate() cgroup_migrate_add_task() In cgroup_migrate_add_task() it checks for PF_EXITING flag of task T. Task T will not migrate to destination cgroup. css_task_iter_start() will keep pointing to task T in loop waiting for task T cg_list node to be removed. Task T do_exit() exit_signals() // sets PF_EXITING exit_task_namespaces() switch_task_namespaces() free_nsproxy() put_mnt_ns() drop_collected_mounts() namespace_unlock() synchronize_rcu() _synchronize_rcu_expedited() schedule_work() // on cpu0 low priority worker pool wait_event() // waiting for work item to execute Task T inserted a work item in the worklist of cpu0 low priority worker pool. It is waiting for expedited grace period work item to execute. This work item will only be executed once kworker/0:0 complete execution of cpuset_hotplug_workfn(). kworker/0:0 ==> Task T ==>kworker/0:0 In case of PF_EXITING task being migrated from source to destination cgroup, migrate next available task in source cgroup. Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Tejun Heo <tj@kernel.org> [AmitP: Upstream commit cherry-pick failed, so I picked the backported changes from CAF/msm-4.9 tree instead: https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?id=49b74f1696417b270c89cd893ca9f37088928078] Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* module: exclude SHN_UNDEF symbols from kallsyms apiJessica Yu2018-10-101-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f2d1e68cf4d641def734adaccfc3823d3575e6c ] Livepatch modules are special in that we preserve their entire symbol tables in order to be able to apply relocations after module load. The unwanted side effect of this is that undefined (SHN_UNDEF) symbols of livepatch modules are accessible via the kallsyms api and this can confuse symbol resolution in livepatch (klp_find_object_symbol()) and cause subtle bugs in livepatch. Have the module kallsyms api skip over SHN_UNDEF symbols. These symbols are usually not available for normal modules anyway as we cut down their symbol tables to just the core (non-undefined) symbols, so this should really just affect livepatch modules. Note that this patch doesn't affect the display of undefined symbols in /proc/kallsyms. Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* alarmtimer: Prevent overflow for relative nanosleepThomas Gleixner2018-10-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5f936e19cc0ef97dbe3a56e9498922ad5ba1edef ] Air Icy reported: UBSAN: Undefined behaviour in kernel/time/alarmtimer.c:811:7 signed integer overflow: 1529859276030040771 + 9223372036854775807 cannot be represented in type 'long long int' Call Trace: alarm_timer_nsleep+0x44c/0x510 kernel/time/alarmtimer.c:811 __do_sys_clock_nanosleep kernel/time/posix-timers.c:1235 [inline] __se_sys_clock_nanosleep kernel/time/posix-timers.c:1213 [inline] __x64_sys_clock_nanosleep+0x326/0x4e0 kernel/time/posix-timers.c:1213 do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290 alarm_timer_nsleep() uses ktime_add() to add the current time and the relative expiry value. ktime_add() has no sanity checks so the addition can overflow when the relative timeout is large enough. Use ktime_add_safe() which has the necessary sanity checks in place and limits the result to the valid range. Fixes: 9a7adcf5c6de ("timers: Posix interface for alarm-timers") Reported-by: Team OWL337 <icytxw@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1807020926360.1595@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ring-buffer: Allow for rescheduling when removing pagesVaibhav Nagarnaik2018-09-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit 83f365554e47997ec68dc4eca3f5dce525cd15c3 upstream. When reducing ring buffer size, pages are removed by scheduling a work item on each CPU for the corresponding CPU ring buffer. After the pages are removed from ring buffer linked list, the pages are free()d in a tight loop. The loop does not give up CPU until all pages are removed. In a worst case behavior, when lot of pages are to be freed, it can cause system stall. After the pages are removed from the list, the free() can happen while the work is rescheduled. Call cond_resched() in the loop to prevent the system hangup. Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Jason Behmer <jbehmer@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: fix use-after-free in audit_add_watchRonny Chevalier2018-09-261-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit baa2a4fdd525c8c4b0f704d20457195b29437839 ] audit_add_watch stores locally krule->watch without taking a reference on watch. Then, it calls audit_add_to_parent, and uses the watch stored locally. Unfortunately, it is possible that audit_add_to_parent updates krule->watch. When it happens, it also drops a reference of watch which could free the watch. How to reproduce (with KASAN enabled): auditctl -w /etc/passwd -F success=0 -k test_passwd auditctl -w /etc/passwd -F success=1 -k test_passwd2 The second call to auditctl triggers the use-after-free, because audit_to_parent updates krule->watch to use a previous existing watch and drops the reference to the newly created watch. To fix the issue, we grab a reference of watch and we release it at the end of the function. Signed-off-by: Ronny Chevalier <ronny.chevalier@hp.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/osq_lock: Fix osq_lock queue corruptionPrateek Sood2018-09-191-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 50972fe78f24f1cd0b9d7bbf1f87d2be9e4f412e upstream. Fix ordering of link creation between node->prev and prev->next in osq_lock(). A case in which the status of optimistic spin queue is CPU6->CPU2 in which CPU6 has acquired the lock. tail v ,-. <- ,-. |6| |2| `-' -> `-' At this point if CPU0 comes in to acquire osq_lock, it will update the tail count. CPU2 CPU0 ---------------------------------- tail v ,-. <- ,-. ,-. |6| |2| |0| `-' -> `-' `-' After tail count update if CPU2 starts to unqueue itself from optimistic spin queue, it will find an updated tail count with CPU0 and update CPU2 node->next to NULL in osq_wait_next(). unqueue-A tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-' unqueue-B ->tail != curr && !node->next If reordering of following stores happen then prev->next where prev being CPU2 would be updated to point to CPU0 node: tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' -> `-' osq_wait_next() node->next <- 0 xchg(node->next, NULL) tail v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-' unqueue-C At this point if next instruction WRITE_ONCE(next->prev, prev); in CPU2 path is committed before the update of CPU0 node->prev = prev then CPU0 node->prev will point to CPU6 node. tail v----------. v ,-. <- ,-. ,-. |6| |2| |0| `-' `-' `-' `----------^ At this point if CPU0 path's node->prev = prev is committed resulting in change of CPU0 prev back to CPU2 node. CPU2 node->next is NULL currently, tail v ,-. <- ,-. <- ,-. |6| |2| |0| `-' `-' `-' `----------^ so if CPU0 gets into unqueue path of osq_lock it will keep spinning in infinite loop as condition prev->next == node will never be true. Signed-off-by: Prateek Sood <prsood@codeaurora.org> [ Added pictures, rewrote comments. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sramana@codeaurora.org Link: http://lkml.kernel.org/r/1500040076-27626-1-git-send-email-prsood@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/rwsem-xadd: Fix missed wakeup due to reordering of loadPrateek Sood2018-09-191-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9c29c31830a4eca724e137a9339137204bbb31be upstream. If a spinner is present, there is a chance that the load of rwsem_has_spinner() in rwsem_wake() can be reordered with respect to decrement of rwsem count in __up_write() leading to wakeup being missed: spinning writer up_write caller --------------- ----------------------- [S] osq_unlock() [L] osq spin_lock(wait_lock) sem->count=0xFFFFFFFF00000001 +0xFFFFFFFF00000000 count=sem->count MB sem->count=0xFFFFFFFE00000001 -0xFFFFFFFF00000001 spin_trylock(wait_lock) return rwsem_try_write_lock(count) spin_unlock(wait_lock) schedule() Reordering of atomic_long_sub_return_release() in __up_write() and rwsem_has_spinner() in rwsem_wake() can cause missing of wakeup in up_write() context. In spinning writer, sem->count and local variable count is 0XFFFFFFFE00000001. It would result in rwsem_try_write_lock() failing to acquire rwsem and spinning writer going to sleep in rwsem_down_write_failed(). The smp_rmb() will make sure that the spinner state is consulted after sem->count is updated in up_write context. Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: longman@redhat.com Cc: parri.andrea@gmail.com Cc: sramana@codeaurora.org Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread: Fix use-after-free if kthread fork failsVegard Nossum2018-09-191-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4d6501dce079c1eb6bf0b1d8f528a5e81770109e upstream. If a kthread forks (e.g. usermodehelper since commit 1da5c46fa965) but fails in copy_process() between calling dup_task_struct() and setting p->set_child_tid, then the value of p->set_child_tid will be inherited from the parent and get prematurely freed by free_kthread_struct(). kthread() - worker_thread() - process_one_work() | - call_usermodehelper_exec_work() | - kernel_thread() | - _do_fork() | - copy_process() | - dup_task_struct() | - arch_dup_task_struct() | - tsk->set_child_tid = current->set_child_tid // implied | - ... | - goto bad_fork_* | - ... | - free_task(tsk) | - free_kthread_struct(tsk) | - kfree(tsk->set_child_tid) - ... - schedule() - __schedule() - wq_worker_sleeping() - kthread_data(task)->flags // UAF The problem started showing up with commit 1da5c46fa965 since it reused ->set_child_tid for the kthread worker data. A better long-term solution might be to get rid of the ->set_child_tid abuse. The comment in set_kthread_struct() also looks slightly wrong. Debugged-by: Jamie Iles <jamie.iles@oracle.com> Fixes: 1da5c46fa965 ("kthread: Make struct kthread kmalloc'ed") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jamie Iles <jamie.iles@oracle.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* genirq: Delay incrementing interrupt count if it's disabled/pendingSudeep Holla2018-09-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | commit a946e8c717f9355d1abd5408ed0adc0002d1aed1 upstream. In case of a wakeup interrupt, irq_pm_check_wakeup disables the interrupt and marks it pending and suspended, disables it and notifies the pm core about the wake event. The interrupt gets handled later once the system is resumed. However the irq stats is updated twice: once when it's disabled waiting for the system to resume and later when it's handled, resulting in wrong counting of the wakeup interrupt when waking up the system. This patch updates the interrupt count so that it's updated only when the interrupt gets handled. It's already handled correctly in handle_edge_irq and handle_edge_eoi_irq. Reported-by: Manoil Claudiu <claudiu.manoil@freescale.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/1446661957-1019-1-git-send-email-sudeep.holla@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fork: don't copy inconsistent signal handler state to childJann Horn2018-09-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 06e62a46bbba20aa5286102016a04214bb446141 ] Before this change, if a multithreaded process forks while one of its threads is changing a signal handler using sigaction(), the memcpy() in copy_sighand() can race with the struct assignment in do_sigaction(). It isn't clear whether this can cause corruption of the userspace signal handler pointer, but it definitely can cause inconsistency between different fields of struct sigaction. Take the appropriate spinlock to avoid this. I have tested that this patch prevents inconsistency between sa_sigaction and sa_flags, which is possible before this patch. Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userns: move user access out of the mutexJann Horn2018-09-091-20/+19
| | | | | | | | | | | | | | | | | | | | commit 5820f140edef111a9ea2ef414ab2428b8cb805b1 upstream. The old code would hold the userns_state_mutex indefinitely if memdup_user_nul stalled due to e.g. a userfault region. Prevent that by moving the memdup_user_nul in front of the mutex_lock(). Note: This changes the error precedence of invalid buf/count/*ppos vs map already written / capabilities missing. Fixes: 22d917d80e84 ("userns: Rework the user_namespace adding uid/gid...") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Christian Brauner <christian@brauner.io> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sys: don't hold uts_sem while accessing userspace memoryJann Horn2018-09-092-66/+70
| | | | | | | | | | | | | | | | commit 42a0cc3478584d4d63f68f2f5af021ddbea771fa upstream. Holding uts_sem as a writer while accessing userspace memory allows a namespace admin to stall all processes that attempt to take uts_sem. Instead, move data through stack buffers and don't access userspace memory while uts_sem is held. Cc: stable@vger.kernel.org Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* PM / sleep: wakeup: Fix build error caused by missing SRCU supportzhangyi (F)2018-09-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3df6f61fff49632492490fb6e42646b803a9958a upstream. Commit ea0212f40c6 (power: auto select CONFIG_SRCU) made the code in drivers/base/power/wakeup.c use SRCU instead of RCU, but it forgot to select CONFIG_SRCU in Kconfig, which leads to the following build error if CONFIG_SRCU is not selected somewhere else: drivers/built-in.o: In function `wakeup_source_remove': (.text+0x3c6fc): undefined reference to `synchronize_srcu' drivers/built-in.o: In function `pm_print_active_wakeup_sources': (.text+0x3c7a8): undefined reference to `__srcu_read_lock' drivers/built-in.o: In function `pm_print_active_wakeup_sources': (.text+0x3c84c): undefined reference to `__srcu_read_unlock' drivers/built-in.o: In function `device_wakeup_arm_wake_irqs': (.text+0x3d1d8): undefined reference to `__srcu_read_lock' drivers/built-in.o: In function `device_wakeup_arm_wake_irqs': (.text+0x3d228): undefined reference to `__srcu_read_unlock' drivers/built-in.o: In function `device_wakeup_disarm_wake_irqs': (.text+0x3d24c): undefined reference to `__srcu_read_lock' drivers/built-in.o: In function `device_wakeup_disarm_wake_irqs': (.text+0x3d29c): undefined reference to `__srcu_read_unlock' drivers/built-in.o:(.data+0x4158): undefined reference to `process_srcu' Fix this error by selecting CONFIG_SRCU when PM_SLEEP is enabled. Fixes: ea0212f40c6 (power: auto select CONFIG_SRCU) Cc: 4.2+ <stable@vger.kernel.org> # 4.2+ Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> [ rjw: Minor subject/changelog fixups ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* uprobes: Use synchronize_rcu() not synchronize_sched()Steven Rostedt (VMware)2018-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | commit 016f8ffc48cb01d1e7701649c728c5d2e737d295 upstream. While debugging another bug, I was looking at all the synchronize*() functions being used in kernel/trace, and noticed that trace_uprobes was using synchronize_sched(), with a comment to synchronize with {u,ret}_probe_trace_func(). When looking at those functions, the data is protected with "rcu_read_lock()" and not with "rcu_read_lock_sched()". This is using the wrong synchronize_*() function. Link: http://lkml.kernel.org/r/20180809160553.469e1e32@gandalf.local.home Cc: stable@vger.kernel.org Fixes: 70ed91c6ec7f8 ("tracing/uprobes: Support ftrace_event_file base multibuffer") Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread, tracing: Don't expose half-written comm when creating kthreadsSnild Dolkow2018-09-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3e536e222f2930534c252c1cc7ae799c725c5ff9 upstream. There is a window for racing when printing directly to task->comm, allowing other threads to see a non-terminated string. The vsnprintf function fills the buffer, counts the truncated chars, then finally writes the \0 at the end. creator other vsnprintf: fill (not terminated) count the rest trace_sched_waking(p): ... memcpy(comm, p->comm, TASK_COMM_LEN) write \0 The consequences depend on how 'other' uses the string. In our case, it was copied into the tracing system's saved cmdlines, a buffer of adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be): crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk' 0xffffffd5b3818640: "irq/497-pwr_evenkworker/u16:12" ...and a strcpy out of there would cause stack corruption: [224761.522292] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffff9bf9783c78 crash-arm64> kbt | grep 'comm\|trace_print_context' #6 0xffffff9bf9783c78 in trace_print_context+0x18c(+396) comm (char [16]) = "irq/497-pwr_even" crash-arm64> rd 0xffffffd4d0e17d14 8 ffffffd4d0e17d14: 2f71726900000000 5f7277702d373934 ....irq/497-pwr_ ffffffd4d0e17d24: 726f776b6e657665 3a3631752f72656b evenkworker/u16: ffffffd4d0e17d34: f9780248ff003231 cede60e0ffffff9b 12..H.x......`.. ffffffd4d0e17d44: cede60c8ffffffd4 00000fffffffffd4 .....`.......... The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was likely needed because of this same bug. Solved by vsnprintf:ing to a local buffer, then using set_task_comm(). This way, there won't be a window where comm is not terminated. Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com Cc: stable@vger.kernel.org Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure") Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Snild Dolkow <snild@sony.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> [backported to 3.18 / 4.4 by Snild] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing/blktrace: Fix to allow setting same valueSteven Rostedt (VMware)2018-09-091-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 757d9140072054528b13bbe291583d9823cde195 upstream. Masami Hiramatsu reported: Current trace-enable attribute in sysfs returns an error if user writes the same setting value as current one, e.g. # cat /sys/block/sda/trace/enable 0 # echo 0 > /sys/block/sda/trace/enable bash: echo: write error: Invalid argument # echo 1 > /sys/block/sda/trace/enable # echo 1 > /sys/block/sda/trace/enable bash: echo: write error: Device or resource busy But this is not a preferred behavior, it should ignore if new setting is same as current one. This fixes the problem as below. # cat /sys/block/sda/trace/enable 0 # echo 0 > /sys/block/sda/trace/enable # echo 1 > /sys/block/sda/trace/enable # echo 1 > /sys/block/sda/trace/enable Link: http://lkml.kernel.org/r/20180816103802.08678002@gandalf.local.home Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: stable@vger.kernel.org Fixes: cd649b8bb830d ("blktrace: remove sysfs_blk_trace_enable_show/store()") Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Tested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Do not call start/stop() functions when tracing_on does not changeSteven Rostedt (VMware)2018-09-091-1/+3
| | | | | | | | | | | | | | | | | | | | commit f143641bfef9a4a60c57af30de26c63057e7e695 upstream. Currently, when one echo's in 1 into tracing_on, the current tracer's "start()" function is executed, even if tracing_on was already one. This can lead to strange side effects. One being that if the hwlat tracer is enabled, and someone does "echo 1 > tracing_on" into tracing_on, the hwlat tracer's start() function is called again which will recreate another kernel thread, and make it unable to remove the old one. Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de Cc: stable@vger.kernel.org Fixes: 2df8f8a6a897e ("tracing: Fix regression with irqsoff tracer and tracing_on file") Reported-by: Erica Bugden <erica.bugden@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kprobes: Make list and blacklist root user read onlyMasami Hiramatsu2018-09-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f2a3ab36077222437b4826fc76111caa14562b7c upstream. Since the blacklist and list files on debugfs indicates a sensitive address information to reader, it should be restricted to the root user. Suggested-by: Thomas Richter <tmricht@linux.ibm.com> Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Howells <dhowells@redhat.com> Cc: David S . Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tobin C . Harding <me@tobin.cc> Cc: Will Deacon <will.deacon@arm.com> Cc: acme@kernel.org Cc: akpm@linux-foundation.org Cc: brueckner@linux.vnet.ibm.com Cc: linux-arch@vger.kernel.org Cc: rostedt@goodmis.org Cc: schwidefsky@de.ibm.com Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/lkml/152491890171.9916.5183693615601334087.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/sysctl: Check user input value of sysctl_sched_time_avgEthan Zhao2018-09-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5ccba44ba118a5000cccc50076b0344632459779 upstream. System will hang if user set sysctl_sched_time_avg to 0: [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3 ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000 0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8 ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08 Call Trace: <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200 [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600 [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530 [<ffffffff810c5b84>] ? load_balance+0x194/0x900 [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0 [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290 [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0 [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320 [<ffffffff8108ac85>] ? irq_exit+0x125/0x130 [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160 [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30 [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80 <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230 [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230 [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20 [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420 [<ffffffff81053373>] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function: update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hitting invalid input, and set the minimum limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ethan.kernel@gmail.com Cc: keescook@chromium.org Cc: mcgrof@kernel.org Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Steve Muckle <smuckle@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Use __printf markup to silence compilerMathieu Malaterre2018-08-241-0/+5
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 26b68dd2f48fe7699a89f0cfbb9f4a650dc1c837 ] Silence warnings (triggered at W=1) by adding relevant __printf attributes. CC kernel/trace/trace.o kernel/trace/trace.c: In function ‘__trace_array_vprintk’: kernel/trace/trace.c:2979:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args); ^~~ AR kernel/trace/built-in.o Link: http://lkml.kernel.org/r/20180308205843.27447-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/lockdep: Do not record IRQ state within lockdep codeSteven Rostedt (VMware)2018-08-241-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fcc784be837714a9173b372ff9fb9b514590dad9 ] While debugging where things were going wrong with mapping enabling/disabling interrupts with the lockdep state and actual real enabling and disabling interrupts, I had to silent the IRQ disabling/enabling in debug_check_no_locks_freed() because it was always showing up as it was called before the splat was. Use raw_local_irq_save/restore() for not only debug_check_no_locks_freed() but for all internal lockdep functions, as they hide useful information about where interrupts were used incorrectly last. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/lkml/20180404140630.3f4f4c7a@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ring_buffer: tracing: Inherit the tracing setting to next ring bufferMasami Hiramatsu2018-08-092-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 73c8d8945505acdcbae137c2e00a1232e0be709f upstream. Maintain the tracing on/off setting of the ring_buffer when switching to the trace buffer snapshot. Taking a snapshot is done by swapping the backup ring buffer (max_tr_buffer). But since the tracing on/off setting is defined by the ring buffer, when swapping it, the tracing on/off setting can also be changed. This causes a strange result like below: /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 0 > tracing_on /sys/kernel/debug/tracing # cat tracing_on 0 /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 1 /sys/kernel/debug/tracing # echo 1 > snapshot /sys/kernel/debug/tracing # cat tracing_on 0 We don't touch tracing_on, but snapshot changes tracing_on setting each time. This is an anomaly, because user doesn't know that each "ring_buffer" stores its own tracing-enable state and the snapshot is done by swapping ring buffers. Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox Cc: Ingo Molnar <mingo@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp> Cc: stable@vger.kernel.org Fixes: debdd57f5145 ("tracing: Make a snapshot feature available from userspace") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> [ Updated commit log and comment in the code ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nohz: Fix local_timer_softirq_pending()Anna-Maria Gleixner2018-08-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | commit 80d20d35af1edd632a5e7a3b9c0ab7ceff92769e upstream. local_timer_softirq_pending() checks whether the timer softirq is pending with: local_softirq_pending() & TIMER_SOFTIRQ. This is wrong because TIMER_SOFTIRQ is the softirq number and not a bitmask. So the test checks for the wrong bit. Use BIT(TIMER_SOFTIRQ) instead. Fixes: 5d62c183f9e9 ("nohz: Prevent a timer interrupt storm in tick_nohz_stop_sched_tick()") Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Cc: bigeasy@linutronix.de Cc: peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180731161358.29472-1-anna-maria@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* genirq: Make force irq threading setup more robustThomas Gleixner2018-08-091-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d1f0301b3333eef5efbfa1fe0f0edbea01863d5d upstream. The support of force threading interrupts which are set up with both a primary and a threaded handler wreckaged the setup of regular requested threaded interrupts (primary handler == NULL). The reason is that it does not check whether the primary handler is set to the default handler which wakes the handler thread. Instead it replaces the thread handler with the primary handler as it would do with force threaded interrupts which have been requested via request_irq(). So both the primary and the thread handler become the same which then triggers the warnon that the thread handler tries to wakeup a not configured secondary thread. Fortunately this only happens when the driver omits the IRQF_ONESHOT flag when requesting the threaded interrupt, which is normaly caught by the sanity checks when force irq threading is disabled. Fix it by skipping the force threading setup when a regular threaded interrupt is requested. As a consequence the interrupt request which lacks the IRQ_ONESHOT flag is rejected correctly instead of silently wreckaging it. Fixes: 2a1d3ab8986d ("genirq: Handle force threading of irqs with primary and thread handler") Reported-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kurt Kanzenbach <kurt.kanzenbach@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: allow not equal op for audit by executableOndrej Mosnáček2018-08-062-1/+3
| | | | | | | | | | | | | | | | [ Upstream commit 23bcc480dac204c7dbdf49d96b2c918ed98223c2 ] Current implementation of auditing by executable name only implements the 'equal' operator. This patch extends it to also support the 'not equal' operator. See: https://github.com/linux-audit/audit-kernel/issues/53 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: fix references to free_bpf_prog_info() in commentsJakub Kicinski2018-08-061-2/+2
| | | | | | | | | | | | | | [ Upstream commit ab7f5bf0928be2f148d000a6eaa6c0a36e74750e ] Comments in the verifier refer to free_bpf_prog_info() which seems to have never existed in tree. Replace it with free_used_maps(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Quiet gcc warning about maybe unused link variableSteven Rostedt (VMware)2018-08-061-2/+4
| | | | | | | | | | | | | | | | | | | | | | | commit 2519c1bbe38d7acacc9aacba303ca6f97482ed53 upstream. Commit 57ea2a34adf4 ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure") added an if statement that depends on another if statement that gcc doesn't see will initialize the "link" variable and gives the warning: "warning: 'link' may be used uninitialized in this function" It is really a false positive, but to quiet the warning, and also to make sure that it never actually is used uninitialized, initialize the "link" variable to NULL and add an if (!WARN_ON_ONCE(!link)) where the compiler thinks it could be used uninitialized. Cc: stable@vger.kernel.org Fixes: 57ea2a34adf4 ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failureArtem Savkov2018-08-061-2/+11
| | | | | | | | | | | | | | | | | | | | | | | commit 57ea2a34adf40f3a6e88409aafcf803b8945619a upstream. If enable_trace_kprobe fails to enable the probe in enable_k(ret)probe it returns an error, but does not unset the tp flags it set previously. This results in a probe being considered enabled and failures like being unable to remove the probe through kprobe_events file since probes_open() expects every probe to be disabled. Link: http://lkml.kernel.org/r/20180725102826.8300-1-asavkov@redhat.com Link: http://lkml.kernel.org/r/20180725142038.4765-1-asavkov@redhat.com Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 41a7dd420c57 ("tracing/kprobes: Support ftrace_event_file base multibuffer") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix possible double free in event_enable_trigger_func()Steven Rostedt (VMware)2018-08-061-1/+5
| | | | | | | | | | | | | | | | | | | | | | | commit 15cc78644d0075e76d59476a4467e7143860f660 upstream. There was a case that triggered a double free in event_trigger_callback() due to the called reg() function freeing the trigger_data and then it getting freed again by the error return by the caller. The solution there was to up the trigger_data ref count. Code inspection found that event_enable_trigger_func() has the same issue, but is not as easy to trigger (requires harder to trigger failures). It needs to be solved slightly different as it needs more to clean up when the reg() function fails. Link: http://lkml.kernel.org/r/20180725124008.7008e586@gandalf.local.home Cc: stable@vger.kernel.org Fixes: 7862ad1846e99 ("tracing: Add 'enable_event' and 'disable_event' event trigger commands") Reivewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix double free of event_trigger_dataSteven Rostedt (VMware)2018-08-061-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1863c387259b629e4ebfb255495f67cd06aa229b upstream. Running the following: # cd /sys/kernel/debug/tracing # echo 500000 > buffer_size_kb [ Or some other number that takes up most of memory ] # echo snapshot > events/sched/sched_switch/trigger Triggers the following bug: ------------[ cut here ]------------ kernel BUG at mm/slub.c:296! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:kfree+0x16c/0x180 Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246 RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80 RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500 RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00 FS: 00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0 Call Trace: event_trigger_callback+0xee/0x1d0 event_trigger_write+0xfc/0x1a0 __vfs_write+0x33/0x190 ? handle_mm_fault+0x115/0x230 ? _cond_resched+0x16/0x40 vfs_write+0xb0/0x190 ksys_write+0x52/0xc0 do_syscall_64+0x5a/0x160 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f363e16ab50 Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24 RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50 RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001 RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700 R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009 R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0 Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper 86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e ---[ end trace d301afa879ddfa25 ]--- The cause is because the register_snapshot_trigger() call failed to allocate the snapshot buffer, and then called unregister_trigger() which freed the data that was passed to it. Then on return to the function that called register_snapshot_trigger(), as it sees it failed to register, it frees the trigger_data again and causes a double free. By calling event_trigger_init() on the trigger_data (which only ups the reference counter for it), and then event_trigger_free() afterward, the trigger_data would not get freed by the registering trigger function as it would only up and lower the ref count for it. If the register trigger function fails, then the event_trigger_free() called after it will free the trigger data normally. Link: http://lkml.kernel.org/r/20180724191331.738eb819@gandalf.local.home Cc: stable@vger.kerne.org Fixes: 93e31ffbf417 ("tracing: Add 'snapshot' event trigger command") Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* seccomp: Move speculation migitation control to arch codeThomas Gleixner2018-07-251-13/+2
| | | | | | | | | | | | | | | | | | | | | | commit 8bf37d8c067bb7eb8e7c381bdadf9bd89182b6bc upstream The migitation control is simpler to implement in architecture code as it avoids the extra function call to check the mode. Aside of that having an explicit seccomp enabled mode in the architecture mitigations would require even more workarounds. Move it into architecture code and provide a weak function in the seccomp code. Remove the 'which' argument as this allows the architecture to decide which mitigations are relevant for seccomp. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* seccomp: Add filter flag to opt-out of SSB mitigationKees Cook2018-07-251-8/+11
| | | | | | | | | | | | | | | | | | commit 00a02d0c502a06d15e07b857f8ff921e3e402675 upstream If a seccomp user is not interested in Speculative Store Bypass mitigation by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when adding filters. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* seccomp: Use PR_SPEC_FORCE_DISABLEThomas Gleixner2018-07-251-1/+1
| | | | | | | | | | | | | | | | commit b849a812f7eb92e96d1c8239b06581b2cfd8b275 upstream Use PR_SPEC_FORCE_DISABLE in seccomp() because seccomp does not allow to widen restrictions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* seccomp: Enable speculation flaw mitigationsKees Cook2018-07-251-0/+17
| | | | | | | | | | | | | | | | | | commit 5c3070890d06ff82eecb808d02d2ca39169533ef upstream When speculation flaw mitigations are opt-in (via prctl), using seccomp will automatically opt-in to these protections, since using seccomp indicates at least some level of sandboxing is desired. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nospec: Allow getting/setting on non-current taskKees Cook2018-07-251-4/+5
| | | | | | | | | | | | | | | | | | | | commit 7bbf1373e228840bb0295a2ca26d548ef37f448e upstream Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than current. This is needed both for /proc/$pid/status queries and for seccomp (since thread-syncing can trigger seccomp in non-current threads). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* prctl: Add speculation control prctlsThomas Gleixner2018-07-251-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b617cfc858161140d69cc0b5cc211996b557a1c7 upstream Add two new prctls to control aspects of speculation related vulnerabilites and their mitigations to provide finer grained control over performance impacting mitigations. PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature which is selected with arg2 of prctl(2). The return value uses bit 0-2 with the following meaning: Bit Define Description 0 PR_SPEC_PRCTL Mitigation can be controlled per task by PR_SET_SPECULATION_CTRL 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is disabled 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is enabled If all bits are 0 the CPU is not affected by the speculation misfeature. If PR_SPEC_PRCTL is set, then the per task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail. PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of prctl(2) per task. arg3 is used to hand in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. The common return values are: EINVAL prctl is not implemented by the architecture or the unused prctl() arguments are not 0 ENODEV arg2 is selecting a not supported speculation misfeature PR_SET_SPECULATION_CTRL has these additional return values: ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE ENXIO prctl control of the selected speculation misfeature is disabled The first supported controlable speculation misfeature is PR_SPEC_STORE_BYPASS. Add the define so this can be shared between architectures. Based on an initial patch from Tim Chen and mostly rewritten. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu> Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com> Reviewed-by: Alexey Makhalov <amakhalov@vmware.com> Reviewed-by: Bo Gan <ganb@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* PM / hibernate: Fix oops at snapshot_write()Tetsuo Handa2018-07-171-0/+5
| | | | | | | | | | | | | | | | commit fc14eebfc20854a38fd9f1d93a42b1783dad4d17 upstream. syzbot is reporting NULL pointer dereference at snapshot_write() [1]. This is because data->handle is zero-cleared by ioctl(SNAPSHOT_FREE). Fix this by checking data_of(data->handle) != NULL before using it. [1] https://syzkaller.appspot.com/bug?id=828a3c71bd344a6de8b6a31233d51a72099f27fd Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+ae590932da6e45d6564d@syzkaller.appspotmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix missing return symbol in function_graph outputChangbin Du2018-07-111-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1fe4293f4b8de75824935f8d8e9a99c7fc6873da upstream. The function_graph tracer does not show the interrupt return marker for the leaf entry. On leaf entries, we see an unbalanced interrupt marker (the interrupt was entered, but nevern left). Before: 1) | SyS_write() { 1) | __fdget_pos() { 1) 0.061 us | __fget_light(); 1) 0.289 us | } 1) | vfs_write() { 1) 0.049 us | rw_verify_area(); 1) + 15.424 us | __vfs_write(); 1) ==========> | 1) 6.003 us | smp_apic_timer_interrupt(); 1) 0.055 us | __fsnotify_parent(); 1) 0.073 us | fsnotify(); 1) + 23.665 us | } 1) + 24.501 us | } After: 0) | SyS_write() { 0) | __fdget_pos() { 0) 0.052 us | __fget_light(); 0) 0.328 us | } 0) | vfs_write() { 0) 0.057 us | rw_verify_area(); 0) | __vfs_write() { 0) ==========> | 0) 8.548 us | smp_apic_timer_interrupt(); 0) <========== | 0) + 36.507 us | } /* __vfs_write */ 0) 0.049 us | __fsnotify_parent(); 0) 0.066 us | fsnotify(); 0) + 50.064 us | } 0) + 50.952 us | } Link: http://lkml.kernel.org/r/1517413729-20411-1-git-send-email-changbin.du@intel.com Cc: stable@vger.kernel.org Fixes: f8b755ac8e0cc ("tracing/function-graph-tracer: Output arrows signal on hardirq call/return") Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* time: Make sure jiffies_to_msecs() preserves non-zero time periodsGeert Uytterhoeven2018-07-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit abcbcb80cd09cd40f2089d912764e315459b71f7 upstream. For the common cases where 1000 is a multiple of HZ, or HZ is a multiple of 1000, jiffies_to_msecs() never returns zero when passed a non-zero time period. However, if HZ > 1000 and not an integer multiple of 1000 (e.g. 1024 or 1200, as used on alpha and DECstation), jiffies_to_msecs() may return zero for small non-zero time periods. This may break code that relies on receiving back a non-zero value. jiffies_to_usecs() does not need such a fix: one jiffy can only be less than one µs if HZ > 1000000, and such large values of HZ are already rejected at build time, twice: - include/linux/jiffies.h does #error if HZ >= 12288, - kernel/time/time.c has BUILD_BUG_ON(HZ > USEC_PER_SEC). Broken since forever. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Stephen Boyd <sboyd@kernel.org> Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180622143357.7495-1-geert@linux-m68k.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix crash when freeing instances with event triggersSteven Rostedt (VMware)2018-06-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 86b389ff22bd6ad8fd3cb98e41cd271886c6d023 upstream. If a instance has an event trigger enabled when it is freed, it could cause an access of free memory. Here's the case that crashes: # cd /sys/kernel/tracing # mkdir instances/foo # echo snapshot > instances/foo/events/initcall/initcall_start/trigger # rmdir instances/foo Would produce: general protection fault: 0000 [#1] PREEMPT SMP PTI Modules linked in: tun bridge ... CPU: 5 PID: 6203 Comm: rmdir Tainted: G W 4.17.0-rc4-test+ #933 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:clear_event_triggers+0x3b/0x70 RSP: 0018:ffffc90003783de0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0 RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016 R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0 R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c FS: 00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0 Call Trace: event_trace_del_tracer+0x2a/0xc5 instance_rmdir+0x15c/0x200 tracefs_syscall_rmdir+0x52/0x90 vfs_rmdir+0xdb/0x160 do_rmdir+0x16d/0x1c0 __x64_sys_rmdir+0x17/0x20 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This was due to the call the clears out the triggers when an instance is being deleted not removing the trigger from the link list. Cc: stable@vger.kernel.org Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kdb: make "mdr" command repeatRandy Dunlap2018-05-301-6/+21
| | | | | | | | | | | | | | | [ Upstream commit 1e0ce03bf142454f38a5fc050bf4fd698d2d36d8 ] The "mdr" command should repeat (continue) when only Enter/Return is pressed, so make it do so. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>