summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* tracing: Do not stop recording comms if the trace file is being readSteven Rostedt (VMware)2021-06-301-9/+0
| | | | | | | | | | | | | | | | commit 4fdd595e4f9a1ff6d93ec702eaecae451cfc6591 upstream. A while ago, when the "trace" file was opened, tracing was stopped, and code was added to stop recording the comms to saved_cmdlines, for mapping of the pids to the task name. Code has been added that only records the comm if a trace event occurred, and there's no reason to not trace it if the trace file is opened. Cc: stable@vger.kernel.org Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Do not stop recording cmdlines when tracing is offSteven Rostedt (VMware)2021-06-301-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 85550c83da421fb12dc1816c45012e1e638d2b38 upstream. The saved_cmdlines is used to map pids to the task name, such that the output of the tracing does not just show pids, but also gives a human readable name for the task. If the name is not mapped, the output looks like this: <...>-1316 [005] ...2 132.044039: ... Instead of this: gnome-shell-1316 [005] ...2 132.044039: ... The names are updated when tracing is running, but are skipped if tracing is stopped. Unfortunately, this stops the recording of the names if the top level tracer is stopped, and not if there's other tracers active. The recording of a name only happens when a new event is written into a ring buffer, so there is no need to test if tracing is on or not. If tracing is off, then no event is written and no need to test if tracing is off or not. Remove the check, as it hides the names of tasks for events in the instance buffers. Cc: stable@vger.kernel.org Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Do no increment trace_clock_global() by oneSteven Rostedt (VMware)2021-06-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 89529d8b8f8daf92d9979382b8d2eb39966846ea upstream. The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit aafe104aa9096 was added. Cc: stable@vger.kernel.org Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Do not blindly read the ip address in ftrace_bug()Steven Rostedt (VMware)2021-06-161-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | commit 6c14133d2d3f768e0a35128faac8aa6ed4815051 upstream. It was reported that a bug on arm64 caused a bad ip address to be used for updating into a nop in ftrace_init(), but the error path (rightfully) returned -EINVAL and not -EFAULT, as the bug caused more than one error to occur. But because -EINVAL was returned, the ftrace_bug() tried to report what was at the location of the ip address, and read it directly. This caused the machine to panic, as the ip was not pointing to a valid memory address. Instead, read the ip address with copy_from_kernel_nofault() to safely access the memory, and if it faults, report that the address faulted, otherwise report what was in that location. Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/ Cc: stable@vger.kernel.org Fixes: 05736a427f7e1 ("ftrace: warn on failure to disable mcount callers") Reported-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf: Fix data race between pin_count increment/decrementMarco Elver2021-06-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6c605f8371159432ec61cbb1488dcf7ad24ad19a upstream. KCSAN reports a data race between increment and decrement of pin_count: write to 0xffff888237c2d4e0 of 4 bytes by task 15740 on cpu 1: find_get_context kernel/events/core.c:4617 __do_sys_perf_event_open kernel/events/core.c:12097 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... read to 0xffff888237c2d4e0 of 4 bytes by task 15743 on cpu 0: perf_unpin_context kernel/events/core.c:1525 [inline] __do_sys_perf_event_open kernel/events/core.c:12328 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... Because neither read-modify-write here is atomic, this can lead to one of the operations being lost, resulting in an inconsistent pin_count. Fix it by adding the missing locking in the CPU-event case. Fixes: fe4b04fa31a6 ("perf: Cure task_oncpu_function_call() races") Reported-by: syzbot+142c9018f5962db69c7e@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210527104711.2671610-1-elver@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup1: don't allow '\n' in renamingAlexander Kuznetsov2021-06-161-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b7e24eb1caa5f8da20d405d262dba67943aedc42 upstream. cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Reported-by: Andrey Krasichkov <buglloc@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ptrace: make ptrace() fail if the tracee changed its pid unexpectedlyOleg Nesterov2021-05-261-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dbb5afad100a828c97e012c6106566d99f041db6 ] Suppose we have 2 threads, the group-leader L and a sub-theread T, both parked in ptrace_stop(). Debugger tries to resume both threads and does ptrace(PTRACE_CONT, T); ptrace(PTRACE_CONT, L); If the sub-thread T execs in between, the 2nd PTRACE_CONT doesn not resume the old leader L, it resumes the post-exec thread T which was actually now stopped in PTHREAD_EVENT_EXEC. In this case the PTHREAD_EVENT_EXEC event is lost, and the tracer can't know that the tracee changed its pid. This patch makes ptrace() fail in this case until debugger does wait() and consumes PTHREAD_EVENT_EXEC which reports old_pid. This affects all ptrace requests except the "asynchronous" PTRACE_INTERRUPT/KILL. The patch doesn't add the new PTRACE_ option to not complicate the API, and I _hope_ this won't cause any noticeable regression: - If debugger uses PTRACE_O_TRACEEXEC and the thread did an exec and the tracer does a ptrace request without having consumed the exec event, it's 100% sure that the thread the ptracer thinks it is targeting does not exist anymore, or isn't the same as the one it thinks it is targeting. - To some degree this patch adds nothing new. In the scenario above ptrace(L) can fail with -ESRCH if it is called after the execing sub-thread wakes the leader up and before it "steals" the leader's pid. Test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <errno.h> #include <pthread.h> #include <assert.h> void *tf(void *arg) { execve("/usr/bin/true", NULL, NULL); assert(0); return NULL; } int main(void) { int leader = fork(); if (!leader) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(&th, NULL, tf, NULL); for (;;) pause(); return 0; } waitpid(leader, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, leader, 0, PTRACE_O_TRACECLONE | PTRACE_O_TRACEEXEC); waitpid(leader, NULL, 0); ptrace(PTRACE_CONT, leader, 0,0); waitpid(leader, NULL, 0); int status, thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != leader); assert(status == 0x80137f); ptrace(PTRACE_CONT, thread, 0,0); /* * waitid() because waitpid(leader, &status, WNOWAIT) does not * report status. Why ???? * * Why WEXITED? because we have another kernel problem connected * to mt-exec. */ siginfo_t info; assert(waitid(P_PID, leader, &info, WSTOPPED|WEXITED|WNOWAIT) == 0); assert(info.si_pid == leader && info.si_status == 0x0405); /* OK, it sleeps in ptrace(PTRACE_EVENT_EXEC == 0x04) */ assert(ptrace(PTRACE_CONT, leader, 0,0) == -1); assert(errno == ESRCH); assert(leader == waitpid(leader, &status, WNOHANG)); assert(status == 0x04057f); assert(ptrace(PTRACE_CONT, leader, 0,0) == 0); return 0; } Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Simon Marchi <simon.marchi@efficios.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pedro Alves <palves@redhat.com> Acked-by: Simon Marchi <simon.marchi@efficios.com> Acked-by: Jan Kratochvil <jan.kratochvil@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kernel: kexec_file: fix error return code of kexec_calculate_store_digests()Jia-Ju Bai2021-05-221-1/+3
| | | | | | | | | | | | | | | | | [ Upstream commit 31d82c2c787d5cf65fedd35ebbc0c1bd95c1a679 ] When vzalloc() returns NULL to sha_regions, no error return code of kexec_calculate_store_digests() is assigned. To fix this bug, ret is assigned with -ENOMEM in this case. Link: https://lkml.kernel.org/r/20210309083904.24321-1-baijiaju1990@gmail.com Fixes: a43cac0d9dc2 ("kexec: split kexec_file syscall code to kexec_file.c") Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Restructure trace_clock_global() to never blockSteven Rostedt (VMware)2021-05-221-14/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit aafe104aa9096827a429bc1358f8260ee565b7cc upstream. It was reported that a fix to the ring buffer recursion detection would cause a hung machine when performing suspend / resume testing. The following backtrace was extracted from debugging that case: Call Trace: trace_clock_global+0x91/0xa0 __rb_reserve_next+0x237/0x460 ring_buffer_lock_reserve+0x12a/0x3f0 trace_buffer_lock_reserve+0x10/0x50 __trace_graph_return+0x1f/0x80 trace_graph_return+0xb7/0xf0 ? trace_clock_global+0x91/0xa0 ftrace_return_to_handler+0x8b/0xf0 ? pv_hash+0xa0/0xa0 return_to_handler+0x15/0x30 ? ftrace_graph_caller+0xa0/0xa0 ? trace_clock_global+0x91/0xa0 ? __rb_reserve_next+0x237/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 ? trace_event_buffer_lock_reserve+0x3c/0x120 ? trace_event_buffer_reserve+0x6b/0xc0 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0 ? dpm_run_callback+0x3b/0xc0 ? pm_ops_is_empty+0x50/0x50 ? platform_get_irq_byname_optional+0x90/0x90 ? trace_device_pm_callback_start+0x82/0xd0 ? dpm_run_callback+0x49/0xc0 With the following RIP: RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200 Since the fix to the recursion detection would allow a single recursion to happen while tracing, this lead to the trace_clock_global() taking a spin lock and then trying to take it again: ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* lock taken */ (something else gets traced by function graph tracer) ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* DEAD LOCK! */ Tracing should *never* block, as it can lead to strange lockups like the above. Restructure the trace_clock_global() code to instead of simply taking a lock to update the recorded "prev_time" simply use it, as two events happening on two different CPUs that calls this at the same time, really doesn't matter which one goes first. Use a trylock to grab the lock for updating the prev_time, and if it fails, simply try again the next time. If it failed to be taken, that means something else is already updating it. Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home Cc: stable@vger.kernel.org Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Fixes: b02414c8f045 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem Fixes: 14131f2f98ac3 ("tracing: implement trace_clock_*() APIs") # where the bug happened Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Map all PIDs to command linesSteven Rostedt (VMware)2021-05-221-26/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 785e3c0a3a870e72dc530856136ab4c8dd207128 upstream. The default max PID is set by PID_MAX_DEFAULT, and the tracing infrastructure uses this number to map PIDs to the comm names of the tasks, such output of the trace can show names from the recorded PIDs in the ring buffer. This mapping is also exported to user space via the "saved_cmdlines" file in the tracefs directory. But currently the mapping expects the PIDs to be less than PID_MAX_DEFAULT, which is the default maximum and not the real maximum. Recently, systemd will increases the maximum value of a PID on the system, and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its comm is not recorded. This leads to the entire trace to have "<...>" as the comm name, which is pretty useless. Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead of just mapping the index to the comm, map a mask of the PID (PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the map_cmdline_to_pid array (that already exists). This bug goes back to the beginning of ftrace, but hasn't been an issue until user space started increasing the maximum value of PIDs. Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home Cc: stable@vger.kernel.org Fixes: bc0c38d139ec7 ("ftrace: latency tracer infrastructure") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Treat recording comm for idle task as a successJoel Fernandes2021-05-221-1/+5
| | | | | | | | | | | | | | | | | commit eaf260ac04d9b4cf9f458d5c97555bfff2da526e upstream. Currently we stop recording comm for non-idle tasks when switching from/to idle task since we treat that as a record failure. Fix that by treat recording of comm for idle task as a success. Link: http://lkml.kernel.org/r/20170706230023.17942-1-joelaf@google.com Cc: kernel-team@android.com Cc: Ingo Molnar <mingo@redhat.com> Reported-by: Michael Sartain <mikesart@gmail.com> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Use strlcpy() instead of strcpy() in __trace_find_cmdline()Amey Telawane2021-05-221-1/+1
| | | | | | | | | | | | | | | | | | | commit e09e28671cda63e6308b31798b997639120e2a21 upstream. Strcpy is inherently not safe, and strlcpy() should be used instead. __trace_find_cmdline() uses strcpy() because the comms saved must have a terminating nul character, but it doesn't hurt to add the extra protection of using strlcpy() instead of strcpy(). Link: http://lkml.kernel.org/r/1493806274-13936-1-git-send-email-amit.pundir@linaro.org Signed-off-by: Amey Telawane <ameyt@codeaurora.org> [AmitP: Cherry-picked this commit from CodeAurora kernel/msm-3.10 https://source.codeaurora.org/quic/la/kernel/msm-3.10/commit/?id=2161ae9a70b12cf18ac8e5952a20161ffbccb477] Signed-off-by: Amit Pundir <amit.pundir@linaro.org> [ Updated change log and removed the "- 1" from len parameter ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Handle commands when closing set_ftrace_filter fileSteven Rostedt (VMware)2021-05-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8c9af478c06bb1ab1422f90d8ecbc53defd44bc3 upstream. # echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter will cause switch_mm to stop tracing by the traceoff command. # echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter does nothing. The reason is that the parsing in the write function only processes commands if it finished parsing (there is white space written after the command). That's to handle: write(fd, "switch_mm:", 10); write(fd, "traceoff", 8); cases, where the command is broken over multiple writes. The problem is if the file descriptor is closed, then the write call is not processed, and the command needs to be processed in the release code. The release code can handle matching of functions, but does not handle commands. Cc: stable@vger.kernel.org Fixes: eda1e32855656 ("tracing: handle broken names in ftrace filter") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: Move the position of debug_work_activate() in __queue_work()Zqiang2021-04-161-1/+1
| | | | | | | | | | | | | | [ Upstream commit 0687c66b5f666b5ad433f4e94251590d9bc9d10e ] The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Add a vmalloc_sync_mappings() for safe measureSteven Rostedt (VMware)2021-04-101-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 11f5efc3ab66284f7aaacc926e9351d658e2577b upstream x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults. Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear. Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Suggested-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> [sudip: add header] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix stack trace event sizeSteven Rostedt (VMware)2021-04-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9deb193af69d3fd6dd8e47f292b67c805a787010 upstream. Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace event so that user space could parse it properly. Originally the stack trace format to user space showed that the called stack was a dynamic array. But it is not actually a dynamic array, in the way that other dynamic event arrays worked, and this broke user space parsing for it. The update was to make the array look to have 8 entries in it. Helper functions were added to make it parse it correctly, as the stack was dynamic, but was determined by the size of the event stored. Although this fixed user space on how it read the event, it changed the internal structure used for the stack trace event. It changed the array size from [0] to [8] (added 8 entries). This increased the size of the stack trace event by 8 words. The size reserved on the ring buffer was the size of the stack trace event plus the number of stack entries found in the stack trace. That commit caused the amount to be 8 more than what was needed because it did not expect the caller field to have any size. This produced 8 entries of garbage (and reading random data) from the stack trace event: <idle>-0 [002] d... 1976396.837549: <stack trace> => trace_event_raw_event_sched_switch => __traceiter_sched_switch => __schedule => schedule_idle => do_idle => cpu_startup_entry => secondary_startup_64_no_verify => 0xc8c5e150ffff93de => 0xffff93de => 0 => 0 => 0xc8c5e17800000000 => 0x1f30affff93de => 0x00000004 => 0x200000000 Instead, subtract the size of the caller field from the size of the event to make sure that only the amount needed to store the stack trace is reserved. Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/ Cc: stable@vger.kernel.org Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly") Reported-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* genirq: Disable interrupts for force threaded handlersThomas Gleixner2021-03-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 81e2073c175b887398e5bca6c004efa89983f58d upstream. With interrupt force threading all device interrupt handlers are invoked from kernel threads. Contrary to hard interrupt context the invocation only disables bottom halfs, but not interrupts. This was an oversight back then because any code like this will have an issue: thread(irq_A) irq_handler(A) spin_lock(&foo->lock); interrupt(irq_B) irq_handler(B) spin_lock(&foo->lock); This has been triggered with networking (NAPI vs. hrtimers) and console drivers where printk() happens from an interrupt which interrupted the force threaded handler. Now people noticed and started to change the spin_lock() in the handler to spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the interrupt request which in turn breaks RT. Fix the root cause and not the symptom and disable interrupts before invoking the force threaded handler which preserves the regular semantics and the usefulness of the interrupt force threading as a general debugging tool. For not RT this is not changing much, except that during the execution of the threaded handler interrupts are delayed until the handler returns. Vs. scheduling and softirq processing there is no difference. For RT kernels there is no issue. Fixes: 8d32a307e4fa ("genirq: Provide forced interrupt threading") Reported-by: Johan Hovold <johan@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Johan Hovold <johan@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: fix dead code in attach_to_pi_owner()Thomas Gleixner2021-03-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch comes directly from an origin patch (commit 91509e84949fc97e7424521c32a9e227746e0b85) in v4.9. And it is part of a full patch which was originally back-ported to v4.14 as commit e6e00df182908f34360c3c9f2d13cc719362e9c0 The handle_exit_race() function is defined in commit 9c3f39860367 ("futex: Cure exit race"), which never returns -EBUSY. This results in a small piece of dead code in the attach_to_pi_owner() function: int ret = handle_exit_race(uaddr, uval, p); /* Never return -EBUSY */ ... if (ret == -EBUSY) *exiting = p; /* dead code */ The return value -EBUSY is added to handle_exit_race() in upsteam commit ac31c7ff8624409 ("futex: Provide distinct return value when owner is exiting"). This commit was incorporated into v4.9.255, before the function handle_exit_race() was introduced, whitout Modify handle_exit_race(). To fix dead code, extract the change of handle_exit_race() from commit ac31c7ff8624409 ("futex: Provide distinct return value when owner is exiting"), re-incorporated. Lee writes: This commit takes the remaining functional snippet of: ac31c7ff8624409 ("futex: Provide distinct return value when owner is exiting") ... and is the correct fix for this issue. Fixes: 9c3f39860367 ("futex: Cure exit race") Cc: stable@vger.kernel.org # v4.9.258 Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Cure exit raceThomas Gleixner2021-03-171-6/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit da791a667536bf8322042e38ca85d55a78d3c273 upstream. This patch comes directly from an origin patch (commit 9c3f3986036760c48a92f04b36774aa9f63673f80) in v4.9. Stefan reported, that the glibc tst-robustpi4 test case fails occasionally. That case creates the following race between sys_exit() and sys_futex_lock_pi(): CPU0 CPU1 sys_exit() sys_futex() do_exit() futex_lock_pi() exit_signals(tsk) No waiters: tsk->flags |= PF_EXITING; *uaddr == 0x00000PID mm_release(tsk) Set waiter bit exit_robust_list(tsk) { *uaddr = 0x80000PID; Set owner died attach_to_pi_owner() { *uaddr = 0xC0000000; tsk = get_task(PID); } if (!tsk->flags & PF_EXITING) { ... attach(); tsk->flags |= PF_EXITPIDONE; } else { if (!(tsk->flags & PF_EXITPIDONE)) return -EAGAIN; return -ESRCH; <--- FAIL } ESRCH is returned all the way to user space, which triggers the glibc test case assert. Returning ESRCH unconditionally is wrong here because the user space value has been changed by the exiting task to 0xC0000000, i.e. the FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This is a valid state and the kernel has to handle it, i.e. taking the futex. Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE is set in the task which 'owns' the futex. If the value has changed, let the kernel retry the operation, which includes all regular sanity checks and correctly handles the FUTEX_OWNER_DIED case. If it hasn't changed, then return ESRCH as there is no way to distinguish this case from malfunctioning user space. This happens when the exiting task did not have a robust list, the robust list was corrupted or the user space value in the futex was simply bogus. Reported-by: Stefan Liebler <stli@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467 Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Lee: Required to satisfy functional dependency from futex back-port. Re-add the missing handle_exit_race() parts from: 3d4775df0a89 ("futex: Replace PF_EXITPIDONE with a state")] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Change locking rulesPeter Zijlstra2021-03-171-26/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 734009e96d1983ad739e5b656e03430b3660c913 upstream. This patch comes directly from an origin patch (commit dc3f2ff11740159080f2e8e359ae0ab57c8e74b6) in v4.9. Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported in support of a previous futex back-port attempt] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: fix spin_lock() / spin_unlock_irq() imbalanceThomas Schoebel-Theuer2021-03-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch and problem analysis is specific for 4.4 LTS, due to incomplete backporting of other fixes. Later LTS series have different backports. The following is obviously incorrect: static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, struct futex_hash_bucket *hb) { [...] raw_spin_lock(&pi_state->pi_mutex.wait_lock); [...] raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock); [...] } The 4.4-specific fix should probably go in the direction of b4abf91047c, making everything irq-safe. Probably, backporting of b4abf91047c to 4.4 LTS could thus be another good idea. However, this might involve some more 4.4-specific work and require thorough testing: > git log --oneline v4.4..b4abf91047c -- kernel/futex.c kernel/locking/rtmutex.c | wc -l 10 So this patch is just an obvious quickfix for now. Hint: the lock order is documented in 4.9.y and later. A similar documenting is missing in 4.4.y. Please somebody either backport also, or write a new description, if there would be some differences I cannot easily see at the moment. Without reliable docs, inspection of the locking correctness may become a pain. Signed-off-by: Thomas Schoebel-Theuer <tst@1und1.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lee Jones <lee.jones@linaro.org> Fixes: 394fc4981426 ("futex: Rework inconsistent rt_mutex/futex_q state") Fixes: 6510e4a2d04f ("futex,rt_mutex: Provide futex specific rt_mutex API") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: fix irq self-deadlock and satisfy assertionThomas Schoebel-Theuer2021-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch and problem analysis is specific for 4.4 LTS, due to incomplete backporting of other fixes. Later LTS series have different backports. Since v4.4.257 when CONFIG_PROVE_LOCKING=y the following triggers right after reboot of our pre-life systems which equal our production setup: Mar 03 11:27:33 icpu-test-bap10 kernel: ================================= Mar 03 11:27:33 icpu-test-bap10 kernel: [ INFO: inconsistent lock state ] Mar 03 11:27:33 icpu-test-bap10 kernel: 4.4.259-rc1-grsec+ #730 Not tainted Mar 03 11:27:33 icpu-test-bap10 kernel: --------------------------------- Mar 03 11:27:33 icpu-test-bap10 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. Mar 03 11:27:33 icpu-test-bap10 kernel: apache2-ssl/9310 [HC0[0]:SC0[0]:HE1:SE1] takes: Mar 03 11:27:33 icpu-test-bap10 kernel: (&p->pi_lock){?.-.-.}, at: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: {IN-HARDIRQ-W} state was registered at: Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088c4a>] __lock_acquire+0x3a7/0xe4a Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8170151c>] _raw_spin_lock_irqsave+0x3e/0x50 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810719a5>] try_to_wake_up+0x2c/0x210 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81071bf3>] default_wake_function+0xd/0xf Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083588>] autoremove_wake_function+0x11/0x35 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810830b2>] __wake_up_common+0x48/0x7c Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8108311a>] __wake_up+0x34/0x46 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c2a23>] megasas_complete_int_cmd+0x31/0x33 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c60a0>] megasas_complete_cmd+0x570/0x57b Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d05bc>] complete_cmd_fusion+0x23e/0x33d Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d0768>] megasas_isr_fusion+0x67/0x74 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091ae5>] handle_irq_event_percpu+0x134/0x311 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091cf5>] handle_irq_event+0x33/0x51 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810948b9>] handle_edge_irq+0xa3/0xc2 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005f7b>] handle_irq+0xf9/0x101 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005700>] do_IRQ+0x80/0xf5 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81702228>] ret_from_intr+0x0/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100cab0>] arch_cpu_idle+0xa/0xc Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083a5a>] default_idle_call+0x1e/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083b9d>] cpu_startup_entry+0x141/0x22f Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816fb853>] rest_init+0x135/0x13b Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5ce99>] start_kernel+0x3fa/0x40a Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c2af>] x86_64_start_reservations+0x2a/0x2c Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c3d0>] x86_64_start_kernel+0x11f/0x12c Mar 03 11:27:33 icpu-test-bap10 kernel: irq event stamp: 1457 Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last enabled at (1457): [<ffffffff81042a69>] get_user_pages_fast+0xeb/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last disabled at (1456): [<ffffffff810429dd>] get_user_pages_fast+0x5f/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last enabled at (1446): [<ffffffff815e127d>] release_sock+0x142/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last disabled at (1444): [<ffffffff815e116f>] release_sock+0x34/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: other info that might help us debug this: Mar 03 11:27:33 icpu-test-bap10 kernel: Possible unsafe locking scenario: Mar 03 11:27:33 icpu-test-bap10 kernel: CPU0 Mar 03 11:27:33 icpu-test-bap10 kernel: ---- Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: <Interrupt> Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: *** DEADLOCK *** Mar 03 11:27:33 icpu-test-bap10 kernel: 2 locks held by apache2-ssl/9310: Mar 03 11:27:33 icpu-test-bap10 kernel: #0: (&(&(__futex_data.queues)[i].lock)->rlock){+.+...}, at: [<ffffffff810ae4e6>] do Mar 03 11:27:33 icpu-test-bap10 kernel: #1: (&lock->wait_lock){+.+...}, at: [<ffffffff810ae53a>] do_futex+0x639/0x809 Mar 03 11:27:33 icpu-test-bap10 kernel: stack backtrace: Mar 03 11:27:33 icpu-test-bap10 kernel: CPU: 13 PID: 9310 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #730 Mar 03 11:27:33 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019 Mar 03 11:27:33 icpu-test-bap10 kernel: 0000000000000000 ffff883fb79bfc00 ffffffff816f8fc2 ffff883ffa66d300 Mar 03 11:27:33 icpu-test-bap10 kernel: ffffffff8eaa71f0 ffff883fb79bfc50 ffffffff81088484 0000000000000000 Mar 03 11:27:33 icpu-test-bap10 kernel: 0000000000000001 0000000000000001 0000000000000002 ffff883ffa66db58 Mar 03 11:27:33 icpu-test-bap10 kernel: Call Trace: Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816f8fc2>] dump_stack+0x94/0xca Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088484>] print_usage_bug+0x1bc/0x1d1 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81087d76>] ? check_usage_forwards+0x98/0x98 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810885a5>] mark_lock+0x10c/0x203 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088cb9>] __lock_acquire+0x416/0xe4a Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] ? lock_acquire+0x18d/0x1bc Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81700d12>] _raw_spin_lock+0x2a/0x39 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae5af>] do_futex+0x6ae/0x809 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae83d>] SyS_futex+0x133/0x143 Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81701848>] tracesys_phase2+0x90/0x95 Bisecting detects 47e452fcf2f in the above specific scenario using apache-ssl, but apparently the missing *_irq() was introduced in 34c8e1c2c02. However, just reverting the old _irq() variants to a similar status than before 34c8e1c2c02, or using _irqsave() / _irqrestore() as some other backports are doing in various places, would not really help. The fundamental problem is the following violation of the assertion lockdep_assert_held(&pi_state->pi_mutex.wait_lock) in pi_state_update_owner(): Mar 03 12:50:03 icpu-test-bap10 kernel: ------------[ cut here ]------------ Mar 03 12:50:03 icpu-test-bap10 kernel: WARNING: CPU: 37 PID: 8488 at kernel/futex.c:844 pi_state_update_owner+0x3d/0xd7() Mar 03 12:50:03 icpu-test-bap10 kernel: Modules linked in: xt_time xt_connlimit xt_connmark xt_NFLOG xt_limit xt_hashlimit veth ip_set_bitmap_port xt_DSCP xt_multiport ip_set_hash_ip xt_owner xt_set ip_set_hash_net xt_state xt_conntrack nf_conntrack_ftp mars lz4_decompress lz4_compress ipmi_devintf x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul hed ipmi_si ipmi_msghandler processor crc32c_intel ehci_pci ehci_hcd usbcore i40e usb_common Mar 03 12:50:03 icpu-test-bap10 kernel: CPU: 37 PID: 8488 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #737 Mar 03 12:50:03 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019 Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000000 ffff883f863f7c70 ffffffff816f9002 0000000000000000 Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000009 ffff883f863f7ca8 ffffffff8104cda2 ffffffff810abac7 Mar 03 12:50:03 icpu-test-bap10 kernel: ffff883ffbfe5e80 0000000000000000 ffff883f82ed4bc0 00007fc01c9bf000 Mar 03 12:50:03 icpu-test-bap10 kernel: Call Trace: Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff816f9002>] dump_stack+0x94/0xca Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104cda2>] warn_slowpath_common+0x94/0xad Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] ? pi_state_update_owner+0x3d/0xd7 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104ce5f>] warn_slowpath_null+0x15/0x17 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] pi_state_update_owner+0x3d/0xd7 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abea8>] free_pi_state+0x2d/0x73 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abf0b>] unqueue_me_pi+0x1d/0x31 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ad735>] futex_lock_pi+0x27a/0x2e8 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81088bca>] ? __lock_acquire+0x327/0xe4a Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae6a9>] do_futex+0x784/0x809 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810cfa9a>] ? seccomp_phase1+0xde/0x1e7 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810a4503>] ? current_kernel_time64+0xb/0x31 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810d23c3>] ? current_kernel_time+0xb/0xf Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae861>] SyS_futex+0x133/0x143 Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81701888>] tracesys_phase2+0x90/0x95 Mar 03 12:50:03 icpu-test-bap10 kernel: ---[ end trace 968f95a458dea951 ]--- In order to both (1) prevent the self-deadlock, and (2) to satisfy the assertion at pi_state_update_owner(), some locking with irq disable is needed, at least in the specific call stack. Interestingly, there existed a suchalike locking just before f08a4af5ccb. This is just a quick hotfix, resurrecting some previous locks at the old places, but now using ->wait_lock in place of the previous ->pi_lock (which was in place before f08a4af5ccb). The ->pi_lock is now also taken, by the new code which had been introduced in 34c8e1c2c02. When this patch is applied, both the above splats are no longer triggering at my prelife machines. Without this patch, I cannot ensure stable production at 1&1 Ionos. Hint for further work: I have not yet tested other call paths, since I am under time pressure for security reasons. Hint for further hardening of 4.4.y and probably some more LTS series: Probably some more systematic testing with CONFIG_PROVE_LOCKING (and probably some more options) should be invested in order to make the 4.4 LTS series really "stable" again. Signed-off-by: Thomas Schoebel-Theuer <tst@1und1.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lee Jones <lee.jones@linaro.org> Fixes: f08a4af5ccb2 ("futex: Use pi_state_update_owner() in put_pi_state()") Fixes: 34c8e1c2c025 ("futex: Provide and use pi_state_update_owner()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Ensure the correct return value from futex_lock_pi()Thomas Gleixner2021-03-071-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream. In case that futex_lock_pi() was aborted by a signal or a timeout and the task returned without acquiring the rtmutex, but is the designated owner of the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to establish consistent state. In that case it invokes fixup_pi_state_owner() which in turn tries to acquire the rtmutex again. If that succeeds then it does not propagate this success to fixup_owner() and futex_lock_pi() returns -EINTR or -ETIMEOUT despite having the futex locked. Return success from fixup_pi_state_owner() in all cases where the current task owns the rtmutex and therefore the futex and propagate it correctly through fixup_owner(). Fixup the other callsite which does not expect a positive return value. Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> [Sharan: Backported patch for kernel 4.4.y. Also folded in is a part of the cleanup patch d7c5ed73b19c("futex: Remove needless goto's")] Signed-off-by: Sharan Turlapati <sturlapati@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Fix OWNER_DEAD fixupPeter Zijlstra2021-03-031-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a97cb0e7b3f4c6297fd857055ae8e895f402f501 upstream. Both Geert and DaveJ reported that the recent futex commit: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") introduced a problem with setting OWNER_DEAD. We set the bit on an uninitialized variable and then entirely optimize it away as a dead-store. Move the setting of the bit to where it is more useful. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Link: http://lkml.kernel.org/r/20180122103947.GD2228@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbolsFangrui Song2021-03-031-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ebfac7b778fac8b0e8e92ec91d0b055f046b4604 upstream. clang-12 -fno-pic (since https://github.com/llvm/llvm-project/commit/a084c0388e2a59b9556f2de0083333232da3f1d6) can emit `call __stack_chk_fail@PLT` instead of `call __stack_chk_fail` on x86. The two forms should have identical behaviors on x86-64 but the former causes GNU as<2.37 to produce an unreferenced undefined symbol _GLOBAL_OFFSET_TABLE_. (On x86-32, there is an R_386_PC32 vs R_386_PLT32 difference but the linker behavior is identical as far as Linux kernel is concerned.) Simply ignore _GLOBAL_OFFSET_TABLE_ for now, like what scripts/mod/modpost.c:ignore_undef_symbol does. This also fixes the problem for gcc/clang -fpie and -fpic, which may emit `call foo@PLT` for external function calls on x86. Note: ld -z defs and dynamic loaders do not error for unreferenced undefined symbols so the module loader is reading too much. If we ever need to ignore more symbols, the code should be refactored to ignore unreferenced symbols. Cc: <stable@vger.kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1250 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=27178 Reported-by: Marco Elver <elver@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Marco Elver <elver@google.com> Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracepoint: Do not fail unregistering a probe due to memory failureSteven Rostedt (VMware)2021-03-031-16/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit befe6d946551d65cddbd32b9cb0170b0249fd5ed ] The list of tracepoint callbacks is managed by an array that is protected by RCU. To update this array, a new array is allocated, the updates are copied over to the new array, and then the list of functions for the tracepoint is switched over to the new array. After a completion of an RCU grace period, the old array is freed. This process happens for both adding a callback as well as removing one. But on removing a callback, if the new array fails to be allocated, the callback is not removed, and may be used after it is freed by the clients of the tracepoint. There's really no reason to fail if the allocation for a new array fails when removing a function. Instead, the function can simply be replaced by a stub function that could be cleaned up on the next modification of the array. That is, instead of calling the function registered to the tracepoint, it would call a stub function in its place. Link: https://lore.kernel.org/r/20201115055256.65625-1-mmullins@mmlx.us Link: https://lore.kernel.org/r/20201116175107.02db396d@gandalf.local.home Link: https://lore.kernel.org/r/20201117211836.54acaef2@oasis.local.home Link: https://lkml.kernel.org/r/20201118093405.7a6d2290@gandalf.local.home [ Note, this version does use undefined compiler behavior (assuming that a stub function with no parameters or return, can be called by a location that thinks it has parameters but still no return value. Static calls do the same thing, so this trick is not without precedent. There's another solution that uses RCU tricks and is more complex, but can be an alternative if this solution becomes an issue. Link: https://lore.kernel.org/lkml/20210127170721.58bce7cc@gandalf.local.home/ ] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@chromium.org> Cc: netdev <netdev@vger.kernel.org> Cc: bpf <bpf@vger.kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Florian Weimer <fw@deneb.enyo.de> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") Reported-by: syzbot+83aa762ef23b6f0d1991@syzkaller.appspotmail.com Reported-by: syzbot+d29e58bb557324e55e5e@syzkaller.appspotmail.com Reported-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Tested-by: Matt Mullins <mmullins@mmlx.us> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kdb: Make memory allocations more robustSumit Garg2021-03-031-1/+1
| | | | | | | | | | | | | | | | | | | | | commit 93f7a6d818deef69d0ba652d46bae6fbabbf365c upstream. Currently kdb uses in_interrupt() to determine whether its library code has been called from the kgdb trap handler or from a saner calling context such as driver init. This approach is broken because in_interrupt() alone isn't able to determine kgdb trap handler entry from normal task context. This can happen during normal use of basic features such as breakpoints and can also be trivially reproduced using: echo g > /proc/sysrq-trigger We can improve this by adding check for in_dbg_master() instead which explicitly determines if we are running in debugger context. Cc: stable@vger.kernel.org Signed-off-by: Sumit Garg <sumit.garg@linaro.org> Link: https://lore.kernel.org/r/1611313556-4004-1-git-send-email-sumit.garg@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fgraph: Initialize tracing_graph_pause at task creationSteven Rostedt (VMware)2021-02-231-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7e0a9220467dbcfdc5bc62825724f3e52e50ab31 upstream. On some archs, the idle task can call into cpu_suspend(). The cpu_suspend() will disable or pause function graph tracing, as there's some paths in bringing down the CPU that can have issues with its return address being modified. The task_struct structure has a "tracing_graph_pause" atomic counter, that when set to something other than zero, the function graph tracer will not modify the return address. The problem is that the tracing_graph_pause counter is initialized when the function graph tracer is enabled. This can corrupt the counter for the idle task if it is suspended in these architectures. CPU 1 CPU 2 ----- ----- do_idle() cpu_suspend() pause_graph_tracing() task_struct->tracing_graph_pause++ (0 -> 1) start_graph_tracing() for_each_online_cpu(cpu) { ftrace_graph_init_idle_task(cpu) task-struct->tracing_graph_pause = 0 (1 -> 0) unpause_graph_tracing() task_struct->tracing_graph_pause-- (0 -> -1) The above should have gone from 1 to zero, and enabled function graph tracing again. But instead, it is set to -1, which keeps it disabled. There's no reason that the field tracing_graph_pause on the task_struct can not be initialized at boot up. Cc: stable@vger.kernel.org Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339 Reported-by: pierre.gondois@arm.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Do not count ftrace events in top level enable outputSteven Rostedt (VMware)2021-02-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 256cfdd6fdf70c6fcf0f7c8ddb0ebd73ce8f3bc9 upstream. The file /sys/kernel/tracing/events/enable is used to enable all events by echoing in "1", or disabling all events when echoing in "0". To know if all events are enabled, disabled, or some are enabled but not all of them, cating the file should show either "1" (all enabled), "0" (all disabled), or "X" (some enabled but not all of them). This works the same as the "enable" files in the individule system directories (like tracing/events/sched/enable). But when all events are enabled, the top level "enable" file shows "X". The reason is that its checking the "ftrace" events, which are special events that only exist for their format files. These include the format for the function tracer events, that are enabled when the function tracer is enabled, but not by the "enable" file. The check includes these events, which will always be disabled, and even though all true events are enabled, the top level "enable" file will show "X" instead of "1". To fix this, have the check test the event's flags to see if it has the "IGNORE_ENABLE" flag set, and if so, not test it. Cc: stable@vger.kernel.org Fixes: 553552ce1796c ("tracing: Combine event filter_active and enable into single flags field") Reported-by: "Yordan Karadzhov (VMware)" <y.karadz@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kretprobe: Avoid re-registration of the same kretprobe earlierWang ShaoBo2021-02-101-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0188b87899ffc4a1d36a0badbe77d56c92fd91dc upstream. Our system encountered a re-init error when re-registering same kretprobe, where the kretprobe_instance in rp->free_instances is illegally accessed after re-init. Implementation to avoid re-registration has been introduced for kprobe before, but lags for register_kretprobe(). We must check if kprobe has been re-registered before re-initializing kretprobe, otherwise it will destroy the data struct of kretprobe registered, which can lead to memory leak, system crash, also some unexpected behaviors. We use check_kprobe_rereg() to check if kprobe has been re-registered before running register_kretprobe()'s body, for giving a warning message and terminate registration process. Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.com Cc: stable@vger.kernel.org Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe") [ The above commit should have been done for kretprobes too ] Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* elfcore: fix building with clangArnd Bergmann2021-02-102-26/+0
| | | | | | | | | | | | | | | | | | | | | | | commit 6e7b64b9dd6d96537d816ea07ec26b7dedd397b9 upstream. kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ELF/MIPS build fixRalf Baechle2021-02-101-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f43edca7ed08fc02279f2a62015da5cb6aa0ad61 upstream. CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the following linker errors: arch/mips/built-in.o: In function `elf_core_dump': binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs' binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size' binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs' binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data' CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following linker errors: arch/mips/built-in.o: In function `elf_core_dump': binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs' binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size' binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs' binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data' This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from elfcore but for these configurations elfcore will not be built. Fixed by making elfcore selectable by a separate config symbol which unlike the current mechanism can also be used from other directories than kernel/, then having each flavor of ELF that relies on elfcore.o, select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32 which fixes this issue. Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Reviewed-by: James Hogan <james.hogan@imgtec.com> Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Handle faults correctly for PI futexesLee Jones2021-02-101-18/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> fixup_pi_state_owner() tries to ensure that the state of the rtmutex, pi_state and the user space value related to the PI futex are consistent before returning to user space. In case that the user space value update faults and the fault cannot be resolved by faulting the page in via fault_in_user_writeable() the function returns with -EFAULT and leaves the rtmutex and pi_state owner state inconsistent. A subsequent futex_unlock_pi() operates on the inconsistent pi_state and releases the rtmutex despite not owning it which can corrupt the RB tree of the rtmutex and cause a subsequent kernel stack use after free. It was suggested to loop forever in fixup_pi_state_owner() if the fault cannot be resolved, but that results in runaway tasks which is especially undesired when the problem happens due to a programming error and not due to malice. As the user space value cannot be fixed up, the proper solution is to make the rtmutex and the pi_state consistent so both have the same owner. This leaves the user space value out of sync. Any subsequent operation on the futex will fail because the 10th rule of PI futexes (pi_state owner and user space value are consistent) has been violated. As a consequence this removes the inept attempts of 'fixing' the situation in case that the current task owns the rtmutex when returning with an unresolvable fault by unlocking the rtmutex which left pi_state::owner and rtmutex::owner out of sync in a different and only slightly less dangerous way. Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi") Reported-by: gzobqq@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Simplify fixup_pi_state_owner()Lee Jones2021-02-101-14/+27
| | | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> [ Upstream commit f2dac39d93987f7de1e20b3988c8685523247ae2 ] Too many gotos already and an upcoming fix would make it even more unreadable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Use pi_state_update_owner() in put_pi_state()Lee Jones2021-02-101-4/+1
| | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> [ Upstream commit 6ccc84f917d33312eb2846bd7b567639f585ad6d ] No point in open coding it. This way it gains the extra sanity checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rtmutex: Remove unused argument from rt_mutex_proxy_unlock()Lee Jones2021-02-103-5/+3
| | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> [ Upstream commit 2156ac1934166d6deb6cd0f6ffc4c1076ec63697 ] Nothing uses the argument. Remove it as preparation to use pi_state_update_owner(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Provide and use pi_state_update_owner()Lee Jones2021-02-101-31/+33
| | | | | | | | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> [ Upstream commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 ] Updating pi_state::owner is done at several places with the same code. Provide a function for it and use that at the obvious places. This is also a preparation for a bug fix to avoid yet another copy of the same code or alternatively introducing a completely unpenetratable mess of gotos. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Replace pointless printk in fixup_owner()Lee Jones2021-02-101-7/+3
| | | | | | | | | | | | | | | | From: Thomas Gleixner <tglx@linutronix.de> [ Upstream commit 04b79c55201f02ffd675e1231d731365e335c307 ] If that unexpected case of inconsistent arguments ever happens then the futex state is left completely inconsistent and the printk is not really helpful. Replace it with a warning and make the state consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Avoid violating the 10th rule of futexLee Jones2021-02-103-20/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Peter Zijlstra <peterz@infradead.org> commit c1e2f0eaf015fb7076d51a339011f2383e6dd389 upstream. Julia reported futex state corruption in the following scenario: waiter waker stealer (prio > waiter) futex(WAIT_REQUEUE_PI, uaddr, uaddr2, timeout=[N ms]) futex_wait_requeue_pi() futex_wait_queue_me() freezable_schedule() <scheduled out> futex(LOCK_PI, uaddr2) futex(CMP_REQUEUE_PI, uaddr, uaddr2, 1, 0) /* requeues waiter to uaddr2 */ futex(UNLOCK_PI, uaddr2) wake_futex_pi() cmp_futex_value_locked(uaddr2, waiter) wake_up_q() <woken by waker> <hrtimer_wakeup() fires, clears sleeper->task> futex(LOCK_PI, uaddr2) __rt_mutex_start_proxy_lock() try_to_take_rt_mutex() /* steals lock */ rt_mutex_set_owner(lock, stealer) <preempted> <scheduled in> rt_mutex_wait_proxy_lock() __rt_mutex_slowlock() try_to_take_rt_mutex() /* fails, lock held by stealer */ if (timeout && !timeout->task) return -ETIMEDOUT; fixup_owner() /* lock wasn't acquired, so, fixup_pi_state_owner skipped */ return -ETIMEDOUT; /* At this point, we've returned -ETIMEDOUT to userspace, but the * futex word shows waiter to be the owner, and the pi_mutex has * stealer as the owner */ futex_lock(LOCK_PI, uaddr2) -> bails with EDEADLK, futex word says we're owner. And suggested that what commit: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") removes from fixup_owner() looks to be just what is needed. And indeed it is -- I completely missed that requeue_pi could also result in this case. So we need to restore that, except that subsequent patches, like commit: 16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock") changed all the locking rules. Even without that, the sequence: - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { - locked = 1; - goto out; - } - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(&q->pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); already suggests there were races; otherwise we'd never have to look at next_owner. So instead of doing 3 consecutive wait_lock sections with who knows what races, we do it all in a single section. Additionally, the usage of pi_state->owner in fixup_owner() was only safe because only the rt_mutex owner would modify it, which this additional case wrecks. Luckily the values can only change away and not to the value we're testing, this means we can do a speculative test and double check once we have the wait_lock. Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") Reported-by: Julia Cartwright <julia@ni.com> Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Julia Cartwright <julia@ni.com> Tested-by: Gratian Crisan <gratian.crisan@ni.com> Cc: Darren Hart <dvhart@infradead.org> Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Rework inconsistent rt_mutex/futex_q stateLee Jones2021-02-101-36/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Peter Zijlstra <peterz@infradead.org> [Upstream commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa ] There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases. Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario: T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1) CPU0 T1 lock_pi() queue_me() <- Waiter is visible preemption T2 unlock_pi() loops with -EAGAIN forever Which is undesirable for PI primitives. Future patches will rectify this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Remove rt_mutex_deadlock_account_*()Lee Jones2021-02-104-40/+16
| | | | | | | | | | | | | | | | | | | | | From: Peter Zijlstra <peterz@infradead.org> These are unused and clutter up the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex,rt_mutex: Provide futex specific rt_mutex APILee Jones2021-02-103-33/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Peter Zijlstra <peterz@infradead.org> [ Upstream commit 5293c2efda37775346885c7e924d4ef7018ea60b ] Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock. This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much. The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path. Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Prevent exit livelockThomas Gleixner2021-02-031-15/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream. Oleg provided the following test case: int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is: parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING The same problem can happen just by regular preemption as well: task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING; --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD. To cure this, the handling of the exiting owner is changed so: - A refcount is held on the task - The task pointer is stored in a caller visible location - The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated. This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions. For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material. Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stable Team <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Provide distinct return value when owner is exitingThomas Gleixner2021-02-031-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream. attach_to_pi_owner() returns -EAGAIN for various cases: - Owner task is exiting - Futex value has changed The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock. As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Add mutex around futex exitThomas Gleixner2021-02-031-0/+16
| | | | | | | | | | | | | | | | commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream. The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Provide state handling for exec() as wellThomas Gleixner2021-02-031-4/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream. exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does. The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary. While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Sanitize exit state handlingThomas Gleixner2021-02-031-7/+10
| | | | | | | | | | | | | | | commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream. Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Mark the begin of futex exit explicitlyThomas Gleixner2021-02-032-8/+37
| | | | | | | | | | | | | | | | | | | | | commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream. Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code. As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct. This allows to simplify that logic in a next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Set task::futex_state to DEAD right after handling futex exitThomas Gleixner2021-02-032-1/+1
| | | | | | | | | | | | | | | | | | | commit f24f22435dcc11389acc87e5586239c1819d217c upstream. Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code. Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Split futex_mm_release() for exit/execThomas Gleixner2021-02-032-4/+8
| | | | | | | | | | | | | | | | | | commit 150d71584b12809144b8145b817e83b81158ae5f upstream. To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>