summaryrefslogtreecommitdiffstats
path: root/kernel/signal.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'sched-core-2021-04-28' of ↵Linus Torvalds2021-04-281-11/+48
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces to a unified debugfs interface. - Signals: Allow caching one sigqueue object per task, to improve performance & latencies. - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups. - Improve energy-aware scheduling - Improve the PELT metrics for certain workloads - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous regressions - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature, or the use of the resched_latency_warn_ms=xx boot parameter. - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining balance_push() vs. hotplug holes/races - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well - Fix/improve various load-balancing corner cases vs. capacity margins - Fix sched topology on systems with NUMA diameter of 3 or above - Fix PF_KTHREAD vs to_kthread() race - Minor rseq optimizations - Misc cleanups, optimizations, fixes and smaller updates * tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) cpumask/hotplug: Fix cpu_dying() state tracking kthread: Fix PF_KTHREAD vs to_kthread() race sched/debug: Fix cgroup_path[] serialization sched,psi: Handle potential task count underflow bugs more gracefully sched: Warn on long periods of pending need_resched sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning sched/debug: Rename the sched_debug parameter to sched_verbose sched,fair: Alternative sched_slice() sched: Move /proc/sched_debug to debugfs sched,debug: Convert sysctl sched_domains to debugfs debugfs: Implement debugfs_create_str() sched,preempt: Move preempt_dynamic to debug.c sched: Move SCHED_DEBUG sysctl to debugfs sched: Don't make LATENCYTOP select SCHED_DEBUG sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG sched: Use cpu_dying() to fix balance_push vs hotplug-rollback cpumask: Introduce DYING mask cpumask: Make cpu_{online,possible,present,active}() inline rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs() ...
| * Merge tag 'v5.12-rc8' into sched/core, to pick up fixesIngo Molnar2021-04-201-3/+11
| |\ | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | signal: Allow tasks to cache one sigqueue structThomas Gleixner2021-04-141-2/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The idea for this originates from the real time tree to make signal delivery for realtime applications more efficient. In quite some of these application scenarios a control tasks signals workers to start their computations. There is usually only one signal per worker on flight. This works nicely as long as the kmem cache allocations do not hit the slow path and cause latencies. To cure this an optimistic caching was introduced (limited to RT tasks) which allows a task to cache a single sigqueue in a pointer in task_struct instead of handing it back to the kmem cache after consuming a signal. When the next signal is sent to the task then the cached sigqueue is used instead of allocating a new one. This solved the problem for this set of application scenarios nicely. The task cache is not preallocated so the first signal sent to a task goes always to the cache allocator. The cached sigqueue stays around until the task exits and is freed when task::sighand is dropped. After posting this solution for mainline the discussion came up whether this would be useful in general and should not be limited to realtime tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org One concern leading to the original limitation was to avoid a large amount of pointlessly cached sigqueues in alive tasks. The other concern was vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for. The accounting problem is real, but on the other hand slightly academic. After gathering some statistics it turned out that after boot of a regular distro install there are less than 10 sigqueues cached in ~1500 tasks. In case of a 'mass fork and fire signal to child' scenario the extra 80 bytes of memory per task are well in the noise of the overall memory consumption of the fork bomb. If this should be limited then this would need an extra counter in struct user, more atomic instructions and a seperate rlimit. Yet another tunable which is mostly unused. The caching is actually used. After boot and a full kernel compile on a 64CPU machine with make -j128 the number of 'allocations' looks like this: From slab: 23996 From task cache: 52223 I.e. it reduces the number of slab cache operations by ~68%. A typical pattern there is: <...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460 <...>-58488 __sigqueue_free: cache ffff8881132df460 <...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550 bash-1149 exit_task_sighand: free ffff8881132df460 bash-1149 __sigqueue_free: cache ffff8881103dc550 The interesting sequence is that the exiting task 58488 grabs the sigqueue from bash's task cache to signal exit and bash sticks it back into it's own cache. Lather, rinse and repeat. The caching is probably not noticable for the general use case, but the benefit for latency sensitive applications is clear. While kmem caches are usually just serving from the fast path the slab merging (default) can depending on the usage pattern of the merged slabs cause occasional slow path allocations. The time spared per cached entry is a few micro seconds per signal which is not relevant for e.g. a kernel build, but for signal heavy workloads it's measurable. As there is no real downside of this caching mechanism making it unconditionally available is preferred over more conditional code or new magic tunables. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
| * | signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()Thomas Gleixner2021-04-141-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no point in having the conditional at the callsite. Just hand in the allocation mode flag to __sigqueue_alloc() and use it to initialize sigqueue::flags. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
* | | Merge tag 'perf-core-2021-04-28' of ↵Linus Torvalds2021-04-281-0/+13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: - Improve Intel uncore PMU support: - Parse uncore 'discovery tables' - a new hardware capability enumeration method introduced on the latest Intel platforms. This table is in a well-defined PCI namespace location and is read via MMIO. It is organized in an rbtree. These uncore tables will allow the discovery of standard counter blocks, but fancier counters still need to be enumerated explicitly. - Add Alder Lake support - Improve IIO stacks to PMON mapping support on Skylake servers - Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived) cores. The CPU-side feature set is entirely symmetrical - but on the PMU side there's core type dependent PMU functionality. - Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by fixing the AUX allocation watermark logic. - Improve ring buffer allocation on NUMA systems - Put 'struct perf_event' into their separate kmem_cache pool - Add support for synchronous signals for select perf events. The immediate motivation is to support low-overhead sampling-based race detection for user-space code. The feature consists of the following main changes: - Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits inheritance of events to CLONE_THREAD. - Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec. - Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64 ::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is PERF_TYPE_BREAKPOINT. The siginfo support is adequate for breakpoints right now - but the new field can be used to introduce support for other types of metadata passed over siginfo as well. - Misc fixes, cleanups and smaller updates. * tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) signal, perf: Add missing TRAP_PERF case in siginfo_layout() signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures perf/x86: Allow for 8<num_fixed_counters<16 perf/x86/rapl: Add support for Intel Alder Lake perf/x86/cstate: Add Alder Lake CPU support perf/x86/msr: Add Alder Lake CPU support perf/x86/intel/uncore: Add Alder Lake support perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE perf/x86/intel: Add Alder Lake Hybrid support perf/x86: Support filter_match callback perf/x86/intel: Add attr_update for Hybrid PMUs perf/x86: Add structures for the attributes of Hybrid PMUs perf/x86: Register hybrid PMUs perf/x86: Factor out x86_pmu_show_pmu_cap perf/x86: Remove temporary pmu assignment in event_init perf/x86/intel: Factor out intel_pmu_check_extra_regs perf/x86/intel: Factor out intel_pmu_check_event_constraints perf/x86/intel: Factor out intel_pmu_check_num_counters perf/x86: Hybrid PMU support for extra_regs perf/x86: Hybrid PMU support for event constraints ...
| * | | signal, perf: Add missing TRAP_PERF case in siginfo_layout()Marco Elver2021-04-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the missing TRAP_PERF case in siginfo_layout() for interpreting the layout correctly as SIL_PERF_EVENT instead of just SIL_FAULT. This ensures the si_perf field is copied and not just the si_addr field. This was caught and tested by running the perf_events/sigtrap_threads kselftest as a 32-bit binary with a 64-bit kernel. Fixes: fb6cc127e0b6 ("signal: Introduce TRAP_PERF si_code and si_perf to siginfo") Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210422191823.79012-2-elver@google.com
| * | | signal: Introduce TRAP_PERF si_code and si_perf to siginfoMarco Elver2021-04-161-0/+11
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduces the TRAP_PERF si_code, and associated siginfo_t field si_perf. These will be used by the perf event subsystem to send signals (if requested) to the task where an event occurred. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
* | | Merge tag 'livepatching-for-5.13' of ↵Linus Torvalds2021-04-271-3/+1
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching update from Petr Mladek: - Use TIF_NOTIFY_SIGNAL infrastructure instead of the fake signal * tag 'livepatching-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructure
| * | livepatch: Replace the fake signal sending with TIF_NOTIFY_SIGNAL infrastructureMiroslav Benes2021-03-301-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Livepatch sends a fake signal to all remaining blocking tasks of a running transition after a set period of time. It uses TIF_SIGPENDING flag for the purpose. Commit 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") added a generic infrastructure to achieve the same. Replace our bespoke solution with the generic one. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Revert "signal: don't allow STOP on PF_IO_WORKER threads"Jens Axboe2021-03-271-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597. The IO threads allow and handle SIGSTOP now, so don't special case them anymore in task_set_jobctl_pending(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Revert "kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals"Jens Axboe2021-03-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 6fb8f43cede0e4bd3ead847de78d531424a96be9. The IO threads do allow signals now, including SIGSTOP, and we can allow ptrace attach. Attaching won't reveal anything interesting for the IO threads, but it will allow eg gdb to attach to a task with io_urings and IO threads without complaining. And once attached, it will allow the usual introspection into regular threads. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"Jens Axboe2021-03-271-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5. IO threads now take signals just fine, so there's no reason to limit them specifically. Revert the change that prevented that from happening. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | kernel: don't call do_exit() for PF_IO_WORKER threadsJens Axboe2021-03-261-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now we're never calling get_signal() from PF_IO_WORKER threads, but in preparation for doing so, don't handle a fatal signal for them. The workers have state they need to cleanup when exiting, so just return instead of calling do_exit() on their behalf. The threads themselves will detect a fatal signal and do proper shutdown. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | signal: don't allow STOP on PF_IO_WORKER threadsEric W. Biederman2021-03-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like we don't allow normal signals to IO threads, don't deliver a STOP to a task that has PF_IO_WORKER set. The IO threads don't take signals in general, and have no means of flushing out a stop either. Longer term, we may want to look into allowing stop of these threads, as it relates to eg process freezing. For now, this prevents a spin issue if a SIGSTOP is delivered to the parent task. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | signal: don't allow sending any signals to PF_IO_WORKER threadsJens Axboe2021-03-211-0/+3
| |/ |/| | | | | | | | | | | | | | | | | | | They don't take signals individually, and even if they share signals with the parent task, don't allow them to be delivered through the worker thread. Linux does allow this kind of behavior for regular threads, but it's really a compatability thing that we need not care about for the IO threads. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signalsJens Axboe2021-02-211-2/+2
|/ | | | Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-linus-2021-01-24' of ↵Linus Torvalds2021-01-241-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull misc fixes from Christian Brauner: - Jann reported sparse complaints because of a missing __user annotation in a helper we added way back when we added pidfd_send_signal() to avoid compat syscall handling. Fix it. - Yanfei replaces a reference in a comment to the _do_fork() helper I removed a while ago with a reference to the new kernel_clone() replacement - Alexander Guril added a simple coding style fix * tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: kthread: remove comments about old _do_fork() helper Kernel: fork.c: Fix coding style: Do not use {} around single-line statements signal: Add missing __user annotation to copy_siginfo_from_user_any
| * signal: Add missing __user annotation to copy_siginfo_from_user_anyJann Horn2021-01-111-1/+2
| | | | | | | | | | | | | | | | | | copy_siginfo_from_user_any() takes a userspace pointer as second argument; annotate the parameter type accordingly. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20201207000252.138564-1-jannh@google.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | task_work: unconditionally run task_work from get_signal()Jens Axboe2021-01-081-0/+3
|/ | | | | | | | | | | | | | | | | | | | | Song reported a boot regression in a kvm image with 5.11-rc, and bisected it down to the below patch. Debugging this issue, turns out that the boot stalled when a task is waiting on a pipe being released. As we no longer run task_work from get_signal() unless it's queued with TWA_SIGNAL, the task goes idle without running the task_work. This prevents ->release() from being called on the pipe, which another boot task is waiting on. For now, re-instate the unconditional task_work run from get_signal(). For 5.12, we'll collapse TWA_RESUME and TWA_SIGNAL, as it no longer makes sense to have a distinction between the two. This will turn task_work notification into a simple boolean, whether to notify or not. Fixes: 98b89b649fce ("signal: kill JOBCTL_TASK_WORK") Reported-by: Song Liu <songliubraving@fb.com> Tested-by: John Stultz <john.stultz@linaro.org> Tested-by: Douglas Anderson <dianders@chromium.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang version 11.0.1 Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds2020-12-161-22/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe: "This sits on top of of the core entry/exit and x86 entry branch from the tip tree, which contains the generic and x86 parts of this work. Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL. With that done, we can get rid of JOBCTL_TASK_WORK from task_work and signal.c, and also remove a deadlock work-around in io_uring around knowing that signal based task_work waking is invoked with the sighand wait queue head lock. The motivation for this work is to decouple signal notify based task_work, of which io_uring is a heavy user of, from sighand. The sighand lock becomes a huge contention point, particularly for threaded workloads where it's shared between threads. Even outside of threaded applications it's slower than it needs to be. Roman Gershman <romger@amazon.com> reported that his networked workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all spent hammering on the sighand lock, showing 57% of the CPU time there [1]. There are further cleanups possible on top of this. One example is TIF_PATCH_PENDING, where a patch already exists to use TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more consolidation, but the work stands on its own as well" [1] https://github.com/axboe/liburing/issues/215 * tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits) io_uring: remove 'twa_signal_ok' deadlock work-around kernel: remove checking for TIF_NOTIFY_SIGNAL signal: kill JOBCTL_TASK_WORK io_uring: JOBCTL_TASK_WORK is no longer used by task_work task_work: remove legacy TWA_SIGNAL path sparc: add support for TIF_NOTIFY_SIGNAL riscv: add support for TIF_NOTIFY_SIGNAL nds32: add support for TIF_NOTIFY_SIGNAL ia64: add support for TIF_NOTIFY_SIGNAL h8300: add support for TIF_NOTIFY_SIGNAL c6x: add support for TIF_NOTIFY_SIGNAL alpha: add support for TIF_NOTIFY_SIGNAL xtensa: add support for TIF_NOTIFY_SIGNAL arm: add support for TIF_NOTIFY_SIGNAL microblaze: add support for TIF_NOTIFY_SIGNAL hexagon: add support for TIF_NOTIFY_SIGNAL csky: add support for TIF_NOTIFY_SIGNAL openrisc: add support for TIF_NOTIFY_SIGNAL sh: add support for TIF_NOTIFY_SIGNAL um: add support for TIF_NOTIFY_SIGNAL ...
| * kernel: remove checking for TIF_NOTIFY_SIGNALJens Axboe2020-12-121-2/+0
| | | | | | | | | | | | It's available everywhere now, no need to check or add dummy defines. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * signal: kill JOBCTL_TASK_WORKJens Axboe2020-12-121-20/+0
| | | | | | | | | | | | It's no longer used, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge tag 'core-entry-notify-signal' of ↵Jens Axboe2020-11-091-4/+18
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into tif-task_work.arch Core changes to support TASK_NOTIFY_SIGNAL * tag 'core-entry-notify-signal' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: task_work: Use TIF_NOTIFY_SIGNAL if available entry: Add support for TIF_NOTIFY_SIGNAL signal: Add task_sigpending() helper
* | \ Merge tag 'core-entry-2020-12-14' of ↵Linus Torvalds2020-12-141-4/+18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry/exit updates from Thomas Gleixner: "A set of updates for entry/exit handling: - More generalization of entry/exit functionality - The consolidation work to reclaim TIF flags on x86 and also for non-x86 specific TIF flags which are solely relevant for syscall related work and have been moved into their own storage space. The x86 specific part had to be merged in to avoid a major conflict. - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal delivery mode of task work and results in an impressive performance improvement for io_uring. The non-x86 consolidation of this is going to come seperate via Jens. - The selective syscall redirection facility which provides a clean and efficient way to support the non-Linux syscalls of WINE by catching them at syscall entry and redirecting them to the user space emulation. This can be utilized for other purposes as well and has been designed carefully to avoid overhead for the regular fastpath. This includes the core changes and the x86 support code. - Simplification of the context tracking entry/exit handling for the users of the generic entry code which guarantee the proper ordering and protection. - Preparatory changes to make the generic entry code accomodate S390 specific requirements which are mostly related to their syscall restart mechanism" * tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) entry: Add syscall_exit_to_user_mode_work() entry: Add exit_to_user_mode() wrapper entry_Add_enter_from_user_mode_wrapper entry: Rename exit_to_user_mode() entry: Rename enter_from_user_mode() docs: Document Syscall User Dispatch selftests: Add benchmark for syscall user dispatch selftests: Add kselftest for syscall user dispatch entry: Support Syscall User Dispatch on common syscall entry kernel: Implement selective syscall userspace redirection signal: Expose SYS_USER_DISPATCH si_code type x86: vdso: Expose sigreturn address on vdso to the kernel MAINTAINERS: Add entry for common entry code entry: Fix boot for !CONFIG_GENERIC_ENTRY x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs sched: Detect call to schedule from critical entry code context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK x86: Reclaim unused x86 TI flags ...
| * \ \ Merge branch 'core/urgent' into core/entryThomas Gleixner2020-11-041-1/+1
| |\ \ \ | | |_|/ | |/| | | | | | Pick up the entry fix before further modifications.
| * | | entry: Add support for TIF_NOTIFY_SIGNALJens Axboe2020-10-291-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set, will return true if signal_pending() is used in a wait loop. That causes an exit of the loop so that notify_signal tracehooks can be run. If the wait loop is currently inside a system call, the system call is restarted once task_work has been processed. In preparation for only having arch_do_signal() handle syscall restarts if _TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart(). Pass in a boolean that tells the architecture specific signal handler if it should attempt to get a signal, or just process a potential syscall restart. For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to get_signal(). This is done to minimize the needed architecture changes to support this feature. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
| * | | signal: Add task_sigpending() helperJens Axboe2020-10-291-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is in preparation for maintaining signal_pending() as the decider of whether or not a schedule() loop should be broken, or continue sleeping. This is different than the core signal use cases, which really need to know whether an actual signal is pending or not. task_sigpending() returns non-zero if TIF_SIGPENDING is set. Only core kernel use cases should care about the distinction between the two, make sure those use the task_sigpending() helper. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20201026203230.386348-2-axboe@kernel.dk
* | | | signal: define the SA_EXPOSE_TAGBITS bit in sa_flagsPeter Collingbourne2020-11-231-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Architectures that support address tagging, such as arm64, may want to expose fault address tag bits to the signal handler to help diagnose memory errors. However, these bits have not been previously set, and their presence may confuse unaware user applications. Therefore, introduce a SA_EXPOSE_TAGBITS flag bit in sa_flags that a signal handler may use to explicitly request that the bits are set. The generic signal handler APIs expect to receive tagged addresses. Architectures may specify how to untag addresses in the case where SA_EXPOSE_TAGBITS is clear by defining the arch_untagged_si_addr function. Signed-off-by: Peter Collingbourne <pcc@google.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://linux-review.googlesource.com/id/I16dd0ed2081f091fce97be0190cb8caa874c26cb Link: https://lkml.kernel.org/r/13cf24d00ebdd8e1f55caf1821c7c29d54100191.1605904350.git.pcc@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | | signal: define the SA_UNSUPPORTED bit in sa_flagsPeter Collingbourne2020-11-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define a sa_flags bit, SA_UNSUPPORTED, which will never be supported in the uapi. The purpose of this flag bit is to allow userspace to distinguish an old kernel that does not clear unknown sa_flags bits from a kernel that supports every flag bit. In other words, if userspace does something like: act.sa_flags |= SA_UNSUPPORTED; sigaction(SIGSEGV, &act, 0); sigaction(SIGSEGV, 0, &oldact); and finds that SA_UNSUPPORTED remains set in oldact.sa_flags, it means that the kernel cannot be trusted to have cleared unknown flag bits from sa_flags, so no assumptions about flag bit support can be made. Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Link: https://linux-review.googlesource.com/id/Ic2501ad150a3a79c1cf27fb8c99be342e9dffbcb Link: https://lkml.kernel.org/r/bda7ddff8895a9bc4ffc5f3cf3d4d37a32118077.1605582887.git.pcc@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | | signal: clear non-uapi flag bits when passing/returning sa_flagsPeter Collingbourne2020-11-231-0/+10
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we were not clearing non-uapi flag bits in sigaction.sa_flags when storing the userspace-provided sa_flags or when returning them via oldact. Start doing so. This allows userspace to detect missing support for flag bits and allows the kernel to use non-uapi bits internally, as we are already doing in arch/x86 for two flag bits. Now that this change is in place, we no longer need the code in arch/x86 that was hiding these bits from userspace, so remove it. This is technically a userspace-visible behavior change for sigaction, as the unknown bits returned via oldact.sa_flags are no longer set. However, we are free to define the behavior for unknown bits exactly because their behavior is currently undefined, so for now we can define the meaning of each of them to be "clear the bit in oldact.sa_flags unless the bit becomes known in the future". Furthermore, this behavior is consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and NetBSD [5] fail the syscall if unknown bits are set). So there is some precedent for this behavior in other kernels, and in particular in XNU, which is probably the most popular kernel among those that I looked at, which means that this change is less likely to be a compatibility issue. Link: [1] https://github.com/openbsd/src/blob/f634a6a4b5bf832e9c1de77f7894ae2625e74484/sys/kern/kern_sig.c#L278 Link: [2] https://github.com/illumos/illumos-gate/blob/76f19f5fdc974fe5be5c82a556e43a4df93f1de1/usr/src/uts/common/syscall/sigaction.c#L86 Link: [3] https://github.com/apple/darwin-xnu/blob/a449c6a3b8014d9406c2ddbdc81795da24aa7443/bsd/kern/kern_sig.c#L480 Link: [4] https://github.com/freebsd/freebsd/blob/eded70c37057857c6e23fae51f86b8f8f43cd2d0/sys/kern/kern_sig.c#L699 Link: [5] https://github.com/NetBSD/src/blob/3365779becdcedfca206091a645a0e8e22b2946e/sys/kern/sys_sig.c#L473 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86 Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | ptrace: fix task_join_group_stop() for the case when current is tracedOleg Nesterov2020-11-021-9/+10
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This testcase #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <pthread.h> #include <assert.h> void *tf(void *arg) { return NULL; } int main(void) { int pid = fork(); if (!pid) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(&th, NULL, tf, NULL); return 0; } waitpid(pid, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE); waitpid(pid, NULL, 0); ptrace(PTRACE_CONT, pid, 0,0); waitpid(pid, NULL, 0); int status; int thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != pid); assert(status == 0x80137f); return 0; } fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap(). This is because task_join_group_stop() has 2 problems when current is traced: 1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee can be woken up by debugger and it can clone another thread which should join the group-stop. We need to check group_stop_count || SIGNAL_STOP_STOPPED. 2. If SIGNAL_STOP_STOPPED is already set, we should not increment sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread should stop without another do_notify_parent_cldstop() report. To clarify, the problem is very old and we should blame ptrace_init_task(). But now that we have task_join_group_stop() it makes more sense to fix this helper to avoid the code duplication. Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <christian@brauner.io> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-08-231-1/+1
|/ | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* task_work: only grab task signal lock when neededJens Axboe2020-08-131-1/+15
| | | | | | | | | | | | | | | | | | | | | | | If JOBCTL_TASK_WORK is already set on the targeted task, then we need not go through {lock,unlock}_task_sighand() to set it again and queue a signal wakeup. This is safe as we're checking it _after_ adding the new task_work with cmpxchg(). The ordering is as follows: task_work_add() get_signal() -------------------------------------------------------------- STORE(task->task_works, new_work); STORE(task->jobctl); mb(); mb(); LOAD(task->jobctl); LOAD(task->task_works); This speeds up TWA_SIGNAL handling quite a bit, which is important now that io_uring is relying on it for all task_work deliveries. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jann Horn <jannh@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* signal: fix typo in dequeue_synchronous_signal()Pavel Machek2020-07-261-1/+1
| | | | | | | | | s/postive/positive/ Signed-off-by: Pavel Machek (CIP) <pavel@denx.de> Link: https://lore.kernel.org/r/20200724090531.GA14409@amd [christian.brauner@ubuntu.com: tweak commit message] Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* task_work: teach task_work_add() to do signal_wake_up()Oleg Nesterov2020-06-301-3/+7
| | | | | | | | | | | | | | | | | | So that the target task will exit the wait_event_interruptible-like loop and call task_work_run() asap. The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the new TWA_SIGNAL flag implies signal_wake_up(). However, it needs to avoid the race with recalc_sigpending(), so the patch also adds the new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK. TODO: once this patch is merged we need to change all current users of task_work_add(notify = true) to use TWA_RESUME. Cc: stable@vger.kernel.org # v5.7 Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'work.set_fs-exec' of ↵Linus Torvalds2020-06-011-53/+53
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull uaccess/coredump updates from Al Viro: "set_fs() removal in coredump-related area - mostly Christoph's stuff..." * 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump binfmt_elf: remove the set_fs in fill_siginfo_note signal: refactor copy_siginfo_to_user32 powerpc/spufs: simplify spufs core dumping powerpc/spufs: stop using access_ok powerpc/spufs: fix copy_to_user while atomic
| * signal: refactor copy_siginfo_to_user32Christoph Hellwig2020-05-051-53/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out a copy_siginfo_to_external32 helper from copy_siginfo_to_user32 that fills out the compat_siginfo, but does so on a kernel space data structure. With that we can let architectures override copy_siginfo_to_user32 with their own implementations using copy_siginfo_to_external32. That allows moving the x32 SIGCHLD purely to x86 architecture code. As a nice side effect copy_siginfo_to_external32 also comes in handy for avoiding a set_fs() call in the coredump code later on. Contains improvements from Eric W. Biederman <ebiederm@xmission.com> and Arnd Bergmann <arnd@arndb.de>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2020-04-231-1/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull SIGCHLD fix from Eric Biederman: "Christof Meerwald reported that do_notify_parent has not been successfully populating si_pid and si_uid for multi-threaded processes. This is the one-liner fix. Strictly speaking a one-liner plus comment" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal: Avoid corrupting si_pid and si_uid in do_notify_parent
| * | signal: Avoid corrupting si_pid and si_uid in do_notify_parentEric W. Biederman2020-04-211-1/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christof Meerwald <cmeerw@cmeerw.org> writes: > Hi, > > this is probably related to commit > 7a0cf094944e2540758b7f957eb6846d5126f535 (signal: Correct namespace > fixups of si_pid and si_uid). > > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a > properly set si_pid field - this seems to happen for multi-threaded > child processes. > > A simple test program (based on the sample from the signalfd man page): > > #include <sys/signalfd.h> > #include <signal.h> > #include <unistd.h> > #include <spawn.h> > #include <stdlib.h> > #include <stdio.h> > > #define handle_error(msg) \ > do { perror(msg); exit(EXIT_FAILURE); } while (0) > > int main(int argc, char *argv[]) > { > sigset_t mask; > int sfd; > struct signalfd_siginfo fdsi; > ssize_t s; > > sigemptyset(&mask); > sigaddset(&mask, SIGCHLD); > > if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1) > handle_error("sigprocmask"); > > pid_t chldpid; > char *chldargv[] = { "./sfdclient", NULL }; > posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL); > > sfd = signalfd(-1, &mask, 0); > if (sfd == -1) > handle_error("signalfd"); > > for (;;) { > s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo)); > if (s != sizeof(struct signalfd_siginfo)) > handle_error("read"); > > if (fdsi.ssi_signo == SIGCHLD) { > printf("Got SIGCHLD %d %d %d %d\n", > fdsi.ssi_status, fdsi.ssi_code, > fdsi.ssi_uid, fdsi.ssi_pid); > return 0; > } else { > printf("Read unexpected signal\n"); > } > } > } > > > and a multi-threaded client to test with: > > #include <unistd.h> > #include <pthread.h> > > void *f(void *arg) > { > sleep(100); > } > > int main() > { > pthread_t t[8]; > > for (int i = 0; i != 8; ++i) > { > pthread_create(&t[i], NULL, f, NULL); > } > } > > I tried to do a bit of debugging and what seems to be happening is > that > > /* From an ancestor pid namespace? */ > if (!task_pid_nr_ns(current, task_active_pid_ns(t))) { > > fails inside task_pid_nr_ns because the check for "pid_alive" fails. > > This code seems to be called from do_notify_parent and there we > actually have "tsk != current" (I am assuming both are threads of the > current process?) I instrumented the code with a warning and received the following backtrace: > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15 > Modules linked in: > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15 > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046 > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29 > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001 > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140 > FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0 > Call Trace: > send_signal+0x1c8/0x310 > do_notify_parent+0x50f/0x550 > release_task.part.21+0x4fd/0x620 > do_exit+0x6f6/0xaf0 > do_group_exit+0x42/0xb0 > get_signal+0x13b/0xbb0 > do_signal+0x2b/0x670 > ? __audit_syscall_exit+0x24d/0x2b0 > ? rcu_read_lock_sched_held+0x4d/0x60 > ? kfree+0x24c/0x2b0 > do_syscall_64+0x176/0x640 > ? trace_hardirqs_off_thunk+0x1a/0x1c > entry_SYSCALL_64_after_hwframe+0x49/0xb3 The immediate problem is as Christof noticed that "pid_alive(current) == false". This happens because do_notify_parent is called from the last thread to exit in a process after that thread has been reaped. The bigger issue is that do_notify_parent can be called from any process that manages to wait on a thread of a multi-threaded process from wait_task_zombie. So any logic based upon current for do_notify_parent is just nonsense, as current can be pretty much anything. So change do_notify_parent to call __send_signal directly. Inspecting the code it appears this problem has existed since the pid namespace support started handling this case in 2.6.30. This fix only backports to 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid") where the problem logic was moved out of __send_signal and into send_signal. Cc: stable@vger.kernel.org Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary") Ref: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals") Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/ Reported-by: Christof Meerwald <cmeerw@cmeerw.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | signal: use kill_proc_info instead of kill_pid_info in kill_something_infoZhiqiang Liu2020-04-121-6/+2
| | | | | | | | | | | | | | | | | | | | | | signal.c provides kill_proc_info, we can use it instead of kill_pid_info in kill_something_info func gracefully. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/80236965-f0b5-c888-95ff-855bdec75bb3@huawei.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | signal: check sig before setting info in kill_pid_usb_asyncioZhiqiang Liu2020-04-121-3/+3
|/ | | | | | | | | | In kill_pid_usb_asyncio, if signal is not valid, we do not need to set info struct. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2020-04-021-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exec/proc updates from Eric Biederman: "This contains two significant pieces of work: the work to sort out proc_flush_task, and the work to solve a deadlock between strace and exec. Fixing proc_flush_task so that it no longer requires a persistent mount makes improvements to proc possible. The removal of the persistent mount solves an old regression that that caused the hidepid mount option to only work on remount not on mount. The regression was found and reported by the Android folks. This further allows Alexey Gladkov's work making proc mount options specific to an individual mount of proc to move forward. The work on exec starts solving a long standing issue with exec that it takes mutexes of blocking userspace applications, which makes exec extremely deadlock prone. For the moment this adds a second mutex with a narrower scope that handles all of the easy cases. Which makes the tricky cases easy to spot. With a little luck the code to solve those deadlocks will be ready by next merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits) signal: Extend exec_id to 64bits pidfd: Use new infrastructure to fix deadlocks in execve perf: Use new infrastructure to fix deadlocks in execve proc: io_accounting: Use new infrastructure to fix deadlocks in execve proc: Use new infrastructure to fix deadlocks in execve kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve kernel: doc: remove outdated comment cred.c mm: docs: Fix a comment in process_vm_rw_core selftests/ptrace: add test cases for dead-locks exec: Fix a deadlock in strace exec: Add exec_update_mutex to replace cred_guard_mutex exec: Move exec_mmap right after de_thread in flush_old_exec exec: Move cleanup of posix timers on exec out of de_thread exec: Factor unshare_sighand out of de_thread and call it separately exec: Only compute current once in flush_old_exec pid: Improve the comment about waiting in zap_pid_ns_processes proc: Remove the now unnecessary internal mount of proc uml: Create a private mount of proc for mconsole uml: Don't consult current to find the proc_mnt in mconsole_proc proc: Use a list of inodes to flush from proc ...
| * signal: Extend exec_id to 64bitsEric W. Biederman2020-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | signal: avoid double atomic counter increments for user accountingLinus Torvalds2020-02-261-9/+14
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <feng.tang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Philip Li <philip.li@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched.h: Annotate sighand_struct with __rcuMadhuparna Bhowmik2020-01-261-1/+1
| | | | | | | | | | | | | | | | | | | | | This patch fixes the following sparse errors by annotating the sighand_struct with __rcu kernel/fork.c:1511:9: error: incompatible types in comparison expression kernel/exit.c:100:19: error: incompatible types in comparison expression kernel/signal.c:1370:27: error: incompatible types in comparison expression This fix introduces the following sparse error in signal.c due to checking the sighand pointer without rcu primitives: kernel/signal.c:1386:21: error: incompatible types in comparison expression This new sparse error is also fixed in this patch. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20200124045908.26389-1-madhuparnabhowmik10@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ↵Oleg Nesterov2019-10-111-1/+1
| | | | | | | | | | | | | | | ptrace_stop() ptrace_stop() does preempt_enable_no_resched() to avoid the preemption, but after that cgroup_enter_frozen() does spin_lock/unlock and this adds another preemption point. Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com> Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge tag 'core-process-v5.4' of ↵Linus Torvalds2019-09-161-2/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd/waitid updates from Christian Brauner: "This contains two features and various tests. First, it adds support for waiting on process through pidfds by adding the P_PIDFD type to the waitid() syscall. This completes the basic functionality of the pidfd api (cf. [1]). In the meantime we also have a new adition to the userspace projects that make use of the pidfd api. The qt project was nice enough to send a mail pointing out that they have a pr up to switch to the pidfd api (cf. [2]). Second, this tag contains an extension to the waitid() syscall to make it possible to wait on the current process group in a race free manner (even though the actual problem is very unlikely) by specifing 0 together with the P_PGID type. This extension traces back to a discussion on the glibc development mailing list. There are also a range of tests for the features above. Additionally, the test-suite which detected the pidfd-polling race we fixed in [3] is included in this tag" [1] https://lwn.net/Articles/794707/ [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456 [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state") * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: waitid: Add support for waiting for the current process group tests: add pidfd poll tests tests: move common definitions and functions into pidfd.h pidfd: add pidfd_wait tests pidfd: add P_PIDFD to waitid()
| * pidfd: add P_PIDFD to waitid()Christian Brauner2019-08-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the P_PIDFD type to waitid(). One of the last remaining bits for the pidfd api is to make it possible to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace that want to use the pidfd api to exclusively manage processes can do so now. One of the things this will unblock in the future is the ability to make it possible to retrieve the exit status via waitid(P_PIDFD) for non-parent processes if handed a _suitable_ pidfd that has this feature set. This is similar to what you can do on FreeBSD with kqueue(). It might even end up being possible to wait on a process as a non-parent if an appropriate property is enabled on the pidfd. With P_PIDFD no scoping of the process identified by the pidfd is possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0), waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics equivalent to wait4(pid), waitid(P_PID). Users that need scoping should rely on pid-based wait*() syscalls for now. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
* | signal: Allow cifs and drbd to receive their terminating signalsEric W. Biederman2019-08-191-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My recent to change to only use force_sig for a synchronous events wound up breaking signal reception cifs and drbd. I had overlooked the fact that by default kthreads start out with all signals set to SIG_IGN. So a change I thought was safe turned out to have made it impossible for those kernel thread to catch their signals. Reverting the work on force_sig is a bad idea because what the code was doing was very much a misuse of force_sig. As the way force_sig ultimately allowed the signal to happen was to change the signal handler to SIG_DFL. Which after the first signal will allow userspace to send signals to these kernel threads. At least for wake_ack_receiver in drbd that does not appear actively wrong. So correct this problem by adding allow_kernel_signal that will allow signals whose siginfo reports they were sent by the kernel through, but will not allow userspace generated signals, and update cifs and drbd to call allow_kernel_signal in an appropriate place so that their thread can receive this signal. Fixing things this way ensures that userspace won't be able to send signals and cause problems, that it is clear which signals the threads are expecting to receive, and it guarantees that nothing else in the system will be affected. This change was partly inspired by similar cifs and drbd patches that added allow_signal. Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com> Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: Steve French <smfrench@gmail.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: David Laight <David.Laight@ACULAB.COM> Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes") Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig") Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | kernel/signal.c: fix a kernel-doc markupMauro Carvalho Chehab2019-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The kernel-doc parser doesn't handle expressions with %foo*. Instead, when an asterisk should be part of a constant, it uses an alternative notation: `foo*`. Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>