summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
* bpf: Mark PTR_TO_FUNC register initially with zero offsetDaniel Borkmann2022-01-271-3/+6
| | | | | | | | | | | | | | | | | | commit d400a6cf1c8a57cdf10f35220ead3284320d85ff upstream. Similar as with other pointer types where we use ldimm64, clear the register content to zero first, and then populate the PTR_TO_FUNC type and subprogno number. Currently this is not done, and leads to reuse of stale register tracking data. Given for special ldimm64 cases we always clear the register offset, make it common for all cases, so it won't be forgotten in future. Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix mount source show for bpffsYafang Shao2022-01-271-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1e9d74660d4df625b0889e77018f9e94727ceacd upstream. We noticed our tc ebpf tools can't start after we upgrade our in-house kernel version from 4.19 to 5.10. That is because of the behaviour change in bpffs caused by commit d2935de7e4fd ("vfs: Convert bpf to use the new mount API"). In our tc ebpf tools, we do strict environment check. If the environment is not matched, we won't allow to start the ebpf progs. One of the check is whether bpffs is properly mounted. The mount information of bpffs in kernel-4.19 and kernel-5.10 are as follows: - kernel 4.19 $ mount -t bpf bpffs /sys/fs/bpf $ mount -t bpf bpffs on /sys/fs/bpf type bpf (rw,relatime) - kernel 5.10 $ mount -t bpf bpffs /sys/fs/bpf $ mount -t bpf none on /sys/fs/bpf type bpf (rw,relatime) The device name in kernel-5.10 is displayed as none instead of bpffs, then our environment check fails. Currently we modify the tools to adopt to the kernel behaviour change, but I think we'd better change the kernel code to keep the behavior consistent. After this change, the mount information will be displayed the same with the behavior in kernel-4.19, for example: $ mount -t bpf bpffs /sys/fs/bpf $ mount -t bpf bpffs on /sys/fs/bpf type bpf (rw,relatime) Fixes: d2935de7e4fd ("vfs: Convert bpf to use the new mount API") Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/bpf/20220108134623.32467-1-laoar.shao@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() failsNikita Yushchenko2022-01-271-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0878355b51f5f26632e652c848a8e174bb02d22d upstream. If start_per_cpu_kthreads() called from osnoise_workload_start() returns error, event hooks are left in broken state: unhook_irq_events() called but unhook_thread_events() and unhook_softirq_events() not called, and trace_osnoise_callback_enabled flag not cleared. On the next tracer enable, hooks get not installed due to trace_osnoise_callback_enabled flag. And on the further tracer disable an attempt to remove non-installed hooks happened, hitting a WARN_ON_ONCE() in tracepoint_remove_func(). Fix the error path by adding the missing part of cleanup. While at this, introduce osnoise_unhook_events() to avoid code duplication between this error path and normal tracer disable. Link: https://lkml.kernel.org/r/20220109153459.3701773-1-nikita.yushchenko@virtuozzo.com Cc: stable@vger.kernel.org Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Have syscall trace events use trace_event_buffer_lock_reserve()Steven Rostedt2022-01-271-4/+2
| | | | | | | | | | | | | | | | | | | | | | commit 3e2a56e6f639492311e0a8533f0a7aed60816308 upstream. Currently, the syscall trace events call trace_buffer_lock_reserve() directly, which means that it misses out on some of the filtering optimizations provided by the helper function trace_event_buffer_lock_reserve(). Have the syscall trace events call that instead, as it was missed when adding the update to use the temp buffer when filtering. Link: https://lkml.kernel.org/r/20220107225839.823118570@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 0fc1b09ff1ff4 ("tracing: Use temp buffer when filtering events") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing/kprobes: 'nmissed' not showed correctly for kretprobeXiangyang Zhang2022-01-271-1/+4
| | | | | | | | | | | | | | | | | | commit dfea08a2116fe327f79d8f4d4b2cf6e0c88be11f upstream. The 'nmissed' column of the 'kprobe_profile' file for kretprobe is not showed correctly, kretprobe can be skipped by two reasons, shortage of kretprobe_instance which is counted by tk->rp.nmissed, and kprobe itself is missed by some reason, so to show the sum. Link: https://lkml.kernel.org/r/20220107150242.5019-1-xyz.sun.ok@gmail.com Cc: stable@vger.kernel.org Fixes: 4a846b443b4e ("tracing/kprobes: Cleanup kprobe tracer code") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Xiangyang Zhang <xyz.sun.ok@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/cpuacct: Fix user/system in shown cpuacct.usage*Andrey Ryabinin2022-01-271-47/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit dd02d4234c9a2214a81c57a16484304a1a51872a upstream. cpuacct has 2 different ways of accounting and showing user and system times. The first one uses cpuacct_account_field() to account times and cpuacct.stat file to expose them. And this one seems to work ok. The second one is uses cpuacct_charge() function for accounting and set of cpuacct.usage* files to show times. Despite some attempts to fix it in the past it still doesn't work. Sometimes while running KVM guest the cpuacct_charge() accounts most of the guest time as system time. This doesn't match with user&system times shown in cpuacct.stat or proc/<pid>/stat. Demonstration: # git clone https://github.com/aryabinin/kvmsample # make # mkdir /sys/fs/cgroup/cpuacct/test # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks # ./kvmsample & # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done 1976535645 2979839428 3979832704 4983603153 5983604157 Use cpustats accounted in cpuacct_account_field() as the source of user/sys times for cpuacct.usage* files. Make cpuacct_charge() to account only summary execution time. Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cputime, cpuacct: Include guest time in user time in cpuacct.statAndrey Ryabinin2022-01-271-2/+2
| | | | | | | | | | | | | | | | | | | | | commit 9731698ecb9c851f353ce2496292ff9fcea39dff upstream. cpuacct.stat in no-root cgroups shows user time without guest time included int it. This doesn't match with user time shown in root cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat in the same way. Make account_guest_time() to add user time to cgroup's cpustat to fix this. Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics") Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: ensure userspace is penalized the same as the kernel when under pressurePaul Moore2022-01-271-1/+17
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8f110f530635af44fff1f4ee100ecef0bac62510 ] Due to the audit control mutex necessary for serializing audit userspace messages we haven't been able to block/penalize userspace processes that attempt to send audit records while the system is under audit pressure. The result is that privileged userspace applications have a priority boost with respect to audit as they are not bound by the same audit queue throttling as the other tasks on the system. This patch attempts to restore some balance to the system when under audit pressure by blocking these privileged userspace tasks after they have finished their audit processing, and dropped the audit control mutex, but before they return to userspace. Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com> Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* rcutorture: Avoid soft lockup during cpu stallWander Lairson Costa2022-01-271-0/+5
| | | | | | | | | | | | | | [ Upstream commit 5ff7c9f9d7e3e0f6db5b81945fa11b69d62f433a ] If we use the module stall_cpu option, we may get a soft lockup warning in case we also don't pass the stall_cpu_block option. Introduce the stall_no_softlockup option to avoid a soft lockup on cpu stall even if we don't use the stall_cpu_block option. Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaimBrian Chen2022-01-272-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf ] We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay on the runqueue in reclaim. A simple example involves a single threaded program that keeps leaking and touching large amounts of memory. It runs in a cgroup with swap enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though there is significant CPU pressure and memory SOME, there is barely any memory FULL since the task enters reclaim and stays on the runqueue. However, this memory-bound task is effectively stalled on memory and we expect memory FULL to match memory SOME in this scenario. The code is confused about memstall && running, thinking there is a stalled task and a productive task when there's only one task: a reclaimer that's counted as both. To fix this, we redefine the condition for PSI_MEM_FULL to check that all running tasks are in an active memstall instead of checking that there are no running tasks. case PSI_MEM_FULL: - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]); + return unlikely(tasks[NR_MEMSTALL] && + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); This will capture reclaimers. It will also capture tasks that called psi_memstall_enter() and are about to sleep, but this should be negligible noise. Signed-off-by: Brian Chen <brianchen118@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* clocksource: Avoid accidental unstable marking of clocksourcesWaiman Long2022-01-271-9/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c86ff8c55b8ae68837b2fa59dc0c203907e9a15f ] Since commit db3a34e17433 ("clocksource: Retry clock read if long delays detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), it is found that tsc clocksource fallback to hpet can sometimes happen on both Intel and AMD systems especially when they are running stressful benchmarking workloads. Of the 23 systems tested with a v5.14 kernel, 10 of them have switched to hpet clock source during the test run. The result of falling back to hpet is a drastic reduction of performance when running benchmarks. For example, the fio performance tests can drop up to 70% whereas the iperf3 performance can drop up to 80%. 4 hpet fallbacks happened during bootup. They were: [ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable Other fallbacks happen when the systems were running stressful benchmarks. For example: [ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), changed the skew margin from 100us to 50us. I think this is too small and can easily be exceeded when running some stressful workloads on a thermally stressed system. So it is switched back to 100us. Even a maximum skew margin of 100us may be too small in for some systems when booting up especially if those systems are under thermal stress. To eliminate the case that the large skew is due to the system being too busy slowing down the reading of both the watchdog and the clocksource, an extra consecutive read of watchdog clock is being done to check this. The consecutive watchdog read delay is compared against WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that the system is just too busy. A warning will be printed to the console and the clock skew check is skipped for this round. Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Fix verifier support for validation of async callbacksKris Van Hees2022-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a5bebc4f00dee47113eed48098c68e88b5ba70e8 ] Commit bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") added support for BPF_FUNC_timer_set_callback to the __check_func_call() function. The test in __check_func_call() is flaweed because it can mis-interpret a regular BPF-to-BPF pseudo-call as a BPF_FUNC_timer_set_callback callback call. Consider the conditional in the code: if (insn->code == (BPF_JMP | BPF_CALL) && insn->imm == BPF_FUNC_timer_set_callback) { The BPF_FUNC_timer_set_callback has value 170. This means that if you have a BPF program that contains a pseudo-call with an instruction delta of 170, this conditional will be found to be true by the verifier, and it will interpret the pseudo-call as a callback. This leads to a mess with the verification of the program because it makes the wrong assumptions about the nature of this call. Solution: include an explicit check to ensure that insn->src_reg == 0. This ensures that calls cannot be mis-interpreted as an async callback call. Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220105210150.GH1559@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Don't promote bogus looking registers after null check.Daniel Borkmann2022-01-271-6/+6
| | | | | | | | | | | | | | | | [ Upstream commit e60b0d12a95dcf16a63225cead4541567f5cb517 ] If we ever get to a point again where we convert a bogus looking <ptr>_or_null typed register containing a non-zero fixed or variable offset, then lets not reset these bounds to zero since they are not and also don't promote the register to a <ptr> type, but instead leave it as <ptr>_or_null. Converting to a unknown register could be an avenue as well, but then if we run into this case it would allow to leak a kernel pointer this way. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* rcu/exp: Mark current CPU as exp-QS in IPI loop second passFrederic Weisbecker2022-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 81f6d49cce2d2fe507e3fddcc4a6db021d9c2e7b ] Expedited RCU grace periods invoke sync_rcu_exp_select_node_cpus(), which takes two passes over the leaf rcu_node structure's CPUs. The first pass gathers up the current CPU and CPUs that are in dynticks idle mode. The workqueue will report a quiescent state on their behalf later. The second pass sends IPIs to the rest of the CPUs, but excludes the current CPU, incorrectly assuming it has been included in the first pass's list of CPUs. Unfortunately the current CPU may have changed between the first and second pass, due to the fact that the various rcu_node structures' ->lock fields have been dropped, thus momentarily enabling preemption. This means that if the second pass's CPU was not on the first pass's list, it will be ignored completely. There will be no IPI sent to it, and there will be no reporting of quiescent states on its behalf. Unfortunately, the expedited grace period will nevertheless be waiting for that CPU to report a quiescent state, but with that CPU having no reason to believe that such a report is needed. The result will be an expedited grace period stall. Fix this by no longer excluding the current CPU from consideration during the second pass. Fixes: b9ad4d6ed18e ("rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()") Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/rt: Try to restart rt period timer when rt runtime exceededLi Hua2022-01-271-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c ] When rt_runtime is modified from -1 to a valid control value, it may cause the task to be throttled all the time. Operations like the following will trigger the bug. E.g: 1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us 2. Run a FIFO task named A that executes while(1) 3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us When rt_runtime is -1, The rt period timer will not be activated when task A enqueued. And then the task will be throttled after setting rt_runtime to 950,000. The task will always be throttled because the rt period timer is not activated. Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Li Hua <hucool.lihua@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Remove config check to enable bpf support for branch recordsKajol Jain2022-01-271-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit db52f57211b4e45f0ebb274e2c877b211dc18591 ] Branch data available to BPF programs can be very useful to get stack traces out of userspace application. Commit fff7b64355ea ("bpf: Add bpf_read_branch_records() helper") added BPF support to capture branch records in x86. Enable this feature also for other architectures as well by removing checks specific to x86. If an architecture doesn't support branch records, bpf_read_branch_records() still has appropriate checks and it will return an -EINVAL in that scenario. Based on UAPI helper doc in include/uapi/linux/bpf.h, unsupported architectures should return -ENOENT in such case. Hence, update the appropriate check to return -ENOENT instead. Selftest 'perf_branches' result on power9 machine which has the branch stacks support: - Before this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:FAIL #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:FAIL Summary: 0/1 PASSED, 0 SKIPPED, 1 FAILED - After this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:OK #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:OK Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED Selftest 'perf_branches' result on power9 machine which doesn't have branch stack report: - After this patch: [command]# ./test_progs -t perf_branches #88/1 perf_branches/perf_branches_hw:SKIP #88/2 perf_branches/perf_branches_no_hw:OK #88 perf_branches:OK Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED Fixes: fff7b64355eac ("bpf: Add bpf_read_branch_records() helper") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211206073315.77432-1-kjain@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Disallow BPF_LOG_KERNEL log level for bpf(BPF_BTF_LOAD)Hou Tao2022-01-272-5/+4
| | | | | | | | | | | | | | | | | [ Upstream commit 866de407444398bc8140ea70de1dba5f91cc34ac ] BPF_LOG_KERNEL is only used internally, so disallow bpf_btf_load() to set log level as BPF_LOG_KERNEL. The same checking has already been done in bpf_check(), so factor out a helper to check the validity of log attributes and use it in both places. Fixes: 8580ac9404f6 ("bpf: Process in-kernel BTF") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211203053001.740945-1-houtao1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Adjust BTF log size limit.Alexei Starovoitov2022-01-271-1/+1
| | | | | | | | | | | | | [ Upstream commit c5a2d43e998a821701029f23e25b62f9188e93ff ] Make BTF log size limit to be the same as the verifier log size limit. Otherwise tools that progressively increase log size and use the same log for BTF loading and program loading will be hitting hard to debug EINVAL. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-7-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/fair: Fix per-CPU kthread and wakee stacking for asym CPU capacityVincent Donnefort2022-01-271-1/+2
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 014ba44e8184e1acf93e0cbb7089ee847802f8f0 ] select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread where the selected CPU is the previous one. For asymmetric CPU capacity systems, the assumption was that the wakee couldn't have a bigger utilization during task placement than it used to have during the last activation. That was not considering uclamp.min which can completely change between two task activations and as a consequence mandates the fitness criterion asym_fits_capacity(), even for the exit path described above. Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211129173115.4006346-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* sched/fair: Fix detection of per-CPU kthreads waking a taskVincent Donnefort2022-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8b4e74ccb582797f6f0b0a50372ebd9fd2372a27 ] select_idle_sibling() has a special case for tasks woken up by a per-CPU kthread, where the selected CPU is the previous one. However, the current condition for this exit path is incomplete. A task can wake up from an interrupt context (e.g. hrtimer), while a per-CPU kthread is running. A such scenario would spuriously trigger the special case described above. Also, a recent change made the idle task like a regular per-CPU kthread, hence making that situation more likely to happen (is_per_cpu_kthread(swapper) being true now). Checking for task context makes sure select_idle_sibling() will not interpret a wake up from any other context as a wake up by a per-CPU kthread. Fixes: 52262ee567ad ("sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression") Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211201143450.479472-1-vincent.donnefort@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* dma/pool: create dma atomic pool only if dma zone has managed pagesBaoquan He2022-01-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a674e48c5443d12a8a43c3ac42367aa39505d506 upstream. Currently three dma atomic pools are initialized as long as the relevant kernel codes are built in. While in kdump kernel of x86_64, this is not right when trying to create atomic_pool_dma, because there's no managed pages in DMA zone. In the case, DMA zone only has low 1M memory presented and locked down by memblock allocator. So no pages are added into buddy of DMA zone. Please check commit f1d4d47c5851 ("x86/setup: Always reserve the first 1M of RAM"). Then in kdump kernel of x86_64, it always prints below failure message: DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1 Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018 Call Trace: dump_stack+0x7f/0xa1 warn_alloc.cold+0x72/0xd6 __alloc_pages_slowpath.constprop.0+0xf29/0xf50 __alloc_pages+0x24d/0x2c0 alloc_page_interleave+0x13/0xb0 atomic_pool_expand+0x118/0x210 __dma_atomic_pool_init+0x45/0x93 dma_atomic_pool_init+0xdb/0x176 do_one_initcall+0x67/0x320 kernel_init_freeable+0x290/0x2dc kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 Mem-Info: ...... DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations Here, let's check if DMA zone has managed pages, then create atomic_pool_dma if yes. Otherwise just skip it. Link: https://lkml.kernel.org/r/20211223094435.248523-3-bhe@redhat.com Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified") Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf: Protect perf_guest_cbs with RCUSean Christopherson2022-01-201-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ff083a2d972f56bebfd82409ca62e5dfce950961 upstream. Protect perf_guest_cbs with RCU to fix multiple possible errors. Luckily, all paths that read perf_guest_cbs already require RCU protection, e.g. to protect the callback chains, so only the direct perf_guest_cbs touchpoints need to be modified. Bug #1 is a simple lack of WRITE_ONCE/READ_ONCE behavior to ensure perf_guest_cbs isn't reloaded between a !NULL check and a dereference. Fixed via the READ_ONCE() in rcu_dereference(). Bug #2 is that on weakly-ordered architectures, updates to the callbacks themselves are not guaranteed to be visible before the pointer is made visible to readers. Fixed by the smp_store_release() in rcu_assign_pointer() when the new pointer is non-NULL. Bug #3 is that, because the callbacks are global, it's possible for readers to run in parallel with an unregisters, and thus a module implementing the callbacks can be unloaded while readers are in flight, resulting in a use-after-free. Fixed by a synchronize_rcu() call when unregistering callbacks. Bug #1 escaped notice because it's extremely unlikely a compiler will reload perf_guest_cbs in this sequence. perf_guest_cbs does get reloaded for future derefs, e.g. for ->is_user_mode(), but the ->is_in_guest() guard all but guarantees the consumer will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before KVM start its module unload / unregister sequence. This also makes it all but impossible to encounter bug #3. Bug #2 has not been a problem because all architectures that register callbacks are strongly ordered and/or have a static set of callbacks. But with help, unloading kvm_intel can trigger bug #1 e.g. wrapping perf_guest_cbs with READ_ONCE in perf_misc_flags() while spamming kvm_intel module load/unload leads to: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf Fixes: 39447b386c84 ("perf: Enhance perf to allow for guest statistic collection from host") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211111020738.2512932-2-seanjc@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix out of bounds access from invalid *_or_null type verificationDaniel Borkmann2022-01-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ no upstream commit given implicitly fixed through the larger refactoring in c25b2ae136039ffa820c26138ed4a5e5f3ab3841 ] While auditing some other code, I noticed missing checks inside the pointer arithmetic simulation, more specifically, adjust_ptr_min_max_vals(). Several *_OR_NULL types are not rejected whereas they are _required_ to be rejected given the expectation is that they get promoted into a 'real' pointer type for the success case, that is, after an explicit != NULL check. One case which stands out and is accessible from unprivileged (iff enabled given disabled by default) is BPF ring buffer. From crafting a PoC, the NULL check can be bypassed through an offset, and its id marking will then lead to promotion of mem_or_null to a mem type. bpf_ringbuf_reserve() helper can trigger this case through passing of reserved flags, for example. func#0 @0 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 0: (7a) *(u64 *)(r10 -8) = 0 1: R1=ctx(id=0,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm 1: (18) r1 = 0x0 3: R1_w=map_ptr(id=0,off=0,ks=0,vs=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm 3: (b7) r2 = 8 4: R1_w=map_ptr(id=0,off=0,ks=0,vs=0,imm=0) R2_w=invP8 R10=fp0 fp-8_w=mmmmmmmm 4: (b7) r3 = 0 5: R1_w=map_ptr(id=0,off=0,ks=0,vs=0,imm=0) R2_w=invP8 R3_w=invP0 R10=fp0 fp-8_w=mmmmmmmm 5: (85) call bpf_ringbuf_reserve#131 6: R0_w=mem_or_null(id=2,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 6: (bf) r6 = r0 7: R0_w=mem_or_null(id=2,ref_obj_id=2,off=0,imm=0) R6_w=mem_or_null(id=2,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 7: (07) r0 += 1 8: R0_w=mem_or_null(id=2,ref_obj_id=2,off=1,imm=0) R6_w=mem_or_null(id=2,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 8: (15) if r0 == 0x0 goto pc+4 R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 9: R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 9: (62) *(u32 *)(r6 +0) = 0 R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 10: R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 10: (bf) r1 = r6 11: R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R1_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 11: (b7) r2 = 0 12: R0_w=mem(id=0,ref_obj_id=0,off=0,imm=0) R1_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R2_w=invP0 R6_w=mem(id=0,ref_obj_id=2,off=0,imm=0) R10=fp0 fp-8_w=mmmmmmmm refs=2 12: (85) call bpf_ringbuf_submit#132 13: R6=invP(id=0) R10=fp0 fp-8=mmmmmmmm 13: (b7) r0 = 0 14: R0_w=invP0 R6=invP(id=0) R10=fp0 fp-8=mmmmmmmm 14: (95) exit from 8 to 13: safe processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 0 OK All three commits, that is b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL support"), 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it"), and the afbf21dce668 ("bpf: Support readonly/readwrite buffers in verifier") suffer the same cause and their *_OR_NULL type pendants must be rejected in adjust_ptr_min_max_vals(). Make the test more robust by reusing reg_type_may_be_null() helper such that we catch all *_OR_NULL types we have today and in future. Note that pointer arithmetic on PTR_TO_BTF_ID, PTR_TO_RDONLY_BUF, and PTR_TO_RDWR_BUF is generally allowed. Fixes: b121b341e598 ("bpf: Add PTR_TO_BTF_ID_OR_NULL support") Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Fixes: afbf21dce668 ("bpf: Support readonly/readwrite buffers in verifier") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: Fix unbind_workers() VS wq_worker_running() raceFrederic Weisbecker2022-01-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 07edfece8bcb0580a1828d939e6f8d91a8603eb2 upstream. At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up. In that case the following scenario can happen: unbind_workers() wq_worker_running() -------------- ------------------- if (!(worker->flags & WORKER_NOT_RUNNING)) //PREEMPTED by unbind_workers worker->flags |= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_inc(&worker->pool->nr_running); After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of 1, triggering the following warning when the worker goes idle: WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Also due to this incorrect "nr_running == 1", further queued work may end up not being served, because no worker is awaken at work insert time. This raises rcutorture writer stalls for example. Fix this with disabling preemption in the right place in wq_worker_running(). It's worth noting that if the worker migrates and runs concurrently with unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update due to set_cpus_allowed_ptr() acquiring/releasing rq->lock. Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: Use open-time cgroup namespace for process migration perm checksTejun Heo2022-01-112-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e57457641613fef0d147ede8bd6a3047df588b95 upstream. cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's cgroup namespace which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes cgroup remember the cgroup namespace at the time of open and uses it for migration permission checks instad of current's. Note that this only applies to cgroup2 as cgroup1 doesn't have namespace support. This also fixes a use-after-free bug on cgroupns reported in https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Note that backporting this fix also requires the preceding patch. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reported-by: syzbot+50f5cf33a284ce738b62@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/00000000000048c15c05d0083397@google.com Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option") Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: Allocate cgroup_file_ctx for kernfs_open_file->privTejun Heo2022-01-113-31/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 0d2b5955b36250a9428c832664f2079cbf723bec upstream. of->priv is currently used by each interface file implementation to store private information. This patch collects the current two private data usages into struct cgroup_file_ctx which is allocated and freed by the common path. This allows generic private data which applies to multiple files, which will be used to in the following patch. Note that cgroup_procs iterator is now embedded as procs.iter in the new cgroup_file_ctx so that it doesn't need to be allocated and freed separately. v2: union dropped from cgroup_file_ctx and the procs iterator is embedded in cgroup_file_ctx as suggested by Linus. v3: Michal pointed out that cgroup1's procs pidlist uses of->priv too. Converted. Didn't change to embedded allocation as cgroup1 pidlists get stored for caching. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cgroup: Use open-time credentials for process migraton perm checksTejun Heo2022-01-112-4/+12
| | | | | | | | | | | | | | | | | | | | | | commit 1756d7994ad85c2479af6ae5a9750b92324685af upstream. cgroup process migration permission checks are performed at write time as whether a given operation is allowed or not is dependent on the content of the write - the PID. This currently uses current's credentials which is a potential security weakness as it may allow scenarios where a less privileged process tricks a more privileged one into writing into a fd that it created. This patch makes both cgroup2 and cgroup1 process migration interfaces to use the credentials saved at the time of open (file->f_cred) instead of current's. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Fixes: 187fe84067bd ("cgroup: require write perm on common ancestor when moving processes on the default hierarchy") Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Tag trace_percpu_buffer as a percpu pointerNaveen N. Rao2022-01-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | commit f28439db470cca8b6b082239314e9fd10bd39034 upstream. Tag trace_percpu_buffer as a percpu pointer to resolve warnings reported by sparse: /linux/kernel/trace/trace.c:3218:46: warning: incorrect type in initializer (different address spaces) /linux/kernel/trace/trace.c:3218:46: expected void const [noderef] __percpu *__vpp_verify /linux/kernel/trace/trace.c:3218:46: got struct trace_buffer_struct * /linux/kernel/trace/trace.c:3234:9: warning: incorrect type in initializer (different address spaces) /linux/kernel/trace/trace.c:3234:9: expected void const [noderef] __percpu *__vpp_verify /linux/kernel/trace/trace.c:3234:9: got int * Link: https://lkml.kernel.org/r/ebabd3f23101d89cb75671b68b6f819f5edc830b.1640255304.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Fixes: 07d777fe8c398 ("tracing: Add percpu buffers for trace_printk()") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix check for trace_percpu_buffer validity in get_trace_buf()Naveen N. Rao2022-01-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 823e670f7ed616d0ce993075c8afe0217885f79d upstream. With the new osnoise tracer, we are seeing the below splat: Kernel attempted to read user page (c7d880000) - exploit attempt? (uid: 0) BUG: Unable to handle kernel data access on read at 0xc7d880000 Faulting instruction address: 0xc0000000002ffa10 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries ... NIP [c0000000002ffa10] __trace_array_vprintk.part.0+0x70/0x2f0 LR [c0000000002ff9fc] __trace_array_vprintk.part.0+0x5c/0x2f0 Call Trace: [c0000008bdd73b80] [c0000000001c49cc] put_prev_task_fair+0x3c/0x60 (unreliable) [c0000008bdd73be0] [c000000000301430] trace_array_printk_buf+0x70/0x90 [c0000008bdd73c00] [c0000000003178b0] trace_sched_switch_callback+0x250/0x290 [c0000008bdd73c90] [c000000000e70d60] __schedule+0x410/0x710 [c0000008bdd73d40] [c000000000e710c0] schedule+0x60/0x130 [c0000008bdd73d70] [c000000000030614] interrupt_exit_user_prepare_main+0x264/0x270 [c0000008bdd73de0] [c000000000030a70] syscall_exit_prepare+0x150/0x180 [c0000008bdd73e10] [c00000000000c174] system_call_vectored_common+0xf4/0x278 osnoise tracer on ppc64le is triggering osnoise_taint() for negative duration in get_int_safe_duration() called from trace_sched_switch_callback()->thread_exit(). The problem though is that the check for a valid trace_percpu_buffer is incorrect in get_trace_buf(). The check is being done after calculating the pointer for the current cpu, rather than on the main percpu pointer. Fix the check to be against trace_percpu_buffer. Link: https://lkml.kernel.org/r/a920e4272e0b0635cf20c444707cbce1b2c8973d.1640255304.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: e2ace001176dc9 ("tracing: Choose static tp_printk buffer by explicit nesting count") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kernel/crash_core: suppress unknown crashkernel parameter warningPhilipp Rudo2021-12-291-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 71d2bcec2d4d69ff109c497e6611d6c53c8926d4 ] When booting with crashkernel= on the kernel command line a warning similar to Kernel command line: ro console=ttyS0 crashkernel=256M Unknown kernel command line parameters "crashkernel=256M", will be passed to user space. is printed. This comes from crashkernel= being parsed independent from the kernel parameter handling mechanism. So the code in init/main.c doesn't know that crashkernel= is a valid kernel parameter and prints this incorrect warning. Suppress the warning by adding a dummy early_param handler for crashkernel=. Link: https://lkml.kernel.org/r/20211208133443.6867-1-prudo@redhat.com Fixes: 86d1919a4fb0 ("init: print out unknown kernel parameters") Signed-off-by: Philipp Rudo <prudo@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ucounts: Fix rlimit max values checkAlexey Gladkov2021-12-291-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 59ec71575ab440cd5ca0aa53b2a2985b3639fad4 ] The semantics of the rlimit max values differs from ucounts itself. When creating a new userns, we store the current rlimit of the process in ucount_max. Thus, the value of the limit in the parent userns is saved in the created one. The problem is that now we are taking the maximum value for counter from the same userns. So for init_user_ns it will always be RLIM_INFINITY. To fix the problem we need to check the counter value with the max value stored in userns. Reproducer: su - test -c "ulimit -u 3; sleep 5 & sleep 6 & unshare -U --map-root-user sh -c 'sleep 7 & sleep 8 & date; wait'" Before: [1] 175 [2] 176 Fri Nov 26 13:48:20 UTC 2021 [1]- Done sleep 5 [2]+ Done sleep 6 After: [1] 167 [2] 168 sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: retry: Resource temporarily unavailable sh: fork: Interrupted system call [1]- Done sleep 5 [2]+ Done sleep 6 Fixes: c54b245d0118 ("Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace") Reported-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/024ec805f6e16896f0b23e094773790d171d2c1c.1638218242.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* rcu: Mark accesses to rcu_state.n_force_qsPaul E. McKenney2021-12-221-5/+5
| | | | | | | | | | | | commit 2431774f04d1050292054c763070021bade7b151 upstream. This commit marks accesses to the rcu_state.n_force_qs. These data races are hard to make happen, but syzkaller was equal to the task. Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()Zqiang2021-12-221-1/+1
| | | | | | | | | | | | | | | | commit 8f556a326c93213927e683fc32bbf5be1b62540a upstream. Optimistic spinning needs to be terminated when the spinning waiter is not longer the top waiter on the lock, but the condition is negated. It terminates if the waiter is the top waiter, which is defeating the whole purpose. Fixes: c3123c431447 ("locking/rtmutex: Dont dereference waiter lockless") Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211217074207.77425-1-qiang1.zhang@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Really make sure wall_to_monotonic isn't positiveYu Liao2021-12-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4e8c11b6b3f0b6a283e898344f154641eda94266 upstream. Even after commit e1d7ba873555 ("time: Always make sure wall_to_monotonic isn't positive") it is still possible to make wall_to_monotonic positive by running the following code: int main(void) { struct timespec time; clock_gettime(CLOCK_MONOTONIC, &time); time.tv_nsec = 0; clock_settime(CLOCK_REALTIME, &time); return 0; } The reason is that the second parameter of timespec64_compare(), ts_delta, may be unnormalized because the delta is calculated with an open coded substraction which causes the comparison of tv_sec to yield the wrong result: wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 } ts_delta = { .tv_sec = -9, .tv_nsec = -900000000 } That makes timespec64_compare() claim that wall_to_monotonic < ts_delta, but actually the result should be wall_to_monotonic > ts_delta. After normalization, the result of timespec64_compare() is correct because the tv_sec comparison is not longer misleading: wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 } ts_delta = { .tv_sec = -10, .tv_nsec = 100000000 } Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the issue. Fixes: e1d7ba873555 ("time: Always make sure wall_to_monotonic isn't positive") Signed-off-by: Yu Liao <liaoyu15@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211213135727.1656662-1-liaoyu15@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: improve robustness of the audit queue handlingPaul Moore2021-12-221-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f4b3ee3c85551d2d343a3ba159304066523f730f upstream. If the audit daemon were ever to get stuck in a stopped state the kernel's kauditd_thread() could get blocked attempting to send audit records to the userspace audit daemon. With the kernel thread blocked it is possible that the audit queue could grow unbounded as certain audit record generating events must be exempt from the queue limits else the system enter a deadlock state. This patch resolves this problem by lowering the kernel thread's socket sending timeout from MAX_SCHEDULE_TIMEOUT to HZ/10 and tweaks the kauditd_send_queue() function to better manage the various audit queues when connection problems occur between the kernel and the audit daemon. With this patch, the backlog may temporarily grow beyond the defined limits when the audit daemon is stopped and the system is under heavy audit pressure, but kauditd_thread() will continue to make progress and drain the queues as it would for other connection problems. For example, with the audit daemon put into a stopped state and the system configured to audit every syscall it was still possible to shutdown the system without a kernel panic, deadlock, etc.; granted, the system was slow to shutdown but that is to be expected given the extreme pressure of recording every syscall. The timeout value of HZ/10 was chosen primarily through experimentation and this developer's "gut feeling". There is likely no one perfect value, but as this scenario is limited in scope (root privileges would be needed to send SIGSTOP to the audit daemon), it is likely not worth exposing this as a tunable at present. This can always be done at a later date if it proves necessary. Cc: stable@vger.kernel.org Fixes: 5b52330bbfe63 ("audit: fix auditd/kernel connection state tracking") Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com> Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux regDaniel Borkmann2021-12-221-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a82fe085f344ef20b452cd5f481010ff96b5c4cd upstream. The implementation of BPF_CMPXCHG on a high level has the following parameters: .-[old-val] .-[new-val] BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG) `-[mem-loc] `-[old-val] Given a BPF insn can only have two registers (dst, src), the R0 is fixed and used as an auxilliary register for input (old value) as well as output (returning old value from memory location). While the verifier performs a number of safety checks, it misses to reject unprivileged programs where R0 contains a pointer as old value. Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the guessed address into the map slot as a scalar, and using the map value pointer as R0 while SRC_REG has a canary value to detect a matching address. Fix it by checking R0 for pointers, and reject if that's the case for unprivileged programs. Fixes: 5ffa25502b5a ("bpf: Add instructions for atomic_[cmp]xchg") Reported-by: Ryota Shiga (Flatt Security) Acked-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Make 32->64 bounds propagation slightly more robustDaniel Borkmann2021-12-221-9/+15
| | | | | | | | | | | | | | commit e572ff80f05c33cd0cb4860f864f5c9c044280b6 upstream. Make the bounds propagation in __reg_assign_32_into_64() slightly more robust and readable by aligning it similarly as we did back in the __reg_combine_64_into_32() counterpart. Meaning, only propagate or pessimize them as a smin/smax pair. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix signed bounds propagation after mov32Daniel Borkmann2021-12-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3cf2b61eb06765e27fec6799292d9fb46d0b7e60 upstream. For the case where both s32_{min,max}_value bounds are positive, the __reg_assign_32_into_64() directly propagates them to their 64 bit counterparts, otherwise it pessimises them into [0,u32_max] universe and tries to refine them later on by learning through the tnum as per comment in mentioned function. However, that does not always happen, for example, in mov32 operation we call zext_32_to_64(dst_reg) which invokes the __reg_assign_32_into_64() as is without subsequent bounds update as elsewhere thus no refinement based on tnum takes place. Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() / __reg_bound_offset() triplet as we do, for example, in case of ALU ops via adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when dumping the full register state: Before fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=0,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Technically, the smin_value=0 and smax_value=4294967295 bounds are not incorrect, but given the register is still a constant, they break assumptions about const scalars that smin_value == smax_value and umin_value == umax_value. After fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Without the smin_value == smax_value and umin_value == umax_value invariant being intact for const scalars, it is possible to leak out kernel pointers from unprivileged user space if the latter is enabled. For example, when such registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals() will taint the destination register into an unknown scalar, and the latter can be exported and stored e.g. into a BPF map value. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Kuee K1r0a <liulin063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix kernel address leakage in atomic fetchDaniel Borkmann2021-12-221-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7d3baf0afa3aa9102d6a521a8e4c41888bb79882 upstream. The change in commit 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since this would allow for unprivileged users to leak kernel pointers. For example, an atomic fetch/and with -1 on a stack destination which holds a spilled pointer will migrate the spilled register type into a scalar, which can then be exported out of the program (since scalar != pointer) by dumping it into a map value. The original implementation of XADD was preventing this situation by using a double call to check_mem_access() one with BPF_READ and a subsequent one with BPF_WRITE, in both cases passing -1 as a placeholder value instead of register as per XADD semantics since it didn't contain a value fetch. The BPF_READ also included a check in check_stack_read_fixed_off() which rejects the program if the stack slot is of __is_pointer_value() if dst_regno < 0. The latter is to distinguish whether we're dealing with a regular stack spill/ fill or some arithmetical operation which is disallowed on non-scalars, see also 6e7e63cbb023 ("bpf: Forbid XADD on spilled pointers for unprivileged users") for more context on check_mem_access() and its handling of placeholder value -1. One minimally intrusive option to fix the leak is for the BPF_FETCH case to initially check the BPF_READ case via check_mem_access() with -1 as register, followed by the actual load case with non-negative load_reg to propagate stack bounds to registers. Fixes: 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH") Reported-by: <n4ke4mry@gmail.com> Acked-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Fix a kmemleak false positive in tracing_mapChen Jun2021-12-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f25667e5980a4333729cac3101e5de1bb851f71a ] Doing the command: echo 'hist:key=common_pid.execname,common_timestamp' > /sys/kernel/debug/tracing/events/xxx/trigger Triggers many kmemleak reports: unreferenced object 0xffff0000c7ea4980 (size 128): comm "bash", pid 338, jiffies 4294912626 (age 9339.324s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0 [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178 [<00000000633bd154>] tracing_map_init+0x1f8/0x268 [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0 [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128 [<00000000f549355a>] event_trigger_write+0x7c/0x120 [<00000000b80f898d>] vfs_write+0xc4/0x380 [<00000000823e1055>] ksys_write+0x74/0xf8 [<000000008a9374aa>] __arm64_sys_write+0x24/0x30 [<0000000087124017>] do_el0_svc+0x88/0x1c0 [<00000000efd0dcd1>] el0_svc+0x1c/0x28 [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0 [<00000000e7399680>] el0_sync+0x148/0x180 unreferenced object 0xffff0000c7ea4980 (size 128): comm "bash", pid 338, jiffies 4294912626 (age 9339.324s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0 [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178 [<00000000633bd154>] tracing_map_init+0x1f8/0x268 [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0 [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128 [<00000000f549355a>] event_trigger_write+0x7c/0x120 [<00000000b80f898d>] vfs_write+0xc4/0x380 [<00000000823e1055>] ksys_write+0x74/0xf8 [<000000008a9374aa>] __arm64_sys_write+0x24/0x30 [<0000000087124017>] do_el0_svc+0x88/0x1c0 [<00000000efd0dcd1>] el0_svc+0x1c/0x28 [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0 [<00000000e7399680>] el0_sync+0x148/0x180 The reason is elts->pages[i] is alloced by get_zeroed_page. and kmemleak will not scan the area alloced by get_zeroed_page. The address stored in elts->pages will be regarded as leaked. That is, the elts->pages[i] will have pointers loaded onto it as well, and without telling kmemleak about it, those pointers will look like memory without a reference. To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i] Link: https://lkml.kernel.org/r/20211124140801.87121-1-chenjun102@huawei.com Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* wait: add wake_up_pollfree()Eric Biggers2021-12-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream. Several ->poll() implementations are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'. However, that has a bug: wake_up_poll() calls __wake_up() with nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters, and the wakeup function for the first one returns a positive value, only that one will be called. That's *not* what's needed for POLLFREE; POLLFREE is special in that it really needs to wake up everyone. Considering the three non-blocking poll systems: - io_uring poll doesn't handle POLLFREE at all, so it is broken anyway. - aio poll is unaffected, since it doesn't support exclusive waits. However, that's fragile, as someone could add this feature later. - epoll doesn't appear to be broken by this, since its wakeup function returns 0 when it sees POLLFREE. But this is fragile. Although there is a workaround (see epoll), it's better to define a function which always sends POLLFREE to all waiters. Add such a function. Also make it verify that the queue really becomes empty after all waiters have been woken up. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timers: implement usleep_idle_range()SeongJae Park2021-12-141-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e4779015fd5d2fb8390c258268addff24d6077c7 upstream. Patch series "mm/damon: Fix fake /proc/loadavg reports", v3. This patchset fixes DAMON's fake load report issue. The first patch makes yet another variant of usleep_range() for this fix, and the second patch fixes the issue of DAMON by making it using the newly introduced function. This patch (of 2): Some kernel threads such as DAMON could need to repeatedly sleep in micro seconds level. Because usleep_range() sleeps in uninterruptible state, however, such threads would make /proc/loadavg reports fake load. To help such cases, this commit implements a variant of usleep_range() called usleep_idle_range(). It is same to usleep_range() but sets the state of the current task as TASK_IDLE while sleeping. Link: https://lkml.kernel.org/r/20211126145015.15862-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211126145015.15862-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: John Stultz <john.stultz@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: Fix the off-by-two error in range markingsMaxim Mikityanskiy2021-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2fa7d94afc1afbb4d702760c058dc2d7ed30f226 upstream. The first commit cited below attempts to fix the off-by-one error that appeared in some comparisons with an open range. Due to this error, arithmetically equivalent pieces of code could get different verdicts from the verifier, for example (pseudocode): // 1. Passes the verifier: if (data + 8 > data_end) return early read *(u64 *)data, i.e. [data; data+7] // 2. Rejected by the verifier (should still pass): if (data + 7 >= data_end) return early read *(u64 *)data, i.e. [data; data+7] The attempted fix, however, shifts the range by one in a wrong direction, so the bug not only remains, but also such piece of code starts failing in the verifier: // 3. Rejected by the verifier, but the check is stricter than in #1. if (data + 8 >= data_end) return early read *(u64 *)data, i.e. [data; data+7] The change performed by that fix converted an off-by-one bug into off-by-two. The second commit cited below added the BPF selftests written to ensure than code chunks like #3 are rejected, however, they should be accepted. This commit fixes the off-by-two error by adjusting new_range in the right direction and fixes the tests by changing the range into the one that should actually fail. Fixes: fb2a311a31d3 ("bpf: fix off by one for range markings with L{T, E} patterns") Fixes: b37242c773b2 ("bpf: add test cases to bpf selftests to cover all access tests") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/uclamp: Fix rq->uclamp_max not set on first enqueueQais Yousef2021-12-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ] Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to match the woken up task's uclamp_max when the rq is idle. The code was relying on rq->uclamp_max initialized to zero, so on first enqueue static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p, enum uclamp_id clamp_id) { ... if (uc_se->value > READ_ONCE(uc_rq->value)) WRITE_ONCE(uc_rq->value, uc_se->value); } was actually resetting it. But since commit d81ae8aac85c changed the default to 1024, this no longer works. And since rq->uclamp_flags is also initialized to 0, neither above code path nor uclamp_idle_reset() update the rq->uclamp_max on first wake up from idle. This is only visible from first wake up(s) until the first dequeue to idle after enabling the static key. And it only matters if the uclamp_max of this task is < 1024 since only then its uclamp_max will be effectively ignored. Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to ensure uclamp_idle_reset() is called which then will update the rq uclamp_max value as expected. Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq") Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* preempt/dynamic: Fix setup_preempt_mode() return valueAndrew Halaney2021-12-081-2/+2
| | | | | | | | | | | | | | [ Upstream commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86 ] __setup() callbacks expect 1 for success and 0 for failure. Correct the usage here to reflect that. Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing/histograms: String compares should not care about signed valuesSteven Rostedt (VMware)2021-12-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | commit 450fec13d9170127678f991698ac1a5b05c02e2f upstream. When comparing two strings for the "onmatch" histogram trigger, fields that are strings use string comparisons, which do not care about being signed or not. Do not fail to match two string fields if one is unsigned char array and the other is a signed char array. Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/ Cc: stable@vgerk.kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Yafang Shao <laoar.shao@gmail.com> Fixes: b05e89ae7cf3b ("tracing: Accept different type for synthetic event fields") Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org> Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kprobes: Limit max data_size of the kretprobe instancesMasami Hiramatsu2021-12-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | commit 6bbfa44116689469267f1a6e3d233b52114139d2 upstream. The 'kprobe::data_size' is unsigned, thus it can not be negative. But if user sets it enough big number (e.g. (size_t)-8), the result of 'data_size + sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct kretprobe_instance) or zero. In result, the kretprobe_instance are allocated without enough memory, and kretprobe accesses outside of allocated memory. To avoid this issue, introduce a max limitation of the kretprobe::data_size. 4KB per instance should be OK. Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: f47cd9b553aa ("kprobes: kretprobe user entry-handler") Reported-by: zhangyue <zhangyue1@kylinos.cn> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Don't use out-of-sync va_list in event printingNikita Yushchenko2021-12-081-0/+12
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2ef75e9bd2c998f1c6f6f23a3744136105ddefd5 ] If trace_seq becomes full, trace_seq_vprintf() no longer consumes arguments from va_list, making va_list out of sync with format processing by trace_check_vprintf(). This causes va_arg() in trace_check_vprintf() to return wrong positional argument, which results into a WARN_ON_ONCE() hit. ftrace_stress_test from LTP triggers this situation. Fix it by explicitly avoiding further use if va_list at the point when it's consistency can no longer be guaranteed. Link: https://lkml.kernel.org/r/20211118145516.13219-1-nikita.yushchenko@virtuozzo.com Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* tracing: Check pid filtering when creating eventsSteven Rostedt (VMware)2021-12-011-0/+10
| | | | | | | | | | | | | | | | | | commit 6cb206508b621a9a0a2c35b60540e399225c8243 upstream. When pid filtering is activated in an instance, all of the events trace files for that instance has the PID_FILTER flag set. This determines whether or not pid filtering needs to be done on the event, otherwise the event is executed as normal. If pid filtering is enabled when an event is created (via a dynamic event or modules), its flag is not updated to reflect the current state, and the events are not filtered properly. Cc: stable@vger.kernel.org Fixes: 3fdaf80f4a836 ("tracing: Implement event pid filtering") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/scs: Reset task stack state in bringup_cpu()Mark Rutland2021-12-012-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 ] To hot unplug a CPU, the idle task on that CPU calls a few layers of C code before finally leaving the kernel. When KASAN is in use, poisoned shadow is left around for each of the active stack frames, and when shadow call stacks are in use. When shadow call stacks (SCS) are in use the task's saved SCS SP is left pointing at an arbitrary point within the task's shadow call stack. When a CPU is offlined than onlined back into the kernel, this stale state can adversely affect execution. Stale KASAN shadow can alias new stackframes and result in bogus KASAN warnings. A stale SCS SP is effectively a memory leak, and prevents a portion of the shadow call stack being used. Across a number of hotplug cycles the idle task's entire shadow call stack can become unusable. We previously fixed the KASAN issue in commit: e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug") ... by removing any stale KASAN stack poison immediately prior to onlining a CPU. Subsequently in commit: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") ... the refactoring left the KASAN and SCS cleanup in one-time idle thread initialization code rather than something invoked prior to each CPU being onlined, breaking both as above. We fixed SCS (but not KASAN) in commit: 63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit") ... but as this runs in the context of the idle task being offlined it's potentially fragile. To fix these consistently and more robustly, reset the SCS SP and KASAN shadow of a CPU's idle task immediately before we online that CPU in bringup_cpu(). This ensures the idle task always has a consistent state when it is running, and removes the need to so so when exiting an idle task. Whenever any thread is created, dup_task_struct() will give the task a stack which is free of KASAN shadow, and initialize the task's SCS SP, so there's no need to specially initialize either for idle thread within init_idle(), as this was only necessary to handle hotplug cycles. I've tested this on arm64 with: * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK ... offlining and onlining CPUS with: | while true; do | for C in /sys/devices/system/cpu/cpu*/online; do | echo 0 > $C; | echo 1 > $C; | done | done Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>