summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
* | | | kprobes: Fix possible use-after-free issue on kprobe registrationZheng Yejian2024-04-101-6/+12
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unloading a module, its state is changing MODULE_STATE_LIVE -> MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take a time. `is_module_text_address()` and `__module_text_address()` works with MODULE_STATE_LIVE and MODULE_STATE_GOING. If we use `is_module_text_address()` and `__module_text_address()` separately, there is a chance that the first one is succeeded but the next one is failed because module->state becomes MODULE_STATE_UNFORMED between those operations. In `check_kprobe_address_safe()`, if the second `__module_text_address()` is failed, that is ignored because it expected a kernel_text address. But it may have failed simply because module->state has been changed to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify non-exist module text address (use-after-free). To fix this problem, we should not use separated `is_module_text_address()` and `__module_text_address()`, but use only `__module_text_address()` once and do `try_module_get(module)` which is only available with MODULE_STATE_LIVE. Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/ Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas") Cc: stable@vger.kernel.org Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
* | | Merge tag 'timers-urgent-2024-04-07' of ↵Linus Torvalds2024-04-074-16/+58
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Fix various timer bugs: - Fix a timer migration bug that may result in missed events - Fix timer migration group hierarchy event updates - Fix a PowerPC64 build warning - Fix a handful of DocBook annotation bugs" * tag 'timers-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Return early on deactivation timers/migration: Fix ignored event due to missing CPU update vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.h timers: Fix text inconsistencies and spelling tick/sched: Fix struct tick_sched doc warnings tick/sched: Fix various kernel-doc warnings timers: Fix kernel-doc format and add Return values time/timekeeping: Fix kernel-doc warnings and typos time/timecounter: Fix inline documentation
| * | timers/migration: Return early on deactivationAnna-Maria Behnsen2024-04-051-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") removed the logic to return early in tmigr_update_events() on deactivation. With this the problem with a not properly updated first global event in a hierarchy containing only a single group was fixed. But when having a look at this code path with a hierarchy with more than a single level, now unnecessary work is done (example is partially copied from the message of the commit mentioned above): [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = NONE active = 0 active = NONE nextevt = T0i, T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 active idle idle idle 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are upper levels' events. CPU 1 is idle and has the timer T1 enqueued. CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2 [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T0:0i, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is pushed as its next expiry and its own event kept as "ignore". Without this early return the following steps happen in tmigr_update_events() when child = null and group = GRP0:0 : lock(GRP0:0->lock); timerqueue_del(GRP0:0, T0i); unlock(GRP0:0->lock); [GRP1:0] migrator = NONE active = NONE nextevt = T0:0, T0:1 / \ [GRP0:0] [GRP0:1] migrator = NONE migrator = NONE active = NONE active = NONE nextevt = T1 nextevt = T2 / \ / \ 0 (T0i) 1 (T1) 2 (T2) 3 idle idle idle idle 2) The change now propagates up to the top. Then tmigr_update_events() updates the group event of GRP0:0 and executes the following steps (child = GRP0:0 and group = GRP0:0): lock(GRP0:0->lock); lock(GRP1:0->lock); evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events in GRP0:0 ... update GRP1:0 group event and timerqueue ... unlock(GRP1:0->lock); unlock(GRP0:0->lock); So the dance in 1) with locking the GRP0:0->lock and removing the T0i from the timerqueue is redundand as this is done nevertheless in 2) when tmigr_next_groupevt(GRP0:0) is executed. Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") and add a condition into return path to skip the return only, when hierarchy contains a single group. Adapt comments accordingly. Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation") Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus
| * | timers/migration: Fix ignored event due to missing CPU updateFrederic Weisbecker2024-04-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a group event is updated with its expiry unchanged but a different CPU, that target change may go unnoticed and the event may be propagated up with a stale CPU value. The following depicts a scenario that has been actually observed: [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = TGRP1:0 (T0) / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T0 / \ 0 (T0) 1 (T1) idle idle 0) The hierarchy has 3 levels. The left part (GRP1:0) is all idle, including CPU 0 and CPU 1 which have a timer each: T0 and T1. They have the same expiry value. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T0 / \ 0 (T0) 1 (T1) idle idle 1) The migrator in GRP1:1 handles remotely T0. The event is dequeued from the top and T0 executed. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 2) The migrator in GRP1:1 fetches the next timer for CPU 0 and finds none. But it updates the events from its groups, starting with GRP0:0 which now has T1 as its next event. So far so good. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 3) The migrator in GRP1:1 proceeds upward and updates the events in GRP1:0. The child event TGRP0:0 is found queued with the same expiry as before. And therefore it is left unchanged. However the target CPU is not the same but that fact is ignored so TGRP0:0 still points to CPU 0 when it should point to CPU 1. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = TGRP1:0 (T0) / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 4) The propagation has reached the top level and TGRP1:0, having TGRP0:0 as its first event, also wrongly points to CPU 0. TGRP1:0 is added to the top level group. [GRP2:0] migrator = GRP1:1 active = GRP1:1 nextevt = KTIME_MAX / \ [GRP1:0] [GRP1:1] migrator = NONE [...] active = NONE nextevt = TGRP0:0 (T0) / \ [GRP0:0] [...] migrator = NONE active = NONE nextevt = T1 / \ 0 1 (T1) idle idle 5) The migrator in GRP1:1 dequeues the next event in top level pointing to CPU 0. But since it actually doesn't see any real event in CPU 0, it early returns. 6) T1 is left unhandled until either CPU 0 or CPU 1 wake up. Some other bad scenario may involve trees with just two levels. Fix this with unconditionally updating the CPU of the child event before considering to early return while updating a queued event with an unchanged expiry value. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Link: https://lore.kernel.org/r/Zg2Ct6M2RJAYHgCB@localhost.localdomain
| * | timers: Fix text inconsistencies and spellingRandy Dunlap2024-04-011-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some text for consistency: s/lvl/level/ in a comment and use correct/full function names in comments. Correct spelling errors as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240331172652.14086-7-rdunlap@infradead.org
| * | tick/sched: Fix struct tick_sched doc warningsRandy Dunlap2024-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc warnings in struct tick_sched: tick-sched.h:103: warning: Function parameter or struct member 'idle_sleeptime_seq' not described in 'tick_sched' tick-sched.h:104: warning: Excess struct member 'nohz_mode' description in 'tick_sched' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240331172652.14086-6-rdunlap@infradead.org
| * | tick/sched: Fix various kernel-doc warningsRandy Dunlap2024-04-011-3/+15
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a slew of kernel-doc warnings in tick-sched.c: tick-sched.c:650: warning: Function parameter or struct member 'now' not described in 'tick_nohz_update_jiffies' tick-sched.c:741: warning: No description found for return value of 'get_cpu_idle_time_us' tick-sched.c:767: warning: No description found for return value of 'get_cpu_iowait_time_us' tick-sched.c:1210: warning: No description found for return value of 'tick_nohz_idle_got_tick' tick-sched.c:1228: warning: No description found for return value of 'tick_nohz_get_next_hrtimer' tick-sched.c:1243: warning: No description found for return value of 'tick_nohz_get_sleep_length' tick-sched.c:1282: warning: Function parameter or struct member 'cpu' not described in 'tick_nohz_get_idle_calls_cpu' tick-sched.c:1282: warning: No description found for return value of 'tick_nohz_get_idle_calls_cpu' tick-sched.c:1294: warning: No description found for return value of 'tick_nohz_get_idle_calls' tick-sched.c:1577: warning: Function parameter or struct member 'hrtimer' not described in 'tick_setup_sched_timer' tick-sched.c:1577: warning: Excess function parameter 'mode' description in 'tick_setup_sched_timer' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240331172652.14086-5-rdunlap@infradead.org
* | Merge tag 'net-6.9-rc3' of ↵Linus Torvalds2024-04-043-8/+40
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bluetooth and bpf. Fairly usual collection of driver and core fixes. The large selftest accompanying one of the fixes is also becoming a common occurrence. Current release - regressions: - ipv6: fix infinite recursion in fib6_dump_done() - net/rds: fix possible null-deref in newly added error path Current release - new code bugs: - net: do not consume a full cacheline for system_page_pool - bpf: fix bpf_arena-related file descriptor leaks in the verifier - drv: ice: fix freeing uninitialized pointers, fixing misuse of the newfangled __free() auto-cleanup Previous releases - regressions: - x86/bpf: fixes the BPF JIT with retbleed=stuff - xen-netfront: add missing skb_mark_for_recycle, fix page pool accounting leaks, revealed by recently added explicit warning - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6 non-wildcard addresses - Bluetooth: - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with better workarounds to un-break some buggy Qualcomm devices - set conn encrypted before conn establishes, fix re-connecting to some headsets which use slightly unusual sequence of msgs - mptcp: - prevent BPF accessing lowat from a subflow socket - don't account accept() of non-MPC client as fallback to TCP - drv: mana: fix Rx DMA datasize and skb_over_panic - drv: i40e: fix VF MAC filter removal Previous releases - always broken: - gro: various fixes related to UDP tunnels - netns crossing problems, incorrect checksum conversions, and incorrect packet transformations which may lead to panics - bpf: support deferring bpf_link dealloc to after RCU grace period - nf_tables: - release batch on table validation from abort path - release mutex after nft_gc_seq_end from abort path - flush pending destroy work before exit_net release - drv: r8169: skip DASH fw status checks when DASH is disabled" * tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) netfilter: validate user input for expected length net/sched: act_skbmod: prevent kernel-infoleak net: usb: ax88179_178a: avoid the interface always configured as random address net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45() net: ravb: Always update error counters net: ravb: Always process TX descriptor ring netfilter: nf_tables: discard table flag update with pending basechain deletion netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get() netfilter: nf_tables: reject new basechain after table flag update netfilter: nf_tables: flush pending destroy work before exit_net release netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path netfilter: nf_tables: release batch on table validation from abort path Revert "tg3: Remove residual error handling in tg3_suspend" tg3: Remove residual error handling in tg3_suspend net: mana: Fix Rx DMA datasize and skb_over_panic net/sched: fix lockdep splat in qdisc_tree_reduce_backlog() net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping net: stmmac: fix rx queue priority assignment net: txgbe: fix i2c dev name cannot match clkdev net: fec: Set mac_managed_pm during probe ...
| * bpf: fix possible file descriptor leaks in verifierAnton Protopopov2024-03-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The resolve_pseudo_ldimm64() function might have leaked file descriptors when BPF_MAP_TYPE_ARENA was used in a program (some error paths missed a corresponding fdput). Add missing fdputs. v2: remove unrelated changes from the fix Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240329071106.67968-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: support deferring bpf_link dealloc to after RCU grace periodAndrii Nakryiko2024-03-282-5/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF link for some program types is passed as a "context" which can be used by those BPF programs to look up additional information. E.g., for multi-kprobes and multi-uprobes, link is used to fetch BPF cookie values. Because of this runtime dependency, when bpf_link refcnt drops to zero there could still be active BPF programs running accessing link data. This patch adds generic support to defer bpf_link dealloc callback to after RCU GP, if requested. This is done by exposing two different deallocation callbacks, one synchronous and one deferred. If deferred one is provided, bpf_link_free() will schedule dealloc_deferred() callback to happen after RCU GP. BPF is using two flavors of RCU: "classic" non-sleepable one and RCU tasks trace one. The latter is used when sleepable BPF programs are used. bpf_link_free() accommodates that by checking underlying BPF program's sleepable flag, and goes either through normal RCU GP only for non-sleepable, or through RCU tasks trace GP *and* then normal RCU GP (taking into account rcu_trace_implies_rcu_gp() optimization), if BPF program is sleepable. We use this for multi-kprobe and multi-uprobe links, which dereference link during program run. We also preventively switch raw_tp link to use deferred dealloc callback, as upcoming changes in bpf-next tree expose raw_tp link data (specifically, cookie value) to BPF program at runtime as well. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Reported-by: syzbot+981935d9485a560bfbcb@syzkaller.appspotmail.com Reported-by: syzbot+2cb5a6c573e98db598cc@syzkaller.appspotmail.com Reported-by: syzbot+62d8b26793e8a2bd0516@syzkaller.appspotmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240328052426.3042617-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: put uprobe link's path and task in release callbackAndrii Nakryiko2024-03-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to delay putting either path or task to deallocation step. It can be done right after bpf_uprobe_unregister. Between release and dealloc, there could be still some running BPF programs, but they don't access either task or path, only data in link->uprobes, so it is safe to do. On the other hand, doing path_put() in dealloc callback makes this dealloc sleepable because path_put() itself might sleep. Which is problematic due to the need to call uprobe's dealloc through call_rcu(), which is what is done in the next bug fix patch. So solve the problem by releasing these resources early. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240328052426.3042617-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | Merge tag 'kbuild-fixes-v6.9' of ↵Linus Torvalds2024-03-311-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Deduplicate Kconfig entries for CONFIG_CXL_PMU - Fix unselectable choice entry in MIPS Kconfig, and forbid this structure - Remove unused include/asm-generic/export.h - Fix a NULL pointer dereference bug in modpost - Enable -Woverride-init warning consistently with W=1 - Drop KCSAN flags from *.mod.c files * tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: Fix typo HEIGTH to HEIGHT Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries kbuild: make -Woverride-init warnings more consistent modpost: do not make find_tosym() return NULL export.h: remove include/asm-generic/export.h kconfig: do not reparent the menu inside a choice block MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
| * | kbuild: make -Woverride-init warnings more consistentArnd Bergmann2024-03-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The -Woverride-init warn about code that may be intentional or not, but the inintentional ones tend to be real bugs, so there is a bit of disagreement on whether this warning option should be enabled by default and we have multiple settings in scripts/Makefile.extrawarn as well as individual subsystems. Older versions of clang only supported -Wno-initializer-overrides with the same meaning as gcc's -Woverride-init, though all supported versions now work with both. Because of this difference, an earlier cleanup of mine accidentally turned the clang warning off for W=1 builds and only left it on for W=2, while it's still enabled for gcc with W=1. There is also one driver that only turns the warning off for newer versions of gcc but not other compilers, and some but not all the Makefiles still use a cc-disable-warning conditional that is no longer needed with supported compilers here. Address all of the above by removing the special cases for clang and always turning the warning off unconditionally where it got in the way, using the syntax that is supported by both compilers. Fixes: 2cd3271b7a31 ("kbuild: avoid duplicate warning options") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hamza Mahfooz <hamza.mahfooz@amd.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | | Merge tag 'irq_urgent_for_v6.9_rc2' of ↵Linus Torvalds2024-03-311-2/+7
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix an unused function warning on irqchip/irq-armada-370-xp - Fix the IRQ sharing with pinctrl-amd and ACPI OSL * tag 'irq_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/armada-370-xp: Suppress unused-function warning genirq: Introduce IRQF_COND_ONESHOT and use it in pinctrl-amd
| * | genirq: Introduce IRQF_COND_ONESHOT and use it in pinctrl-amdRafael J. Wysocki2024-03-251-2/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a problem when a driver requests a shared interrupt line to run a threaded handler on it without IRQF_ONESHOT set if that flag has been set already for the IRQ in question by somebody else. Namely, the request fails which usually leads to a probe failure even though the driver might have worked just fine with IRQF_ONESHOT, but it does not want to use it by default. Currently, the only way to handle this is to try to request the IRQ without IRQF_ONESHOT, but with IRQF_PROBE_SHARED set and if this fails, try again with IRQF_ONESHOT set. However, this is a bit cumbersome and not very clean. When commit 7a36b901a6eb ("ACPI: OSL: Use a threaded interrupt handler for SCI") switched the ACPI subsystem over to using a threaded interrupt handler for the SCI, it had to use IRQF_ONESHOT for it because that's required due to the way the SCI handler works (it needs to walk all of the enabled GPEs before the interrupt line can be unmasked). The SCI interrupt line is not shared with other users very often due to the SCI handling overhead, but on sone systems it is shared and when the other user of it attempts to install a threaded handler, a flags mismatch related to IRQF_ONESHOT may occur. As it turned out, that happened to the pinctrl-amd driver and so commit 4451e8e8415e ("pinctrl: amd: Add IRQF_ONESHOT to the interrupt request") attempted to address the issue by adding IRQF_ONESHOT to the interrupt flags in that driver, but this is now causing an IRQF_ONESHOT-related mismatch to occur on another system which cannot boot as a result of it. Clearly, pinctrl-amd can work with IRQF_ONESHOT if need be, but it should not set that flag by default, so it needs a way to indicate that to the interrupt subsystem. To that end, introdcuce a new interrupt flag, IRQF_COND_ONESHOT, which will only have effect when the IRQ line is shared and IRQF_ONESHOT has been set for it already, in which case it will be promoted to the latter. This is sufficient for drivers sharing the interrupt line with the SCI as it is requested by the ACPI subsystem before any drivers are probed, so they will always see IRQF_ONESHOT set for the interrupt in question. Fixes: 4451e8e8415e ("pinctrl: amd: Add IRQF_ONESHOT to the interrupt request") Reported-by: Francisco Ayala Le Brun <francisco@videowindow.eu> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Cc: 6.8+ <stable@vger.kernel.org> # 6.8+ Closes: https://lore.kernel.org/lkml/CAN-StX1HqWqi+YW=t+V52-38Mfp5fAz7YHx4aH-CQjgyNiKx3g@mail.gmail.com/ Link: https://lore.kernel.org/r/12417336.O9o76ZdvQC@kreacher
* | Merge tag 'net-6.9-rc2' of ↵Linus Torvalds2024-03-284-11/+56
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf, WiFi and netfilter. Current release - regressions: - ipv6: fix address dump when IPv6 is disabled on an interface Current release - new code bugs: - bpf: temporarily disable atomic operations in BPF arena - nexthop: fix uninitialized variable in nla_put_nh_group_stats() Previous releases - regressions: - bpf: protect against int overflow for stack access size - hsr: fix the promiscuous mode in offload mode - wifi: don't always use FW dump trig - tls: adjust recv return with async crypto and failed copy to userspace - tcp: properly terminate timers for kernel sockets - ice: fix memory corruption bug with suspend and rebuild - at803x: fix kernel panic with at8031_probe - qeth: handle deferred cc1 Previous releases - always broken: - bpf: fix bug in BPF_LDX_MEMSX - netfilter: reject table flag and netdev basechain updates - inet_defrag: prevent sk release while still in use - wifi: pick the version of SESSION_PROTECTION_NOTIF - wwan: t7xx: split 64bit accesses to fix alignment issues - mlxbf_gige: call request_irq() after NAPI initialized - hns3: fix kernel crash when devlink reload during pf initialization" * tag 'net-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) inet: inet_defrag: prevent sk release while still in use Octeontx2-af: fix pause frame configuration in GMP mode net: lan743x: Add set RFE read fifo threshold for PCI1x1x chips net: bcmasp: Remove phy_{suspend/resume} net: bcmasp: Bring up unimac after PHY link up net: phy: qcom: at803x: fix kernel panic with at8031_probe netfilter: arptables: Select NETFILTER_FAMILY_ARP when building arp_tables.c netfilter: nf_tables: skip netdev hook unregistration if table is dormant netfilter: nf_tables: reject table flag and netdev basechain updates netfilter: nf_tables: reject destroy command to remove basechain hooks bpf: update BPF LSM designated reviewer list bpf: Protect against int overflow for stack access size bpf: Check bloom filter map value size bpf: fix warning for crash_kexec selftests: netdevsim: set test timeout to 10 minutes net: wan: framer: Add missing static inline qualifiers mlxbf_gige: call request_irq() after NAPI initialized tls: get psock ref after taking rxlock to avoid leak selftests: tls: add test with a partially invalid iov tls: adjust recv return with async crypto and failed copy to userspace ...
| * | bpf: Protect against int overflow for stack access sizeAndrei Matei2024-03-271-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch re-introduces protection against the size of access to stack memory being negative; the access size can appear negative as a result of overflowing its signed int representation. This should not actually happen, as there are other protections along the way, but we should protect against it anyway. One code path was missing such protections (fixed in the previous patch in the series), causing out-of-bounds array accesses in check_stack_range_initialized(). This patch causes the verification of a program with such a non-sensical access size to fail. This check used to exist in a more indirect way, but was inadvertendly removed in a833a17aeac7. Fixes: a833a17aeac7 ("bpf: Fix verification of indirect var-off stack access") Reported-by: syzbot+33f4297b5f927648741a@syzkaller.appspotmail.com Reported-by: syzbot+aafd0513053a1cbf52ef@syzkaller.appspotmail.com Closes: https://lore.kernel.org/bpf/CAADnVQLORV5PT0iTAhRER+iLBTkByCYNBYyvBSgjN1T31K+gOw@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Link: https://lore.kernel.org/r/20240327024245.318299-3-andreimatei1@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | bpf: Check bloom filter map value sizeAndrei Matei2024-03-271-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a missing check to bloom filter creating, rejecting values above KMALLOC_MAX_SIZE. This brings the bloom map in line with many other map types. The lack of this protection can cause kernel crashes for value sizes that overflow int's. Such a crash was caught by syzkaller. The next patch adds more guard-rails at a lower level. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240327024245.318299-2-andreimatei1@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | bpf: fix warning for crash_kexecHari Bathini2024-03-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With [1], crash dump specific code is moved out of CONFIG_KEXEC_CORE and placed under CONFIG_CRASH_DUMP, where it is more appropriate. And since CONFIG_KEXEC & !CONFIG_CRASH_DUMP build option is supported with that, it led to the below warning: "WARN: resolve_btfids: unresolved symbol crash_kexec" Fix it by using the appropriate #ifdef. [1] https://lore.kernel.org/all/20240124051254.67105-1-bhe@redhat.com/ Acked-by: Baoquan He <bhe@redhat.com> Fixes: 02aff8480533 ("crash: split crash dumping code out from kexec_core.c") Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Link: https://lore.kernel.org/r/20240319080152.36987-1-hbathini@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | Merge tag 'for-netdev' of ↵Paolo Abeni2024-03-262-10/+37
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-03-25 The following pull-request contains BPF updates for your *net* tree. We've added 17 non-merge commits during the last 12 day(s) which contain a total of 19 files changed, 184 insertions(+), 61 deletions(-). The main changes are: 1) Fix an arm64 BPF JIT bug in BPF_LDX_MEMSX implementation's offset handling found via test_bpf module, from Puranjay Mohan. 2) Various fixups to the BPF arena code in particular in the BPF verifier and around BPF selftests to match latest corresponding LLVM implementation, from Puranjay Mohan and Alexei Starovoitov. 3) Fix xsk to not assume that metadata is always requested in TX completion, from Stanislav Fomichev. 4) Fix riscv BPF JIT's kfunc parameter incompatibility between BPF and the riscv ABI which requires sign-extension on int/uint, from Pu Lehui. 5) Fix s390x BPF JIT's bpf_plt pointer arithmetic which triggered a crash when testing struct_ops, from Ilya Leoshkevich. 6) Fix libbpf's arena mmap handling which had incorrect u64-to-pointer cast on 32-bit architectures, from Andrii Nakryiko. 7) Fix libbpf to define MFD_CLOEXEC when not available, from Arnaldo Carvalho de Melo. 8) Fix arm64 BPF JIT implementation for 32bit unconditional bswap which resulted in an incorrect swap as indicated by test_bpf, from Artem Savkov. 9) Fix BPF man page build script to use silent mode, from Hangbin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi bpf: verifier: reject addr_space_cast insn without arena selftests/bpf: verifier_arena: fix mmap address for arm64 bpf: verifier: fix addr_space_cast from as(1) to as(0) libbpf: Define MFD_CLOEXEC if not available arm64: bpf: fix 32bit unconditional bswap bpf, arm64: fix bug in BPF_LDX_MEMSX libbpf: fix u64-to-pointer cast on 32-bit arches s390/bpf: Fix bpf_plt pointer arithmetic xsk: Don't assume metadata is always requested in TX completion selftests/bpf: Add arena test case for 4Gbyte corner case selftests/bpf: Remove hard coded PAGE_SIZE macro. libbpf, selftests/bpf: Adjust libbpf, bpftool, selftests to match LLVM bpf: Clarify bpf_arena comments. MAINTAINERS: Update email address for Quentin Monnet scripts/bpf_doc: Use silent mode when exec make cmd bpf: Temporarily disable atomic operations in BPF arena ==================== Link: https://lore.kernel.org/r/20240325213520.26688-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| | * | bpf: verifier: reject addr_space_cast insn without arenaPuranjay Mohan2024-03-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The verifier allows using the addr_space_cast instruction in a program that doesn't have an associated arena. This was caught in the form an invalid memory access in do_misc_fixups() when while converting addr_space_cast to a normal 32-bit mov, env->prog->aux->arena was dereferenced to check for BPF_F_NO_USER_CONV flag. Reject programs that include the addr_space_cast instruction but don't have an associated arena. root@rv-tester:~# ./reproducer Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000030 Oops [#1] [<ffffffff8017eeaa>] do_misc_fixups+0x43c/0x1168 [<ffffffff801936d6>] bpf_check+0xda8/0x22b6 [<ffffffff80174b32>] bpf_prog_load+0x486/0x8dc [<ffffffff80176566>] __sys_bpf+0xbd8/0x214e [<ffffffff80177d14>] __riscv_sys_bpf+0x22/0x2a [<ffffffff80d2493a>] do_trap_ecall_u+0x102/0x17c [<ffffffff80d3048c>] ret_from_exception+0x0/0x64 Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/bpf/CABOYnLz09O1+2gGVJuCxd_24a-7UueXzV-Ff+Fr+h5EKFDiYCQ@mail.gmail.com/ Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240322153518.11555-1-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * | bpf: verifier: fix addr_space_cast from as(1) to as(0)Puranjay Mohan2024-03-221-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The verifier currently converts addr_space_cast from as(1) to as(0) that is: BPF_ALU64 | BPF_MOV | BPF_X with off=1 and imm=1 to BPF_ALU | BPF_MOV | BPF_X with imm=1 (32-bit mov) Because of this imm=1, the JITs that have bpf_jit_needs_zext() == true, interpret the converted instruction as BPF_ZEXT_REG(DST) which is a special form of mov32, used for doing explicit zero extension on dst. These JITs will just zero extend the dst reg and will not move the src to dst before the zext. Fix do_misc_fixups() to set imm=0 when converting addr_space_cast to a normal mov32. The JITs that have bpf_jit_needs_zext() == true rely on the verifier to emit zext instructions. Mark dst_reg as subreg when doing cast from as(1) to as(0) so the verifier emits a zext instruction after the mov. Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240321153939.113996-1-puranjay12@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * | bpf: Clarify bpf_arena comments.Alexei Starovoitov2024-03-151-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clarify two bpf_arena comments, use existing SZ_4G #define, improve page_cnt check. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20240315021834.62988-2-alexei.starovoitov@gmail.com
| | * | bpf: Temporarily disable atomic operations in BPF arenaPuranjay Mohan2024-03-141-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the x86 JIT handling PROBE_MEM32 tagged accesses is not equipped to handle atomic accesses into PTR_TO_ARENA, as no PROBE_MEM32 tagging is performed and no handling is enabled for them. This will lead to unsafety as the offset into arena will dereferenced directly without turning it into a base + offset access into the arena region. Since the changes to the x86 JIT will be fairly involved, for now, temporarily disallow use of PTR_TO_ARENA as the destination operand for atomics until support is added to the JIT backend. Fixes: 2fe99eb0ccf2 ("bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions.") Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Message-ID: <20240314174931.98702-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | | | Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of ↵Linus Torvalds2024-03-272-2/+12
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Various hotfixes. About half are cc:stable and the remainder address post-6.8 issues or aren't considered suitable for backporting. zswap figures prominently in the post-6.8 issues - folloup against the large amount of changes we have just made to that code. Apart from that, all over the map" * tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) crash: use macro to add crashk_res into iomem early for specific arch mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices selftests/mm: fix ARM related issue with fork after pthread_create hexagon: vmlinux.lds.S: handle attributes section userfaultfd: fix deadlock warning when locking src and dst VMAs tmpfs: fix race on handling dquot rbtree selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion ARM: prctl: reject PR_SET_MDWE on pre-ARMv6 prctl: generalize PR_SET_MDWE support check to be per-arch MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev mm: zswap: fix kernel BUG in sg_init_one selftests: mm: restore settings from only parent process tools/Makefile: remove cgroup target mm: cachestat: fix two shmem bugs mm: increase folio batch size mm,page_owner: fix recursion mailmap: update entry for Leonard Crestez init: open /initrd.image with O_LARGEFILE selftests/mm: Fix build with _FORTIFY_SOURCE ...
| * | | | crash: use macro to add crashk_res into iomem early for specific archBaoquan He2024-03-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are regression reports[1][2] that crashkernel region on x86_64 can't be added into iomem tree sometime. This causes the later failure of kdump loading. This happened after commit 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") was merged. Even though, these reported issues are proved to be related to other component, they are just exposed after above commmit applied, I still would like to keep crashk_res and crashk_low_res being added into iomem early as before because the early adding has been always there on x86_64 and working very well. For safety of kdump, Let's change it back. Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that only ARCH defining the macro can have the early adding crashk_res/_low_res into iomem. Then define HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it. Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res handling which was mistakenly added back in commit 85fcde402db1 ("kexec: split crashkernel reservation code out from crash_core.c"). [1] [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u [2] Question about Address Range Validation in Crash Kernel Allocation https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv Fixes: 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Li Huafei <lihuafei1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | prctl: generalize PR_SET_MDWE support check to be per-archZev Weiss2024-03-261-2/+5
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported". I noticed after a recent kernel update that my ARM926 system started segfaulting on any execve() after calling prctl(PR_SET_MDWE). After some investigation it appears that ARMv5 is incapable of providing the appropriate protections for MDWE, since any readable memory is also implicitly executable. The prctl_set_mdwe() function already had some special-case logic added disabling it on PARISC (commit 793838138c15, "prctl: Disable prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that check to use an arch_*() function, and (2) adds a corresponding override for ARM to disable MDWE on pre-ARMv6 CPUs. With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can succeed instead of unconditionally failing; on ARMv6 the prctl works as it did previously. [0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/ This patch (of 2): There exist systems other than PARISC where MDWE may not be feasible to support; rather than cluttering up the generic code with additional arch-specific logic let's add a generic function for checking MDWE support and allow each arch to override it as needed. Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.net Signed-off-by: Zev Weiss <zev@bewilderbeest.net> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florent Revest <revest@chromium.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kees Cook <keescook@chromium.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sam James <sam@gentoo.org> Cc: Stefan Roesch <shr@devkernel.io> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: <stable@vger.kernel.org> [6.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | Merge tag 'probes-fixes-v6.9-rc1' of ↵Linus Torvalds2024-03-271-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixlet from Masami Hiramatsu: - tracing/probes: initialize a 'val' local variable with zero. This variable is read by FETCH_OP_ST_EDATA in a loop, and is initialized by FETCH_OP_ARG in the same loop. Since this initialization is not obvious, smatch warns about it. Explicitly initializing 'val' with zero fixes this warning. * tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probes: Fix to zero initialize a local variable
| * | | | tracing: probes: Fix to zero initialize a local variableMasami Hiramatsu (Google)2024-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix to initialize 'val' local variable with zero. Dan reported that Smatch static code checker reports an error that a local 'val' variable needs to be initialized. Actually, the 'val' is expected to be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So initialize it with zero. Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/ Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/ Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and fprobe)") Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
* | | | | Fix memory leak in posix_clock_open()Linus Torvalds2024-03-271-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the clk ops.open() function returns an error, we don't release the pccontext we allocated for this clock. Re-organize the code slightly to make it all more obvious. Reported-by: Rohit Keshri <rkeshri@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Fixes: 60c6946675fc ("posix-clock: introduce posix_clock_context concept") Cc: Jakub Kicinski <kuba@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linuxfoundation.org>
* | | | | Merge tag 'printk-for-6.9-rc2' of ↵Linus Torvalds2024-03-261-0/+6
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Prevent scheduling in an atomic context when printk() takes over the console flushing duty * tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Update @console_may_schedule in console_trylock_spinning()
| * | | | | printk: Update @console_may_schedule in console_trylock_spinning()John Ogness2024-03-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | console_trylock_spinning() may takeover the console lock from a schedulable context. Update @console_may_schedule to make sure it reflects a trylock acquire. Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Closes: https://lore.kernel.org/lkml/20240222090538.23017-1-quic_mojha@quicinc.com Fixes: dbdda842fe96 ("printk: Add console owner and waiter logic to load balance console writes") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/875xybmo2z.fsf@jogness.linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
* | | | | | Merge tag 'v6.9-p2' of ↵Linus Torvalds2024-03-251-0/+5
|\ \ \ \ \ \ | |_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a regression that broke iwd as well as a divide by zero in iaa" * tag 'v6.9-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: iaa - Fix nr_cpus < nr_iaa case Revert "crypto: pkcs7 - remove sha1 support"
| * | | | | Revert "crypto: pkcs7 - remove sha1 support"Eric Biggers2024-03-221-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 16ab7cb5825fc3425c16ad2c6e53d827f382d7c6 because it broke iwd. iwd uses the KEYCTL_PKEY_* UAPIs via its dependency libell, and apparently it is relying on SHA-1 signature support. These UAPIs are fairly obscure, and their documentation does not mention which algorithms they support. iwd really should be using a properly supported userspace crypto library instead. Regardless, since something broke we have to revert the change. It may be possible that some parts of this commit can be reinstated without breaking iwd (e.g. probably the removal of MODULE_SIG_SHA1), but for now this just does a full revert to get things working again. Reported-by: Karel Balej <balejk@matfyz.cz> Closes: https://lore.kernel.org/r/CZSHRUIJ4RKL.34T4EASV5DNJM@matfyz.cz Cc: Dimitri John Ledkov <dimitri.ledkov@canonical.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Karel Balej <balejk@matfyz.cz> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | | | | Merge tag 'dma-mapping-6.9-2024-03-24' of ↵Linus Torvalds2024-03-241-12/+33
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: "This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now" * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE iommu/dma: Force swiotlb_max_mapping_size on an untrusted device swiotlb: Fix alignment checks when both allocation and DMA masks are present swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() swiotlb: Enforce page alignment in swiotlb_alloc() swiotlb: Fix double-allocation of slots due to broken alignment handling
| * | | | | | swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZEWill Deacon2024-03-131-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For swiotlb allocations >= PAGE_SIZE, the slab search historically adjusted the stride to avoid checking unaligned slots. This had the side-effect of aligning large mapping requests to PAGE_SIZE, but that was broken by 0eee5ae10256 ("swiotlb: fix slot alignment checks"). Since this alignment could be relied upon drivers, reinstate PAGE_SIZE alignment for swiotlb mappings >= PAGE_SIZE. Reported-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | | swiotlb: Fix alignment checks when both allocation and DMA masks are presentWill Deacon2024-03-131-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nicolin reports that swiotlb buffer allocations fail for an NVME device behind an IOMMU using 64KiB pages. This is because we end up with a minimum allocation alignment of 64KiB (for the IOMMU to map the buffer safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME page (i.e. preserving the 4KiB page offset from the original allocation). If the original address is not 4KiB-aligned, the allocation will fail because swiotlb_search_pool_area() erroneously compares these unmasked bits with the 64KiB-aligned candidate allocation. Tweak swiotlb_search_pool_area() so that the DMA alignment mask is reduced based on the required alignment of the allocation. Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.com Reported-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | | swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()Will Deacon2024-03-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | core-api/dma-api-howto.rst states the following properties of dma_alloc_coherent(): | The CPU virtual address and the DMA address are both guaranteed to | be aligned to the smallest PAGE_SIZE order which is greater than or | equal to the requested size. However, swiotlb_alloc() passes zero for the 'alloc_align_mask' parameter of swiotlb_find_slots() and so this property is not upheld. Instead, allocations larger than a page are aligned to PAGE_SIZE, Calculate the mask corresponding to the page order suitable for holding the allocation and pass that to swiotlb_find_slots(). Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | | swiotlb: Enforce page alignment in swiotlb_alloc()Will Deacon2024-03-131-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocating pages from a restricted DMA pool in swiotlb_alloc(), the buffer address is blindly converted to a 'struct page *' that is returned to the caller. In the unlikely event of an allocation bug, page-unaligned addresses are not detected and slots can silently be double-allocated. Add a simple check of the buffer alignment in swiotlb_alloc() to make debugging a little easier if something has gone wonky. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | | | swiotlb: Fix double-allocation of slots due to broken alignment handlingWill Deacon2024-03-131-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix"), which was a fix for commit 0eee5ae10256 ("swiotlb: fix slot alignment checks"), causes a functional regression with vsock in a virtual machine using bouncing via a restricted DMA SWIOTLB pool. When virtio allocates the virtqueues for the vsock device using dma_alloc_coherent(), the SWIOTLB search can return page-unaligned allocations if 'area->index' was left unaligned by a previous allocation from the buffer: # Final address in brackets is the SWIOTLB address returned to the caller | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800) | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800) | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800) This ends badly (typically buffer corruption and/or a hang) because swiotlb_alloc() is expecting a page-aligned allocation and so blindly returns a pointer to the 'struct page' corresponding to the allocation, therefore double-allocating the first half (2KiB slot) of the 4KiB page. Fix the problem by treating the allocation alignment separately to any additional alignment requirements from the device, using the maximum of the two as the stride to search the buffer slots and taking care to ensure a minimum of page-alignment for buffers larger than a page. This also resolves swiotlb allocation failures occuring due to the inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and resulting in alignment requirements exceeding swiotlb_max_mapping_size(). Fixes: bbb73a103fbb ("swiotlb: fix a braino in the alignment check fix") Fixes: 0eee5ae10256 ("swiotlb: fix slot alignment checks") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | | | | Merge tag 'timers-urgent-2024-03-23' of ↵Linus Torvalds2024-03-232-3/+20
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two regression fixes for the timer and timer migration code: - Prevent endless timer requeuing which is caused by two CPUs racing out of idle. This happens when the last CPU goes idle and therefore has to ensure to expire the pending global timers and some other CPU come out of idle at the same time and the other CPU wins the race and expires the global queue. This causes the last CPU to chase ghost timers forever and reprogramming it's clockevent device endlessly. Cure this by re-evaluating the wakeup time unconditionally. - The split into local (pinned) and global timers in the timer wheel caused a regression for NOHZ full as it broke the idle tracking of global timers. On NOHZ full this prevents an self IPI being sent which in turn causes the timer to be not programmed and not being expired on time. Restore the idle tracking for the global timer base so that the self IPI condition for NOHZ full is working correctly again" * tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix removed self-IPI on global timer's enqueue in nohz_full timers/migration: Fix endless timer requeue after idle interrupts
| * | | | | | | timers: Fix removed self-IPI on global timer's enqueue in nohz_fullFrederic Weisbecker2024-03-191-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running in nohz_full mode, a task may enqueue a timer while the tick is stopped. However the only places where the timer wheel, alongside the timer migration machinery's decision, may reprogram the next event accordingly with that new timer's expiry are the idle loop or any IRQ tail. However neither the idle task nor an interrupt may run on the CPU if it resumes busy work in userspace for a long while in full dynticks mode. To solve this, the timer enqueue path raises a self-IPI that will re-evaluate the timer wheel on its IRQ tail. This asynchronous solution avoids potential locking inversion. This is supposed to happen both for local and global timers but commit: b2cf7507e186 ("timers: Always queue timers on the local CPU") broke the global timers case with removing the ->is_idle field handling for the global base. As a result, global timers enqueue may go unnoticed in nohz_full. Fix this with restoring the idle tracking of the global timer's base, allowing self-IPIs again on enqueue time. Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240318230729.15497-3-frederic@kernel.org
| * | | | | | | timers/migration: Fix endless timer requeue after idle interruptsFrederic Weisbecker2024-03-191-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a CPU is an idle migrator, but another CPU wakes up before it, becomes an active migrator and handles the queue, the initial idle migrator may end up endlessly reprogramming its clockevent, chasing ghost timers forever such as in the following scenario: [GRP0:0] migrator = 0 active = 0 nextevt = T1 / \ 0 1 active idle (T1) 0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is the active migrator. [GRP0:0] migrator = NONE active = NONE nextevt = T1 / \ 0 1 idle idle (T1) wakeup = T1 1) CPU 0 is now idle and is therefore the idle migrator. It has programmed its next timer interrupt to handle T1. [GRP0:0] migrator = 1 active = 1 nextevt = KTIME_MAX / \ 0 1 idle active wakeup = T1 2) CPU 1 has woken up, it is now active and it has just handled its own timer T1. 3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote() realize it is not the migrator anymore. So it early returns without observing that T1 has been expired already and therefore without updating its ->wakeup value. 4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns because it doesn't queue a timer of its own. So ->wakeup is left unchanged and the next timer is programmed to fire now. 5) goto 3) forever This results in timer interrupt storms in idle and also in nohz_full (as observed in rcutorture's TREE07 scenario). Fix this with forcing a re-evaluation of tmc->wakeup while trying remote timer handling when the CPU isn't the migrator anymmore. The check is inherently racy but in the worst case the CPU just races setting the KTIME_MAX value that a remote expiry also tries to set. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org
* | | | | | | | Merge tag 'core-entry-2024-03-23' of ↵Linus Torvalds2024-03-231-1/+7
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry fix from Thomas Gleixner: "A single fix for the generic entry code: The trace_sys_enter() tracepoint can modify the syscall number via kprobes or BPF in pt_regs, but that requires that the syscall number is re-evaluted from pt_regs after the tracepoint. A seccomp fix in that area removed the re-evaluation so the change does not take effect as the code just uses the locally cached number. Restore the original behaviour by re-evaluating the syscall number after the tracepoint" * tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Respect changes to system call number by trace_sys_enter()
| * | | | | | | | entry: Respect changes to system call number by trace_sys_enter()André Rösti2024-03-121-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a probe is registered at the trace_sys_enter() tracepoint, and that probe changes the system call number, the old system call still gets executed. This worked correctly until commit b6ec41346103 ("core/entry: Report syscall correctly for trace and audit"), which removed the re-evaluation of the syscall number after the trace point. Restore the original semantics by re-evaluating the system call number after trace_sys_enter(). The performance impact of this re-evaluation is minimal because it only takes place when a trace point is active, and compared to the actual trace point overhead the read from a cache hot variable is negligible. Fixes: b6ec41346103 ("core/entry: Report syscall correctly for trace and audit") Signed-off-by: André Rösti <an.roesti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240311211704.7262-1-an.roesti@gmail.com
* | | | | | | | | Merge tag 'riscv-for-linus-6.9-mw2' of ↵Linus Torvalds2024-03-222-7/+22
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for various vector-accelerated crypto routines - Hibernation is now enabled for portable kernel builds - mmap_rnd_bits_max is larger on systems with larger VAs - Support for fast GUP - Support for membarrier-based instruction cache synchronization - Support for the Andes hart-level interrupt controller and PMU - Some cleanups around unaligned access speed probing and Kconfig settings - Support for ACPI LPI and CPPC - Various cleanus related to barriers - A handful of fixes * tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits) riscv: Fix syscall wrapper for >word-size arguments crypto: riscv - add vector crypto accelerated AES-CBC-CTS crypto: riscv - parallelize AES-CBC decryption riscv: Only flush the mm icache when setting an exec pte riscv: Use kcalloc() instead of kzalloc() riscv/barrier: Add missing space after ',' riscv/barrier: Consolidate fence definitions riscv/barrier: Define RISCV_FULL_BARRIER riscv/barrier: Define __{mb,rmb,wmb} RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ cpufreq: Move CPPC configs to common Kconfig and add RISC-V ACPI: RISC-V: Add CPPC driver ACPI: Enable ACPI_PROCESSOR for RISC-V ACPI: RISC-V: Add LPI driver cpuidle: RISC-V: Move few functions to arch/riscv riscv: Introduce set_compat_task() in asm/compat.h riscv: Introduce is_compat_thread() into compat.h riscv: add compile-time test into is_compat_task() riscv: Replace direct thread flag check with is_compat_task() riscv: Improve arch_get_mmap_end() macro ...
| * | | | | | | | | membarrier: riscv: Provide core serializing commandAndrea Parri2024-02-152-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RISC-V uses xRET instructions on return from interrupt and to go back to user-space; the xRET instruction is not core serializing. Use FENCE.I for providing core serialization as follows: - by calling sync_core_before_usermode() on return from interrupt (cf. ipi_sync_core()), - via switch_mm() and sync_core_before_usermode() (respectively, for uthread->uthread and kthread->uthread transitions) before returning to user-space. On RISC-V, the serialization in switch_mm() is activated by resetting the icache_stale_mask of the mm at prepare_sync_core_cmd(). Suggested-by: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-5-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | | | | | | | | locking: Introduce prepare_sync_core_cmd()Andrea Parri2024-02-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce an architecture function that architectures can use to set up ("prepare") SYNC_CORE commands. The function will be used by RISC-V to update its "deferred icache- flush" data structures (icache_stale_mask). Architectures defining prepare_sync_core_cmd() static inline need to select ARCH_HAS_PREPARE_SYNC_CORE_CMD. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-4-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | | | | | | | | membarrier: Create Documentation/scheduler/membarrier.rstAndrea Parri2024-02-152-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To gather the architecture requirements of the "private/global expedited" membarrier commands. The file will be expanded to integrate further information about the membarrier syscall (as needed/desired in the future). While at it, amend some related inline comments in the membarrier codebase. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-3-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
| * | | | | | | | | membarrier: riscv: Add full memory barrier in switch_mm()Andrea Parri2024-02-151-2/+3
| | |_|_|/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The membarrier system call requires a full memory barrier after storing to rq->curr, before going back to user-space. The barrier is only needed when switching between processes: the barrier is implied by mmdrop() when switching from kernel to userspace, and it's not needed when switching from userspace to kernel. Rely on the feature/mechanism ARCH_HAS_MEMBARRIER_CALLBACKS and on the primitive membarrier_arch_switch_mm(), already adopted by the PowerPC architecture, to insert the required barrier. Fixes: fab957c11efe2f ("RISC-V: Atomic and Locking Code") Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-2-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>