summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
...
* mm: zswap: rename is_zswap_enabled() to zswap_is_enabled()Yosry Ahmed2024-09-121-2/+2
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 2b33a97c94bc44468fc1d54b745269c0cf0b7bb2 ] In preparation for introducing a similar function, rename is_zswap_enabled() to use zswap_* prefix like other zswap functions. Link: https://lkml.kernel.org/r/20240611024516.1375191-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: e39925734909 ("mm/memcontrol: respect zswap.writeback setting from parent cg too") Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf, net: Fix a potential race in do_sock_getsockopt()Tze-nan Wu2024-09-121-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 33f339a1ba54e56bba57ee9a77c71e385ab4825c ] There's a potential race when `cgroup_bpf_enabled(CGROUP_GETSOCKOPT)` is false during the execution of `BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN`, but becomes true when `BPF_CGROUP_RUN_PROG_GETSOCKOPT` is called. This inconsistency can lead to `BPF_CGROUP_RUN_PROG_GETSOCKOPT` receiving an "-EFAULT" from `__cgroup_bpf_run_filter_getsockopt(max_optlen=0)`. Scenario shown as below: `process A` `process B` ----------- ------------ BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN enable CGROUP_GETSOCKOPT BPF_CGROUP_RUN_PROG_GETSOCKOPT (-EFAULT) To resolve this, remove the `BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN` macro and directly uses `copy_from_sockptr` to ensure that `max_optlen` is always set before `BPF_CGROUP_RUN_PROG_GETSOCKOPT` is invoked. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Co-developed-by: Yanghui Li <yanghui.li@mediatek.com> Signed-off-by: Yanghui Li <yanghui.li@mediatek.com> Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://patch.msgid.link/20240830082518.23243-1-Tze-nan.Wu@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Bluetooth: hci_sync: Introduce hci_cmd_sync_run/hci_cmd_sync_run_onceLuiz Augusto von Dentz2024-09-121-0/+4
| | | | | | | | | | | | [ Upstream commit c898f6d7b093bd71e66569cd6797c87d4056f44b ] This introduces hci_cmd_sync_run/hci_cmd_sync_run_once which acts like hci_cmd_sync_queue/hci_cmd_sync_queue_once but runs immediately when already on hdev->cmd_sync_work context. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Stable-dep-of: 227a0cdf4a02 ("Bluetooth: MGMT: Fix not generating command complete for MGMT_OP_DISCONNECT") Signed-off-by: Sasha Levin <sashal@kernel.org>
* regulator: core: Stub devm_regulator_bulk_get_const() if !CONFIG_REGULATORDouglas Anderson2024-09-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1a5caec7f80ca2e659c03f45378ee26915f4eda2 ] When adding devm_regulator_bulk_get_const() I missed adding a stub for when CONFIG_REGULATOR is not enabled. Under certain conditions (like randconfig testing) this can cause the compiler to reports errors like: error: implicit declaration of function 'devm_regulator_bulk_get_const'; did you mean 'devm_regulator_bulk_get_enable'? Add the stub. Fixes: 1de452a0edda ("regulator: core: Allow drivers to define their init data as const") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408301813.TesFuSbh-lkp@intel.com/ Cc: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://patch.msgid.link/20240830073511.1.Ib733229a8a19fad8179213c05e1af01b51e42328@changeid Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* netfs, cifs: Fix handling of short DIO readDavid Howells2024-09-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1da29f2c39b67b846b74205c81bf0ccd96d34727 ] Short DIO reads, particularly in relation to cifs, are not being handled correctly by cifs and netfslib. This can be tested by doing a DIO read of a file where the size of read is larger than the size of the file. When it crosses the EOF, it gets a short read and this gets retried, and in the case of cifs, the retry read fails, with the failure being translated to ENODATA. Fix this by the following means: (1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it detects that the read did hit the EOF. (2) Make the netfslib read assessment stop processing subrequests when it encounters one with that flag set. (3) Return rreq->transferred, the accumulated contiguous amount read to that point, to userspace for a DIO read. (4) Make cifs set the flag and clear the error if the read RPC returned ENODATA. (5) Make cifs set the flag and clear the error if a short read occurred without error and the read-to file position is now at the remote inode size. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* drm/panthor: Restrict high priorities on group_createMary Guillemard2024-09-121-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f7762042f8a5377bd8a32844db353c0311a7369 upstream. We were allowing any users to create a high priority group without any permission checks. As a result, this was allowing possible denial of service. We now only allow the DRM master or users with the CAP_SYS_NICE capability to set higher priorities than PANTHOR_GROUP_PRIORITY_MEDIUM. As the sole user of that uAPI lives in Mesa and hardcode a value of MEDIUM [1], this should be safe to do. Additionally, as those checks are performed at the ioctl level, panthor_group_create now only check for priority level validity. [1]https://gitlab.freedesktop.org/mesa/mesa/-/blob/f390835074bdf162a63deb0311d1a6de527f9f89/src/gallium/drivers/panfrost/pan_csf.c#L1038 Signed-off-by: Mary Guillemard <mary.guillemard@collabora.com> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") Cc: stable@vger.kernel.org Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240903144955.144278-2-mary.guillemard@collabora.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: mana: Fix error handling in mana_create_txq/rxq's NAPI cleanupSouradeep Chakrabarti2024-09-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b6ecc662037694488bfff7c9fd21c405df8411f2 upstream. Currently napi_disable() gets called during rxq and txq cleanup, even before napi is enabled and hrtimer is initialized. It causes kernel panic. ? page_fault_oops+0x136/0x2b0 ? page_counter_cancel+0x2e/0x80 ? do_user_addr_fault+0x2f2/0x640 ? refill_obj_stock+0xc4/0x110 ? exc_page_fault+0x71/0x160 ? asm_exc_page_fault+0x27/0x30 ? __mmdrop+0x10/0x180 ? __mmdrop+0xec/0x180 ? hrtimer_active+0xd/0x50 hrtimer_try_to_cancel+0x2c/0xf0 hrtimer_cancel+0x15/0x30 napi_disable+0x65/0x90 mana_destroy_rxq+0x4c/0x2f0 mana_create_rxq.isra.0+0x56c/0x6d0 ? mana_uncfg_vport+0x50/0x50 mana_alloc_queues+0x21b/0x320 ? skb_dequeue+0x5f/0x80 Cc: stable@vger.kernel.org Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ") Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "Bluetooth: MGMT/SMP: Fix address type when using SMP over BREDR/LE"Luiz Augusto von Dentz2024-09-121-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 532f8bcd1c2c4e8112f62e1922fd1703bc0ffce0 upstream. This reverts commit 59b047bc98084f8af2c41483e4d68a5adf2fa7f7 which breaks compatibility with commands like: bluetoothd[46328]: @ MGMT Command: Load.. (0x0013) plen 74 {0x0001} [hci0] Keys: 2 BR/EDR Address: C0:DC:DA:A5:E5:47 (Samsung Electronics Co.,Ltd) Key type: Authenticated key from P-256 (0x03) Central: 0x00 Encryption size: 16 Diversifier[2]: 0000 Randomizer[8]: 0000000000000000 Key[16]: 6ed96089bd9765be2f2c971b0b95f624 LE Address: D7:2A:DE:1E:73:A2 (Static) Key type: Unauthenticated key from P-256 (0x02) Central: 0x00 Encryption size: 16 Diversifier[2]: 0000 Randomizer[8]: 0000000000000000 Key[16]: 87dd2546ededda380ffcdc0a8faa4597 @ MGMT Event: Command Status (0x0002) plen 3 {0x0001} [hci0] Load Long Term Keys (0x0013) Status: Invalid Parameters (0x0d) Cc: stable@vger.kernel.org Link: https://github.com/bluez/bluez/issues/875 Fixes: 59b047bc9808 ("Bluetooth: MGMT/SMP: Fix address type when using SMP over BREDR/LE") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner2024-09-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ea72ce5da22806d5713f3ffb39a6d5ae73841f93 upstream. iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i2c: Use IS_REACHABLE() for substituting empty ACPI functionsRichard Fitzgerald2024-09-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 71833e79a42178d8a50b5081c98c78ace9325628 upstream. Replace IS_ENABLED() with IS_REACHABLE() to substitute empty stubs for: i2c_acpi_get_i2c_resource() i2c_acpi_client_count() i2c_acpi_find_bus_speed() i2c_acpi_new_device_by_fwnode() i2c_adapter *i2c_acpi_find_adapter_by_handle() i2c_acpi_waive_d0_probe() commit f17c06c6608a ("i2c: Fix conditional for substituting empty ACPI functions") partially fixed this conditional to depend on CONFIG_I2C, but used IS_ENABLED(), which is wrong since CONFIG_I2C is tristate. CONFIG_ACPI is boolean but let's also change it to use IS_REACHABLE() to future-proof it against becoming tristate. Somehow despite testing various combinations of CONFIG_I2C and CONFIG_ACPI we missed the combination CONFIG_I2C=m, CONFIG_ACPI=y. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: f17c06c6608a ("i2c: Fix conditional for substituting empty ACPI functions") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202408141333.gYnaitcV-lkp@intel.com/ Reviewed-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: tcp/dccp: prepare for tw_timer un-pinningValentin Schneider2024-09-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b334b924c9b709bc969644fb5c406f5c9d01dceb ] The TCP timewait timer is proving to be problematic for setups where scheduler CPU isolation is achieved at runtime via cpusets (as opposed to statically via isolcpus=domains). What happens there is a CPU goes through tcp_time_wait(), arming the time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer fires, causing interference for the now-isolated CPU. This is conceptually similar to the issue described in commit e02b93124855 ("workqueue: Unbind kworkers before sending them to exit()") Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash lock held. Expand the lock's critical section from inet_twsk_kill() to inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of the timer. IOW, this prevents the following race: tcp_time_wait() inet_twsk_hashdance() inet_twsk_deschedule_put() del_timer_sync() inet_twsk_schedule() Thanks to Paolo Abeni for suggesting to leverage the ehash lock. This also restores a comment from commit ec94c2696f0b ("tcp/dccp: avoid one atomic operation for timewait hashdance") as inet_twsk_hashdance() had a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing. inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize with inet_twsk_hashdance_schedule(). To ease possible regression search, actual un-pin is done in next patch. Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/ Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* net: remove NULL-pointer net parameter in ip_metrics_convertJason Xing2024-09-082-3/+2
| | | | | | | | | | | | | | [ Upstream commit 61e2bbafb00e4b9a5de45e6448a7b6b818658576 ] When I was doing some experiments, I found that when using the first parameter, namely, struct net, in ip_metrics_convert() always triggers NULL pointer crash. Then I digged into this part, realizing that we can remove this one due to its uselessness. Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fsnotify: clear PARENT_WATCHED flags lazilyAmir Goldstein2024-09-081-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 172e422ffea20a89bfdc672741c1aad6fbb5044e ] In some setups directories can have many (usually negative) dentries. Hence __fsnotify_update_child_dentry_flags() function can take a significant amount of time. Since the bulk of this function happens under inode->i_lock this causes a significant contention on the lock when we remove the watch from the directory as the __fsnotify_update_child_dentry_flags() call from fsnotify_recalc_mask() races with __fsnotify_update_child_dentry_flags() calls from __fsnotify_parent() happening on children. This can lead upto softlockup reports reported by users. Fix the problem by calling fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags only when parent starts watching children. When parent stops watching children, clear false positive PARENT_WATCHED flags lazily in __fsnotify_parent() for each accessed child. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
* pwm: xilinx: Fix u32 overflow issue in 32-bit width PWM mode.Ken Sloat2024-09-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 56f45266df67aa0f5b2a6881c8c4d16dbfff6b7d ] This timer HW supports 8, 16 and 32-bit timer widths. This driver currently uses a u32 to store the max possible value of the timer. However, statements perform addition of 2 in xilinx_pwm_apply() when calculating the period_cycles and duty_cycles values. Since priv->max is a u32, this will result in an overflow to 1 which will not only be incorrect but fail on range comparison. This results in making it impossible to set the PWM in this timer mode. There are two obvious solutions to the current problem: 1. Cast each instance where overflow occurs to u64. 2. Change priv->max from a u32 to a u64. Solution #1 requires more code modifications, and leaves opportunity to introduce similar overflows if other math statements are added in the future. These may also go undetected if running in non 32-bit timer modes. Solution #2 is the much smaller and cleaner approach and thus the chosen method in this patch. This was tested on a Zynq UltraScale+ with multiple instances of the PWM IP. Signed-off-by: Ken Sloat <ksloat@designlinxhs.com> Reviewed-by: Michal Simek <michal.simek@amd.com> Reviewed-by: Sean Anderson <sean.anderson@seco.com> Link: https://lore.kernel.org/r/SJ0P222MB0107490C5371B848EF04351CA1E19@SJ0P222MB0107.NAMP222.PROD.OUTLOOK.COM Signed-off-by: Michal Simek <michal.simek@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* soc: qcom: smem: Add qcom_smem_bust_hwspin_lock_by_host()Chris Lew2024-09-081-0/+2
| | | | | | | | | | | | | | | | [ Upstream commit 2e3f0d693875db698891ffe89a18121bda5b95b8 ] Add qcom_smem_bust_hwspin_lock_by_host to enable remoteproc to bust the hwspin_lock owned by smem. In the event the remoteproc crashes unexpectedly, the remoteproc driver can invoke this API to try and bust the hwspin_lock and release the lock if still held by the remoteproc device. Signed-off-by: Chris Lew <quic_clew@quicinc.com> Reviewed-by: Bjorn Andersson <quic_bjorande@quicinc.com> Link: https://lore.kernel.org/r/20240529-hwspinlock-bust-v3-3-c8b924ffa5a2@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* hwspinlock: Introduce hwspin_lock_bust()Richard Maina2024-09-081-0/+6
| | | | | | | | | | | | | | | | | [ Upstream commit 7c327d56597d8de1680cf24e956b704270d3d84a ] When a remoteproc crashes or goes down unexpectedly this can result in a state where locks held by the remoteproc will remain locked possibly resulting in deadlock. This new API hwspin_lock_bust() allows hwspinlock implementers to define a bust operation for freeing previously acquired hwspinlocks after verifying ownership of the acquired lock. Signed-off-by: Richard Maina <quic_rmaina@quicinc.com> Reviewed-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Chris Lew <quic_clew@quicinc.com> Link: https://lore.kernel.org/r/20240529-hwspinlock-bust-v3-1-c8b924ffa5a2@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* i2c: Fix conditional for substituting empty ACPI functionsRichard Fitzgerald2024-09-081-1/+1
| | | | | | | | | | | | | | | [ Upstream commit f17c06c6608ad4ecd2ccf321753fb511812d821b ] Add IS_ENABLED(CONFIG_I2C) to the conditional around a bunch of ACPI functions. The conditional around these functions depended only on CONFIG_ACPI. But the functions are implemented in I2C core, so are only present if CONFIG_I2C is enabled. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ALSA: ump: Transmit RPN/NRPN message at each MSB/LSB data receptionTakashi Iwai2024-09-081-0/+1
| | | | | | | | | | | | | | | | | | | [ Upstream commit e6ce8a28c768dbbad3f818db286cd0f4c7a921a8 ] The UMP 1.1 spec says that an RPN/NRPN should be sent when one of the following occurs: * a CC 38 is received * a subsequent CC 6 is received * a CC 98, 99, 100, and 101 is received, indicating the last RPN/NRPN message has ended and a new one has started That said, we should send a partial data even if it's not fully filled. Let's change the UMP conversion helper code to follow that rule. Link: https://patch.msgid.link/20240731130528.12600-2-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* scsi: ufs: core: Check LSDBS cap when !mcqKyoungrul Kim2024-09-082-0/+2
| | | | | | | | | | | | | | | | [ Upstream commit 0c60eb0cc320fffbb8b10329d276af14f6f5e6bf ] If the user sets use_mcq_mode to 0, the host will try to activate the LSDB mode unconditionally even when the LSDBS of device HCI cap is 1. This makes commands time out and causes device probing to fail. To prevent that problem, check the LSDBS cap when MCQ is not supported. Signed-off-by: Kyoungrul Kim <k831.kim@samsung.com> Link: https://lore.kernel.org/r/20240709232520epcms2p8ebdb5c4fccc30a6221390566589bf122@epcms2p8 Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* soc: qcom: pmic_glink: Fix race during initializationBjorn Andersson2024-09-041-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3568affcddd68743e25aa3ec1647d9b82797757b upstream. As pointed out by Stephen Boyd it is possible that during initialization of the pmic_glink child drivers, the protection-domain notifiers fires, and the associated work is scheduled, before the client registration returns and as a result the local "client" pointer has been initialized. The outcome of this is a NULL pointer dereference as the "client" pointer is blindly dereferenced. Timeline provided by Stephen: CPU0 CPU1 ---- ---- ucsi->client = NULL; devm_pmic_glink_register_client() client->pdr_notify(client->priv, pg->client_state) pmic_glink_ucsi_pdr_notify() schedule_work(&ucsi->register_work) <schedule away> pmic_glink_ucsi_register() ucsi_register() pmic_glink_ucsi_read_version() pmic_glink_ucsi_read() pmic_glink_ucsi_read() pmic_glink_send(ucsi->client) <client is NULL BAD> ucsi->client = client // Too late! This code is identical across the altmode, battery manager and usci child drivers. Resolve this by splitting the allocation of the "client" object and the registration thereof into two operations. This only happens if the protection domain registry is populated at the time of registration, which by the introduction of commit '1ebcde047c54 ("soc: qcom: add pd-mapper implementation")' became much more likely. Reported-by: Amit Pundir <amit.pundir@linaro.org> Closes: https://lore.kernel.org/all/CAMi1Hd2_a7TjA7J9ShrAbNOd_CoZ3D87twmO5t+nZxC9sX18tA@mail.gmail.com/ Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/all/ZqiyLvP0gkBnuekL@hovoldconsulting.com/ Reported-by: Stephen Boyd <swboyd@chromium.org> Closes: https://lore.kernel.org/all/CAE-0n52JgfCBWiFQyQWPji8cq_rCsviBpW-m72YitgNfdaEhQg@mail.gmail.com/ Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Cc: stable@vger.kernel.org Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Tested-by: Amit Pundir <amit.pundir@linaro.org> Reviewed-by: Johan Hovold <johan+linaro@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Tested-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> Link: https://lore.kernel.org/r/20240820-pmic-glink-v6-11-races-v3-1-eec53c750a04@quicinc.com Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: busy-poll: use ktime_get_ns() instead of local_clock()Eric Dumazet2024-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0870b0d8b393dde53106678a1e2cec9dfa52f9b7 ] Typically, busy-polling durations are below 100 usec. When/if the busy-poller thread migrates to another cpu, local_clock() can be off by +/-2msec or more for small values of HZ, depending on the platform. Use ktimer_get_ns() to ensure deterministic behavior, which is the whole point of busy-polling. Fixes: 060212928670 ("net: add low latency socket poll") Fixes: 9a3c71aa8024 ("net: convert low latency sockets to sched_clock()") Fixes: 37089834528b ("sched, net: Fixup busy_loop_us_clock()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20240827114916.223377-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bonding: change ipsec_lock from spin lock to mutexJianbo Liu2024-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2aeeef906d5a526dc60cf4af92eda69836c39b1f ] In the cited commit, bond->ipsec_lock is added to protect ipsec_list, hence xdo_dev_state_add and xdo_dev_state_delete are called inside this lock. As ipsec_lock is a spin lock and such xfrmdev ops may sleep, "scheduling while atomic" will be triggered when changing bond's active slave. [ 101.055189] BUG: scheduling while atomic: bash/902/0x00000200 [ 101.055726] Modules linked in: [ 101.058211] CPU: 3 PID: 902 Comm: bash Not tainted 6.9.0-rc4+ #1 [ 101.058760] Hardware name: [ 101.059434] Call Trace: [ 101.059436] <TASK> [ 101.060873] dump_stack_lvl+0x51/0x60 [ 101.061275] __schedule_bug+0x4e/0x60 [ 101.061682] __schedule+0x612/0x7c0 [ 101.062078] ? __mod_timer+0x25c/0x370 [ 101.062486] schedule+0x25/0xd0 [ 101.062845] schedule_timeout+0x77/0xf0 [ 101.063265] ? asm_common_interrupt+0x22/0x40 [ 101.063724] ? __bpf_trace_itimer_state+0x10/0x10 [ 101.064215] __wait_for_common+0x87/0x190 [ 101.064648] ? usleep_range_state+0x90/0x90 [ 101.065091] cmd_exec+0x437/0xb20 [mlx5_core] [ 101.065569] mlx5_cmd_do+0x1e/0x40 [mlx5_core] [ 101.066051] mlx5_cmd_exec+0x18/0x30 [mlx5_core] [ 101.066552] mlx5_crypto_create_dek_key+0xea/0x120 [mlx5_core] [ 101.067163] ? bonding_sysfs_store_option+0x4d/0x80 [bonding] [ 101.067738] ? kmalloc_trace+0x4d/0x350 [ 101.068156] mlx5_ipsec_create_sa_ctx+0x33/0x100 [mlx5_core] [ 101.068747] mlx5e_xfrm_add_state+0x47b/0xaa0 [mlx5_core] [ 101.069312] bond_change_active_slave+0x392/0x900 [bonding] [ 101.069868] bond_option_active_slave_set+0x1c2/0x240 [bonding] [ 101.070454] __bond_opt_set+0xa6/0x430 [bonding] [ 101.070935] __bond_opt_set_notify+0x2f/0x90 [bonding] [ 101.071453] bond_opt_tryset_rtnl+0x72/0xb0 [bonding] [ 101.071965] bonding_sysfs_store_option+0x4d/0x80 [bonding] [ 101.072567] kernfs_fop_write_iter+0x10c/0x1a0 [ 101.073033] vfs_write+0x2d8/0x400 [ 101.073416] ? alloc_fd+0x48/0x180 [ 101.073798] ksys_write+0x5f/0xe0 [ 101.074175] do_syscall_64+0x52/0x110 [ 101.074576] entry_SYSCALL_64_after_hwframe+0x4b/0x53 As bond_ipsec_add_sa_all and bond_ipsec_del_sa_all are only called from bond_change_active_slave, which requires holding the RTNL lock. And bond_ipsec_add_sa and bond_ipsec_del_sa are xfrm state xdo_dev_state_add and xdo_dev_state_delete APIs, which are in user context. So ipsec_lock doesn't have to be spin lock, change it to mutex, and thus the above issue can be resolved. Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20240823031056.110999-4-jianbol@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* netfilter: nf_tables_ipv6: consider network offset in netdev/egress validationPablo Neira Ayuso2024-09-041-2/+3
| | | | | | | | | | | [ Upstream commit 70c261d500951cf3ea0fcf32651aab9a65a91471 ] From netdev/egress, skb->len can include the ethernet header, therefore, subtract network offset from skb->len when validating IPv6 packet length. Fixes: 42df6e1d221d ("netfilter: Introduce egress hook") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* netfilter: nf_tables: restore IP sanity checks for netdev/egressPablo Neira Ayuso2024-09-041-4/+6
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5fd0628918977a0afdc2e6bc562d8751b5d3b8c5 ] Subtract network offset to skb->len before performing IPv4 header sanity checks, then adjust transport offset from offset from mac header. Jorge Ortiz says: When small UDP packets (< 4 bytes payload) are sent from eth0, `meta l4proto udp` condition is not met because `NFT_PKTINFO_L4PROTO` is not set. This happens because there is a comparison that checks if the transport header offset exceeds the total length. This comparison does not take into account the fact that the skb network offset might be non-zero in egress mode (e.g., 14 bytes for Ethernet header). Fixes: 0ae8e4cca787 ("netfilter: nf_tables: set transport offset from mac header for netdev/egress") Reported-by: Jorge Ortiz <jorge.ortiz.escribano@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs/nfsd: fix update of inode attrs in CB_GETATTRJeff Layton2024-09-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7e8ae8486e4471513e2111aba6ac29f2357bed2a ] Currently, we copy the mtime and ctime to the in-core inode and then mark the inode dirty. This is fine for certain types of filesystems, but not all. Some require a real setattr to properly change these values (e.g. ceph or reexported NFS). Fix this code to call notify_change() instead, which is the proper way to effect a setattr. There is one problem though: In this case, the client is holding a write delegation and has sent us attributes to update our cache. We don't want to break the delegation for this since that would defeat the purpose. Add a new ATTR_DELEG flag that makes notify_change bypass the try_break_deleg call. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* video/aperture: optionally match the device in sysfb_disable()Alex Deucher2024-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b49420d6a1aeb399e5b107fc6eb8584d0860fbd7 upstream. In aperture_remove_conflicting_pci_devices(), we currently only call sysfb_disable() on vga class devices. This leads to the following problem when the pimary device is not VGA compatible: 1. A PCI device with a non-VGA class is the boot display 2. That device is probed first and it is not a VGA device so sysfb_disable() is not called, but the device resources are freed by aperture_detach_platform_device() 3. Non-primary GPU has a VGA class and it ends up calling sysfb_disable() 4. NULL pointer dereference via sysfb_disable() since the resources have already been freed by aperture_detach_platform_device() when it was called by the other device. Fix this by passing a device pointer to sysfb_disable() and checking the device to determine if we should execute it or not. v2: Fix build when CONFIG_SCREEN_INFO is not set v3: Move device check into the mutex Drop primary variable in aperture_remove_conflicting_pci_devices() Drop __init on pci sysfb_pci_dev_is_enabled() Fixes: 5ae3716cfdcd ("video/aperture: Only remove sysfb on the default vga pci device") Cc: Javier Martinez Canillas <javierm@redhat.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Helge Deller <deller@gmx.de> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240821191135.829765-1-alexander.deucher@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPI: video: Add Dell UART backlight controller detectionHans de Goede2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cd8e468efb4fb2742e06328a75b282c35c1abf8d upstream. Dell All In One (AIO) models released after 2017 use a backlight controller board connected to an UART. In DSDT this uart port will be defined as: Name (_HID, "DELL0501") Name (_CID, EisaId ("PNP0501") Commit 484bae9e4d6a ("platform/x86: Add new Dell UART backlight driver") has added support for this, but I neglected to tie this into acpi_video_get_backlight_type(). Now the first AIO has turned up which has not only the DSDT bits for this, but also an actual controller attached to the UART, yet it is not using this controller for backlight control. Add support to acpi_video_get_backlight_type() for a new dell_uart backlight type. So that the existing infra to override the backlight control method on the commandline or with DMI quirks can be used. Fixes: 484bae9e4d6a ("platform/x86: Add new Dell UART backlight driver") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Link: https://patch.msgid.link/20240814190159.15650-2-hdegoede@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* scsi: core: Fix the return value of scsi_logical_block_count()Chaotian Jing2024-08-291-1/+1
| | | | | | | | | | | | | | | | | commit f03e94f23b04c2b71c0044c1534921b3975ef10c upstream. scsi_logical_block_count() should return the block count of a given SCSI command. The original implementation ended up shifting twice, leading to an incorrect count being returned. Fix the conversion between bytes and logical blocks. Cc: stable@vger.kernel.org Fixes: 6a20e21ae1e2 ("scsi: core: Add helper to return number of logical blocks in a request") Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com> Link: https://lore.kernel.org/r/20240813053534.7720-1-chaotian.jing@mediatek.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kcm: Serialise kcm_sendmsg() for the same socket.Kuniyuki Iwashima2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 807067bf014d4a3ae2cc55bd3de16f22a01eb580 ] syzkaller reported UAF in kcm_release(). [0] The scenario is 1. Thread A builds a skb with MSG_MORE and sets kcm->seq_skb. 2. Thread A resumes building skb from kcm->seq_skb but is blocked by sk_stream_wait_memory() 3. Thread B calls sendmsg() concurrently, finishes building kcm->seq_skb and puts the skb to the write queue 4. Thread A faces an error and finally frees skb that is already in the write queue 5. kcm_release() does double-free the skb in the write queue When a thread is building a MSG_MORE skb, another thread must not touch it. Let's add a per-sk mutex and serialise kcm_sendmsg(). [0]: BUG: KASAN: slab-use-after-free in __skb_unlink include/linux/skbuff.h:2366 [inline] BUG: KASAN: slab-use-after-free in __skb_dequeue include/linux/skbuff.h:2385 [inline] BUG: KASAN: slab-use-after-free in __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline] BUG: KASAN: slab-use-after-free in __skb_queue_purge include/linux/skbuff.h:3181 [inline] BUG: KASAN: slab-use-after-free in kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691 Read of size 8 at addr ffff0000ced0fc80 by task syz-executor329/6167 CPU: 1 PID: 6167 Comm: syz-executor329 Tainted: G B 6.8.0-rc5-syzkaller-g9abbc24128bc #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call trace: dump_backtrace+0x1b8/0x1e4 arch/arm64/kernel/stacktrace.c:291 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:298 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd0/0x124 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:377 [inline] print_report+0x178/0x518 mm/kasan/report.c:488 kasan_report+0xd8/0x138 mm/kasan/report.c:601 __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381 __skb_unlink include/linux/skbuff.h:2366 [inline] __skb_dequeue include/linux/skbuff.h:2385 [inline] __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline] __skb_queue_purge include/linux/skbuff.h:3181 [inline] kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691 __sock_release net/socket.c:659 [inline] sock_close+0xa4/0x1e8 net/socket.c:1421 __fput+0x30c/0x738 fs/file_table.c:376 ____fput+0x20/0x30 fs/file_table.c:404 task_work_run+0x230/0x2e0 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x618/0x1f64 kernel/exit.c:871 do_group_exit+0x194/0x22c kernel/exit.c:1020 get_signal+0x1500/0x15ec kernel/signal.c:2893 do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249 do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Allocated by task 6166: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x40/0x78 mm/kasan/common.c:68 kasan_save_alloc_info+0x70/0x84 mm/kasan/generic.c:626 unpoison_slab_object mm/kasan/common.c:314 [inline] __kasan_slab_alloc+0x74/0x8c mm/kasan/common.c:340 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook mm/slub.c:3813 [inline] slab_alloc_node mm/slub.c:3860 [inline] kmem_cache_alloc_node+0x204/0x4c0 mm/slub.c:3903 __alloc_skb+0x19c/0x3d8 net/core/skbuff.c:641 alloc_skb include/linux/skbuff.h:1296 [inline] kcm_sendmsg+0x1d3c/0x2124 net/kcm/kcmsock.c:783 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_sendmsg+0x220/0x2c0 net/socket.c:768 splice_to_socket+0x7cc/0xd58 fs/splice.c:889 do_splice_from fs/splice.c:941 [inline] direct_splice_actor+0xec/0x1d8 fs/splice.c:1164 splice_direct_to_actor+0x438/0xa0c fs/splice.c:1108 do_splice_direct_actor fs/splice.c:1207 [inline] do_splice_direct+0x1e4/0x304 fs/splice.c:1233 do_sendfile+0x460/0xb3c fs/read_write.c:1295 __do_sys_sendfile64 fs/read_write.c:1362 [inline] __se_sys_sendfile64 fs/read_write.c:1348 [inline] __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1348 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Freed by task 6167: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x40/0x78 mm/kasan/common.c:68 kasan_save_free_info+0x5c/0x74 mm/kasan/generic.c:640 poison_slab_object+0x124/0x18c mm/kasan/common.c:241 __kasan_slab_free+0x3c/0x78 mm/kasan/common.c:257 kasan_slab_free include/linux/kasan.h:184 [inline] slab_free_hook mm/slub.c:2121 [inline] slab_free mm/slub.c:4299 [inline] kmem_cache_free+0x15c/0x3d4 mm/slub.c:4363 kfree_skbmem+0x10c/0x19c __kfree_skb net/core/skbuff.c:1109 [inline] kfree_skb_reason+0x240/0x6f4 net/core/skbuff.c:1144 kfree_skb include/linux/skbuff.h:1244 [inline] kcm_release+0x104/0x4c8 net/kcm/kcmsock.c:1685 __sock_release net/socket.c:659 [inline] sock_close+0xa4/0x1e8 net/socket.c:1421 __fput+0x30c/0x738 fs/file_table.c:376 ____fput+0x20/0x30 fs/file_table.c:404 task_work_run+0x230/0x2e0 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x618/0x1f64 kernel/exit.c:871 do_group_exit+0x194/0x22c kernel/exit.c:1020 get_signal+0x1500/0x15ec kernel/signal.c:2893 do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249 do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148 exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline] exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline] el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 The buggy address belongs to the object at ffff0000ced0fc80 which belongs to the cache skbuff_head_cache of size 240 The buggy address is located 0 bytes inside of freed 240-byte region [ffff0000ced0fc80, ffff0000ced0fd70) The buggy address belongs to the physical page: page:00000000d35f4ae4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10ed0f flags: 0x5ffc00000000800(slab|node=0|zone=2|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 05ffc00000000800 ffff0000c1cbf640 fffffdffc3423100 dead000000000004 raw: 0000000000000000 00000000000c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff0000ced0fb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff0000ced0fc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc >ffff0000ced0fc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff0000ced0fd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc ffff0000ced0fd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Reported-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b72d86aa5df17ce74c60 Tested-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240815220437.69511-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* net: mscc: ocelot: serialize access to the injection/extraction groupsVladimir Oltean2024-08-291-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c5e12ac3beb0dd3a718296b2d8af5528e9ab728e ] As explained by Horatiu Vultur in commit 603ead96582d ("net: sparx5: Add spinlock for frame transmission from CPU") which is for a similar hardware design, multiple CPUs can simultaneously perform injection or extraction. There are only 2 register groups for injection and 2 for extraction, and the driver only uses one of each. So we'd better serialize access using spin locks, otherwise frame corruption is possible. Note that unlike in sparx5, FDMA in ocelot does not have this issue because struct ocelot_fdma_tx_ring already contains an xmit_lock. I guess this is mostly a problem for NXP LS1028A, as that is dual core. I don't think VSC7514 is. So I'm blaming the commit where LS1028A (aka the felix DSA driver) started using register-based packet injection and extraction. Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and ↵Vladimir Oltean2024-08-292-1/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | register injection [ Upstream commit 67c3ca2c5cfe6a50772514e3349b5e7b3b0fac03 ] Problem description ------------------- On an NXP LS1028A (felix DSA driver) with the following configuration: - ocelot-8021q tagging protocol - VLAN-aware bridge (with STP) spanning at least swp0 and swp1 - 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700 - ptp4l on swp0.700 and swp1.700 we see that the ptp4l instances do not see each other's traffic, and they all go to the grand master state due to the ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition. Jumping to the conclusion for the impatient ------------------------------------------- There is a zero-day bug in the ocelot switchdev driver in the way it handles VLAN-tagged packet injection. The correct logic already exists in the source code, in function ocelot_xmit_get_vlan_info() added by commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). But it is used only for normal NPI-based injection with the DSA "ocelot" tagging protocol. The other injection code paths (register-based and FDMA-based) roll their own wrong logic. This affects and was noticed on the DSA "ocelot-8021q" protocol because it uses register-based injection. By moving ocelot_xmit_get_vlan_info() to a place that's common for both the DSA tagger and the ocelot switch library, it can also be called from ocelot_port_inject_frame() in ocelot.c. We need to touch the lines with ocelot_ifh_port_set()'s prototype anyway, so let's rename it to something clearer regarding what it does, and add a kernel-doc. ocelot_ifh_set_basic() should do. Investigation notes ------------------- Debugging reveals that PTP event (aka those carrying timestamps, like Sync) frames injected into swp0.700 (but also swp1.700) hit the wire with two VLAN tags: 00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc ~~~~~~~~~~~ 00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00 ~~~~~~~~~~~ 00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03 00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00 00000040: 00 00 The second (unexpected) VLAN tag makes felix_check_xtr_pkt() -> ptp_classify_raw() fail to see these as PTP packets at the link partner's receiving end, and return PTP_CLASS_NONE (because the BPF classifier is not written to expect 2 VLAN tags). The reason why packets have 2 VLAN tags is because the transmission code treats VLAN incorrectly. Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX feature. Therefore, at xmit time, all VLANs should be in the skb head, and none should be in the hwaccel area. This is done by: static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb, netdev_features_t features) { if (skb_vlan_tag_present(skb) && !vlan_hw_offload_capable(features, skb->vlan_proto)) skb = __vlan_hwaccel_push_inside(skb); return skb; } But ocelot_port_inject_frame() handles things incorrectly: ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb)); void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op) { (...) if (vlan_tag) ocelot_ifh_set_vlan_tci(ifh, vlan_tag); (...) } The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head is by calling: static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) { skb->vlan_present = 0; } which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get(). This means that ocelot, when it calls skb_vlan_tag_get(), sees (and uses) a residual skb->vlan_tci, while the same VLAN tag is _already_ in the skb head. The trivial fix for double VLAN headers is to replace the content of ocelot_ifh_port_set() with: if (skb_vlan_tag_present(skb)) ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb)); but this would not be correct either, because, as mentioned, vlan_hw_offload_capable() is false for us, so we'd be inserting dead code and we'd always transmit packets with VID=0 in the injection frame header. I can't actually test the ocelot switchdev driver and rely exclusively on code inspection, but I don't think traffic from 8021q uppers has ever been injected properly, and not double-tagged. Thus I'm blaming the introduction of VLAN fields in the injection header - early driver code. As hinted at in the early conclusion, what we _want_ to happen for VLAN transmission was already described once in commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). ocelot_xmit_get_vlan_info() intends to ensure that if the port through which we're transmitting is under a VLAN-aware bridge, the outer VLAN tag from the skb head is stripped from there and inserted into the injection frame header (so that the packet is processed in hardware through that actual VLAN). And in all other cases, the packet is sent with VID=0 in the injection frame header, since the port is VLAN-unaware and has logic to strip this VID on egress (making it invisible to the wire). Fixes: 08d02364b12f ("net: mscc: fix the injection header") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Bluetooth: HCI: Invert LE State quirk to be opt-out rather then opt-inLuiz Augusto von Dentz2024-08-292-8/+11
| | | | | | | | | | | | | | | [ Upstream commit aae6b81260fd9a7224f7eb4fc440d625852245bb ] This inverts the LE State quirk so by default we assume the controllers would report valid states rather than invalid which is how quirks normally behave, also this would result in HCI command failing it the LE States are really broken thus exposing the controllers that are really broken in this respect. Link: https://github.com/bluez/bluez/issues/584 Fixes: 220915857e29 ("Bluetooth: Adding driver and quirk defs for multi-role LE") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* thermal: gov_bang_bang: Use governor_data to reduce overheadRafael J. Wysocki2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6e6f58a170ea98e44075b761f2da42a5aec47dfb ] After running once, the for_each_trip_desc() loop in bang_bang_manage() is pure needless overhead because it is not going to make any changes unless a new cooling device has been bound to one of the trips in the thermal zone or the system is resuming from sleep. For this reason, make bang_bang_manage() set governor_data for the thermal zone and check it upfront to decide whether or not it needs to do anything. However, governor_data needs to be reset in some cases to let bang_bang_manage() know that it should walk the trips again, so add an .update_tz() callback to the governor and make the core additionally invoke it during system resume. To avoid affecting the other users of that callback unnecessarily, add a special notification reason for system resume, THERMAL_TZ_RESUME, and also pass it to __thermal_zone_device_update() called during system resume for consistency. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Kästle <peter@piie.net> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Cc: 6.10+ <stable@vger.kernel.org> # 6.10+ Link: https://patch.msgid.link/2285575.iZASKD2KPV@rjwysocki.net Signed-off-by: Sasha Levin <sashal@kernel.org>
* printk/panic: Allow cpu backtraces to be written into ringbuffer during panicRyo Takakura2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bcc954c6caba01fca143162d5fbb90e46aa1ad80 ] commit 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") disabled non-panic CPUs to further write messages to ringbuffer after panicked. Since the commit, non-panicked CPU's are not allowed to write to ring buffer after panicked and CPU backtrace which is triggered after panicked to sample non-panicked CPUs' backtrace no longer serves its function as it has nothing to print. Fix the issue by allowing non-panicked CPUs to write into ringbuffer while CPU backtrace is in flight. Fixes: 779dbc2e78d7 ("printk: Avoid non-panic CPUs writing to ringbuffer") Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240812072703.339690-1-takakura@valinux.co.jp Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* io_uring/napi: use ktime in busy pollingPavel Begunkov2024-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 342b2e395d5f34c9f111a818556e617939f83a8c ] It's more natural to use ktime/ns instead of keeping around usec, especially since we're comparing it against user provided timers, so convert napi busy poll internal handling to ktime. It's also nicer since the type (ktime_t vs unsigned long) now tells the unit of measure. Keep everything as ktime, which we convert to/from micro seconds for IORING_[UN]REGISTER_NAPI. The net/ busy polling works seems to work with usec, however it's not real usec as shift by 10 is used to get it from nsecs, see busy_loop_current_time(), so it's easy to get truncated nsec back and we get back better precision. Note, we can further improve it later by removing the truncation and maybe convincing net/ to use ktime/ns instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/95e7ec8d095069a3ed5d40a4bc6f8b586698bc7e.1722003776.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 84f2eecf9501 ("io_uring/napi: check napi_enabled in io_napi_add() before proceeding") Signed-off-by: Sasha Levin <sashal@kernel.org>
* kbuild: remove PROVIDE() for kallsyms symbolsMasahiro Yamada2024-08-291-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c442db3f49f27e5a60a641b2ac9a3c6320796ed6 ] This reimplements commit 951bcae6c5a0 ("kallsyms: Avoid weak references for kallsyms symbols") because I am not a big fan of PROVIDE(). As an alternative solution, this commit prepends one more kallsyms step. KSYMS .tmp_vmlinux.kallsyms0.S # added AS .tmp_vmlinux.kallsyms0.o # added LD .tmp_vmlinux.btf BTF .btf.vmlinux.bin.o LD .tmp_vmlinux.kallsyms1 NM .tmp_vmlinux.kallsyms1.syms KSYMS .tmp_vmlinux.kallsyms1.S AS .tmp_vmlinux.kallsyms1.o LD .tmp_vmlinux.kallsyms2 NM .tmp_vmlinux.kallsyms2.syms KSYMS .tmp_vmlinux.kallsyms2.S AS .tmp_vmlinux.kallsyms2.o LD vmlinux Step 0 takes /dev/null as input, and generates .tmp_vmlinux.kallsyms0.o, which has a valid kallsyms format with the empty symbol list, and can be linked to vmlinux. Since it is really small, the added compile-time cost is negligible. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Stable-dep-of: 020925ce9299 ("kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols") Signed-off-by: Sasha Levin <sashal@kernel.org>
* vsock: fix recursive ->recvmsg callsCong Wang2024-08-291-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 69139d2919dd4aa9a553c8245e7c63e82613e3fc ] After a vsock socket has been added to a BPF sockmap, its prot->recvmsg has been replaced with vsock_bpf_recvmsg(). Thus the following recursiion could happen: vsock_bpf_recvmsg() -> __vsock_recvmsg() -> vsock_connectible_recvmsg() -> prot->recvmsg() -> vsock_bpf_recvmsg() again We need to fix it by calling the original ->recvmsg() without any BPF sockmap logic in __vsock_recvmsg(). Fixes: 634f1a7110b4 ("vsock: support sockmap") Reported-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Tested-by: syzbot+bdb4bd87b5e22058e2a4@syzkaller.appspotmail.com Cc: Bobby Eshleman <bobby.eshleman@bytedance.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20240812022153.86512-1-xiyou.wangcong@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf: Fix updating attached freplace prog in prog_array mapLeon Hwang2024-08-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fdad456cbcca739bae1849549c7a999857c56f88 ] The commit f7866c358733 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT") fixed a NULL pointer dereference panic, but didn't fix the issue that fails to update attached freplace prog to prog_array map. Since commit 1c123c567fb1 ("bpf: Resolve fext program type when checking map compatibility"), freplace prog and its target prog are able to tail call each other. And the commit 3aac1ead5eb6 ("bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach") sets prog->aux->dst_prog as NULL after attaching freplace prog to its target prog. After loading freplace the prog_array's owner type is BPF_PROG_TYPE_SCHED_CLS. Then, after attaching freplace its prog->aux->dst_prog is NULL. Then, while updating freplace in prog_array the bpf_prog_map_compatible() incorrectly returns false because resolve_prog_type() returns BPF_PROG_TYPE_EXT instead of BPF_PROG_TYPE_SCHED_CLS. After this patch the resolve_prog_type() returns BPF_PROG_TYPE_SCHED_CLS and update to prog_array can succeed. Fixes: f7866c358733 ("bpf: Fix null pointer dereference in resolve_prog_type() for BPF_PROG_TYPE_EXT") Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20240728114612.48486-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* alloc_tag: introduce clear_page_tag_ref() helper functionSuren Baghdasaryan2024-08-291-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | commit a8fc28dad6d574582cdf2f7e78c73c59c623df30 upstream. In several cases we are freeing pages which were not allocated using common page allocators. For such cases, in order to keep allocation accounting correct, we should clear the page tag to indicate that the page being freed is expected to not have a valid allocation tag. Introduce clear_page_tag_ref() helper function to be used for this. Link: https://lkml.kernel.org/r/20240813150758.855881-1-surenb@google.com Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: mana: Fix doorbell out of order violation and avoid unnecessary ↵Long Li2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | doorbell rings commit 58a63729c957621f1990c3494c702711188ca347 upstream. After napi_complete_done() is called when NAPI is polling in the current process context, another NAPI may be scheduled and start running in softirq on another CPU and may ring the doorbell before the current CPU does. When combined with unnecessary rings when there is no need to arm the CQ, it triggers error paths in the hardware. This patch fixes this by calling napi_complete_done() after doorbell rings. It limits the number of unnecessary rings when there is no need to arm. MANA hardware specifies that there must be one doorbell ring every 8 CQ wraparounds. This driver guarantees one doorbell ring as soon as the number of consumed CQEs exceeds 4 CQ wraparounds. In practical workloads, the 4 CQ wraparounds proves to be big enough that it rarely exceeds this limit before all the napi weight is consumed. To implement this, add a per-CQ counter cq->work_done_since_doorbell, and make sure the CQ is armed as soon as passing 4 wraparounds of the CQ. Cc: stable@vger.kernel.org Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ") Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/1723219138-29887-1-git-send-email-longli@linuxonhyperv.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/hugetlb: fix hugetlb vs. core-mm PT lockingDavid Hildenbrand2024-08-292-3/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f75cfbd6bb02295ddaed48adf667b6c828ce07b upstream. We recently made GUP's common page table walking code to also walk hugetlb VMAs without most hugetlb special-casing, preparing for the future of having less hugetlb-specific page table walking code in the codebase. Turns out that we missed one page table locking detail: page table locking for hugetlb folios that are not mapped using a single PMD/PUD. Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the page tables, will perform a pte_offset_map_lock() to grab the PTE table lock. However, hugetlb that concurrently modifies these page tables would actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the locks would differ. Something similar can happen right now with hugetlb folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. This issue can be reproduced [1], for example triggering: [ 3105.936100] ------------[ cut here ]------------ [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 [ 3105.944634] Modules linked in: [...] [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3105.991108] pc : try_grab_folio+0x11c/0x188 [ 3105.994013] lr : follow_page_pte+0xd8/0x430 [ 3105.996986] sp : ffff80008eafb8f0 [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 [ 3106.047957] Call trace: [ 3106.049522] try_grab_folio+0x11c/0x188 [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 [ 3106.055527] follow_page_mask+0x1a0/0x2b8 [ 3106.058118] __get_user_pages+0xf0/0x348 [ 3106.060647] faultin_page_range+0xb0/0x360 [ 3106.063651] do_madvise+0x340/0x598 Let's make huge_pte_lockptr() effectively use the same PT locks as any core-mm page table walker would. Add ptep_lockptr() to obtain the PTE page table lock using a pte pointer -- unfortunately we cannot convert pte_lockptr() because virt_to_page() doesn't work with kmap'ed page tables we can have with CONFIG_HIGHPTE. Handle CONFIG_PGTABLE_LEVELS correctly by checking in reverse order, such that when e.g., CONFIG_PGTABLE_LEVELS==2 with PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE will work as expected. Document why that works. There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb folio being mapped using two PTE page tables. While hugetlb wants to take the PMD table lock, core-mm would grab the PTE table lock of one of both PTE page tables. In such corner cases, we have to make sure that both locks match, which is (fortunately!) currently guaranteed for 8xx as it does not support SMP and consequently doesn't use split PT locks. [1] https://lore.kernel.org/all/1bbfcc7f-f222-45a5-ac44-c5a1381c596d@redhat.com/ Link: https://lkml.kernel.org/r/20240801204748.99107-1-david@redhat.com Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Peter Xu <peterx@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHAREAl Viro2024-08-291-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9a2fa1472083580b6c66bdaf291f591e1170123a upstream. copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The minimally invasive fix would be to deal with that in dup_fd(). If this proves to add measurable overhead, we can go that way, but let's try to fix copy_fd_bitmaps() first. * new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size). * make copy_fd_bitmaps() take the bitmap size in words, rather than bits; it's 'count' argument is always a multiple of BITS_PER_LONG, so we are not losing any information, and that way we can use the same helper for all three bitmaps - compiler will see that count is a multiple of BITS_PER_LONG for the large ones, so it'll generate plain memcpy()+memset(). Reproducer added to tools/testing/selftests/core/close_range_test.c Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vfs: Don't evict inode under the inode lru traversing contextZhihao Cheng2024-08-291-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2a0629834cd82f05d424bbc193374f9a43d1f87d upstream. The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) | iput(ib) // ib->nlink is 0, do evict | ubifs_evict_inode | ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Fixes: 7959cf3a7506 ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ACPICA: Add a depth argument to acpi_execute_reg_methods()Rafael J. Wysocki2024-08-291-0/+1
| | | | | | | | | | | | | | | commit cdf65d73e001fde600b18d7e45afadf559425ce5 upstream. A subsequent change will need to pass a depth argument to acpi_execute_reg_methods(), so prepare that function for it. No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Cc: All applicable <stable@vger.kernel.org> Link: https://patch.msgid.link/8451567.NyiUUSuA9g@rjwysocki.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a ↵David Howells2024-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | second writeback flag" commit 8e5ced7804cb9184c4a23f8054551240562a8eda upstream. This reverts commit ae678317b95e760607c7b20b97c9cd4ca9ed6e1a. Revert the patch that removes the deprecated use of PG_private_2 in netfslib for the moment as Ceph is actually still using this to track data copied to the cache. Fixes: ae678317b95e ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD"Griffin Kroah-Hartman2024-08-291-3/+0
| | | | | | | | | | | | | | | | | | | | commit 9bb5e74b2bf88fbb024bb15ded3b011e02c673be upstream. This reverts commit bab2f5e8fd5d2f759db26b78d9db57412888f187. Joel reported that this commit breaks userspace and stops sensors in SDM845 from working. Also breaks other qcom SoC devices running postmarketOS. Cc: stable <stable@kernel.org> Cc: Ekansh Gupta <quic_ekangupt@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reported-by: Joel Selvaraj <joelselvaraj.oss@gmail.com> Link: https://lore.kernel.org/r/9a9f5646-a554-4b65-8122-d212bb665c81@umsystem.edu Signed-off-by: Griffin Kroah-Hartman <griffin@kroah.com> Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Fixes: bab2f5e8fd5d ("misc: fastrpc: Restrict untrusted app to attach to privileged PD") Link: https://lore.kernel.org/r/20240815094920.8242-1-griffin@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "ACPI: EC: Evaluate orphan _REG under EC device"Rafael J. Wysocki2024-08-291-4/+0
| | | | | | | | | | | | | | commit 779bac9994452f6a894524f70c00cfb0cd4b6364 upstream. This reverts commit 0e6b6dedf168 ("Revert "ACPI: EC: Evaluate orphan _REG under EC device") because the problem addressed by it will be addressed differently in what follows. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Cc: All applicable <stable@vger.kernel.org> Link: https://patch.msgid.link/3236716.5fSG56mABF@rjwysocki.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* block: use the right type for stub rq_integrity_vec()Jens Axboe2024-08-141-1/+1
| | | | | | | | | | | | commit 69b6517687a4b1fb250bd8c9c193a0a304c8ba17 upstream. For !CONFIG_BLK_DEV_INTEGRITY, rq_integrity_vec() wasn't updated properly. Fix it up. Fixes: cf546dd289e0 ("block: change rq_integrity_vec to respect the iterator") Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracefs: Use generic inode RCU for synchronizing freeingSteven Rostedt2024-08-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0b6743bd60a56a701070b89fb80c327a44b7b3e2 upstream. With structure layout randomization enabled for 'struct inode' we need to avoid overlapping any of the RCU-used / initialized-only-once members, e.g. i_lru or i_sb_list to not corrupt related list traversals when making use of the rcu_head. For an unlucky structure layout of 'struct inode' we may end up with the following splat when running the ftrace selftests: [<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object]) [<...>] ------------[ cut here ]------------ [<...>] kernel BUG at lib/list_debug.c:54! [<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G N 6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65 [<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0 [<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f [<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283 [<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000 [<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001 [<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25 [<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d [<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000 [<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0 [<...>] RSI: __func__.47+0x4340/0x4400 [<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550] [<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550] [<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550] [<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object] [<...>] FS: 00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000 [<...>] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0 [<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [<...>] ASID: 0003 [<...>] Stack: [<...>] ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0 [<...>] ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f [<...>] 0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392 [<...>] Call Trace: [<...>] <TASK> [<...>] [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0 [<...>] [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48 [<...>] [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88 [<...>] [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90 [<...>] [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8 [<...>] [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8 [<...>] [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40 [<...>] [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48 [<...>] [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70 [<...>] [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78 [<...>] [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80 [<...>] [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8 [<...>] [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0 [<...>] [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8 [<...>] [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18 [<...>] [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38 [<...>] [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70 [<...>] [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90 [<...>] [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0 [<...>] [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28 [<...>] [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60 [<...>] [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88 [<...>] [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8 [<...>] [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8 [<...>] [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0 [<...>] [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30 [<...>] [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70 [<...>] [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88 [<...>] [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8 [<...>] [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8 [<...>] [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8 [<...>] </TASK> [<...>] <PTREGS> [<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map] [<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48 [<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a [<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70 [<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0 [<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000 [<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0 [<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map] [<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read|write|mayread|maywrite|growsdown|account)]+0x0/0xb8 [userland map] [<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] </PTREGS> [<...>] Modules linked in: [<...>] ---[ end trace 0000000000000000 ]--- The list debug message as well as RBX's symbolic value point out that the object in question was allocated from 'tracefs_inode_cache' and that the list's '->next' member is at offset 0. Dumping the layout of the relevant parts of 'struct tracefs_inode' gives the following: struct tracefs_inode { union { struct inode { struct list_head { struct list_head * next; /* 0 8 */ struct list_head * prev; /* 8 8 */ } i_lru; [...] } vfs_inode; struct callback_head { void (*func)(struct callback_head *); /* 0 8 */ struct callback_head * next; /* 8 8 */ } rcu; }; [...] }; Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in call_rcu() or later when calling the RCU callback. This will disturb concurrent list traversals as well as object reuse which assumes these list heads will keep their integrity. For reproduction, the following diff manually overlays 'i_lru' with 'rcu' as, otherwise, one would require some good portion of luck for gambling an unlucky RANDSTRUCT seed: --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -629,6 +629,7 @@ struct inode { umode_t i_mode; unsigned short i_opflags; kuid_t i_uid; + struct list_head i_lru; /* inode LRU list */ kgid_t i_gid; unsigned int i_flags; @@ -690,7 +691,6 @@ struct inode { u16 i_wb_frn_avg_time; u16 i_wb_frn_history; #endif - struct list_head i_lru; /* inode LRU list */ struct list_head i_sb_list; struct list_head i_wb_list; /* backing dev writeback list */ union { The tracefs inode does not need to supply its own RCU delayed destruction of its inode. The inode code itself offers both a "destroy_inode()" callback that gets called when the last reference of the inode is released, and the "free_inode()" which is called after a RCU synchronization period from the "destroy_inode()". The tracefs code can unlink the inode from its list in the destroy_inode() callback, and the simply free it from the free_inode() callback. This should provide the same protection. Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com> Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home Fixes: baa23a8d4360 ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: Brad Spengler <spender@grsecurity.net> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ASoC: cs35l56: Handle OTP read latency over SoundWireRichard Fitzgerald2024-08-141-0/+5
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit e42066df07c0fcedebb32ed56f8bc39b4bf86337 ] Use the late-read buffer in the CS35L56 SoundWire interface to read OTP memory. The OTP memory has a longer access latency than chip registers and cannot guarantee to return the data value in the SoundWire control response if the bus clock is >4.8 MHz. The Cirrus SoundWire peripheral IP exposes the bridge-to-bus read buffer and status bits. For a read from OTP the bridge status bits are polled to wait for the OTP data to be loaded into the read buffer and the data is then read from there. Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: e1830f66f6c6 ("ASoC: cs35l56: Add helper functions for amp calibration") Link: https://patch.msgid.link/20240805140839.26042-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>