summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* net: micrel: Fix PTP frame parsing for lan8814Horatiu Vultur2024-01-221-0/+11
| | | | | | | | | | | | | | | | | | The HW has the capability to check each frame if it is a PTP frame, which domain it is, which ptp frame type it is, different ip address in the frame. And if one of these checks fail then the frame is not timestamp. Most of these checks were disabled except checking the field minorVersionPTP inside the PTP header. Meaning that once a partner sends a frame compliant to 8021AS which has minorVersionPTP set to 1, then the frame was not timestamp because the HW expected by default a value of 0 in minorVersionPTP. This is exactly the same issue as on lan8841. Fix this issue by removing this check so the userspace can decide on this. Fixes: ece19502834d ("net: phy: micrel: 1588 support for LAN8814 phy") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Divya Koppera <divya.koppera@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'dpll-fixes'David S. Miller2024-01-223-29/+100
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arkadiusz Kubalewski says: ==================== dpll: fix unordered unbind/bind registerer issues Fix issues when performing unordered unbind/bind of a kernel modules which are using a dpll device with DPLL_PIN_TYPE_MUX pins. Currently only serialized bind/unbind of such use case works, fix the issues and allow for unserialized kernel module bind order. The issues are observed on the ice driver, i.e., $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind results in: ice 0000:af:00.0: Removed PTP clock BUG: kernel NULL pointer dereference, address: 0000000000000010 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-rc5_next-queue_19th-Oct-2023-01625-g039e5d15e451 #1 Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice] Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b 66 08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b 10 41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90 RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000 R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0 R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000 FS: 00007fdc7dae0740(0000) GS:ffff888c105c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x20/0x70 ? page_fault_oops+0x76/0x170 ? exc_page_fault+0x65/0x150 ? asm_exc_page_fault+0x22/0x30 ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice] ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice] dpll_msg_add_pin_parents+0x142/0x1d0 dpll_pin_event_send+0x7d/0x150 dpll_pin_on_pin_unregister+0x3f/0x100 ice_dpll_deinit_pins+0xa1/0x230 [ice] ice_dpll_deinit+0x29/0xe0 [ice] ice_remove+0xcd/0x200 [ice] pci_device_remove+0x33/0xa0 device_release_driver_internal+0x193/0x200 unbind_store+0x9d/0xb0 kernfs_fop_write_iter+0x128/0x1c0 vfs_write+0x2bb/0x3e0 ksys_write+0x5f/0xe0 do_syscall_64+0x59/0x90 ? filp_close+0x1b/0x30 ? do_dup2+0x7d/0xd0 ? syscall_exit_work+0x103/0x130 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 ? syscall_exit_work+0x103/0x130 ? syscall_exit_to_user_mode+0x22/0x40 ? do_syscall_64+0x69/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fdc7d93eb97 Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97 RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001 RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0 R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0 </TASK> Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1 vfio irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs libcrc32c rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr i2c_i801 ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded: iavf] CR2: 0000000000000010 v6: - fix memory corruption on error path in patch [v5 2/4] ==================== Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dpll: fix register pin with unregistered parent pinArkadiusz Kubalewski2024-01-221-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of multiple kernel module instances using the same dpll device: if only one registers dpll device, then only that one can register directly connected pins with a dpll device. When unregistered parent is responsible for determining if the muxed pin can be registered with it or not, the drivers need to be loaded in serialized order to work correctly - first the driver instance which registers the direct pins needs to be loaded, then the other instances could register muxed type pins. Allow registration of a pin with a parent even if the parent was not yet registered, thus allow ability for unserialized driver instance load order. Do not WARN_ON notification for unregistered pin, which can be invoked for described case, instead just return error. Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Reviewed-by: Jan Glaza <jan.glaza@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dpll: fix userspace availability of pinsArkadiusz Kubalewski2024-01-221-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If parent pin was unregistered but child pin was not, the userspace would see the "zombie" pins - the ones that were registered with a parent pin (dpll_pin_on_pin_register(..)). Technically those are not available - as there is no dpll device in the system. Do not dump those pins and prevent userspace from any interaction with them. Provide a unified function to determine if the pin is available and use it before acting/responding for user requests. Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Reviewed-by: Jan Glaza <jan.glaza@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dpll: fix pin dump crash for rebound moduleArkadiusz Kubalewski2024-01-223-18/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a kernel module is unbound but the pin resources were not entirely freed (other kernel module instance of the same PCI device have had kept the reference to that pin), and kernel module is again bound, the pin properties would not be updated (the properties are only assigned when memory for the pin is allocated), prop pointer still points to the kernel module memory of the kernel module which was deallocated on the unbind. If the pin dump is invoked in this state, the result is a kernel crash. Prevent the crash by storing persistent pin properties in dpll subsystem, copy the content from the kernel module when pin is allocated, instead of using memory of the kernel module. Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions") Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions") Reviewed-by: Jan Glaza <jan.glaza@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dpll: fix broken error path in dpll_pin_alloc(..)Arkadiusz Kubalewski2024-01-221-3/+4
|/ | | | | | | | | | | If pin type is not expected, or pin properities failed to allocate memory, the unwind error path shall not destroy pin's xarrays, which were not yet initialized. Add new goto label and use it to fix broken error path. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'tun-fixes'David S. Miller2024-01-221-2/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | Yunjian Wang says: ==================== fixes for tun There are few places on the receive path where packet receives and packet drops were not accounted for. This patchset fixes that issue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: add missing rx stats accounting in tun_xdp_actYunjian Wang2024-01-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TUN can be used as vhost-net backend, and it is necessary to count the packets transmitted from TUN to vhost-net/virtio-net. However, there are some places in the receive path that were not taken into account when using XDP. It would be beneficial to also include new accounting for successfully received bytes using dev_sw_netstats_rx_add. Fixes: 761876c857cb ("tap: XDP support") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: fix missing dropped counter in tun_xdp_actYunjian Wang2024-01-221-2/+6
|/ | | | | | | | | | | | | | The commit 8ae1aff0b331 ("tuntap: split out XDP logic") includes dropped counter for XDP_DROP, XDP_ABORTED, and invalid XDP actions. Unfortunately, that commit missed the dropped counter when error occurs during XDP_TX and XDP_REDIRECT actions. This patch fixes this issue. Fixes: 8ae1aff0b331 ("tuntap: split out XDP logic") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fix removing a namespace with conflicting altnamesJakub Kicinski2024-01-222-0/+12
| | | | | | | | | | | | | | | | | | | | | | Mark reports a BUG() when a net namespace is removed. kernel BUG at net/core/dev.c:11520! Physical interfaces moved outside of init_net get "refunded" to init_net when that namespace disappears. The main interface name may get overwritten in the process if it would have conflicted. We need to also discard all conflicting altnames. Recent fixes addressed ensuring that altnames get moved with the main interface, which surfaced this problem. Reported-by: Марк Коренберг <socketpair@gmail.com> Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/ Fixes: 7663d522099e ("net: check for altname conflicts when changing netdev's netns") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* idpf: distinguish vports by the dev_port attributeMichal Schmidt2024-01-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | idpf registers multiple netdevs (virtual ports) for one PCI function, but it does not provide a way for userspace to distinguish them with sysfs attributes. Per Documentation/ABI/testing/sysfs-class-net, it is a bug not to set dev_port for independent ports on the same PCI bus, device and function. Without dev_port set, systemd-udevd's default naming policy attempts to assign the same name ("ens2f0") to all four idpf netdevs on my test system and obviously fails, leaving three of them with the initial eth<N> name. With this patch, systemd-udevd is able to assign unique names to the netdevs (e.g. "ens2f0", "ens2f0d1", "ens2f0d2", "ens2f0d3"). The Intel-provided out-of-tree idpf driver already sets dev_port. In this patch I chose to do it in the same place in the idpf_cfg_netdev function. Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* udp: fix busy pollingEric Dumazet2024-01-214-14/+26
| | | | | | | | | | | | | | | | | | | | | | | | Generic sk_busy_loop_end() only looks at sk->sk_receive_queue for presence of packets. Problem is that for UDP sockets after blamed commit, some packets could be present in another queue: udp_sk(sk)->reader_queue In some cases, a busy poller could spin until timeout expiration, even if some packets are available in udp_sk(sk)->reader_queue. v3: - make sk_busy_loop_end() nicer (Willem) v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats. - add a sk_is_inet() check in sk_is_udp() (Willem feedback) - add a sk_is_inet() check in sk_is_tcp(). Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* llc: Drop support for ETH_P_TR_802_2.Kuniyuki Iwashima2024-01-192-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported an uninit-value bug below. [0] llc supports ETH_P_802_2 (0x0004) and used to support ETH_P_TR_802_2 (0x0011), and syzbot abused the latter to trigger the bug. write$tun(r0, &(0x7f0000000040)={@val={0x0, 0x11}, @val, @mpls={[], @llc={@snap={0xaa, 0x1, ')', "90e5dd"}}}}, 0x16) llc_conn_handler() initialises local variables {saddr,daddr}.mac based on skb in llc_pdu_decode_sa()/llc_pdu_decode_da() and passes them to __llc_lookup(). However, the initialisation is done only when skb->protocol is htons(ETH_P_802_2), otherwise, __llc_lookup_established() and __llc_lookup_listener() will read garbage. The missing initialisation existed prior to commit 211ed865108e ("net: delete all instances of special processing for token ring"). It removed the part to kick out the token ring stuff but forgot to close the door allowing ETH_P_TR_802_2 packets to sneak into llc_rcv(). Let's remove llc_tr_packet_type and complete the deprecation. [0]: BUG: KMSAN: uninit-value in __llc_lookup_established+0xe9d/0xf90 __llc_lookup_established+0xe9d/0xf90 __llc_lookup net/llc/llc_conn.c:611 [inline] llc_conn_handler+0x4bd/0x1360 net/llc/llc_conn.c:791 llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206 __netif_receive_skb_one_core net/core/dev.c:5527 [inline] __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5641 netif_receive_skb_internal net/core/dev.c:5727 [inline] netif_receive_skb+0x58/0x660 net/core/dev.c:5786 tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555 tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002 tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x8ef/0x1490 fs/read_write.c:584 ksys_write+0x20f/0x4c0 fs/read_write.c:637 __do_sys_write fs/read_write.c:649 [inline] __se_sys_write fs/read_write.c:646 [inline] __x64_sys_write+0x93/0xd0 fs/read_write.c:646 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Local variable daddr created at: llc_conn_handler+0x53/0x1360 net/llc/llc_conn.c:783 llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206 CPU: 1 PID: 5004 Comm: syz-executor994 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Fixes: 211ed865108e ("net: delete all instances of special processing for token ring") Reported-by: syzbot+b5ad66046b913bc04c6f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b5ad66046b913bc04c6f Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240119015515.61898-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* llc: make llc_ui_sendmsg() more robust against bonding changesEric Dumazet2024-01-191-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot was able to trick llc_ui_sendmsg(), allocating an skb with no headroom, but subsequently trying to push 14 bytes of Ethernet header [1] Like some others, llc_ui_sendmsg() releases the socket lock before calling sock_alloc_send_skb(). Then it acquires it again, but does not redo all the sanity checks that were performed. This fix: - Uses LL_RESERVED_SPACE() to reserve space. - Check all conditions again after socket lock is held again. - Do not account Ethernet header for mtu limitation. [1] skbuff: skb_under_panic: text:ffff800088baa334 len:1514 put:14 head:ffff0000c9c37000 data:ffff0000c9c36ff2 tail:0x5dc end:0x6c0 dev:bond0 kernel BUG at net/core/skbuff.c:193 ! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 6875 Comm: syz-executor.0 Not tainted 6.7.0-rc8-syzkaller-00101-g0802e17d9aca-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_panic net/core/skbuff.c:189 [inline] pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 lr : skb_panic net/core/skbuff.c:189 [inline] lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 sp : ffff800096f97000 x29: ffff800096f97010 x28: ffff80008cc8d668 x27: dfff800000000000 x26: ffff0000cb970c90 x25: 00000000000005dc x24: ffff0000c9c36ff2 x23: ffff0000c9c37000 x22: 00000000000005ea x21: 00000000000006c0 x20: 000000000000000e x19: ffff800088baa334 x18: 1fffe000368261ce x17: ffff80008e4ed000 x16: ffff80008a8310f8 x15: 0000000000000001 x14: 1ffff00012df2d58 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : e28a51f1087e8400 x8 : e28a51f1087e8400 x7 : ffff80008028f8d0 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800082b78714 x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000089 Call trace: skb_panic net/core/skbuff.c:189 [inline] skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 skb_push+0xf0/0x108 net/core/skbuff.c:2451 eth_header+0x44/0x1f8 net/ethernet/eth.c:83 dev_hard_header include/linux/netdevice.h:3188 [inline] llc_mac_hdr_init+0x110/0x17c net/llc/llc_output.c:33 llc_sap_action_send_xid_c+0x170/0x344 net/llc/llc_s_ac.c:85 llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline] llc_sap_next_state net/llc/llc_sap.c:182 [inline] llc_sap_state_process+0x1ec/0x774 net/llc/llc_sap.c:209 llc_build_and_send_xid_pkt+0x12c/0x1c0 net/llc/llc_sap.c:270 llc_ui_sendmsg+0x7bc/0xb1c net/llc/af_llc.c:997 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_sendmsg+0x194/0x274 net/socket.c:767 splice_to_socket+0x7cc/0xd58 fs/splice.c:881 do_splice_from fs/splice.c:933 [inline] direct_splice_actor+0xe4/0x1c0 fs/splice.c:1142 splice_direct_to_actor+0x2a0/0x7e4 fs/splice.c:1088 do_splice_direct+0x20c/0x348 fs/splice.c:1194 do_sendfile+0x4bc/0xc70 fs/read_write.c:1254 __do_sys_sendfile64 fs/read_write.c:1322 [inline] __se_sys_sendfile64 fs/read_write.c:1308 [inline] __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1308 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595 Code: aa1803e6 aa1903e7 a90023f5 94792f6a (d4210000) Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+2a7024e9502df538e8ef@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240118183625.4007013-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPINGLin Ma2024-01-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | In the vlan_changelink function, a loop is used to parse the nested attributes IFLA_VLAN_EGRESS_QOS and IFLA_VLAN_INGRESS_QOS in order to obtain the struct ifla_vlan_qos_mapping. These two nested attributes are checked in the vlan_validate_qos_map function, which calls nla_validate_nested_deprecated with the vlan_map_policy. However, this deprecated validator applies a LIBERAL strictness, allowing the presence of an attribute with the type IFLA_VLAN_QOS_UNSPEC. Consequently, the loop in vlan_changelink may parse an attribute of type IFLA_VLAN_QOS_UNSPEC and believe it carries a payload of struct ifla_vlan_qos_mapping, which is not necessarily true. To address this issue and ensure compatibility, this patch introduces two type checks that skip attributes whose type is not IFLA_VLAN_QOS_MAPPING. Fixes: 07b5b17e157b ("[VLAN]: Use rtnl_link API") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240118130306.1644001-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge branch 'bnxt_en-bug-fixes'Jakub Kicinski2024-01-195-19/+42
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael Chan says: ==================== bnxt_en: Bug fixes This series contains 5 miscellaneous fixes. The fixes include adding delay for FLR, buffer memory leak, RSS table size calculation, ethtool self test kernel warning, and mqprio crash. ==================== Link: https://lore.kernel.org/r/20240117234515.226944-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bnxt_en: Fix possible crash after creating sw mqprio TCsMichael Chan2024-01-195-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The driver relies on netdev_get_num_tc() to get the number of HW offloaded mqprio TCs to allocate and free TX rings. This won't work and can potentially crash the system if software mqprio or taprio TCs have been setup. netdev_get_num_tc() will return the number of software TCs and it may cause the driver to allocate or free more TX rings that it should. Fix it by adding a bp->num_tc field to store the number of HW offload mqprio TCs for the device. Use bp->num_tc instead of netdev_get_num_tc(). This fixes a crash like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 42b8404067 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 120 PID: 8661 Comm: ifconfig Kdump: loaded Tainted: G OE 5.18.16 #1 Hardware name: Lenovo ThinkSystem SR650 V3/SB27A92818, BIOS ESE114N-2.12 04/25/2023 RIP: 0010:bnxt_hwrm_cp_ring_alloc_p5+0x10/0x90 [bnxt_en] Code: 41 5c 41 5d 41 5e c3 cc cc cc cc 41 8b 44 24 08 66 89 03 eb c6 e8 b0 f1 7d db 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 <48> 8b 06 48 89 f3 48 81 c6 28 01 00 00 0f b6 96 13 ff ff ff 44 8b RSP: 0018:ff65907660d1fa88 EFLAGS: 00010202 RAX: 0000000000000010 RBX: ff4dde1d907e4980 RCX: f400000000000000 RDX: 0000000000000010 RSI: 0000000000000000 RDI: ff4dde1d907e4980 RBP: ff4dde1d907e4980 R08: 000000000000000f R09: 0000000000000000 R10: ff4dde5f02671800 R11: 0000000000000008 R12: 0000000088888889 R13: 0500000000000000 R14: 00f0000000000000 R15: ff4dde5f02671800 FS: 00007f4b126b5740(0000) GS:ff4dde9bff600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000416f9c6002 CR4: 0000000000771ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> bnxt_hwrm_ring_alloc+0x204/0x770 [bnxt_en] bnxt_init_chip+0x4d/0x680 [bnxt_en] ? bnxt_poll+0x1a0/0x1a0 [bnxt_en] __bnxt_open_nic+0xd2/0x740 [bnxt_en] bnxt_open+0x10b/0x220 [bnxt_en] ? raw_notifier_call_chain+0x41/0x60 __dev_open+0xf3/0x1b0 __dev_change_flags+0x1db/0x250 dev_change_flags+0x21/0x60 devinet_ioctl+0x590/0x720 ? avc_has_extended_perms+0x1b7/0x420 ? _copy_from_user+0x3a/0x60 inet_ioctl+0x189/0x1c0 ? wp_page_copy+0x45a/0x6e0 sock_do_ioctl+0x42/0xf0 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120 sock_ioctl+0x1ce/0x2e0 __x64_sys_ioctl+0x87/0xc0 do_syscall_64+0x59/0x90 ? syscall_exit_work+0x103/0x130 ? syscall_exit_to_user_mode+0x12/0x30 ? do_syscall_64+0x69/0x90 ? exc_page_fault+0x62/0x150 Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-6-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bnxt_en: Prevent kernel warning when running offline self testMichael Chan2024-01-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We call bnxt_half_open_nic() to setup the chip partially to run loopback tests. The rings and buffers are initialized normally so that we can transmit and receive packets in loopback mode. That means page pool buffers are allocated for the aggregation ring just like the normal case. NAPI is not needed because we are just polling for the loopback packets. When we're done with the loopback tests, we call bnxt_half_close_nic() to clean up. When freeing the page pools, we hit a WARN_ON() in page_pool_unlink_napi() because the NAPI state linked to the page pool is uninitialized. The simplest way to avoid this warning is just to initialize the NAPIs during half open and delete the NAPIs during half close. Trying to skip the page pool initialization or skip linking of NAPI during half open will be more complicated. This fix avoids this warning: WARNING: CPU: 4 PID: 46967 at net/core/page_pool.c:946 page_pool_unlink_napi+0x1f/0x30 CPU: 4 PID: 46967 Comm: ethtool Tainted: G S W 6.7.0-rc5+ #22 Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.3.8 08/31/2021 RIP: 0010:page_pool_unlink_napi+0x1f/0x30 Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 18 48 85 c0 74 1b 48 8b 50 10 83 e2 01 74 08 8b 40 34 83 f8 ff 74 02 <0f> 0b 48 c7 47 18 00 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90 RSP: 0018:ffa000003d0dfbe8 EFLAGS: 00010246 RAX: ff110003607ce640 RBX: ff110010baf5d000 RCX: 0000000000000008 RDX: 0000000000000000 RSI: ff110001e5e522c0 RDI: ff110010baf5d000 RBP: ff11000145539b40 R08: 0000000000000001 R09: ffffffffc063f641 R10: ff110001361eddb8 R11: 000000000040000f R12: 0000000000000001 R13: 000000000000001c R14: ff1100014553a080 R15: 0000000000003fc0 FS: 00007f9301c4f740(0000) GS:ff1100103fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91344fa8f0 CR3: 00000003527cc005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x81/0x140 ? page_pool_unlink_napi+0x1f/0x30 ? report_bug+0x102/0x200 ? handle_bug+0x44/0x70 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? bnxt_free_ring.isra.123+0xb1/0xd0 [bnxt_en] ? page_pool_unlink_napi+0x1f/0x30 page_pool_destroy+0x3e/0x150 bnxt_free_mem+0x441/0x5e0 [bnxt_en] bnxt_half_close_nic+0x2a/0x40 [bnxt_en] bnxt_self_test+0x21d/0x450 [bnxt_en] __dev_ethtool+0xeda/0x2e30 ? native_queued_spin_lock_slowpath+0x17f/0x2b0 ? __link_object+0xa1/0x160 ? _raw_spin_unlock_irqrestore+0x23/0x40 ? __create_object+0x5f/0x90 ? __kmem_cache_alloc_node+0x317/0x3c0 ? dev_ethtool+0x59/0x170 dev_ethtool+0xa7/0x170 dev_ioctl+0xc3/0x530 sock_do_ioctl+0xa8/0xf0 sock_ioctl+0x270/0x310 __x64_sys_ioctl+0x8c/0xc0 do_syscall_64+0x3e/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 Fixes: 294e39e0d034 ("bnxt: hook NAPIs to page pools") Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bnxt_en: Fix RSS table entries calculation for P5_PLUS chipsMichael Chan2024-01-192-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing formula used in the driver to calculate the number of RSS table entries is to round up the number of RX rings to the next integer multiples of 64 (e.g. 64, 128, 192, ..). This is incorrect. The valid values supported by the chip are 64, 128, 256, 512 only (power of 2 starting from 64). When the number of RX rings is greater than 128, the entry size will likely be wrong. Firmware will round down the invalid value (e.g. 192 rounded down to 128) provided by the driver, causing some RSS rings to not receive any packets. We already have an existing function bnxt_calc_nr_ring_pages() to do this calculation. Use it in bnxt_get_nr_rss_ctxs() to calculate the number of RSS contexts correctly for P5_PLUS chips. Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Fixes: 7b3af4f75b81 ("bnxt_en: Add RSS support for 57500 chips.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bnxt_en: Fix memory leak in bnxt_hwrm_get_rings()Michael Chan2024-01-191-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | bnxt_hwrm_get_rings() can abort and return error when there are not enough ring resources. It aborts without releasing the HWRM DMA buffer, causing a dma_pool_destroy warning when the driver is unloaded: bnxt_en 0000:99:00.0: dma_pool_destroy bnxt_hwrm, 000000005b089ba8 busy Fixes: f1e50b276d37 ("bnxt_en: Fix trimming of P5 RX and TX rings") Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bnxt_en: Wait for FLR to complete during probeMichael Chan2024-01-191-0/+5
|/ | | | | | | | | | | | | The first message to firmware may fail if the device is undergoing FLR. The driver has some recovery logic for this failure scenario but we must wait 100 msec for FLR to complete before proceeding. Otherwise the recovery will always fail. Fixes: ba02629ff6cb ("bnxt_en: log firmware status on firmware init failure") Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20240117234515.226944-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* tcp: make sure init the accept_queue's spinlocks onceZhengchao Shao2024-01-194-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I run syz's reproduction C program locally, it causes the following issue: pvqspinlock: lock 0xffff9d181cd5c660 has corrupted value 0x0! WARNING: CPU: 19 PID: 21160 at __pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Code: 73 56 3a ff 90 c3 cc cc cc cc 8b 05 bb 1f 48 01 85 c0 74 05 c3 cc cc cc cc 8b 17 48 89 fe 48 c7 c7 30 20 ce 8f e8 ad 56 42 ff <0f> 0b c3 cc cc cc cc 0f 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa8d200604cb8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d1ef60e0908 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d1ef60e0900 RBP: ffff9d181cd5c280 R08: 0000000000000000 R09: 00000000ffff7fff R10: ffffa8d200604b68 R11: ffffffff907dcdc8 R12: 0000000000000000 R13: ffff9d181cd5c660 R14: ffff9d1813a3f330 R15: 0000000000001000 FS: 00007fa110184640(0000) GS:ffff9d1ef60c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000011f65e000 CR4: 00000000000006f0 Call Trace: <IRQ> _raw_spin_unlock (kernel/locking/spinlock.c:186) inet_csk_reqsk_queue_add (net/ipv4/inet_connection_sock.c:1321) inet_csk_complete_hashdance (net/ipv4/inet_connection_sock.c:1358) tcp_check_req (net/ipv4/tcp_minisocks.c:868) tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2260) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) ip_local_deliver_finish (net/ipv4/ip_input.c:234) __netif_receive_skb_one_core (net/core/dev.c:5529) process_backlog (./include/linux/rcupdate.h:779) __napi_poll (net/core/dev.c:6533) net_rx_action (net/core/dev.c:6604) __do_softirq (./arch/x86/include/asm/jump_label.h:27) do_softirq (kernel/softirq.c:454 kernel/softirq.c:441) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:381) __dev_queue_xmit (net/core/dev.c:4374) ip_finish_output2 (./include/net/neighbour.h:540 net/ipv4/ip_output.c:235) __ip_queue_xmit (net/ipv4/ip_output.c:535) __tcp_transmit_skb (net/ipv4/tcp_output.c:1462) tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6469) tcp_rcv_state_process (net/ipv4/tcp_input.c:6657) tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929) __release_sock (./include/net/sock.h:1121 net/core/sock.c:2968) release_sock (net/core/sock.c:3536) inet_wait_for_connect (net/ipv4/af_inet.c:609) __inet_stream_connect (net/ipv4/af_inet.c:702) inet_stream_connect (net/ipv4/af_inet.c:748) __sys_connect (./include/linux/file.h:45 net/socket.c:2064) __x64_sys_connect (net/socket.c:2073 net/socket.c:2070 net/socket.c:2070) do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7fa10ff05a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48 RSP: 002b:00007fa110183de8 EFLAGS: 00000202 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000020000054 RCX: 00007fa10ff05a3d RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00007fa110183e20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fa110184640 R13: 0000000000000000 R14: 00007fa10fe8b060 R15: 00007fff73e23b20 </TASK> The issue triggering process is analyzed as follows: Thread A Thread B tcp_v4_rcv //receive ack TCP packet inet_shutdown tcp_check_req tcp_disconnect //disconnect sock ... tcp_set_state(sk, TCP_CLOSE) inet_csk_complete_hashdance ... inet_csk_reqsk_queue_add inet_listen //start listen spin_lock(&queue->rskq_lock) inet_csk_listen_start ... reqsk_queue_alloc ... spin_lock_init spin_unlock(&queue->rskq_lock) //warning When the socket receives the ACK packet during the three-way handshake, it will hold spinlock. And then the user actively shutdowns the socket and listens to the socket immediately, the spinlock will be initialized. When the socket is going to release the spinlock, a warning is generated. Also the same issue to fastopenq.lock. Move init spinlock to inet_create and inet_accept to make sure init the accept_queue's spinlocks once. Fixes: fff1f3001cc5 ("tcp: add a spinlock to protect struct request_sock_queue") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Reported-by: Ming Shu <sming56@aliyun.com> Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240118012019.1751966-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* selftests: bonding: Increase timeout to 1200sBenjamin Poirier2024-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tests are run by runner.sh, bond_options.sh gets killed before it can complete: make -C tools/testing/selftests run_tests TARGETS="drivers/net/bonding" [...] # timeout set to 120 # selftests: drivers/net/bonding: bond_options.sh # TEST: prio (active-backup miimon primary_reselect 0) [ OK ] # TEST: prio (active-backup miimon primary_reselect 1) [ OK ] # TEST: prio (active-backup miimon primary_reselect 2) [ OK ] # TEST: prio (active-backup arp_ip_target primary_reselect 0) [ OK ] # TEST: prio (active-backup arp_ip_target primary_reselect 1) [ OK ] # TEST: prio (active-backup arp_ip_target primary_reselect 2) [ OK ] # not ok 7 selftests: drivers/net/bonding: bond_options.sh # TIMEOUT 120 seconds This test includes many sleep statements, at least some of which are related to timers in the operation of the bonding driver itself. Increase the test timeout to allow the test to complete. I ran the test in slightly different VMs (including one without HW virtualization support) and got runtimes of 13m39.760s, 13m31.238s, and 13m2.956s. Use a ~1.5x "safety factor" and set the timeout to 1200s. Fixes: 42a8d4aaea84 ("selftests: bonding: add bonding prio option test") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20240116104402.1203850a@kernel.org/#t Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240118001233.304759-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/smc: fix illegal rmb_desc access in SMC-D connection dumpWen Gu2024-01-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A crash was found when dumping SMC-D connections. It can be reproduced by following steps: - run nginx/wrk test: smc_run nginx smc_run wrk -t 16 -c 1000 -d <duration> -H 'Connection: Close' <URL> - continuously dump SMC-D connections in parallel: watch -n 1 'smcss -D' BUG: kernel NULL pointer dereference, address: 0000000000000030 CPU: 2 PID: 7204 Comm: smcss Kdump: loaded Tainted: G E 6.7.0+ #55 RIP: 0010:__smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag] Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x66/0x150 ? exc_page_fault+0x69/0x140 ? asm_exc_page_fault+0x26/0x30 ? __smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag] ? __kmalloc_node_track_caller+0x35d/0x430 ? __alloc_skb+0x77/0x170 smc_diag_dump_proto+0xd0/0xf0 [smc_diag] smc_diag_dump+0x26/0x60 [smc_diag] netlink_dump+0x19f/0x320 __netlink_dump_start+0x1dc/0x300 smc_diag_handler_dump+0x6a/0x80 [smc_diag] ? __pfx_smc_diag_dump+0x10/0x10 [smc_diag] sock_diag_rcv_msg+0x121/0x140 ? __pfx_sock_diag_rcv_msg+0x10/0x10 netlink_rcv_skb+0x5a/0x110 sock_diag_rcv+0x28/0x40 netlink_unicast+0x22a/0x330 netlink_sendmsg+0x1f8/0x420 __sock_sendmsg+0xb0/0xc0 ____sys_sendmsg+0x24e/0x300 ? copy_msghdr_from_user+0x62/0x80 ___sys_sendmsg+0x7c/0xd0 ? __do_fault+0x34/0x160 ? do_read_fault+0x5f/0x100 ? do_fault+0xb0/0x110 ? __handle_mm_fault+0x2b0/0x6c0 __sys_sendmsg+0x4d/0x80 do_syscall_64+0x69/0x180 entry_SYSCALL_64_after_hwframe+0x6e/0x76 It is possible that the connection is in process of being established when we dump it. Assumed that the connection has been registered in a link group by smc_conn_create() but the rmb_desc has not yet been initialized by smc_buf_create(), thus causing the illegal access to conn->rmb_desc. So fix it by checking before dump. Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'net-6.8-rc1' of ↵Linus Torvalds2024-01-1890-235/+1366
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. Previous releases - regressions: - Revert "net: rtnetlink: Enslave device before bringing it up", breaks the case inverse to the one it was trying to fix - net: dsa: fix oob access in DSA's netdevice event handler dereference netdev_priv() before check its a DSA port - sched: track device in tcf_block_get/put_ext() only for clsact binder types - net: tls, fix WARNING in __sk_msg_free when record becomes full during splice and MORE hint set - sfp-bus: fix SFP mode detect from bitrate - drv: stmmac: prevent DSA tags from breaking COE Previous releases - always broken: - bpf: fix no forward progress in in bpf_iter_udp if output buffer is too small - bpf: reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access - netfilter: tighten input validation - net: add more sanity check in virtio_net_hdr_to_skb() - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid infinite loop - amt: do not use the portion of skb->cb area which may get clobbered - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option Misc: - spring cleanup of inactive maintainers" * tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) i40e: Include types.h to some headers ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes selftests: mlxsw: qos_pfc: Remove wrong description mlxsw: spectrum_router: Register netdevice notifier before nexthop mlxsw: spectrum_acl_tcam: Fix stack corruption mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure ethtool: netlink: Add missing ethnl_ops_begin/complete selftests: bonding: Add more missing config options selftests: netdevsim: add a config file libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length ...
| * Merge tag 'nf-24-01-18' of ↵Jakub Kicinski2024-01-1814-63/+125
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net. Slightly larger than usual because this batch includes several patches to tighten the nf_tables control plane to reject inconsistent configuration: 1) Restrict NFTA_SET_POLICY to NFT_SET_POL_PERFORMANCE and NFT_SET_POL_MEMORY. 2) Bail out if a nf_tables expression registers more than 16 netlink attributes which is what struct nft_expr_info allows. 3) Bail out if NFT_EXPR_STATEFUL provides no .clone interface, remove existing fallback to memcpy() when cloning which might accidentally duplicate memory reference to the same object. 4) Fix br_netfilter interaction with neighbour layer. This requires three preparation patches: - Use nf_bridge_get_physinif() in nfnetlink_log - Use nf_bridge_info_exists() to check in br_netfilter context is available in nf_queue. - Pass net to nf_bridge_get_physindev() And finally, the fix which replaces physindev with physinif in nf_bridge_info. Patches from Pavel Tikhomirov. 5) Catch-all deactivation happens in the transaction, hence this oneliner to check for the next generation. This bug uncovered after the removal of the _BUSY bit, which happened in set elements back in summer 2023. 6) Ensure set (total) key length size and concat field length description is consistent, otherwise bail out. 7) Skip set element with the _DEAD flag on from the netlink dump path. A tests occasionally shows that dump is mismatching because GC might lose race to get rid of this element while a netlink dump is in progress. 8) Reject NFT_SET_CONCAT for field_count < 1. 9) Use IP6_INC_STATS in ipvs to fix preemption BUG splat, patch from Fedor Pchelkin. * tag 'nf-24-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length netfilter: nf_tables: check if catch-all set element is active in next generation netfilter: bridge: replace physindev with physinif in nf_bridge_info netfilter: propagate net to nf_bridge_get_physindev netfilter: nf_queue: remove excess nf_bridge variable netfilter: nfnetlink_log: use proper helper for fetching physinif netfilter: nft_limit: do not ignore unsupported flags netfilter: nf_tables: bail out if stateful expression provides no .clone netfilter: nf_tables: validate .maxattr at expression registration netfilter: nf_tables: reject invalid set policy ==================== Link: https://lore.kernel.org/r/20240118161726.14838-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * ipvs: avoid stat macros calls from preemptible contextFedor Pchelkin2024-01-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inside decrement_ttl() upon discovering that the packet ttl has exceeded, __IP_INC_STATS and __IP6_INC_STATS macros can be called from preemptible context having the following backtrace: check_preemption_disabled: 48 callbacks suppressed BUG: using __this_cpu_add() in preemptible [00000000] code: curl/1177 caller is decrement_ttl+0x217/0x830 CPU: 5 PID: 1177 Comm: curl Not tainted 6.7.0+ #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xbd/0xe0 check_preemption_disabled+0xd1/0xe0 decrement_ttl+0x217/0x830 __ip_vs_get_out_rt+0x4e0/0x1ef0 ip_vs_nat_xmit+0x205/0xcd0 ip_vs_in_hook+0x9b1/0x26a0 nf_hook_slow+0xc2/0x210 nf_hook+0x1fb/0x770 __ip_local_out+0x33b/0x640 ip_local_out+0x2a/0x490 __ip_queue_xmit+0x990/0x1d10 __tcp_transmit_skb+0x288b/0x3d10 tcp_connect+0x3466/0x5180 tcp_v4_connect+0x1535/0x1bb0 __inet_stream_connect+0x40d/0x1040 inet_stream_connect+0x57/0xa0 __sys_connect_file+0x162/0x1a0 __sys_connect+0x137/0x160 __x64_sys_connect+0x72/0xb0 do_syscall_64+0x6f/0x140 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe6dbbc34e0 Use the corresponding preemption-aware variants: IP_INC_STATS and IP6_INC_STATS. Found by Linux Verification Center (linuxtesting.org). Fixes: 8d8e20e2d7bb ("ipvs: Decrement ttl") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: reject NFT_SET_CONCAT with not field length descriptionPablo Neira Ayuso2024-01-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | It is still possible to set on the NFT_SET_CONCAT flag by specifying a set size and no field description, report EINVAL in such case. Fixes: 1b6345d4160e ("netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: skip dead set elements in netlink dumpPablo Neira Ayuso2024-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delete from packet path relies on the garbage collector to purge elements with NFT_SET_ELEM_DEAD_BIT on. Skip these dead elements from nf_tables_dump_setelem() path, I very rarely see tests/shell/testcases/maps/typeof_maps_add_delete reports [DUMP FAILED] showing a mismatch in the expected output with an element that should not be there. If the netlink dump happens before GC worker run, it might show dead elements in the ruleset listing. nft_rhash_get() already skips dead elements in nft_rhash_cmp(), therefore, it already does not show the element when getting a single element via netlink control plane. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: do not allow mismatch field size and set key lengthPablo Neira Ayuso2024-01-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The set description provides the size of each field in the set whose sum should not mismatch the set key length, bail out otherwise. I did not manage to crash nft_set_pipapo with mismatch fields and set key length so far, but this is UB which must be disallowed. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: check if catch-all set element is active in next ↵Pablo Neira Ayuso2024-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | generation When deactivating the catch-all set element, check the state in the next generation that represents this transaction. This bug uncovered after the recent removal of the element busy mark a2dd0233cbc4 ("netfilter: nf_tables: remove busy mark and gc batch API"). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Cc: stable@vger.kernel.org Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: bridge: replace physindev with physinif in nf_bridge_infoPavel Tikhomirov2024-01-176-21/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. As skb->dev can be reset back to nf_bridge->physindev and used, and as there is no explicit mechanism that prevents this physindev from been freed under us (for instance neigh_flush_dev doesn't cleanup skbs from different device's neigh queue) we can crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(&neigh->arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish Let's use plain ifindex instead of net_device link. To peek into the original net_device we will use dev_get_by_index_rcu(). Thus either we get device and are safe to use it or we don't get it and drop skb. Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: propagate net to nf_bridge_get_physindevPavel Tikhomirov2024-01-177-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparation patch for replacing physindev with physinif on nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve device, when needed, and it requires net to be available. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_queue: remove excess nf_bridge variablePavel Tikhomirov2024-01-171-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't really need nf_bridge variable here. And nf_bridge_info_exists is better replacement for nf_bridge_info_get in case we are only checking for existence. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nfnetlink_log: use proper helper for fetching physinifPavel Tikhomirov2024-01-171-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't use physindev in __build_packet_message except for getting physinif from it. So let's switch to nf_bridge_get_physinif to get what we want directly. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nft_limit: do not ignore unsupported flagsPablo Neira Ayuso2024-01-171-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | Bail out if userspace provides unsupported flags, otherwise future extensions to the limit expression will be silently ignored by the kernel. Fixes: c7862a5f0de5 ("netfilter: nft_limit: allow to invert matching criteria") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: bail out if stateful expression provides no .clonePablo Neira Ayuso2024-01-171-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | All existing NFT_EXPR_STATEFUL provide a .clone interface, remove fallback to copy content of stateful expression since this is never exercised and bail out if .clone interface is not defined. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: validate .maxattr at expression registrationPablo Neira Ayuso2024-01-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct nft_expr_info allows to store up to NFT_EXPR_MAXATTR (16) attributes when parsing netlink attributes. Rise a warning in case there is ever a nft expression whose .maxattr goes beyond this number of expressions, in such case, struct nft_expr_info needs to be updated. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: reject invalid set policyPablo Neira Ayuso2024-01-171-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | Report -EINVAL in case userspace provides a unsupported set backend policy. Fixes: c50b960ccc59 ("netfilter: nf_tables: implement proper set selection") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | Merge tag 'for-netdev' of ↵Jakub Kicinski2024-01-1812-51/+806
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-01-18 We've added 10 non-merge commits during the last 5 day(s) which contain a total of 12 files changed, 806 insertions(+), 51 deletions(-). The main changes are: 1) Fix an issue in bpf_iter_udp under backward progress which prevents user space process from finishing iteration, from Martin KaFai Lau. 2) Fix BPF verifier to reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access, from Hao Sun. 3) Follow up fixes for kernel- and libbpf-side logic around handling arg:ctx tagged arguments of BPF global subprogs, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS selftests/bpf: Test udp and tcp iter batching bpf: Avoid iter->offset making backward progress in bpf_iter_udp bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket ==================== Link: https://lore.kernel.org/r/20240118153936.11769-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * \ Merge branch 'tighten-up-arg-ctx-type-enforcement'Alexei Starovoitov2024-01-175-39/+513
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrii Nakryiko says: ==================== Tighten up arg:ctx type enforcement Follow up fixes for kernel-side and libbpf-side logic around handling arg:ctx (__arg_ctx) tagged arguments of BPF global subprogs. Patch #1 adds libbpf feature detection of kernel-side __arg_ctx support to avoid unnecessary rewriting BTF types. With stricter kernel-side type enforcement this is now mandatory to avoid problems with using `struct bpf_user_pt_regs_t` instead of actual typedef. For __arg_ctx tagged arguments verifier is now supporting either `bpf_user_pt_regs_t` typedef or resolves it down to the actual struct (pt_regs/user_pt_regs/user_regs_struct), depending on architecture), but for old kernels without __arg_ctx support it's more backwards compatible for libbpf to use `struct bpf_user_pt_regs_t` rewrite which will work on wider range of kernels. So feature detection prevent libbpf accidentally breaking global subprogs on new kernels. We also adjust selftests to do similar feature detection (much simpler, but potentially breaking due to kernel source code refactoring, which is fine for selftests), and skip tests expecting libbpf's BTF type rewrites. Patch #2 is preparatory refactoring for patch #3 which adds type enforcement for arg:ctx tagged global subprog args. See the patch for specifics. Patch #4 adds many new cases to ensure type logic works as expected. Finally, patch #5 adds a relevant subset of kernel-side type checks to __arg_ctx cases that libbpf supports rewrite of. In libbpf's case, type violations are reported as warnings and BTF rewrite is not performed, which will eventually lead to BPF verifier complaining at program verification time. Good care was taken to avoid conflicts between bpf and bpf-next tree (which has few follow up refactorings in the same code area). Once trees converge some of the code will be moved around a bit (and some will be deleted), but with no change to functionality or general shape of the code. v2->v3: - support `bpf_user_pt_regs_t` typedef for KPROBE and PERF_EVENT (CI); v1->v2: - add user_pt_regs and user_regs_struct support for PERF_EVENT (CI); - drop FEAT_ARG_CTX_TAG enum leftover from patch #1; - fix warning about default: without break in the switch (CI). ==================== Link: https://lore.kernel.org/r/20240118033143.3384355-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | libbpf: warn on unexpected __arg_ctx type when rewriting BTFAndrii Nakryiko2024-01-171-9/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On kernel that don't support arg:ctx tag, before adjusting global subprog BTF information to match kernel's expected canonical type names, make sure that types used by user are meaningful, and if not, warn and don't do BTF adjustments. This is similar to checks that kernel performs, but narrower in scope, as only a small subset of BPF program types can be accommodated by libbpf using canonical type names. Libbpf unconditionally allows `struct pt_regs *` for perf_event program types, unlike kernel, which supports that conditionally on architecture. This is done to keep things simple and not cause unnecessary false positives. This seems like a minor and harmless deviation, which in real-world programs will be caught by kernels with arg:ctx tag support anyways. So KISS principle. This logic is hard to test (especially on latest kernels), so manual testing was performed instead. Libbpf emitted the following warning for perf_event program with wrong context argument type: libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | selftests/bpf: add tests confirming type logic in kernel for __arg_ctxAndrii Nakryiko2024-01-171-3/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a bunch of global subprogs across variety of program types to validate expected kernel type enforcement logic for __arg_ctx arguments. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | bpf: enforce types for __arg_ctx-tagged arguments in global subprogsAndrii Nakryiko2024-01-171-0/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add enforcement of expected types for context arguments tagged with arg:ctx (__arg_ctx) tag. First, any program type will accept generic `void *` context type when combined with __arg_ctx tag. Besides accepting "canonical" struct names and `void *`, for a bunch of program types for which program context is actually a named struct, we allows a bunch of pragmatic exceptions to match real-world and expected usage: - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as canonical context argument type, where `bpf_user_pt_regs_t` is a *typedef*, not a struct; - for kprobes, we also always accept `struct pt_regs *`, as that's what actually is passed as a context to any kprobe program; - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`) down to actual struct type and accept `struct pt_regs *`, or `struct user_pt_regs *`, or `struct user_regs_struct *`, depending on the actual struct type kernel architecture points `bpf_user_pt_regs_t` typedef to; otherwise, canonical `struct bpf_perf_event_data *` is expected; - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's what's expected with BPF_PROG() usage; otherwise, canonical `struct bpf_raw_tracepoint_args *` is expected; - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *` formats, both are coded as expections as tp_btf is actually a TRACING program type, which has no canonical context type; - iterator programs accept `struct bpf_iter__xxx *` structs, currently with no further iterator-type specific enforcement; - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`; - classic tracepoint programs, as well as syscall and freplace programs allow any user-provided type. In all other cases kernel will enforce exact match of struct name to expected canonical type. And if user-provided type doesn't match that expectation, verifier will emit helpful message with expected type name. Note a bit unnatural way the check is done after processing all the arguments. This is done to avoid conflict between bpf and bpf-next trees. Once trees converge, a small follow up patch will place a simple btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch (which bpf-next tree patch refactored already), removing duplicated arg:ctx detection logic. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | bpf: extract bpf_ctx_convert_map logic and make it more reusableAndrii Nakryiko2024-01-172-27/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor btf_get_prog_ctx_type() a bit to allow reuse of bpf_ctx_convert_map logic in more than one places. Simplify interface by returning btf_type instead of btf_member (field reference in BTF). To do the above we need to touch and start untangling btf_translate_to_vmlinux() implementation. We do the bare minimum to not regress anything for btf_translate_to_vmlinux(), but its implementation is very questionable for what it claims to be doing. Mapping kfunc argument types to kernel corresponding types conceptually is quite different from recognizing program context types. Fixing this is out of scope for this change though. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | libbpf: feature-detect arg:ctx tag support in kernelAndrii Nakryiko2024-01-172-0/+80
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If this is detected, libbpf will avoid doing any __arg_ctx-related BTF rewriting and checks in favor of letting kernel handle this completely. test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same feature detection (albeit in much simpler, though round-about and inefficient, way), and skip the tests. This is done to still be able to execute this test on older kernels (like in libbpf CI). Note, BPF token series ([0]) does a major refactor and code moving of libbpf-internal feature detection "framework", so to avoid unnecessary conflicts we keep newly added feature detection stand-alone with ad-hoc result caching. Once things settle, there will be a small follow up to re-integrate everything back and move code into its final place in newly-added (by BPF token series) features.c file. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=* Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * | selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYSHao Sun2024-01-161-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a test case for PTR_TO_FLOW_KEYS alu. Testing if alu with variable offset on flow_keys is rejected. For the fixed offset success case, we already have C code coverage to verify (e.g. via bpf_flow.c). Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240115082028.9992-2-sunhao.th@gmail.com
| | * | bpf: Reject variable offset alu on PTR_TO_FLOW_KEYSHao Sun2024-01-161-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off for validation. However, variable offset ptr alu is not prohibited for this ptr kind. So the variable offset is not checked. The following prog is accepted: func#0 @0 0: R1=ctx() R10=fp0 0: (bf) r6 = r1 ; R1=ctx() R6_w=ctx() 1: (79) r7 = *(u64 *)(r6 +144) ; R6_w=ctx() R7_w=flow_keys() 2: (b7) r8 = 1024 ; R8_w=1024 3: (37) r8 /= 1 ; R8_w=scalar() 4: (57) r8 &= 1024 ; R8_w=scalar(smin=smin32=0, smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400)) 5: (0f) r7 += r8 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024 mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1 mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024 6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off =(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024, var_off=(0x0; 0x400)) 6: (79) r0 = *(u64 *)(r7 +0) ; R0_w=scalar() 7: (95) exit This prog loads flow_keys to r7, and adds the variable offset r8 to r7, and finally causes out-of-bounds access: BUG: unable to handle page fault for address: ffffc90014c80038 [...] Call Trace: <TASK> bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline] __bpf_prog_run include/linux/filter.h:651 [inline] bpf_prog_run include/linux/filter.h:658 [inline] bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline] bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991 bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359 bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline] __sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475 __do_sys_bpf kernel/bpf/syscall.c:5561 [inline] __se_sys_bpf kernel/bpf/syscall.c:5559 [inline] __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b Fix this by rejecting ptr alu with variable offset on flow_keys. Applying the patch rejects the program with "R7 pointer arithmetic on flow_keys prohibited". Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Signed-off-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20240115082028.9992-1-sunhao.th@gmail.com
| | * | Merge branch 'bpf-fix-backward-progress-bug-in-bpf_iter_udp'Alexei Starovoitov2024-01-135-12/+270
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Martin KaFai Lau says: ==================== bpf: Fix backward progress bug in bpf_iter_udp From: Martin KaFai Lau <martin.lau@kernel.org> This patch set fixes an issue in bpf_iter_udp that makes backward progress and prevents the user space process from finishing. There is a test at the end to reproduce the bug. Please see individual patches for details. v3: - Fixed the iter_fd check and local_port check in the patch 3 selftest. (Yonghong) - Moved jhash2 to test_jhash.h in the patch 3. (Yonghong) - Added explanation in the bucket selection in the patch 3. (Yonghong) v2: - Added patch 1 to fix another bug that goes back to the previous bucket - Simplify the fix in patch 2 to always reset iter->offset to 0 - Add a test case to close all udp_sk in a bucket while in the middle of the iteration. ==================== Link: https://lore.kernel.org/r/20240112190530.3751661-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | | * | selftests/bpf: Test udp and tcp iter batchingMartin KaFai Lau2024-01-134-0/+260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch adds a test to exercise the bpf_iter_udp batching logic. It specifically tests the case that there are multiple so_reuseport udp_sk in a bucket of the udp_table. The test creates two sets of so_reuseport sockets and each set on a different port. Meaning there will be two buckets in the udp_table. The test does the following: 1. read() 3 out of 4 sockets in the first bucket. 2. close() all sockets in the first bucket. This will ensure the current bucket's offset in the kernel does not affect the read() of the following bucket. 3. read() all 4 sockets in the second bucket. The test also reads one udp_sk at a time from the bpf_iter_udp prog. The true case in "do_test(..., bool onebyone)". This is the buggy case that the previous patch fixed. It also tests the "false" case in "do_test(..., bool onebyone)", meaning the userspace reads the whole bucket. There is no bug in this case but adding this test also while at it. Considering the way to have multiple tcp_sk in the same bucket is similar (by using so_reuseport), this patch also tests the bpf_iter_tcp even though the bpf_iter_tcp batching logic works correctly. Both IP v4 and v6 are exercising the same bpf_iter batching code path, so only v6 is tested. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240112190530.3751661-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>