summaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/core
Commit message (Collapse)AuthorAgeFilesLines
* RDMA/core: Silence oversized kvmalloc() warningShay Drory2025-04-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller triggered an oversized kvmalloc() warning. Silence it by adding __GFP_NOWARN. syzkaller log: WARNING: CPU: 7 PID: 518 at mm/util.c:665 __kvmalloc_node_noprof+0x175/0x180 CPU: 7 UID: 0 PID: 518 Comm: c_repro Not tainted 6.11.0-rc6+ #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:__kvmalloc_node_noprof+0x175/0x180 RSP: 0018:ffffc90001e67c10 EFLAGS: 00010246 RAX: 0000000000000100 RBX: 0000000000000400 RCX: ffffffff8149d46b RDX: 0000000000000000 RSI: ffff8881030fae80 RDI: 0000000000000002 RBP: 000000712c800000 R08: 0000000000000100 R09: 0000000000000000 R10: ffffc90001e67c10 R11: 0030ae0601000000 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007fde79159740(0000) GS:ffff88813bdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000180 CR3: 0000000105eb4005 CR4: 00000000003706b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ib_umem_odp_get+0x1f6/0x390 mlx5_ib_reg_user_mr+0x1e8/0x450 ib_uverbs_reg_mr+0x28b/0x440 ib_uverbs_write+0x7d3/0xa30 vfs_write+0x1ac/0x6c0 ksys_write+0x134/0x170 ? __sanitizer_cov_trace_pc+0x1c/0x50 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 37824952dc8f ("RDMA/odp: Use kvcalloc for the dma_list and page_list") Signed-off-by: Shay Drory <shayd@nvidia.com> Link: https://patch.msgid.link/c6cb92379de668be94894f49c2cfa40e73f94d56.1742388096.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/cma: Fix workqueue crash in cma_netevent_work_handlerSharath Srinivasan2025-04-091-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct rdma_cm_id has member "struct work_struct net_work" that is reused for enqueuing cma_netevent_work_handler()s onto cma_wq. Below crash[1] can occur if more than one call to cma_netevent_callback() occurs in quick succession, which further enqueues cma_netevent_work_handler()s for the same rdma_cm_id, overwriting any previously queued work-item(s) that was just scheduled to run i.e. there is no guarantee the queued work item may run between two successive calls to cma_netevent_callback() and the 2nd INIT_WORK would overwrite the 1st work item (for the same rdma_cm_id), despite grabbing id_table_lock during enqueue. Also drgn analysis [2] indicates the work item was likely overwritten. Fix this by moving the INIT_WORK() to __rdma_create_id(), so that it doesn't race with any existing queue_work() or its worker thread. [1] Trimmed crash stack: ============================================= BUG: kernel NULL pointer dereference, address: 0000000000000008 kworker/u256:6 ... 6.12.0-0... Workqueue: cma_netevent_work_handler [rdma_cm] (rdma_cm) RIP: 0010:process_one_work+0xba/0x31a Call Trace: worker_thread+0x266/0x3a0 kthread+0xcf/0x100 ret_from_fork+0x31/0x50 ret_from_fork_asm+0x1a/0x30 ============================================= [2] drgn crash analysis: >>> trace = prog.crashed_thread().stack_trace() >>> trace (0) crash_setup_regs (./arch/x86/include/asm/kexec.h:111:15) (1) __crash_kexec (kernel/crash_core.c:122:4) (2) panic (kernel/panic.c:399:3) (3) oops_end (arch/x86/kernel/dumpstack.c:382:3) ... (8) process_one_work (kernel/workqueue.c:3168:2) (9) process_scheduled_works (kernel/workqueue.c:3310:3) (10) worker_thread (kernel/workqueue.c:3391:4) (11) kthread (kernel/kthread.c:389:9) Line workqueue.c:3168 for this kernel version is in process_one_work(): 3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN); >>> trace[8]["work"] *(struct work_struct *)0xffff92577d0a21d8 = { .data = (atomic_long_t){ .counter = (s64)536870912, <=== Note }, .entry = (struct list_head){ .next = (struct list_head *)0xffff924d075924c0, .prev = (struct list_head *)0xffff924d075924c0, }, .func = (work_func_t)cma_netevent_work_handler+0x0 = 0xffffffffc2cec280, } Suspicion is that pwq is NULL: >>> trace[8]["pwq"] (struct pool_workqueue *)<absent> In process_one_work(), pwq is assigned from: struct pool_workqueue *pwq = get_work_pwq(work); and get_work_pwq() is: static struct pool_workqueue *get_work_pwq(struct work_struct *work) { unsigned long data = atomic_long_read(&work->data); if (data & WORK_STRUCT_PWQ) return work_struct_pwq(data); else return NULL; } WORK_STRUCT_PWQ is 0x4: >>> print(repr(prog['WORK_STRUCT_PWQ'])) Object(prog, 'enum work_flags', value=4) But work->data is 536870912 which is 0x20000000. So, get_work_pwq() returns NULL and we crash in process_one_work(): 3168 strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN); ============================================= Fixes: 925d046e7e52 ("RDMA/core: Add a netevent notifier to cma") Cc: stable@vger.kernel.org Co-developed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/bf0082f9-5b25-4593-92c6-d130aa8ba439@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/ucaps: Avoid format-security warningArnd Bergmann2025-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | Passing a non-literal format string to dev_set_name causes a warning: drivers/infiniband/core/ucaps.c:173:33: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security] 173 | ret = dev_set_name(&ucap->dev, ucap_names[type]); | ^~~~~~~~~~~~~~~~ drivers/infiniband/core/ucaps.c:173:33: note: treat the string as an argument to avoid this 173 | ret = dev_set_name(&ucap->dev, ucap_names[type]); | ^ | "%s", Turn the name into the %s argument as suggested by gcc. Fixes: 61e51682816d ("RDMA/uverbs: Introduce UCAP (User CAPabilities) API") Link: https://patch.msgid.link/r/20250314155721.264083-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* IB/mad: Check available slots before posting receive WRsMaher Sanalla2025-03-191-18/+20
| | | | | | | | | | | | | | | | | | | | | | | The ib_post_receive_mads() function handles posting receive work requests (WRs) to MAD QPs and is called in two cases: 1) When a MAD port is opened. 2) When a receive WQE is consumed upon receiving a new MAD. Whereas, if MADs arrive during the port open phase, a race condition might cause an extra WR to be posted, exceeding the QP’s capacity. This leads to failures such as: infiniband mlx5_0: ib_post_recv failed: -12 infiniband mlx5_0: Couldn't post receive WRs infiniband mlx5_0: Couldn't start port infiniband mlx5_0: Couldn't open port 1 Fix this by checking the current receive count before posting a new WR. If the QP’s receive queue is full, do not post additional WRs. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/c4984ba3c3a98a5711a558bccefcad789587ecf1.1741875592.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Pass port to counter bind/unbind operationsPatrisious Haddad2025-03-182-11/+11
| | | | | | | | | | | | | This will be useful for the next patches in the series since port number is needed for optional counters binding and unbinding. Note that this change is needed since when the operation is done qp->port isn't necessarily initialized yet and can't be used. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/b6f6797844acbd517358e8d2a270ea9b3e6ecba1.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Add support to optional-counters binding configurationPatrisious Haddad2025-03-182-11/+35
| | | | | | | | | | | | | | | | | | | Whenever a new counter is created, save inside it the user requested configuration for optional-counters binding, for manual configuration it is requested directly by the user and for the automatic configuration it depends on if the automatic binding was enabled with or without optional-counters binding. This argument will later be used by the driver to determine if to bind the optional-counters as well or not when trying to bind this counter to a QP. It indicates that when binding counters to a QP we also want the currently enabled link optional-counters to be bound as well. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/82f1c357606a16932979ef9a5910122675c74a3a.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()Patrisious Haddad2025-03-182-1/+5
| | | | | | | | | | | | | | | | | | Change rdma_counter allocation to use rdma_zalloc_drv_obj() instead of, explicitly allocating at core, in order to be contained inside driver specific structures. Adjust all drivers that use it to have their containing structure, and add driver specific initialization operation. This change is needed to allow upcoming patches to implement optional-counters binding whereas inside each driver specific counter struct his bound optional-counters will be maintained. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/a5a484f421fc2e5595158e61a354fba43272b02d.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Fix use-after-free when rename device nameWang Liang2025-03-131-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported a slab-use-after-free with the following call trace: ================================================================== BUG: KASAN: slab-use-after-free in nla_put+0xd3/0x150 lib/nlattr.c:1099 Read of size 5 at addr ffff888140ea1c60 by task syz.0.988/10025 CPU: 0 UID: 0 PID: 10025 Comm: syz.0.988 Not tainted 6.14.0-rc4-syzkaller-00859-gf77f12010f67 #0 Hardware name: Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0x16e/0x5b0 mm/kasan/report.c:521 kasan_report+0x143/0x180 mm/kasan/report.c:634 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105 nla_put+0xd3/0x150 lib/nlattr.c:1099 nla_put_string include/net/netlink.h:1621 [inline] fill_nldev_handle+0x16e/0x200 drivers/infiniband/core/nldev.c:265 rdma_nl_notify_event+0x561/0xef0 drivers/infiniband/core/nldev.c:2857 ib_device_notify_register+0x22/0x230 drivers/infiniband/core/device.c:1344 ib_register_device+0x1292/0x1460 drivers/infiniband/core/device.c:1460 rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540 rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550 rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212 nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795 rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:709 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:724 ____sys_sendmsg+0x53a/0x860 net/socket.c:2564 ___sys_sendmsg net/socket.c:2618 [inline] __sys_sendmsg+0x269/0x350 net/socket.c:2650 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f42d1b8d169 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 ... RSP: 002b:00007f42d2960038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f42d1da6320 RCX: 00007f42d1b8d169 RDX: 0000000000000000 RSI: 00004000000002c0 RDI: 000000000000000c RBP: 00007f42d1c0e2a0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f42d1da6320 R15: 00007ffe399344a8 </TASK> Allocated by task 10025: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __do_kmalloc_node mm/slub.c:4294 [inline] __kmalloc_node_track_caller_noprof+0x28b/0x4c0 mm/slub.c:4313 __kmemdup_nul mm/util.c:61 [inline] kstrdup+0x42/0x100 mm/util.c:81 kobject_set_name_vargs+0x61/0x120 lib/kobject.c:274 dev_set_name+0xd5/0x120 drivers/base/core.c:3468 assign_name drivers/infiniband/core/device.c:1202 [inline] ib_register_device+0x178/0x1460 drivers/infiniband/core/device.c:1384 rxe_register_device+0x233/0x350 drivers/infiniband/sw/rxe/rxe_verbs.c:1540 rxe_net_add+0x74/0xf0 drivers/infiniband/sw/rxe/rxe_net.c:550 rxe_newlink+0xde/0x1a0 drivers/infiniband/sw/rxe/rxe.c:212 nldev_newlink+0x5ea/0x680 drivers/infiniband/core/nldev.c:1795 rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:709 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:724 ____sys_sendmsg+0x53a/0x860 net/socket.c:2564 ___sys_sendmsg net/socket.c:2618 [inline] __sys_sendmsg+0x269/0x350 net/socket.c:2650 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 10035: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2353 [inline] slab_free mm/slub.c:4609 [inline] kfree+0x196/0x430 mm/slub.c:4757 kobject_rename+0x38f/0x410 lib/kobject.c:524 device_rename+0x16a/0x200 drivers/base/core.c:4525 ib_device_rename+0x270/0x710 drivers/infiniband/core/device.c:402 nldev_set_doit+0x30e/0x4c0 drivers/infiniband/core/nldev.c:1146 rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline] rdma_nl_rcv+0x6dd/0x9e0 drivers/infiniband/core/netlink.c:259 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x7f6/0x990 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x8de/0xcb0 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:709 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:724 ____sys_sendmsg+0x53a/0x860 net/socket.c:2564 ___sys_sendmsg net/socket.c:2618 [inline] __sys_sendmsg+0x269/0x350 net/socket.c:2650 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f This is because if rename device happens, the old name is freed in ib_device_rename() with lock, but ib_device_notify_register() may visit the dev name locklessly by event RDMA_REGISTER_EVENT or RDMA_NETDEV_ATTACH_EVENT. Fix this by hold devices_rwsem in ib_device_notify_register(). Reported-by: syzbot+f60349ba1f9f08df349f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=25bc6f0ed2b88b9eb9b8 Fixes: 9cbed5aab5ae ("RDMA/nldev: Add support for RDMA monitoring") Signed-off-by: Wang Liang <wangliang74@huawei.com> Link: https://patch.msgid.link/20250313092421.944658-1-wangliang74@huawei.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject()Maher Sanalla2025-03-131-68/+76
| | | | | | | | | | | | | | | | | | | | | | | | Currently, the IB uverbs API calls uobj_get_uobj_read(), which in turn uses the rdma_lookup_get_uobject() helper to retrieve user objects. In case of failure, uobj_get_uobj_read() returns NULL, overriding the error code from rdma_lookup_get_uobject(). The IB uverbs API then translates this NULL to -EINVAL, masking the actual error and complicating debugging. For example, applications calling ibv_modify_qp that fails with EBUSY when retrieving the QP uobject will see the overridden error code EINVAL instead, masking the actual error. Furthermore, based on rdma-core commit: "2a22f1ced5f3 ("Merge pull request #1568 from jakemoroni/master")" Kernel's IB uverbs return values are either ignored and passed on as is to application or overridden with other errnos in a few cases. Thus, to improve error reporting and debuggability, propagate the original error from rdma_lookup_get_uobject() instead of replacing it with EINVAL. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/64f9d3711b183984e939962c2f83383904f97dfb.1740577869.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/uverbs: Add support for UCAPs in context creationChiara Meiohas2025-03-092-0/+23
| | | | | | | | | | | | | Add support for file descriptor array attribute for GET_CONTEXT commands. Check that the file descriptor (fd) array represents fds for valid UCAPs. Store the enabled UCAPs from the fd array as a bitmask in ib_ucontext. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/ebfb30bc947e2259b193c96a319c80e82599045b.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/uverbs: Introduce UCAP (User CAPabilities) APIChiara Meiohas2025-03-093-1/+271
| | | | | | | | | | | | | | | | | | | | | | Implement a new User CAPabilities (UCAP) API to provide fine-grained control over specific firmware features. This approach offers more granular capabilities than the existing Linux capabilities, which may be too generic for certain FW features. This mechanism represents each capability as a character device with root read-write access. Root processes can grant users special privileges by allowing access to these character devices (e.g., using chown). UCAP character devices are located in /dev/infiniband and the class path is /sys/class/infiniband_ucaps. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Fixes infiniband sysctl boundsNicolas Bouchinet2025-03-032-2/+6
| | | | | | | | | | | | | Bound infiniband iwcm and ucma sysctl writings between SYSCTL_ZERO and SYSCTL_INT_MAX. The proc_handler has thus been updated to proc_dointvec_minmax. Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> Link: https://patch.msgid.link/20250224095826.16458-6-nicolas.bouchinet@clip-os.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Don't expose hw_counters outside of init net namespaceRoman Gushchin2025-03-032-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") accidentally almost exposed hw counters to non-init net namespaces. It didn't expose them fully, as an attempt to read any of those counters leads to a crash like this one: [42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028 [42021.814463] #PF: supervisor read access in kernel mode [42021.819549] #PF: error_code(0x0000) - not-present page [42021.824636] PGD 0 P4D 0 [42021.827145] Oops: 0000 [#1] SMP PTI [42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX [42021.841697] Hardware name: XXX [42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core] [42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48 [42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287 [42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000 [42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0 [42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000 [42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530 [42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000 [42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000 [42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0 [42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42021.949324] Call Trace: [42021.951756] <TASK> [42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70 [42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0 [42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0 [42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0 [42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30 [42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core] [42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core] [42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50 [42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0 [42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410 [42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0 [42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0 [42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0 [42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 The problem can be reproduced using the following steps: ip netns add foo ip netns exec foo bash cat /sys/class/infiniband/mlx4_0/hw_counters/* The panic occurs because of casting the device pointer into an ib_device pointer using container_of() in hw_stat_device_show() is wrong and leads to a memory corruption. However the real problem is that hw counters should never been exposed outside of the non-init net namespace. Fix this by saving the index of the corresponding attribute group (it might be 1 or 2 depending on the presence of driver-specific attributes) and zeroing the pointer to hw_counters group for compat devices during the initialization. With this fix applied hw_counters are not available in a non-init net namespace: find /sys/class/infiniband/mlx4_0/ -name hw_counters /sys/class/infiniband/mlx4_0/ports/1/hw_counters /sys/class/infiniband/mlx4_0/ports/2/hw_counters /sys/class/infiniband/mlx4_0/hw_counters ip netns add foo ip netns exec foo bash find /sys/class/infiniband/mlx4_0/ -name hw_counters Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Maher Sanalla <msanalla@nvidia.com> Cc: linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Fix best page size finding when it can cross SG entriesMichael Margolin2025-02-192-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A single scatter-gather entry is limited by a 32 bits "length" field that is practically 4GB - PAGE_SIZE. This means that even when the memory is physically contiguous, we might need more than one entry to represent it. Additionally when using dmabuf, the sg_table might be originated outside the subsystem and optimized for other needs. For instance an SGT of 16GB GPU continuous memory might look like this: (a real life example) dma_address 34401400000, length fffff000 dma_address 345013ff000, length fffff000 dma_address 346013fe000, length fffff000 dma_address 347013fd000, length fffff000 dma_address 348013fc000, length 4000 Since ib_umem_find_best_pgsz works within SG entries, in the above case we will result with the worst possible 4KB page size. Fix this by taking into consideration only the alignment of addresses of real discontinuity points rather than treating SG entries as such, and adjust the page iterator to correctly handle cross SG entry pages. There is currently an assumption that drivers do not ask for pages bigger than maximal DMA size supported by their devices. Reviewed-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/core: Use ib_port_state_to_str() for IB state sysfsMaher Sanalla2025-02-061-13/+1
| | | | | | | | | | Refactor the IB state sysfs implementation to replace the local array used for converting IB state to string with the ib_port_state_to_str() function. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/06affabbbf144f990e64b447918af39483c78c3e.1738586601.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* IB/cache: Add log messages for IB device state changesMaher Sanalla2025-02-061-0/+6
| | | | | | | | | | | | | Enhance visibility into IB device state transitions by adding log messages to the kernel log (dmesg). Whenever an IB device changes state, a relevant print will be printed, such as: "mlx5_core 0000:08:00.0 mlx5_0: Port: 1 Link DOWN" "mlx5_core 0000:08:00.0 rdmap8s0f0: Port: 2 Link ACTIVE" Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/2d26ccbd669bad99089fa2aebb5cba4014fc4999.1738586601.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* RDMA/rxe: Make rping work with tun deviceZhu Yanjun2025-02-031-5/+19
| | | | | | | | | | Because the type of tun device is ARPHRD_NONE, in cma, ARPHRD_NONE is not recognized. To RXE device, just as SIW does, ARPHRD_NONE is added to support. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250119172831.3123110-4-yanjun.zhu@linux.dev Signed-off-by: Leon Romanovsky <leon@kernel.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2025-01-244-184/+92
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma updates from Jason Gunthorpe: "Lighter that normal, but the now usual collection of driver fixes and small improvements: - Small fixes and minor improvements to cxgb4, bnxt_re, rxe, srp, efa, cxgb4 - Update mlx4 to use the new umem APIs, avoiding direct use of scatterlist - Support ROCEv2 in erdma - Remove various uncalled functions, constify bin_attribute - Provide core infrastructure to catch netdev events and route them to drivers, consolidating duplicated driver code - Fix rare race condition crashes in mlx5 ODP flows" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (63 commits) RDMA/mlx5: Fix implicit ODP use after free RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error RDMA/qib: Constify 'struct bin_attribute' RDMA/hfi1: Constify 'struct bin_attribute' RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]" RDMA/cxgb4: Notify rdma stack for IB_EVENT_QP_LAST_WQE_REACHED event RDMA/bnxt_re: Allocate dev_attr information dynamically RDMA/bnxt_re: Pass the context for ulp_irq_stop RDMA/bnxt_re: Add support to handle DCB_CONFIG_CHANGE event RDMA/bnxt_re: Query firmware defaults of CC params during probe RDMA/bnxt_re: Add Async event handling support bnxt_en: Add ULP call to notify async events RDMA/mlx5: Fix indirect mkey ODP page count MAINTAINERS: Update the bnxt_re maintainers RDMA/hns: Clean up the legacy CONFIG_INFINIBAND_HNS RDMA/rtrs: Add missing deinit() call RDMA/efa: Align interrupt related fields to same type RDMA/bnxt_re: Fix to drop reference to the mmap entry in case of error RDMA/mlx5: Fix link status down event for MPV RDMA/erdma: Support create_ah/destroy_ah in non-sleepable contexts ...
| * RDMA/core: Support link status events dispatchingYuyu Li2024-12-241-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the dispatching of link status events is implemented by each RDMA driver independently, and most of them have very similar patterns. Add support for this in ib_core so that we can get rid of duplicate codes in each driver. A new last_port_state is added in ib_port_cache to cache the port state of the last link status events dispatching. The original port_state in ib_port_cache is not used here because it will be updated when ib_dispatch_event() is called, which means it may be changed between two link status events, and may lead to a loss of event dispatching. Some drivers currently have some private stuff in their link status events handler in addition to event dispatching, and cannot be perfectly integrated into the ib_core handling process. For these drivers, add a new ops report_port_event() so that they can keep their current processing. Finally, events of LAG devices are not supported yet in this patch as currently there is no way to obtain ibdev from upper netdev in ib_core. This can be a TODO work after the core have more support for LAG. Signed-off-by: Yuyu Li <liyuyu6@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Add ib_query_netdev_port() to query netdev port by IB device.Yuyu Li2024-12-241-7/+32
| | | | | | | | | | | | | | | | Query the port number of a netdev associated with an ibdev. Signed-off-by: Yuyu Li <liyuyu6@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Remove unused ib_copy_path_rec_from_userDr. David Alan Gilbert2024-12-241-42/+0
| | | | | | | | | | | | | | | | | | | | | | | | ib_copy_path_rec_from_user() has been unused since 2019's commit a1a8e4a85cf7 ("rdma: Delete the ib_ucm module") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20241221014021.343979-5-linux@treblig.org Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Remove unused ibdev_printkDr. David Alan Gilbert2024-12-241-17/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The last use of ibdev_printk() was removed in 2019 by commit b2299e83815c ("RDMA: Delete DEBUG code") Remove it. Note: The __ibdev_printk() is still used in the idev_err etc functions so leave that. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20241221014021.343979-4-linux@treblig.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Remove unused ib_find_exact_cached_pkeyDr. David Alan Gilbert2024-12-241-35/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The last use of ib_find_exact_cached_pkey() was removed in 2012 by commit 2c75d2ccb6e5 ("IB/mlx4: Fix QP1 P_Key processing in the Primary Physical Function (PPF)") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20241221014021.343979-3-linux@treblig.org Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Remove unused ib_ud_header_unpackDr. David Alan Gilbert2024-12-241-83/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ib_ud_header_unpack() is unused, and I can't see any sign of it ever having been used in git. The only reference I can find is from December 2004 BKrev: 41d30034XNbBUl0XnyC6ig9V61Nf-A when it looks like it was added. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20241221014021.343979-2-linux@treblig.org Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2025-01-033-8/+26
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma fixes from Jason Gunthorpe: "A lot of fixes accumulated over the holiday break: - Static tool fixes, value is already proven to be NULL, possible integer overflow - Many bnxt_re fixes: - Crashes due to a mismatch in the maximum SGE list size - Don't waste memory for user QPs by creating kernel-only structures - Fix compatability issues with older HW in some of the new HW features recently introduced: RTS->RTS feature, work around 9096 - Do not allow destroy_qp to fail - Validate QP MTU against device limits - Add missing validation on madatory QP attributes for RTR->RTS - Report port_num in query_qp as required by the spec - Fix creation of QPs of the maximum queue size, and in the variable mode - Allow all QPs to be used on newer HW by limiting a work around only to HW it affects - Use the correct MSN table size for variable mode QPs - Add missing locking in create_qp() accessing the qp_tbl - Form WQE buffers correctly when some of the buffers are 0 hop - Don't crash on QP destroy if the userspace doesn't setup the dip_ctx - Add the missing QP flush handler call on the DWQE path to avoid hanging on error recovery - Consistently use ENXIO for return codes if the devices is fatally errored - Try again to fix VLAN support on iwarp, previous fix was reverted due to breaking other cards - Correct error path return code for rdma netlink events - Remove the seperate net_device pointer in siw and rxe which syzkaller found a way to UAF - Fix a UAF of a stack ib_sge in rtrs - Fix a regression where old mlx5 devices and FW were wrongly activing new device features and failing" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (28 commits) RDMA/mlx5: Enable multiplane mode only when it is supported RDMA/bnxt_re: Fix error recovery sequence RDMA/rtrs: Ensure 'ib_sge list' is accessible RDMA/rxe: Remove the direct link to net_device RDMA/hns: Fix missing flush CQE for DWQE RDMA/hns: Fix warning storm caused by invalid input in IO path RDMA/hns: Fix accessing invalid dip_ctx during destroying QP RDMA/hns: Fix mapping error of zero-hop WQE buffer RDMA/bnxt_re: Fix the locking while accessing the QP table RDMA/bnxt_re: Fix MSN table size for variable wqe mode RDMA/bnxt_re: Add send queue size check for variable wqe RDMA/bnxt_re: Disable use of reserved wqes RDMA/bnxt_re: Fix max_qp_wrs reported RDMA/siw: Remove direct link to net_device RDMA/nldev: Set error code in rdma_nl_notify_event RDMA/bnxt_re: Fix reporting hw_ver in query_device RDMA/bnxt_re: Fix to export port num to ib_query_qp RDMA/bnxt_re: Fix setting mandatory attributes for modify_qp RDMA/bnxt_re: Add check for path mtu in modify_qp RDMA/bnxt_re: Fix the check for 9060 condition ...
| * | RDMA/nldev: Set error code in rdma_nl_notify_eventChiara Meiohas2024-12-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of error set the error code before the goto. Fixes: 6ff57a2ea7c2 ("RDMA/nldev: Fix NULL pointer dereferences issue in rdma_nl_notify_event") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-rdma/a84a2fc3-33b6-46da-a1bd-3343fa07eaf9@stanley.mountain/ Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/13eb25961923f5de9eb9ecbbc94e26113d6049ef.1733815944.git.leonro@nvidia.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * | RDMA/core: Fix ENODEV error for iWARP test over vlanAnumula Murali Mohan Reddy2024-12-101-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If traffic is over vlan, cma_validate_port() fails to match net_device ifindex with bound_if_index and results in ENODEV error. As iWARP gid table is static, it contains entry corresponding to only one net device which is either real netdev or vlan netdev for cases like siw attached to a vlan interface. This patch fixes the issue by assigning bound_if_index with net device index, if real net device obtained from bound if index matches with net device retrieved from gid table Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices") Link: https://lore.kernel.org/all/ZzNgdrjo1kSCGbRz@chelsio.com/ Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com> Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Link: https://patch.msgid.link/20241203140052.3985-1-anumula@chelsio.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * | RDMA/uverbs: Prevent integer overflow issueDan Carpenter2024-12-041-7/+9
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the expression "cmd.wqe_size * cmd.wr_count", both variables are u32 values that come from the user so the multiplication can lead to integer wrapping. Then we pass the result to uverbs_request_next_ptr() which also could potentially wrap. The "cmd.sge_count * sizeof(struct ib_uverbs_sge)" multiplication can also overflow on 32bit systems although it's fine on 64bit systems. This patch does two things. First, I've re-arranged the condition in uverbs_request_next_ptr() so that the use controlled variable "len" is on one side of the comparison by itself without any math. Then I've modified all the callers to use size_mul() for the multiplications. Fixes: 67cdb40ca444 ("[IB] uverbs: Implement more commands") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/b8765ab3-c2da-4611-aae0-ddd6ba173d23@stanley.mountain Signed-off-by: Leon Romanovsky <leon@kernel.org>
* / module: Convert symbol namespace to string literalPeter Zijlstra2024-12-021-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2024-11-227-123/+240
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma updates from Jason Gunthorpe: "Seveal fixes scattered across the drivers and a few new features: - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns - Force disassociate the userspace FD when hns does an async reset - bnxt new features for optimized modify QP to skip certain stayes, CQ coalescing, better debug dumping - mlx5 new data placement ordering feature - Faster destruction of mlx5 devx HW objects - Improvements to RDMA CM mad handling" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits) RDMA/bnxt_re: Correct the sequence of device suspend RDMA/bnxt_re: Use the default mode of congestion control RDMA/bnxt_re: Support different traffic class IB/cm: Rework sending DREQ when destroying a cm_id IB/cm: Do not hold reference on cm_id unless needed IB/cm: Explicitly mark if a response MAD is a retransmission RDMA/mlx5: Move events notifier registration to be after device registration RDMA/bnxt_re: Cache MSIx info to a local structure RDMA/bnxt_re: Refurbish CQ to NQ hash calculation RDMA/bnxt_re: Refactor NQ allocation RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved RDMA/hns: Fix different dgids mapping to the same dip_idx RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design bnxt_en: Add support for RoCE sriov configuration RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg() RDMA/hns: Fix out-of-order issue of requester when setting FENCE RDMA/nldev: Add IB device and net device rename events RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation RDMA/core: Move ib_uverbs_file struct to uverbs_types.h ...
| * IB/cm: Rework sending DREQ when destroying a cm_idSean Hefty2024-11-171-21/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A DREQ is sent in 2 situations: 1. When requested by the user. This DREQ has to wait for a DREP, which will be routed to the user. 2. When the cm_id is destroyed. This DREQ is generated by the CM to notify the peer that the connection has been destroyed. In the latter case, any DREP that is received will be discarded. There's no need to hold a reference on the cm_id. Today, both situations are covered by the same function: cm_send_dreq_locked(). When invoked in the cm_id destroy path, the cm_id reference would be held until the DREQ completes, blocking the destruction. Because it could take several seconds to minutes before the DREQ receives a DREP, the destroy call posts a send for the DREQ then immediately cancels the MAD. However, cancellation is not immediate in the MAD layer. There could still be a delay before the MAD layer returns the DREQ to the CM. Moreover, the only guarantee is that the DREQ will be sent at most once. Introduce a separate flow for sending a DREQ when destroying the cm_id. The new flow will not hold a reference on the cm_id, allowing it to be cleaned up immediately. The cancellation trick is no longer needed. The MAD layer will send the DREQ exactly once. Signed-off-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/a288a098b8e0550305755fd4a7937431699317f4.1731495873.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * IB/cm: Do not hold reference on cm_id unless neededSean Hefty2024-11-171-42/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Typically, when the CM sends a MAD it bumps a reference count on the associated cm_id. There are some exceptions, such as when the MAD is a direct response to a receive MAD. For example, the CM may generate an MRA in response to a duplicate REQ. But, in general, if a MAD may be sent as a result of the user invoking an API call (e.g. ib_send_cm_rep(), ib_send_cm_rtu(), etc.), a reference is taken on the cm_id. This reference is necessary if the MAD requires a response. The reference allows routing a response MAD back to the cm_id, or, if no response is received, allows updating the cm_id state to reflect the failure. For MADs which do not generate a response from the target, however, there's no need to hold a reference on the cm_id. Such MADs will not be retried by the MAD layer and their completions do not change the state of the cm_id. There are 2 internal calls used to allocate MADs which take a reference on the cm_id: cm_alloc_msg() and cm_alloc_priv_msg(). The latter calls the former. It turns out that all other places where cm_alloc_msg() is called are for MADs that do not generate a response from the target: sending an RTU, DREP, REJ, MRA, or SIDR REP. In all of these cases, there's no need to hold a reference on the cm_id. The benefit of dropping unneeded references is that it allows destruction of the cm_id to proceed immediately. Currently, the cm_destroy_id() call blocks as long as there's a reference held on the cm_id. Worse, is that cm_destroy_id() will send MADs, which it then needs to complete. Sending the MADs is beneficial, as they notify the peer that a connection is being destroyed. However, since the MADs hold a reference on the cm_id, they block destruction and cannot be retried. Move cm_id referencing from cm_alloc_msg() to cm_alloc_priv_msg(). The latter should hold a reference on the cm_id in all cases but one, which will be handled in a separate patch. cm_alloc_priv_msg() is used when sending a REQ, REP, DREQ, and SIDR REQ, all of which require a response. Also, merge common code into cm_alloc_priv_msg() and combine the freeing of all messages which do not need a response. Signed-off-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/1f0f96acace72790ecf89087fc765dead960189e.1731495873.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * IB/cm: Explicitly mark if a response MAD is a retransmissionSean Hefty2024-11-171-20/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In several situations the CM may send a reply to a received MAD without the reply being directly linked with a cm_id. For example, it may send a REJ in response to a REQ which does not match a listener. Or, it may send a DREP in response to a DREQ if the cm_id has already been destroyed. This can happen if the original DREP was lost and the DREQ was retried. When such a response MAD completes, it updates a counter tracking how many MADs were retried. However, not all response MADs issued directly by the CM may be retries. The REJ mentioned in the example above is such a case. To distinguish between responses which were retries versus those that are not, the send_handler performs the following check: is a retry if the response is not associated with a cm_id and the response is not a REJ message. Replace this indirect method of checking if a response is a retry with an explicit check. Note that these retries are generated directly by the CM, rather than retried by the MAD layer. This change will be needed by later changes which would otherwise break the indirect check. Signed-off-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/1ee6e2a68f8de1992b9da23aa1d7e3f9f25e0036.1731495873.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/nldev: Add IB device and net device rename eventsChiara Meiohas2024-11-042-2/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement event sending for IB device rename and IB device port associated netdevice rename. In iproute2, rdma monitor displays the IB device name, port and the netdevice name when displaying event info. Since users can modiy these names, we track and notify on renaming events. Note: In order to receive netdevice rename events, drivers must use the ib_device_set_netdev() API when attaching net devices to IB devices. $ rdma monitor $ rmmod mlx5_ib [UNREGISTER] dev 1 rocep8s0f1 [UNREGISTER] dev 0 rocep8s0f0 $ modprobe mlx5_ib [REGISTER] dev 2 mlx5_0 [NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2 [REGISTER] dev 3 mlx5_1 [NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3 [RENAME] dev 2 rocep8s0f0 [RENAME] dev 3 rocep8s0f1 $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev [UNREGISTER] dev 2 rocep8s0f0 [REGISTER] dev 4 mlx5_0 [NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2 [RENAME] dev 4 rdmap8s0f0 $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7 [REGISTER] dev 5 mlx5_0 [NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8 [REGISTER] dev 6 mlx5_1 [NETDEV_ATTACH] dev 6 mlx5_1 port 1 netdev 12 eth9 [RENAME] dev 5 rocep8s0f0v0 [RENAME] dev 6 rocep8s0f0v1 [REGISTER] dev 7 mlx5_0 [NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10 [RENAME] dev 7 rocep8s0f0v2 [REGISTER] dev 8 mlx5_0 [NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11 [RENAME] dev 8 rocep8s0f0v3 $ ip link set eth2 name myeth2 [NETDEV_RENAME] netdev 4 myeth2 $ ip link set eth1 name myeth1 ** no events received, because eth1 is not attached to an IB device ** Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/093c978ef2766fd3ab4ff8798eeb68f2f11582f6.1730367038.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Move ib_uverbs_file struct to uverbs_types.hPatrisious Haddad2024-11-042-33/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | In light of the previous commit, make the ib_uverbs_file accessible to drivers by moving its definition to uverbs_types.h, to allow drivers to freely access the struct argument and create a personalized cleanup flow. For the same reason expose uverbs_try_lock_object function to allow driver to safely access the uverbs objects. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/29b718e0dca35daa5f496320a39284fc1f5a1722.1730373303.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Add device ufile cleanup operationPatrisious Haddad2024-11-042-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a driver operation to allow preemptive cleanup of ufile HW resources before the standard ufile cleanup flow begins. Thus, expediting the final cleanup phase which leads to fast teardown overall. This allows the use of driver specific clean up procedures to make the cleanup process more efficient. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/cabe00d75132b5732cb515944e3c500a01fb0b4a.1730373303.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Implement RoCE GID port rescan and export delete functionChiara Meiohas2024-11-041-4/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rdma_roce_rescan_port() scans all network devices in the system and adds the gids if relevant to the RoCE device port. When not in bonding mode it adds the GIDs of the netdevice in this port. When in bonding mode it adds the GIDs of both the port's netdevice and the bond master netdevice. Export roce_del_all_netdev_gids(), which removes all GIDs associated with a specific netdevice for a given port. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/674d498da4637a1503ff1367e28bd09ff942fd5e.1730381292.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/core: Provide rdma_user_mmap_disassociate() to disassociate mmap pagesChengchang Tang2024-10-072-2/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a new api rdma_user_mmap_disassociate() for drivers to disassociate mmap pages for a device. Since drivers can now disassociate mmaps by calling this api, introduce a new disassociation_lock to specifically prevent races between this disassociation process and new mmaps. And thus the old hw_destroy_rwsem is not needed in this api. Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20240927103323.1897094-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* | Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2024-11-182-18/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
| * | deal with the last remaing boolean uses of fd_file()Al Viro2024-11-031-5/+3
| | | | | | | | | | | | | | | Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fdget(), more trivial conversionsAl Viro2024-11-031-13/+6
| |/ | | | | | | | | | | | | | | | | | | all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Revert "RDMA/core: Fix ENODEV error for iWARP test over vlan"Leon Romanovsky2024-11-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The citied commit in Fixes line caused to regression for udaddy [1] application. It doesn't work over VLANs anymore. Client: ifconfig eth2 1.1.1.1 ip link add link eth2 name p0.3597 type vlan protocol 802.1Q id 3597 ip link set dev p0.3597 up ip addr add 2.2.2.2/16 dev p0.3597 udaddy -S 847 -C 220 -c 2 -t 0 -s 2.2.2.3 -b 2.2.2.2 Server: ifconfig eth2 1.1.1.3 ip link add link eth2 name p0.3597 type vlan protocol 802.1Q id 3597 ip link set dev p0.3597 up ip addr add 2.2.2.3/16 dev p0.3597 udaddy -S 847 -C 220 -c 2 -t 0 -b 2.2.2.3 [1] https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/examples/udaddy.c Fixes: 5069d7e202f6 ("RDMA/core: Fix ENODEV error for iWARP test over vlan") Reported-by: Leon Romanovsky <leonro@nvidia.com> Closes: https://lore.kernel.org/all/20241110130746.GA48891@unreal Link: https://patch.msgid.link/bb9d403419b2b9566da5b8bf0761fa8377927e49.1731401658.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
* | RDMA/core: Fix ENODEV error for iWARP test over vlanAnumula Murali Mohan Reddy2024-10-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If traffic is over vlan, cma_validate_port() fails to match vlan net_device ifindex with bound_if_index and results in ENODEV error. It is because rdma_copy_src_l2_addr() always assigns bound_if_index with real net_device ifindex. This patch fixes the issue by assigning bound_if_index with vlan net_device index if traffic is over vlan. Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices") Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com> Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Link: https://patch.msgid.link/20241008114334.146702-1-anumula@chelsio.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
* | RDMA/nldev: Fix NULL pointer dereferences issue in rdma_nl_notify_eventQianqiang Liu2024-10-081-0/+2
|/ | | | | | | | | | | nlmsg_put() may return a NULL pointer assigned to nlh, which will later be dereferenced in nlmsg_end(). Fixes: 9cbed5aab5ae ("RDMA/nldev: Add support for RDMA monitoring") Link: https://patch.msgid.link/r/Zva71Yf3F94uxi5A@iZbp1asjb3cy8ks0srf007Z Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
* [tree-wide] finally take no_llseek outAl Viro2024-09-273-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2024-09-249-65/+267
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma updates from Jason Gunthorpe: "Usual collection of small improvements and fixes, nothing especially stands out to me here. The new multipath PCI feature is a sign of things to come, I think we will see more of this in the next 10 years. Broadcom and HNS continue to update their drivers for their new HW generations. Summary: - Bug fixes and minor improvments in cxgb4, siw, mlx5, rxe, efa, rts, hfi, erdma, hns, irdma - Code cleanups/typos/etc. Tidy alloc_ordered_workqueue() calls - Multipath PCI for mlx5 - Variable size work queue, SRQ changes, and relaxed ordering for new bnxt HW - New ODP fault resolution FW protocol in mlx5 - New 'rdma monitor' netlink mechanism" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits) RDMA/bnxt_re: Remove the unused variable en_dev RDMA/nldev: Add missing break in rdma_nl_notify_err_msg() RDMA/irdma: fix error message in irdma_modify_qp_roce() RDMA/cxgb4: Added NULL check for lookup_atid RDMA/hns: Fix ah error counter in sw stat not increasing RDMA/bnxt_re: Recover the device when FW error is detected RDMA/bnxt_re: Group all operations under add_device and remove_device RDMA/bnxt_re: Use the aux device for L2 ULP callbacks RDMA/bnxt_re: Change aux driver data to en_info to hold more information RDMA/nldev: Expose whether RDMA monitoring is supported RDMA/nldev: Add support for RDMA monitoring RDMA/mlx5: Use IB set_netdev and get_netdev functions RDMA/device: Remove optimization in ib_device_get_netdev() RDMA/mlx5: Initialize phys_port_cnt earlier in RDMA device creation RDMA/mlx5: Obtain upper net device only when needed RDMA/mlx5: Check RoCE LAG status before getting netdev RDMA/mlx5: Consider the query_vuid cap for data_direct net/mlx5: Handle memory scheme ODP capabilities RDMA/mlx5: Add implicit MR handling to ODP memory scheme RDMA/mlx5: Add handling for memory scheme page fault events ...
| * RDMA/nldev: Add missing break in rdma_nl_notify_err_msg()Nathan Chancellor2024-09-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang warns (or errors with CONFIG_WERROR=y): drivers/infiniband/core/nldev.c:2795:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] 2795 | default: | ^ Clang is a little more pedantic than GCC, which does not warn when falling through to a case that is just break or return. Clang's version is more in line with the kernel's own stance in deprecated.rst, which states that all switch/case blocks must end in either break, fallthrough, continue, goto, or return. Add the missing break to silence the warning. Fixes: 9cbed5aab5ae ("RDMA/nldev: Add support for RDMA monitoring") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20240916-rdma-fix-clang-fallthrough-nl_notify_err_msg-v1-1-89de6a7423f1@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/nldev: Expose whether RDMA monitoring is supportedChiara Meiohas2024-09-131-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend the "rdma sys" command to display whether RDMA monitoring is supported. RDMA monitoring is not supported in mlx4 because it does not use the ib_device_set_netdev() API, which sends the RDMA events. Example output for kernel where monitoring is supported: $ rdma sys show netns shared privileged-qkey off monitor on copy-on-fork on Example output for kernel where monitoring is not supported: $ rdma sys show netns shared privileged-qkey off monitor off copy-on-fork on Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-8-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/nldev: Add support for RDMA monitoringChiara Meiohas2024-09-133-0/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new netlink command to allow rdma event monitoring. The rdma events supported now are IB device registration/unregistration and net device attachment/detachment. Example output of rdma monitor and the commands which trigger the events: $ rdma monitor $ rmmod mlx5_ib [UNREGISTER] dev 1 rocep8s0f1 [UNREGISTER] dev 0 rocep8s0f0 $ modprobe mlx5_ib [REGISTER] dev 2 mlx5_0 [NETDEV_ATTACH] dev 2 mlx5_0 port 1 netdev 4 eth2 [REGISTER] dev 3 mlx5_1 [NETDEV_ATTACH] dev 3 mlx5_1 port 1 netdev 5 eth3 $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev [UNREGISTER] dev 2 rocep8s0f0 [REGISTER] dev 4 mlx5_0 [NETDEV_ATTACH] dev 4 mlx5_0 port 30 netdev 4 eth2 $ echo 4 > /sys/class/net/eth2/device/sriov_numvfs [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 2 netdev 7 eth4 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 3 netdev 8 eth5 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 4 netdev 9 eth6 [NETDEV_ATTACH] dev 4 rdmap8s0f0 port 5 netdev 10 eth7 [REGISTER] dev 5 mlx5_0 [NETDEV_ATTACH] dev 5 mlx5_0 port 1 netdev 11 eth8 [REGISTER] dev 6 mlx5_0 [NETDEV_ATTACH] dev 6 mlx5_0 port 1 netdev 12 eth9 [REGISTER] dev 7 mlx5_0 [NETDEV_ATTACH] dev 7 mlx5_0 port 1 netdev 13 eth10 [REGISTER] dev 8 mlx5_0 [NETDEV_ATTACH] dev 8 mlx5_0 port 1 netdev 14 eth11 $ echo 0 > /sys/class/net/eth2/device/sriov_numvfs [UNREGISTER] dev 5 rocep8s0f0v0 [UNREGISTER] dev 6 rocep8s0f0v1 [UNREGISTER] dev 7 rocep8s0f0v2 [UNREGISTER] dev 8 rocep8s0f0v3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 2 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 3 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 4 [NETDEV_DETACH] dev 4 rdmap8s0f0 port 5 Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-7-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
| * RDMA/mlx5: Use IB set_netdev and get_netdev functionsChiara Meiohas2024-09-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IB layer provides a common interface to store and get net devices associated to an IB device port (ib_device_set_netdev() and ib_device_get_netdev()). Previously, mlx5_ib stored and managed the associated net devices internally. Replace internal net device management in mlx5_ib with ib_device_set_netdev() when attaching/detaching a net device and ib_device_get_netdev() when retrieving the net device. Export ib_device_get_netdev(). For mlx5 representors/PFs/VFs and lag creation we replace the netdev assignments with the IB set/get netdev functions. In active-backup mode lag the active slave net device is stored in the lag itself. To assure the net device stored in a lag bond IB device is the active slave we implement the following: - mlx5_core: when modifying the slave of a bond we send the internal driver event MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE. - mlx5_ib: when catching the event call ib_device_set_netdev() This patch also ensures the correct IB events are sent in switchdev lag. While at it, when in multiport eswitch mode, only a single IB device is created for all ports. The said IB device will receive all netdev events of its VFs once loaded, thus to avoid overwriting the mapping of PF IB device to PF netdev, ignore NETDEV_REGISTER events if the ib device has already been mapped to a netdev. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-6-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>