summaryrefslogtreecommitdiffstats
path: root/drivers/infiniband
Commit message (Collapse)AuthorAgeFilesLines
* RDMA/bnxt_re: Avoid calling wake_up threads from spin_lock contextKashyap Desai2023-07-191-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3099bcdc19b701f732f638ee45679858c08559bb ] bnxt_qplib_service_creq can be called from interrupt or tasklet or process context. So the function take irq variant of spin_lock. But when wake_up is invoked with the lock held, it is putting the calling context to sleep. [exception RIP: __wake_up_common+190] RIP: ffffffffb7539d7e RSP: ffffa73300207ad8 RFLAGS: 00000083 RAX: 0000000000000001 RBX: ffff91fa295f69b8 RCX: dead000000000200 RDX: ffffa733344af940 RSI: ffffa73336527940 RDI: ffffa73336527940 RBP: 000000000000001c R8: 0000000000000002 R9: 00000000000299c0 R10: 0000017230de82c5 R11: 0000000000000002 R12: ffffa73300207b28 R13: 0000000000000000 R14: ffffa733341bf928 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Call the wakeup after releasing the lock. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://lore.kernel.org/r/1686308514-11996-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: wraparound mbox producer indexKashyap Desai2023-07-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0af91306e17ef3d18e5f100aa58aa787869118af ] Driver is not handling the wraparound of the mbox producer index correctly. Currently the wraparound happens once u32 max is reached. Bit 31 of the producer index register is special and should be set only once for the first command. Because the producer index overflow setting bit31 after a long time, FW goes to initialization sequence and this causes FW hang. Fix is to wraparound the mbox producer index once it reaches u16 max. Fixes: cee0c7bba486 ("RDMA/bnxt_re: Refactor command queue management code") Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://lore.kernel.org/r/1686308514-11996-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Fix access checks in rxe_check_bind_mwBob Pearson2023-07-191-8/+9
| | | | | | | | | | | | | | [ Upstream commit 425e1c9018fdf25cb4531606cc92d9d01a55534f ] The subroutine rxe_check_bind_mw() in rxe_mw.c performs checks on the mw access flags before they are set so they always succeed. This patch instead checks the access flags passed in the send wqe. Fixes: 32a577b4c3a9 ("RDMA/rxe: Add support for bind MW work requests") Link: https://lore.kernel.org/r/20230530221334.89432-4-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Replace pr_xxx by rxe_dbg_xxx in rxe_mw.cBob Pearson2023-07-191-10/+10
| | | | | | | | | | | | [ Upstream commit e8a87efdf87455454d0a14fd486c679769bfeee2 ] Replace calls to pr_xxx() int rxe_mw.c with rxe_dbg_xxx(). Link: https://lore.kernel.org/r/20221103171013.20659-6-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Stable-dep-of: 425e1c9018fd ("RDMA/rxe: Fix access checks in rxe_check_bind_mw") Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Add ibdev_dbg macros for rxeBob Pearson2023-07-191-0/+19
| | | | | | | | | | | | [ Upstream commit 4554bac48a8c464ff00136a64efe8847e4da4ea8 ] Add macros borrowed from siw to call dynamic debug macro ibdev_dbg. Link: https://lore.kernel.org/r/20221103171013.20659-2-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Stable-dep-of: 425e1c9018fd ("RDMA/rxe: Fix access checks in rxe_check_bind_mw") Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/hns: Fix hns_roce_table_get return valueChengchang Tang2023-07-191-3/+4
| | | | | | | | | | | | | | [ Upstream commit cf5b608fb0e369c473a8303cad6ddb386505e5b8 ] The return value of set_hem has been fixed to ENODEV, which will lead a diagnostic information missing. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Link: https://lore.kernel.org/r/20230523121641.3132102-3-huangjunxian6@hisilicon.com Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/hfi1: Fix wrong mmu_node used for user SDMA packet after invalidateBrendan Cunningham2023-07-199-145/+177
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c9358de193ecfb360c3ce75f27ce839ca0b0bc8c ] The hfi1 user SDMA pinned-page cache will leave a stale cache entry when the cache-entry's virtual address range is invalidated but that cache entry is in-use by an outstanding SDMA request. Subsequent user SDMA requests with buffers in or spanning the virtual address range of the stale cache entry will result in packets constructed from the wrong memory, the physical pages pointed to by the stale cache entry. To fix this, remove mmu_rb_node cache entries from the mmu_rb_handler cache independent of the cache entry's refcount. Add 'struct kref refcount' to struct mmu_rb_node and manage mmu_rb_node lifetime with kref_get() and kref_put(). mmu_rb_node.refcount makes sdma_mmu_node.refcount redundant. Remove 'atomic_t refcount' from struct sdma_mmu_node and change sdma_mmu_node code to use mmu_rb_node.refcount. Move the mmu_rb_handler destructor call after a wait-for-SDMA-request-completion call so mmu_rb_nodes that need mmu_rb_handler's workqueue to queue themselves up for destruction from an interrupt context may do so. Fixes: f48ad614c100 ("IB/hfi1: Move driver out of staging") Fixes: 00cbce5cbf88 ("IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests") Link: https://lore.kernel.org/r/168451393605.3700681.13493776139032178861.stgit@awfm-02.cornelisnetworks.com Reviewed-by: Dean Luick <dean.luick@cornelisnetworks.com> Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/irdma: avoid fortify-string warning in irdma_clr_wqesArnd Bergmann2023-07-191-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b002760f877c0d91ecd3c78565b52f4bbac379dd ] Commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") triggers a warning for fortified memset(): In function 'fortify_memset_chk', inlined from 'irdma_clr_wqes' at drivers/infiniband/hw/irdma/uk.c:103:4: include/linux/fortify-string.h:493:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 493 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The problem here isthat the inner array only has four 8-byte elements, so clearing 4096 bytes overflows that. As this structure is part of an outer array, change the code to pass a pointer to the irdma_qp_quanta instead, and change the size argument for readability, matching the comment above it. Fixes: 551c46edc769 ("RDMA/irdma: Add user/kernel shared libraries") Link: https://lore.kernel.org/r/20230523111859.2197825-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Fix to remove an unnecessary logKalesh AP2023-07-191-4/+1
| | | | | | | | | | | | | | | | [ Upstream commit 43774bc156614346fe5dacabc8e8c229167f2536 ] During destroy_qp, driver sets the qp handle in the existing CQEs belonging to the QP being destroyed to NULL. As a result, a poll_cq after destroy_qp can report unnecessary messages. Remove this noise from system logs. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684478897-12247-6-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Remove a redundant check inside bnxt_re_update_gidKalesh AP2023-07-191-7/+1
| | | | | | | | | | | | | | | | | [ Upstream commit b989f90cef0af48aa5679b6a75476371705ec53c ] The NULL check inside bnxt_re_update_gid() always return false. If sgid_tbl->tbl is not allocated, then dev_init would have failed. Fixes: 5fac5b1b297f ("RDMA/bnxt_re: Add vlan tag for untagged RoCE traffic when PFC is configured") Link: https://lore.kernel.org/r/1684478897-12247-5-git-send-email-selvin.xavier@broadcom.com Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Use unique names while registering interruptsKalesh AP2023-07-194-5/+25
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ff2e4bfd162cf66a112a81509e419805add44d64 ] bnxt_re currently uses the names "bnxt_qplib_creq" and "bnxt_qplib_nq-0" while registering IRQs. There is no way to distinguish the IRQs of different device ports when there are multiple IB devices registered. This could make the scenarios worse where one want to pin IRQs of a device port to certain CPUs. Fixed the code to use unique names which has PCI BDF information while registering interrupts like: "bnxt_re-nq-0@pci:0000:65:00.0" and "bnxt_re-creq@pci:0000:65:00.1". Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684478897-12247-4-git-send-email-selvin.xavier@broadcom.com Reviewed-by: Bhargava Chenna Marreddy <bhargava.marreddy@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Fix to remove unnecessary return labelsKalesh AP2023-07-191-5/+2
| | | | | | | | | | | | | | | | [ Upstream commit 9b3ee47796f529e5bc31a355d6cb756d68a7079a ] If there is no cleanup needed then just return directly. This cleans up the code and improve readability. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684478897-12247-3-git-send-email-selvin.xavier@broadcom.com Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Disable/kill tasklet only if it is enabledSelvin Xavier2023-07-193-14/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ab112ee7899d6171da5acd77a7ed7ae103f488de ] When the ulp hook to start the IRQ fails because the rings are not available, tasklets are not enabled. In this case when the driver is unloaded, driver calls CREQ tasklet_kill. This causes an indefinite hang as the tasklet is not enabled. Driver shouldn't call tasklet_kill if it is not enabled. So using the creq->requested and nq->requested flags to identify if both tasklets/irqs are registered. Checking this flag while scheduling the tasklet from ISR. Also, added a cleanup for disabling tasklet, in case request_irq fails during start_irq. Check for return value for bnxt_qplib_rcfw_start_irq and in case the bnxt_qplib_rcfw_start_irq fails, return bnxt_re_start_irq without attempting to start NQ IRQs. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684478897-12247-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/isert: Fix incorrect release of isert connectionSaravanan Vajravel2023-06-211-2/+0
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 699826f4e30ab76a62c238c86fbef7e826639c8d ] The ib_isert module is releasing the isert connection both in isert_wait_conn() handler as well as isert_free_conn() handler. In isert_wait_conn() handler, it is expected to wait for iSCSI session logout operation to complete. It should free the isert connection only in isert_free_conn() handler. When a bunch of iSER target is cleared, this issue can lead to use-after-free memory issue as isert conn is twice released Fixes: b02efbfc9a05 ("iser-target: Fix implicit termination of connections") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://lore.kernel.org/r/20230606102531.162967-4-saravanan.vajravel@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/isert: Fix possible list corruption in CMA handlerSaravanan Vajravel2023-06-211-0/+4
| | | | | | | | | | | | | | | | [ Upstream commit 7651e2d6c5b359a28c2d4c904fec6608d1021ca8 ] When ib_isert module receives connection error event, it is releasing the isert session and removes corresponding list node but it doesn't take appropriate mutex lock to remove the list node. This can lead to linked list corruption Fixes: bd3792205aae ("iser-target: Fix pending connections handling in target stack shutdown sequnce") Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Link: https://lore.kernel.org/r/20230606102531.162967-3-saravanan.vajravel@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/isert: Fix dead lock in ib_isertSaravanan Vajravel2023-06-211-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 691b0480933f0ce88a81ed1d1a0aff340ff6293a ] - When a iSER session is released, ib_isert module is taking a mutex lock and releasing all pending connections. As part of this, ib_isert is destroying rdma cm_id. To destroy cm_id, rdma_cm module is sending CM events to CMA handler of ib_isert. This handler is taking same mutex lock. Hence it leads to deadlock between ib_isert & rdma_cm modules. - For fix, created local list of pending connections and release the connection outside of mutex lock. Calltrace: --------- [ 1229.791410] INFO: task kworker/10:1:642 blocked for more than 120 seconds. [ 1229.791416] Tainted: G OE --------- - - 4.18.0-372.9.1.el8.x86_64 #1 [ 1229.791418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1229.791419] task:kworker/10:1 state:D stack: 0 pid: 642 ppid: 2 flags:0x80004000 [ 1229.791424] Workqueue: ib_cm cm_work_handler [ib_cm] [ 1229.791436] Call Trace: [ 1229.791438] __schedule+0x2d1/0x830 [ 1229.791445] ? select_idle_sibling+0x23/0x6f0 [ 1229.791449] schedule+0x35/0xa0 [ 1229.791451] schedule_preempt_disabled+0xa/0x10 [ 1229.791453] __mutex_lock.isra.7+0x310/0x420 [ 1229.791456] ? select_task_rq_fair+0x351/0x990 [ 1229.791459] isert_cma_handler+0x224/0x330 [ib_isert] [ 1229.791463] ? ttwu_queue_wakelist+0x159/0x170 [ 1229.791466] cma_cm_event_handler+0x25/0xd0 [rdma_cm] [ 1229.791474] cma_ib_handler+0xa7/0x2e0 [rdma_cm] [ 1229.791478] cm_process_work+0x22/0xf0 [ib_cm] [ 1229.791483] cm_work_handler+0xf4/0xf30 [ib_cm] [ 1229.791487] ? move_linked_works+0x6e/0xa0 [ 1229.791490] process_one_work+0x1a7/0x360 [ 1229.791491] ? create_worker+0x1a0/0x1a0 [ 1229.791493] worker_thread+0x30/0x390 [ 1229.791494] ? create_worker+0x1a0/0x1a0 [ 1229.791495] kthread+0x10a/0x120 [ 1229.791497] ? set_kthread_struct+0x40/0x40 [ 1229.791499] ret_from_fork+0x1f/0x40 [ 1229.791739] INFO: task targetcli:28666 blocked for more than 120 seconds. [ 1229.791740] Tainted: G OE --------- - - 4.18.0-372.9.1.el8.x86_64 #1 [ 1229.791741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1229.791742] task:targetcli state:D stack: 0 pid:28666 ppid: 5510 flags:0x00004080 [ 1229.791743] Call Trace: [ 1229.791744] __schedule+0x2d1/0x830 [ 1229.791746] schedule+0x35/0xa0 [ 1229.791748] schedule_preempt_disabled+0xa/0x10 [ 1229.791749] __mutex_lock.isra.7+0x310/0x420 [ 1229.791751] rdma_destroy_id+0x15/0x20 [rdma_cm] [ 1229.791755] isert_connect_release+0x115/0x130 [ib_isert] [ 1229.791757] isert_free_np+0x87/0x140 [ib_isert] [ 1229.791761] iscsit_del_np+0x74/0x120 [iscsi_target_mod] [ 1229.791776] lio_target_np_driver_store+0xe9/0x140 [iscsi_target_mod] [ 1229.791784] configfs_write_file+0xb2/0x110 [ 1229.791788] vfs_write+0xa5/0x1a0 [ 1229.791792] ksys_write+0x4f/0xb0 [ 1229.791794] do_syscall_64+0x5b/0x1a0 [ 1229.791798] entry_SYSCALL_64_after_hwframe+0x65/0xca Fixes: bd3792205aae ("iser-target: Fix pending connections handling in target stack shutdown sequnce") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Link: https://lore.kernel.org/r/20230606102531.162967-2-saravanan.vajravel@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Fix affinity assignmentMark Bloch2023-06-212-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 617f5db1a626f18d5cbb7c7faf7bf8f9ea12be78 ] The cited commit aimed to ensure that Virtual Functions (VFs) assign a queue affinity to a Queue Pair (QP) to distribute traffic when the LAG master creates a hardware LAG. If the affinity was set while the hardware was not in LAG, the firmware would ignore the affinity value. However, this commit unintentionally assigned an affinity to QPs on the LAG master's VPORT even if the RDMA device was not marked as LAG-enabled. In most cases, this was not an issue because when the hardware entered hardware LAG configuration, the RDMA device of the LAG master would be destroyed and a new one would be created, marked as LAG-enabled. The problem arises when a user configures Equal-Cost Multipath (ECMP). In ECMP mode, traffic can be directed to different physical ports based on the queue affinity, which is intended for use by VPORTS other than the E-Switch manager. ECMP mode is supported only if both E-Switch managers are in switchdev mode and the appropriate route is configured via IP. In this configuration, the RDMA device is not destroyed, and we retain the RDMA device that is not marked as LAG-enabled. To ensure correct behavior, Send Queues (SQs) opened by the E-Switch manager through verbs should be assigned strict affinity. This means they will only be able to communicate through the native physical port associated with the E-Switch manager. This will prevent the firmware from assigning affinity and will not allow the SQs to be remapped in case of failover. Fixes: 802dcc7fc5ec ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/uverbs: Fix to consider event queue closing also upon non-blocking modeYishai Hadas2023-06-211-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 62fab312fa1683e812e605db20d4f22de3e3fb2f ] Fix ib_uverbs_event_read() to consider event queue closing also upon non-blocking mode. Once the queue is closed (e.g. hot-plug flow) all the existing events are cleaned-up as part of ib_uverbs_free_event_queue(). An application that uses the non-blocking FD mode should get -EIO in that case to let it knows that the device was removed already. Otherwise, it can loose the indication that the device was removed and won't recover. As part of that, refactor the code to have a single flow with regards to 'is_closed' for both blocking and non-blocking modes. Fixes: 14e23bd6d221 ("RDMA/core: Fix locking in ib_uverbs_event_read") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/97b00116a1e1e13f8dc4ec38a5ea81cf8c030210.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/cma: Always set static rate to 0 for RoCEMark Zhang2023-06-211-2/+2
| | | | | | | | | | | | | | | | [ Upstream commit 58030c76cce473b6cfd630bbecb97215def0dff8 ] Set static rate to 0 as it should be discovered by path query and has no meaning for RoCE. This also avoid of using the rtnl lock and ethtool API, which is a bottleneck when try to setup many rdma-cm connections at the same time, especially with multiple processes. Fixes: 3c86aa70bf67 ("RDMA/cm: Add RDMA CM support for IBoE devices") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/f72a4f8b667b803aee9fa794069f61afb5839ce4.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Create an indirect flow table for steering anchorMark Bloch2023-06-213-7/+296
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e1f4a52ac171dd863fe89055e749ef5e0a0bc5ce ] A misbehaved user can create a steering anchor that points to a kernel flow table and then destroy the anchor without freeing the associated STC. This creates a problem as the kernel can't destroy the flow table since there is still a reference to it. As a result, this can exhaust all available flow table resources, preventing other users from using the RDMA device. To prevent this problem, a solution is implemented where a special flow table with two steering rules is created when a user creates a steering anchor for the first time. The rules include one that drops all traffic and another that points to the kernel flow table. If the steering anchor is destroyed, only the rule pointing to the kernel's flow table is removed. Any traffic reaching the special flow table after that is dropped. Since the special flow table is not destroyed when the steering anchor is destroyed, any issues are prevented from occurring. The remaining resources are only destroyed when the RDMA device is destroyed, which happens after all DEVX objects are freed, including the STCs, thus mitigating the issue. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/b4a88a871d651fa4e8f98d552553c1cfe9ba2cd6.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functionsMaher Sanalla2023-06-211-0/+3
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit ee4d269eccfea6c17b18281bef482700d898e86f ] Delay drop data is initiated for PFs that have the capability of rq_delay_drop and are in roce profile. However, PFs with RAW ethernet profile do not initiate delay drop data on function load, causing kernel panic if delay drop struct members are accessed later on in case a dropless RQ is created. Thus, stage the delay drop initialization as part of RAW ethernet PF loading process. Fixes: b5ca15ad7e61 ("IB/mlx5: Add proper representors support") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/2e9d386785043d48c38711826eb910315c1de141.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Fix the use-before-initialization error of resp_pktsZhu Yanjun2023-06-211-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2a62b6210ce876c596086ab8fd4c8a0c3d10611a ] In the following: Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 assign_lock_key kernel/locking/lockdep.c:982 [inline] register_lock_class+0xdb6/0x1120 kernel/locking/lockdep.c:1295 __lock_acquire+0x10a/0x5df0 kernel/locking/lockdep.c:4951 lock_acquire kernel/locking/lockdep.c:5691 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5656 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162 skb_dequeue+0x20/0x180 net/core/skbuff.c:3639 drain_resp_pkts drivers/infiniband/sw/rxe/rxe_comp.c:555 [inline] rxe_completer+0x250d/0x3cc0 drivers/infiniband/sw/rxe/rxe_comp.c:652 rxe_qp_do_cleanup+0x1be/0x820 drivers/infiniband/sw/rxe/rxe_qp.c:761 execute_in_process_context+0x3b/0x150 kernel/workqueue.c:3473 __rxe_cleanup+0x21e/0x370 drivers/infiniband/sw/rxe/rxe_pool.c:233 rxe_create_qp+0x3f6/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:583 This is a use-before-initialization problem. It happens because rxe_qp_do_cleanup is called during error unwind before the struct has been fully initialized. Move the initialization of the skb earlier. Fixes: 8700e3e7c485 ("Soft RoCE driver") Link: https://lore.kernel.org/r/20230602035408.741534-1-yanjun.zhu@intel.com Reported-by: syzbot+eba589d8f49c73d356da@syzkaller.appspotmail.com Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Removed unused name from rxe_task structBob Pearson2023-06-213-12/+5
| | | | | | | | | | | | | [ Upstream commit de669ae8af49ceed0eed44f5b3d51dc62affc5e4 ] The name field in struct rxe_task is never used. This patch removes it. Link: https://lore.kernel.org/r/20221021200118.2163-4-rpearsonhpe@gmail.com Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Stable-dep-of: 2a62b6210ce8 ("RDMA/rxe: Fix the use-before-initialization error of resp_pkts") Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Fix ref count error in check_rkey()Bob Pearson2023-06-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b00683422fd79dd07c9b75efdce1660e5e19150e ] There is a reference count error in error path code and a potential race in check_rkey() in rxe_resp.c. When looking up the rkey for a memory window the reference to the mw from rxe_lookup_mw() is dropped before a reference is taken on the mr referenced by the mw. If the mr is destroyed immediately after the call to rxe_put(mw) the mr pointer is unprotected and may end up pointing at freed memory. The rxe_get(mr) call should take place before the rxe_put(mw) call. All errors in check_rkey() call rxe_put(mw) if mw is not NULL but it was already called after the above. The mw pointer should be set to NULL after the rxe_put(mw) call to prevent this from happening. Fixes: cdd0b85675ae ("RDMA/rxe: Implement memory access through MWs") Link: https://lore.kernel.org/r/20230517211509.1819998-1-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Fix packet length checksBob Pearson2023-06-211-0/+6
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9a3763e87379c97a78b7c6c6f40720b1e877174f ] In rxe_net.c a received packet, from udp or loopback, is passed to rxe_rcv() in rxe_recv.c as a udp packet. I.e. skb->data is pointing at the udp header. But rxe_rcv() makes length checks to verify the packet is long enough to hold the roce headers as if it were a roce packet. I.e. skb->data pointing at the bth header. A runt packet would appear to have 8 more bytes than it actually does which may lead to incorrect behavior. This patch calls skb_pull() to adjust the skb to point at the bth header before calling rxe_rcv() which fixes this error. Fixes: 8700e3e7c485 ("Soft RoCE driver") Link: https://lore.kernel.org/r/20230517172242.1806340-1-rpearsonhpe@gmail.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rtrs: Fix rxe_dealloc_pd warningLi Zhijian2023-06-211-32/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9c29c8c7df0688f358d2df5ddd16c97c2f7292b4 ] In current design: 1. PD and clt_path->s.dev are shared among connections. 2. every con[n]'s cleanup phase will call destroy_con_cq_qp() 3. clt_path->s.dev will be always decreased in destroy_con_cq_qp(), and when clt_path->s.dev become zero, it will destroy PD. 4. when con[1] failed to create, con[1] will not take clt_path->s.dev, but it try to decreased clt_path->s.dev So, in case create_cm(con[0]) succeeds but create_cm(con[1]) fails, destroy_con_cq_qp(con[1]) will be called first which will destroy the PD while this PD is still taken by con[0]. Here, we refactor the error path of create_cm() and init_conns(), so that we do the cleanup in the order they are created. The warning occurs when destroying RXE PD whose reference count is not zero. rnbd_client L597: Mapping device /dev/nvme0n1 on session client, (access_mode: rw, nr_poll_queues: 0) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 26407 at drivers/infiniband/sw/rxe/rxe_pool.c:256 __rxe_cleanup+0x13a/0x170 [rdma_rxe] Modules linked in: rpcrdma rdma_ucm ib_iser rnbd_client libiscsi rtrs_client scsi_transport_iscsi rtrs_core rdma_cm iw_cm ib_cm crc32_generic rdma_rxe udp_tunnel ib_uverbs ib_core kmem device_dax nd_pmem dax_pmem nd_vme crc32c_intel fuse nvme_core nfit libnvdimm dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 26407 Comm: rnbd-client.sh Kdump: loaded Not tainted 6.2.0-rc6-roce-flush+ #53 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__rxe_cleanup+0x13a/0x170 [rdma_rxe] Code: 45 84 e4 0f 84 5a ff ff ff 48 89 ef e8 5f 18 71 f9 84 c0 75 90 be c8 00 00 00 48 89 ef e8 be 89 1f fa 85 c0 0f 85 7b ff ff ff <0f> 0b 41 bc ea ff ff ff e9 71 ff ff ff e8 84 7f 1f fa e9 d0 fe ff RSP: 0018:ffffb09880b6f5f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99401f15d6a8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffbac8234b RDI: 00000000ffffffff RBP: ffff99401f15d6d0 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000002d82 R11: 0000000000000000 R12: 0000000000000001 R13: ffff994101eff208 R14: ffffb09880b6f6a0 R15: 00000000fffffe00 FS: 00007fe113904740(0000) GS:ffff99413bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff6cde656c8 CR3: 000000001f108004 CR4: 00000000001706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> rxe_dealloc_pd+0x16/0x20 [rdma_rxe] ib_dealloc_pd_user+0x4b/0x80 [ib_core] rtrs_ib_dev_put+0x79/0xd0 [rtrs_core] destroy_con_cq_qp+0x8a/0xa0 [rtrs_client] init_path+0x1e7/0x9a0 [rtrs_client] ? __pfx_autoremove_wake_function+0x10/0x10 ? lock_is_held_type+0xd7/0x130 ? rcu_read_lock_sched_held+0x43/0x80 ? pcpu_alloc+0x3dd/0x7d0 ? rtrs_clt_init_stats+0x18/0x40 [rtrs_client] rtrs_clt_open+0x24f/0x5a0 [rtrs_client] ? __pfx_rnbd_clt_link_ev+0x10/0x10 [rnbd_client] rnbd_clt_map_device+0x6a5/0xe10 [rnbd_client] Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality") Link: https://lore.kernel.org/r/1682384563-2-4-git-send-email-lizhijian@fujitsu.com Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Tested-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rtrs: Fix the last iu->buf leak in err pathLi Zhijian2023-06-211-1/+3
| | | | | | | | | | | | | | [ Upstream commit 3bf3a7c6985c625f64e73baefdaa36f1c2045a29 ] The last iu->buf will leak if ib_dma_mapping_error() fails. Fixes: c0894b3ea69d ("RDMA/rtrs: core: lib functions shared between client and server modules") Link: https://lore.kernel.org/r/1682384563-2-3-git-send-email-lizhijian@fujitsu.com Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/uverbs: Restrict usage of privileged QKEYsEdward Srouji2023-06-211-1/+6
| | | | | | | | | | | | | | | | | | commit 0cadb4db79e1d9eea66711c4031e435c2191907e upstream. According to the IB specification rel-1.6, section 3.5.3: "QKEYs with the most significant bit set are considered controlled QKEYs, and a HCA does not allow a consumer to arbitrarily specify a controlled QKEY." Thus, block non-privileged users from setting such a QKEY. Cc: stable@vger.kernel.org Fixes: bc38a6abdd5a ("[PATCH] IB uverbs: core implementation") Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://lore.kernel.org/r/c00c809ddafaaf87d6f6cb827978670989a511b3.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* RDMA/irdma: Fix Local Invalidate fencingMustafa Ismail2023-06-091-0/+1
| | | | | | | | | | | | | | | [ Upstream commit 5842d1d9c1b0d17e0c29eae65ae1f245f83682dd ] If the local invalidate fence is indicated in the WR, only the read fence is currently being set in WQE. Fix this to set both the read and local fence in the WQE. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Link: https://lore.kernel.org/r/20230522155654.1309-4-shiraz.saleem@intel.com Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/irdma: Prevent QP use after freeMustafa Ismail2023-06-091-5/+6
| | | | | | | | | | | | | | | | | [ Upstream commit c8f304d75f6c6cc679a73f89591f9a915da38f09 ] There is a window where the poll cq may use a QP that has been freed. This can happen if a CQE is polled before irdma_clean_cqes() can clear the CQE's related to the QP and the destroy QP races to free the QP memory. then the QP structures are used in irdma_poll_cq. Fix this by moving the clearing of CQE's before the reference is removed and the QP is destroyed. Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs") Link: https://lore.kernel.org/r/20230522155654.1309-3-shiraz.saleem@intel.com Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Fix return value of bnxt_re_process_raw_qp_pkt_rxKalesh AP2023-06-091-3/+1
| | | | | | | | | | | | | | | | [ Upstream commit 0fa0d520e2a878cb4c94c4dc84395905d3f14f54 ] bnxt_re_process_raw_qp_pkt_rx() always return 0 and ignores the return value of bnxt_re_post_send_shadow_qp(). Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684397461-23082-3-git-send-email-selvin.xavier@broadcom.com Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Fix a possible memory leakKalesh AP2023-06-091-5/+6
| | | | | | | | | | | | | | | | | | | [ Upstream commit 349e3c0cf239cc01d58a1e6c749e171de014cd6a ] Inside bnxt_qplib_create_cq(), when the check for NULL DPI fails, driver returns directly without freeing the memory allocated inside bnxt_qplib_alloc_init_hwq() routine. Fixed this by moving the check for NULL DPI before invoking bnxt_qplib_alloc_init_hwq(). Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Link: https://lore.kernel.org/r/1684397461-23082-2-git-send-email-selvin.xavier@broadcom.com Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/hns: Modify the value of long message loopback sliceYangyang Li2023-06-091-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 56518a603fd2bf74762d176ac980572db84a3e14 ] Long message loopback slice is used for achieving traffic balance between QPs. It prevents the problem that QPs with large traffic occupying the hardware pipeline for a long time and QPs with small traffic cannot be scheduled. Currently, its maximum value is set to 16K, which means only after a QP sends 16K will the second QP be scheduled. This value is too large, which will lead to unbalanced traffic scheduling, and thus it needs to be modified. The setting range of the long message loopback slice is modified to be from 1024 (the lower limit supported by hardware) to mtu. Actual testing shows that this value can significantly reduce error in hardware traffic scheduling. This solution is compatible with both HIP08 and HIP09. The modified lp_pktn_ini has a maximum value of 2 (when mtu is 256), so the range checking code for lp_pktn_ini is no longer necessary and needs to be deleted. Fixes: 0e60778efb07 ("RDMA/hns: Modify the value of MAX_LP_MSG_LEN to meet hardware compatibility") Link: https://lore.kernel.org/r/20230512092245.344442-4-huangjunxian6@hisilicon.com Signed-off-by: Yangyang Li <liyangyang20@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/hns: Fix base address table allocationChengchang Tang2023-06-091-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7f3969b14f356dd65fa95b3528eb05c32e68bc06 ] For hns, the specification of an entry like resource (E.g. WQE/CQE/EQE) depends on BT page size, buf page size and hopnum. For user mode, the buf page size depends on UMEM. Therefore, the actual specification is controlled by BT page size and hopnum. The current BT page size and hopnum are obtained from firmware. This makes the driver inflexible and introduces unnecessary constraints. Resource allocation failures occur in many scenarios. This patch will calculate whether the BT page size set by firmware is sufficient before allocating BT, and increase the BT page size if it is insufficient. Fixes: 1133401412a9 ("RDMA/hns: Optimize base address table config flow for qp buffer") Link: https://lore.kernel.org/r/20230512092245.344442-3-huangjunxian6@hisilicon.com Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/hns: Fix timeout attr in query qp for HIP08Chengchang Tang2023-06-092-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 58caa2a51ad4fd21763696cc6c4defc9fc1b4b4f ] On HIP08, the queried timeout attr is different from the timeout attr configured by the user. It is found by rdma-core testcase test_rdmacm_async_traffic: ====================================================================== FAIL: test_rdmacm_async_traffic (tests.test_rdmacm.CMTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "./tests/test_rdmacm.py", line 33, in test_rdmacm_async_traffic self.two_nodes_rdmacm_traffic(CMAsyncConnection, self.rdmacm_traffic, File "./tests/base.py", line 382, in two_nodes_rdmacm_traffic raise(res) AssertionError Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC") Link: https://lore.kernel.org/r/20230512092245.344442-2-huangjunxian6@hisilicon.com Signed-off-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/efa: Fix unsupported page sizes in deviceYonatan Nachum2023-06-091-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 866422cdddcdf59d8c68e9472d49ba1be29b5fcf ] Device uses 4KB size blocks for user pages indirect list while the driver creates those blocks with the size of PAGE_SIZE of the kernel. On kernels with PAGE_SIZE different than 4KB (ARM RHEL), this leads to a failure on register MR with indirect list because of the miss communication between driver and device. Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Link: https://lore.kernel.org/r/20230511115103.13876-1-ynachum@amazon.com Reviewed-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Michael Margolin <mrgolin@amazon.com> Signed-off-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/bnxt_re: Fix the page_size used during the MR creationSelvin Xavier2023-06-092-14/+5
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 08c7f09356e45d093d1867c7a3c6ac6526e2f98b ] Driver populates the list of pages used for Memory region wrongly when page size is more than system page size. This is causing a failure when some of the applications that creates MR with page size as 2M. Since HW can support multiple page sizes, pass the correct page size while creating the MR. Also, driver need not adjust the number of pages when HW Queues are created with user memory. It should work with the number of dma blocks returned by ib_umem_num_dma_blocks. Fix this calculation also. Fixes: 0c4dcd602817 ("RDMA/bnxt_re: Refactor hardware queue memory allocation") Fixes: f6919d56388c ("RDMA/bnxt_re: Code refactor while populating user MRs") Link: https://lore.kernel.org/r/1683484169-9539-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task"Zhu Yanjun2023-06-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b2b1ddc457458fecd1c6f385baa9fbda5f0c63ad ] In the function rxe_create_qp(), rxe_qp_from_init() is called to initialize qp, internally things like rxe_init_task are not setup until rxe_qp_init_req(). If an error occurred before this point then the unwind will call rxe_cleanup() and eventually to rxe_qp_do_cleanup()/rxe_cleanup_task() which will oops when trying to access the uninitialized spinlock. If rxe_init_task is not executed, rxe_cleanup_task will not be called. Reported-by: syzbot+cfcc1a3c85be15a40cba@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=fd85757b74b3eb59f904138486f755f71e090df8 Fixes: 8700e3e7c485 ("Soft RoCE driver") Fixes: 2d4b21e0a291 ("IB/rxe: Prevent from completer to operate on non valid QP") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20230413101115.1366068-1-yanjun.zhu@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Use correct device num_ports when modify DCMark Zhang2023-05-111-1/+1
| | | | | | | | | | | | | | | [ Upstream commit 746aa3c8cb1a650ff2583497ac646e505831b9b9 ] Just like other QP types, when modify DC, the port_num should be compared with dev->num_ports, instead of HCA_CAP.num_ports. Otherwise Multi-port vHCA on DC may not work. Fixes: 776a3906b692 ("IB/mlx5: Add support for DC target QP") Link: https://lore.kernel.org/r/20230420013906.1244185-1-markzhang@nvidia.com Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Fix flow counter query via DEVXMark Bloch2023-05-111-5/+26
| | | | | | | | | | | | | | | | | | [ Upstream commit 3e358ea8614ddfbc59ca7a3f5dff5dde2b350b2c ] Commit cited in "fixes" tag added bulk support for flow counters but it didn't account that's also possible to query a counter using a non-base id if the counter was allocated as bulk. When a user performs a query, validate the flow counter id given in the mailbox is inside the valid range taking bulk value into account. Fixes: 208d70f562e5 ("IB/mlx5: Support flow counters offset for bulk counters") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/79d7fbe291690128e44672418934256254d93115.1681377114.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMRAvihai Horon2023-05-111-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d43b020b0f82c088ef8ff3196ef00575a97d200e ] relaxed_ordering_read HCA capability is set if both the device supports relaxed ordering (RO) read and RO is set in PCI config space. RO in PCI config space can change during runtime. This will change the value of relaxed_ordering_read HCA capability in FW, but the driver will not see it since it queries the capabilities only once. This can lead to the following scenario: 1. RO in PCI config space is enabled. 2. User creates MKey without RO. 3. RO in PCI config space is disabled. As a result, relaxed_ordering_read HCA capability is turned off in FW but remains on in driver copy of the capabilities. 4. User requests to reconfig the MKey with RO via UMR. 5. Driver will try to reconfig the MKey with RO read although it shouldn't (as relaxed_ordering_read HCA capability is really off). To fix this, check pcie_relaxed_ordering_enabled() before setting RO read in UMR. Fixes: 896ec9735336 ("RDMA/mlx5: Set mkey relaxed ordering by UMR with ConnectX-7") Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Link: https://lore.kernel.org/r/8d39eb8317e7bed1a354311a20ae707788fd94ed.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requestsPatrick Kelsey2023-05-1111-304/+423
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 00cbce5cbf88459cd1aa1d60d0f1df15477df127 ] hfi1 user SDMA request processing has two bugs that can cause data corruption for user SDMA requests that have multiple payload iovecs where an iovec other than the tail iovec does not run up to the page boundary for the buffer pointed to by that iovec.a Here are the specific bugs: 1. user_sdma_txadd() does not use struct user_sdma_iovec->iov.iov_len. Rather, user_sdma_txadd() will add up to PAGE_SIZE bytes from iovec to the packet, even if some of those bytes are past iovec->iov.iov_len and are thus not intended to be in the packet. 2. user_sdma_txadd() and user_sdma_send_pkts() fail to advance to the next iovec in user_sdma_request->iovs when the current iovec is not PAGE_SIZE and does not contain enough data to complete the packet. The transmitted packet will contain the wrong data from the iovec pages. This has not been an issue with SDMA packets from hfi1 Verbs or PSM2 because they only produce iovecs that end short of PAGE_SIZE as the tail iovec of an SDMA request. Fixing these bugs exposes other bugs with the SDMA pin cache (struct mmu_rb_handler) that get in way of supporting user SDMA requests with multiple payload iovecs whose buffers do not end at PAGE_SIZE. So this commit fixes those issues as well. Here are the mmu_rb_handler bugs that non-PAGE_SIZE-end multi-iovec payload user SDMA requests can hit: 1. Overlapping memory ranges in mmu_rb_handler will result in duplicate pinnings. 2. When extending an existing mmu_rb_handler entry (struct mmu_rb_node), the mmu_rb code (1) removes the existing entry under a lock, (2) releases that lock, pins the new pages, (3) then reacquires the lock to insert the extended mmu_rb_node. If someone else comes in and inserts an overlapping entry between (2) and (3), insert in (3) will fail. The failure path code in this case unpins _all_ pages in either the original mmu_rb_node or the new mmu_rb_node that was inserted between (2) and (3). 3. In hfi1_mmu_rb_remove_unless_exact(), mmu_rb_node->refcount is incremented outside of mmu_rb_handler->lock. As a result, mmu_rb_node could be evicted by another thread that gets mmu_rb_handler->lock and checks mmu_rb_node->refcount before mmu_rb_node->refcount is incremented. 4. Related to #2 above, SDMA request submission failure path does not check mmu_rb_node->refcount before freeing mmu_rb_node object. If there are other SDMA requests in progress whose iovecs have pointers to the now-freed mmu_rb_node(s), those pointers to the now-freed mmu_rb nodes will be dereferenced when those SDMA requests complete. Fixes: 7be85676f1d1 ("IB/hfi1: Don't remove RB entry when not needed.") Fixes: 7724105686e7 ("IB/hfi1: add driver files") Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com> Signed-off-by: Patrick Kelsey <pat.kelsey@cornelisnetworks.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Link: https://lore.kernel.org/r/168088636445.3027109.10054635277810177889.stgit@252.162.96.66.static.eigbox.net Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* IB/hfi1: Fix SDMA mmu_rb_node not being evicted in LRU orderPatrick Kelsey2023-05-111-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9fe8fec5e43d5a80f43cbf61aaada1b047a1eb61 ] hfi1_mmu_rb_remove_unless_exact() did not move mmu_rb_node objects in mmu_rb_handler->lru_list after getting a cache hit on an mmu_rb_node. As a result, hfi1_mmu_rb_evict() was not guaranteed to evict truly least-recently used nodes. This could be a performance issue for an application when that application: - Uses some long-lived buffers frequently. - Uses a large number of buffers once. - Hits the mmu_rb_handler cache size or pinned-page limits, forcing mmu_rb_handler cache entries to be evicted. In this case, the one-time use buffers cause the long-lived buffer entries to eventually filter to the end of the LRU list where hfi1_mmu_rb_evict() will consider evicting a frequently-used long-lived entry instead of evicting one of the one-time use entries. Fix this by inserting new mmu_rb_node at the tail of mmu_rb_handler->lru_list and move mmu_rb_ndoe to the tail of mmu_rb_handler->lru_list when the mmu_rb_node is a hit in hfi1_mmu_rb_remove_unless_exact(). Change hfi1_mmu_rb_evict() to evict from the head of mmu_rb_handler->lru_list instead of the tail. Fixes: 0636e9ab8355 ("IB/hfi1: Add cache evict LRU list") Signed-off-by: Brendan Cunningham <bcunningham@cornelisnetworks.com> Signed-off-by: Patrick Kelsey <pat.kelsey@cornelisnetworks.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Link: https://lore.kernel.org/r/168088635931.3027109.10423156330761536044.stgit@252.162.96.66.static.eigbox.net Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/srpt: Add a check for valid 'mad_agent' pointerSaravanan Vajravel2023-05-111-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit eca5cd9474cd26d62f9756f536e2e656d3f62f3a ] When unregistering MAD agent, srpt module has a non-null check for 'mad_agent' pointer before invoking ib_unregister_mad_agent(). This check can pass if 'mad_agent' variable holds an error value. The 'mad_agent' can have an error value for a short window when srpt_add_one() and srpt_remove_one() is executed simultaneously. In srpt module, added a valid pointer check for 'sport->mad_agent' before unregistering MAD agent. This issue can hit when RoCE driver unregisters ib_device Stack Trace: ------------ BUG: kernel NULL pointer dereference, address: 000000000000004d PGD 145003067 P4D 145003067 PUD 2324fe067 PMD 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 10 PID: 4459 Comm: kworker/u80:0 Kdump: loaded Tainted: P Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.5.4 01/13/2020 Workqueue: bnxt_re bnxt_re_task [bnxt_re] RIP: 0010:_raw_spin_lock_irqsave+0x19/0x40 Call Trace: ib_unregister_mad_agent+0x46/0x2f0 [ib_core] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready ? __schedule+0x20b/0x560 srpt_unregister_mad_agent+0x93/0xd0 [ib_srpt] srpt_remove_one+0x20/0x150 [ib_srpt] remove_client_context+0x88/0xd0 [ib_core] bond0: (slave p2p1): link status definitely up, 100000 Mbps full duplex disable_device+0x8a/0x160 [ib_core] bond0: active interface up! ? kernfs_name_hash+0x12/0x80 (NULL device *): Bonding Info Received: rdev: 000000006c0b8247 __ib_unregister_device+0x42/0xb0 [ib_core] (NULL device *): Master: mode: 4 num_slaves:2 ib_unregister_device+0x22/0x30 [ib_core] (NULL device *): Slave: id: 105069936 name:p2p1 link:0 state:0 bnxt_re_stopqps_and_ib_uninit+0x83/0x90 [bnxt_re] bnxt_re_alloc_lag+0x12e/0x4e0 [bnxt_re] Fixes: a42d985bd5b2 ("ib_srpt: Initial SRP Target merge for v3.3-rc1") Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com> Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Link: https://lore.kernel.org/r/20230406042549.507328-1-saravanan.vajravel@broadcom.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/cm: Trace icm_send_rej event before the cm state is resetMark Zhang2023-05-111-1/+2
| | | | | | | | | | | | | | | | | [ Upstream commit bd9de1badac7e4ff6780365d4aa38983f5e2a436 ] Trace icm_send_rej event before the cm state is reset to idle, so that correct cm state will be logged. For example when an incoming request is rejected, the old trace log was: icm_send_rej: local_id=961102742 remote_id=3829151631 state=IDLE reason=REJ_CONSUMER_DEFINED With this patch: icm_send_rej: local_id=312971016 remote_id=3778819983 state=MRA_REQ_SENT reason=REJ_CONSUMER_DEFINED Fixes: 8dc105befe16 ("RDMA/cm: Add tracepoints to track MAD send operations") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/20230330072351.481200-1-markzhang@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/siw: Remove namespace check from siw_netdev_event()Tetsuo Handa2023-05-111-3/+0
| | | | | | | | | | | | | | | | | | | [ Upstream commit 266e9b3475ba82212062771fdbc40be0e3c06ec8 ] syzbot is reporting that siw_netdev_event(NETDEV_UNREGISTER) cannot destroy siw_device created after unshare(CLONE_NEWNET) due to net namespace check. It seems that this check was by error there and should be removed. Reported-by: syzbot <syzbot+5e70d01ee8985ae62a3b@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=5e70d01ee8985ae62a3b Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Suggested-by: Leon Romanovsky <leon@kernel.org> Fixes: bdcf26bf9b3a ("rdma/siw: network and RDMA core interface") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/a44e9ac5-44e2-d575-9e30-02483cc7ffd1@I-love.SAKURA.ne.jp Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/erdma: Use fixed hardware page sizeCheng Xu2023-05-112-8/+13
| | | | | | | | | | | | | [ Upstream commit d649c638dc26f3501da510cf7fceb5c15ca54258 ] Hardware's page size is 4096, but the kernel's page size may vary. Driver should use hardware's page size when communicating with hardware. Fixes: 155055771704 ("RDMA/erdma: Add verbs implementation") Link: https://lore.kernel.org/r/20230307102924.70577-2-chengyou@linux.alibaba.com Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/mlx4: Prevent shift wrapping in set_user_sq_size()Dan Carpenter2023-05-111-2/+6
| | | | | | | | | | | | | | | [ Upstream commit d50b3c73f1ac20dabc53dc6e9d64ce9c79a331eb ] The ucmd->log_sq_bb_count variable is controlled by the user so this shift can wrap. Fix it by using check_shl_overflow() in the same way that it was done in commit 515f60004ed9 ("RDMA/hns: Prevent undefined behavior in hns_roce_set_user_sq_size()"). Fixes: 839041329fd3 ("IB/mlx4: Sanity check userspace send queue sizes") Signed-off-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/r/a8dfbd1d-c019-4556-930b-bab1ded73b10@kili.mountain Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/rdmavt: Delete unnecessary NULL checkNatalia Petrova2023-05-111-2/+0
| | | | | | | | | | | | | | | | | [ Upstream commit b73a0b80c69de77d8d4942abb37066531c0169b2 ] There is no need to check 'rdi->qp_dev' for NULL. The field 'qp_dev' is created in rvt_register_device() which will fail if the 'qp_dev' allocation fails in rvt_driver_qp_init(). Overwise this pointer doesn't changed and passed to rvt_qp_exit() by the next step. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 0acb0cc7ecc1 ("IB/rdmavt: Initialize and teardown of qpn table") Signed-off-by: Natalia Petrova <n.petrova@fintech.ru> Link: https://lore.kernel.org/r/20230303124408.16685-1-n.petrova@fintech.ru Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* RDMA/siw: Fix potential page_array out of range accessDaniil Dulov2023-05-111-1/+1
| | | | | | | | | | | | | | | [ Upstream commit 271bfcfb83a9f77cbae3d6e1a16e3c14132922f0 ] When seg is equal to MAX_ARRAY, the loop should break, otherwise it will result in out of range access. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: b9be6f18cf9e ("rdma/siw: transmit path") Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Link: https://lore.kernel.org/r/20230227091751.589612-1-d.dulov@aladdin.ru Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>