summaryrefslogtreecommitdiffstats
path: root/drivers/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'mlx5-fixes-2023-05-31' of ↵Jakub Kicinski2023-06-013-14/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2023-05-31 This series provides bug fixes to mlx5 driver. * tag 'mlx5-fixes-2023-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Read embedded cpu after init bit cleared net/mlx5e: Fix error handling in mlx5e_refresh_tirs net/mlx5: Ensure af_desc.mask is properly initialized net/mlx5: Fix setting of irq->map.index for static IRQ case net/mlx5: Remove rmap also in case dynamic MSIX not supported ==================== Link: https://lore.kernel.org/r/20230601031051.131529-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/mlx5: Read embedded cpu after init bit clearedMoshe Shemesh2023-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During driver load it reads embedded_cpu bit from initialization segment, but the initialization segment is readable only after initialization bit is cleared. Move the call to mlx5_read_embedded_cpu() right after initialization bit cleared. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Fixes: 591905ba9679 ("net/mlx5: Introduce Mellanox SmartNIC and modify page management logic") Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Fix error handling in mlx5e_refresh_tirsSaeed Mahameed2023-05-311-7/+4
| | | | | | | | | | | | | | | | | | | | Allocation failure is outside the critical lock section and should return immediately rather than jumping to the unlock section. Also unlock as soon as required and remove the now redundant jump label. Fixes: 80a2a9026b24 ("net/mlx5e: Add a lock on tir list") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Ensure af_desc.mask is properly initializedChuck Lever2023-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ 9.837087] mlx5_core 0000:02:00.0: firmware version: 16.35.2000 [ 9.843126] mlx5_core 0000:02:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link) [ 10.311515] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps [ 10.321948] mlx5_core 0000:02:00.0: E-Switch: Total vports 2, per vport: max uc(128) max mc(2048) [ 10.344324] mlx5_core 0000:02:00.0: mlx5_pcie_event:301:(pid 88): PCIe slot advertised sufficient power (27W). [ 10.354339] BUG: unable to handle page fault for address: ffffffff8ff0ade0 [ 10.361206] #PF: supervisor read access in kernel mode [ 10.366335] #PF: error_code(0x0000) - not-present page [ 10.371467] PGD 81ec39067 P4D 81ec39067 PUD 81ec3a063 PMD 114b07063 PTE 800ffff7e10f5062 [ 10.379544] Oops: 0000 [#1] PREEMPT SMP PTI [ 10.383721] CPU: 0 PID: 117 Comm: kworker/0:6 Not tainted 6.3.0-13028-g7222f123c983 #1 [ 10.391625] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017 [ 10.398750] Workqueue: events work_for_cpu_fn [ 10.403108] RIP: 0010:__bitmap_or+0x10/0x26 [ 10.407286] Code: 85 c0 0f 95 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 89 c9 31 c0 48 83 c1 3f 48 c1 e9 06 39 c> [ 10.426024] RSP: 0000:ffffb45a0078f7b0 EFLAGS: 00010097 [ 10.431240] RAX: 0000000000000000 RBX: ffffffff8ff0adc0 RCX: 0000000000000004 [ 10.438365] RDX: ffff9156801967d0 RSI: ffffffff8ff0ade0 RDI: ffff9156801967b0 [ 10.445489] RBP: ffffb45a0078f7e8 R08: 0000000000000030 R09: 0000000000000000 [ 10.452613] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000ec [ 10.459737] R13: ffffffff8ff0ade0 R14: 0000000000000001 R15: 0000000000000020 [ 10.466862] FS: 0000000000000000(0000) GS:ffff9165bfc00000(0000) knlGS:0000000000000000 [ 10.474936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 10.480674] CR2: ffffffff8ff0ade0 CR3: 00000001011ae003 CR4: 00000000003706f0 [ 10.487800] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 10.494922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 10.502046] Call Trace: [ 10.504493] <TASK> [ 10.506589] ? matrix_alloc_area.constprop.0+0x43/0x9a [ 10.511729] ? prepare_namespace+0x84/0x174 [ 10.515914] irq_matrix_reserve_managed+0x56/0x10c [ 10.520699] x86_vector_alloc_irqs+0x1d2/0x31e [ 10.525146] irq_domain_alloc_irqs_hierarchy+0x39/0x3f [ 10.530284] irq_domain_alloc_irqs_parent+0x1a/0x2a [ 10.535155] intel_irq_remapping_alloc+0x59/0x5e9 [ 10.539859] ? kmem_cache_debug_flags+0x11/0x26 [ 10.544383] ? __radix_tree_lookup+0x39/0xb9 [ 10.548649] irq_domain_alloc_irqs_hierarchy+0x39/0x3f [ 10.553779] irq_domain_alloc_irqs_parent+0x1a/0x2a [ 10.558650] msi_domain_alloc+0x8c/0x120 [ 10.567697] irq_domain_alloc_irqs_locked+0x11d/0x286 [ 10.572741] __irq_domain_alloc_irqs+0x72/0x93 [ 10.577179] __msi_domain_alloc_irqs+0x193/0x3f1 [ 10.581789] ? __xa_alloc+0xcf/0xe2 [ 10.585273] msi_domain_alloc_irq_at+0xa8/0xfe [ 10.589711] pci_msix_alloc_irq_at+0x47/0x5c The crash is due to matrix_alloc_area() attempting to access per-CPU memory for CPUs that are not present on the system. The CPU mask passed into reserve_managed_vector() via it's @irqd parameter is corrupted because it contains uninitialized stack data. Fixes: bbac70c74183 ("net/mlx5: Use newer affinity descriptor") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Fix setting of irq->map.index for static IRQ caseNiklas Schnelle2023-05-311-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dynamic IRQ allocation is not supported all IRQs are allocated up front in mlx5_irq_table_create() instead of dynamically as part of mlx5_irq_alloc(). In the latter dynamic case irq->map.index is set via the mapping returned by pci_msix_alloc_irq_at(). In the static case and prior to commit 1da438c0ae02 ("net/mlx5: Fix indexing of mlx5_irq") irq->map.index was set in mlx5_irq_alloc() twice once initially to 0 and then to the requested index before storing in the xarray. After this commit it is only set to 0 which breaks all other IRQ mappings. Fix this by setting irq->map.index to the requested index together with irq->map.virq and improve the related comment to make it clearer which cases it deals with. Cc: Chuck Lever III <chuck.lever@oracle.com> Tested-by: Mark Brown <broonie@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eli Cohen <elic@nvidia.com> Fixes: 1da438c0ae02 ("net/mlx5: Fix indexing of mlx5_irq") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Cédric Le Goater <clg@redhat.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Remove rmap also in case dynamic MSIX not supportedShay Drory2023-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mlx5 add IRQs to rmap upon MSIX request, and mlx5 remove rmap from MSIX only if msi_map.index is populated. However, msi_map.index is populated only when dynamic MSIX is supported. This results in freeing IRQs without removing them from rmap, which triggers the bellow WARN_ON[1]. rmap is a feature which have no relation to dynamic MSIX. Hence, remove the check of msi_map.index when removing IRQ from rmap. [1] [ 200.307160 ] WARNING: CPU: 20 PID: 1702 at kernel/irq/manage.c:2034 free_irq+0x2ac/0x358 [ 200.316990 ] CPU: 20 PID: 1702 Comm: modprobe Not tainted 6.4.0-rc3_for_upstream_min_debug_2023_05_24_14_02 #1 [ 200.318939 ] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 200.321659 ] pc : free_irq+0x2ac/0x358 [ 200.322400 ] lr : free_irq+0x20/0x358 [ 200.337865 ] Call trace: [ 200.338360 ] free_irq+0x2ac/0x358 [ 200.339029 ] irq_release+0x58/0xd0 [mlx5_core] [ 200.340093 ] mlx5_irqs_release_vectors+0x80/0xb0 [mlx5_core] [ 200.341344 ] destroy_comp_eqs+0x120/0x170 [mlx5_core] [ 200.342469 ] mlx5_eq_table_destroy+0x1c/0x38 [mlx5_core] [ 200.343645 ] mlx5_unload+0x8c/0xc8 [mlx5_core] [ 200.344652 ] mlx5_uninit_one+0x78/0x118 [mlx5_core] [ 200.345745 ] remove_one+0x80/0x108 [mlx5_core] [ 200.346752 ] pci_device_remove+0x40/0xd8 [ 200.347554 ] device_remove+0x50/0x88 [ 200.348272 ] device_release_driver_internal+0x1c4/0x228 [ 200.349312 ] driver_detach+0x54/0xa0 [ 200.350030 ] bus_remove_driver+0x74/0x100 [ 200.350833 ] driver_unregister+0x34/0x68 [ 200.351619 ] pci_unregister_driver+0x28/0xa0 [ 200.352476 ] mlx5_cleanup+0x14/0x2210 [mlx5_core] [ 200.353536 ] __arm64_sys_delete_module+0x190/0x2e8 [ 200.354495 ] el0_svc_common.constprop.0+0x6c/0x1d0 [ 200.355455 ] do_el0_svc+0x38/0x98 [ 200.356122 ] el0_svc+0x1c/0x80 [ 200.356739 ] el0t_64_sync_handler+0xb4/0x130 [ 200.357604 ] el0t_64_sync+0x174/0x178 [ 200.358345 ] ---[ end trace 0000000000000000 ]--- Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* | ice: recycle/free all of the fragments from multi-buffer frameMaciej Fijalkowski2023-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ice driver caches next_to_clean value at the beginning of ice_clean_rx_irq() in order to remember the first buffer that has to be freed/recycled after main Rx processing loop. The end boundary is indicated by first descriptor of frame that Rx processing loop has ended its duties. Note that if mentioned loop ended in the middle of gathering multi-buffer frame, next_to_clean would be pointing to the descriptor in the middle of the frame BUT freeing/recycling stage will stop at the first descriptor. This means that next iteration of ice_clean_rx_irq() will miss the (first_desc, next_to_clean - 1) entries. When running various 9K MTU workloads, such splats were observed: [ 540.780716] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 540.787787] #PF: supervisor read access in kernel mode [ 540.793002] #PF: error_code(0x0000) - not-present page [ 540.798218] PGD 0 P4D 0 [ 540.800801] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 540.805231] CPU: 18 PID: 3984 Comm: xskxceiver Tainted: G W 6.3.0-rc7+ #96 [ 540.813619] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [ 540.824209] RIP: 0010:ice_clean_rx_irq+0x2b6/0xf00 [ice] [ 540.829678] Code: 74 24 10 e9 aa 00 00 00 8b 55 78 41 31 57 10 41 09 c4 4d 85 ff 0f 84 83 00 00 00 49 8b 57 08 41 8b 4f 1c 65 8b 35 1a fa 4b 3f <48> 8b 02 48 c1 e8 3a 39 c6 0f 85 a2 00 00 00 f6 42 08 02 0f 85 98 [ 540.848717] RSP: 0018:ffffc9000f42fc50 EFLAGS: 00010282 [ 540.854029] RAX: 0000000000000004 RBX: 0000000000000002 RCX: 000000000000fffe [ 540.861272] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff [ 540.868519] RBP: ffff88984a05ac00 R08: 0000000000000000 R09: dead000000000100 [ 540.875760] R10: ffff88983fffcd00 R11: 000000000010f2b8 R12: 0000000000000004 [ 540.883008] R13: 0000000000000003 R14: 0000000000000800 R15: ffff889847a10040 [ 540.890253] FS: 00007f6ddf7fe640(0000) GS:ffff88afdf800000(0000) knlGS:0000000000000000 [ 540.898465] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 540.904299] CR2: 0000000000000000 CR3: 000000010d3da001 CR4: 00000000007706e0 [ 540.911542] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 540.918789] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 540.926032] PKRU: 55555554 [ 540.928790] Call Trace: [ 540.931276] <TASK> [ 540.933418] ice_napi_poll+0x4ca/0x6d0 [ice] [ 540.937804] ? __pfx_ice_napi_poll+0x10/0x10 [ice] [ 540.942716] napi_busy_loop+0xd7/0x320 [ 540.946537] xsk_recvmsg+0x143/0x170 [ 540.950178] sock_recvmsg+0x99/0xa0 [ 540.953729] __sys_recvfrom+0xa8/0x120 [ 540.957543] ? do_futex+0xbd/0x1d0 [ 540.961008] ? __x64_sys_futex+0x73/0x1d0 [ 540.965083] __x64_sys_recvfrom+0x20/0x30 [ 540.969155] do_syscall_64+0x38/0x90 [ 540.972796] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 540.977934] RIP: 0033:0x7f6de5f27934 To fix this, set cached_ntc to first_desc so that at the end, when freeing/recycling buffers, descriptors from first to ntc are not missed. Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20230531154457.3216621-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: phy: mxl-gpy: extend interrupt fix to all impacted variantsXu Liang2023-06-011-13/+3
| | | | | | | | | | | | | | | | | | | | | | The interrupt fix in commit 97a89ed101bb should be applied on all variants of GPY2xx PHY and GPY115C. Fixes: 97a89ed101bb ("net: phy: mxl-gpy: disable interrupts on GPY215 by default") Signed-off-by: Xu Liang <lxu@maxlinear.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230531074822.39136-1-lxu@maxlinear.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: renesas: rswitch: Fix return value in error path of xmitYoshihiro Shimoda2023-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | Fix return value in the error path of rswitch_start_xmit(). If TX queues are full, this function should return NETDEV_TX_BUSY. Fixes: 3590918b5d07 ("net: ethernet: renesas: Add support for "Ethernet Switch"") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Link: https://lore.kernel.org/r/20230529073817.1145208-1-yoshihiro.shimoda.uh@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: dsa: mv88e6xxx: Increase wait after reset deactivationAndreas Svensson2023-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A switch held in reset by default needs to wait longer until we can reliably detect it. An issue was observed when testing on the Marvell 88E6393X (Link Street). The driver failed to detect the switch on some upstarts. Increasing the wait time after reset deactivation solves this issue. The updated wait time is now also the same as the wait time in the mv88e6xxx_hardware_reset function. Fixes: 7b75e49de424 ("net: dsa: mv88e6xxx: wait after reset deactivation") Signed-off-by: Andreas Svensson <andreas.svensson@axis.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230530145223.1223993-1-andreas.svensson@axis.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | net: ipa: Use correct value for IPA_STATUS_SIZEBert Karwatzki2023-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | IPA_STATUS_SIZE was introduced in commit b8dc7d0eea5a as a replacement for the size of the removed struct ipa_status which had size sizeof(__le32[8]). Use this value as IPA_STATUS_SIZE. Fixes: b8dc7d0eea5a ("net: ipa: stop using sizeof(status)") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230531103618.102608-1-spasswolf@web.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | sfc: fix error unwinds in TC offloadEdward Cree2023-05-311-15/+12
|/ | | | | | | | | | | | | | | | | | Failure ladders weren't exactly unwinding what the function had done up to that point; most seriously, when we encountered an already offloaded rule, the failure path tried to remove the new rule from the hashtable, which would in fact remove the already-present 'old' rule (since it has the same key) from the table, and leak its resources. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202305200745.xmIlkqjH-lkp@intel.com/ Fixes: d902e1a737d4 ("sfc: bare bones TC offload on EF100") Fixes: 17654d84b47c ("sfc: add offloading of 'foreign' TC (decap) rules") Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230530202527.53115-1-edward.cree@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: mana: Fix perf regression: remove rx_cqes, tx_cqes countersHaiyang Zhang2023-05-302-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The apc->eth_stats.rx_cqes is one per NIC (vport), and it's on the frequent and parallel code path of all queues. So, r/w into this single shared variable by many threads on different CPUs creates a lot caching and memory overhead, hence perf regression. And, it's not accurate due to the high volume concurrent r/w. For example, a workload is iperf with 128 threads, and with RPS enabled. We saw perf regression of 25% with the previous patch adding the counters. And this patch eliminates the regression. Since the error path of mana_poll_rx_cq() already has warnings, so keeping the counter and convert it to a per-queue variable is not necessary. So, just remove this counter from this high frequency code path. Also, remove the tx_cqes counter for the same reason. We have warnings & other counters for errors on that path, and don't need to count every normal cqe processing. Cc: stable@vger.kernel.org Fixes: bd7fc6e1957c ("net: mana: Add new MANA VF performance counters for easier troubleshooting") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1685115537-31675-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net: usb: qmi_wwan: Set DTR quirk for BroadMobi BM818Sebastian Krzyszkowiak2023-05-291-1/+1
| | | | | | | | | | | BM818 is based on Qualcomm MDM9607 chipset. Fixes: 9a07406b00cd ("net: usb: qmi_wwan: Add the BroadMobi BM818 card") Cc: stable@vger.kernel.org Signed-off-by: Sebastian Krzyszkowiak <sebastian.krzyszkowiak@puri.sm> Acked-by: Bjørn Mork <bjorn@mork.no> Link: https://lore.kernel.org/r/20230526-bm818-dtr-v1-1-64bbfa6ba8af@puri.sm Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* amd-xgbe: fix the false linkup in xgbe_phy_statusRaju Rangoju2023-05-261-3/+9
| | | | | | | | | | | | | | | | | | | In the event of a change in XGBE mode, the current auto-negotiation needs to be reset and the AN cycle needs to be re-triggerred. However, the current code ignores the return value of xgbe_set_mode(), leading to false information as the link is declared without checking the status register. Fix this by propagating the mode switch status information to xgbe_phy_status(). Fixes: e57f7a3feaef ("amd-xgbe: Prepare for working with more than one type of phy") Co-developed-by: Sudheesh Mavila <sudheesh.mavila@amd.com> Signed-off-by: Sudheesh Mavila <sudheesh.mavila@amd.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'mlx5-fixes-2023-05-24' of ↵Jakub Kicinski2023-05-2517-167/+241
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 fixes 2023-05-24 This series includes bug fixes for the mlx5 driver. * tag 'mlx5-fixes-2023-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: Documentation: net/mlx5: Wrap notes in admonition blocks Documentation: net/mlx5: Add blank line separator before numbered lists Documentation: net/mlx5: Use bullet and definition lists for vnic counters description Documentation: net/mlx5: Wrap vnic reporter devlink commands in code blocks net/mlx5: Fix check for allocation failure in comp_irqs_request_pci() net/mlx5: DR, Add missing mutex init/destroy in pattern manager net/mlx5e: Move Ethernet driver debugfs to profile init callback net/mlx5e: Don't attach netdev profile while handling internal error net/mlx5: Fix post parse infra to only parse every action once net/mlx5e: Use query_special_contexts cmd only once per mdev net/mlx5: fw_tracer, Fix event handling net/mlx5: SF, Drain health before removing device net/mlx5: Drain health before unregistering devlink net/mlx5e: Do not update SBCM when prio2buffer command is invalid net/mlx5e: Consider internal buffers size in port buffer calculations net/mlx5e: Prevent encap offload when neigh update is running net/mlx5e: Extract remaining tunnel encap code to dedicated file ==================== Link: https://lore.kernel.org/r/20230525034847.99268-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/mlx5: Fix check for allocation failure in comp_irqs_request_pci()Dan Carpenter2023-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This function accidentally dereferences "cpus" instead of returning directly. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202305200354.KV3jU94w-lkp@intel.com/ Fixes: b48a0f72bc3e ("net/mlx5: Refactor completion irq request/release code") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: DR, Add missing mutex init/destroy in pattern managerYevgeny Kliteynik2023-05-241-0/+3
| | | | | | | | | | | | | | | | | | Add missing mutex init/destroy as caught by the lock's debug warning: DEBUG_LOCKS_WARN_ON(lock->magic != lock) Fixes: da5d0027d666 ("net/mlx5: DR, Add cache for modify header pattern") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Move Ethernet driver debugfs to profile init callbackJianbo Liu2023-05-242-5/+11
| | | | | | | | | | | | | | | | | | | | | | As priv->dfs_root is cleared, and therefore missed, when change eswitch mode, move the creation of the root debugfs to the init callback of mlx5e_nic_profile and mlx5e_uplink_rep_profile, and the destruction to the cleanup callback for symmeter. Fixes: 288eca60cc31 ("net/mlx5e: Add Ethernet driver debugfs") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Don't attach netdev profile while handling internal errorDmytro Linkin2023-05-241-4/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of switchdev mode disablement, driver changes port netdevice profile from uplink to nic. If this process is triggered by health recovery flow (PCI reset, for ex.) profile attach would fail because all fw commands aborted when internal error flag is set. As a result, nic netdevice profile is not attached and driver fails to rollback to uplink profile, which leave driver in broken state and cause crash later. To handle broken state do netdevice profile initialization only instead of full attachment and release mdev resources on driver suspend as expected. Actual netdevice attachment is done during driver load. Fixes: c4d7eb57687f ("net/mxl5e: Add change profile method") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Fix post parse infra to only parse every action onceVlad Buslov2023-05-243-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Caller of mlx5e_tc_act_post_parse() needs it to parse only the subset of actions starting after previous split and ending at the current action. However, that range is not provided as arguments and mlx5e_tc_act_post_parse() uses generic flow_action_for_each() that iterates over all flow actions. Not only this is redundant, it also causes a bug when mlx5e_tc_act->post_parse() callback is not idempotent since it will be called for every split. For example, ct action tc_act_post_parse_ct() callback obtains a reference to mlx5_ct_ft instance and calling it several times during parsing stage will cause reference counter imbalance. Fix the issue by providing a proper action range of the current split subset to mlx5e_tc_act_post_parse() and only calling mlx5e_tc_act->post_parse() for actions inside the subset range. Fixes: 8300f225268b ("net/mlx5e: Create new flow attr for multi table actions") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Use query_special_contexts cmd only once per mdevDragos Tatulea2023-05-243-21/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't query the firmware so many times (num rqs * num wqes * wqe frags) because it slows down linearly the interface creation time when the product is larger. Do it only once per mdev and store the result in mlx5e_param. Due to helper function being called from different files, move it to an appropriate location. Rename the function with a proper prefix and add a small cleanup. This fix applies only for legacy rq. Fixes: 1b1e4868836a ("net/mlx5e: Use query_special_contexts for mkeys") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: fw_tracer, Fix event handlingShay Drory2023-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | mlx5 driver needs to parse traces with event_id inside the range of first_string_trace and num_string_trace. However, mlx5 is parsing all events with event_id >= first_string_trace. Fix it by checking for the correct range. Fixes: c71ad41ccb0c ("net/mlx5: FW tracer, events handling") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: SF, Drain health before removing deviceShay Drory2023-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | There is no point in recovery during device removal. Also, if health work started need to wait for it to avoid races and NULL pointer access. Hence, drain health WQ before removing device. Fixes: 1958fc2f0712 ("net/mlx5: SF, Add auxiliary device driver") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Drain health before unregistering devlinkShay Drory2023-05-241-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | mlx5 health mechanism is using devlink APIs, which are using devlink notify APIs. After the cited patch, using devlink notify APIs after devlink is unregistered triggers a WARN_ON(). Hence, drain health WQ before devlink is unregistered. Fixes: cf530217408e ("devlink: Notify users when objects are accessible") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Do not update SBCM when prio2buffer command is invalidMaher Sanalla2023-05-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The shared buffer pools configuration which are stored in the SBCM register are updated when the user changes the prio2buffer mapping. However, in case the user desired prio2buffer change is invalid, which can occur due to mapping a lossless priority to a not large enough buffer, the SBCM update should not be performed, as the user command is failed. Thus, Perform the SBCM update only after xoff threshold calculation is performed and the user prio2buffer mapping is validated. Fixes: a440030d8946 ("net/mlx5e: Update shared buffer along with device buffer changes") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Consider internal buffers size in port buffer calculationsMaher Sanalla2023-05-243-21/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when a user triggers a change in port buffer headroom (buffers 0-7), the driver checks that the requested headroom does not exceed the total port buffer size. However, this check does not take into account the internal buffers (buffers 8-9), which are also part of the total port buffer. This can result in treating invalid port buffer change requests as valid, causing unintended changes to the shared buffer. To address this, include the internal buffers size in the calculation of available port buffer space which ensures that port buffer requests do not exceed the correct limit. Furthermore, remove internal buffers (8-9) size from the total_size calculation as these buffers are reserved for internal use and are not exposed to the user. While at it, add verbosity to the debug prints in mlx5e_port_query_buffer() function to ease future debugging. Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Prevent encap offload when neigh update is runningChris Mi2023-05-241-17/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cited commit adds a compeletion to remove dependency on rtnl lock. But it causes a deadlock for multiple encapsulations: crash> bt ffff8aece8a64000 PID: 1514557 TASK: ffff8aece8a64000 CPU: 3 COMMAND: "tc" #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418 #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898 #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8 #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core] #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core] #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core] #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core] #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core] #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core] #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core] #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower] #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower] #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c crash> bt 0xffff8aeb07544000 PID: 1110766 TASK: ffff8aeb07544000 CPU: 0 COMMAND: "kworker/u20:9" #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45 #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418 #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88 #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core] #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core] #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core] #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012 #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f After the first encap is attached, flow will be added to encap entry's flows list. If neigh update is running at this time, the following encaps of the flow can't hold the encap_tbl_lock and sleep. If neigh update thread is waiting for that flow's init_done, deadlock happens. Fix it by holding lock outside of the for loop. If neigh update is running, prevent encap flows from offloading. Since the lock is held outside of the for loop, concurrent creation of encap entries is not allowed. So remove unnecessary wait_for_completion call for res_ready. Fixes: 95435ad7999b ("net/mlx5e: Only access fully initialized flows in neigh update") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Extract remaining tunnel encap code to dedicated fileChris Mi2023-05-243-87/+94
| | | | | | | | | | | | | | | | | | | | | | Move set_encap_dests() and clean_encap_dests() to the tunnel encap dedicated file. And rename them to mlx5e_tc_tun_encap_dests_set() and mlx5e_tc_tun_encap_dests_unset(). No functional change in this patch. It is needed in the next patch. Signed-off-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* | net: stmmac: fix call trace when stmmac_xdp_xmit() is invokedWei Fang2023-05-252-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We encountered a kernel call trace issue which was related to ndo_xdp_xmit callback on our i.MX8MP platform. The reproduce steps show as follows. 1. The FEC port (eth0) connects to a PC port, and the PC uses pktgen_sample03_burst_single_flow.sh to generate packets and send these packets to the FEC port. Notice that the script must be executed before step 2. 2. Run the "./xdp_redirect eth0 eth1" command on i.MX8MP, the eth1 interface is the dwmac. Then there will be a call trace issue soon. Please see the log for more details. The root cause is that the NETDEV_XDP_ACT_NDO_XMIT feature is enabled by default, so when the step 2 command is exexcuted and packets have already been sent to eth0, the stmmac_xdp_xmit() starts running before the stmmac_xdp_set_prog() finishes. To resolve this issue, we disable the NETDEV_XDP_ACT_NDO_XMIT feature by default and turn on/off this feature when the bpf program is installed/uninstalled which just like the other ethernet drivers. Call Trace log: [ 306.311271] ------------[ cut here ]------------ [ 306.315910] WARNING: CPU: 0 PID: 15 at lib/timerqueue.c:55 timerqueue_del+0x68/0x70 [ 306.323590] Modules linked in: [ 306.326654] CPU: 0 PID: 15 Comm: ksoftirqd/0 Not tainted 6.4.0-rc1+ #37 [ 306.333277] Hardware name: NXP i.MX8MPlus EVK board (DT) [ 306.338591] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 306.345561] pc : timerqueue_del+0x68/0x70 [ 306.349577] lr : __remove_hrtimer+0x5c/0xa0 [ 306.353777] sp : ffff80000b7c3920 [ 306.357094] x29: ffff80000b7c3920 x28: 0000000000000000 x27: 0000000000000001 [ 306.364244] x26: ffff80000a763a40 x25: ffff0000d0285a00 x24: 0000000000000001 [ 306.371390] x23: 0000000000000001 x22: ffff000179389a40 x21: 0000000000000000 [ 306.378537] x20: ffff000179389aa0 x19: ffff0000d2951308 x18: 0000000000001000 [ 306.385686] x17: f1d3000000000000 x16: 00000000c39c1000 x15: 55e99bbe00001a00 [ 306.392835] x14: 09000900120aa8c0 x13: e49af1d300000000 x12: 000000000000c39c [ 306.399987] x11: 100055e99bbe0000 x10: ffff8000090b1048 x9 : ffff8000081603fc [ 306.407133] x8 : 000000000000003c x7 : 000000000000003c x6 : 0000000000000001 [ 306.414284] x5 : ffff0000d2950980 x4 : 0000000000000000 x3 : 0000000000000000 [ 306.421432] x2 : 0000000000000001 x1 : ffff0000d2951308 x0 : ffff0000d2951308 [ 306.428585] Call trace: [ 306.431035] timerqueue_del+0x68/0x70 [ 306.434706] __remove_hrtimer+0x5c/0xa0 [ 306.438549] hrtimer_start_range_ns+0x2bc/0x370 [ 306.443089] stmmac_xdp_xmit+0x174/0x1b0 [ 306.447021] bq_xmit_all+0x194/0x4b0 [ 306.450612] __dev_flush+0x4c/0x98 [ 306.454024] xdp_do_flush+0x18/0x38 [ 306.457522] fec_enet_rx_napi+0x6c8/0xc68 [ 306.461539] __napi_poll+0x40/0x220 [ 306.465038] net_rx_action+0xf8/0x240 [ 306.468707] __do_softirq+0x128/0x3a8 [ 306.472378] run_ksoftirqd+0x40/0x58 [ 306.475961] smpboot_thread_fn+0x1c4/0x288 [ 306.480068] kthread+0x124/0x138 [ 306.483305] ret_from_fork+0x10/0x20 [ 306.486889] ---[ end trace 0000000000000000 ]--- Fixes: 66c0e13ad236 ("drivers: net: turn on XDP features") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230524125714.357337-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: mellanox: mlxbf_gige: Fix skb_panic splat under memory pressureThomas Bogendoerfer2023-05-251-6/+7
| | | | | | | | | | | | | | | | | | | | | | Do skb_put() after a new skb has been successfully allocated otherwise the reused skb leads to skb_panics or incorrect packet sizes. Fixes: f92e1869d74e ("Add Mellanox BlueField Gigabit Ethernet driver") Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230524194908.147145-1-tbogendoerfer@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: phy: mscc: enable VSC8501/2 RGMII RX clockDavid Epping2023-05-242-26/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default the VSC8501 and VSC8502 RGMII/GMII/MII RX_CLK output is disabled. To allow packet forwarding towards the MAC it needs to be enabled. For other PHYs supported by this driver the clock output is enabled by default. Fixes: d3169863310d ("net: phy: mscc: add support for VSC8502") Signed-off-by: David Epping <david.epping@missinglinkelectronics.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: phy: mscc: remove unnecessary phydev lockingDavid Epping2023-05-241-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Holding the struct phy_device (phydev) lock is unnecessary when accessing phydev->interface in the PHY driver .config_init method, which is the only place that vsc85xx_rgmii_set_skews() is called from. The phy_modify_paged() function implements required MDIO bus level locking, which can not be achieved by a phydev lock. Signed-off-by: David Epping <david.epping@missinglinkelectronics.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: phy: mscc: add support for VSC8501David Epping2023-05-242-0/+26
| | | | | | | | | | | | | | | | | | | | The VSC8501 PHY can use the same driver implementation as the VSC8502. Adding the PHY ID and copying the handler functions of VSC8502 is sufficient to operate it. Signed-off-by: David Epping <david.epping@missinglinkelectronics.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLEDavid Epping2023-05-241-0/+1
|/ | | | | | | | | | The mscc driver implements support for VSC8502, so its ID should be in the MODULE_DEVICE_TABLE for automatic loading. Signed-off-by: David Epping <david.epping@missinglinkelectronics.com> Fixes: d3169863310d ("net: phy: mscc: add support for VSC8502") Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge tag 'mlx5-fixes-2023-05-22' of ↵David S. Miller2023-05-2416-58/+173
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-fixes-2023-05-22 This series provides bug fixes for the mlx5 driver. Please pull and let me know if there is any problem. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/mlx5: Fix indexing of mlx5_irqShay Drory2023-05-221-4/+5
| | | | | | | | | | | | | | | | | | | | | | After the cited patch, mlx5_irq xarray index can be different then mlx5_irq MSIX table index. Fix it by storing both mlx5_irq xarray index and MSIX table index. Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Fix irq affinity managementShay Drory2023-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | The cited patch deny the user of changing the affinity of mlx5 irqs, which break backward compatibility. Hence, allow the user to change the affinity of mlx5 irqs. Fixes: bbac70c74183 ("net/mlx5: Use newer affinity descriptor") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Free irqs only on shutdown callbackShay Drory2023-05-223-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever a shutdown is invoked, free irqs only and keep mlx5_irq synthetic wrapper intact in order to avoid use-after-free on system shutdown. for example: ================================================================== BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80 Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608 CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G B W O 6.1.21-cloudflare-kasan-2023.3.21 #1 Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021 Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core] Call Trace: <TASK> dump_stack_lvl+0x34/0x48 print_report+0x170/0x473 ? _find_first_bit+0x66/0x80 kasan_report+0xad/0x130 ? _find_first_bit+0x66/0x80 _find_first_bit+0x66/0x80 mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core] ? console_unlock+0x2fa/0x430 ? _raw_spin_lock_irqsave+0x8d/0xf0 ? _raw_spin_unlock_irqrestore+0x42/0x80 ? preempt_count_add+0x7d/0x150 ? __wake_up_klogd.part.0+0x7d/0xc0 ? vprintk_emit+0xfe/0x2c0 ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core] ? dev_attr_show.cold+0x35/0x35 ? devlink_health_do_dump.part.0+0x174/0x340 ? devlink_health_report+0x504/0x810 ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core] ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core] ? process_one_work+0x680/0x1050 mlx5e_safe_switch_params+0x156/0x220 [mlx5_core] ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core] ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core] mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core] ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0 devlink_health_reporter_recover+0xa6/0x1f0 devlink_health_report+0x2f7/0x810 ? vsnprintf+0x854/0x15e0 mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core] ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core] ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core] ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core] ? newidle_balance+0x9b7/0xe30 ? psi_group_change+0x6a7/0xb80 ? mutex_lock+0x96/0xf0 ? __mutex_lock_slowpath+0x10/0x10 mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core] process_one_work+0x680/0x1050 worker_thread+0x5a0/0xeb0 ? process_one_work+0x1050/0x1050 kthread+0x2a2/0x340 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Freed by task 1: kasan_save_stack+0x23/0x50 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 ____kasan_slab_free+0x169/0x1d0 slab_free_freelist_hook+0xd2/0x190 __kmem_cache_free+0x1a1/0x2f0 irq_pool_free+0x138/0x200 [mlx5_core] mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core] mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core] shutdown+0x194/0x1aa [mlx5_core] pci_device_shutdown+0x75/0x120 device_shutdown+0x35c/0x620 kernel_restart+0x60/0xa0 __do_sys_reboot+0x1cb/0x2c0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x4b/0xb5 The buggy address belongs to the object at ffff88823fc0d300 which belongs to the cache kmalloc-192 of size 192 The buggy address is located 24 bytes inside of 192-byte region [ffff88823fc0d300, ffff88823fc0d3c0) The buggy address belongs to the physical page: page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x23fc0c head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0 flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff) raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00 raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== general protection fault, probably for non-canonical address 0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67] CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G B W O 6.1.21-cloudflare-kasan-2023.3.21 #1 Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021 Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core] RIP: 0010:__alloc_pages+0x141/0x5c0 Call Trace: <TASK> ? sysvec_apic_timer_interrupt+0xa0/0xc0 ? asm_sysvec_apic_timer_interrupt+0x16/0x20 ? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0 ? _raw_spin_unlock_irqrestore+0x3d/0x80 __kmalloc_large_node+0x80/0x120 ? kvmalloc_node+0x4e/0x170 __kmalloc_node+0xd4/0x150 kvmalloc_node+0x4e/0x170 mlx5e_open_channels+0x631/0x3a10 [mlx5_core] ? console_unlock+0x2fa/0x430 ? _raw_spin_lock_irqsave+0x8d/0xf0 ? _raw_spin_unlock_irqrestore+0x42/0x80 ? preempt_count_add+0x7d/0x150 ? __wake_up_klogd.part.0+0x7d/0xc0 ? vprintk_emit+0xfe/0x2c0 ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core] ? dev_attr_show.cold+0x35/0x35 ? devlink_health_do_dump.part.0+0x174/0x340 ? devlink_health_report+0x504/0x810 ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core] ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core] ? process_one_work+0x680/0x1050 mlx5e_safe_switch_params+0x156/0x220 [mlx5_core] ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core] ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core] mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core] ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0 devlink_health_reporter_recover+0xa6/0x1f0 devlink_health_report+0x2f7/0x810 ? vsnprintf+0x854/0x15e0 mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core] ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core] ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core] ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core] ? newidle_balance+0x9b7/0xe30 ? psi_group_change+0x6a7/0xb80 ? mutex_lock+0x96/0xf0 ? __mutex_lock_slowpath+0x10/0x10 mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core] process_one_work+0x680/0x1050 worker_thread+0x5a0/0xeb0 ? process_one_work+0x1050/0x1050 kthread+0x2a2/0x340 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> ---[ end trace 0000000000000000 ]--- RIP: 0010:__alloc_pages+0x141/0x5c0 Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea 03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02 00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89 RSP: 0018:ffff888251f0f438 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000 RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60 RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003 R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380 FS: 0000000000000000(0000) GS:ffff88bf2be80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]---] Reported-by: Frederick Lawler <fred@cloudflare.com> Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.com Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Devcom, serialize devcom registrationShay Drory2023-05-221-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | From one hand, mlx5 driver is allowing to probe PFs in parallel. From the other hand, devcom, which is a share resource between PFs, is registered without any lock. This might resulted in memory problems. Hence, use the global mlx5_dev_list_lock in order to serialize devcom registration. Fixes: fadd59fc50d0 ("net/mlx5: Introduce inter-device communication mechanism") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Devcom, fix error flow in mlx5_devcom_register_deviceShay Drory2023-05-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | In case devcom allocation is failed, mlx5 is always freeing the priv. However, this priv might have been allocated by a different thread, and freeing it might lead to use-after-free bugs. Fix it by freeing the priv only in case it was allocated by the running thread. Fixes: fadd59fc50d0 ("net/mlx5: Introduce inter-device communication mechanism") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: E-switch, Devcom, sync devcom events and devcom comp registerShay Drory2023-05-222-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | devcom events are sent to all registered component. Following the cited patch, it is possible for two components, e.g.: two eswitches, to send devcom events, while both components are registered. This means eswitch layer will do double un/pairing, which is double allocation and free of resources, even though only one un/pairing is needed. flow example: cpu0 cpu1 ---- ---- mlx5_devlink_eswitch_mode_set(dev0) esw_offloads_devcom_init() mlx5_devcom_register_component(esw0) mlx5_devlink_eswitch_mode_set(dev1) esw_offloads_devcom_init() mlx5_devcom_register_component(esw1) mlx5_devcom_send_event() mlx5_devcom_send_event() Hence, check whether the eswitches are already un/paired before free/allocation of resources. Fixes: 09b278462f16 ("net: devlink: enable parallel ops on netlink interface") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: TC, Fix using eswitch mapping in nic modePaul Blakey2023-05-221-7/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cited patch is using the eswitch object mapping pool while in nic mode where it isn't initialized. This results in the trace below [0]. Fix that by using either nic or eswitch object mapping pool depending if eswitch is enabled or not. [0]: [ 826.446057] ================================================================== [ 826.446729] BUG: KASAN: slab-use-after-free in mlx5_add_flow_rules+0x30/0x490 [mlx5_core] [ 826.447515] Read of size 8 at addr ffff888194485830 by task tc/6233 [ 826.448243] CPU: 16 PID: 6233 Comm: tc Tainted: G W 6.3.0-rc6+ #1 [ 826.448890] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 826.449785] Call Trace: [ 826.450052] <TASK> [ 826.450302] dump_stack_lvl+0x33/0x50 [ 826.450650] print_report+0xc2/0x610 [ 826.450998] ? __virt_addr_valid+0xb1/0x130 [ 826.451385] ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core] [ 826.451935] kasan_report+0xae/0xe0 [ 826.452276] ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core] [ 826.452829] mlx5_add_flow_rules+0x30/0x490 [mlx5_core] [ 826.453368] ? __kmalloc_node+0x5a/0x120 [ 826.453733] esw_add_restore_rule+0x20f/0x270 [mlx5_core] [ 826.454288] ? mlx5_eswitch_add_send_to_vport_meta_rule+0x260/0x260 [mlx5_core] [ 826.455011] ? mutex_unlock+0x80/0xd0 [ 826.455361] ? __mutex_unlock_slowpath.constprop.0+0x210/0x210 [ 826.455862] ? mapping_add+0x2cb/0x440 [mlx5_core] [ 826.456425] mlx5e_tc_action_miss_mapping_get+0x139/0x180 [mlx5_core] [ 826.457058] ? mlx5e_tc_update_skb_nic+0xb0/0xb0 [mlx5_core] [ 826.457636] ? __kasan_kmalloc+0x77/0x90 [ 826.458000] ? __kmalloc+0x57/0x120 [ 826.458336] mlx5_tc_ct_flow_offload+0x325/0xe40 [mlx5_core] [ 826.458916] ? ct_kernel_enter.constprop.0+0x48/0xa0 [ 826.459360] ? mlx5_tc_ct_parse_action+0xf0/0xf0 [mlx5_core] [ 826.459933] ? mlx5e_mod_hdr_attach+0x491/0x520 [mlx5_core] [ 826.460507] ? mlx5e_mod_hdr_get+0x12/0x20 [mlx5_core] [ 826.461046] ? mlx5e_tc_attach_mod_hdr+0x154/0x170 [mlx5_core] [ 826.461635] mlx5e_configure_flower+0x969/0x2110 [mlx5_core] [ 826.462217] ? _raw_spin_lock_bh+0x85/0xe0 [ 826.462597] ? __mlx5e_add_fdb_flow+0x750/0x750 [mlx5_core] [ 826.463163] ? kasan_save_stack+0x2e/0x40 [ 826.463534] ? down_read+0x115/0x1b0 [ 826.463878] ? down_write_killable+0x110/0x110 [ 826.464288] ? tc_setup_action.part.0+0x9f/0x3b0 [ 826.464701] ? mlx5e_is_uplink_rep+0x4c/0x90 [mlx5_core] [ 826.465253] ? mlx5e_tc_reoffload_flows_work+0x130/0x130 [mlx5_core] [ 826.465878] tc_setup_cb_add+0x112/0x250 [ 826.466247] fl_hw_replace_filter+0x230/0x310 [cls_flower] [ 826.466724] ? fl_hw_destroy_filter+0x1a0/0x1a0 [cls_flower] [ 826.467212] fl_change+0x14e1/0x2030 [cls_flower] [ 826.467636] ? sock_def_readable+0x89/0x120 [ 826.468019] ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower] [ 826.468509] ? kasan_unpoison+0x23/0x50 [ 826.468873] ? get_random_u16+0x180/0x180 [ 826.469244] ? __radix_tree_lookup+0x2b/0x130 [ 826.469640] ? fl_get+0x7b/0x140 [cls_flower] [ 826.470042] ? fl_mask_put+0x200/0x200 [cls_flower] [ 826.470478] ? __mutex_unlock_slowpath.constprop.0+0x210/0x210 [ 826.470973] ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower] [ 826.471427] tc_new_tfilter+0x644/0x1050 [ 826.471795] ? tc_get_tfilter+0x860/0x860 [ 826.472170] ? __thaw_task+0x130/0x130 [ 826.472525] ? arch_stack_walk+0x98/0xf0 [ 826.472892] ? cap_capable+0x9f/0xd0 [ 826.473235] ? security_capable+0x47/0x60 [ 826.473608] rtnetlink_rcv_msg+0x1d5/0x550 [ 826.473985] ? rtnl_calcit.isra.0+0x1f0/0x1f0 [ 826.474383] ? __stack_depot_save+0x35/0x4c0 [ 826.474779] ? kasan_save_stack+0x2e/0x40 [ 826.475149] ? kasan_save_stack+0x1e/0x40 [ 826.475518] ? __kasan_record_aux_stack+0x9f/0xb0 [ 826.475939] ? task_work_add+0x77/0x1c0 [ 826.476305] netlink_rcv_skb+0xe0/0x210 [ 826.476661] ? rtnl_calcit.isra.0+0x1f0/0x1f0 [ 826.477057] ? netlink_ack+0x7c0/0x7c0 [ 826.477412] ? rhashtable_jhash2+0xef/0x150 [ 826.477796] ? _copy_from_iter+0x105/0x770 [ 826.484386] netlink_unicast+0x346/0x490 [ 826.484755] ? netlink_attachskb+0x400/0x400 [ 826.485145] ? kernel_text_address+0xc2/0xd0 [ 826.485535] netlink_sendmsg+0x3b0/0x6c0 [ 826.485902] ? kernel_text_address+0xc2/0xd0 [ 826.486296] ? netlink_unicast+0x490/0x490 [ 826.486671] ? iovec_from_user.part.0+0x7a/0x1a0 [ 826.487083] ? netlink_unicast+0x490/0x490 [ 826.487461] sock_sendmsg+0x73/0xc0 [ 826.487803] ____sys_sendmsg+0x364/0x380 [ 826.488186] ? import_iovec+0x7/0x10 [ 826.488531] ? kernel_sendmsg+0x30/0x30 [ 826.488893] ? __copy_msghdr+0x180/0x180 [ 826.489258] ? kasan_save_stack+0x2e/0x40 [ 826.489629] ? kasan_save_stack+0x1e/0x40 [ 826.490002] ? __kasan_record_aux_stack+0x9f/0xb0 [ 826.490424] ? __call_rcu_common.constprop.0+0x46/0x580 [ 826.490876] ___sys_sendmsg+0xdf/0x140 [ 826.491231] ? copy_msghdr_from_user+0x110/0x110 [ 826.491649] ? fget_raw+0x120/0x120 [ 826.491988] ? ___sys_recvmsg+0xd9/0x130 [ 826.492355] ? folio_batch_add_and_move+0x80/0xa0 [ 826.492776] ? _raw_spin_lock+0x7a/0xd0 [ 826.493137] ? _raw_spin_lock+0x7a/0xd0 [ 826.493500] ? _raw_read_lock_irq+0x30/0x30 [ 826.493880] ? kasan_set_track+0x21/0x30 [ 826.494249] ? kasan_save_free_info+0x2a/0x40 [ 826.494650] ? do_sys_openat2+0xff/0x270 [ 826.495016] ? __fget_light+0x1b5/0x200 [ 826.495377] ? __virt_addr_valid+0xb1/0x130 [ 826.495763] __sys_sendmsg+0xb2/0x130 [ 826.496118] ? __sys_sendmsg_sock+0x20/0x20 [ 826.496501] ? __x64_sys_rseq+0x2e0/0x2e0 [ 826.496874] ? do_user_addr_fault+0x276/0x820 [ 826.497273] ? fpregs_assert_state_consistent+0x52/0x60 [ 826.497727] ? exit_to_user_mode_prepare+0x30/0x120 [ 826.498158] do_syscall_64+0x3d/0x90 [ 826.498502] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 826.498949] RIP: 0033:0x7f9b67f4f887 [ 826.499294] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [ 826.500742] RSP: 002b:00007fff5d1a5498 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 826.501395] RAX: ffffffffffffffda RBX: 0000000064413ce6 RCX: 00007f9b67f4f887 [ 826.501975] RDX: 0000000000000000 RSI: 00007fff5d1a5500 RDI: 0000000000000003 [ 826.502556] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001 [ 826.503135] R10: 00007f9b67e08708 R11: 0000000000000246 R12: 0000000000000001 [ 826.503714] R13: 0000000000000001 R14: 00007fff5d1a9800 R15: 0000000000485400 [ 826.504304] </TASK> [ 826.504753] Allocated by task 3764: [ 826.505090] kasan_save_stack+0x1e/0x40 [ 826.505453] kasan_set_track+0x21/0x30 [ 826.505810] __kasan_kmalloc+0x77/0x90 [ 826.506164] __mlx5_create_flow_table+0x16d/0xbb0 [mlx5_core] [ 826.506742] esw_offloads_enable+0x60d/0xfb0 [mlx5_core] [ 826.507292] mlx5_eswitch_enable_locked+0x4d3/0x680 [mlx5_core] [ 826.507885] mlx5_devlink_eswitch_mode_set+0x2a3/0x580 [mlx5_core] [ 826.508513] devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0 [ 826.508969] genl_family_rcv_msg_doit.isra.0+0x146/0x1c0 [ 826.509427] genl_rcv_msg+0x28d/0x3e0 [ 826.509772] netlink_rcv_skb+0xe0/0x210 [ 826.510133] genl_rcv+0x24/0x40 [ 826.510448] netlink_unicast+0x346/0x490 [ 826.510810] netlink_sendmsg+0x3b0/0x6c0 [ 826.511179] sock_sendmsg+0x73/0xc0 [ 826.511519] __sys_sendto+0x18d/0x220 [ 826.511867] __x64_sys_sendto+0x72/0x80 [ 826.512232] do_syscall_64+0x3d/0x90 [ 826.512576] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 826.513220] Freed by task 5674: [ 826.513535] kasan_save_stack+0x1e/0x40 [ 826.513893] kasan_set_track+0x21/0x30 [ 826.514245] kasan_save_free_info+0x2a/0x40 [ 826.514629] ____kasan_slab_free+0x11a/0x1b0 [ 826.515021] __kmem_cache_free+0x14d/0x280 [ 826.515399] tree_put_node+0x109/0x1c0 [mlx5_core] [ 826.515907] mlx5_destroy_flow_table+0x119/0x630 [mlx5_core] [ 826.516481] esw_offloads_steering_cleanup+0xe7/0x150 [mlx5_core] [ 826.517084] esw_offloads_disable+0xe0/0x160 [mlx5_core] [ 826.517632] mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core] [ 826.518225] mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core] [ 826.518834] devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0 [ 826.519286] genl_family_rcv_msg_doit.isra.0+0x146/0x1c0 [ 826.519748] genl_rcv_msg+0x28d/0x3e0 [ 826.520101] netlink_rcv_skb+0xe0/0x210 [ 826.520458] genl_rcv+0x24/0x40 [ 826.520771] netlink_unicast+0x346/0x490 [ 826.521137] netlink_sendmsg+0x3b0/0x6c0 [ 826.521505] sock_sendmsg+0x73/0xc0 [ 826.521842] __sys_sendto+0x18d/0x220 [ 826.522191] __x64_sys_sendto+0x72/0x80 [ 826.522554] do_syscall_64+0x3d/0x90 [ 826.522894] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 826.523540] Last potentially related work creation: [ 826.523969] kasan_save_stack+0x1e/0x40 [ 826.524331] __kasan_record_aux_stack+0x9f/0xb0 [ 826.524739] insert_work+0x30/0x130 [ 826.525078] __queue_work+0x34b/0x690 [ 826.525426] queue_work_on+0x48/0x50 [ 826.525766] __rhashtable_remove_fast_one+0x4af/0x4d0 [mlx5_core] [ 826.526365] del_sw_flow_group+0x1b5/0x270 [mlx5_core] [ 826.526898] tree_put_node+0x109/0x1c0 [mlx5_core] [ 826.527407] esw_offloads_steering_cleanup+0xd3/0x150 [mlx5_core] [ 826.528009] esw_offloads_disable+0xe0/0x160 [mlx5_core] [ 826.528616] mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core] [ 826.529218] mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core] [ 826.529823] devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0 [ 826.530276] genl_family_rcv_msg_doit.isra.0+0x146/0x1c0 [ 826.530733] genl_rcv_msg+0x28d/0x3e0 [ 826.531079] netlink_rcv_skb+0xe0/0x210 [ 826.531439] genl_rcv+0x24/0x40 [ 826.531755] netlink_unicast+0x346/0x490 [ 826.532123] netlink_sendmsg+0x3b0/0x6c0 [ 826.532487] sock_sendmsg+0x73/0xc0 [ 826.532825] __sys_sendto+0x18d/0x220 [ 826.533175] __x64_sys_sendto+0x72/0x80 [ 826.533533] do_syscall_64+0x3d/0x90 [ 826.533877] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 826.534521] The buggy address belongs to the object at ffff888194485800 which belongs to the cache kmalloc-512 of size 512 [ 826.535506] The buggy address is located 48 bytes inside of freed 512-byte region [ffff888194485800, ffff888194485a00) [ 826.536666] The buggy address belongs to the physical page: [ 826.537138] page:00000000d75841dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x194480 [ 826.537915] head:00000000d75841dd order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 826.538595] flags: 0x200000000010200(slab|head|node=0|zone=2) [ 826.539089] raw: 0200000000010200 ffff888100042c80 ffffea0004523800 dead000000000002 [ 826.539755] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [ 826.540417] page dumped because: kasan: bad access detected [ 826.541095] Memory state around the buggy address: [ 826.541519] ffff888194485700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 826.542149] ffff888194485780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 826.542773] >ffff888194485800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 826.543400] ^ [ 826.543822] ffff888194485880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 826.544452] ffff888194485900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 826.545079] ================================================================== Fixes: 6702782845a5 ("net/mlx5e: TC, Set CT miss to the specific ct action instance") Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Fix SQ wake logic in ptp napi_poll contextRahul Rameshbabu2023-05-223-7/+16
| | | | | | | | | | | | | | | | | | | | | | Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up. Fixes: 1880bc4e4a96 ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Fix deadlock in tc route query codeVlad Buslov2023-05-223-20/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cited commit causes ABBA deadlock[0] when peer flows are created while holding the devcom rw semaphore. Due to peer flows offload implementation the lock is taken much higher up the call chain and there is no obvious way to easily fix the deadlock. Instead, since tc route query code needs the peer eswitch structure only to perform a lookup in xarray and doesn't perform any sleeping operations with it, refactor the code for lockless execution in following ways: - RCUify the devcom 'data' pointer. When resetting the pointer synchronously wait for RCU grace period before returning. This is fine since devcom is currently only used for synchronization of pairing/unpairing of eswitches which is rare and already expensive as-is. - Wrap all usages of 'paired' boolean in {READ|WRITE}_ONCE(). The flag has already been used in some unlocked contexts without proper annotations (e.g. users of mlx5_devcom_is_paired() function), but it wasn't an issue since all relevant code paths checked it again after obtaining the devcom semaphore. Now it is also used by mlx5_devcom_get_peer_data_rcu() as "best effort" check to return NULL when devcom is being unpaired. Note that while RCU read lock doesn't prevent the unpaired flag from being changed concurrently it still guarantees that reader can continue to use 'data'. - Refactor mlx5e_tc_query_route_vport() function to use new mlx5_devcom_get_peer_data_rcu() API which fixes the deadlock. [0]: [ 164.599612] ====================================================== [ 164.600142] WARNING: possible circular locking dependency detected [ 164.600667] 6.3.0-rc3+ #1 Not tainted [ 164.601021] ------------------------------------------------------ [ 164.601557] handler1/3456 is trying to acquire lock: [ 164.601998] ffff88811f1714b0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}, at: mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.603078] but task is already holding lock: [ 164.603617] ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core] [ 164.604459] which lock already depends on the new lock. [ 164.605190] the existing dependency chain (in reverse order) is: [ 164.605848] -> #1 (&comp->sem){++++}-{3:3}: [ 164.606380] down_read+0x39/0x50 [ 164.606772] mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core] [ 164.607336] mlx5e_tc_query_route_vport+0x86/0xc0 [mlx5_core] [ 164.607914] mlx5e_tc_tun_route_lookup+0x1a4/0x1d0 [mlx5_core] [ 164.608495] mlx5e_attach_decap_route+0xc6/0x1e0 [mlx5_core] [ 164.609063] mlx5e_tc_add_fdb_flow+0x1ea/0x360 [mlx5_core] [ 164.609627] __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core] [ 164.610175] mlx5e_configure_flower+0x952/0x1a20 [mlx5_core] [ 164.610741] tc_setup_cb_add+0xd4/0x200 [ 164.611146] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 164.611661] fl_change+0xc95/0x18a0 [cls_flower] [ 164.612116] tc_new_tfilter+0x3fc/0xd20 [ 164.612516] rtnetlink_rcv_msg+0x418/0x5b0 [ 164.612936] netlink_rcv_skb+0x54/0x100 [ 164.613339] netlink_unicast+0x190/0x250 [ 164.613746] netlink_sendmsg+0x245/0x4a0 [ 164.614150] sock_sendmsg+0x38/0x60 [ 164.614522] ____sys_sendmsg+0x1d0/0x1e0 [ 164.614934] ___sys_sendmsg+0x80/0xc0 [ 164.615320] __sys_sendmsg+0x51/0x90 [ 164.615701] do_syscall_64+0x3d/0x90 [ 164.616083] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 164.616568] -> #0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}: [ 164.617210] __lock_acquire+0x159e/0x26e0 [ 164.617638] lock_acquire+0xc2/0x2a0 [ 164.618018] __mutex_lock+0x92/0xcd0 [ 164.618401] mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.618943] post_process_attr+0x153/0x2d0 [mlx5_core] [ 164.619471] mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core] [ 164.620021] __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core] [ 164.620564] mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core] [ 164.621125] tc_setup_cb_add+0xd4/0x200 [ 164.621531] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 164.622047] fl_change+0xc95/0x18a0 [cls_flower] [ 164.622500] tc_new_tfilter+0x3fc/0xd20 [ 164.622906] rtnetlink_rcv_msg+0x418/0x5b0 [ 164.623324] netlink_rcv_skb+0x54/0x100 [ 164.623727] netlink_unicast+0x190/0x250 [ 164.624138] netlink_sendmsg+0x245/0x4a0 [ 164.624544] sock_sendmsg+0x38/0x60 [ 164.624919] ____sys_sendmsg+0x1d0/0x1e0 [ 164.625340] ___sys_sendmsg+0x80/0xc0 [ 164.625731] __sys_sendmsg+0x51/0x90 [ 164.626117] do_syscall_64+0x3d/0x90 [ 164.626502] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 164.626995] other info that might help us debug this: [ 164.627725] Possible unsafe locking scenario: [ 164.628268] CPU0 CPU1 [ 164.628683] ---- ---- [ 164.629098] lock(&comp->sem); [ 164.629421] lock(&esw->offloads.encap_tbl_lock); [ 164.630066] lock(&comp->sem); [ 164.630555] lock(&esw->offloads.encap_tbl_lock); [ 164.630993] *** DEADLOCK *** [ 164.631575] 3 locks held by handler1/3456: [ 164.631962] #0: ffff888124b75130 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200 [ 164.632703] #1: ffff888116e512b8 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core] [ 164.633552] #2: ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core] [ 164.634435] stack backtrace: [ 164.634883] CPU: 17 PID: 3456 Comm: handler1 Not tainted 6.3.0-rc3+ #1 [ 164.635431] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 164.636340] Call Trace: [ 164.636616] <TASK> [ 164.636863] dump_stack_lvl+0x47/0x70 [ 164.637217] check_noncircular+0xfe/0x110 [ 164.637601] __lock_acquire+0x159e/0x26e0 [ 164.637977] ? mlx5_cmd_set_fte+0x5b0/0x830 [mlx5_core] [ 164.638472] lock_acquire+0xc2/0x2a0 [ 164.638828] ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.639339] ? lock_is_held_type+0x98/0x110 [ 164.639728] __mutex_lock+0x92/0xcd0 [ 164.640074] ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.640576] ? __lock_acquire+0x382/0x26e0 [ 164.640958] ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.641468] ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.641965] mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core] [ 164.642454] ? lock_release+0xbf/0x240 [ 164.642819] post_process_attr+0x153/0x2d0 [mlx5_core] [ 164.643318] mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core] [ 164.643835] __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core] [ 164.644340] mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core] [ 164.644862] ? lock_acquire+0xc2/0x2a0 [ 164.645219] tc_setup_cb_add+0xd4/0x200 [ 164.645588] fl_hw_replace_filter+0x14c/0x1f0 [cls_flower] [ 164.646067] fl_change+0xc95/0x18a0 [cls_flower] [ 164.646488] tc_new_tfilter+0x3fc/0xd20 [ 164.646861] ? tc_del_tfilter+0x810/0x810 [ 164.647236] rtnetlink_rcv_msg+0x418/0x5b0 [ 164.647621] ? rtnl_setlink+0x160/0x160 [ 164.647982] netlink_rcv_skb+0x54/0x100 [ 164.648348] netlink_unicast+0x190/0x250 [ 164.648722] netlink_sendmsg+0x245/0x4a0 [ 164.649090] sock_sendmsg+0x38/0x60 [ 164.649434] ____sys_sendmsg+0x1d0/0x1e0 [ 164.649804] ? copy_msghdr_from_user+0x6d/0xa0 [ 164.650213] ___sys_sendmsg+0x80/0xc0 [ 164.650563] ? lock_acquire+0xc2/0x2a0 [ 164.650926] ? lock_acquire+0xc2/0x2a0 [ 164.651286] ? __fget_files+0x5/0x190 [ 164.651644] ? find_held_lock+0x2b/0x80 [ 164.652006] ? __fget_files+0xb9/0x190 [ 164.652365] ? lock_release+0xbf/0x240 [ 164.652723] ? __fget_files+0xd3/0x190 [ 164.653079] __sys_sendmsg+0x51/0x90 [ 164.653435] do_syscall_64+0x3d/0x90 [ 164.653784] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 164.654229] RIP: 0033:0x7f378054f8bd [ 164.654577] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 6a c3 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 be c3 f4 ff 48 [ 164.656041] RSP: 002b:00007f377fa114b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [ 164.656701] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f378054f8bd [ 164.657297] RDX: 0000000000000000 RSI: 00007f377fa11540 RDI: 0000000000000014 [ 164.657885] RBP: 00007f377fa12278 R08: 0000000000000000 R09: 000000000000015c [ 164.658472] R10: 00007f377fa123d0 R11: 0000000000000293 R12: 0000560962d99bd0 [ 164.665317] R13: 0000000000000000 R14: 0000560962d99bd0 R15: 00007f377fa11540 Fixes: f9d196bd632b ("net/mlx5e: Use correct eswitch for stack devices with lag") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Fix error message when failing to allocate device memoryRoi Dayan2023-05-221-1/+1
| | | | | | | | | | | | | | | | Fix spacing for the error and also the correct error code pointer. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5e: Use correct encap attribute during invalidationVlad Buslov2023-05-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With introduction of post action infrastructure most of the users of encap attribute had been modified in order to obtain the correct attribute by calling mlx5e_tc_get_encap_attr() helper instead of assuming encap action is always on default attribute. However, the cited commit didn't modify mlx5e_invalidate_encap() which prevents it from destroying correct modify header action which leads to a warning [0]. Fix the issue by using correct attribute. [0]: Feb 21 09:47:35 c-237-177-40-045 kernel: WARNING: CPU: 17 PID: 654 at drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:684 mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core] Feb 21 09:47:35 c-237-177-40-045 kernel: RIP: 0010:mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core] Feb 21 09:47:35 c-237-177-40-045 kernel: Call Trace: Feb 21 09:47:35 c-237-177-40-045 kernel: <TASK> Feb 21 09:47:35 c-237-177-40-045 kernel: mlx5e_tc_fib_event_work+0x8e3/0x1f60 [mlx5_core] Feb 21 09:47:35 c-237-177-40-045 kernel: ? mlx5e_take_all_encap_flows+0xe0/0xe0 [mlx5_core] Feb 21 09:47:35 c-237-177-40-045 kernel: ? lock_downgrade+0x6d0/0x6d0 Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x273/0x3f0 Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x273/0x3f0 Feb 21 09:47:35 c-237-177-40-045 kernel: process_one_work+0x7c2/0x1310 Feb 21 09:47:35 c-237-177-40-045 kernel: ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0 Feb 21 09:47:35 c-237-177-40-045 kernel: ? pwq_dec_nr_in_flight+0x230/0x230 Feb 21 09:47:35 c-237-177-40-045 kernel: ? rwlock_bug.part.0+0x90/0x90 Feb 21 09:47:35 c-237-177-40-045 kernel: worker_thread+0x59d/0xec0 Feb 21 09:47:35 c-237-177-40-045 kernel: ? __kthread_parkme+0xd9/0x1d0 Fixes: 8300f225268b ("net/mlx5e: Create new flow attr for multi table actions") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: DR, Check force-loopback RC QP capability independently from RoCEYevgeny Kliteynik2023-05-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB (loopback), and FL (force-loopback) QP is preferred for performance. FL is available when RoCE is enabled or disabled based on RoCE caps. This patch adds reading of FL capability from HCA caps in addition to the existing reading from RoCE caps, thus fixing the case where we didn't have loopback enabled when RoCE was disabled. Fixes: 7304d603a57a ("net/mlx5: DR, Add support for force-loopback QP") Signed-off-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUsErez Shitrit2023-05-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calculating crc for hash index we use the function crc32 that calculates for little-endian (LE) arch. Then we convert it to network endianness using htonl(), but it's wrong to do the conversion in BE archs since the crc32 value is already LE. The solution is to switch the bytes from the crc result for all types of arc. Fixes: 40416d8ede65 ("net/mlx5: DR, Replace CRC32 implementation to use kernel lib") Signed-off-by: Erez Shitrit <erezsh@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Handle pairing of E-switch via uplink un/load APIsShay Drory2023-05-223-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case user switch a device from switchdev mode to legacy mode, mlx5 first unpair the E-switch and afterwards unload the uplink vport. From the other hand, in case user remove or reload a device, mlx5 first unload the uplink vport and afterwards unpair the E-switch. The latter is causing a bug[1], hence, handle pairing of E-switch as part of uplink un/load APIs. [1] In case VF_LAG is used, every tc fdb flow is duplicated to the peer esw. However, the original esw keeps a pointer to this duplicated flow, not the peer esw. e.g.: if user create tc fdb flow over esw0, the flow is duplicated over esw1, in FW/HW, but in SW, esw0 keeps a pointer to the duplicated flow. During module unload while a peer tc fdb flow is still offloaded, in case the first device to be removed is the peer device (esw1 in the example above), the peer net-dev is destroyed, and so the mlx5e_priv is memset to 0. Afterwards, the peer device is trying to unpair himself from the original device (esw0 in the example above). Unpair API invoke the original device to clear peer flow from its eswitch (esw0), but the peer flow, which is stored over the original eswitch (esw0), is trying to use the peer mlx5e_priv, which is memset to 0 and result in bellow kernel-oops. [ 157.964081 ] BUG: unable to handle page fault for address: 000000000002ce60 [ 157.964662 ] #PF: supervisor read access in kernel mode [ 157.965123 ] #PF: error_code(0x0000) - not-present page [ 157.965582 ] PGD 0 P4D 0 [ 157.965866 ] Oops: 0000 [#1] SMP [ 157.967670 ] RIP: 0010:mlx5e_tc_del_fdb_flow+0x48/0x460 [mlx5_core] [ 157.976164 ] Call Trace: [ 157.976437 ] <TASK> [ 157.976690 ] __mlx5e_tc_del_fdb_peer_flow+0xe6/0x100 [mlx5_core] [ 157.977230 ] mlx5e_tc_clean_fdb_peer_flows+0x67/0x90 [mlx5_core] [ 157.977767 ] mlx5_esw_offloads_unpair+0x2d/0x1e0 [mlx5_core] [ 157.984653 ] mlx5_esw_offloads_devcom_event+0xbf/0x130 [mlx5_core] [ 157.985212 ] mlx5_devcom_send_event+0xa3/0xb0 [mlx5_core] [ 157.985714 ] esw_offloads_disable+0x5a/0x110 [mlx5_core] [ 157.986209 ] mlx5_eswitch_disable_locked+0x152/0x170 [mlx5_core] [ 157.986757 ] mlx5_eswitch_disable+0x51/0x80 [mlx5_core] [ 157.987248 ] mlx5_unload+0x2a/0xb0 [mlx5_core] [ 157.987678 ] mlx5_uninit_one+0x5f/0xd0 [mlx5_core] [ 157.988127 ] remove_one+0x64/0xe0 [mlx5_core] [ 157.988549 ] pci_device_remove+0x31/0xa0 [ 157.988933 ] device_release_driver_internal+0x18f/0x1f0 [ 157.989402 ] driver_detach+0x3f/0x80 [ 157.989754 ] bus_remove_driver+0x70/0xf0 [ 157.990129 ] pci_unregister_driver+0x34/0x90 [ 157.990537 ] mlx5_cleanup+0xc/0x1c [mlx5_core] [ 157.990972 ] __x64_sys_delete_module+0x15a/0x250 [ 157.991398 ] ? exit_to_user_mode_prepare+0xea/0x110 [ 157.991840 ] do_syscall_64+0x3d/0x90 [ 157.992198 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: 04de7dda7394 ("net/mlx5e: Infrastructure for duplicated offloading of TC flows") Fixes: 1418ddd96afd ("net/mlx5e: Duplicate offloaded TC eswitch rules under uplink LAG") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>