summaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlx5/core/en.h
Commit message (Collapse)AuthorAgeFilesLines
* net/mlx5e: SHAMPO, Increase timeout to improve latencyDragos Tatulea10 days1-1/+1
| | | | | | | | | | | | | | | | | | | | | | During latency tests (netperf TCP_RR) a 30% degradation of HW GRO vs SW GRO was observed. This is due to SHAMPO triggering timeout filler CQEs instead of delivering the CQE for the packet. Having a short timeout for SHAMPO doesn't bring any benefits as it is the driver that does the merging, not the hardware. On the contrary, it can have a negative impact: additional filler CQEs are generated due to the timeout. As there is no way to disable this timeout, this change sets it to the maximum value. Instead of using the packet_merge.timeout parameter which is also used for LRO, set the value directly when filling in the rest of the SHAMPO parameters in mlx5e_build_rq_param(). Fixes: 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: Add struct kernel_ethtool_ts_infoKory Maincent2024-07-151-1/+1
| | | | | | | | | | | | | | | | | | | In prevision to add new UAPI for hwtstamp we will be limited to the struct ethtool_ts_info that is currently passed in fixed binary format through the ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code already started operating on an extensible kernel variant of that structure, similar in concept to struct kernel_hwtstamp_config vs struct hwtstamp_config. Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here we introduce the kernel-only structure in include/linux/ethtool.h. The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO. Acked-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Add txq to sq stats mappingJoe Damato2024-06-171-0/+2
| | | | | | | | | | | | mlx5 currently maps txqs to an sq via priv->txq2sq. It is useful to map txqs to sq_stats, as well, for direct access to stats. Add priv->txq2sq_stats and insert mappings. The mappings will be used next to tabulate stats information. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: SHAMPO, Use KSMs instead of KLMsYoray Zack2024-06-051-19/+1
| | | | | | | | | | | | | | | | | KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact, it is a faster mechanism than KLM. SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer. As it used KLMs with the same buffer size for each entry, we can use KSMs instead. This commit changes the Mkeys that map the SHAMPO headers buffer from KLMs to KSMs. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: SHAMPO, Simplify header page release in teardownDragos Tatulea2024-06-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd) has some complicated logic that comes from the fact that it is called twice during teardown: 1) To release the posted header pages that didn't get any completions. 2) To release all remaining header pages. This flow is not necessary: all header pages can be released from the driver side in one go. Furthermore, the above flow is buggy. Taking the 8 headers per page example: 1) Release fragments 5-7. Page will be released. 2) Release remaining fragments 0-4. The bits in the header will indicate that the page needs releasing. But this is incorrect: page was released in step 1. This patch releases all header pages in one go. This simplifies the header page cleanup function. For consistency, the datapath header page release API (mlx5e_free_rx_shampo_hd_entry()) is used. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Modifying channels number and updating TX queuesCarolina Jubran2024-05-131-0/+1
| | | | | | | | | | | | | | | | | It is not appropriate for the mlx5e_num_channels_changed function to be called solely for updating the TX queues, even if the channels number has not been changed. Move the code responsible for updating the TC and TX queues from mlx5e_num_channels_changed and produce a new function called mlx5e_update_tc_and_tx_queues. This new function should only be called when the channels number remains unchanged. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512124306.740898-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Implement ethtool callbacks for supporting per-queue coalescingRahul Rameshbabu2024-04-221-0/+4
| | | | | | | | | | | | | | | Use mlx5 on-the-fly coalescing configuration support to enable individual channel configuration. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Support updating coalescing configuration without resetting channelsRahul Rameshbabu2024-04-221-0/+20
| | | | | | | | | | | | | | | When CQE mode or DIM state is changed, gracefully reconfigure channels to handle new configuration. Previously, would create new channels that would reflect the changes rather than update the original channels. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Dynamically allocate DIM structure for SQs/RQsRahul Rameshbabu2024-04-221-2/+2
| | | | | | | | | | | | | | | Make it possible for the DIM structure to be torn down while an SQ or RQ is still active. Changing the CQ period mode is an example where the previous sampling done with the DIM structure would need to be invalidated. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Move DIM function declarations to en/dim.hRahul Rameshbabu2024-04-221-2/+0
| | | | | | | | | | | | | | | Create a header specifically for DIM-related declarations. Move existing DIM-specific functionality from en.h. Future DIM-related functionality will be declared in en/dim.h in subsequent patches. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Un-expose functions in en.hTariq Toukan2024-04-051-12/+0
| | | | | | | | | | Un-expose functions that are not used outside of their c file. Make them static. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://lore.kernel.org/r/20240404173357.123307-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5: Convert uintX_t to uXGal Pressman2024-04-031-1/+1
| | | | | | | | | | | | In the kernel, the preferred types are uX. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240402133043.56322-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5e: Support per-mdev queue counterTariq Toukan2024-03-071-3/+4
| | | | | | | | | | | | | | | Each queue counter object counts some events (in hardware) for the RQs that are attached to it, like events of packet drops due to no receive WQE (rx_out_of_buffer). Each RQ can be attached to a queue counter only within the same vhca. To still cover all RQs with these counters, we create multiple instances, one per vhca. The result that's shown to the user is now the sum of all instances. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Let channels be SD-awareTariq Toukan2024-03-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Distribute the channels between the different SD-devices to acheive local numa node performance on multiple numas. Each channel works against one specific mdev, creating all datapath queues against it. We distribute channels to mdevs in a round-robin policy. Example for 2 mdevs and 6 channels: +-------+---------+ | ch ix | mdev ix | +-------+---------+ | 0 | 0 | | 1 | 1 | | 2 | 0 | | 3 | 1 | | 4 | 0 | | 5 | 1 | +-------+---------+ This round-robin distribution policy is preferred over another suggested intuitive distribution, in which we first distribute one half of the channels to mdev #0 and then the second half to mdev #1. We prefer round-robin for a reason: it is less influenced by changes in the number of channels. The mapping between channel index and mdev is fixed, no matter how many channels the user configures. As the channel stats are persistent to channels closure, changing the mapping every single time would turn the accumulative stats less representing of the channel's history. Per-channel objects should stop using the primary mdev (priv->mdev) directly, and instead move to using their own channel's mdev. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Create EN core HW resources for all secondary devicesTariq Toukan2024-03-071-0/+1
| | | | | | | | | Traffic queues will be created on all devices, including the secondaries. Create the needed core layer resources for them as well. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Use the correct lag ports number when creating TISesSaeed Mahameed2024-01-241-1/+1
| | | | | | | | | | | | | | | | | | | The cited commit moved the code of mlx5e_create_tises() and changed the loop to create TISes over MLX5_MAX_PORTS constant value, instead of getting the correct lag ports supported by the device, which can cause FW errors on devices with less than MLX5_MAX_PORTS ports. Change that back to mlx5e_get_num_lag_ports(mdev). Also IPoIB interfaces create there own TISes, they don't use the eth TISes, pass a flag to indicate that. This fixes the following errors that might appear in kernel log: mlx5_cmd_out_err:808:(pid 650): CREATE_TIS(0x912) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x595b5d), err(-22) mlx5e_create_mdev_resources:174:(pid 650): alloc tises failed, -22 Fixes: b25bd37c859f ("net/mlx5: Move TISes from priv to mdev HW resources") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* Revert "mlx5 updates 2023-12-20"Jakub Kicinski2024-01-071-11/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert "net/mlx5: Implement management PF Ethernet profile" This reverts commit 22c4640698a1d47606b5a4264a584e8046641784. Revert "net/mlx5: Enable SD feature" This reverts commit c88c49ac9c18fb7c3fa431126de1d8f8f555e912. Revert "net/mlx5e: Block TLS device offload on combined SD netdev" This reverts commit 83a59ce0057b7753d7fbece194b89622c663b2a6. Revert "net/mlx5e: Support per-mdev queue counter" This reverts commit d72baceb92539a178d2610b0e9ceb75706a75b55. Revert "net/mlx5e: Support cross-vhca RSS" This reverts commit c73a3ab8fa6e93a783bd563938d7cf00d62d5d34. Revert "net/mlx5e: Let channels be SD-aware" This reverts commit e4f9686bdee7b4dd89e0ed63cd03606e4bda4ced. Revert "net/mlx5e: Create EN core HW resources for all secondary devices" This reverts commit c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1. Revert "net/mlx5e: Create single netdev per SD group" This reverts commit e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad. Revert "net/mlx5: SD, Add informative prints in kernel log" This reverts commit c82d360325112ccc512fc11a3b68cdcdf04a1478. Revert "net/mlx5: SD, Implement steering for primary and secondaries" This reverts commit 605fcce33b2d1beb0139b6e5913fa0b2062116b2. Revert "net/mlx5: SD, Implement devcom communication and primary election" This reverts commit a45af9a96740873db9a4b5bb493ce2ad81ccb4d5. Revert "net/mlx5: SD, Implement basic query and instantiation" This reverts commit 63b9ce944c0e26c44c42cdd5095c2e9851c1a8ff. Revert "net/mlx5: SD, Introduce SD lib" This reverts commit 4a04a31f49320d078b8078e1da4b0e2faca5dfa3. Revert "net/mlx5: Fix query of sd_group field" This reverts commit e04984a37398b3f4f5a79c993b94c6b1224184cc. Revert "net/mlx5e: Use the correct lag ports number when creating TISes" This reverts commit a7e7b40c4bc115dbf2a2bb453d7bbb2e0ea99703. There are some unanswered questions on the list, and we don't have any docs. Given the lack of replies so far and the fact that v6.8 merge window has started - let's revert this and revisit for v6.9. Link: https://lore.kernel.org/all/20231221005721.186607-1-saeed@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/mlx5: Implement management PF Ethernet profileArmen Ratner2023-12-201-0/+4
| | | | | | | | | | | | Add management PF modules, which introduce support for the structures needed to create the resources for the MGMT PF to work. Also, add the necessary calls and functions to establish this functionality. Signed-off-by: Armen Ratner <armeng@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Daniel Jurgens <danielj@nvidia.com>
* net/mlx5e: Support per-mdev queue counterTariq Toukan2023-12-201-3/+4
| | | | | | | | | | | | | | | Each queue counter object counts some events (in hardware) for the RQs that are attached to it, like events of packet drops due to no receive WQE (rx_out_of_buffer). Each RQ can be attached to a queue counter only within the same vhca. To still cover all RQs with these counters, we create multiple instances, one per vhca. The result that's shown to the user is now the sum of all instances. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Let channels be SD-awareTariq Toukan2023-12-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Distribute the channels between the different SD-devices to acheive local numa node performance on multiple numas. Each channel works against one specific mdev, creating all datapath queues against it. We distribute channels to mdevs in a round-robin policy. Example for 2 mdevs and 6 channels: +-------+---------+ | ch ix | mdev ix | +-------+---------+ | 0 | 0 | | 1 | 1 | | 2 | 0 | | 3 | 1 | | 4 | 0 | | 5 | 1 | +-------+---------+ This round-robin distribution policy is preferred over another suggested intuitive distribution, in which we first distribute one half of the channels to mdev #0 and then the second half to mdev #1. We prefer round-robin for a reason: it is less influenced by changes in the number of channels. The mapping between channel index and mdev is fixed, no matter how many channels the user configures. As the channel stats are persistent to channels closure, changing the mapping every single time would turn the accumulative stats less representing of the channel's history. Per-channel objects should stop using the primary mdev (priv->mdev) directly, and instead move to using their own channel's mdev. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Create EN core HW resources for all secondary devicesTariq Toukan2023-12-201-0/+1
| | | | | | | | | Traffic queues will be created on all devices, including the secondaries. Create the needed core layer resources for them as well. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Use the correct lag ports number when creating TISesSaeed Mahameed2023-12-201-1/+1
| | | | | | | | | | | | | | | The cited commit moved the code of mlx5e_create_tises() and changed the loop to create TISes over MLX5_MAX_PORTS constant value, instead of getting the correct lag ports supported by the device, which can cause FW errors on devices with less than MLX5_MAX_PORTS ports. Change that back to mlx5e_get_num_lag_ports(mdev). Also IPoIB interfaces create there own TISes, they don't use the eth TISes, pass a flag to indicate that. Fixes: b25bd37c859f ("net/mlx5: Move TISes from priv to mdev HW resources") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* Merge tag 'mlx5-updates-2023-12-13' of ↵David S. Miller2023-12-151-10/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-12-13 Preparation for mlx5e socket direct feature. Socket direct will allow multiple PF devices attached to different NUMA nodes but sharing the same physical port. The following series is a small refactoring series in preparation to support socket direct in the following submission. Highlights: - Define required device registers and bits related to socket direct - Flow steering re-arrangements - Generalize TX objects (TISs) and store them in a common object, will be useful in the next series for per function object management. - Decouple raw CQ objects from their parent netdev priv - Prepare devcom for Socket Direct device group discovery. Please see the individual patches for more information. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/mlx5e: Decouple CQ from privTariq Toukan2023-12-131-2/+4
| | | | | | | | | | | | | | | | | | | | Make CQ struct and methods independent of "priv", use more basic arguments instead. This will ease the transition to netdev with multiple mdevs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
| * net/mlx5: Move TISes from priv to mdev HW resourcesTariq Toukan2023-12-131-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The transport interface send (TIS) object is responsible for performing all transport related operations of the transmit side. Messages from Send Queues get segmented and transmitted by the TIS including all transport required implications, e.g. in the case of large send offload, the TIS is responsible for the segmentation. These are stateless objects and can be used by multiple netdevs (e.g. representors) who share the same core device. Providing the TISes as a service from the core layer to the netdev layer reduces the number of replecated TIS objects (in case of multiple netdevs), and will ease the transition to netdev with multiple mdevs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-12-141-0/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c 3a0b5a2929fd ("iavf: Introduce new state machines for flow director") 95260816b489 ("iavf: use iavf_schedule_aq_request() helper") https://lore.kernel.org/all/84e12519-04dc-bd80-bc34-8cf50d7898ce@intel.com/ drivers/net/ethernet/broadcom/bnxt/bnxt.c c13e268c0768 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic") c2f8063309da ("bnxt_en: Refactor RX VLAN acceleration logic.") a7445d69809f ("bnxt_en: Add support for new RX and TPA_START completion types for P7") 1c7fd6ee2fe4 ("bnxt_en: Rename some macros for the P5 chips") https://lore.kernel.org/all/20231211110022.27926ad9@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c bd6781c18cb5 ("bnxt_en: Fix wrong return value check in bnxt_close_nic()") 84793a499578 ("bnxt_en: Skip nic close/open when configuring tstamp filters") https://lore.kernel.org/all/20231214113041.3a0c003c@canb.auug.org.au/ drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c 3d7a3f2612d7 ("net/mlx5: Nack sync reset request when HotPlug is enabled") cecf44ea1a1f ("net/mlx5: Allow sync reset flow when BF MGT interface device is present") https://lore.kernel.org/all/20231211110328.76c925af@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net/mlx5e: Fix possible deadlock on mlx5e_tx_timeout_workMoshe Shemesh2023-12-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the cited patch, devlink health commands take devlink lock and this may result in deadlock for mlx5e_tx_reporter as it takes local state_lock before calling devlink health report and on the other hand devlink health commands such as diagnose for same reporter take local state_lock after taking devlink lock (see kernel log below). To fix it, remove local state_lock from mlx5e_tx_timeout_work() before calling devlink_health_report() and take care to cancel the work before any call to close channels, which may free the SQs that should be handled by the work. Before cancel_work_sync(), use current_work() to check we are not calling it from within the work, as mlx5e_tx_timeout_work() itself may close the channels and reopen as part of recovery flow. While removing state_lock from mlx5e_tx_timeout_work() keep rtnl_lock to ensure no change in netdev->real_num_tx_queues, but use rtnl_trylock() and a flag to avoid deadlock by calling cancel_work_sync() before closing the channels while holding rtnl_lock too. Kernel log: ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1 Not tainted ------------------------------------------------------ kworker/u16:2/65 is trying to acquire lock: ffff888122f6c2f8 (&devlink->lock_key#2){+.+.}-{3:3}, at: devlink_health_report+0x2f1/0x7e0 but task is already holding lock: ffff888121d20be0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_tx_timeout_work+0x70/0x280 [mlx5_core] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&priv->state_lock){+.+.}-{3:3}: __mutex_lock+0x12c/0x14b0 mlx5e_rx_reporter_diagnose+0x71/0x700 [mlx5_core] devlink_nl_cmd_health_reporter_diagnose_doit+0x212/0xa50 genl_family_rcv_msg_doit+0x1e9/0x2f0 genl_rcv_msg+0x2e9/0x530 netlink_rcv_skb+0x11d/0x340 genl_rcv+0x24/0x40 netlink_unicast+0x438/0x710 netlink_sendmsg+0x788/0xc40 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x1c1/0x290 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 -> #0 (&devlink->lock_key#2){+.+.}-{3:3}: __lock_acquire+0x2c8a/0x6200 lock_acquire+0x1c1/0x550 __mutex_lock+0x12c/0x14b0 devlink_health_report+0x2f1/0x7e0 mlx5e_health_report+0xc9/0xd7 [mlx5_core] mlx5e_reporter_tx_timeout+0x2ab/0x3d0 [mlx5_core] mlx5e_tx_timeout_work+0x1c1/0x280 [mlx5_core] process_one_work+0x7c2/0x1340 worker_thread+0x59d/0xec0 kthread+0x28f/0x330 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&priv->state_lock); lock(&devlink->lock_key#2); lock(&priv->state_lock); lock(&devlink->lock_key#2); *** DEADLOCK *** 4 locks held by kworker/u16:2/65: #0: ffff88811a55b138 ((wq_completion)mlx5e#2){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340 #1: ffff888101de7db8 ((work_completion)(&priv->tx_timeout_work)){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340 #2: ffffffff84ce8328 (rtnl_mutex){+.+.}-{3:3}, at: mlx5e_tx_timeout_work+0x53/0x280 [mlx5_core] #3: ffff888121d20be0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_tx_timeout_work+0x70/0x280 [mlx5_core] stack backtrace: CPU: 1 PID: 65 Comm: kworker/u16:2 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core] Call Trace: <TASK> dump_stack_lvl+0x57/0x7d check_noncircular+0x278/0x300 ? print_circular_bug+0x460/0x460 ? find_held_lock+0x2d/0x110 ? __stack_depot_save+0x24c/0x520 ? alloc_chain_hlocks+0x228/0x700 __lock_acquire+0x2c8a/0x6200 ? register_lock_class+0x1860/0x1860 ? kasan_save_stack+0x1e/0x40 ? kasan_set_free_info+0x20/0x30 ? ____kasan_slab_free+0x11d/0x1b0 ? kfree+0x1ba/0x520 ? devlink_health_do_dump.part.0+0x171/0x3a0 ? devlink_health_report+0x3d5/0x7e0 lock_acquire+0x1c1/0x550 ? devlink_health_report+0x2f1/0x7e0 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? find_held_lock+0x2d/0x110 __mutex_lock+0x12c/0x14b0 ? devlink_health_report+0x2f1/0x7e0 ? devlink_health_report+0x2f1/0x7e0 ? mutex_lock_io_nested+0x1320/0x1320 ? trace_hardirqs_on+0x2d/0x100 ? bit_wait_io_timeout+0x170/0x170 ? devlink_health_do_dump.part.0+0x171/0x3a0 ? kfree+0x1ba/0x520 ? devlink_health_do_dump.part.0+0x171/0x3a0 devlink_health_report+0x2f1/0x7e0 mlx5e_health_report+0xc9/0xd7 [mlx5_core] mlx5e_reporter_tx_timeout+0x2ab/0x3d0 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? mlx5e_reporter_tx_err_cqe+0x1b0/0x1b0 [mlx5_core] ? mlx5e_tx_reporter_timeout_dump+0x70/0x70 [mlx5_core] ? mlx5e_tx_reporter_dump_sq+0x320/0x320 [mlx5_core] ? mlx5e_tx_timeout_work+0x70/0x280 [mlx5_core] ? mutex_lock_io_nested+0x1320/0x1320 ? process_one_work+0x70f/0x1340 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? lock_downgrade+0x6e0/0x6e0 mlx5e_tx_timeout_work+0x1c1/0x280 [mlx5_core] process_one_work+0x7c2/0x1340 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? pwq_dec_nr_in_flight+0x230/0x230 ? rwlock_bug.part.0+0x90/0x90 worker_thread+0x59d/0xec0 ? process_one_work+0x1340/0x1340 kthread+0x28f/0x330 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Fixes: c90005b5f75c ("devlink: Hold the instance lock in health callbacks") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* | | net: ethtool: pass a pointer to parameters to get/set_rxfh ethtool opsAhmed Zaki2023-12-131-3/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The get/set_rxfh ethtool ops currently takes the rxfh (RSS) parameters as direct function arguments. This will force us to change the API (and all drivers' functions) every time some new parameters are added. This is part 1/2 of the fix, as suggested in [1]: - First simplify the code by always providing a pointer to all params (indir, key and func); the fact that some of them may be NULL seems like a weird historic thing or a premature optimization. It will simplify the drivers if all pointers are always present. - Then make the functions take a dev pointer, and a pointer to a single struct wrapping all arguments. The set_* should also take an extack. Link: https://lore.kernel.org/netdev/20231121152906.2dd5f487@kernel.org/ [1] Suggested-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://lore.kernel.org/r/20231213003321.605376-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net/mlx5e: Implement AF_XDP TX timestamp and checksum offloadStanislav Fomichev2023-11-291-1/+3
|/ | | | | | | | | | | | | | | TX timestamp: - requires passing clock, not sure I'm passing the correct one (from cq->mdev), but the timestamp value looks convincing TX checksum: - looks like device does packet parsing (and doesn't accept custom start/offset), so I'm ignoring user offsets Cc: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231127190319.1190813-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* net/mlx5e: Increase max supported channels number to 256Adham Faris2023-10-141-1/+1
| | | | | | | | | | Increase max supported channels number to 256 (it is not extended further due to testing disabilities). Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Preparations for supporting larger number of channelsAdham Faris2023-10-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Data center server CPUs number keeps getting larger with time. Currently, our driver limits the number of channels to 128. Maximum channels number is enforced and bounded by hardcoded defines (en.h/MLX5E_MAX_NUM_CHANNELS) even though the device and machine (CPUs num) can allow more. Refactor current implementation in order to handle further channels. The maximum supported channels number will be increased in the followup patch. Introduce RQT size calculation/allocation scheme below: 1) Preserve current RQT size of 256 for channels number up to 128 (the old limit). 2) For greater channels number, RQT size is calculated by multiplying the channels number by 2 and rounding up the result to the nearest power of 2. If the calculated RQT size exceeds the maximum supported size by the NIC, fallback to this maximum RQT size (1 << log_max_rqt_size). Since RQT size is no more static, allocate and free the indirection table SW shadow dynamically. Signed-off-by: Adham Faris <afaris@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5: Handle IPsec steering upon master unbind/bindPatrisious Haddad2023-10-021-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | When the master device is unbinded, make sure to clean up all of the steering rules or flow tables that were created over the master, in order to allow proper unbinding of master, and for ethernet traffic to continue to work independently. Upon bringing master device back up and attaching the slave to it, checks if the slave already has IPsec configured and if so reconfigure the rules needed to support RoCE traffic. Note that while master device is unbound, the user is unable to configure IPsec again, since they are in a kind of illegal state in which they are in MPV mode but the slave has no master. However if IPsec was configured before hand, it will continue to work for ethernet traffic while master is unbound, and would continue to work for all traffic when the master is bound back again. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/8434e88912c588affe51b34669900382a132e873.1695296682.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* net/mlx5: Register mlx5e priv to devcom in MPV modePatrisious Haddad2023-10-021-0/+1
| | | | | | | | | | | | | | | | If the device is in MPV mode, the ethernet driver would now register to events from IB driver about core devices affiliation or de-affiliation. Use the key provided in said event to connect each mlx5e priv instance to it's master counterpart, this way the ethernet driver is now aware of who is his master core device and even more, such as knowing if partner device has IPsec configured or not. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/279adfa0aa3a1957a339086f2c1739a50b8e4b68.1695296682.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
* Merge branch 'mlx5-next' of ↵Jakub Kicinski2023-08-241-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Leon Romanovsky says: ==================== mlx5 MACsec RoCEv2 support From Patrisious: This series extends previously added MACsec offload support to cover RoCE traffic either. In order to achieve that, we need configure MACsec with offload between the two endpoints, like below: REMOTE_MAC=10:70:fd:43:71:c0 * ip addr add 1.1.1.1/16 dev eth2 * ip link set dev eth2 up * ip link add link eth2 macsec0 type macsec encrypt on * ip macsec offload macsec0 mac * ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16 * ip macsec add macsec0 rx port 1 address $REMOTE_MAC * ip macsec add macsec0 rx port 1 address $REMOTE_MAC sa 0 pn 1 on key 01 ead3664f508eb06c40ac7104cdae4ce5 * ip addr add 10.1.0.1/16 dev macsec0 * ip link set dev macsec0 up And in a similar manner on the other machine, while noting the keys order would be reversed and the MAC address of the other machine. RDMA traffic is separated through relevant GID entries and in case of IP ambiguity issue - meaning we have a physical GIDs and a MACsec GIDs with the same IP/GID, we disable our physical GID in order to force the user to only use the MACsec GID. v0: https://lore.kernel.org/netdev/20230813064703.574082-1-leon@kernel.org/ * 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion net/mlx5: Add RoCE MACsec steering infrastructure in core net/mlx5: Configure MACsec steering for ingress RoCEv2 traffic net/mlx5: Configure MACsec steering for egress RoCEv2 traffic IB/core: Reorder GID delete code for RoCE net/mlx5: Add MACsec priorities in RDMA namespaces RDMA/mlx5: Implement MACsec gid addition and deletion net/mlx5: Maintain fs_id xarray per MACsec device inside macsec steering net/mlx5: Remove netdevice from MACsec steering net/mlx5e: Move MACsec flow steering and statistics database from ethernet to core net/mlx5e: Rename MACsec flow steering functions/parameters to suit core naming style net/mlx5: Remove dependency of macsec flow steering on ethernet net/mlx5e: Move MACsec flow steering operations to be used as core library macsec: add functions to get macsec real netdevice and check offload ==================== Link: https://lore.kernel.org/r/20230821073833.59042-1-leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net/mlx5e: Move MACsec flow steering operations to be used as core libraryPatrisious Haddad2023-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move MACsec flow steering operations(macsec_fs) from core/en_accel to core/lib, this mandates moving MACsec statistics structure from the general MACsec code header(en_accel/macsec.h) to macsec_fs header to remove macsec_fs.h dependency over en_accel/macsec.h. This to lay the ground for RoCE MACsec by moving all the data that will need to be accessed by both ethernet MACsec and RoCE MACsec to be shared at core. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
* | net/mlx5: Rename mlx5_comp_vectors_count() to mlx5_comp_vectors_max()Maher Sanalla2023-08-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | To accurately represent its purpose, rename the function that retrieves the value of maximum vectors from mlx5_comp_vectors_count() to mlx5_comp_vectors_max(). Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* | net/mlx5e: Make flow classification filters staticParav Pandit2023-07-271-3/+0
|/ | | | | | | | | Get and set flow classification filters are used in a single file. Hence, make them static. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Remove mlx5e_dbg() and msglvl supportGal Pressman2023-06-161-10/+0
| | | | | | | | | | | The msglvl support was implemented using the mlx5e_dbg() macro which is rarely used in the driver, and is not very useful when you can just use dynamic debug instead. Remove mlx5e_dbg() and convert its usages to netdev_dbg(). Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Remove RX page cache leftoversTariq Toukan2023-06-071-7/+0
| | | | | | | | | Remove unused definitions left after the removal of the RX page cache feature. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: Use query_special_contexts cmd only once per mdevDragos Tatulea2023-05-241-0/+1
| | | | | | | | | | | | | | | | | | | Don't query the firmware so many times (num rqs * num wqes * wqe frags) because it slows down linearly the interface creation time when the product is larger. Do it only once per mdev and store the result in mlx5e_param. Due to helper function being called from different files, move it to an appropriate location. Rename the function with a proper prefix and add a small cleanup. This fix applies only for legacy rq. Fixes: 1b1e4868836a ("net/mlx5e: Use query_special_contexts for mkeys") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: RX, Add XDP multi-buffer support in Striding RQTariq Toukan2023-04-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add support for multi-buffer XDP handling in Striding RQ, which is our default out-of-the-box RQ type. Before this series, loading such an XDP program would fail, until you switch to the legacy RQ (by unsetting the rx_striding_rq priv-flag). To overcome the lack of headroom and tailroom between the strides, we allocate a side page to be used for the descriptor (xdp_buff / skb) and the linear part. When an XDP program is attached, we structure the xdp_buff so that it contains no data in the linear part, and the whole packet resides in the fragments. In case of XDP_PASS, where an SKB still needs to be created, we copy up to 256 bytes to its linear part, to match the current behavior, and satisfy functions that assume finding the packet headers in the SKB linear part (like eth_type_trans). Performance testing: Packet rate test, 64 bytes, 32 channels, MTU 9000 bytes. CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. NIC: ConnectX-6 Dx, at 100 Gbps. +----------+-------------+-------------+---------+ | Test | Legacy RQ | Striding RQ | Speedup | +----------+-------------+-------------+---------+ | XDP_DROP | 101,615,544 | 117,191,020 | +15% | +----------+-------------+-------------+---------+ | XDP_TX | 95,608,169 | 117,043,422 | +22% | +----------+-------------+-------------+---------+ Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifoTariq Toukan2023-04-191-1/+1
| | | | | | | | | | | | | | | | Here we fix the current wi->num_pkts abuse, as it was used to indicate multiple xdpi entries in the xdpi_fifo. Instead, reduce mlx5e_xdp_info to the size of a single field, making it a union of unions. Per packet, use as many instances as needed to provide the information needed at the time of completion. The sequence of xdpi instances pushed is well defined, derived by the xmit_mode. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: Introduce extended version for mlx5e_xmit_dataTariq Toukan2023-04-191-1/+0
| | | | | | | | | | | | | | | | | | | | Introduce struct mlx5e_xmit_data_frags to be used for non-linear xmit buffers. Let it include sinfo pointer. Take one bit from the len field to indicate if the descriptor has fragments and can be casted-up into the extended version. Zero-init to make sure has_frags, and potentially future fields, are zero when not explicitly assigned. Another field will be added in a downstream patch to indicate and point to dma addresses of the different frags, for redirect-in requests. This simplifies the mlx5e_xmit_xdp_frame/mlx5e_xmit_xdp_frame_mpwqe functions params. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: Move struct mlx5e_xmit_data to datapath headerTariq Toukan2023-04-191-6/+1
| | | | | | | | | Move TX datapath struct from the generic en.h to the datapath txrx.h header, where it belongs. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: Move XDP struct and enum to XDP headerTariq Toukan2023-04-191-35/+0
| | | | | | | | | Move struct mlx5e_xdp_info and enum mlx5e_xdp_xmit_mode from the generic en.h to the XDP header, where they belong. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx5e: RX, Break the wqe bulk refill in smaller chunksDragos Tatulea2023-03-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid overflowing the page pool's cache, don't release the whole bulk which is usually larger than the cache refill size. Group release+alloc instead into cache refill units that allow releasing to the cache and then allocating from the cache. A refill_unit variable is added as a iteration unit over the wqe_bulk when doing release+alloc. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 0% to 52%: +---------------------------------------------+ | Page Pool stats (/sec) | Before | After | +-------------------------+---------+---------+ |rx_pp_alloc_fast | 2145422 | 2193802 | |rx_pp_alloc_slow | 2 | 0 | |rx_pp_alloc_empty | 2 | 0 | |rx_pp_alloc_refill | 34059 | 16634 | |rx_pp_alloc_waive | 0 | 0 | |rx_pp_recycle_cached | 0 | 1145818 | |rx_pp_recycle_cache_full | 0 | 0 | |rx_pp_recycle_ring | 2179361 | 1064616 | |rx_pp_recycle_ring_full | 121 | 0 | +---------------------------------------------+ With this patch, the performance for legacy rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: RX, Increase WQE bulk size for legacy rqDragos Tatulea2023-03-281-1/+1
| | | | | | | | | | | | Deferred page release was added to legacy rq but its desired effect (driver releases last fragment to page pool cache) is not yet visible due to the WQE bulks being too small. This patch increases the WQE bulk size to span 512 KB and clip it to one quarter of the rx queue size. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: RX, Defer page release in legacy rq for better recyclingDragos Tatulea2023-03-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, fragmented pages from the page pool can be released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. A flag is added to make sure that there's no release before first alloc and that XDP_TX fragments are not released prematurely. This is a preparation patch that doesn't unlock the performance improvements yet. A followup patch will do that. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: RX, Change wqe last_in_page field from bool to bit flagsDragos Tatulea2023-03-281-1/+5
| | | | | | | | | Change the bool flag to a bitfield as we'll use it in a downstream patch in the series to add signaling about skipping a fragment release. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic nameDragos Tatulea2023-03-281-1/+1
| | | | | | | | | | | | | The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX. A following patch will use this bitmap in a slightly different context but for the same purpose. So rename the bitmap to a more generic name that reflects the purpose not the context. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>