summaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/intel/ice
Commit message (Collapse)AuthorAgeFilesLines
* treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner2025-04-052-3/+3
| | | | | | | | | | timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2025-03-265-23/+55
|\ | | | | | | | | | | | | | | | | | | | | | | Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw()Mateusz Polchlopek2025-03-181-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Fix using the untrusted value of proto->raw.pkt_len in function ice_vc_fdir_parse_raw() by verifying if it does not exceed the VIRTCHNL_MAX_SIZE_RAW_PACKET value. Fixes: 99f419df8a5c ("ice: enable FDIR filters from raw binary patterns for VFs") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: fix input validation for virtchnl BWLukasz Czapnik2025-03-181-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing validation of tc and queue id values sent by a VF in ice_vc_cfg_q_bw(). Additionally fixed logged value in the warning message, where max_tx_rate was incorrectly referenced instead of min_tx_rate. Also correct error handling in this function by properly exiting when invalid configuration is detected. Fixes: 015307754a19 ("ice: Support VF queue rate limit and quanta size configuration") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: validate queue quanta parameters to prevent OOB accessJan Glaza2025-03-181-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Add queue wraparound prevention in quanta configuration. Ensure end_qid does not overflow by validating start_qid and num_queues. Fixes: 015307754a19 ("ice: Support VF queue rate limit and quanta size configuration") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jan Glaza <jan.glaza@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: stop truncating queue ids when checkingJan Glaza2025-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Queue IDs can be up to 4096, fix invalid check to stop truncating IDs to 8 bits. Fixes: bf93bf791cec8 ("ice: introduce ice_virtchnl.c and ice_virtchnl.h") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jan Glaza <jan.glaza@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: fix reservation of resources for RDMA when disabledJesse Brandeburg2025-03-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the CONFIG_INFINIBAND_IRDMA symbol is not enabled as a module or a built-in, then don't let the driver reserve resources for RDMA. The result of this change is a large savings in resources for older kernels, and a cleaner driver configuration for the IRDMA=n case for old and new kernels. Implement this by avoiding enabling the RDMA capability when scanning hardware capabilities. Note: Loading the out-of-tree irdma driver in connection to the in-kernel ice driver, is not supported, and should not be attempted, especially when disabling IRDMA in the kernel config. Fixes: d25a0fc41c1f ("ice: Initialize RDMA support") Signed-off-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Acked-by: Dave Ertman <david.m.ertman@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: ensure periodic output start time is in the futureKarol Kolacinski2025-03-181-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On E800 series hardware, if the start time for a periodic output signal is programmed into GLTSYN_TGT_H and GLTSYN_TGT_L registers, the hardware logic locks up and the periodic output signal never starts. Any future attempt to reprogram the clock function is futile as the hardware will not reset until a power on. The ice_ptp_cfg_perout function has logic to prevent this, as it checks if the requested start time is in the past. If so, a new start time is calculated by rounding up. Since commit d755a7e129a5 ("ice: Cache perout/extts requests and check flags"), the rounding is done to the nearest multiple of the clock period, rather than to a full second. This is more accurate, since it ensures the signal matches the user request precisely. Unfortunately, there is a race condition with this rounding logic. If the current time is close to the multiple of the period, we could calculate a target time that is extremely soon. It takes time for the software to program the registers, during which time this requested start time could become a start time in the past. If that happens, the periodic output signal will lock up. For large enough periods, or for the logic prior to the mentioned commit, this is unlikely. However, with the new logic rounding to the period and with a small enough period, this becomes inevitable. For example, attempting to enable a 10MHz signal requires a period of 100 nanoseconds. This means in the *best* case, we have 99 nanoseconds to program the clock output. This is essentially impossible, and thus such a small period practically guarantees that the clock output function will lock up. To fix this, add some slop to the clock time used to check if the start time is in the past. Because it is not critical that output signals start immediately, but it *is* critical that we do not brick the function, 0.5 seconds is selected. This does mean that any requested output will be delayed by at least 0.5 seconds. This slop is applied before rounding, so that we always round up to the nearest multiple of the period that is at least 0.5 seconds in the future, ensuring a minimum of 0.5 seconds to program the clock output registers. Finally, to ensure that the hardware registers programming the clock output complete in a timely manner, add a write flush to the end of ice_ptp_write_perout. This ensures we don't risk any issue with PCIe transaction batching. Strictly speaking, this fixes a race condition all the way back at the initial implementation of periodic output programming, as it is theoretically possible to trigger this bug even on the old logic when always rounding to a full second. However, the window is narrow, and the code has been refactored heavily since then, making a direct backport not apply cleanly. Fixes: d755a7e129a5 ("ice: Cache perout/extts requests and check flags") Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: health.c: fix compilation on gcc 7.5Przemek Kitszel2025-03-181-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC 7 is not as good as GCC 8+ in telling what is a compile-time const, and thus could be used for static storage. Fortunately keeping strings as const arrays is enough to make old gcc happy. Excerpt from the report: My GCC is: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0. CC [M] drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.o drivers/net/ethernet/intel/ice/devlink/health.c:35:3: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:35:3: note: (near initialization for 'ice_health_status_lookup[0].solution') drivers/net/ethernet/intel/ice/devlink/health.c:35:31: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:35:31: note: (near initialization for 'ice_health_status_lookup[0].data_label[0]') drivers/net/ethernet/intel/ice/devlink/health.c:37:46: error: initializer element is not constant "Change or replace the module or cable.", {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/devlink/health.c:37:46: note: (near initialization for 'ice_health_status_lookup[1].data_label[0]') drivers/net/ethernet/intel/ice/devlink/health.c:39:3: error: initializer element is not constant ice_common_port_solutions, {ice_port_number_label}}, ^~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 85d6164ec56d ("ice: add fw and port health reporters") Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Closes: https://lore.kernel.org/netdev/CY8PR11MB7134BF7A46D71E50D25FA7A989F72@CY8PR11MB7134.namprd11.prod.outlook.com Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* | ice: E825C PHY register cleanupKarol Kolacinski2025-03-181-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minor PTP register refactor, including logical grouping E825C 1-step timestamping registers. Remove unused register definitions (PHY_REG_GPCS_BITSLIP, PHY_REG_REVISION). Also, apply preferred GENMASK macro (instead of ICE_M) for register fields definition affected by this patch. Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-5-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | ice: Refactor E825C PHY registers info structKarol Kolacinski2025-03-183-65/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify ice_phy_reg_info_eth56g struct definition to include base address for the very first quad. Use base address info and 'step' value to determine address for specific PHY quad. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-4-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | ice: rename ice_ptp_init_phc_eth56g functionKarol Kolacinski2025-03-181-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor the code by changing ice_ptp_init_phc_eth56g function name to ice_ptp_init_phc_e825, to be consistent with the naming pattern for other devices. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-3-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | ice: Add E830 checksum offload supportPaul Greenwalt2025-03-187-4/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | E830 supports raw receive and generic transmit checksum offloads. Raw receive checksum support is provided by hardware calculating the checksum over the whole packet, regardless of type. The calculated checksum is provided to driver in the Rx flex descriptor. Then the driver assigns the checksum to skb->csum and sets skb->ip_summed to CHECKSUM_COMPLETE. Generic transmit checksum support is provided by hardware calculating the checksum given two offsets: the start offset to begin checksum calculation, and the offset to insert the calculated checksum in the packet. Support is advertised to the stack using NETIF_F_HW_CSUM feature. E830 has the following limitations when both generic transmit checksum offload and TCP Segmentation Offload (TSO) are enabled: 1. Inner packet header modification is not supported. This restriction includes the inability to alter TCP flags, such as the push flag. As a result, this limitation can impact the receiver's ability to coalesce packets, potentially degrading network throughput. 2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents support of Maximum Transmission Unit (MTU) greater than 1063 bytes. Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features(). When NETIF_F_HW_CSUM is requested the provided skb->csum_start and skb->csum_offset are passed to hardware in the Tx context descriptor generic checksum (GCS) parameters. Hardware calculates the 1's complement from skb->csum_start to the end of the packet, and inserts the result in the packet at skb->csum_offset. Co-developed-by: Alice Michael <alice.michael@intel.com> Signed-off-by: Alice Michael <alice.michael@intel.com> Co-developed-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250310174502.3708121-2-anthony.l.nguyen@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni2025-03-137-32/+33
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.") 2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c 661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * ice: register devlink prior to creating health reportersPrzemek Kitszel2025-03-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ice_health_init() was introduced in the commit 2a82874a3b7b ("ice: add Tx hang devlink health reporter"). The call to it should have been put after ice_devlink_register(). It went unnoticed until next reporter by Konrad, which receives events from FW. FW is reporting all events, also from prior driver load, and thus it is not unlikely to have something at the very beginning. And that results in a splat: [ 24.455950] ? devlink_recover_notify.constprop.0+0x198/0x1b0 [ 24.455973] devlink_health_report+0x5d/0x2a0 [ 24.455976] ? __pfx_ice_health_status_lookup_compare+0x10/0x10 [ice] [ 24.456044] ice_process_health_status_event+0x1b7/0x200 [ice] Do the analogous thing for deinit patch. Fixes: 85d6164ec56d ("ice: add fw and port health reporters") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Konrad Knitter <konrad.knitter@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: Fix switchdev slow-path in LAGMarcin Szycik2025-03-052-1/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ever since removing switchdev control VSI and using PF for port representor Tx/Rx, switchdev slow-path has been working improperly after failover in SR-IOV LAG. LAG assumes that the first uplink to be added to the aggregate will own VFs and have switchdev configured. After failing-over to the other uplink, representors are still configured to Tx through the uplink they are set up on, which fails because that uplink is now down. On failover, update all PRs on primary uplink to use the currently active uplink for Tx. Call netif_keep_dst(), as the secondary uplink might not be in switchdev mode. Also make sure to call ice_eswitch_set_target_vsi() if uplink is in LAG. On the Rx path, representors are already working properly, because default Tx from VFs is set to PF owning the eswitch. After failover the same PF is receiving traffic from VFs, even though link is down. Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: fix memory leak in aRFS after resetGrzegorz Nitka2025-03-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix aRFS (accelerated Receive Flow Steering) structures memory leak by adding a checker to verify if aRFS memory is already allocated while configuring VSI. aRFS objects are allocated in two cases: - as part of VSI initialization (at probe), and - as part of reset handling However, VSI reconfiguration executed during reset involves memory allocation one more time, without prior releasing already allocated resources. This led to the memory leak with the following signature: [root@os-delivery ~]# cat /sys/kernel/debug/kmemleak unreferenced object 0xff3c1ca7252e6000 (size 8192): comm "kworker/0:0", pid 8, jiffies 4296833052 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): [<ffffffff991ec485>] __kmalloc_cache_noprof+0x275/0x340 [<ffffffffc0a6e06a>] ice_init_arfs+0x3a/0xe0 [ice] [<ffffffffc09f1027>] ice_vsi_cfg_def+0x607/0x850 [ice] [<ffffffffc09f244b>] ice_vsi_setup+0x5b/0x130 [ice] [<ffffffffc09c2131>] ice_init+0x1c1/0x460 [ice] [<ffffffffc09c64af>] ice_probe+0x2af/0x520 [ice] [<ffffffff994fbcd3>] local_pci_probe+0x43/0xa0 [<ffffffff98f07103>] work_for_cpu_fn+0x13/0x20 [<ffffffff98f0b6d9>] process_one_work+0x179/0x390 [<ffffffff98f0c1e9>] worker_thread+0x239/0x340 [<ffffffff98f14abc>] kthread+0xcc/0x100 [<ffffffff98e45a6d>] ret_from_fork+0x2d/0x50 [<ffffffff98e083ba>] ret_from_fork_asm+0x1a/0x30 ... Fixes: 28bf26724fdb ("ice: Implement aRFS") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: do not configure destination override for switchdevLarysa Zaremba2025-03-053-28/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After switchdev is enabled and disabled later, LLDP packets sending stops, despite working perfectly fine before and during switchdev state. To reproduce (creating/destroying VF is what triggers the reconfiguration): devlink dev eswitch set pci/<address> mode switchdev echo '2' > /sys/class/net/<ifname>/device/sriov_numvfs echo '0' > /sys/class/net/<ifname>/device/sriov_numvfs This happens because LLDP relies on the destination override functionality. It needs to 1) set a flag in the descriptor, 2) set the VSI permission to make it valid. The permissions are set when the PF VSI is first configured, but switchdev then enables it for the uplink VSI (which is always the PF) once more when configured and disables when deconfigured, which leads to software-generated LLDP packets being blocked. Do not modify the destination override permissions when configuring switchdev, as the enabled state is the default configuration that is never modified. Fixes: 1a1c40df2e80 ("ice: set and release switchdev environment") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* | ice: dpll: Remove newline at the end of a netlink error messageGal Pressman2025-02-271-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | Netlink error messages should not have a newline at the end of the string. Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250226093904.6632-6-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2025-02-274-6/+11
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ice: Avoid setting default Rx VSI twice in switchdev setupMarcin Szycik2025-02-251-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of switchdev environment setup, uplink VSI is configured as default for both Tx and Rx. Default Rx VSI is also used by promiscuous mode. If promisc mode is enabled and an attempt to enter switchdev mode is made, the setup will fail because Rx VSI is already configured as default (rule exists). Reproducer: devlink dev eswitch set $PF1_PCI mode switchdev ip l s $PF1 up ip l s $PF1 promisc on echo 1 > /sys/class/net/$PF1/device/sriov_numvfs In switchdev setup, use ice_set_dflt_vsi() instead of plain ice_cfg_dflt_vsi(), which avoids repeating setting default VSI for Rx if it's already configured. Fixes: 50d62022f455 ("ice: default Tx rule instead of to queue") Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250224190647.3601930-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ice: Fix deinitializing VF in error pathMarcin Szycik2025-02-253-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If ice_ena_vfs() fails after calling ice_create_vf_entries(), it frees all VFs without removing them from snapshot PF-VF mailbox list, leading to list corruption. Reproducer: devlink dev eswitch set $PF1_PCI mode switchdev ip l s $PF1 up ip l s $PF1 promisc on sleep 1 echo 1 > /sys/class/net/$PF1/device/sriov_numvfs sleep 1 echo 1 > /sys/class/net/$PF1/device/sriov_numvfs Trace (minimized): list_add corruption. next->prev should be prev (ffff8882e241c6f0), but was 0000000000000000. (next=ffff888455da1330). kernel BUG at lib/list_debug.c:29! RIP: 0010:__list_add_valid_or_report+0xa6/0x100 ice_mbx_init_vf_info+0xa7/0x180 [ice] ice_initialize_vf_entry+0x1fa/0x250 [ice] ice_sriov_configure+0x8d7/0x1520 [ice] ? __percpu_ref_switch_mode+0x1b1/0x5d0 ? __pfx_ice_sriov_configure+0x10/0x10 [ice] Sometimes a KASAN report can be seen instead with a similar stack trace: BUG: KASAN: use-after-free in __list_add_valid_or_report+0xf1/0x100 VFs are added to this list in ice_mbx_init_vf_info(), but only removed in ice_free_vfs(). Move the removing to ice_free_vf_entries(), which is also being called in other places where VFs are being removed (including ice_free_vfs() itself). Fixes: 8cd8a6b17d27 ("ice: move VF overflow message count into struct ice_mbx_vf_info") Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250224190647.3601930-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ice: use napi's irq affinity and rmap IRQ notifiersAhmed Zaki2025-02-266-93/+6
| | | | | | | | | | | | | | | | | | Delete the driver CPU affinity and aRFS rmap info, use the core's API instead. Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-5-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ice: clear NAPI's IRQ numbers in ice_vsi_clear_napi_queues()Ahmed Zaki2025-02-261-1/+8
| | | | | | | | | | | | | | | | | | We set the NAPI's IRQ number in ice_vsi_set_napi_queues(). Clear the NAPI's IRQ in ice_vsi_clear_napi_queues(). Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-4-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ethtool: Symmetric OR-XOR RSS hashGal Pressman2025-02-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an additional type of symmetric RSS hash type: OR-XOR. The "Symmetric-OR-XOR" algorithm transforms the input as follows: (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT) Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap of supported RXH_XFRM_* types. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge branch '100GbE' of ↵Jakub Kicinski2025-02-178-12/+104
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice, iavf: Add support for Rx timestamping Mateusz Polchlopek says: Initially, during VF creation it registers the PTP clock in the system and negotiates with PF it's capabilities. In the meantime the PF enables the Flexible Descriptor for VF. Only this type of descriptor allows to receive Rx timestamps. Enabling virtual clock would be possible, though it would probably perform poorly due to the lack of direct time access. Enable timestamping should be done using userspace tools, e.g. hwstamp_ctl -i $VF -r 14 In order to report the timestamps to userspace, the VF extends timestamp to 40b. To support this feature the flexible descriptors and PTP part in iavf driver have been introduced. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: add support for Rx timestamps to hotpath iavf: handle set and get timestamps ops iavf: Implement checking DD desc field iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors iavf: define Rx descriptors as qwords libeth: move idpf_rx_csum_decoded and idpf_rx_extracted iavf: periodically cache PHC time iavf: add support for indirect access to PHC time iavf: add initial framework for registering PTP clock iavf: negotiate PTP capabilities iavf: add support for negotiating flexible RXDID format virtchnl: add enumeration for the rxdid format ice: support Rx timestamp on flex descriptor virtchnl: add support for enabling PTP on iAVF ==================== Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | iavf: add support for negotiating flexible RXDID formatJacob Keller2025-02-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable support for VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC, to enable the VF driver the ability to determine what Rx descriptor formats are available. This requires sending an additional message during initialization and reset, the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS. This operation requests the supported Rx descriptor IDs available from the PF. This is treated the same way that VLAN V2 capabilities are handled. Add a new set of extended capability flags, used to process send and receipt of the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message. This ensures we finish negotiating for the supported descriptor formats prior to beginning configuration of receive queues. This change stores the supported format bitmap into the iavf_adapter structure. Additionally, if VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC is enabled by the PF, we need to make sure that the Rx queue configuration specifies the format. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: support Rx timestamp on flex descriptorSimei Su2025-02-148-10/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support Rx timestamp offload, VIRTCHNL_OP_1588_PTP_CAPS is sent by the VF to request PTP capability and responded by the PF what capability is enabled for that VF. Hardware captures timestamps which contain only 32 bits of nominal nanoseconds, as opposed to the 64bit timestamps that the stack expects. To convert 32b to 64b, we need a current PHC time. VIRTCHNL_OP_1588_PTP_GET_TIME is sent by the VF and responded by the PF with the current PHC time. Signed-off-by: Simei Su <simei.su@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* | | ice: Fix signedness bug in ice_init_interrupt_scheme()Dan Carpenter2025-02-141-2/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If pci_alloc_irq_vectors() can't allocate the minimum number of vectors then it returns -ENOSPC so there is no need to check for that in the caller. In fact, because pf->msix.min is an unsigned int, it means that any negative error codes are type promoted to high positive values and treated as success. So here, the "return -ENOMEM;" is unreachable code. Check for negatives instead. Now that we're only dealing with error codes, it's easier to propagate the error code from pci_alloc_irq_vectors() instead of hardcoding -ENOMEM. Fixes: 79d97b8cf9a8 ("ice: remove splitting MSI-X between features") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/b16e4f01-4c85-46e2-b602-fce529293559@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge branch '100GbE' of ↵Jakub Kicinski2025-02-1115-548/+735
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e) For ice: Karol, Jake, and Michal add PTP support for E830 devices. Karol refactors and cleans up PTP code. Jake allows for a common cross-timestamp implementation to be shared for all devices and Michal adds E830 support. Mateusz cleans up initial Flow Director rule creation to loop rather than duplicate repeated similar calls. For igc: Siang adjust calls to remove need for close and open calls on loading XDP program. For e1000e: Gerhard Engleder batches register writes for writing multicast table on real-time kernels. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: e1000e: Fix real-time violations on link up igc: Avoid unnecessary link down event in XDP_SETUP_PROG process ice: refactor ice_fdir_create_dflt_rules() function ice: Implement PTP support for E830 devices ice: Refactor ice_ptp_init_tx_* ice: Add unified ice_capture_crosststamp ice: Process TSYN IRQ in a separate function ice: Use FIELD_PREP for timestamp values ice: Remove unnecessary ice_is_e8xx() functions ice: Don't check device type when checking GNSS presence ==================== Link: https://patch.msgid.link/20250210192352.3799673-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | ice: refactor ice_fdir_create_dflt_rules() functionMateusz Polchlopek2025-02-101-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Flow Director function ice_fdir_create_dflt_rules() calls few times function ice_create_init_fdir_rule() each time with different enum ice_fltr_ptype parameter. Next step is to return error code if error occurred. Change the code to store all necessary default rules in constant array and call ice_create_init_fdir_rule() in the loop. It makes it easy to extend the list of default rules in the future, without the need of duplicate code more and more. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Implement PTP support for E830 devicesMichal Michalik2025-02-105-6/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add specific functions and definitions for E830 devices to enable PTP support. E830 devices support direct write to GLTSYN_ registers without shadow registers and 64 bit read of PHC time. Enable PTM for E830 device, which is required for cross timestamp and and dependency on PCIE_PTM for ICE_HWTS. Check X86_FEATURE_ART for E830 as it may not be present in the CPU. Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Co-developed-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Michal Michalik <michal.michalik@intel.com> Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Refactor ice_ptp_init_tx_*Karol Kolacinski2025-02-102-39/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X. This simplifies the code for the future use with new MAC types. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Add unified ice_capture_crosststampJacob Keller2025-02-101-75/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Devices supported by ice driver use essentially the same logic for performing a crosstimestamp. The only difference is that E830 hardware has different offsets. Instead of having multiple implementations, combine them into a single ice_capture_crosststamp() function. To support both hardware types, the ice_capture_crosststamp function must be able to determine the appropriate registers to access. To handle this, pass a custom context structure instead of the PF pointer. This structure, ice_crosststamp_ctx, contains a pointer to the PF, and a pointer to the device configuration structure. This new structure also will make it easier to implement historic snapshot support in a future commit. The device configuration structure is a static const data which defines the offsets and flags for the various registers. This includes the lock register, the cross timestamp control register, the upper and lower ART system time capture registers, and the upper and lower device time capture registers for each timer index. Use the configuration structure to access all of the registers in ice_capture_crosststamp(). Ensure that we don't over-run the device time array by checking that the timer index is 0 or 1. Previously this was simply assumed, and it would cause the device to read an incorrect and likely garbage register. It does feel like there should be a kernel interface for managing register offsets like this, but the closest thing I saw was <linux/regmap.h> which is interesting but not quite what we're looking for... Use rd32_poll_timeout() to read lock_reg and ctl_reg. Add snapshot system time for historic interpolation. Remove X86_FEATURE_ART and X86_FEATURE_TSC_KNOWN_FREQ from all E82X devices because those are SoCs, which will always have those features. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Process TSYN IRQ in a separate functionKarol Kolacinski2025-02-103-16/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify TSYN IRQ processing by moving it to a separate function and having appropriate behavior per PHY model, instead of multiple conditions not related to HW, but to specific timestamping modes. When PTP is not enabled in the kernel, don't process timestamps and return IRQ_HANDLED. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Use FIELD_PREP for timestamp valuesKarol Kolacinski2025-02-102-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using shifts and casts, use FIELD_PREP after reading 40b timestamp values. Rename a couple defines for better clarity and consistency. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Remove unnecessary ice_is_e8xx() functionsKarol Kolacinski2025-02-108-279/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use MAC type where applicable. Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because in reality it depends on the ready bitmap, which only E810 does not have. Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further function calls check the MAC type anyway and this allows simpler code in the future with addition of the new MAC types. Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * | ice: Don't check device type when checking GNSS presenceKarol Kolacinski2025-02-107-116/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't check if the device type is E810T as non-E810T devices can support GNSS too and PCA9575 check is enough to determine if GNSS is present or not. Rename ice_gnss_is_gps_present() to ice_gnss_is_module_present() because GNSS module supports multiple GNSS providers, not only GPS. Move functions related to PCA9575 from ice_ptp_hw.c to ice_common.c to be able to access them when PTP is disabled in the kernel, but GNSS is enabled. Remove logical AND with ICE_AQC_LINK_TOPO_NODE_TYPE_M in ice_get_pca9575_handle(), which has no effect, and reorder device type checks to check the device_id first, then set other variables. Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* | | ice: use generic unrolled_count() macroAlexander Lobakin2025-02-102-9/+3
|/ / | | | | | | | | | | | | | | | | | | | | | | | | ice, same as i40e, has a custom loop unrolling macros for unrolling Tx descriptors filling on XSk xmit. Replace ice defs with generic unrolled_count(), which is also more convenient as it allows passing defines as its argument, not hardcoded values, while the loop declaration will still be usual for-loop. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge branch 'for-next' of ↵Jakub Kicinski2025-02-0610-420/+269
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: managing MSI-X in driver Michal Swiatkowski says: It is another try to allow user to manage amount of MSI-X used for each feature in ice. First was via devlink resources API, it wasn't accepted in upstream. Also static MSI-X allocation using devlink resources isn't really user friendly. This try is using more dynamic way. "Dynamic" across whole kernel when platform supports it and "dynamic" across the driver when not. To achieve that reuse global devlink parameter pf_msix_max and pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for VFs also). Having pf_msix_max allow user to statically set how many MSI-X he wants on PF and how many should be reserved for VFs. pf_msix_min is used to set minimum number of MSI-X with which ice driver should probe correctly. Meaning of this field in case of dynamic vs static allocation: - on system with dynamic MSI-X allocation support * alloc pf_msix_min as static, rest will be allocated dynamically - on system without dynamic MSI-X allocation support * try alloc pf_msix_max as static, minimum acceptable result is pf_msix_min As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an probably should) be stored in NVM. This patchset isn't implementing that. Dynamic (kernel or driver) way means that splitting MSI-X across the RDMA and eth in case there is a MSI-X shortage isn't correct. Can work when dynamic is only on driver site, but can't when dynamic is on kernel site. Let's remove this code and move to MSI-X allocation feature by feature. If there is no more MSI-X for a feature, a feature is working with less MSI-X or it is turned off. There is a regression here. With MSI-X splitting user can run RDMA and eth even on system with not enough MSI-X. Now only eth will work. RDMA can be turned on by changing number of PF queues (lowering) and reprobe RDMA driver. Example: 72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147. We set pf_msix_min = 2, pf_msix_max = 128 OICR: 1 eth: 72 flow director: 1 RDMA: 128 - 74 = 54 We can change number of queues on pf to 36 and do devlink reinit OICR: 1 eth: 36 RDMA: 73 flow director: 1 We can also (implemented in "ice: enable_rdma devlink param") turned RDMA off. OICR: 1 eth: 72 RDMA: 0 (turned off) flow director: 1 After this changes we have a static base vector for SRIOV (SIOV probably in the feature). Last patch from this series is simplifying managing VF MSI-X code based on static vector. Now changing queues using ethtool is also changing MSI-X. If there is enough MSI-X it is always one to one. When there is not enough there will be more queues than MSI-X. * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: init flow director before RDMA ice: simplify VF MSI-X managing ice: enable_rdma devlink param ice: treat dyn_allowed only as suggestion ice, irdma: move interrupts code to irdma ice: get rid of num_lan_msix field ice: remove splitting MSI-X between features ice: devlink PF MSI-X max and min parameter ice: count combined queues using Rx/Tx count ==================== Link: https://patch.msgid.link/20250205185512.895887-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ice: init flow director before RDMAMichal Swiatkowski2025-02-051-2/+4
| | | | | | | | | | | | | | | | | | | | Flow director needs only one MSI-X. Load it before RDMA to save MSI-X for it. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: simplify VF MSI-X managingMichal Swiatkowski2025-02-054-173/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After implementing pf->msix.max field, base vector for other use cases (like VFs) can be fixed. This simplify code when changing MSI-X amount on particular VF, because there is no need to move a base vector. A fixed base vector allows to reserve vectors from the beginning instead of from the end, which is also simpler in code. Store total and rest value in the same struct as max and min for PF. Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also use for other none PF use cases (SIOV). Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: enable_rdma devlink paramMichal Swiatkowski2025-02-052-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement enable_rdma devlink parameter to allow user to turn RDMA feature on and off. It is useful when there is no enough interrupts and user doesn't need RDMA feature. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: treat dyn_allowed only as suggestionMichal Swiatkowski2025-02-052-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It can be needed to have some MSI-X allocated as static and rest as dynamic. For example on PF VSI. We want to always have minimum one MSI-X on it, because of that it is allocated as a static one, rest can be dynamic if it is supported. Change the ice_get_irq_res() to allow using static entries if they are free even if caller wants dynamic one. Adjust limit values to the new approach. Min and max in limit means the values that are valid, so decrease max and num_static by one. Set vsi::irq_dyn_alloc if dynamic allocation is supported. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice, irdma: move interrupts code to irdmaMichal Swiatkowski2025-02-053-52/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | Move responsibility of MSI-X requesting for RDMA feature from ice driver to irdma driver. It is done to allow simple fallback when there is not enough MSI-X available. Change amount of MSI-X used for control from 4 to 1, as it isn't needed to have more than one MSI-X for this purpose. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: get rid of num_lan_msix fieldMichal Swiatkowski2025-02-055-29/+24
| | | | | | | | | | | | | | | | | | | | | | | | Remove the field to allow having more queues than MSI-X on VSI. As default the number will be the same, but if there won't be more MSI-X available VSI can run with at least one MSI-X. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: remove splitting MSI-X between featuresMichal Swiatkowski2025-02-052-158/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With dynamic approach to alloc MSI-X there is no sense to statically split MSI-X between PF features. Splitting was also calculating needed MSI-X. Move this part to separate function and use as max value. Remove ICE_ESWITCH_MSIX, as there is no need for additional MSI-X for switchdev. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: devlink PF MSI-X max and min parameterMichal Swiatkowski2025-02-053-0/+95
| | | | | | | | | | | | | | | | | | | | | | Use generic devlink PF MSI-X parameter to allow user to change MSI-X range. Add notes about this parameters into ice devlink documentation. Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
| * ice: count combined queues using Rx/Tx countMichal Swiatkowski2025-02-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | Previous implementation assumes that there is 1:1 matching between vectors and queues. It isn't always true. Get minimum value from Rx/Tx queues to determine combined queues number. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* | Merge branch '100GbE' of ↵Jakub Kicinski2025-02-033-91/+103
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== ice: fix Rx data path for heavy 9k MTU traffic Maciej Fijalkowski says: This patchset fixes a pretty nasty issue that was reported by RedHat folks which occurred after ~30 minutes (this value varied, just trying here to state that it was not observed immediately but rather after a considerable longer amount of time) when ice driver was tortured with jumbo frames via mix of iperf traffic executed simultaneously with wrk/nginx on client/server sides (HTTP and TCP workloads basically). The reported splats were spanning across all the bad things that can happen to the state of page - refcount underflow, use-after-free, etc. One of these looked as follows: [ 2084.019891] BUG: Bad page state in process swapper/34 pfn:97fcd0 [ 2084.025990] page:00000000a60ee772 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x97fcd0 [ 2084.035462] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 2084.041990] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000 [ 2084.049730] raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 [ 2084.057468] page dumped because: nonzero _refcount [ 2084.062260] Modules linked in: bonding tls sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 irqd [ 2084.137829] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Not tainted 5.14.0-427.37.1.el9_4.x86_64 #1 [ 2084.147039] Hardware name: Dell Inc. PowerEdge R750/0216NK, BIOS 1.13.2 12/19/2023 [ 2084.154604] Call Trace: [ 2084.157058] <IRQ> [ 2084.159080] dump_stack_lvl+0x34/0x48 [ 2084.162752] bad_page.cold+0x63/0x94 [ 2084.166333] check_new_pages+0xb3/0xe0 [ 2084.170083] rmqueue_bulk+0x2d2/0x9e0 [ 2084.173749] ? ktime_get+0x35/0xa0 [ 2084.177159] rmqueue_pcplist+0x13b/0x210 [ 2084.181081] rmqueue+0x7d3/0xd40 [ 2084.184316] ? xas_load+0x9/0xa0 [ 2084.187547] ? xas_find+0x183/0x1d0 [ 2084.191041] ? xa_find_after+0xd0/0x130 [ 2084.194879] ? intel_iommu_iotlb_sync_map+0x89/0xe0 [ 2084.199759] get_page_from_freelist+0x11f/0x530 [ 2084.204291] __alloc_pages+0xf2/0x250 [ 2084.207958] ice_alloc_rx_bufs+0xcc/0x1c0 [ice] [ 2084.212543] ice_clean_rx_irq+0x631/0xa20 [ice] [ 2084.217111] ice_napi_poll+0xdf/0x2a0 [ice] [ 2084.221330] __napi_poll+0x27/0x170 [ 2084.224824] net_rx_action+0x233/0x2f0 [ 2084.228575] __do_softirq+0xc7/0x2ac [ 2084.232155] __irq_exit_rcu+0xa1/0xc0 [ 2084.235821] common_interrupt+0x80/0xa0 [ 2084.239662] </IRQ> [ 2084.241768] <TASK> The fix is mostly about reverting what was done in commit 1dc1a7e7f410 ("ice: Centrallize Rx buffer recycling") followed by proper timing on page_count() storage and then removing the ice_rx_buf::act related logic (which was mostly introduced for purposes from cited commit). Special thanks to Xu Du for providing reproducer and Jacob Keller for initial extensive analysis. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: stop storing XDP verdict within ice_rx_buf ice: gather page_count()'s of each frag right before XDP prog call ice: put Rx buffers after being done with current frame ==================== Link: https://patch.msgid.link/20250131185415.3741532-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>