summaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlxsw
Commit message (Collapse)AuthorAgeFilesLines
* mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 addressIdo Schimmel3 days1-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device stores IPv6 addresses that are used for encapsulation in linear memory that is managed by the driver. Changing the remote address of an ip6gre net device never worked properly, but since cited commit the following reproducer [1] would result in a warning [2] and a memory leak [3]. The problem is that the new remote address is never added by the driver to its hash table (and therefore the device) and the old address is never removed from it. Fix by programming the new address when the configuration of the ip6gre net device changes and removing the old one. If the address did not change, then the above would result in increasing the reference count of the address and then decreasing it. [1] # ip link add name bla up type ip6gre local 2001:db8:1::1 remote 2001:db8:2::1 tos inherit ttl inherit # ip link set dev bla type ip6gre remote 2001:db8:3::1 # ip link del dev bla # devlink dev reload pci/0000:01:00.0 [2] WARNING: CPU: 0 PID: 1682 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3002 mlxsw_sp_ipv6_addr_put+0x140/0x1d0 Modules linked in: CPU: 0 UID: 0 PID: 1682 Comm: ip Not tainted 6.12.0-rc3-custom-g86b5b55bc835 #151 Hardware name: Nvidia SN5600/VMOD0013, BIOS 5.13 05/31/2023 RIP: 0010:mlxsw_sp_ipv6_addr_put+0x140/0x1d0 [...] Call Trace: <TASK> mlxsw_sp_router_netdevice_event+0x55f/0x1240 notifier_call_chain+0x5a/0xd0 call_netdevice_notifiers_info+0x39/0x90 unregister_netdevice_many_notify+0x63e/0x9d0 rtnl_dellink+0x16b/0x3a0 rtnetlink_rcv_msg+0x142/0x3f0 netlink_rcv_skb+0x50/0x100 netlink_unicast+0x242/0x390 netlink_sendmsg+0x1de/0x420 ____sys_sendmsg+0x2bd/0x320 ___sys_sendmsg+0x9a/0xe0 __sys_sendmsg+0x7a/0xd0 do_syscall_64+0x9e/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f [3] unreferenced object 0xffff898081f597a0 (size 32): comm "ip", pid 1626, jiffies 4294719324 hex dump (first 32 bytes): 20 01 0d b8 00 02 00 00 00 00 00 00 00 00 00 01 ............... 21 49 61 83 80 89 ff ff 00 00 00 00 01 00 00 00 !Ia............. backtrace (crc fd9be911): [<00000000df89c55d>] __kmalloc_cache_noprof+0x1da/0x260 [<00000000ff2a1ddb>] mlxsw_sp_ipv6_addr_kvdl_index_get+0x281/0x340 [<000000009ddd445d>] mlxsw_sp_router_netdevice_event+0x47b/0x1240 [<00000000743e7757>] notifier_call_chain+0x5a/0xd0 [<000000007c7b9e13>] call_netdevice_notifiers_info+0x39/0x90 [<000000002509645d>] register_netdevice+0x5f7/0x7a0 [<00000000c2e7d2a9>] ip6gre_newlink_common.isra.0+0x65/0x130 [<0000000087cd6d8d>] ip6gre_newlink+0x72/0x120 [<000000004df7c7cc>] rtnl_newlink+0x471/0xa20 [<0000000057ed632a>] rtnetlink_rcv_msg+0x142/0x3f0 [<0000000032e0d5b5>] netlink_rcv_skb+0x50/0x100 [<00000000908bca63>] netlink_unicast+0x242/0x390 [<00000000cdbe1c87>] netlink_sendmsg+0x1de/0x420 [<0000000011db153e>] ____sys_sendmsg+0x2bd/0x320 [<000000003b6d53eb>] ___sys_sendmsg+0x9a/0xe0 [<00000000cae27c62>] __sys_sendmsg+0x7a/0xd0 Fixes: cf42911523e0 ("mlxsw: spectrum_ipip: Use common hash table for IPv6 address mapping") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/e91012edc5a6cb9df37b78fd377f669381facfcb.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* mlxsw: pci: Sync Rx buffers for deviceAmit Cohen3 days1-1/+2
| | | | | | | | | | | | | | | | | Non-coherent architectures, like ARM, may require invalidating caches before the device can use the DMA mapped memory, which means that before posting pages to device, drivers should sync the memory for device. Sync for device can be configured as page pool responsibility. Set the relevant flag and define max_len for sync. Cc: Jiri Pirko <jiri@resnulli.us> Fixes: b5b60bb491b2 ("mlxsw: pci: Use page pool for Rx buffers allocation") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/92e01f05c4f506a4f0a9b39c10175dcc01994910.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* mlxsw: pci: Sync Rx buffers for CPUAmit Cohen3 days1-7/+15
| | | | | | | | | | | | | | | | | When Rx packet is received, drivers should sync the pages for CPU, to ensure the CPU reads the data written by the device and not stale data from its cache. Add the missing sync call in Rx path, sync the actual length of data for each fragment. Cc: Jiri Pirko <jiri@resnulli.us> Fixes: b5b60bb491b2 ("mlxsw: pci: Use page pool for Rx buffers allocation") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/461486fac91755ca4e04c2068c102250026dcd0b.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* mlxsw: spectrum_ptp: Add missing verification before pushing Tx headerAmit Cohen3 days1-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | Tx header should be pushed for each packet which is transmitted via Spectrum ASICs. The cited commit moved the call to skb_cow_head() from mlxsw_sp_port_xmit() to functions which handle Tx header. In case that mlxsw_sp->ptp_ops->txhdr_construct() is used to handle Tx header, and txhdr_construct() is mlxsw_sp_ptp_txhdr_construct(), there is no call for skb_cow_head() before pushing Tx header size to SKB. This flow is relevant for Spectrum-1 and Spectrum-4, for PTP packets. Add the missing call to skb_cow_head() to make sure that there is both enough room to push the Tx header and that the SKB header is not cloned and can be modified. An additional set will be sent to net-next to centralize the handling of the Tx header by pushing it to every packet just before transmission. Cc: Richard Cochran <richardcochran@gmail.com> Fixes: 24157bc69f45 ("mlxsw: Send PTP packets as data packets to overcome a limitation") Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/5145780b07ebbb5d3b3570f311254a3a2d554a44.1729866134.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* mlxsw: spectrum_router: fix xa_store() error checkingYuan Can11 days1-6/+3
| | | | | | | | | | | | | | It is meant to use xa_err() to extract the error encoded in the return value of xa_store(). Fixes: 44c2fbebe18a ("mlxsw: spectrum_router: Share nexthop counters in resilient groups") Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20241017023223.74180-1-yuancan@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* Merge tag 'thermal-6.12-rc1' of ↵Linus Torvalds2024-09-161-84/+31
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control updates from Rafael Wysocki: "These mostly continue to rework the thermal core and the thermal zone driver interface to make the code more straightforward and reduce bloat The most significant piece of this work is a change of the code related to binding cooling devices to thermal zones which, among other things, replaces two previously existing thermal zone operations with one allowing driver implementations to be much simpler There is also a new thermal core testing module allowing mock thermal zones to be created and controlled via debugfs in order to exercise the thermal core functionality. It is expected to be used for implementing thermal core self tests in the future Apart from the above, there are assorted thermal driver updates Specifics: - Update some thermal drivers to eliminate thermal_zone_get_trip() calls from them and get rid of that function (Rafael Wysocki) - Update the thermal sysfs code to store trip point attributes in trip descriptors and get to trip points via attribute pointers (Rafael Wysocki) - Move the computation of the low and high boundaries for thermal_zone_set_trips() to __thermal_zone_device_update() (Daniel Lezcano) - Introduce a debugfs-based facility for thermal core testing (Rafael Wysocki) - Replace the thermal zone .bind() and .unbind() callbacks for binding cooling devices to thermal zones with one .should_bind() callback used for deciding whether or not a given cooling devices should be bound to a given trip point in a given thermal zone (Rafael Wysocki) - Eliminate code that has no more users after the other changes, drop some redundant checks from the thermal core and clean it up (Rafael Wysocki) - Fix rounding of delay jiffies in the thermal core (Rafael Wysocki) - Refuse to accept trip point temperature or hysteresis that would lead to an invalid threshold value when setting them via sysfs (Rafael Wysocki) - Adjust states of all uninitialized instances in the .manage() callback of the Bang-bang thermal governor (Rafael Wysocki) - Drop a couple of redundant checks along with the code depending on them from the thermal core (Rafael Wysocki) - Rearrange the thermal core to avoid redundant checks and simplify control flow in a couple of code paths (Rafael Wysocki) - Add power domain DT bindings for new Amlogic SoCs (Georges Stark) - Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr() in the ST driver and add a Kconfig dependency on THERMAL_OF subsystem for the STi driver (Raphael Gallais-Pou) - Simplify the error code path in the probe functions in the brcmstb driver with the helo of dev_err_probe() (Yan Zhen) - Make imx_sc_thermal use dev_err_probe() (Alexander Stein) - Remove trailing space after \n newline in the Renesas driver (Colin Ian King) - Add DT binding compatible string for the SA8255p to the tsens thermal driver (Nikunj Kela) - Use the devm_clk_get_enabled() helpers to simplify the init routine in the sprd thermal driver (Huan Yang) - Remove __maybe_unused notations for the functions by using the new RUNTIME_PM_OPS() and SYSTEM_SLEEP_PM_OPS() macros on the IMx and Qoriq drivers (Fabio Estevam) - Remove unused declarations from the ti-soc-thermal driver's header file as the functions in question were removed previously (Zhang Zekun)" * tag 'thermal-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (48 commits) thermal: core: Drop thermal_zone_device_is_enabled() thermal: core: Check passive delay in monitor_thermal_zone() thermal: core: Drop dead code from monitor_thermal_zone() thermal: core: Drop redundant lockdep_assert_held() thermal: gov_bang_bang: Adjust states of all uninitialized instances thermal: sysfs: Add sanity checks for trip temperature and hysteresis thermal/drivers/imx_sc_thermal: Use dev_err_probe thermal/drivers/ti-soc-thermal: Remove unused declarations thermal/drivers/imx: Remove __maybe_unused notations thermal/drivers/qoriq: Remove __maybe_unused notations thermal/drivers/sprd: Use devm_clk_get_enabled() helpers dt-bindings: thermal: tsens: document support on SA8255p thermal/drivers/renesas: Remove trailing space after \n newline thermal/drivers/brcmstb_thermal: Simplify with dev_err_probe() thermal/drivers/sti: Depend on THERMAL_OF subsystem thermal/drivers/st: Switch from CONFIG_PM_SLEEP guards to pm_sleep_ptr() dt-bindings: thermal: amlogic,thermal: add optional power-domains thermal: core: Drop tz field from struct thermal_instance thermal: core: Drop redundant checks from thermal_bind_cdev_to_trip() thermal: core: Rename cdev-to-thermal-zone bind/unbind functions ...
| * mlxsw: core_thermal: Use the .should_bind() thermal zone callbackRafael J. Wysocki2024-08-221-84/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the mlxsw core_thermal driver use the .should_bind() thermal zone callback to provide the thermal core with the information on whether or not to bind the given cooling device to the given trip point in the given thermal zone. If it returns 'true', the thermal core will bind the cooling device to the trip and the corresponding unbinding will be taken care of automatically by the core on the removal of the involved thermal zone or cooling device. It replaces the .bind() and .unbind() thermal zone callbacks (in 3 places) which assumed the same trip points ordering in the driver and in the thermal core (that may not be true any more in the future). The .bind() callbacks used loops over trip point indices to call thermal_zone_bind_cooling_device() for the same cdev (once it had been verified) and all of the trip points, but they passed different 'upper' and 'lower' values to it for each trip. To retain the original functionality, the .should_bind() callbacks need to use the same 'upper' and 'lower' values that would be used by the corresponding .bind() callbacks when they are about to return 'true'. To that end, the 'priv' field of each trip is set during the thermal zone initialization to point to the corresponding 'state' object containing the maximum and minimum cooling states of the cooling device. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/2216931.Icojqenx9y@rjwysocki.net
* | mlxsw: spectrum: Remove setting of RX software timestampGal Pressman2024-09-062-20/+6
| | | | | | | | | | | | | | | | | | | | | | | | The responsibility for reporting of RX software timestamp has moved to the core layer (see __ethtool_get_ts_info()), remove usage from the device drivers. Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_localAlexander Lobakin2024-09-031-2/+3
| | | | | | | | | | | | | | | | | | | | | | "Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | netdev_features: convert NETIF_F_LLTX to dev->lltxAlexander Lobakin2024-09-031-1/+2
| | | | | | | | | | | | | | | | | | | | NETIF_F_LLTX can't be changed via Ethtool and is not a feature, rather an attribute, very similar to IFF_NO_QUEUE (and hot). Free one netdev_features_t bit and make it a "hot" private flag. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | mlxsw: core_thermal: Fix -Wformat-truncation warningIdo Schimmel2024-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The name of a thermal zone device cannot be longer than 19 characters ('THERMAL_NAME_LENGTH - 1'). The format string 'mlxsw-lc%d-module%d' can exceed this limitation if the maximum number of line cards cannot be represented using a single digit and the maximum number of transceiver modules cannot be represented using two digits. This is not the case with current systems nor future ones. Therefore, increase the size of the result buffer beyond 'THERMAL_NAME_LENGTH' and suppress the following build warning [1]. If this limitation is ever exceeded, we will know about it since the thermal core validates the thermal device's name during registration. [1] drivers/net/ethernet/mellanox/mlxsw/core_thermal.c: In function ‘mlxsw_thermal_modules_init.part.0’: drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:70: error: ‘%d’ directive output may be truncated writing between 1 and 3 bytes into a region of size between 2 and 4 [-Werror=format-truncation=] 418 | snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d", | ^~ In function ‘mlxsw_thermal_module_tz_init’, inlined from ‘mlxsw_thermal_module_init’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:465:9, inlined from ‘mlxsw_thermal_modules_init.part.0’ at drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:500:9: drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:52: note: directive argument in the range [1, 256] 418 | snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d", | ^~~~~~~~~~~~~~~~~~~~~ drivers/net/ethernet/mellanox/mlxsw/core_thermal.c:418:17: note: ‘snprintf’ output between 18 and 22 bytes into a destination of size 20 418 | snprintf(tz_name, sizeof(tz_name), "mlxsw-lc%d-module%d", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 419 | module_tz->slot_index, module_tz->module + 1); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/583a70c6dbe75e6bf0c2c58abbb3470a860d2dc3.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Remove unnecessary assignmentsIdo Schimmel2024-07-311-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | Setting both pointers to NULL is unnecessary since the code never checks whether these pointers are NULL or not. Remove the assignments. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/ea646f5d7642fffd5393fa23650660ab8f77a511.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Remove unnecessary checksIdo Schimmel2024-07-311-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | mlxsw_thermal_module_fini() cannot be invoked with a thermal module which is NULL or which is not associated with a thermal zone, so remove these checks. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/8db5fe0a3a28ba09a15d4102cc03f7e8ca7675be.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Simplify rollbackIdo Schimmel2024-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During rollback, instead of calling mlxsw_thermal_module_fini() for all the modules, only call it for modules that were successfully initialized. This is not a bug fix since mlxsw_thermal_module_fini() first checks that the module was initialized. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/905bebc45f6e246031f0c5c177bba8efe11e05f5.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Make mlxsw_thermal_module_{init, fini} symmetricIdo Schimmel2024-07-311-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | mlxsw_thermal_module_fini() de-initializes the module's thermal zone, but mlxsw_thermal_module_init() does not initialize it. Make both functions symmetric by moving the initialization of the module's thermal zone to mlxsw_thermal_module_init(). Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/a661ad468f8ad0d7d533d8334e4abf61dfe34342.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Remove unused argumentsIdo Schimmel2024-07-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | 'dev' and 'core' arguments are not used by mlxsw_thermal_module_init(). Remove them. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/563fc7383f61809a306b9954872219eaaf3c689b.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Fold two loops into oneIdo Schimmel2024-07-311-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | There is no need to traverse the same array twice. Do it once by folding both loops into one. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/81756744ed532aaa9249a83fc08757accfe8b07c.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Remove another unnecessary checkIdo Schimmel2024-07-311-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mlxsw_thermal_modules_init() allocates an array of modules and then initializes each entry by calling mlxsw_thermal_module_init() which among other things initializes the 'parent' pointer of the entry. mlxsw_thermal_modules_init() then traverses over the array again, but skips over entries that do not have their 'parent' pointer set which is impossible given the above. Therefore, remove the unnecessary check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/fb3e8ded422a441436431d5785b900f11ffc9621.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Remove unnecessary checkIdo Schimmel2024-07-311-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mlxsw_thermal_modules_init() allocates an array of modules and then calls mlxsw_thermal_module_init() to initialize each entry in the array. It is therefore impossible for mlxsw_thermal_module_init() to encounter an entry that is already initialized and has its 'parent' pointer set. Remove the unnecessary check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Call thermal_zone_device_unregister() unconditionallyIdo Schimmel2024-07-311-8/+2
|/ | | | | | | | | | | | | | The function returns immediately if the thermal zone pointer is NULL so there is no need to check it before calling the function. Remove the check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/0bd251aa8ce03d3c951983aa6b4300d8205b88a7.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge branch 'net-make-timestamping-selectable'Jakub Kicinski2024-07-154-9/+9
|\ | | | | | | | | | | | | | | First part of "net: Make timestamping selectable" from Kory Maincent. Change the driver-facing type already to lower rebasing pain. Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: Add struct kernel_ethtool_ts_infoKory Maincent2024-07-154-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In prevision to add new UAPI for hwtstamp we will be limited to the struct ethtool_ts_info that is currently passed in fixed binary format through the ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code already started operating on an extensible kernel variant of that structure, similar in concept to struct kernel_hwtstamp_config vs struct hwtstamp_config. Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here we introduce the kernel-only structure in include/linux/ethtool.h. The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO. Acked-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Lock configuration space of upstream bridge during resetIdo Schimmel2024-07-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The driver triggers a "Secondary Bus Reset" (SBR) by calling __pci_reset_function_locked() which asserts the SBR bit in the "Bridge Control Register" in the configuration space of the upstream bridge for 2ms. This is done without locking the configuration space of the upstream bridge port, allowing user space to access it concurrently. Linux 6.11 will start warning about such unlocked resets [1][2]: pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0 Avoid the warning and the concurrent access by locking the configuration space of the upstream bridge prior to the reset and unlocking it afterwards. [1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/ [2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/9937b0afdb50f2f2825945393c94c093c04a5897.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: core_thermal: Report valid current state during cooling device ↵Ido Schimmel2024-07-091-26/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | registration Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()") changed the thermal core to read the current state of the cooling device as part of the cooling device's registration. This is incompatible with the current implementation of the cooling device operations in mlxsw, leading to initialization failure with errors such as: mlxsw_spectrum 0000:01:00.0: Failed to register cooling device mlxsw_spectrum 0000:01:00.0: cannot register bus device The reason for the failure is that when the get current state operation is invoked the driver tries to derive the index of the cooling device by walking a per thermal zone array and looking for the matching cooling device pointer. However, the pointer is returned from the registration function and therefore only set in the array after the registration. The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not fail cdev registration because of invalid initial state") by not failing the registration of the cooling device if it cannot report a valid current state during registration, although drivers are responsible for ensuring that this will not happen. Therefore, make sure the driver is able to report a valid current state for the cooling device during registration by passing to the registration function a per cooling device private data that already has the cooling device index populated. While at it, call thermal_cooling_device_unregister() unconditionally since the function returns immediately if the cooling device pointer is NULL. Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/c823c4678b6b7afb902c35b3551c81a053afd110.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: Warn about invalid accesses to array fieldsPetr Machata2024-07-091-0/+4
|/ | | | | | | | | | | | | | | | A forgotten or buggy variable initialization can cause out-of-bounds access to a register or other item array field. For an overflow, such access would mangle adjacent parts of the register payload. For an underflow, due to all variables being unsigned, the access would likely trample unrelated memory. Since neither is correct, replace these accesses with accesses at the index of 0, and warn about the issue. Suggested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/b988fb265c2f6c1206fe12d5bfdcfa188b7672d1.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-07-041-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI ↵Aleksandr Mishin2024-07-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | file In case of invalid INI file mlxsw_linecard_types_init() deallocates memory but doesn't reset pointer to NULL and returns 0. In case of any error occurred after mlxsw_linecard_types_init() call, mlxsw_linecards_init() calls mlxsw_linecard_types_fini() which performs memory deallocation again. Add pointer reset to NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: b217127e5e4e ("mlxsw: core_linecards: Add line card objects and implement provisioning") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/20240703203251.8871-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: Implement ethtool operation to write to a transceiver module EEPROMIdo Schimmel2024-06-284-0/+93
| | | | | | | | | | | | | | | | | | | | | | Implement the ethtool_ops::set_module_eeprom_by_page operation to allow ethtool to write to a transceiver module EEPROM, in a similar fashion to the ethtool_ops::get_module_eeprom_by_page operation. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-06-273-9/+31
|\| | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling") d9c04209990b ("ionic: Mark error paths in the data path as unlikely") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * mlxsw: spectrum_buffers: Fix memory corruptions on Spectrum-4 systemsIdo Schimmel2024-06-211-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following two shared buffer operations make use of the Shared Buffer Status Register (SBSR): # devlink sb occupancy snapshot pci/0000:01:00.0 # devlink sb occupancy clearmax pci/0000:01:00.0 The register has two masks of 256 bits to denote on which ingress / egress ports the register should operate on. Spectrum-4 has more than 256 ports, so the register was extended by cited commit with a new 'port_page' field. However, when filling the register's payload, the driver specifies the ports as absolute numbers and not relative to the first port of the port page, resulting in memory corruptions [1]. Fix by specifying the ports relative to the first port of the port page. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_sb_occ_snapshot+0xb6d/0xbc0 Read of size 1 at addr ffff8881068cb00f by task devlink/1566 [...] Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xce/0x670 kasan_report+0xd7/0x110 mlxsw_sp_sb_occ_snapshot+0xb6d/0xbc0 mlxsw_devlink_sb_occ_snapshot+0x75/0xb0 devlink_nl_sb_occ_snapshot_doit+0x1f9/0x2a0 genl_family_rcv_msg_doit+0x20c/0x300 genl_rcv_msg+0x567/0x800 netlink_rcv_skb+0x170/0x450 genl_rcv+0x2d/0x40 netlink_unicast+0x547/0x830 netlink_sendmsg+0x8d4/0xdb0 __sys_sendto+0x49b/0x510 __x64_sys_sendto+0xe5/0x1c0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f [...] Allocated by task 1: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 copy_verifier_state+0xbc2/0xfb0 do_check_common+0x2c51/0xc7e0 bpf_check+0x5107/0x9960 bpf_prog_load+0xf0e/0x2690 __sys_bpf+0x1a61/0x49d0 __x64_sys_bpf+0x7d/0xc0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 1: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 poison_slab_object+0x109/0x170 __kasan_slab_free+0x14/0x30 kfree+0xca/0x2b0 free_verifier_state+0xce/0x270 do_check_common+0x4828/0xc7e0 bpf_check+0x5107/0x9960 bpf_prog_load+0xf0e/0x2690 __sys_bpf+0x1a61/0x49d0 __x64_sys_bpf+0x7d/0xc0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: f8538aec88b4 ("mlxsw: Add support for more than 256 ports in SBSR register") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mlxsw: pci: Fix driver initialization with Spectrum-4Ido Schimmel2024-06-212-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cited commit added support for a new reset flow ("all reset") which is deeper than the existing reset flow ("software reset") and allows the device's PCI firmware to be upgraded. In the new flow the driver first tells the firmware that "all reset" is required by issuing a new reset command (i.e., MRSR.command=6) and then triggers the reset by having the PCI core issue a secondary bus reset (SBR). However, due to a race condition in the device's firmware the device is not always able to recover from this reset, resulting in initialization failures [1]. New firmware versions include a fix for the bug and advertise it using a new capability bit in the Management Capabilities Mask (MCAM) register. Avoid initialization failures by reading the new capability bit and triggering the new reset flow only if the bit is set. If the bit is not set, trigger a normal PCI hot reset by skipping the call to the Management Reset and Shutdown Register (MRSR). Normal PCI hot reset is weaker than "all reset", but it results in a fully operational driver and allows users to flash a new firmware, if they want to. [1] mlxsw_spectrum4 0000:01:00.0: not ready 1023ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 2047ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 4095ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 8191ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 16383ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 32767ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 65535ms after bus reset; giving up mlxsw_spectrum4 0000:01:00.0: PCI function reset failed with -25 mlxsw_spectrum4 0000:01:00.0: cannot register bus device mlxsw_spectrum4: probe of 0000:01:00.0 failed with error -25 Fixes: f257c73e5356 ("mlxsw: pci: Add support for new reset flow") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Maksym Yaremchuk <maksymy@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mlxsw: pci: Use fragmented buffersAmit Cohen2024-06-261-34/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WQE (Work Queue Element) includes 3 scatter/gather entries for buffers. The buffer can be split into 3 parts, software should set address and byte count of each part. A previous patch-set used page pool to allocate buffers, to simplify the change, we first used one continuous buffer, which was allocated with order > 0. This patch improves page pool usage to allocate the exact number of pages which are required for packet. As part of init, fill WQE.address[x] and WQE.byte_count* with pages which are allocated from the pool. Fill x entries according to number of scatter/gather entries which are required for maximum packet size. When a packet is received, check the actual size and replace only the used pages. Save bytes for software overhead only as part of the first entry. This change also requires using fragmented SKB, till now all the buffer was in the linear part. Note that 'skb->truesize' is decreased for small packets. For now the maximum buffer size is 3 * PAGE_SIZE which is enough, in case that the driver will support larger MTU, we can use 'order' to allocate more than one page per scatter/gather entry. This change significantly reduces memory consumption of mlxsw driver. The footprint is reduced by 26%. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/ee38898c692e7f644a7f3ea4d33aeddb4dd917d2.1719321422.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Store number of scatter/gather entries for maximum packet sizeAmit Cohen2024-06-261-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous patch-set used page pool for Rx buffers allocations. To simplify the change, we first used page pool for one allocation per packet - one continuous buffer is allocated for each packet. This can be improved by using fragmented buffers, then memory consumption will be significantly reduced. WQE (Work Queue Element) includes up to 3 scatter/gather entries for data. As preparation for fragmented buffer usage, calculate number of scatter/gather entries which are required for packet according to maximum MTU and store it for future use. For now use PAGE_SIZE for each entry, which means that maximum buffer size is 3 * PAGE_SIZE. This is enough for the maximum MTU which is supported in the driver now (10K). Warn in an unlikely case of maximum MTU which requires more than 3 pages, for now this warn should not happen with standard page size (>=4K) and maximum MTU (10K). Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/98c3e3adb7e727e571ac538faf67cef262cec4fc.1719321422.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: Drop explicit initialization of struct i2c_device_id::driver_data to 0Uwe Kleine-König2024-06-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These drivers don't use the driver_data member of struct i2c_device_id, so don't explicitly initialize this member. This prepares putting driver_data in an anonymous union which requires either no initialization or named designators. But it's also a nice cleanup on its own. While add it, also remove commas after the sentinel entries. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # For mlxsw Reviewed-by: Kory Maincent <Kory.maincent@bootlin.com> Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au> # for mctp-i2c Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240625083853.2205977-2-u.kleine-koenig@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Use napi_consume_skb() to free SKB as part of Tx completionAmit Cohen2024-06-191-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, as part of Tx completion, the driver calls dev_kfree_skb_any() to free the SKB. For this flow, the correct function is napi_consume_skb(). This function and dev_consume_skb_any() were added to be used for consumed SKBs, which were not dropped, so the skb:kfree_skb tracepoint is not triggered, and we can get better diagnostics about dropped packets. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/a9f9f3dc884c0d1be4bd4c9d72030c88c7ac004f.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Do not store SKB for RDQ elementsAmit Cohen2024-06-191-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch used page pool to allocate buffers for RDQ. With this change, 'elem_info->u.rdq.skb' is not used anymore, as we do not allocate SKB before getting the packet, we hold page pointer and build the SKB around it once packet is received. Remove the union and store SKB pointer for SDQ only. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/23a531008936dc9a1a298643fb1e4f9a7b8e6eb3.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Optimize data buffer accessAmit Cohen2024-06-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Before accessing data buffer, call net_prefetch() to load it into the cache. This change improves driver performance, CPU can handle about 7.1% more packets per second. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1fa07c510890866a6f201163ab7e78890ba28b3b.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Use page pool for Rx buffers allocationAmit Cohen2024-06-191-39/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of driver init, all Rx queues are filled with buffers for hardware usage. Later, when a packet is received, a new buffer should be allocated to be used by hardware instead of the received buffer. Packet's processing time includes allocation time, which can be improved using page pool. Using page pool, DMA mapping is done only for first allocation of buffers. As subsequent buffers allocation avoid DMA mapping, it results in performance improvement. The purpose of page pool is to allocate pages fast from cache without locking. This lockless guarantee naturally comes from running under a NAPI. Use page pool to allocate the data buffer only, so hardware will use it to fill the packet. At completion time, attach the data buffer (now filled with packet payload) to new SKB which is allocated around the received buffer. SKB building at completion time prevents cache miss for each packet, as now the SKB is allocated right before packets will be handled by networking stack. Page pool for each Rx queue enhances Rx side performance by reclaiming buffers back to each queue specific pool. This change significantly improves driver performance, CPU can handle about 345% of the packets per second it previously handled. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1cf788a8f43c70aae6d526018ef77becb27ad6d3.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Initialize page pool per CQAmit Cohen2024-06-192-1/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Next patch will use page pool to allocate buffers for RDQ. Initialize page pool for each CQ, which is mapped 1:1 to RDQ. Page pool for each Rx queue enhances Rx side performance by reclaiming buffers back to each queue specific pool. When only one NAPI instance is the consumer of pages from page pool, it is recommended to pass it as part of 'page_pool_params', then page pool APIs will be done without special locks. mlxsw driver holds NAPI instance per CQ, so add page pool per CQ and use the existing NAPI instance. For now, pages are not allocated from the pool, next patch will use it. Some notes regarding 'page_pool_params': * Use PP_FLAG_DMA_MAP to allow page pool handles DMA mapping, for now do not use sync flag, as only the device writes to this memory and we read it only when it finishes writing there. This will probably be changed when we will support XDP. * Define 'order' according to maximum MTU and take into account software overhead. Some round up are done, which means that we allocate more pages than we really need. This can be improved later by using fragmented buffers. * Use pool_size = MLXSW_PCI_WQE_COUNT. This will be the size of 'ptr_ring', and should be the maximum amount of packets that page pool will allocate memory for. In our case, this is the queue size, defined as MLXSW_PCI_WQE_COUNT. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/02e5856ae7c572d4293ce6bb92c286ee6cfec800.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Store CQ pointer as part of RDQ structureAmit Cohen2024-06-191-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Next patches will add support for page pool in mlxsw driver. Page pool will be used to allocate buffers for RDQ and will use NAPI instance of the appropriate CQ (RDQ is mapped 1:1 to CQ). To allow pool initialization as part of CQ init, when NAPI is initialized, page_pool structure will be as part of CQ structure. Later, the allocations for RDQ will be done from the pool in the appropriate CQ. To allow access to the appropriate pool, set CQ pointer as part of RDQ initialization. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/d60918ca1e142a554af1df9c1152cdac83854a3b.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: pci: Split NAPI setup/teardown into two stepsAmit Cohen2024-06-191-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mlxsw_pci_cq_napi_setup() includes both NAPI initialization and enablement, similar to teardown function. Next patches will add support for page pool in mlxsw driver, then we use NAPI instance for page pool. Page pool initialization should be done before NAPI enablement, same for page pool destruction which should be done after NAPI disablement. As preparation, split NAPI setup/teardown into two steps, then page pool setup will be done between the phases. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/8dbf37e859f07247498fca17109b8858ff2b0498.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: Use the same maximum MTU value throughout the driverAmit Cohen2024-06-143-26/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the driver uses two different values for maximum MTU, one is stored in mlxsw_port->dev->max_mtu and the second is stored in mlxsw_port->max_mtu. The second one is set to value which is queried from firmware. This value was never tested, and unfortunately is not really supported. That means that with the existing code, user can set MTU to X, which is not really supported by firmware and which is bigger than buffer size which is allocated in pci. To make the driver consistent, use only mlxsw_port->dev->max_mtu for maximum MTU value, for buffers headroom add Ethernet frame headers, which are not included in mlxsw_port->dev->max_mtu. Remove mlxsw_port->max_mtu. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/89fa6f804386b918d337e736e14ac291bb947483.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: spectrum: Set more accurate values for netdevice min/max MTUAmit Cohen2024-06-141-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the driver uses ETH_MAX_MTU as maximum MTU of netdevices, instead, use the accurate value which is supported by the driver. Subtract Ethernet headers which are taken into account by hardware for MTU checking, as described in the previous patch. Set minimum MTU to ETH_MIN_MTU, as zero MTU is not really supported. With this change: a. The stack will do the MTU checking, so we can remove it from the driver. b. User space will be able to query the actual MTU limits. Before this patch: $ ip -j -d link show dev swp1 | jq | grep mtu "mtu": 1500, "min_mtu": 0, "max_mtu": 65535, With this patch: $ ip -j -d link show dev swp1 | jq | grep mtu "mtu": 1500, "min_mtu": 68, "max_mtu": 10218, Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/be8232e38c196ecb607f82c5e000ea427ce22abb.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: Adjust MTU value to hardware checkAmit Cohen2024-06-142-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ethernet frame consists of - Ethernet header, payload, FCS. The MTU value which is used by user is the size of the payload, which means that when user sets MTU to X, the total frame size will be larger due to the addition of the Ethernet header and FCS. Spectrum ASICs take into account Ethernet header and FCS as part of packet size for MTU check. Adjust MTU value when user sets MTU, to configure the MTU size which is required by hardware. The Tx header length which was used by the driver is not relevant for such calculation, take into account Ethernet header (with VLAN extension) and FCS. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/f3203c2477bb8ed18b1e79642fa3e3713e1e55bb.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: port: Edit maximum MTU valueAmit Cohen2024-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Currently mlxsw driver supports up to 10000 bytes for maximum MTU, this value is not accurate, we can support up to 10K bytes. Change the value to the maximum supported MTU by firmware. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/666f51681234aeef09d771833ccb6e94bd323c88.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: spectrum_router: Apply user-defined multipath hash seedPetr Machata2024-06-121-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When Spectrum machines compute hash for the purposes of ECMP routing, they use a seed specified through RECR_v2 (Router ECMP Configuration Register). Up until now mlxsw computed the seed by hashing the machine's base MAC. Now that we can optionally have a user-provided seed, use that if possible. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607151357.421181-4-petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | mlxsw: spectrum_acl: Fix ACL scale regression and firmware errorsIdo Schimmel2024-06-103-17/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ACLs that reside in the algorithmic TCAM (A-TCAM) in Spectrum-2 and newer ASICs can share the same mask if their masks only differ in up to 8 consecutive bits. For example, consider the following filters: # tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 192.0.2.0/24 action drop # tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 198.51.100.128/25 action drop The second filter can use the same mask as the first (dst_ip/24) with a delta of 1 bit. However, the above only works because the two filters have different values in the common unmasked part (dst_ip/24). When entries have the same value in the common unmasked part they create undesired collisions in the device since many entries now have the same key. This leads to firmware errors such as [1] and to a reduced scale. Fix by adjusting the hash table key to only include the value in the common unmasked part. That is, without including the delta bits. That way the driver will detect the collision during filter insertion and spill the filter into the circuit TCAM (C-TCAM). Add a test case that fails without the fix and adjust existing cases that check C-TCAM spillage according to the above limitation. [1] mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=3379b18a00003394,reg_id=3027(ptce3),type=write,status=8(resource not available)) Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP") Reported-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mlxsw: spectrum_acl_erp: Fix object nesting warningIdo Schimmel2024-06-101-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ACLs in Spectrum-2 and newer ASICs can reside in the algorithmic TCAM (A-TCAM) or in the ordinary circuit TCAM (C-TCAM). The former can contain more ACLs (i.e., tc filters), but the number of masks in each region (i.e., tc chain) is limited. In order to mitigate the effects of the above limitation, the device allows filters to share a single mask if their masks only differ in up to 8 consecutive bits. For example, dst_ip/25 can be represented using dst_ip/24 with a delta of 1 bit. The C-TCAM does not have a limit on the number of masks being used (and therefore does not support mask aggregation), but can contain a limited number of filters. The driver uses the "objagg" library to perform the mask aggregation by passing it objects that consist of the filter's mask and whether the filter is to be inserted into the A-TCAM or the C-TCAM since filters in different TCAMs cannot share a mask. The set of created objects is dependent on the insertion order of the filters and is not necessarily optimal. Therefore, the driver will periodically ask the library to compute a more optimal set ("hints") by looking at all the existing objects. When the library asks the driver whether two objects can be aggregated the driver only compares the provided masks and ignores the A-TCAM / C-TCAM indication. This is the right thing to do since the goal is to move as many filters as possible to the A-TCAM. The driver also forbids two identical masks from being aggregated since this can only happen if one was intentionally put in the C-TCAM to avoid a conflict in the A-TCAM. The above can result in the following set of hints: H1: {mask X, A-TCAM} -> H2: {mask Y, A-TCAM} // X is Y + delta H3: {mask Y, C-TCAM} -> H4: {mask Z, A-TCAM} // Y is Z + delta After getting the hints from the library the driver will start migrating filters from one region to another while consulting the computed hints and instructing the device to perform a lookup in both regions during the transition. Assuming a filter with mask X is being migrated into the A-TCAM in the new region, the hints lookup will return H1. Since H2 is the parent of H1, the library will try to find the object associated with it and create it if necessary in which case another hints lookup (recursive) will be performed. This hints lookup for {mask Y, A-TCAM} will either return H2 or H3 since the driver passes the library an object comparison function that ignores the A-TCAM / C-TCAM indication. This can eventually lead to nested objects which are not supported by the library [1]. Fix by removing the object comparison function from both the driver and the library as the driver was the only user. That way the lookup will only return exact matches. I do not have a reliable reproducer that can reproduce the issue in a timely manner, but before the fix the issue would reproduce in several minutes and with the fix it does not reproduce in over an hour. Note that the current usefulness of the hints is limited because they include the C-TCAM indication and represent aggregation that cannot actually happen. This will be addressed in net-next. [1] WARNING: CPU: 0 PID: 153 at lib/objagg.c:170 objagg_obj_parent_assign+0xb5/0xd0 Modules linked in: CPU: 0 PID: 153 Comm: kworker/0:18 Not tainted 6.9.0-rc6-custom-g70fbc2c1c38b #42 Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:objagg_obj_parent_assign+0xb5/0xd0 [...] Call Trace: <TASK> __objagg_obj_get+0x2bb/0x580 objagg_obj_get+0xe/0x80 mlxsw_sp_acl_erp_mask_get+0xb5/0xf0 mlxsw_sp_acl_atcam_entry_add+0xe8/0x3c0 mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0 mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270 mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510 process_one_work+0x151/0x370 Fixes: 9069a3817d82 ("lib: objagg: implement optimization hints assembly and use hints for object creation") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mlxsw: spectrum_acl_atcam: Fix wrong commentIdo Schimmel2024-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The key is encoded, not encrypted. Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mlxsw: spectrum_router: Constify struct devlink_dpipe_table_opsChristophe JAILLET2024-06-051-4/+4
|/ | | | | | | | | | | | | | | | | | | | | | | 'struct devlink_dpipe_table_ops' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 15557 712 0 16269 3f8d drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o After: ===== text data bss dec hex filename 15789 488 0 16277 3f95 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>