summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* xen-netback: use default TX queue size for vifsRoger Pau Monne2023-10-081-4/+0
| | | | | | | | | | | | | | | | | | Do not set netback interfaces (vifs) default TX queue size to the ring size. The TX queue size is not related to the ring size, and using the ring size (32) as the queue size can lead to packet drops. Note the TX side of the vif interface in the netback domain is the one receiving packets to be injected to the guest. Do not explicitly set the TX queue length to any value when creating the interface, and instead use the system default. Note that the queue length can also be adjusted at runtime. Fixes: f942dc2552b8 ('xen network backend driver') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* mlxsw: fix mlxsw_sp2_nve_vxlan_learning_set() return typeDan Carpenter2023-10-081-2/+2
| | | | | | | | | | | The mlxsw_sp2_nve_vxlan_learning_set() function is supposed to return zero on success or negative error codes. So it needs to be type int instead of bool. Fixes: 4ee70efab68d ("mlxsw: spectrum_nve: Add support for VXLAN on Spectrum-2") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ravb-fix-use-after-free-issues'Jakub Kicinski2023-10-061-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yoshihiro Shimoda says: ==================== ravb: Fix use-after-free issues This patch series fixes use-after-free issues in ravb_remove(). The original patch is made by Zheng Wang [1]. And, I made the patch 1/2 which I found other issue in the ravb_remove(). [1] https://lore.kernel.org/netdev/20230725030026.1664873-1-zyytlz.wz@163.com/ v1: https://lore.kernel.org/all/20231004091253.4194205-1-yoshihiro.shimoda.uh@renesas.com/ ==================== Link: https://lore.kernel.org/r/20231005011201.14368-1-yoshihiro.shimoda.uh@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ravb: Fix use-after-free issue in ravb_tx_timeout_work()Yoshihiro Shimoda2023-10-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ravb_stop() should call cancel_work_sync(). Otherwise, ravb_tx_timeout_work() is possible to use the freed priv after ravb_remove() was called like below: CPU0 CPU1 ravb_tx_timeout() ravb_remove() unregister_netdev() free_netdev(ndev) // free priv ravb_tx_timeout_work() // use priv unregister_netdev() will call .ndo_stop() so that ravb_stop() is called. And, after phy_stop() is called, netif_carrier_off() is also called. So that .ndo_tx_timeout() will not be called after phy_stop(). Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Reported-by: Zheng Wang <zyytlz.wz@163.com> Closes: https://lore.kernel.org/netdev/20230725030026.1664873-1-zyytlz.wz@163.com/ Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20231005011201.14368-3-yoshihiro.shimoda.uh@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ravb: Fix up dma_free_coherent() call in ravb_remove()Yoshihiro Shimoda2023-10-061-2/+2
|/ | | | | | | | | | | | In ravb_remove(), dma_free_coherent() should be call after unregister_netdev(). Otherwise, this controller is possible to use the freed buffer. Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20231005011201.14368-2-yoshihiro.shimoda.uh@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* devlink: Hold devlink lock on health reporter dump getMoshe Shemesh2023-10-061-14/+16
| | | | | | | | | | | | | | | | | | | | | Devlink health dump get callback should take devlink lock as any other devlink callback. Otherwise, since devlink_mutex was removed, this callback is not protected from a race of the reporter being destroyed while handling the callback. Add devlink lock to the callback and to any call for devlink_health_do_dump(). This should be safe as non of the drivers dump callback implementation takes devlink lock. As devlink lock is added to any callback of dump, the reporter dump_lock is now redundant and can be removed. Fixes: d3efc2a6a6d8 ("net: devlink: remove devlink_mutex") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/1696510216-189379-1-git-send-email-moshe@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: sched: cls_u32: Fix allocation size in u32_init()Gustavo A. R. Silva2023-10-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d61491a51f7e ("net/sched: cls_u32: Replace one-element array with flexible-array member") incorrecly replaced an instance of `sizeof(*tp_c)` with `struct_size(tp_c, hlist->ht, 1)`. This results in a an over-allocation of 8 bytes. This change is wrong because `hlist` in `struct tc_u_common` is a pointer: net/sched/cls_u32.c: struct tc_u_common { struct tc_u_hnode __rcu *hlist; void *ptr; int refcnt; struct idr handle_idr; struct hlist_node hnode; long knodes; }; So, the use of `struct_size()` makes no sense: we don't need to allocate any extra space for a flexible-array member. `sizeof(*tp_c)` is just fine. So, `struct_size(tp_c, hlist->ht, 1)` translates to: sizeof(*tp_c) + sizeof(tp_c->hlist->ht) == sizeof(struct tc_u_common) + sizeof(struct tc_u_knode *) == 144 + 8 == 0x98 (byes) ^^^ | unnecessary extra allocation size $ pahole -C tc_u_common net/sched/cls_u32.o struct tc_u_common { struct tc_u_hnode * hlist; /* 0 8 */ void * ptr; /* 8 8 */ int refcnt; /* 16 4 */ /* XXX 4 bytes hole, try to pack */ struct idr handle_idr; /* 24 96 */ /* --- cacheline 1 boundary (64 bytes) was 56 bytes ago --- */ struct hlist_node hnode; /* 120 16 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ long int knodes; /* 136 8 */ /* size: 144, cachelines: 3, members: 6 */ /* sum members: 140, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; And with `sizeof(*tp_c)`, we have: sizeof(*tp_c) == sizeof(struct tc_u_common) == 144 == 0x90 (bytes) which is the correct and original allocation size. Fix this issue by replacing `struct_size(tp_c, hlist->ht, 1)` with `sizeof(*tp_c)`, and avoid allocating 8 too many bytes. The following difference in binary output is expected and reflects the desired change: | net/sched/cls_u32.o | @@ -6148,7 +6148,7 @@ | include/linux/slab.h:599 | 2cf5: mov 0x0(%rip),%rdi # 2cfc <u32_init+0xfc> | 2cf8: R_X86_64_PC32 kmalloc_caches+0xc |- 2cfc: mov $0x98,%edx |+ 2cfc: mov $0x90,%edx Reported-by: Alejandro Colomar <alx@kernel.org> Closes: https://lore.kernel.org/lkml/09b4a2ce-da74-3a19-6961-67883f634d98@kernel.org/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'qca8k-fixes'David S. Miller2023-10-061-2/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marek Behún says: ==================== net: dsa: qca8k: fix qca8k driver for Turris 1.x this is v2 of https://lore.kernel.org/netdev/20231002104612.21898-1-kabel@kernel.org/ Changes since v1: - fixed a typo in commit message noticed by Simon Horman ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: qca8k: fix potential MDIO bus conflict when accessing internal ↵Marek Behún2023-10-061-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PHYs via management frames Besides the QCA8337 switch the Turris 1.x device has on it's MDIO bus also Micron ethernet PHY (dedicated to the WAN port). We've been experiencing a strange behavior of the WAN ethernet interface, wherein the WAN PHY started timing out the MDIO accesses, for example when the interface was brought down and then back up. Bisecting led to commit 2cd548566384 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet"), which added support to access the QCA8337 switch's internal PHYs via management ethernet frames. Connecting the MDIO bus pins onto an oscilloscope, I was able to see that the MDIO bus was active whenever a request to read/write an internal PHY register was done via an management ethernet frame. My theory is that when the switch core always communicates with the internal PHYs via the MDIO bus, even when externally we request the access via ethernet. This MDIO bus is the same one via which the switch and internal PHYs are accessible to the board, and the board may have other devices connected on this bus. An ASCII illustration may give more insight: +---------+ +----| | | | WAN PHY | | +--| | | | +---------+ | | | | +----------------------------------+ | | | QCA8337 | MDC | | | +-------+ | ------o-+--|--------o------------o--| | | MDIO | | | | | PHY 1 |-|--to RJ45 --------o--|---o----+---------o--+--| | | | | | | | +-------+ | | +-------------+ | o--| | | | | MDIO MDC | | | | PHY 2 |-|--to RJ45 eth1 | | | o--+--| | | -----------|-|port0 | | | +-------+ | | | | | o--| | | | | switch core | | | | PHY 3 |-|--to RJ45 | +-------------+ o--+--| | | | | | +-------+ | | | o--| ... | | +----------------------------------+ When we send a request to read an internal PHY register via an ethernet management frame via eth1, the switch core receives the ethernet frame on port 0 and then communicates with the internal PHY via MDIO. At this time, other potential devices, such as the WAN PHY on Turris 1.x, cannot use the MDIO bus, since it may cause a bus conflict. Fix this issue by locking the MDIO bus even when we are accessing the PHY registers via ethernet management frames. Fixes: 2cd548566384 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet") Signed-off-by: Marek Behún <kabel@kernel.org> Reviewed-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: qca8k: fix regmap bulk read/write methods on big endian systemsMarek Behún2023-10-061-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c766e077d927 ("net: dsa: qca8k: convert to regmap read/write API") introduced bulk read/write methods to qca8k's regmap. The regmap bulk read/write methods get the register address in a buffer passed as a void pointer parameter (the same buffer contains also the read/written values). The register address occupies only as many bytes as it requires at the beginning of this buffer. For example if the .reg_bits member in regmap_config is 16 (as is the case for this driver), the register address occupies only the first 2 bytes in this buffer, so it can be cast to u16. But the original commit implementing these bulk read/write methods cast the buffer to u32: u32 reg = *(u32 *)reg_buf & U16_MAX; taking the first 4 bytes. This works on little endian systems where the first 2 bytes of the buffer correspond to the low 16-bits, but it obviously cannot work on big endian systems. Fix this by casting the beginning of the buffer to u16 as u32 reg = *(u16 *)reg_buf; Fixes: c766e077d927 ("net: dsa: qca8k: convert to regmap read/write API") Signed-off-by: Marek Behún <kabel@kernel.org> Tested-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'lynx-28g-fixes'David S. Miller2023-10-061-3/+24
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vladimir Oltean says: ==================== Fixes for lynx-28g PHY driver This series fixes some issues in the Lynx 28G SerDes driver, namely an oops when unloading the module, a race between the periodic workqueue and the PHY API, and a race between phy_set_mode_ext() calls on multiple lanes on the same SerDes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * phy: lynx-28g: serialize concurrent phy_set_mode_ext() calls to shared registersVladimir Oltean2023-10-061-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The protocol converter configuration registers PCC8, PCCC, PCCD (implemented by the driver), as well as others, control protocol converters from multiple lanes (each represented as a different struct phy). So, if there are simultaneous calls to phy_set_mode_ext() to lanes sharing the same PCC register (either for the "old" or for the "new" protocol), corruption of the values programmed to hardware is possible, because lynx_28g_rmw() has no locking. Add a spinlock in the struct lynx_28g_priv shared by all lanes, and take the global spinlock from the phy_ops :: set_mode() implementation. There are no other callers which modify PCC registers. Fixes: 8f73b37cf3fb ("phy: add support for the Layerscape SerDes 28G") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * phy: lynx-28g: lock PHY while performing CDR lock workaroundVladimir Oltean2023-10-061-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | lynx_28g_cdr_lock_check() runs once per second in a workqueue to reset the lane receiver if the CDR has not locked onto bit transitions in the RX stream. But the PHY consumer may do stuff with the PHY simultaneously, and that isn't okay. Block concurrent generic PHY calls by holding the PHY mutex from this workqueue. Fixes: 8f73b37cf3fb ("phy: add support for the Layerscape SerDes 28G") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * phy: lynx-28g: cancel the CDR check work item on the remove pathIoana Ciornei2023-10-061-0/+9
|/ | | | | | | | | | | The blamed commit added the CDR check work item but didn't cancel it on the remove path. Fix this by adding a remove function which takes care of it. Fixes: 8f73b37cf3fb ("phy: add support for the Layerscape SerDes 28G") Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'net-6.6-rc5' of ↵Linus Torvalds2023-10-05112-597/+1355
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Bluetooth, netfilter, BPF and WiFi. I didn't collect precise data but feels like we've got a lot of 6.5 fixes here. WiFi fixes are most user-awaited. Current release - regressions: - Bluetooth: fix hci_link_tx_to RCU lock usage Current release - new code bugs: - bpf: mprog: fix maximum program check on mprog attachment - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns() Previous releases - regressions: - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it doesn't handle zero length like we expected - wifi: - cfg80211: fix cqm_config access race, fix crashes with brcmfmac - iwlwifi: mvm: handle PS changes in vif_cfg_changed - mac80211: fix mesh id corruption on 32 bit systems - mt76: mt76x02: fix MT76x0 external LNA gain handling - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER - l2tp: fix handling of transhdrlen in __ip{,6}_append_data() - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent - eth: stmmac: fix the incorrect parameter after refactoring Previous releases - always broken: - net: replace calls to sock->ops->connect() with kernel_connect(), prevent address rewrite in kernel_bind(); otherwise BPF hooks may modify arguments, unexpectedly to the caller - tcp: fix delayed ACKs when reads and writes align with MSS - bpf: - verifier: unconditionally reset backtrack_state masks on global func exit - s390: let arch_prepare_bpf_trampoline return program size, fix struct_ops offsets - sockmap: fix accounting of available bytes in presence of PEEKs - sockmap: reject sk_msg egress redirects to non-TCP sockets - ipv4/fib: send netlink notify when delete source address routes - ethtool: plca: fix width of reads when parsing netlink commands - netfilter: nft_payload: rebuild vlan header on h_proto access - Bluetooth: hci_codec: fix leaking memory of local_codecs - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids - eth: stmmac: - dwmac-stm32: fix resume on STM32 MCU - remove buggy and unneeded stmmac_poll_controller, depend on NAPI - ibmveth: always recompute TCP pseudo-header checksum, fix use of the driver with Open vSwitch - wifi: - rtw88: rtw8723d: fix MAC address offset in EEPROM - mt76: fix lock dependency problem for wed_lock - mwifiex: sanity check data reported by the device - iwlwifi: ensure ack flag is properly cleared - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm - iwlwifi: mvm: fix incorrect usage of scan API Misc: - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length" * tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits) MAINTAINERS: update Matthieu's email address mptcp: userspace pm allow creating id 0 subflow mptcp: fix delegated action races net: stmmac: remove unneeded stmmac_poll_controller net: lan743x: also select PHYLIB net: ethernet: mediatek: disable irq before schedule napi net: mana: Fix oversized sge0 for GSO packets net: mana: Fix the tso_bytes calculation net: mana: Fix TX CQE error handling netlink: annotate data-races around sk->sk_err sctp: update hb timer immediately after users change hb_interval sctp: update transport state when processing a dupcook packet tcp: fix delayed ACKs for MSS boundary condition tcp: fix quick-ack counting to count actual ACKs of new data page_pool: fix documentation typos tipc: fix a potential deadlock on &tx->lock net: stmmac: dwmac-stm32: fix resume on STM32 MCU ipv4: Set offload_failed flag in fibmatch results netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure netfilter: nf_tables: Deduplicate nft_register_obj audit logs ...
| * Merge branch 'mptcp-fixes-and-maintainer-email-update-for-v6-6'Jakub Kicinski2023-10-056-46/+36
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mat Martineau says: ==================== mptcp: Fixes and maintainer email update for v6.6 Patch 1 addresses a race condition in MPTCP "delegated actions" infrastructure. Affects v5.19 and later. Patch 2 removes an unnecessary restriction that did not allow additional outgoing subflows using the local address of the initial MPTCP subflow. v5.16 and later. Patch 3 updates Matthieu's email address. ==================== Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-0-28de4ac663ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * MAINTAINERS: update Matthieu's email addressMatthieu Baerts2023-10-052-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use my kernel.org account instead. The other one will bounce by the end of the year. Signed-off-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-3-28de4ac663ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * mptcp: userspace pm allow creating id 0 subflowGeliang Tang2023-10-051-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch drops id 0 limitation in mptcp_nl_cmd_sf_create() to allow creating additional subflows with the local addr ID 0. There is no reason not to allow additional subflows from this local address: we should be able to create new subflows from the initial endpoint. This limitation was breaking fullmesh support from userspace. Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/391 Cc: stable@vger.kernel.org Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-2-28de4ac663ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * mptcp: fix delegated action racesPaolo Abeni2023-10-053-39/+34
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The delegated action infrastructure is prone to the following race: different CPUs can try to schedule different delegated actions on the same subflow at the same time. Each of them will check different bits via mptcp_subflow_delegate(), and will try to schedule the action on the related per-cpu napi instance. Depending on the timing, both can observe an empty delegated list node, causing the same entry to be added simultaneously on two different lists. The root cause is that the delegated actions infra does not provide a single synchronization point. Address the issue reserving an additional bit to mark the subflow as scheduled for delegation. Acquiring such bit guarantee the caller to own the delegated list node, and being able to safely schedule the subflow. Clear such bit only when the subflow scheduling is completed, ensuring proper barrier in place. Additionally swap the meaning of the delegated_action bitmask, to allow the usage of the existing helper to set multiple bit at once. Fixes: bcd97734318d ("mptcp: use delegate action to schedule 3rd ack retrans") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: stmmac: remove unneeded stmmac_poll_controllerRemi Pommarel2023-10-051-30/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using netconsole netpoll_poll_dev could be called from interrupt context, thus using disable_irq() would cause the following kernel warning with CONFIG_DEBUG_ATOMIC_SLEEP enabled: BUG: sleeping function called from invalid context at kernel/irq/manage.c:137 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 10, name: ksoftirqd/0 CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G W 5.15.42-00075-g816b502b2298-dirty #117 Hardware name: aml (r1) (DT) Call trace: dump_backtrace+0x0/0x270 show_stack+0x14/0x20 dump_stack_lvl+0x8c/0xac dump_stack+0x18/0x30 ___might_sleep+0x150/0x194 __might_sleep+0x64/0xbc synchronize_irq+0x8c/0x150 disable_irq+0x2c/0x40 stmmac_poll_controller+0x140/0x1a0 netpoll_poll_dev+0x6c/0x220 netpoll_send_skb+0x308/0x390 netpoll_send_udp+0x418/0x760 write_msg+0x118/0x140 [netconsole] console_unlock+0x404/0x500 vprintk_emit+0x118/0x250 dev_vprintk_emit+0x19c/0x1cc dev_printk_emit+0x90/0xa8 __dev_printk+0x78/0x9c _dev_warn+0xa4/0xbc ath10k_warn+0xe8/0xf0 [ath10k_core] ath10k_htt_txrx_compl_task+0x790/0x7fc [ath10k_core] ath10k_pci_napi_poll+0x98/0x1f4 [ath10k_pci] __napi_poll+0x58/0x1f4 net_rx_action+0x504/0x590 _stext+0x1b8/0x418 run_ksoftirqd+0x74/0xa4 smpboot_thread_fn+0x210/0x3c0 kthread+0x1fc/0x210 ret_from_fork+0x10/0x20 Since [0] .ndo_poll_controller is only needed if driver doesn't or partially use NAPI. Because stmmac does so, stmmac_poll_controller can be removed fixing the above warning. [0] commit ac3d9dd034e5 ("netpoll: make ndo_poll_controller() optional") Cc: <stable@vger.kernel.org> # 5.15.x Fixes: 47dd7a540b8a ("net: add support for STMicroelectronics Ethernet controllers") Signed-off-by: Remi Pommarel <repk@triplefau.lt> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/1c156a6d8c9170bd6a17825f2277115525b4d50f.1696429960.git.repk@triplefau.lt Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: lan743x: also select PHYLIBRandy Dunlap2023-10-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since FIXED_PHY depends on PHYLIB, PHYLIB needs to be set to avoid a kconfig warning: WARNING: unmet direct dependencies detected for FIXED_PHY Depends on [n]: NETDEVICES [=y] && PHYLIB [=n] Selected by [y]: - LAN743X [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && PCI [=y] && PTP_1588_CLOCK_OPTIONAL [=y] Fixes: 73c4d1b307ae ("net: lan743x: select FIXED_PHY") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Closes: lore.kernel.org/r/202309261802.JPbRHwti-lkp@intel.com Cc: Bryan Whitehead <bryan.whitehead@microchip.com> Cc: UNGLinuxDriver@microchip.com Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://lore.kernel.org/r/20231002193544.14529-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: ethernet: mediatek: disable irq before schedule napiChristian Marangi2023-10-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While searching for possible refactor of napi_schedule_prep and __napi_schedule it was notice that the mtk eth driver disable the interrupt for rx and tx AFTER napi is scheduled. While this is a very hard to repro case it might happen to have situation where the interrupt is disabled and never enabled again as the napi completes and the interrupt is enabled before. This is caused by the fact that a napi driven by interrupt expect a logic with: 1. interrupt received. napi prepared -> interrupt disabled -> napi scheduled 2. napi triggered. ring cleared -> interrupt enabled -> wait for new interrupt To prevent this case, disable the interrupt BEFORE the napi is scheduled. Fixes: 656e705243fd ("net-next: mediatek: add support for MT7623 ethernet") Cc: stable@vger.kernel.org Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Link: https://lore.kernel.org/r/20231002140805.568-1-ansuelsmth@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * Merge branch 'net-mana-fix-some-tx-processing-bugs'Paolo Abeni2023-10-052-67/+149
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Haiyang Zhang says: ==================== net: mana: Fix some TX processing bugs Fix TX processing bugs on error handling, tso_bytes calculation, and sge0 size. ==================== Link: https://lore.kernel.org/r/1696020147-14989-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| | * net: mana: Fix oversized sge0 for GSO packetsHaiyang Zhang2023-10-052-58/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handle the case when GSO SKB linear length is too large. MANA NIC requires GSO packets to put only the header part to SGE0, otherwise the TX queue may stop at the HW level. So, use 2 SGEs for the skb linear part which contains more than the packet header. Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| | * net: mana: Fix the tso_bytes calculationHaiyang Zhang2023-10-051-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove the subtraction from header size. Cc: stable@vger.kernel.org Fixes: bd7fc6e1957c ("net: mana: Add new MANA VF performance counters for easier troubleshooting") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| | * net: mana: Fix TX CQE error handlingHaiyang Zhang2023-10-051-7/+11
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For an unknown TX CQE error type (probably from a newer hardware), still free the SKB, update the queue tail, etc., otherwise the accounting will be wrong. Also, TX errors can be triggered by injecting corrupted packets, so replace the WARN_ONCE to ratelimited error logging. Cc: stable@vger.kernel.org Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * netlink: annotate data-races around sk->sk_errEric Dumazet2023-10-041-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot caught another data-race in netlink when setting sk->sk_err. Annotate all of them for good measure. BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg write to 0xffff8881613bb220 of 4 bytes by task 28147 on cpu 0: netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994 sock_recvmsg_nosec net/socket.c:1027 [inline] sock_recvmsg net/socket.c:1049 [inline] __sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229 __do_sys_recvfrom net/socket.c:2247 [inline] __se_sys_recvfrom net/socket.c:2243 [inline] __x64_sys_recvfrom+0x78/0x90 net/socket.c:2243 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd write to 0xffff8881613bb220 of 4 bytes by task 28146 on cpu 1: netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994 sock_recvmsg_nosec net/socket.c:1027 [inline] sock_recvmsg net/socket.c:1049 [inline] __sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229 __do_sys_recvfrom net/socket.c:2247 [inline] __se_sys_recvfrom net/socket.c:2243 [inline] __x64_sys_recvfrom+0x78/0x90 net/socket.c:2243 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00000000 -> 0x00000016 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 28146 Comm: syz-executor.0 Not tainted 6.6.0-rc3-syzkaller-00055-g9ed22ae6be81 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231003183455.3410550-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * sctp: update hb timer immediately after users change hb_intervalXin Long2023-10-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when hb_interval is changed by users, it won't take effect until the next expiry of hb timer. As the default value is 30s, users have to wait up to 30s to wait its hb_interval update to work. This becomes pretty bad in containers where a much smaller value is usually set on hb_interval. This patch improves it by resetting the hb timer immediately once the value of hb_interval is updated by users. Note that we don't address the already existing 'problem' when sending a heartbeat 'on demand' if one hb has just been sent(from the timer) mentioned in: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.html Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * sctp: update transport state when processing a dupcook packetXin Long2023-10-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the 4-way handshake, the transport's state is set to ACTIVE in sctp_process_init() when processing INIT_ACK chunk on client or COOKIE_ECHO chunk on server. In the collision scenario below: 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state, sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it creates a new association and sets its transport to ACTIVE then updates to the old association in sctp_assoc_update(). However, in sctp_assoc_update(), it will skip the transport update if it finds a transport with the same ipaddr already existing in the old asoc, and this causes the old asoc's transport state not to move to ACTIVE after the handshake. This means if DATA retransmission happens at this moment, it won't be able to enter PF state because of the check 'transport->state == SCTP_ACTIVE' in sctp_do_8_2_transport_strike(). This patch fixes it by updating the transport in sctp_assoc_update() with sctp_assoc_add_peer() where it updates the transport state if there is already a transport with the same ipaddr exists in the old asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * tcp: fix delayed ACKs for MSS boundary conditionNeal Cardwell2023-10-041-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes poor delayed ACK behavior that can cause poor TCP latency in a particular boundary condition: when an application makes a TCP socket write that is an exact multiple of the MSS size. The problem is that there is painful boundary discontinuity in the current delayed ACK behavior. With the current delayed ACK behavior, we have: (1) If an app reads data when > 1*MSS is unacknowledged, then tcp_cleanup_rbuf() ACKs immediately because of: tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || (2) If an app reads all received data, and the packets were < 1*MSS, and either (a) the app is not ping-pong or (b) we received two packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause of: ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) || ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) && !inet_csk_in_pingpong_mode(sk))) && (3) *However*: if an app reads exactly 1*MSS of data, tcp_cleanup_rbuf() does not send an immediate ACK. This is true even if the app is not ping-pong and the 1*MSS of data had the PSH bit set, suggesting the sending application completed an application write. Thus if the app is not ping-pong, we have this painful case where >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a write whose last skb is an exact multiple of 1*MSS can get a 40ms delayed ACK. This means that any app that transfers data in one direction and takes care to align write size or packet size with MSS can suffer this problem. With receive zero copy making 4KB MSS values more common, it is becoming more common to have application writes naturally align with MSS, and more applications are likely to encounter this delayed ACK problem. The fix in this commit is to refine the delayed ACK heuristics with a simple check: immediately ACK a received 1*MSS skb with PSH bit set if the app reads all data. Why? If an skb has a len of exactly 1*MSS and has the PSH bit set then it is likely the end of an application write. So more data may not be arriving soon, and yet the data sender may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send an ACK immediately if the app reads all of the data and is not ping-pong. Note that this logic is also executed for the case where len > MSS, but in that case this logic does not matter (and does not hurt) because tcp_cleanup_rbuf() will always ACK immediately if the app reads data and there is more than an MSS of unACKed data. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Xin Guo <guoxin0309@gmail.com> Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * tcp: fix quick-ack counting to count actual ACKs of new dataNeal Cardwell2023-10-042-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * Merge tag 'nf-23-10-04' of ↵Jakub Kicinski2023-10-049-62/+395
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter patches for net First patch resolves a regression with vlan header matching, this was broken since 6.5 release. From myself. Second patch fixes an ancient problem with sctp connection tracking in case INIT_ACK packets are delayed. This comes with a selftest, both patches from Xin Long. Patch 4 extends the existing nftables audit selftest, from Phil Sutter. Patch 5, also from Phil, avoids a situation where nftables would emit an audit record twice. This was broken since 5.13 days. Patch 6, from myself, avoids spurious insertion failure if we encounter an overlapping but expired range during element insertion with the 'nft_set_rbtree' backend. This problem exists since 6.2. * tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure netfilter: nf_tables: Deduplicate nft_register_obj audit logs selftests: netfilter: Extend nft_audit.sh selftests: netfilter: test for sctp collision processing in nf_conntrack netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp netfilter: nft_payload: rebuild vlan header on h_proto access ==================== Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failureFlorian Westphal2023-10-041-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nft_rbtree_gc_elem() walks back and removes the end interval element that comes before the expired element. There is a small chance that we've cached this element as 'rbe_ge'. If this happens, we hold and test a pointer that has been queued for freeing. It also causes spurious insertion failures: $ cat test-testcases-sets-0044interval_overlap_0.1/testout.log Error: Could not process rule: File exists add element t s { 0 - 2 } ^^^^^^ Failed to insert 0 - 2 given: table ip t { set s { type inet_service flags interval,timeout timeout 2s gc-interval 2s } } The set (rbtree) is empty. The 'failure' doesn't happen on next attempt. Reason is that when we try to insert, the tree may hold an expired element that collides with the range we're adding. While we do evict/erase this element, we can trip over this check: if (rbe_ge && nft_rbtree_interval_end(rbe_ge) && nft_rbtree_interval_end(new)) return -ENOTEMPTY; rbe_ge was erased by the synchronous gc, we should not have done this check. Next attempt won't find it, so retry results in successful insertion. Restart in-kernel to avoid such spurious errors. Such restart are rare, unless userspace intentionally adds very large numbers of elements with very short timeouts while setting a huge gc interval. Even in this case, this cannot loop forever, on each retry an existing element has been removed. As the caller is holding the transaction mutex, its impossible for a second entity to add more expiring elements to the tree. After this it also becomes feasible to remove the async gc worker and perform all garbage collection from the commit path. Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection") Signed-off-by: Florian Westphal <fw@strlen.de>
| | * netfilter: nf_tables: Deduplicate nft_register_obj audit logsPhil Sutter2023-10-042-16/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When adding/updating an object, the transaction handler emits suitable audit log entries already, the one in nft_obj_notify() is redundant. To fix that (and retain the audit logging from objects' 'update' callback), Introduce an "audit log free" variant for internal use. Fixes: c520292f29b8 ("audit: log nftables configuration change events once per table") Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> (Audit) Signed-off-by: Florian Westphal <fw@strlen.de>
| | * selftests: netfilter: Extend nft_audit.shPhil Sutter2023-10-041-16/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add tests for sets and elements and deletion of all kinds. Also reorder rule reset tests: By moving the bulk rule add command up, the two 'reset rules' tests become identical. While at it, fix for a failing bulk rule add test's error status getting lost due to its use in a pipe. Avoid this by using a temporary file. Headings in diff output for failing tests contain no useful data, strip them. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
| | * selftests: netfilter: test for sctp collision processing in nf_conntrackXin Long2023-10-043-2/+191
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a test case to reproduce the SCTP DATA chunk retransmission timeout issue caused by the improper SCTP collision processing in netfilter nf_conntrack_proto_sctp. In this test, client sends a INIT chunk, but the INIT_ACK replied from server is delayed until the server sends a INIT chunk to start a new connection from its side. After the connection is complete from server side, the delayed INIT_ACK arrives in nf_conntrack_proto_sctp. The delayed INIT_ACK should be dropped in nf_conntrack_proto_sctp instead of updating the vtag with the out-of-date init_tag, otherwise, the vtag in DATA chunks later sent by client don't match the vtag in the conntrack entry and the DATA chunks get dropped. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
| | * netfilter: handle the connecting collision properly in nf_conntrack_proto_sctpXin Long2023-10-042-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In Scenario A and B below, as the delayed INIT_ACK always changes the peer vtag, SCTP ct with the incorrect vtag may cause packet loss. Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK 192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151] 192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] * 192.168.1.2 > 192.168.1.1: [COOKIE ECHO] 192.168.1.1 > 192.168.1.2: [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: [COOKIE ACK] Scenario B: INIT_ACK is delayed until the peer completes its own handshake 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] * This patch fixes it as below: In SCTP_CID_INIT processing: - clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir]. (Scenario E) - set ct->proto.sctp.init[dir]. In SCTP_CID_INIT_ACK processing: - drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C) - drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A) In SCTP_CID_COOKIE_ACK processing: - clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir]. (Scenario D) Also, it's important to allow the ct state to move forward with cookie_echo and cookie_ack from the opposite dir for the collision scenarios. There are also other Scenarios where it should allow the packet through, addressed by the processing above: Scenario C: new CT is created by INIT_ACK. Scenario D: start INIT on the existing ESTABLISHED ct. Scenario E: start INIT after the old collision on the existing ESTABLISHED ct. 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] (both side are stopped, then start new connection again in hours) 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742] Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
| | * netfilter: nft_payload: rebuild vlan header on h_proto accessFlorian Westphal2023-10-041-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nft can perform merging of adjacent payload requests. This means that: ether saddr 00:11 ... ether type 8021ad ... is a single payload expression, for 8 bytes, starting at the ethernet source offset. Check that offset+length is fully within the source/destination mac addersses. This bug prevents 'ether type' from matching the correct h_proto in case vlan tag got stripped. Fixes: de6843be3082 ("netfilter: nft_payload: rebuild vlan header when needed") Reported-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: Florian Westphal <fw@strlen.de>
| * | page_pool: fix documentation typosRandy Dunlap2023-10-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Correct grammar for better readability. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20231001003846.29541-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | tipc: fix a potential deadlock on &tx->lockChengfeng Ye2023-10-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that tipc_crypto_key_revoke() could be be invoked by wokequeue tipc_crypto_work_rx() under process context and timer/rx callback under softirq context, thus the lock acquisition on &tx->lock seems better use spin_lock_bh() to prevent possible deadlock. This flaw was found by an experimental static analysis tool I am developing for irq-related deadlock. tipc_crypto_work_rx() <workqueue> --> tipc_crypto_key_distr() --> tipc_bcast_xmit() --> tipc_bcbase_xmit() --> tipc_bearer_bc_xmit() --> tipc_crypto_xmit() --> tipc_ehdr_build() --> tipc_crypto_key_revoke() --> spin_lock(&tx->lock) <timer interrupt> --> tipc_disc_timeout() --> tipc_bearer_xmit_skb() --> tipc_crypto_xmit() --> tipc_ehdr_build() --> tipc_crypto_key_revoke() --> spin_lock(&tx->lock) <deadlock here> Signed-off-by: Chengfeng Ye <dg573847474@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Fixes: fc1b6d6de220 ("tipc: introduce TIPC encryption & authentication") Link: https://lore.kernel.org/r/20230927181414.59928-1-dg573847474@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: stmmac: dwmac-stm32: fix resume on STM32 MCUBen Wolsieffer2023-10-041-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The STM32MP1 keeps clk_rx enabled during suspend, and therefore the driver does not enable the clock in stm32_dwmac_init() if the device was suspended. The problem is that this same code runs on STM32 MCUs, which do disable clk_rx during suspend, causing the clock to never be re-enabled on resume. This patch adds a variant flag to indicate that clk_rx remains enabled during suspend, and uses this to decide whether to enable the clock in stm32_dwmac_init() if the device was suspended. This approach fixes this specific bug with limited opportunity for unintended side-effects, but I have a follow up patch that will refactor the clock configuration and hopefully make it less error prone. Fixes: 6528e02cc9ff ("net: ethernet: stmmac: add adaptation for stm32mp157c.") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230927175749.1419774-1-ben.wolsieffer@hefring.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | ipv4: Set offload_failed flag in fibmatch resultsBenjamin Poirier2023-10-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to a small omission, the offload_failed flag is missing from ipv4 fibmatch results. Make sure it is set correctly. The issue can be witnessed using the following commands: echo "1 1" > /sys/bus/netdevsim/new_device ip link add dummy1 up type dummy ip route add 192.0.2.0/24 dev dummy1 echo 1 > /sys/kernel/debug/netdevsim/netdevsim1/fib/fail_route_offload ip route add 198.51.100.0/24 dev dummy1 ip route # 192.168.15.0/24 has rt_trap # 198.51.100.0/24 has rt_offload_failed ip route get 192.168.15.1 fibmatch # Result has rt_trap ip route get 198.51.100.1 fibmatch # Result differs from the route shown by `ip route`, it is missing # rt_offload_failed ip link del dev dummy1 echo 1 > /sys/bus/netdevsim/del_device Fixes: 36c5100e859d ("IPv4: Add "offload failed" indication to routes") Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230926182730.231208-1-bpoirier@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | Merge tag 'wireless-2023-09-27' of ↵Jakub Kicinski2023-10-0430-176/+333
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Quite a collection of fixes this time, really too many to list individually. Many stack fixes, even rfkill (found by simulation and the new eevdf scheduler)! Also a bigger maintainers file cleanup, to remove old and redundant information. * tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits) wifi: iwlwifi: mvm: Fix incorrect usage of scan API wifi: mac80211: Create resources for disabled links wifi: cfg80211: avoid leaking stack data into trace wifi: mac80211: allow transmitting EAPOL frames with tainted key wifi: mac80211: work around Cisco AP 9115 VHT MPDU length wifi: cfg80211: Fix 6GHz scan configuration wifi: mac80211: fix potential key leak wifi: mac80211: fix potential key use-after-free wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling wifi: brcmfmac: Replace 1-element arrays with flexible arrays wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM rfkill: sync before userspace visibility/changes wifi: mac80211: fix mesh id corruption on 32 bit systems wifi: cfg80211: add missing kernel-doc for cqm_rssi_work wifi: cfg80211: fix cqm_config access race wifi: iwlwifi: mvm: Fix a memory corruption issue wifi: iwlwifi: Ensure ack flag is properly cleared. wifi: iwlwifi: dbg_ini: fix structure packing iwlwifi: mvm: handle PS changes in vif_cfg_changed ... ==================== Link: https://lore.kernel.org/r/20230927095835.25803-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | wifi: iwlwifi: mvm: Fix incorrect usage of scan APIIlan Peer2023-09-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The support for using link ID in the scan request API was only added in version 16. However, the code wrongly enabled this API usage also for older versions. Fix it. Reported-by: Antoine Beaupré <anarcat@debian.org> Fixes: e98b23d0d7b8 ("wifi: iwlwifi: mvm: Add support for SCAN API version 16") Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230926165546.086e635fbbe6.Ia660f35ca0b1079f2c2ea92fd8d14d8101a89d03@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: mac80211: Create resources for disabled linksBenjamin Berg2023-09-261-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When associating to an MLD AP, links may be disabled. Create all resources associated with a disabled link so that we can later enable it without having to create these resources on the fly. Fixes: 6d543b34dbcf ("wifi: mac80211: Support disabled links during association") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://lore.kernel.org/r/20230925173028.f9afdb26f6c7.I4e6e199aaefc1bf017362d64f3869645fa6830b5@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: cfg80211: avoid leaking stack data into traceBenjamin Berg2023-09-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the structure is not initialized then boolean types might be copied into the tracing data without being initialised. This causes data from the stack to leak into the trace and also triggers a UBSAN failure which can easily be avoided here. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://lore.kernel.org/r/20230925171855.a9271ef53b05.I8180bae663984c91a3e036b87f36a640ba409817@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: mac80211: allow transmitting EAPOL frames with tainted keyWen Gong2023-09-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lower layer device driver stop/wake TX by calling ieee80211_stop_queue()/ ieee80211_wake_queue() while hw scan. Sometimes hw scan and PTK rekey are running in parallel, when M4 sent from wpa_supplicant arrive while the TX queue is stopped, then the M4 will pending send, and then new key install from wpa_supplicant. After TX queue wake up by lower layer device driver, the M4 will be dropped by below call stack. When key install started, the current key flag is set KEY_FLAG_TAINTED in ieee80211_pairwise_rekey(), and then mac80211 wait key install complete by lower layer device driver. Meanwhile ieee80211_tx_h_select_key() will return TX_DROP for the M4 in step 12 below, and then ieee80211_free_txskb() called by ieee80211_tx_dequeue(), so the M4 will not send and free, then the rekey process failed becaue AP not receive M4. Please see details in steps below. There are a interval between KEY_FLAG_TAINTED set for current key flag and install key complete by lower layer device driver, the KEY_FLAG_TAINTED is set in this interval, all packet including M4 will be dropped in this interval, the interval is step 8~13 as below. issue steps: TX thread install key thread 1. stop_queue -idle- 2. sending M4 -idle- 3. M4 pending -idle- 4. -idle- starting install key from wpa_supplicant 5. -idle- =>ieee80211_key_replace() 6. -idle- =>ieee80211_pairwise_rekey() and set currently key->flags |= KEY_FLAG_TAINTED 7. -idle- =>ieee80211_key_enable_hw_accel() 8. -idle- =>drv_set_key() and waiting key install complete from lower layer device driver 9. wake_queue -waiting state- 10. re-sending M4 -waiting state- 11. =>ieee80211_tx_h_select_key() -waiting state- 12. drop M4 by KEY_FLAG_TAINTED -waiting state- 13. -idle- install key complete with success/fail success: clear flag KEY_FLAG_TAINTED fail: start disconnect Hence add check in step 11 above to allow the EAPOL send out in the interval. If lower layer device driver use the old key/cipher to encrypt the M4, then AP received/decrypt M4 correctly, after M4 send out, lower layer device driver install the new key/cipher to hardware and return success. If lower layer device driver use new key/cipher to send the M4, then AP will/should drop the M4, then it is same result with this issue, AP will/ should kick out station as well as this issue. issue log: kworker/u16:4-5238 [000] 6456.108926: stop_queue: phy1 queue:0, reason:0 wpa_supplicant-961 [003] 6456.119737: rdev_tx_control_port: wiphy_name=phy1 name=wlan0 ifindex=6 dest=ARRAY[9e, 05, 31, 20, 9b, d0] proto=36488 unencrypted=0 wpa_supplicant-961 [003] 6456.119839: rdev_return_int_cookie: phy1, returned 0, cookie: 504 wpa_supplicant-961 [003] 6456.120287: rdev_add_key: phy1, netdev:wlan0(6), key_index: 0, mode: 0, pairwise: true, mac addr: 9e:05:31:20:9b:d0 wpa_supplicant-961 [003] 6456.120453: drv_set_key: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 cipher:0xfac04, flags=0x9, keyidx=0, hw_key_idx=0 kworker/u16:9-3829 [001] 6456.168240: wake_queue: phy1 queue:0, reason:0 kworker/u16:9-3829 [001] 6456.168255: drv_wake_tx_queue: phy1 vif:wlan0(2) sta:9e:05:31:20:9b:d0 ac:0 tid:7 kworker/u16:9-3829 [001] 6456.168305: cfg80211_control_port_tx_status: wdev(1), cookie: 504, ack: false wpa_supplicant-961 [003] 6459.167982: drv_return_int: phy1 - -110 issue call stack: nl80211_frame_tx_status+0x230/0x340 [cfg80211] cfg80211_control_port_tx_status+0x1c/0x28 [cfg80211] ieee80211_report_used_skb+0x374/0x3e8 [mac80211] ieee80211_free_txskb+0x24/0x40 [mac80211] ieee80211_tx_dequeue+0x644/0x954 [mac80211] ath10k_mac_tx_push_txq+0xac/0x238 [ath10k_core] ath10k_mac_op_wake_tx_queue+0xac/0xe0 [ath10k_core] drv_wake_tx_queue+0x80/0x168 [mac80211] __ieee80211_wake_txqs+0xe8/0x1c8 [mac80211] _ieee80211_wake_txqs+0xb4/0x120 [mac80211] ieee80211_wake_txqs+0x48/0x80 [mac80211] tasklet_action_common+0xa8/0x254 tasklet_action+0x2c/0x38 __do_softirq+0xdc/0x384 Signed-off-by: Wen Gong <quic_wgong@quicinc.com> Link: https://lore.kernel.org/r/20230801064751.25803-1-quic_wgong@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: mac80211: work around Cisco AP 9115 VHT MPDU lengthJohannes Berg2023-09-256-7/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cisco AP module 9115 with FW 17.3 has a bug and sends a too large maximum MPDU length in the association response (indicating 12k) that it cannot actually process. Work around that by taking the minimum between what's in the association response and the BSS elements (from beacon or probe response). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230918140607.d1966a9a532e.I090225babb7cd4d1081ee9acd40e7de7e41c15ae@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: cfg80211: Fix 6GHz scan configurationIlan Peer2023-09-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the scan request includes a non broadcast BSSID, when adding the scan parameters for 6GHz collocated scanning, do not include entries that do not match the given BSSID. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20230918140607.6d31d2a96baf.I6c4e3e3075d1d1878ee41f45190fdc6b86f18708@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | wifi: mac80211: fix potential key leakJohannes Berg2023-09-251-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When returning from ieee80211_key_link(), the key needs to have been freed or successfully installed. This was missed in a number of error paths, fix it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>