summaryrefslogtreecommitdiffstats
path: root/net/openvswitch
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski7 days1-1/+2
|\ | | | | | | | | | | Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: openvswitch: fix overwriting ct original tuple for ICMPv6Ilya Maximets10 days1-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OVS_PACKET_CMD_EXECUTE has 3 main attributes: - OVS_PACKET_ATTR_KEY - Packet metadata in a netlink format. - OVS_PACKET_ATTR_PACKET - Binary packet content. - OVS_PACKET_ATTR_ACTIONS - Actions to execute on the packet. OVS_PACKET_ATTR_KEY is parsed first to populate sw_flow_key structure with the metadata like conntrack state, input port, recirculation id, etc. Then the packet itself gets parsed to populate the rest of the keys from the packet headers. Whenever the packet parsing code starts parsing the ICMPv6 header, it first zeroes out fields in the key corresponding to Neighbor Discovery information even if it is not an ND packet. It is an 'ipv6.nd' field. However, the 'ipv6' is a union that shares the space between 'nd' and 'ct_orig' that holds the original tuple conntrack metadata parsed from the OVS_PACKET_ATTR_KEY. ND packets should not normally have conntrack state, so it's fine to share the space, but normal ICMPv6 Echo packets or maybe other types of ICMPv6 can have the state attached and it should not be overwritten. The issue results in all but the last 4 bytes of the destination address being wiped from the original conntrack tuple leading to incorrect packet matching and potentially executing wrong actions in case this packet recirculates within the datapath or goes back to userspace. ND fields should not be accessed in non-ND packets, so not clearing them should be fine. Executing memset() only for actual ND packets to avoid the issue. Initializing the whole thing before parsing is needed because ND packet may not contain all the options. The issue only affects the OVS_PACKET_CMD_EXECUTE path and doesn't affect packets entering OVS datapath from network interfaces, because in this case CT metadata is populated from skb after the packet is already parsed. Fixes: 9dd7f8907c37 ("openvswitch: Add original direction conntrack tuple to sw_flow_key.") Reported-by: Antonin Bas <antonin.bas@broadcom.com> Closes: https://github.com/openvswitch/ovs-issues/issues/327 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240509094228.1035477-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-04-251-2/+2
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c 89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link") 87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c 4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: openvswitch: Fix Use-After-Free in ovs_ct_exitHyunwoo Kim2024-04-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal of ovs_ct_limit_exit, is not part of the RCU read critical section, it is possible that the RCU grace period will pass during the traversal and the key will be free. To prevent this, it should be changed to hlist_for_each_entry_safe. Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit") Signed-off-by: Hyunwoo Kim <v4bel@theori.io> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/ZiYvzQN/Ry5oeFQW@v4bel-B760M-AORUS-ELITE-AX Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: openvswitch: Release reference to netdevJun Gu2024-04-241-2/+6
| | | | | | | | | | | | | | | | | | | | | | dev_get_by_name will provide a reference on the netdev. So ensure that the reference of netdev is released after completed. Fixes: 2540088b836f ("net: openvswitch: Check vport netdev name") Signed-off-by: Jun Gu <jun.gu@easystack.cn> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20240423073751.52706-1-jun.gu@easystack.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: openvswitch: Check vport netdev nameJun Gu2024-04-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | Ensure that the provided netdev name is not one of its aliases to prevent unnecessary creation and destruction of the vport by ovs-vswitchd. Signed-off-by: Jun Gu <jun.gu@easystack.cn> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240419061425.132723-1-jun.gu@easystack.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2024-04-111-2/+3
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c 47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset") b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c 7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") 194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 958f56e48385 ("net/mlx5e: Un-expose functions in en.h") 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: openvswitch: fix unwanted error log on timeout policy probingIlya Maximets2024-04-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On startup, ovs-vswitchd probes different datapath features including support for timeout policies. While probing, it tries to execute certain operations with OVS_PACKET_ATTR_PROBE or OVS_FLOW_ATTR_PROBE attributes set. These attributes tell the openvswitch module to not log any errors when they occur as it is expected that some of the probes will fail. For some reason, setting the timeout policy ignores the PROBE attribute and logs a failure anyway. This is causing the following kernel log on each re-start of ovs-vswitchd: kernel: Failed to associated timeout policy `ovs_test_tp' Fix that by using the same logging macro that all other messages are using. The message will still be printed at info level when needed and will be rate limited, but with a net rate limiter instead of generic printk one. The nf_ct_set_timeout() itself will still print some info messages, but at least this change makes logging in openvswitch module more consistent. Fixes: 06bd2bdf19d2 ("openvswitch: Add timeout support to ct action") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240403203803.2137962-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | genetlink: remove linux/genetlink.hJakub Kicinski2024-04-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | genetlink.h is a shell of what used to be a combined uAPI and kernel header over a decade ago. It has fewer than 10 lines of code. Merge it into net/genetlink.h. In some ways it'd be better to keep the combined header under linux/ but it would make looking through git history harder. Acked-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20240329175710.291749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: openvswitch: remove unnecessary linux/genetlink.h includeJakub Kicinski2024-04-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The only legit reason I could think of for net/genetlink.h and linux/genetlink.h to be separate would be if one was included by other headers and we wanted to keep it lightweight. That is not the case, net/openvswitch/meter.h includes linux/genetlink.h but for no apparent reason (for struct genl_family perhaps? it's not necessary, types of externs do not need to be known). Link: https://lore.kernel.org/r/20240329175710.291749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ip_tunnel: convert __be16 tunnel flags to bitmapsAlexander Lobakin2024-04-011-24/+37
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, tunnel flags like TUNNEL_CSUM or TUNNEL_ERSPAN_OPT have been defined as __be16. Now all of those 16 bits are occupied and there's no more free space for new flags. It can't be simply switched to a bigger container with no adjustments to the values, since it's an explicit Endian storage, and on LE systems (__be16)0x0001 equals to (__be64)0x0001000000000000. We could probably define new 64-bit flags depending on the Endianness, i.e. (__be64)0x0001 on BE and (__be64)0x00010000... on LE, but that would introduce an Endianness dependency and spawn a ton of Sparse warnings. To mitigate them, all of those places which were adjusted with this change would be touched anyway, so why not define stuff properly if there's no choice. Define IP_TUNNEL_*_BIT counterparts as a bit number instead of the value already coded and a fistful of <16 <-> bitmap> converters and helpers. The two flags which have a different bit position are SIT_ISATAP_BIT and VTI_ISVTI_BIT, as they were defined not as __cpu_to_be16(), but as (__force __be16), i.e. had different positions on LE and BE. Now they both have strongly defined places. Change all __be16 fields which were used to store those flags, to IP_TUNNEL_DECLARE_FLAGS() -> DECLARE_BITMAP(__IP_TUNNEL_FLAG_NUM) -> unsigned long[1] for now, and replace all TUNNEL_* occurrences to their bitmap counterparts. Use the converters in the places which talk to the userspace, hardware (NFP) or other hosts (GRE header). The rest must explicitly use the new flags only. This must be done at once, otherwise there will be too many conversions throughout the code in the intermediate commits. Finally, disable the old __be16 flags for use in the kernel code (except for the two 'irregular' flags mentioned above), to prevent any accidental (mis)use of them. For the userspace, nothing is changed, only additions were made. Most noticeable bloat-o-meter difference (.text): vmlinux: 307/-1 (306) gre.ko: 62/0 (62) ip_gre.ko: 941/-217 (724) [*] ip_tunnel.ko: 390/-900 (-510) [**] ip_vti.ko: 138/0 (138) ip6_gre.ko: 534/-18 (516) [*] ip6_tunnel.ko: 118/-10 (108) [*] gre_flags_to_tnl_flags() grew, but still is inlined [**] ip_tunnel_find() got uninlined, hence such decrease The average code size increase in non-extreme case is 100-200 bytes per module, mostly due to sizeof(long) > sizeof(__be16), as %__IP_TUNNEL_FLAG_NUM is less than %BITS_PER_LONG and the compilers are able to expand the majority of bitmap_*() calls here into direct operations on scalars. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: openvswitch: limit the number of recursions from action setsAaron Conole2024-02-091-16/+33
| | | | | | | | | | | | | | | | | | | | | | | | | The ovs module allows for some actions to recursively contain an action list for complex scenarios, such as sampling, checking lengths, etc. When these actions are copied into the internal flow table, they are evaluated to validate that such actions make sense, and these calls happen recursively. The ovs-vswitchd userspace won't emit more than 16 recursion levels deep. However, the module has no such limit and will happily accept limits larger than 16 levels nested. Prevent this by tracking the number of recursions happening and manually limiting it to 16 levels nested. The initial implementation of the sample action would track this depth and prevent more than 3 levels of recursion, but this was removed to support the clone use case, rather than limited at the current userspace limit. Fixes: 798c166173ff ("openvswitch: Optimize sample action for the clone use cases") Signed-off-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240207132416.1488485-2-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/sched: act_ct: Always fill offloading tuple iifidxVlad Buslov2023-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Referenced commit doesn't always set iifidx when offloading the flow to hardware. Fix the following cases: - nf_conn_act_ct_ext_fill() is called before extension is created with nf_conn_act_ct_ext_add() in tcf_ct_act(). This can cause rule offload with unspecified iifidx when connection is offloaded after only single original-direction packet has been processed by tc data path. Always fill the new nf_conn_act_ct_ext instance after creating it in nf_conn_act_ct_ext_add(). - Offloading of unidirectional UDP NEW connections is now supported, but ct flow iifidx field is not updated when connection is promoted to bidirectional which can result reply-direction iifidx to be zero when refreshing the connection. Fill in the extension and update flow iifidx before calling flow_offload_refresh(). Fixes: 9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx") Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Fixes: 6a9bad0069cf ("net/sched: act_ct: offload UDP NEW connections") Link: https://lore.kernel.org/r/20231103151410.764271-1-vladbu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: openvswitch: Annotate struct mask_array with __counted_byChristophe JAILLET2023-10-171-1/+1
| | | | | | | | | | | | | | Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/ca5c8049f58bb933f231afd0816e30a5aaa0eddd.1697264974.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net: openvswitch: Use struct_size()Christophe JAILLET2023-10-171-5/+2
| | | | | | | | | | Use struct_size() instead of hand writing it. This is less verbose and more robust. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/e5122b4ff878cbf3ed72653a395ad5c4da04dc1e.1697264974.git.christophe.jaillet@wanadoo.fr Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net: openvswitch: Annotate struct dp_meter with __counted_byKees Cook2023-10-021-1/+1
| | | | | | | | | | | | | | | | | | | Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct dp_meter. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Pravin B Shelar <pshelar@ovn.org> Cc: dev@openvswitch.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230922172858.3822653-12-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: openvswitch: Annotate struct dp_meter_instance with __counted_byKees Cook2023-10-021-1/+1
| | | | | | | | | | | | | | | | | | | Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct dp_meter_instance. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Pravin B Shelar <pshelar@ovn.org> Cc: dev@openvswitch.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230922172858.3822653-10-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* openvswitch: reduce stack usage in do_execute_actionsIlya Maximets2023-10-011-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_execute_actions() function can be called recursively multiple times while executing actions that require pipeline forking or recirculations. It may also be re-entered multiple times if the packet leaves openvswitch module and re-enters it through a different port. Currently, there is a 256-byte array allocated on stack in this function that is supposed to hold NSH header. Compilers tend to pre-allocate that space right at the beginning of the function: a88: 48 81 ec b0 01 00 00 sub $0x1b0,%rsp NSH is not a very common protocol, but the space is allocated on every recursive call or re-entry multiplying the wasted stack space. Move the stack allocation to push_nsh() function that is only used if NSH actions are actually present. push_nsh() is also a simple function without a possibility for re-entry, so the stack is returned right away. With this change the preallocated space is reduced by 256 B per call: b18: 48 81 ec b0 00 00 00 sub $0xb0,%rsp Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Eelco Chaudron echaudro@redhat.com Reviewed-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dst: remove unnecessary input parameter in dst_alloc and dst_initZhengchao Shao2023-09-121-2/+2
| | | | | | | | | | | | | | Since commit 1202cdd66531("Remove DECnet support from kernel") has been merged, all callers pass in the initial_ref value of 1 when they call dst_alloc(). Therefore, remove initial_ref when the dst_alloc() is declared and replace initial_ref with 1 in dst_alloc(). Also when all callers call dst_init(), the value of initial_ref is 1. Therefore, remove the input parameter initial_ref of the dst_init() and replace initial_ref with the value 1 in dst_init. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20230911125045.346390-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-08-181-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flow_indr if it was never registered") 3bf969e88ada ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: openvswitch: reject negative ifindexJakub Kicinski2023-08-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent changes in net-next (commit 759ab1edb56c ("net: store netdevs in an xarray")) refactored the handling of pre-assigned ifindexes and let syzbot surface a latent problem in ovs. ovs does not validate ifindex, making it possible to create netdev ports with negative ifindex values. It's easy to repro with YNL: $ ./cli.py --spec netlink/specs/ovs_datapath.yaml \ --do new \ --json '{"upcall-pid": 1, "name":"my-dp"}' $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' $ ip link show -65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff ... Validate the inputs. Now the second command correctly returns: $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' lib.ynl.NlError: Netlink error: Numerical result out of range nl_len = 108 (92) nl_flags = 0x300 nl_type = 2 error: -34 extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'} Accept 0 since it used to be silently ignored. Fixes: 54c4ef34c4b6 ("openvswitch: allow specifying ifindex of new interfaces") Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | genetlink: remove userhdr from struct genl_infoJakub Kicinski2023-08-153-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | Only three families use info->userhdr today and going forward we discourage using fixed headers in new families. So having the pointer to user header in struct genl_info is an overkill. Compute the header pointer at runtime. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: openvswitch: add misc error drop reasonsAdrian Moreno2023-08-143-8/+18
| | | | | | | | | | | | | | | | | | Use drop reasons from include/net/dropreason-core.h when a reasonable candidate exists. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: openvswitch: add meter drop reasonAdrian Moreno2023-08-142-1/+2
| | | | | | | | | | | | | | | | | | By using an independent drop reason it makes it easy to distinguish between QoS-triggered or flow-triggered drop. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: openvswitch: add explicit drop actionEric Garver2023-08-143-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: Eric Garver <eric@garver.life> This adds an explicit drop action. This is used by OVS to drop packets for which it cannot determine what to do. An explicit action in the kernel allows passing the reason _why_ the packet is being dropped or zero to indicate no particular error happened (i.e: OVS intentionally dropped the packet). Since the error codes coming from userspace mean nothing for the kernel, we squash all of them into only two drop reasons: - OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed - OVS_DROP_EXPLICIT to indicate a zero value was passed (no error) e.g. trace all OVS dropped skbs # perf trace -e skb:kfree_skb --filter="reason >= 0x30000" [..] 106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \ location:0xffffffffc0d9b462, protocol: 2048, reason: 196611) reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT) Also, this patch allows ovs-dpctl.py to add explicit drop actions as: "drop" -> implicit empty-action drop "drop(0)" -> explicit non-error action drop "drop(42)" -> explicit error action drop Signed-off-by: Eric Garver <eric@garver.life> Co-developed-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: openvswitch: add action error drop reasonAdrian Moreno2023-08-142-1/+2
| | | | | | | | | | | | | | | | | | Add a drop reason for packets that are dropped because an action returns a non-zero error code. Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: openvswitch: add last-action drop reasonAdrian Moreno2023-08-143-2/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a new drop reason subsystem for openvswitch and add the first drop reason to represent last-action drops. Last-action drops happen when a flow has an empty action list or there is no action that consumes the packet (output, userspace, recirc, etc). It is the most common way in which OVS drops packets. Implementation-wise, most of these skb-consuming actions already call "consume_skb" internally and return directly from within the do_execute_actions() loop so with minimal changes we can assume that any skb that exits the loop normally is a packet drop. Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | openvswitch: set IPS_CONFIRMED in tmpl status only when commit is set in ↵Xin Long2023-07-201-68/+10
|/ | | | | | | | | | | | | conntrack By not setting IPS_CONFIRMED in tmpl that allows the exp not to be removed from the hashtable when lookup, we can simplify the exp processing code a lot in openvswitch conntrack. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net: openvswitch: add support for l4 symmetric hashingAaron Conole2023-06-122-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since its introduction, the ovs module execute_hash action allowed hash algorithms other than the skb->l4_hash to be used. However, additional hash algorithms were not implemented. This means flows requiring different hash distributions weren't able to use the kernel datapath. Now, introduce support for symmetric hashing algorithm as an alternative hash supported by the ovs module using the flow dissector. Output of flow using l4_sym hash: recirc_id(0),in_port(3),eth(),eth_type(0x0800), ipv4(dst=64.0.0.0/192.0.0.0,proto=6,frag=no), packets:30473425, bytes:45902883702, used:0.000s, flags:SP., actions:hash(sym_l4(0)),recirc(0xd) Some performance testing with no GRO/GSO, two veths, single flow: hash(l4(0)): 4.35 GBits/s hash(l4_sym(0)): 4.24 GBits/s Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: move gso declarations and functions to their own filesEric Dumazet2023-06-102-0/+2
| | | | | | | | | | | Move declarations into include/net/gso.h and code into net/core/gso.c Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-06-082-21/+16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: net/sched/sch_taprio.c d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping") dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()") net/ipv4/sysctl_net_ipv4.c e209fee4118f ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294") ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear") https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: openvswitch: fix upcall counter access before allocationEelco Chaudron2023-06-072-21/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the per cpu upcall counters are allocated after the vport is created and inserted into the system. This could lead to the datapath accessing the counters before they are allocated resulting in a kernel Oops. Here is an example: PID: 59693 TASK: ffff0005f4f51500 CPU: 0 COMMAND: "ovs-vswitchd" #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4 #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60 #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58 #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388 #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68 #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch] ... PID: 58682 TASK: ffff0005b2f0bf00 CPU: 0 COMMAND: "kworker/0:3" #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758 #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994 #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8 #3 [ffff80000a5d3120] die at ffffb70f0628234c #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8 #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4 #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4 #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710 #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74 #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24 #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch] #13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch] #14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch] #15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch] #16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch] #17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90 We moved the per cpu upcall counter allocation to the existing vport alloc and free functions to solve this. Fixes: 95637d91fefd ("net: openvswitch: release vport resources on failure") Fixes: 1933ea365aa7 ("net: openvswitch: Add support to count upcall packets") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: openvswitch: Use struct_size()Christophe JAILLET2023-05-171-3/+1
|/ | | | | | | | | | Use struct_size() instead of hand writing it. This is less verbose and more informative. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/e7746fbbd62371d286081d5266e88bbe8d3fe9f0.1683388991.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: openvswitch: fix race on port outputFelix Huettner2023-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | assume the following setup on a single machine: 1. An openvswitch instance with one bridge and default flows 2. two network namespaces "server" and "client" 3. two ovs interfaces "server" and "client" on the bridge 4. for each ovs interface a veth pair with a matching name and 32 rx and tx queues 5. move the ends of the veth pairs to the respective network namespaces 6. assign ip addresses to each of the veth ends in the namespaces (needs to be the same subnet) 7. start some http server on the server network namespace 8. test if a client in the client namespace can reach the http server when following the actions below the host has a chance of getting a cpu stuck in a infinite loop: 1. send a large amount of parallel requests to the http server (around 3000 curls should work) 2. in parallel delete the network namespace (do not delete interfaces or stop the server, just kill the namespace) there is a low chance that this will cause the below kernel cpu stuck message. If this does not happen just retry. Below there is also the output of bpftrace for the functions mentioned in the output. The series of events happening here is: 1. the network namespace is deleted calling `unregister_netdevice_many_notify` somewhere in the process 2. this sets first `NETREG_UNREGISTERING` on both ends of the veth and then runs `synchronize_net` 3. it then calls `call_netdevice_notifiers` with `NETDEV_UNREGISTER` 4. this is then handled by `dp_device_event` which calls `ovs_netdev_detach_dev` (if a vport is found, which is the case for the veth interface attached to ovs) 5. this removes the rx_handlers of the device but does not prevent packages to be sent to the device 6. `dp_device_event` then queues the vport deletion to work in background as a ovs_lock is needed that we do not hold in the unregistration path 7. `unregister_netdevice_many_notify` continues to call `netdev_unregister_kobject` which sets `real_num_tx_queues` to 0 8. port deletion continues (but details are not relevant for this issue) 9. at some future point the background task deletes the vport If after 7. but before 9. a packet is send to the ovs vport (which is not deleted at this point in time) which forwards it to the `dev_queue_xmit` flow even though the device is unregistering. In `skb_tx_hash` (which is called in the `dev_queue_xmit`) path there is a while loop (if the packet has a rx_queue recorded) that is infinite if `dev->real_num_tx_queues` is zero. To prevent this from happening we update `do_output` to handle devices without carrier the same as if the device is not found (which would be the code path after 9. is done). Additionally we now produce a warning in `skb_tx_hash` if we will hit the infinite loop. bpftrace (first word is function name): __dev_queue_xmit server: real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 1, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 1 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 2, reg_state: 1 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 6, reg_state: 2 ovs_netdev_detach_dev server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, reg_state: 2 netdev_rx_handler_unregister server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 netdev_rx_handler_unregister ret server: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 27, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 22, reg_state: 2 dp_device_event server: real_num_tx_queues: 1 cpu 9, pid: 21024, tid: 21024, event 18, reg_state: 2 netdev_unregister_kobject: real_num_tx_queues: 1, cpu: 9, pid: 21024, tid: 21024 synchronize_rcu_expedited: cpu 9, pid: 21024, tid: 21024 ovs_vport_send server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 __dev_queue_xmit server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 netdev_core_pick_tx server: addr: 0xffff9f0a46d4a000 real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024, skb_addr: 0xffff9edb6f207000, reg_state: 2 broken device server: real_num_tx_queues: 0, cpu: 2, pid: 28024, tid: 28024 ovs_dp_detach_port server: real_num_tx_queues: 0 cpu 9, pid: 9124, tid: 9124, reg_state: 2 synchronize_rcu_expedited: cpu 9, pid: 33604, tid: 33604 stuck message: watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [curl:1929279] Modules linked in: veth pktgen bridge stp llc ip_set_hash_net nft_counter xt_set nft_compat nf_tables ip_set_hash_ip ip_set nfnetlink_cttimeout nfnetlink openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tls binfmt_misc nls_iso8859_1 input_leds joydev serio_raw dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel drm efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel virtio_net ahci net_failover crypto_simd cryptd psmouse libahci virtio_blk failover CPU: 5 PID: 1929279 Comm: curl Not tainted 5.15.0-67-generic #74-Ubuntu Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:netdev_pick_tx+0xf1/0x320 Code: 00 00 8d 48 ff 0f b7 c1 66 39 ca 0f 86 e9 01 00 00 45 0f b7 ff 41 39 c7 0f 87 5b 01 00 00 44 29 f8 41 39 c7 0f 87 4f 01 00 00 <eb> f2 0f 1f 44 00 00 49 8b 94 24 28 04 00 00 48 85 d2 0f 84 53 01 RSP: 0018:ffffb78b40298820 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff9c8773adc2e0 RCX: 000000000000083f RDX: 0000000000000000 RSI: ffff9c8773adc2e0 RDI: ffff9c870a25e000 RBP: ffffb78b40298858 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9c870a25e000 R13: ffff9c870a25e000 R14: ffff9c87fe043480 R15: 0000000000000000 FS: 00007f7b80008f00(0000) GS:ffff9c8e5f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7b80f6a0b0 CR3: 0000000329d66000 CR4: 0000000000350ee0 Call Trace: <IRQ> netdev_core_pick_tx+0xa4/0xb0 __dev_queue_xmit+0xf8/0x510 ? __bpf_prog_exit+0x1e/0x30 dev_queue_xmit+0x10/0x20 ovs_vport_send+0xad/0x170 [openvswitch] do_output+0x59/0x180 [openvswitch] do_execute_actions+0xa80/0xaa0 [openvswitch] ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? flow_lookup.constprop.0+0x5c/0x110 [openvswitch] ovs_execute_actions+0x4c/0x120 [openvswitch] ovs_dp_process_packet+0xa1/0x200 [openvswitch] ? ovs_ct_update_key.isra.0+0xa8/0x120 [openvswitch] ? ovs_ct_fill_key+0x1d/0x30 [openvswitch] ? ovs_flow_key_extract+0x2db/0x350 [openvswitch] ovs_vport_receive+0x77/0xd0 [openvswitch] ? __htab_map_lookup_elem+0x4e/0x60 ? bpf_prog_680e8aff8547aec1_kfree+0x3b/0x714 ? trace_call_bpf+0xc8/0x150 ? kfree+0x1/0x250 ? kfree+0x1/0x250 ? kprobe_perf_func+0x4f/0x2b0 ? kprobe_perf_func+0x4f/0x2b0 ? __mod_memcg_lruvec_state+0x63/0xe0 netdev_port_receive+0xc4/0x180 [openvswitch] ? netdev_port_receive+0x180/0x180 [openvswitch] netdev_frame_hook+0x1f/0x40 [openvswitch] __netif_receive_skb_core.constprop.0+0x23d/0xf00 __netif_receive_skb_one_core+0x3f/0xa0 __netif_receive_skb+0x15/0x60 process_backlog+0x9e/0x170 __napi_poll+0x33/0x180 net_rx_action+0x126/0x280 ? ttwu_do_activate+0x72/0xf0 __do_softirq+0xd9/0x2e7 ? rcu_report_exp_cpu_mult+0x1b0/0x1b0 do_softirq+0x7d/0xb0 </IRQ> <TASK> __local_bh_enable_ip+0x54/0x60 ip_finish_output2+0x191/0x460 __ip_finish_output+0xb7/0x180 ip_finish_output+0x2e/0xc0 ip_output+0x78/0x100 ? __ip_finish_output+0x180/0x180 ip_local_out+0x5e/0x70 __ip_queue_xmit+0x184/0x440 ? tcp_syn_options+0x1f9/0x300 ip_queue_xmit+0x15/0x20 __tcp_transmit_skb+0x910/0x9c0 ? __mod_memcg_state+0x44/0xa0 tcp_connect+0x437/0x4e0 ? ktime_get_with_offset+0x60/0xf0 tcp_v4_connect+0x436/0x530 __inet_stream_connect+0xd4/0x3a0 ? kprobe_perf_func+0x4f/0x2b0 ? aa_sk_perm+0x43/0x1c0 inet_stream_connect+0x3b/0x60 __sys_connect_file+0x63/0x70 __sys_connect+0xa6/0xd0 ? setfl+0x108/0x170 ? do_fcntl+0xe8/0x5a0 __x64_sys_connect+0x18/0x20 do_syscall_64+0x5c/0xc0 ? __x64_sys_fcntl+0xa9/0xd0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? do_syscall_64+0x69/0xc0 ? __sys_setsockopt+0xea/0x1e0 ? exit_to_user_mode_prepare+0x37/0xb0 ? syscall_exit_to_user_mode+0x27/0x50 ? __x64_sys_setsockopt+0x1f/0x30 ? do_syscall_64+0x69/0xc0 ? irqentry_exit+0x1d/0x30 ? exc_page_fault+0x89/0x170 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f7b8101c6a7 Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 RSP: 002b:00007ffffd6b2198 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b8101c6a7 RDX: 0000000000000010 RSI: 00007ffffd6b2360 RDI: 0000000000000005 RBP: 0000561f1370d560 R08: 00002795ad21d1ac R09: 0030312e302e302e R10: 00007ffffd73f080 R11: 0000000000000246 R12: 0000561f1370c410 R13: 0000000000000000 R14: 0000000000000005 R15: 0000000000000000 </TASK> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Co-developed-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Luca Czesla <luca.czesla@mail.schwarz> Signed-off-by: Felix Huettner <felix.huettner@mail.schwarz> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/ZC0pBXBAgh7c76CA@kernel-bug-kernel-bug Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge tag 'mm-nonmm-stable-2023-02-20-15-29' of ↵Linus Torvalds2023-02-231-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "There is no particular theme here - mainly quick hits all over the tree. Most notable is a set of zlib changes from Mikhail Zaslonko which enhances and fixes zlib's use of S390 hardware support: 'lib/zlib: Set of s390 DFLTCC related patches for kernel zlib'" * tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (55 commits) Update CREDITS file entry for Jesper Juhl sparc: allow PM configs for sparc32 COMPILE_TEST hung_task: print message when hung_task_warnings gets down to zero. arch/Kconfig: fix indentation scripts/tags.sh: fix the Kconfig tags generation when using latest ctags nilfs2: prevent WARNING in nilfs_dat_commit_end() lib/zlib: remove redundation assignement of avail_in dfltcc_gdht() lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default lib/zlib: DFLTCC always switch to software inflate for Z_PACKET_FLUSH option lib/zlib: DFLTCC support inflate with small window lib/zlib: Split deflate and inflate states for DFLTCC lib/zlib: DFLTCC not writing header bits when avail_out == 0 lib/zlib: fix DFLTCC ignoring flush modes when avail_in == 0 lib/zlib: fix DFLTCC not flushing EOBS when creating raw streams lib/zlib: implement switching between DFLTCC and software lib/zlib: adjust offset calculation for dfltcc_state nilfs2: replace WARN_ONs for invalid DAT metadata block requests scripts/spelling.txt: add "exsits" pattern and fix typo instances fs: gracefully handle ->get_block not mapping bh in __mpage_writepage cramfs: Kconfig: fix spelling & punctuation ...
| * scripts/spelling.txt: add "exsits" pattern and fix typo instancesLuca Ceresoli2023-02-021-1/+1
| | | | | | | | | | | | | | | | | | | | Fix typos and add the following to the scripts/spelling.txt: exsits||exists Link: https://lkml.kernel.org/r/20230126152205.959277-1-luca.ceresoli@bootlin.com Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | net/sched: cls_api: Support hardware miss to tc actionPaul Blakey2023-02-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For drivers to support partial offload of a filter's action list, add support for action miss to specify an action instance to continue from in sw. CT action in particular can't be fully offloaded, as new connections need to be handled in software. This imposes other limitations on the actions that can be offloaded together with the CT action, such as packet modifications. Assign each action on a filter's action list a unique miss_cookie which drivers can then use to fill action_miss part of the tc skb extension. On getting back this miss_cookie, find the action instance with relevant cookie and continue classifying from there. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2023-02-171-1/+3
|\ \ | | | | | | | | | | | | | | | Some of the devlink bits were tricky, but I think I got it right. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: openvswitch: fix possible memory leak in ovs_meter_cmd_set()Hangyu Hua2023-02-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | old_meter needs to be free after it is detached regardless of whether the new meter is successfully attached. Fixes: c7c4c44c9a95 ("net: openvswitch: expand the meters supported number") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: extract nf_ct_handle_fragments to nf_conntrack_ovsXin Long2023-02-101-44/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now handle_fragments() in OVS and TC have the similar code, and this patch removes the duplicate code by moving the function to nf_conntrack_ovs. Note that skb_clear_hash(skb) or skb->ignore_df = 1 should be done only when defrag returns 0, as it does in other places in kernel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | openvswitch: move key and ovs_cb update out of handle_fragmentsXin Long2023-02-101-12/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch has no functional changes and just moves key and ovs_cb update out of handle_fragments, and skb_clear_hash() and skb->ignore_df change into handle_fragments(), to make it easier to move the duplicate code from handle_fragments() into nf_conntrack_ovs later. Note that it changes to pass info->family to handle_fragments() instead of key for the packet type check, as info->family is set according to key->eth.type in ovs_ct_copy_action() when creating the action. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | net: extract nf_ct_skb_network_trim function to nf_conntrack_ovsXin Long2023-02-101-32/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are almost the same code in ovs_skb_network_trim() and tcf_ct_skb_network_trim(), this patch extracts them into a function nf_ct_skb_network_trim() and moves the function to nf_conntrack_ovs. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | net: create nf_conntrack_ovs for ovs and tc useXin Long2023-02-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to nf_nat_ovs created by Commit ebddb1404900 ("net: move the nat function to nf_nat_ovs for ovs and tc"), this patch is to create nf_conntrack_ovs to get these functions shared by OVS and TC only. There are nf_ct_helper() and nf_ct_add_helper() from nf_conntrak_helper in this patch, and will be more in the following patches. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | openvswitch: Use string_is_terminated() helperAndy Shevchenko2023-02-091-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use string_is_terminated() helper instead of cpecific memchr() call. This shows better the intention of the call. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230208133153.22528-3-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | net: openvswitch: reduce cpu_used_mask memoryEddy Tao2023-02-063-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use actual CPU number instead of hardcoded value to decide the size of 'cpu_used_mask' in 'struct sw_flow'. Below is the reason. 'struct cpumask cpu_used_mask' is embedded in struct sw_flow. Its size is hardcoded to CONFIG_NR_CPUS bits, which can be 8192 by default, it costs memory and slows down ovs_flow_alloc. To address this: Redefine cpu_used_mask to pointer. Append cpumask_size() bytes after 'stat' to hold cpumask. Initialization cpu_used_mask right after stats_last_writer. APIs like cpumask_next and cpumask_set_cpu never access bits beyond cpu count, cpumask_size() bytes of memory is enough. Signed-off-by: Eddy Tao <taoyuan_eddy@hotmail.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/OS3P286MB229570CCED618B20355D227AF5D59@OS3P286MB2295.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-02-021-6/+6
|\| | | | | | | | | | | | | | | | | | | | | | | | | | net/core/gro.c 7d2c89b32587 ("skb: Do mix page pool and page referenced frags in GRO") b1a78b9b9886 ("net: add support for ipv4 big tcp") https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: openvswitch: fix flow memory leak in ovs_flow_cmd_newFedor Pchelkin2023-02-021-6/+6
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller reports a memory leak of new_flow in ovs_flow_cmd_new() as it is not freed when an allocation of a key fails. BUG: memory leak unreferenced object 0xffff888116668000 (size 632): comm "syz-executor231", pid 1090, jiffies 4294844701 (age 18.871s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000defa3494>] kmem_cache_zalloc include/linux/slab.h:654 [inline] [<00000000defa3494>] ovs_flow_alloc+0x19/0x180 net/openvswitch/flow_table.c:77 [<00000000c67d8873>] ovs_flow_cmd_new+0x1de/0xd40 net/openvswitch/datapath.c:957 [<0000000010a539a8>] genl_family_rcv_msg_doit+0x22d/0x330 net/netlink/genetlink.c:739 [<00000000dff3302d>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] [<00000000dff3302d>] genl_rcv_msg+0x328/0x590 net/netlink/genetlink.c:800 [<000000000286dd87>] netlink_rcv_skb+0x153/0x430 net/netlink/af_netlink.c:2515 [<0000000061fed410>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 [<000000009dc0f111>] netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] [<000000009dc0f111>] netlink_unicast+0x545/0x7f0 net/netlink/af_netlink.c:1339 [<000000004a5ee816>] netlink_sendmsg+0x8e7/0xde0 net/netlink/af_netlink.c:1934 [<00000000482b476f>] sock_sendmsg_nosec net/socket.c:651 [inline] [<00000000482b476f>] sock_sendmsg+0x152/0x190 net/socket.c:671 [<00000000698574ba>] ____sys_sendmsg+0x70a/0x870 net/socket.c:2356 [<00000000d28d9e11>] ___sys_sendmsg+0xf3/0x170 net/socket.c:2410 [<0000000083ba9120>] __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439 [<00000000c00628f8>] do_syscall_64+0x30/0x40 arch/x86/entry/common.c:46 [<000000004abfdcf4>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 To fix this the patch rearranges the goto labels to reflect the order of object allocations and adds appropriate goto statements on the error paths. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 68bb10101e6b ("openvswitch: Fix flow lookup to use unmasked key") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230201210218.361970-1-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* / openvswitch: use skb_ip_totlen in conntrackXin Long2023-02-011-1/+1
|/ | | | | | | | | | IPv4 GSO packets may get processed in ovs_skb_network_trim(), and we need to use skb_ip_totlen() to get iph totlen. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: openvswitch: release vport resources on failureAaron Conole2022-12-211-2/+6
| | | | | | | | | | | | | | | | A recent commit introducing upcall packet accounting failed to properly release the vport object when the per-cpu stats struct couldn't be allocated. This can cause dangling pointers to dp objects long after they've been released. Cc: wangchuanlei <wangchuanlei@inspur.com> Fixes: 1933ea365aa7 ("net: openvswitch: Add support to count upcall packets") Reported-by: syzbot+8f4e2dcfcb3209ac35f9@syzkaller.appspotmail.com Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://lore.kernel.org/r/20221220212717.526780-1-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* openvswitch: Fix flow lookup to use unmasked keyEelco Chaudron2022-12-161-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit mentioned below causes the ovs_flow_tbl_lookup() function to be called with the masked key. However, it's supposed to be called with the unmasked key. This due to the fact that the datapath supports installing wider flows, and OVS relies on this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/ 128.0.0.0) is allowed to be added. However, if we try to add a wildcard rule, the installation fails: $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 ovs-vswitchd: updating flow table (File exists) The reason is that the key used to determine if the flow is already present in the system uses the original key ANDed with the mask. This results in the IP address not being part of the (miniflow) key, i.e., being substituted with an all-zero value. When doing the actual lookup, this results in the key wrongfully matching the first flow, and therefore the flow does not get installed. This change reverses the commit below, but rather than having the key on the stack, it's allocated. Fixes: 190aa3e77880 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.") Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>