summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
...
* | | | | net: use indirect call wrappers for skb_copy_datagram_iter()Eric Dumazet2020-03-251-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP recvmsg() calls skb_copy_datagram_iter(), which calls an indirect function (cb pointing to simple_copy_to_iter()) for every MSS (fragment) present in the skb. CONFIG_RETPOLINE=y forces a very expensive operation that we can avoid thanks to indirect call wrappers. This patch gives a 13% increase of performance on a single flow, if the bottleneck is the thread reading the TCP socket. Fixes: 950fcaecd5cc ("datagram: consolidate datagram copy to iter helpers") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | devlink: Only pass packet trap group identifier in trap structureIdo Schimmel2020-03-231-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Packet trap groups are now explicitly registered by drivers and not implicitly registered when the packet traps are registered. Therefore, there is no need to encode entire group structure the trap is associated with inside the trap structure. Instead, only pass the group identifier. Refer to it as initial group identifier, as future patches will allow user space to move traps between groups. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | devlink: Stop reference counting packet trap groupsIdo Schimmel2020-03-231-96/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that drivers explicitly register their supported packet trap groups there is no for devlink to create them on-demand and destroy them when their reference count reaches zero. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | devlink: Add API to register packet trap groupsIdo Schimmel2020-03-231-0/+117
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, packet trap groups are implicitly registered by drivers upon packet trap registration. When the traps are registered, each is associated with a group and the group is created by devlink, if it does not exist already. This makes it difficult for drivers to pass additional attributes for the groups. Therefore, as a preparation for future patches that require passing additional group attributes, add an API to explicitly register / unregister these groups. Next patches will convert existing drivers to use this API. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net: Make skb_segment not to compute checksum if network controller supports ↵Yadu Kishore2020-03-231-8/+15
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | checksumming Problem: TCP checksum in the output path is not being offloaded during GSO in the following case: The network driver does not support scatter-gather but supports checksum offload with NETIF_F_HW_CSUM. Cause: skb_segment calls skb_copy_and_csum_bits if the network driver does not announce NETIF_F_SG. It does not check if the driver supports NETIF_F_HW_CSUM. So for devices which might want to offload checksum but do not support SG there is currently no way to do so if GSO is enabled. Solution: In skb_segment check if the network controller does checksum and if so call skb_copy_bits instead of skb_copy_and_csum_bits. Testing: Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled in the network driver. Observed the TCP checksum offload is not happening since the skbs received by the driver in the output path have skb->ip_summed set to CHECKSUM_NONE. With the patch ran iperf TCP traffic and observed that TCP checksum is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL. Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver to cover the newly introduced if-else code path in skb_segment. Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | netfilter: revert introduction of egress hookDaniel Borkmann2020-03-181-22/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts the following commits: 8537f78647c0 ("netfilter: Introduce egress hook") 5418d3881e1f ("netfilter: Generalize ingress hook") b030f194aed2 ("netfilter: Rename ingress hook include file") >From the discussion in [0], the author's main motivation to add a hook in fast path is for an out of tree kernel module, which is a red flag to begin with. Other mentioned potential use cases like NAT{64,46} is on future extensions w/o concrete code in the tree yet. Revert as suggested [1] given the weak justification to add more hooks to critical fast-path. [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/ [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Alexei Starovoitov <ast@kernel.org> Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2020-03-171-5/+22
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey. 2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi. Follow up patch to clean up this new mode, from Dan Carpenter. 3) Add support for geneve tunnel options, from Xin Long. 4) Make sets built-in and remove modular infrastructure for sets, from Florian Westphal. 5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing. 6) Statify nft_pipapo_get, from Chen Wandun. 7) Use C99 flexible-array member, from Gustavo A. R. Silva. 8) More descriptive variable names for bitwise, from Jeremy Sowden. 9) Four patches to add tunnel device hardware offload to the flowtable infrastructure, from wenxu. 10) pipapo set supports for 8-bit grouping, from Stefano Brivio. 11) pipapo can switch between nibble and byte grouping, also from Stefano. 12) Add AVX2 vectorized version of pipapo, from Stefano Brivio. 13) Update pipapo to be use it for single ranges, from Stefano. 14) Add stateful expression support to elements via control plane, eg. counter per element. 15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal. 15) Add new egress hook, from Lukas Wunner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | netfilter: Introduce egress hookLukas Wunner2020-03-181-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets on ingress. Allow the same on egress. Position the hook immediately before a packet is handed to tc and then sent out on an interface, thereby mirroring the ingress order. This order allows marking packets in the netfilter egress hook and subsequently using the mark in tc. Another benefit of this order is consistency with a lot of existing documentation which says that egress tc is performed after netfilter hooks. Egress hooks already exist for the most common protocols, such as NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because they are executed earlier during packet processing. However for more exotic protocols, there is currently no provision to apply netfilter on egress. A common workaround is to enslave the interface to a bridge and use ebtables, or to resort to tc. But when the ingress hook was introduced, consensus was that users should be given the choice to use netfilter or tc, whichever tool suits their needs best: https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/ This hook is also useful for NAT46/NAT64, tunneling and filtering of locally generated af_packet traffic such as dhclient. There have also been occasional user requests for a netfilter egress hook in the past, e.g.: https://www.spinics.net/lists/netfilter/msg50038.html Performance measurements with pktgen surprisingly show a speedup rather than a slowdown with this commit: * Without this commit: Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags) 2920481pps 1401Mb/sec (1401830880bps) errors: 0 * With this commit: Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags) 2941410pps 1411Mb/sec (1411876800bps) errors: 0 * Without this commit + tc egress: Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags) 2562631pps 1230Mb/sec (1230062880bps) errors: 0 * With this commit + tc egress: Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags) 2659259pps 1276Mb/sec (1276444320bps) errors: 0 * With this commit + nft egress: Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags) 2413320pps 1158Mb/sec (1158393600bps) errors: 0 Tested on a bare-metal Core i7-3615QM, each measurement was performed three times to verify that the numbers are stable. Commands to perform a measurement: modprobe pktgen echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000 Commands for testing tc egress: tc qdisc add dev lo clsact tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32 Commands for testing nft egress: nft add table netdev t nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \} nft add rule netdev t co ip daddr 4.3.2.1/32 drop All testing was performed on the loopback interface to avoid distorting measurements by the packet handling in the low-level Ethernet driver. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | netfilter: Generalize ingress hookLukas Wunner2020-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for addition of a netfilter egress hook by generalizing the ingress hook introduced by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). In particular, rename and refactor the ingress hook's static inlines such that they can be reused for an egress hook. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | netfilter: Rename ingress hook include fileLukas Wunner2020-03-181-1/+1
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* / | | net: ethtool: require drivers to set supported_coalesce_paramsJakub Kicinski2020-03-171-0/+4
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that all in-tree drivers have been updated we can make the supported_coalesce_params mandatory. To save debugging time in case some driver was missed (or is out of tree) add a warning when netdev is registered with set_coalesce but without supported_coalesce_params. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-03-132-44/+178
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bpf: Add bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHERBjörn Töpel2020-03-131-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHER, so all the dispatchers have the common name prefix. And also a small '_' cleanup for bpf_dispatcher_nopfunc function name. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200312195610.346362-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | bpf: Add bpf_xdp_output() helperEelco Chaudron2020-03-121-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce new helper that reuses existing xdp perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct xdp_buff *' as a tracepoint argument. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
| * | | bpf: sockmap: Add UDP supportLorenz Bauer2020-03-091-4/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow adding hashed UDP sockets to sockmaps. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309111243.6982-9-lmb@cloudflare.com
| * | | bpf: sockmap: Simplify sock_map_init_protoLorenz Bauer2020-03-091-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can take advantage of the fact that both callers of sock_map_init_proto are holding a RCU read lock, and have verified that psock is valid. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309111243.6982-7-lmb@cloudflare.com
| * | | bpf: sockmap: Move generic sockmap hooks from BPF TCPLorenz Bauer2020-03-091-5/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The init, close and unhash handlers from TCP sockmap are generic, and can be reused by UDP sockmap. Move the helpers into the sockmap code base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to be conditional on BPF_STREAM_PARSER. The moved functions are unmodified, except that sk_psock_unlink is renamed to sock_map_unlink to better match its behaviour. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
| * | | bpf: tcp: Move assertions into tcp_bpf_get_protoLorenz Bauer2020-03-091-16/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to ensure that sk->sk_prot uses certain callbacks, so that code that directly calls e.g. tcp_sendmsg in certain corner cases works. To avoid spurious asserts, we must to do this only if sk_psock_update_proto has not yet been called. The same invariants apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well. Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
| * | | bpf: sockmap: Only check ULP for TCP socketsLorenz Bauer2020-03-091-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sock map code checks that a socket does not have an active upper layer protocol before inserting it into the map. This requires casting via inet_csk, which isn't valid for UDP sockets. Guard checks for ULP by checking inet_sk(sk)->is_icsk first. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200309111243.6982-2-lmb@cloudflare.com
| * | | bpf: Add gso_size to __sk_buffWillem de Bruijn2020-03-031-14/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF programs may want to know whether an skb is gso. The canonical answer is skb_is_gso(skb), which tests that gso_size != 0. Expose this field in the same manner as gso_segs. That field itself is not a sufficient signal, as the comment in skb_shared_info makes clear: gso_segs may be zero, e.g., from dodgy sources. Also prepare net/bpf/test_run for upcoming BPF_PROG_TEST_RUN tests of the feature. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200303200503.226217-2-willemdebruijn.kernel@gmail.com
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-03-123-23/+62
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: memcg: late association of sock to memcgShakeel Butt2020-03-101-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a TCP socket is allocated in IRQ context or cloned from unassociated (i.e. not associated to a memcg) in IRQ context then it will remain unassociated for its whole life. Almost half of the TCPs created on the system are created in IRQ context, so, memory used by such sockets will not be accounted by the memcg. This issue is more widespread in cgroup v1 where network memory accounting is opt-in but it can happen in cgroup v2 if the source socket for the cloning was created in root memcg. To fix the issue, just do the association of the sockets at the accept() time in the process context and then force charge the memory buffer already used and reserved by the socket. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | cgroup, netclassid: periodically release file_lock on classid updatingDmitry Yakunin2020-03-091-10/+37
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In our production environment we have faced with problem that updating classid in cgroup with heavy tasks cause long freeze of the file tables in this tasks. By heavy tasks we understand tasks with many threads and opened sockets (e.g. balancers). This freeze leads to an increase number of client timeouts. This patch implements following logic to fix this issue: аfter iterating 1000 file descriptors file table lock will be released thus providing a time gap for socket creation/deletion. Now update is non atomic and socket may be skipped using calls: dup2(oldfd, newfd); close(oldfd); But this case is not typical. Moreover before this patch skip is possible too by hiding socket fd in unix socket buffer. New sockets will be allocated with updated classid because cgroup state is updated before start of the file descriptors iteration. So in common cases this patch has no side effects. Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | devlink: validate length of region addr/lenJakub Kicinski2020-03-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DEVLINK_ATTR_REGION_CHUNK_ADDR and DEVLINK_ATTR_REGION_CHUNK_LEN lack entries in the netlink policy. Corresponding nla_get_u64()s may read beyond the end of the message. Fixes: 4e54795a27f5 ("devlink: Add support for region snapshot read command") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | devlink: validate length of param valuesJakub Kicinski2020-03-031-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DEVLINK_ATTR_PARAM_VALUE_DATA may have different types so it's not checked by the normal netlink policy. Make sure the attribute length is what we expect. Fixes: e3b7ca18ad7b ("devlink: Add param set command") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | flow_offload: Add flow_match_ct to get rule ct matchPaul Blakey2020-03-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | Add relevant getter for ct info dissector. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'ct-offload' of ↵David S. Miller2020-03-121-1/+2
|\ \ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
| * | | net: sched: Pass ingress block to tcf_classify_ingressPaul Blakey2020-02-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On ingress and cls_act qdiscs init, save the block on ingress mini_Qdisc and and pass it on to ingress classification, so it can be used for the looking up a specified chain index. Co-developed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * | | net: sched: Introduce ingress classification functionPaul Blakey2020-02-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TC multi chain configuration can cause offloaded tc chains to miss in hardware after jumping to some chain. In such cases the software should continue from the chain that missed in hardware, as the hardware may have manipulated the packet and updated some counters. Currently a single tcf classification function serves both ingress and egress. However, multi chain miss processing (get tc skb extension on hw miss, set tc skb extension on tc miss) should happen only on ingress. Refactor the code to use ingress classification function, and move setting the tc skb extension from general classification to it, as a prestep for supporting the hw miss scenario. Co-developed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* | | | Revert "net: sched: make newly activated qdiscs visible"Julian Wiedmann2020-03-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce from net-next. Brown bag time. Michal noticed that this change doesn't work at all when netif_set_real_num_tx_queues() gets called prior to an initial dev_activate(), as for instance igb does. Doing so dies with: [ 40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400 [ 40.586922] #PF: supervisor read access in kernel mode [ 40.592668] #PF: error_code(0x0000) - not-present page [ 40.598405] PGD 0 P4D 0 [ 40.601234] Oops: 0000 [#1] PREEMPT SMP PTI [ 40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G E 5.6.0-rc3-ethnl.50-default #1 [ 40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013 [ 40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90 [ 40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a [ 40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203 [ 40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000 [ 40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0 [ 40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208 [ 40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000 [ 40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000 [ 40.699750] FS: 00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000 [ 40.708782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0 [ 40.723156] Call Trace: [ 40.725888] dev_qdisc_set_real_num_tx_queues+0x61/0x90 [ 40.731725] netif_set_real_num_tx_queues+0x94/0x1d0 [ 40.737286] __igb_open+0x19a/0x5d0 [igb] [ 40.741767] __dev_open+0xbb/0x150 [ 40.745567] __dev_change_flags+0x157/0x1a0 [ 40.750240] dev_change_flags+0x23/0x60 [...] Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible") Reported-by: Michal Kubecek <mkubecek@suse.cz> CC: Michal Kubecek <mkubecek@suse.cz> CC: Eric Dumazet <edumazet@google.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: Cong Wang <xiyou.wangcong@gmail.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: sched: make newly activated qdiscs visibleJulian Wiedmann2020-03-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In their .attach callback, mq[prio] only add the qdiscs of the currently active TX queues to the device's qdisc hash list. If a user later increases the number of active TX queues, their qdiscs are not visible via eg. 'tc qdisc show'. Add a hook to netif_set_real_num_tx_queues() that walks all active TX queues and adds those which are missing to the hash list. CC: Eric Dumazet <edumazet@google.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: Cong Wang <xiyou.wangcong@gmail.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | pktgen: Allow on loopback deviceLukas Wunner2020-03-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pktgen is used to measure the performance of dev_queue_xmit() packet handling in the core, it is preferable to not hand down packets to a low-level Ethernet driver as it would distort the measurements. Allow using pktgen on the loopback device, thus constraining measurements to core code. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | devlink: Introduce devlink port flavour virtualParav Pandit2020-03-031-0/+2
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently mlx5 PCI PF and VF devlink devices register their ports as physical port in non-representors mode. Introduce a new port flavour as virtual so that virtual devices can register 'virtual' flavour to make it more clear to users. An example of one PCI PF and 2 PCI virtual functions, each having one devlink port. $ devlink port show pci/0000:06:00.0/1: type eth netdev ens2f0 flavour physical port 0 pci/0000:06:00.2/1: type eth netdev ens2f2 flavour virtual port 0 pci/0000:06:00.3/1: type eth netdev ens2f3 flavour virtual port 0 Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-02-293-15/+280
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-02-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 41 non-merge commits during the last 7 day(s) which contain a total of 49 files changed, 1383 insertions(+), 499 deletions(-). The main changes are: 1) BPF and Real-Time nicely co-exist. 2) bpftool feature improvements. 3) retrieve bpf_sk_storage via INET_DIAG. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bpf: INET_DIAG support in bpf_sk_storageMartin KaFai Lau2020-02-271-6/+277
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds INET_DIAG support to bpf_sk_storage. 1. Although this series adds bpf_sk_storage diag capability to inet sk, bpf_sk_storage is in general applicable to all fullsock. Hence, the bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as the argument. 2. The request will be like: INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) ...... Considering there could have multiple bpf_sk_storages in a sk, instead of reusing INET_DIAG_INFO ("ss -i"), the user can select some specific bpf_sk_storage to dump by specifying an array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD. If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages of a sk. 3. The reply will be like: INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) ...... 4. Unlike other INET_DIAG info of a sk which is pretty static, the size required to dump the bpf_sk_storage(s) of a sk is dynamic as the system adding more bpf_sk_storage_map. It is hard to set a static min_dump_alloc size. Hence, this series learns it at the runtime and adjust the cb->min_dump_alloc as it iterates all sk(s) of a system. The "unsigned int *res_diag_size" in bpf_sk_storage_diag_put() is for this purpose. The next patch will update the cb->min_dump_alloc as it iterates the sk(s). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
| * | | bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites.David Miller2020-02-242-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de
* | | | net: datagram: drop 'destructor' argument from several helpersPaolo Abeni2020-02-281-18/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only users for such argument are the UDP protocol and the UNIX socket family. We can safely reclaim the accounted memory directly from the UDP code and, after the previous patch, we can do scm stats accounting outside the datagram helpers. Overall this cleans up a bit some datagram-related helpers, and avoids an indirect call per packet in the UDP receive path. v1 -> v2: - call scm_stat_del() only when not peeking - Kirill - fix build issue with CONFIG_INET_ESPINTCP Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: core: Replace zero-length array with flexible-array memberGustavo A. R. Silva2020-02-283-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-02-272-15/+25
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | The mptcp conflict was overlapping additions. The SMC conflict was an additional and removal happening at the same time. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: core: devlink.c: Use built-in RCU list checkingMadhuparna Bhowmik2020-02-261-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | list_for_each_entry_rcu() has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled. The devlink->lock is held when devlink_dpipe_table_find() is called in non RCU read side section. Therefore, pass struct devlink to devlink_dpipe_table_find() for lockdep checking. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Fix Tx hash bound checkingAmritha Nambiar2020-02-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes the lower and upper bounds when there are multiple TCs and traffic is on the the same TC on the same device. The lower bound is represented by 'qoffset' and the upper limit for hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue mapping when there are multiple TCs, as the queue indices for upper TCs will be offset by 'qoffset'. v2: Fixed commit description based on comments. Fixes: 1b837d489e06 ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash") Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx") Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: core: devlink.c: Hold devlink->lock from the beginning of ↵Madhuparna Bhowmik2020-02-231-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | devlink_dpipe_table_register() devlink_dpipe_table_find() should be called under either rcu_read_lock() or devlink->lock. devlink_dpipe_table_register() calls devlink_dpipe_table_find() without holding the lock and acquires it later. Therefore hold the devlink->lock from the beginning of devlink_dpipe_table_register(). Suggested-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: fix sysfs permssions when device changes network namespaceChristian Brauner2020-02-261-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we moved all the helpers in place and make use netdev_change_owner() to fixup the permissions when moving network devices between network namespaces. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net-sysfs: add queue_change_owner()Christian Brauner2020-02-261-0/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a function to change the owner of the queue entries for a network device when it is moved between network namespaces. Currently, when moving network devices between network namespaces the ownership of the corresponding queue sysfs entries are not changed. This leads to problems when tools try to operate on the corresponding sysfs files. Fix this. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net-sysfs: add netdev_change_owner()Christian Brauner2020-02-262-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a function to change the owner of a network device when it is moved between network namespaces. Currently, when moving network devices between network namespaces the ownership of the corresponding sysfs entries is not changed. This leads to problems when tools try to operate on the corresponding sysfs files. This leads to a bug whereby a network device that is created in a network namespaces owned by a user namespace will have its corresponding sysfs entry owned by the root user of the corresponding user namespace. If such a network device has to be moved back to the host network namespace the permissions will still be set to the user namespaces. This means unprivileged users can e.g. trigger uevents for such incorrectly owned devices. They can also modify the settings of the device itself. Both of these things are unwanted. For example, workloads will create network devices in the host network namespace. Other tools will then proceed to move such devices between network namespaces owner by other user namespaces. While the ownership of the device itself is updated in net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs entry for the device is not: drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent However, if a device is created directly in the network namespace then the device's sysfs permissions will be correctly updated: drwxr-xr-x 5 root root 0 Jan 25 18:12 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len -r--r--r-- 1 root root 4096 Jan 25 18:12 address -r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port -r--r--r-- 1 root root 4096 Jan 25 18:12 dormant -r--r--r-- 1 root root 4096 Jan 25 18:12 duplex -rw-r--r-- 1 root root 4096 Jan 25 18:12 flags -rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias -r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex -r--r--r-- 1 root root 4096 Jan 25 18:12 iflink -r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu -r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 root root 4096 Jan 25 18:12 operstate -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 root root 0 Jan 25 18:12 power -rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 root root 0 Jan 25 18:12 queues -r--r--r-- 1 root root 4096 Jan 25 18:12 speed drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 root root 4096 Jan 25 18:12 type -rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent Now, when creating a network device in a network namespace owned by a user namespace and moving it to the host the permissions will be set to the id that the user namespace root user has been mapped to on the host leading to all sorts of permission issues: 458752 drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 . drwxr-xr-x 9 root root 0 Jan 25 18:08 .. -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | devlink: extend devlink_trap_report() to accept cookie and passJiri Pirko2020-02-251-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add cookie argument to devlink_trap_report() allowing driver to pass in the user cookie. Pass on the cookie down to drop monitor code. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | drop_monitor: extend by passing cookie from driverJiri Pirko2020-02-251-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If driver passed along the cookie, push it through Netlink. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | devlink: add trap metadata type for cookieJiri Pirko2020-02-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow driver to indicate cookie metadata for registered traps. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | flow_offload: pass action cookie through offload structuresJiri Pirko2020-02-251-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend struct flow_action_entry in order to hold TC action cookie specified by user inserting the action. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | devlink: add ACL generic packet trapsJiri Pirko2020-02-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add packet traps that can report packets that were dropped during ACL processing. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>