summaryrefslogtreecommitdiffstats
path: root/net/ipv6
Commit message (Collapse)AuthorAgeFilesLines
* fib: rules: remove duplicated nla policiesFlorian Westphal2021-12-162-10/+0
| | | | | | | | | The attributes are identical in all implementations so move the ipv4 one into the core and remove the per-family nla policies. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* ipv6: use GFP_ATOMIC in rt6_probe()Eric Dumazet2021-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reminded me that rt6_probe() can run from atomic contexts. stack backtrace: CPU: 1 PID: 7461 Comm: syz-executor.2 Not tainted 5.16.0-rc4-next-20211210-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:203 [inline] valid_state kernel/locking/lockdep.c:3945 [inline] mark_lock_irq kernel/locking/lockdep.c:4148 [inline] mark_lock.cold+0x61/0x8e kernel/locking/lockdep.c:4605 mark_usage kernel/locking/lockdep.c:4500 [inline] __lock_acquire+0x11d5/0x54a0 kernel/locking/lockdep.c:4981 lock_acquire kernel/locking/lockdep.c:5639 [inline] lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604 __fs_reclaim_acquire mm/page_alloc.c:4550 [inline] fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4564 might_alloc include/linux/sched/mm.h:253 [inline] slab_pre_alloc_hook mm/slab.h:739 [inline] slab_alloc_node mm/slub.c:3145 [inline] slab_alloc mm/slub.c:3239 [inline] kmem_cache_alloc_trace+0x3b/0x2c0 mm/slub.c:3256 kmalloc include/linux/slab.h:581 [inline] kzalloc include/linux/slab.h:715 [inline] ref_tracker_alloc+0xe1/0x430 lib/ref_tracker.c:74 netdev_tracker_alloc include/linux/netdevice.h:3860 [inline] dev_hold_track include/linux/netdevice.h:3877 [inline] rt6_probe net/ipv6/route.c:661 [inline] find_match.part.0+0xac9/0xd00 net/ipv6/route.c:752 find_match net/ipv6/route.c:825 [inline] __find_rr_leaf+0x17f/0xd20 net/ipv6/route.c:826 find_rr_leaf net/ipv6/route.c:847 [inline] rt6_select net/ipv6/route.c:891 [inline] fib6_table_lookup+0x649/0xa20 net/ipv6/route.c:2185 ip6_pol_route+0x1c5/0x11e0 net/ipv6/route.c:2221 pol_lookup_func include/net/ip6_fib.h:580 [inline] fib6_rule_lookup+0x52a/0x6f0 net/ipv6/fib6_rules.c:120 ip6_route_output_flags_noref+0x2e2/0x380 net/ipv6/route.c:2629 ip6_route_output_flags+0x72/0x320 net/ipv6/route.c:2642 ip6_route_output include/net/ip6_route.h:98 [inline] ip6_dst_lookup_tail+0x5ab/0x1620 net/ipv6/ip6_output.c:1070 ip6_dst_lookup_flow+0x8c/0x1d0 net/ipv6/ip6_output.c:1200 geneve_get_v6_dst+0x46f/0x9a0 drivers/net/geneve.c:858 geneve6_xmit_skb drivers/net/geneve.c:991 [inline] geneve_xmit+0x520/0x3530 drivers/net/geneve.c:1074 __netdev_start_xmit include/linux/netdevice.h:4685 [inline] netdev_start_xmit include/linux/netdevice.h:4699 [inline] xmit_one net/core/dev.c:3473 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3489 __dev_queue_xmit+0x2983/0x3640 net/core/dev.c:4112 neigh_resolve_output net/core/neighbour.c:1522 [inline] neigh_resolve_output+0x50e/0x820 net/core/neighbour.c:1502 neigh_output include/net/neighbour.h:541 [inline] ip6_finish_output2+0x56e/0x14f0 net/ipv6/ip6_output.c:126 __ip6_finish_output net/ipv6/ip6_output.c:191 [inline] __ip6_finish_output+0x61e/0xe80 net/ipv6/ip6_output.c:170 ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201 NF_HOOK_COND include/linux/netfilter.h:296 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224 dst_output include/net/dst.h:451 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508 ndisc_send_rs+0x12e/0x6f0 net/ipv6/ndisc.c:702 addrconf_rs_timer+0x3f2/0x820 net/ipv6/addrconf.c:3898 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421 expire_timers kernel/time/timer.c:1466 [inline] __run_timers.part.0+0x675/0xa20 kernel/time/timer.c:1734 __run_timers kernel/time/timer.c:1715 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> Fixes: fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211214025806.3456382-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-12-091-0/+8
|\ | | | | | | | | | | No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * seg6: fix the iif in the IPv6 socket control blockAndrea Mayer2021-12-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an IPv4 packet is received, the ip_rcv_core(...) sets the receiving interface index into the IPv4 socket control block (v5.16-rc4, net/ipv4/ip_input.c line 510): IPCB(skb)->iif = skb->skb_iif; If that IPv4 packet is meant to be encapsulated in an outer IPv6+SRH header, the seg6_do_srh_encap(...) performs the required encapsulation. In this case, the seg6_do_srh_encap function clears the IPv6 socket control block (v5.16-rc4 net/ipv6/seg6_iptunnel.c line 163): memset(IP6CB(skb), 0, sizeof(*IP6CB(skb))); The memset(...) was introduced in commit ef489749aae5 ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") a long time ago (2019-01-29). Since the IPv6 socket control block and the IPv4 socket control block share the same memory area (skb->cb), the receiving interface index info is lost (IP6CB(skb)->iif is set to zero). As a side effect, that condition triggers a NULL pointer dereference if commit 0857d6f8c759 ("ipv6: When forwarding count rx stats on the orig netdev") is applied. To fix that issue, we set the IP6CB(skb)->iif with the index of the receiving interface once again. Fixes: ef489749aae5 ("ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211208195409.12169-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | xfrm: use net device refcount tracker helpersEric Dumazet2021-12-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfrm4_fill_dst() and xfrm6_fill_dst() build dst, getting a device reference that will likely be released by standard dst_release() code. We have to track these references or risk a warning if CONFIG_NET_DEV_REFCNT_TRACKER=y Note to XFRM maintainers : Error path in xfrm6_fill_dst() releases the reference, but does not clear xdst->u.dst.dev, so I wonder if this could lead to double dev_put() in some cases, where a dst_release() _is_ called by the callers in their error path. This extra dev_put() was added in commit 84c4a9dfbf430 ("xfrm6: release dev before returning error") Fixes: 9038c320001d ("net: dst: add net device refcount tracking to dst_entry") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <amwang@redhat.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Link: https://lore.kernel.org/r/20211207193203.2706158-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | inet: add net device refcount tracker to struct fib_nh_commonEric Dumazet2021-12-071-0/+2
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipmr, ip6mr: add net device refcount tracker to struct vif_deviceEric Dumazet2021-12-061-1/+2
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: add net device refcount tracker to struct inet6_devEric Dumazet2021-12-062-3/+3
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: add net device refcount tracker to struct ip6_tnlEric Dumazet2021-12-063-8/+8
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | sit: add net device refcount tracking to ip_tunnelEric Dumazet2021-12-061-2/+2
| | | | | | | | | | | | | | | | Note that other ip_tunnel users do not seem to hold a reference on tunnel->dev. Probably needs some investigations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: add net device refcount tracker to rt6_probe_deferred()Eric Dumazet2021-12-061-2/+3
| | | | | | | | | | Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: dst: add net device refcount tracking to dst_entryEric Dumazet2021-12-061-2/+3
| | | | | | | | | | | | | | We want to track all dev_hold()/dev_put() to ease leak hunting. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-12-022-5/+5
|\| | | | | | | Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * gro: Fix inconsistent indentingJiapeng Chong2021-12-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | Eliminate the follow smatch warning: net/ipv6/ip6_offload.c:249 ipv6_gro_receive() warn: inconsistent indenting. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix memory leak in fib6_rule_suppressmsizanoen12021-11-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel leaks memory when a `fib` rule is present in IPv6 nftables firewall rules and a suppress_prefix rule is present in the IPv6 routing rules (used by certain tools such as wg-quick). In such scenarios, every incoming packet will leak an allocation in `ip6_dst_cache` slab cache. After some hours of `bpftrace`-ing and source code reading, I tracked down the issue to ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule"). The problem with that change is that the generic `args->flags` always have `FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag `RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not decreasing the refcount when needed. How to reproduce: - Add the following nftables rule to a prerouting chain: meta nfproto ipv6 fib saddr . mark . iif oif missing drop This can be done with: sudo nft create table inet test sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }' sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop - Run: sudo ip -6 rule add table main suppress_prefixlength 0 - Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase with every incoming ipv6 packet. This patch exposes the protocol-specific flags to the protocol specific `suppress` function, and check the protocol-specific `flags` argument for RT6_LOOKUP_F_DST_NOREF instead of the generic FIB_LOOKUP_NOREF when decreasing the refcount, like this. [1]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L71 [2]: https://github.com/torvalds/linux/blob/ca7a03c4175366a92cee0ccc4fec0038c3266e26/net/ipv6/fib6_rules.c#L99 Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105 Fixes: ca7a03c41753 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule") Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ipv6: use the new fib6_nh_release_dsts helper in fib6_nh_releaseNikolay Aleksandrov2021-11-301-18/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | We can remove a bit of code duplication by reusing the new fib6_nh_release_dsts helper in fib6_nh_release. Their only difference is that fib6_nh_release's version doesn't use atomic operation to swap the pointers because it assumes the fib6_nh is no longer visible, while fib6_nh_release_dsts can be used anywhere. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-11-263-1/+21
|\| | | | | | | | | | | | | | | | | | | drivers/net/ipa/ipa_main.c 8afc7e471ad3 ("net: ipa: separate disabling setup from modem stop") 76b5fbcd6b47 ("net: ipa: kill ipa_modem_init()") Duplicated include, drop one. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: ipv6: add fib6_nh_release_dsts stubNikolay Aleksandrov2021-11-222-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We need a way to release a fib6_nh's per-cpu dsts when replacing nexthops otherwise we can end up with stale per-cpu dsts which hold net device references, so add a new IPv6 stub called fib6_nh_release_dsts. It must be used after an RCU grace period, so no new dsts can be created through a group's nexthop entry. Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed so it doesn't need a dummy stub when IPv6 is not enabled. Fixes: 7bf4796dd099 ("nexthops: add support for replace") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fix typos in __ip6_finish_output()Eric Dumazet2021-11-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We deal with IPv6 packets, so we need to use IP6CB(skb)->flags and IP6SKB_REROUTED, instead of IPCB(skb)->flags and IPSKB_REROUTED Found by code inspection, please double check that fixing this bug does not surface other bugs. Fixes: 09ee9dba9611 ("ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tobias Brunner <tobias@strongswan.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Tested-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net-ipv6: changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset()Maciej Żenczykowski2021-11-241-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is to match ipv4 behaviour, see __ip_sock_set_tos() implementation. Technically for ipv6 this might not be required because normally we do not allow tclass to influence routing, yet the cli tooling does support it: lpk11:~# ip -6 rule add pref 5 tos 45 lookup 5 lpk11:~# ip -6 rule 5: from all tos 0x45 lookup 5 and in general dscp/tclass based routing does make sense. We already have cases where dscp can affect vlan priority and/or transmit queue (especially on wifi). So let's just make things match. Easier to reason about and no harm. Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123223208.1117871-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net-ipv6: do not allow IPV6_TCLASS to muck with tcp's ECNMaciej Żenczykowski2021-11-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is to match ipv4 behaviour, see __ip_sock_set_tos() implementation at ipv4/ip_sockglue.c:579 void __ip_sock_set_tos(struct sock *sk, int val) { if (sk->sk_type == SOCK_STREAM) { val &= ~INET_ECN_MASK; val |= inet_sk(sk)->tos & INET_ECN_MASK; } if (inet_sk(sk)->tos != val) { inet_sk(sk)->tos = val; sk->sk_priority = rt_tos2priority(val); sk_dst_reset(sk); } } Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211123223154.1117794-1-zenczykowski@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | gro: remove rcu_read_lock/rcu_read_unlock from gro_complete handlersEric Dumazet2021-11-241-6/+2
| | | | | | | | | | | | | | | | | | | | All gro_complete() handlers are called from napi_gro_complete() while rcu_read_lock() has been called. There is no point stacking more rcu_read_lock() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | gro: remove rcu_read_lock/rcu_read_unlock from gro_receive handlersEric Dumazet2021-11-242-7/+1
| | | | | | | | | | | | | | | | | | | | All gro_receive() handlers are called from dev_gro_receive() while rcu_read_lock() has been called. There is no point stacking more rcu_read_lock() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: Use memset_after() to zero rt6_infoKees Cook2021-11-191-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memset(), avoid intentionally writing across neighboring fields. Use memset_after() to clear everything after the dst_entry member of struct rt6_info. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-11-181-0/+6
|\| | | | | | | Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * ipv6: check return value of ipv6_skip_exthdrJordy Zomer2021-11-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | The offset value is used in pointer math on skb->data. Since ipv6_skip_exthdr may return -1 the pointer to uh and th may not point to the actual udp and tcp headers and potentially overwrite other stuff. This is why I think this should be checked. EDIT: added {}'s, thanks Kees Signed-off-by: Jordy Zomer <jordy@pwning.systems> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: ah6: use swap() to make code cleanerYao Jing2021-11-181-4/+1
| | | | | | | | | | | | | | | | | | Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid opencoding it. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Yao Jing <yao.jing2@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4/raw: support binding to nonlocal addressesRiccardo Paolo Bestetti2021-11-171-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to inet v4 raw sockets for binding to nonlocal addresses through the IP_FREEBIND and IP_TRANSPARENT socket options, as well as the ipv4.ip_nonlocal_bind kernel parameter. Add helper function to inet_sock.h to check for bind address validity on the base of the address type and whether nonlocal address are enabled for the socket via any of the sockopts/sysctl, deduplicating checks in ipv4/ping.c, ipv4/af_inet.c, ipv6/af_inet6.c (for mapped v4->v6 addresses), and ipv4/raw.c. Add test cases with IP[V6]_FREEBIND verifying that both v4 and v6 raw sockets support binding to nonlocal addresses after the change. Add necessary support for the test cases to nettest. Signed-off-by: Riccardo Paolo Bestetti <pbl@bestov.io> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211117090010.125393-1-pbl@bestov.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: align static siphash keysEric Dumazet2021-11-162-2/+2
| | | | | | | | | | | | | | | | | | | | siphash keys use 16 bytes. Define siphash_aligned_key_t macro so that we can make sure they are not crossing a cache line boundary. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: drop nopreempt requirement on sock_prot_inuse_add()Eric Dumazet2021-11-161-4/+4
| | | | | | | | | | | | | | | | | | | | This is distracting really, let's make this simpler, because many callers had to take care of this by themselves, even if on x86 this adds more code than really needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: move gro definitions to include/net/gro.hEric Dumazet2021-11-163-0/+3
| | | | | | | | | | | | | | | | include/linux/netdevice.h became too big, move gro stuff into include/net/gro.h Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: defer skb freeing after socket lock is releasedEric Dumazet2021-11-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp recvmsg() (or rx zerocopy) spends a fair amount of time freeing skbs after their payload has been consumed. A typical ~64KB GRO packet has to release ~45 page references, eventually going to page allocator for each of them. Currently, this freeing is performed while socket lock is held, meaning that there is a high chance that BH handler has to queue incoming packets to tcp socket backlog. This can cause additional latencies, because the user thread has to process the backlog at release_sock() time, and while doing so, additional frames can be added by BH handler. This patch adds logic to defer these frees after socket lock is released, or directly from BH handler if possible. Being able to free these skbs from BH handler helps a lot, because this avoids the usual alloc/free assymetry, when BH handler and user thread do not run on same cpu or NUMA node. One cpu can now be fully utilized for the kernel->user copy, and another cpu is handling BH processing and skb/page allocs/frees (assuming RFS is not forcing use of a single CPU) Tested: 100Gbit NIC Max throughput for one TCP_STREAM flow, over 10 runs MTU : 1500 Before: 55 Gbit After: 66 Gbit MTU : 4096+(headers) Before: 82 Gbit After: 95 Gbit Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: add RETPOLINE mitigation to sk_backlog_rcvEric Dumazet2021-11-161-2/+3
| | | | | | | | | | | | | | | | Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove sk_route_nocapsEric Dumazet2021-11-161-1/+1
| | | | | | | | | | | | | | | | | | Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: small optimization in tcp_v6_send_check()Eric Dumazet2021-11-161-3/+1
| | | | | | | | | | | | | | | | | | | | For TCP flows, inet6_sk(sk)->saddr has the same value than sk->sk_v6_rcv_saddr. Using sk->sk_v6_rcv_saddr increases data locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2021-11-152-8/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2021-11-15 We've added 72 non-merge commits during the last 13 day(s) which contain a total of 171 files changed, 2728 insertions(+), 1143 deletions(-). The main changes are: 1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to BTF such that BPF verifier will be able to detect misuse, from Yonghong Song. 2) Big batch of libbpf improvements including various fixes, future proofing APIs, and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko. 3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush. 4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to ensure exception handling still works, from Russell King and Alan Maguire. 5) Add a new bpf_find_vma() helper for tracing to map an address to the backing file such as shared library, from Song Liu. 6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump, updating documentation and bash-completion among others, from Quentin Monnet. 7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as the API is heavily tailored around perf and is non-generic, from Dave Marchevsky. 8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an opt-out for more relaxed BPF program requirements, from Stanislav Fomichev. 9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits) bpftool: Use libbpf_get_error() to check error bpftool: Fix mixed indentation in documentation bpftool: Update the lists of names for maps and prog-attach types bpftool: Fix indent in option lists in the documentation bpftool: Remove inclusion of utilities.mak from Makefiles bpftool: Fix memory leak in prog_dump() selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning selftests/bpf: Fix an unused-but-set-variable compiler warning bpf: Introduce btf_tracing_ids bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs bpftool: Enable libbpf's strict mode by default docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support selftests/bpf: Clarify llvm dependency with btf_tag selftest selftests/bpf: Add a C test for btf_type_tag selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests selftests/bpf: Test libbpf API function btf__add_type_tag() bpftool: Support BTF_KIND_TYPE_TAG libbpf: Support BTF_KIND_TYPE_TAG ... ==================== Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | bpf: Add ingress_ifindex to bpf_sk_lookupMark Pashmfouroush2021-11-102-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It may be helpful to have access to the ifindex during bpf socket lookup. An example may be to scope certain socket lookup logic to specific interfaces, i.e. an interface may be made exempt from custom lookup code. Add the ifindex of the arriving connection to the bpf_sk_lookup API. Signed-off-by: Mark Pashmfouroush <markpash@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211110111016.5670-2-markpash@cloudflare.com
* | | ipv6: Remove duplicate statementsluo penghao2021-11-141-1/+0
| |/ |/| | | | | | | | | | | | | This statement is repeated with the initialization statement Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: luo penghao <luo.penghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: remove useless assignment to newinet in tcp_v6_syn_recv_sock()Nghia Le2021-11-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The newinet value is initialized with inet_sk() in a block code to handle sockets for the ETH_P_IP protocol. Along this code path, newinet is never read. Thus, assignment to newinet is needless and can be removed. Signed-off-by: Nghia Le <nghialm78@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211104143740.32446-1-nghialm78@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: udp6: replace __UDP_INC_STATS() with __UDP6_INC_STATS()Menglong Dong2021-11-031-3/+3
| | | | | | | | | | | | | | | | | | __UDP_INC_STATS() is used in udpv6_queue_rcv_one_skb() when encap_rcv() fails. __UDP6_INC_STATS() should be used here, so replace it with __UDP6_INC_STATS(). Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net:ipv6:Remove unneeded semicolonZhang Mingyu2021-11-031-1/+1
|/ | | | | | | | | Eliminate the following coccinelle check warning: net/ipv6/seg6.c:381:2-3 Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Zhang Mingyu <zhang.mingyu@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2021-11-011-1/+1
|\ | | | | | | | | | | Merge in the fixes we had queued in case there was another -rc. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * udp6: allow SO_MARK ctrl msg to affect routingJakub Kicinski2021-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c6af0c227a22 ("ip: support SO_MARK cmsg") added propagation of SO_MARK from cmsg to skb->mark. For IPv4 and raw sockets the mark also affects route lookup, but in case of IPv6 the flow info is initialized before cmsg is parsed. Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg") Reported-and-tested-by: Xintong Hu <huxintong@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ndisc: introduce ndisc_evict_nocarrier sysctl parameterJames Prestwood2021-11-012-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In most situations the neighbor discovery cache should be cleared on a NOCARRIER event which is currently done unconditionally. But for wireless roams the neighbor discovery cache can and should remain intact since the underlying network has not changed. This patch introduces a sysctl option ndisc_evict_nocarrier which can be disabled by a wireless supplicant during a roam. This allows packets to be sent after a roam immediately without having to wait for neighbor discovery. A user reported roughly a 1 second delay after a roam before packets could be sent out (note, on IPv4). This delay was due to the ARP cache being cleared. During testing of this same scenario using IPv6 no delay was noticed, but regardless there is no reason to clear the ndisc cache for wireless roams. Signed-off-by: James Prestwood <prestwoj@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: enable net.ipv6.route.max_size sysctl in network namespaceAlexander Kuznetsov2021-10-281-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to increase route cache size in network namespace created with user namespace. Currently ipv6 route settings are disabled for non-initial network namespaces. We can allow this sysctl and it will be safe since commit <6126891c6d4f> because route cache account to kmem, that is why users from user namespace can not DOS system. Signed-off-by: Alexander Kuznetsov <wwfq@yandex-team.ru> Acked-by: Dmitry Yakunin <zeil@yandex-team.ru> Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: do not clear skb->csum if already zeroEric Dumazet2021-10-281-1/+0
| | | | | | | | | | | | | | Freshly allocated skbs have their csum field cleared already. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: remove races in inet{6}_getname()Eric Dumazet2021-10-271-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported data-races in inet_getname() multiple times, it is time we fix this instead of pretending applications should not trigger them. getsockname() and getpeername() are not really considered fast path. v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration needed when CONFIG_CGROUP_BPF=n, as reported by kernel test robot <lkp@intel.com> syzbot typical report: BUG: KCSAN: data-race in __inet_hash_connect / inet_getname write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1: __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831 inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853 tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275 __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664 inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728 __sys_connect_file net/socket.c:1896 [inline] __sys_connect+0x254/0x290 net/socket.c:1913 __do_sys_connect net/socket.c:1923 [inline] __se_sys_connect net/socket.c:1920 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1920 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0: inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790 __sys_getsockname+0x11d/0x1b0 net/socket.c:1946 __do_sys_getsockname net/socket.c:1961 [inline] __se_sys_getsockname net/socket.c:1958 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0xdee0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6/tcp: small drop monitor changesEric Dumazet2021-10-251-2/+2
| | | | | | | | | | | | | | | | | | Two kfree_skb() calls must be replaced by consume_skb() for skbs that are not technically dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: guard IPV6_MINHOPCOUNT with a static keyEric Dumazet2021-10-252-8/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts. Add a static key to remove from TCP fast path useless code, and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount Note that once ip6_min_hopcount static key has been enabled, it stays enabled until next boot. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ipv6: annotate data races around np->min_hopcountEric Dumazet2021-10-252-3/+8
| | | | | | | | | | | | | | | | No report yet from KCSAN, yet worth documenting the races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>