summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* net: remove bond_slave_has_mac_rcu()Jakub Kicinski2022-01-271-14/+0
| | | | | | | No caller since v3.16. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove linkmode_change_bit()Jakub Kicinski2022-01-271-5/+0
| | | | | | | | | No callers since v5.7, the initial use case seems pretty esoteric so removing this should not harm the completeness of the API. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove net_invalid_timestamp()Jakub Kicinski2022-01-271-5/+0
| | | | | | | No callers since v3.15. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* mii: remove mii_lpa_to_linkmode_lpa_sgmii()Jakub Kicinski2022-01-271-17/+0
| | | | | | | | The only caller of mii_lpa_to_linkmode_lpa_sgmii() disappeared in v5.10. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Namespaceify min_adv_mss sysctl knobxu xin2022-01-271-0/+1
| | | | | | | | | | Different netns has different requirement on the setting of min_adv_mss sysctl which the advertised MSS will be never lower than. Enable min_adv_mss to be configured per network namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: allocate tcp_death_row outside of struct netns_ipv4Eric Dumazet2022-01-261-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I forgot tcp had per netns tracking of timewait sockets, and their sysctl to change the limit. After 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()"), whole struct net can be freed before last tw socket is freed. We need to allocate a separate struct inet_timewait_death_row object per netns. tw_count becomes a refcount and gains associated debugging infrastructure. BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46 Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690 CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events pwq_unbound_release_workfn Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421 expire_timers kernel/time/timer.c:1466 [inline] __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734 __run_timers kernel/time/timer.c:1715 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328 Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206 RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001 RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246 R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0 wq_unregister_lockdep kernel/workqueue.c:3508 [inline] pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Allocated by task 3635: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:437 [inline] __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470 kasan_slab_alloc include/linux/kasan.h:260 [inline] slab_post_alloc_hook mm/slab.h:732 [inline] slab_alloc_node mm/slub.c:3230 [inline] slab_alloc mm/slub.c:3238 [inline] kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243 kmem_cache_zalloc include/linux/slab.h:705 [inline] net_alloc net/core/net_namespace.c:407 [inline] copy_net_ns+0x125/0x760 net/core/net_namespace.c:462 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3048 __do_sys_unshare kernel/fork.c:3119 [inline] __se_sys_unshare kernel/fork.c:3117 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88807d5f9a80 which belongs to the cache net_namespace of size 6528 The buggy address is located 1216 bytes inside of 6528-byte region [ffff88807d5f9a80, ffff88807d5fb400) The buggy address belongs to the page: page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8 head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0 memcg:ffff888070023001 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000 raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950 prep_new_page mm/page_alloc.c:2434 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271 alloc_slab_page mm/slub.c:1799 [inline] allocate_slab mm/slub.c:1944 [inline] new_slab+0x28a/0x3b0 mm/slub.c:2004 ___slab_alloc+0x87c/0xe90 mm/slub.c:3018 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105 slab_alloc_node mm/slub.c:3196 [inline] slab_alloc mm/slub.c:3238 [inline] kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243 kmem_cache_zalloc include/linux/slab.h:705 [inline] net_alloc net/core/net_namespace.c:407 [inline] copy_net_ns+0x125/0x760 net/core/net_namespace.c:462 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3048 __do_sys_unshare kernel/fork.c:3119 [inline] __se_sys_unshare kernel/fork.c:3117 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1352 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404 free_unref_page_prepare mm/page_alloc.c:3325 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3404 skb_free_head net/core/skbuff.c:655 [inline] skb_release_data+0x65d/0x790 net/core/skbuff.c:677 skb_release_all net/core/skbuff.c:742 [inline] __kfree_skb net/core/skbuff.c:756 [inline] consume_skb net/core/skbuff.c:914 [inline] consume_skb+0xc2/0x160 net/core/skbuff.c:908 skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325 netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_recvmsg net/socket.c:962 [inline] ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632 ___sys_recvmsg+0x127/0x200 net/socket.c:2674 __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Memory state around the buggy address: ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: stmmac/xpcs: convert to pcs_validate()Russell King (Oracle)2022-01-261-2/+0
| | | | | | | | | | | stmmac explicitly calls the xpcs driver to validate the ethtool linkmodes. This is no longer necessary as phylink now supports validation through a PCS method. Convert both drivers to use this new mechanism. Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: xpcs: add support for retrieving supported interface modesRussell King (Oracle)2022-01-261-0/+1
| | | | | | | | | | | | Add a function to the xpcs driver to retrieve the supported PHY interface modes, which can be used by drivers to fill in phylink's supported_interfaces mask. We validate the interface bit index to ensure that it fits within the bitmap as xpcs lists PHY_INTERFACE_MODE_MAX in an entry. Tested-by: Wong Vee Khee <vee.khee.wong@linux.intel.com> # Intel EHL Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4/tcp: do not use per netns ctl socketsEric Dumazet2022-01-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send RST and some ACK packets (on behalf of TIMEWAIT sockets). This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. tcp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer in order to be able to use IPv4 output functions. Note that I attempted a related change in the past, that had to be hot-fixed in commit bdbbb8527b6f ("ipv4: tcp: get rid of ugly unicast_sock") This patch could very well surface old bugs, on layers not taking care of sk->sk_kern_sock properly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: do not use per netns icmp socketsEric Dumazet2022-01-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | Back in linux-2.6.25 (commit 98c6d1b261e7 "[NETNS]: Make icmpv6_sk per namespace.", we added private per-cpu/per-netns ipv6 icmp sockets. This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. icmp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer. icmpv6_xmit_lock() already makes sure to lock the chosen per-cpu socket. This patch has a considerable impact on the number of netns that the worker thread in cleanup_net() can dismantle per second, because ip6mr_sk_done() is no longer called, meaning we no longer acquire the rtnl mutex, competing with other threads adding new netns. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: do not use per netns icmp socketsEric Dumazet2022-01-251-1/+0
| | | | | | | | | | | | | | | | | | Back in linux-2.6.25 (commit 4a6ad7a141cb "[NETNS]: Make icmp_sk per namespace."), we added private per-cpu/per-netns ipv4 icmp sockets. This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. icmp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer. icmp_xmit_lock() already makes sure to lock the chosen per-cpu socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp/dccp: get rid of inet_twsk_purge()Eric Dumazet2022-01-251-2/+0
| | | | | | | | | | | Prior patches in the series made sure tw_timer_handler() can be fired after netns has been dismantled/freed. We no longer have to scan a potentially big TCP ehash table at netns dismantle. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp/dccp: no longer use twsk_net(tw) from tw_timer_handler()Eric Dumazet2022-01-251-3/+2
| | | | | | | | | | | | | | We will soon get rid of inet_twsk_purge(). This means that tw_timer_handler() might fire after a netns has been dismantled/freed. Instead of adding a function (and data structure) to find a netns from tw->tw_net_cookie, just update the SNMP counters a bit earlier, when the netns is known to be alive. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp/dccp: add tw->tw_bslotEric Dumazet2022-01-251-0/+1
| | | | | | | | | | | We want to allow inet_twsk_kill() working even if netns has been dismantled/freed, to get rid of inet_twsk_purge(). This patch adds tw->tw_bslot to cache the bind bucket slot so that inet_twsk_kill() no longer needs to dereference twsk_net(tw) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2022-01-249-84/+293
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2022-01-24 We've added 80 non-merge commits during the last 14 day(s) which contain a total of 128 files changed, 4990 insertions(+), 895 deletions(-). The main changes are: 1) Add XDP multi-buffer support and implement it for the mvneta driver, from Lorenzo Bianconi, Eelco Chaudron and Toke Høiland-Jørgensen. 2) Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra, from Kumar Kartikeya Dwivedi. 3) Extend BPF cgroup programs to export custom ret value to userspace via two helpers bpf_get_retval() and bpf_set_retval(), from YiFei Zhu. 4) Add support for AF_UNIX iterator batching, from Kuniyuki Iwashima. 5) Complete missing UAPI BPF helper description and change bpf_doc.py script to enforce consistent & complete helper documentation, from Usama Arif. 6) Deprecate libbpf's legacy BPF map definitions and streamline XDP APIs to follow tc-based APIs, from Andrii Nakryiko. 7) Support BPF_PROG_QUERY for BPF programs attached to sockmap, from Di Zhu. 8) Deprecate libbpf's bpf_map__def() API and replace users with proper getters and setters, from Christy Lee. 9) Extend libbpf's btf__add_btf() with an additional hashmap for strings to reduce overhead, from Kui-Feng Lee. 10) Fix bpftool and libbpf error handling related to libbpf's hashmap__new() utility function, from Mauricio Vásquez. 11) Add support to BTF program names in bpftool's program dump, from Raman Shukhau. 12) Fix resolve_btfids build to pick up host flags, from Connor O'Brien. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (80 commits) selftests, bpf: Do not yet switch to new libbpf XDP APIs selftests, xsk: Fix rx_full stats test bpf: Fix flexible_array.cocci warnings xdp: disable XDP_REDIRECT for xdp frags bpf: selftests: add CPUMAP/DEVMAP selftests for xdp frags bpf: selftests: introduce bpf_xdp_{load,store}_bytes selftest net: xdp: introduce bpf_xdp_pointer utility routine bpf: generalise tail call map compatibility check libbpf: Add SEC name for xdp frags programs bpf: selftests: update xdp_adjust_tail selftest to include xdp frags bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature bpf: introduce frags support to bpf_prog_test_run_xdp() bpf: move user_size out of bpf_test_init bpf: add frags support to xdp copy helpers bpf: add frags support to the bpf_xdp_adjust_tail() API bpf: introduce bpf_xdp_get_buff_len helper net: mvneta: enable jumbo frames if the loaded XDP program support frags bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program net: mvneta: add frags support to XDP_TX xdp: add frags support to xdp_return_{buff/frame} ... ==================== Link: https://lore.kernel.org/r/20220124221235.18993-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: xdp: introduce bpf_xdp_pointer utility routineLorenzo Bianconi2022-01-211-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to skb_header_pointer, introduce bpf_xdp_pointer utility routine to return a pointer to a given position in the xdp_buff if the requested area (offset + len) is contained in a contiguous memory area otherwise it will be copied in a bounce buffer provided by the caller. Similar to the tc counterpart, introduce the two following xdp helpers: - bpf_xdp_load_bytes - bpf_xdp_store_bytes Reviewed-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/ab285c1efdd5b7a9d361348b1e7d3ef49f6382b3.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: generalise tail call map compatibility checkToke Hoiland-Jorgensen2022-01-211-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for tail call map compatibility ensures that tail calls only happen between maps of the same type. To ensure backwards compatibility for XDP frags we need a similar type of check for cpumap and devmap programs, so move the state from bpf_array_aux into bpf_map, add xdp_has_frags to the check, and apply the same check to cpumap and devmap. Acked-by: John Fastabend <john.fastabend@gmail.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Toke Hoiland-Jorgensen <toke@redhat.com> Link: https://lore.kernel.org/r/f19fd97c0328a39927f3ad03e1ca6b43fd53cdfd.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: add frags support to the bpf_xdp_adjust_tail() APIEelco Chaudron2022-01-211-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds support for tail growing and shrinking for XDP frags. When called on a non-linear packet with a grow request, it will work on the last fragment of the packet. So the maximum grow size is the last fragments tailroom, i.e. no new buffer will be allocated. A XDP frags capable driver is expected to set frag_size in xdp_rxq_info data structure to notify the XDP core the fragment size. frag_size set to 0 is interpreted by the XDP core as tail growing is not allowed. Introduce __xdp_rxq_info_reg utility routine to initialize frag_size field. When shrinking, it will work from the last fragment, all the way down to the base buffer depending on the shrinking size. It's important to mention that once you shrink down the fragment(s) are freed, so you can not grow again to the original size. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/eabda3485dda4f2f158b477729337327e609461d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: introduce bpf_xdp_get_buff_len helperLorenzo Bianconi2022-01-212-0/+21
| | | | | | | | | | | | | | | | | | | | | | Introduce bpf_xdp_get_buff_len helper in order to return the xdp buffer total size (linear and paged area) Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/aac9ac3504c84026cf66a3c71b7c5ae89bc991be.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf programLorenzo Bianconi2022-01-212-0/+6
| | | | | | | | | | | | | | | | | | | | | | Introduce BPF_F_XDP_HAS_FRAGS and the related field in bpf_prog_aux in order to notify the driver the loaded program support xdp frags. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/db2e8075b7032a356003f407d1b0deb99adaa0ed.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * xdp: add frags support to xdp_return_{buff/frame}Lorenzo Bianconi2022-01-211-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | Take into account if the received xdp_buff/xdp_frame is non-linear recycling/returning the frame memory to the allocator or into xdp_frame_bulk. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/a961069febc868508ce1bdf5e53a343eb4e57cb2.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * net: xdp: add xdp_update_skb_shared_info utility routineLorenzo Bianconi2022-01-211-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xdp_update_skb_shared_info routine to update frags array metadata in skb_shared_info data structure converting to a skb from a xdp_buff or xdp_frame. According to the current skb_shared_info architecture in xdp_frame/xdp_buff and to the xdp frags support, there is no need to run skb_add_rx_frag() and reset frags array converting the buffer to a skb since the frag array will be in the same position for xdp_buff/xdp_frame and for the skb, we just need to update memory metadata. Introduce XDP_FLAGS_PF_MEMALLOC flag in xdp_buff_flags in order to mark the xdp_buff or xdp_frame as under memory-pressure if pages of the frags array are under memory pressure. Doing so we can avoid looping over all fragments in xdp_update_skb_shared_info routine. The driver is expected to set the flag constructing the xdp_buffer using xdp_buff_set_frag_pfmemalloc utility routine. Rely on xdp_update_skb_shared_info in __xdp_build_skb_from_frame routine converting the non-linear xdp_frame to a skb after performing a XDP_REDIRECT. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/bfd23fb8a8d7438724f7819c567cdf99ffd6226f.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * xdp: introduce flags field in xdp_buff/xdp_frameLorenzo Bianconi2022-01-211-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce flags field in xdp_frame and xdp_buffer data structures to define additional buffer features. At the moment the only supported buffer feature is frags bit (XDP_FLAGS_HAS_FRAGS). frags bit is used to specify if this is a linear buffer (XDP_FLAGS_HAS_FRAGS not set) or a frags frame (XDP_FLAGS_HAS_FRAGS set). In the latter case the driver is expected to initialize the skb_shared_info structure at the end of the first buffer to link together subsequent buffers belonging to the same frame. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/e389f14f3a162c0a5bc6a2e1aa8dd01a90be117d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * net: skbuff: add size metadata to skb_shared_info for xdpLorenzo Bianconi2022-01-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xdp_frags_size field in skb_shared_info data structure to store xdp_buff/xdp_frame frame paged size (xdp_frags_size will be used in xdp frags support). In order to not increase skb_shared_info size we will use a hole due to skb_shared_info alignment. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/8a849819a3e0a143d540f78a3a5add76e17e980d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: support BPF_PROG_QUERY for progs attached to sockmapDi Zhu2022-01-201-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now there is no way to query whether BPF programs are attached to a sockmap or not. we can use the standard interface in libbpf to query, such as: bpf_prog_query(mapFd, BPF_SK_SKB_STREAM_PARSER, 0, NULL, ...); the mapFd is the fd of sockmap. Signed-off-by: Di Zhu <zhudi2@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220119014005.1209-1-zhudi2@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return valueYiFei Zhu2022-01-192-4/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The helpers continue to use int for retval because all the hooks are int-returning rather than long-returning. The return value of bpf_set_retval is int for future-proofing, in case in the future there may be errors trying to set the retval. After the previous patch, if a program rejects a syscall by returning 0, an -EPERM will be generated no matter if the retval is already set to -err. This patch change it being forced only if retval is not -err. This is because we want to support, for example, invoking bpf_set_retval(-EINVAL) and return 0, and have the syscall return value be -EINVAL not -EPERM. For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is that, if the return value is NET_XMIT_DROP, the packet is silently dropped. We preserve this behavior for backward compatibility reasons, so even if an errno is set, the errno does not return to caller. However, setting a non-err to retval cannot propagate so this is not allowed and we return a -EFAULT in that case. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Move getsockopt retval to struct bpf_cg_run_ctxYiFei Zhu2022-01-192-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The retval value is moved to struct bpf_cg_run_ctx for ease of access in different prog types with different context structs layouts. The helper implementation (to be added in a later patch in the series) can simply perform a container_of from current->bpf_ctx to retrieve bpf_cg_run_ctx. Unfortunately, there is no easy way to access the current task_struct via the verifier BPF bytecode rewrite, aside from possibly calling a helper, so a pointer to current task is added to struct bpf_sockopt_kern so that the rewritten BPF bytecode can access struct bpf_cg_run_ctx with an indirection. For backward compatibility, if a getsockopt program rejects a syscall by returning 0, an -EPERM will be generated, by having the BPF_PROG_RUN_ARRAY_CG family macros automatically set the retval to -EPERM. Unlike prior to this patch, this -EPERM will be visible to ctx->retval for any other hooks down the line in the prog array. Additionally, the restriction that getsockopt filters can only set the retval to 0 is removed, considering that certain getsockopt implementations may return optlen. Filters are now able to set the value arbitrarily. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/73b0325f5c29912ccea7ea57ec1ed4d388fc1d37.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow booleanYiFei Zhu2022-01-191-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0 for whether the prog array allows or rejects whatever is being hooked. The caller of these macros then return -EPERM or continue processing based on thw macro's return value. Unforunately this is inflexible, since -EPERM is the only err that can be returned. This patch should be a no-op; it prepares for the next patch. The returning of the -EPERM is moved to inside the macros, so the outer functions are directly returning what the macros returned if they are non-zero. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Reviewed-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * uapi/bpf: Add missing description and returns for helper documentationUsama Arif2022-01-191-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Both description and returns section will become mandatory for helpers and syscalls in a later commit to generate man pages. This commit also adds in the documentation that BPF_PROG_RUN is an alias for BPF_PROG_TEST_RUN for anyone searching for the syscall in the generated man pages. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220119114442.1452088-1-usama.arif@bytedance.com
| * net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPFKumar Kartikeya Dwivedi2022-01-181-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds conntrack lookup helpers using the unstable kfunc call interface for the XDP and TC-BPF hooks. The primary usecase is implementing a synproxy in XDP, see Maxim's patchset [0]. Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it. This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled. [0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add reference tracking support to kfuncKumar Kartikeya Dwivedi2022-01-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds verifier support for PTR_TO_BTF_ID return type of kfunc to be a reference, by reusing acquire_reference_state/release_reference support for existing in-kernel bpf helpers. We make use of the three kfunc types: - BTF_KFUNC_TYPE_ACQUIRE Return true if kfunc_btf_id is an acquire kfunc. This will acquire_reference_state for the returned PTR_TO_BTF_ID (this is the only allow return value). Note that acquire kfunc must always return a PTR_TO_BTF_ID{_OR_NULL}, otherwise the program is rejected. - BTF_KFUNC_TYPE_RELEASE Return true if kfunc_btf_id is a release kfunc. This will release the reference to the passed in PTR_TO_BTF_ID which has a reference state (from earlier acquire kfunc). The btf_check_func_arg_match returns the regno (of argument register, hence > 0) if the kfunc is a release kfunc, and a proper referenced PTR_TO_BTF_ID is being passed to it. This is similar to how helper call check uses bpf_call_arg_meta to store the ref_obj_id that is later used to release the reference. Similar to in-kernel helper, we only allow passing one referenced PTR_TO_BTF_ID as an argument. It can also be passed in to normal kfunc, but in case of release kfunc there must always be one PTR_TO_BTF_ID argument that is referenced. - BTF_KFUNC_TYPE_RET_NULL For kfunc returning PTR_TO_BTF_ID, tells if it can be NULL, hence force caller to mark the pointer not null (using check) before accessing it. Note that taking into account the case fixed by commit 93c230e3f5bd ("bpf: Enforce id generation for all may-be-null register type") we assign a non-zero id for mark_ptr_or_null_reg logic. Later, if more return types are supported by kfunc, which have a _OR_NULL variant, it might be better to move this id generation under a common reg_type_may_be_null check, similar to the case in the commit. Referenced PTR_TO_BTF_ID is currently only limited to kfunc, but can be extended in the future to other BPF helpers as well. For now, we can rely on the btf_struct_ids_match check to ensure we get the pointer to the expected struct type. In the future, care needs to be taken to avoid ambiguity for reference PTR_TO_BTF_ID passed to release function, in case multiple candidates can release same BTF ID. e.g. there might be two release kfuncs (or kfunc and helper): foo(struct abc *p); bar(struct abc *p); ... such that both release a PTR_TO_BTF_ID with btf_id of struct abc. In this case we would need to track the acquire function corresponding to the release function to avoid type confusion, and store this information in the register state so that an incorrect program can be rejected. This is not a problem right now, hence it is left as an exercise for the future patch introducing such a case in the kernel. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Introduce mem, size argument pair support for kfuncKumar Kartikeya Dwivedi2022-01-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF helpers can associate two adjacent arguments together to pass memory of certain size, using ARG_PTR_TO_MEM and ARG_CONST_SIZE arguments. Since we don't use bpf_func_proto for kfunc, we need to leverage BTF to implement similar support. The ARG_CONST_SIZE processing for helpers is refactored into a common check_mem_size_reg helper that is shared with kfunc as well. kfunc ptr_to_mem support follows logic similar to global functions, where verification is done as if pointer is not null, even when it may be null. This leads to a simple to follow rule for writing kfunc: always check the argument pointer for NULL, except when it is PTR_TO_CTX. Also, the PTR_TO_CTX case is also only safe when the helper expecting pointer to program ctx is not exposed to other programs where same struct is not ctx type. In that case, the type check will fall through to other cases and would permit passing other types of pointers, possibly NULL at runtime. Currently, we require the size argument to be suffixed with "__sz" in the parameter name. This information is then recorded in kernel BTF and verified during function argument checking. In the future we can use BTF tagging instead, and modify the kernel function definitions. This will be a purely kernel-side change. This allows us to have some form of backwards compatibility for structures that are passed in to the kernel function with their size, and allow variable length structures to be passed in if they are accompanied by a size parameter. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Remove check_kfunc_call callback and old kfunc BTF ID APIKumar Kartikeya Dwivedi2022-01-182-52/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Completely remove the old code for check_kfunc_call to help it work with modules, and also the callback itself. The previous commit adds infrastructure to register all sets and put them in vmlinux or module BTF, and concatenates all related sets organized by the hook and the type. Once populated, these sets remain immutable for the lifetime of the struct btf. Also, since we don't need the 'owner' module anywhere when doing check_kfunc_call, drop the 'btf_modp' module parameter from find_kfunc_desc_btf. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Populate kfunc BTF ID sets in struct btfKumar Kartikeya Dwivedi2022-01-182-6/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch prepares the kernel to support putting all kinds of kfunc BTF ID sets in the struct btf itself. The various kernel subsystems will make register_btf_kfunc_id_set call in the initcalls (for built-in code and modules). The 'hook' is one of the many program types, e.g. XDP and TC/SCHED_CLS, STRUCT_OPS, and 'types' are check (allowed or not), acquire, release, and ret_null (with PTR_TO_BTF_ID_OR_NULL return type). A maximum of BTF_KFUNC_SET_MAX_CNT (32) kfunc BTF IDs are permitted in a set of certain hook and type for vmlinux sets, since they are allocated on demand, and otherwise set as NULL. Module sets can only be registered once per hook and type, hence they are directly assigned. A new btf_kfunc_id_set_contains function is exposed for use in verifier, this new method is faster than the existing list searching method, and is also automatic. It also lets other code not care whether the set is unallocated or not. Note that module code can only do single register_btf_kfunc_id_set call per hook. This is why sorting is only done for in-kernel vmlinux sets, because there might be multiple sets for the same hook and type that must be concatenated, hence sorting them is required to ensure bsearch in btf_id_set_contains continues to work correctly. Next commit will update the kernel users to make use of this infrastructure. Finally, add __maybe_unused annotation for BTF ID macros for the !CONFIG_DEBUG_INFO_BTF case, so that they don't produce warnings during build time. The previous patch is also needed to provide synchronization against initialization for module BTF's kfunc_set_tab introduced here, as described below: The kfunc_set_tab pointer in struct btf is write-once (if we consider the registration phase (comprised of multiple register_btf_kfunc_id_set calls) as a single operation). In this sense, once it has been fully prepared, it isn't modified, only used for lookup (from the verifier context). For btf_vmlinux, it is initialized fully during the do_initcalls phase, which happens fairly early in the boot process, before any processes are present. This also eliminates the possibility of bpf_check being called at that point, thus relieving us of ensuring any synchronization between the registration and lookup function (btf_kfunc_id_set_contains). However, the case for module BTF is a bit tricky. The BTF is parsed, prepared, and published from the MODULE_STATE_COMING notifier callback. After this, the module initcalls are invoked, where our registration function will be called to populate the kfunc_set_tab for module BTF. At this point, BTF may be available to userspace while its corresponding module is still intializing. A BTF fd can then be passed to verifier using bpf syscall (e.g. for kfunc call insn). Hence, there is a race window where verifier may concurrently try to lookup the kfunc_set_tab. To prevent this race, we must ensure the operations are serialized, or waiting for the __init functions to complete. In the earlier registration API, this race was alleviated as verifier bpf_check_mod_kfunc_call didn't find the kfunc BTF ID until it was added by the registration function (called usually at the end of module __init function after all module resources have been initialized). If the verifier made the check_kfunc_call before kfunc BTF ID was added to the list, it would fail verification (saying call isn't allowed). The access to list was protected using a mutex. Now, it would still fail verification, but for a different reason (returning ENXIO due to the failed btf_try_get_module call in add_kfunc_call), because if the __init call is in progress the module will be in the middle of MODULE_STATE_COMING -> MODULE_STATE_LIVE transition, and the BTF_MODULE_LIVE flag for btf_module instance will not be set, so the btf_try_get_module call will fail. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bonding: use rcu_dereference_rtnl when get bonding active slaveHangbin Liu2022-01-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | bond_option_active_slave_get_rcu() should not be used in rtnl_mutex as it use rcu_dereference(). Replace to rcu_dereference_rtnl() so we also can use this function in rtnl protected context. With this update, we can rmeove the rcu_read_lock/unlock in bonding .ndo_eth_ioctl and .get_ts_info. Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com> Fixes: 94dd016ae538 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: annotate accesses to fn->fn_sernumEric Dumazet2022-01-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct fib6_node's fn_sernum field can be read while other threads change it. Add READ_ONCE()/WRITE_ONCE() annotations. Do not change existing smp barriers in fib6_get_cookie_safe() and __fib6_update_sernum_upto_root() syzbot reported: BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1: fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178 fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112 fib6_walk net/ipv6/ip6_fib.c:2160 [inline] fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline] __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256 fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281 rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline] addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230 addrconf_dad_work+0x908/0x1170 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x1bf/0x1e0 kernel/kthread.c:359 ret_from_fork+0x1f/0x30 read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0: fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline] rt6_get_cookie include/net/ip6_fib.h:306 [inline] ip6_dst_store include/net/ip6_route.h:234 [inline] inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109 inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121 __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725 mptcp_push_release net/mptcp/protocol.c:1491 [inline] __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578 mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] kernel_sendmsg+0x97/0xd0 net/socket.c:745 sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086 inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834 kernel_sendpage+0x187/0x200 net/socket.c:3492 sock_sendpage+0x5a/0x70 net/socket.c:1007 pipe_to_sendpage+0x128/0x160 fs/splice.c:364 splice_from_pipe_feed fs/splice.c:418 [inline] __splice_from_pipe+0x207/0x500 fs/splice.c:562 splice_from_pipe fs/splice.c:597 [inline] generic_splice_sendpage+0x94/0xd0 fs/splice.c:746 do_splice_from fs/splice.c:767 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:936 splice_direct_to_actor+0x345/0x650 fs/splice.c:891 do_splice_direct+0x106/0x190 fs/splice.c:979 do_sendfile+0x675/0xc40 fs/read_write.c:1245 __do_sys_sendfile64 fs/read_write.c:1310 [inline] __se_sys_sendfile64 fs/read_write.c:1296 [inline] __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000026f -> 0x00000271 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 The Fixes tag I chose is probably arbitrary, I do not think we need to backport this patch to older kernels. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | tcp: Add a stub for sk_defer_free_flush()Gal Pressman2022-01-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When compiling the kernel with CONFIG_INET disabled, the sk_defer_free_flush() should be defined as a nop. This resolves the following compilation error: ld: net/core/sock.o: in function `sk_defer_free_flush': ./include/net/tcp.h:1378: undefined reference to `__sk_defer_free_flush' Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220120123440.9088-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | ethtool: Fix link extended state for big endianMoshe Tal2022-01-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The link extended sub-states are assigned as enum that is an integer size but read from a union as u8, this is working for small values on little endian systems but for big endian this always give 0. Fix the variable in the union to match the enum size. Fixes: ecc31c60240b ("ethtool: Add link extended state") Signed-off-by: Moshe Tal <moshet@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix information leakage in /proc/net/ptypeCongyu Liu2022-01-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In one net namespace, after creating a packet socket without binding it to a device, users in other net namespaces can observe the new `packet_type` added by this packet socket by reading `/proc/net/ptype` file. This is minor information leakage as packet socket is namespace aware. Add a net pointer in `packet_type` to keep the net namespace of of corresponding packet socket. In `ptype_seq_show`, this net pointer must be checked when it is not NULL. Fixes: 2feb27dbe00c ("[NETNS]: Minor information leak via /proc/net/ptype file.") Signed-off-by: Congyu Liu <liu3101@purdue.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'net-5.17-rc1' of ↵Linus Torvalds2022-01-206-8/+28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bpf. Quite a handful of old regression fixes but most of those are pre-5.16. Current release - regressions: - fix memory leaks in the skb free deferral scheme if upper layer protocols are used, i.e. in-kernel TCP readers like TLS Current release - new code bugs: - nf_tables: fix NULL check typo in _clone() functions - change the default to y for Vertexcom vendor Kconfig - a couple of fixes to incorrect uses of ref tracking - two fixes for constifying netdev->dev_addr Previous releases - regressions: - bpf: - various verifier fixes mainly around register offset handling when passed to helper functions - fix mount source displayed for bpffs (none -> bpffs) - bonding: - fix extraction of ports for connection hash calculation - fix bond_xmit_broadcast return value when some devices are down - phy: marvell: add Marvell specific PHY loopback - sch_api: don't skip qdisc attach on ingress, prevent ref leak - htb: restore minimal packet size handling in rate control - sfp: fix high power modules without diagnostic monitoring - mscc: ocelot: - don't let phylink re-enable TX PAUSE on the NPI port - don't dereference NULL pointers with shared tc filters - smsc95xx: correct reset handling for LAN9514 - cpsw: avoid alignment faults by taking NET_IP_ALIGN into account - phy: micrel: use kszphy_suspend/_resume for irq aware devices, avoid races with the interrupt Previous releases - always broken: - xdp: check prog type before updating BPF link - smc: resolve various races around abnormal connection termination - sit: allow encapsulated IPv6 traffic to be delivered locally - axienet: fix init/reset handling, add missing barriers, read the right status words, stop queues correctly - add missing dev_put() in sock_timestamping_bind_phc() Misc: - ipv4: prevent accidentally passing RTO_ONLINK to ip_route_output_key_hash() by sanitizing flags - ipv4: avoid quadratic behavior in netns dismantle - stmmac: dwmac-oxnas: add support for OX810SE - fsl: xgmac_mdio: add workaround for erratum A-009885" * tag 'net-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits) ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keys ipv4: avoid quadratic behavior in netns dismantle net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses dt-bindings: net: Document fsl,erratum-a009885 net/fsl: xgmac_mdio: Add workaround for erratum A-009885 net: mscc: ocelot: fix using match before it is set net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into account nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() net: axienet: increase default TX ring size to 128 net: axienet: fix for TX busy handling net: axienet: fix number of TX ring slots for available check net: axienet: Fix TX ring slot available check net: axienet: limit minimum TX ring size net: axienet: add missing memory barriers net: axienet: reset core on initialization prior to MDIO access net: axienet: Wait for PhyRstCmplt after core reset net: axienet: increase reset timeout bpf, selftests: Add ringbuf memory type confusion test ...
| * \ Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski2022-01-182-4/+9
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2022-01-19 We've added 12 non-merge commits during the last 8 day(s) which contain a total of 12 files changed, 262 insertions(+), 64 deletions(-). The main changes are: 1) Various verifier fixes mainly around register offset handling when passed to helper functions, from Daniel Borkmann. 2) Fix XDP BPF link handling to assert program type, from Toke Høiland-Jørgensen. 3) Fix regression in mount parameter handling for BPF fs, from Yafang Shao. 4) Fix incorrect integer literal when marking scratched stack slots in verifier, from Christy Lee. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, selftests: Add ringbuf memory type confusion test bpf, selftests: Add various ringbuf tests with invalid offset bpf: Fix ringbuf memory type confusion when passing to helpers bpf: Fix out of bounds access for ringbuf helpers bpf: Generally fix helper register offset check bpf: Mark PTR_TO_FUNC register initially with zero offset bpf: Generalize check_ctx_reg for reuse with other types bpf: Fix incorrect integer literal used for marking scratched stack. bpf/selftests: Add check for updating XDP bpf_link with wrong program type bpf/selftests: convert xdp_link test to ASSERT_* macros xdp: check prog type before updating BPF link bpf: Fix mount source show for bpffs ==================== Link: https://lore.kernel.org/r/20220119011825.9082-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | bpf: Fix ringbuf memory type confusion when passing to helpersDaniel Borkmann2022-01-191-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bpf_ringbuf_submit() and bpf_ringbuf_discard() have ARG_PTR_TO_ALLOC_MEM in their bpf_func_proto definition as their first argument, and thus both expect the result from a prior bpf_ringbuf_reserve() call which has a return type of RET_PTR_TO_ALLOC_MEM_OR_NULL. While the non-NULL memory from bpf_ringbuf_reserve() can be passed to other helpers, the two sinks (bpf_ringbuf_submit(), bpf_ringbuf_discard()) right now only enforce a register type of PTR_TO_MEM. This can lead to potential type confusion since it would allow other PTR_TO_MEM memory to be passed into the two sinks which did not come from bpf_ringbuf_reserve(). Add a new MEM_ALLOC composable type attribute for PTR_TO_MEM, and enforce that: - bpf_ringbuf_reserve() returns NULL or PTR_TO_MEM | MEM_ALLOC - bpf_ringbuf_submit() and bpf_ringbuf_discard() only take PTR_TO_MEM | MEM_ALLOC but not plain PTR_TO_MEM arguments via ARG_PTR_TO_ALLOC_MEM - however, other helpers might treat PTR_TO_MEM | MEM_ALLOC as plain PTR_TO_MEM to populate the memory area when they use ARG_PTR_TO_{UNINIT_,}MEM in their func proto description Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
| | * | bpf: Generalize check_ctx_reg for reuse with other typesDaniel Borkmann2022-01-191-2/+2
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | Generalize the check_ctx_reg() helper function into a more generic named one so that it can be reused for other register types as well to check whether their offset is non-zero. No functional change. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
| * | net_sched: restore "mpu xxx" handlingKevin Bracey2022-01-131-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 56b765b79e9a ("htb: improved accuracy at high rates") broke "overhead X", "linklayer atm" and "mpu X" attributes. "overhead X" and "linklayer atm" have already been fixed. This restores the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping: tc class add ... htb rate X overhead 4 mpu 64 The code being fixed is used by htb, tbf and act_police. Cake has its own mpu handling. qdisc_calculate_pkt_len still uses the size table containing values adjusted for mpu by user space. iproute2 tc has always passed mpu into the kernel via a tc_ratespec structure, but the kernel never directly acted on it, merely stored it so that it could be read back by `tc class show`. Rather, tc would generate length-to-time tables that included the mpu (and linklayer) in their construction, and the kernel used those tables. Since v3.7, the tables were no longer used. Along with "mpu", this also broke "overhead" and "linklayer" which were fixed in 01cb71d2d47b ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b171 ("net_sched: restore "linklayer atm" handling", v3.11). "overhead" was fixed by simply restoring use of tc_ratespec::overhead - this had originally been used by the kernel but was initially omitted from the new non-table-based calculations. "linklayer" had been handled in the table like "mpu", but the mode was not originally passed in tc_ratespec. The new implementation was made to handle it by getting new versions of tc to pass the mode in an extended tc_ratespec, and for older versions of tc the table contents were analysed at load time to deduce linklayer. As "mpu" has always been given to the kernel in tc_ratespec, accompanying the mpu-based table, we can restore system functionality with no userspace change by making the kernel act on the tc_ratespec value. Fixes: 56b765b79e9a ("htb: improved accuracy at high rates") Signed-off-by: Kevin Bracey <kevin@bracey.fi> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Vimalkumar <j.vimal@gmail.com> Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | inet: frags: annotate races around fqdir->dead and fqdir->high_threshEric Dumazet2022-01-132-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Both fields can be read/written without synchronization, add proper accessors and documentation. Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: sched: do not allocate a tracker in tcf_exts_init()Eric Dumazet2022-01-111-1/+3
| |/ | | | | | | | | | | | | | | | | | | | | While struct tcf_exts has a net pointer, it is not refcounted until tcf_exts_get_net() is called. Fixes: dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220110094750.236478-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2022-01-2015-85/+145
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: "55 patches. Subsystems affected by this patch series: percpu, procfs, sysctl, misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2, hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits) lib: remove redundant assignment to variable ret ubsan: remove CONFIG_UBSAN_OBJECT_SIZE kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB btrfs: use generic Kconfig option for 256kB page size limit arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB configs: introduce debug.config for CI-like setup delayacct: track delays from memory compact Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact delayacct: cleanup flags in struct task_delay_info and functions use it delayacct: fix incomplete disable operation when switch enable to disable delayacct: support swapin delay accounting for swapping without blkio panic: remove oops_id panic: use error_report_end tracepoint on warnings fs/adfs: remove unneeded variable make code cleaner FAT: use io_schedule_timeout() instead of congestion_wait() hfsplus: use struct_group_attr() for memcpy() region nilfs2: remove redundant pointer sbufs fs/binfmt_elf: use PT_LOAD p_align values for static PIE const_structs.checkpatch: add frequently used ops structs ...
| * | delayacct: track delays from memory compactwangyong2022-01-202-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delay accounting does not track the delay of memory compact. When there is not enough free memory, tasks can spend a amount of their time waiting for compact. To get the impact of tasks in direct memory compact, measure the delay when allocating memory through memory compact. Also update tools/accounting/getdelays.c: / # ./getdelays_next -di -p 304 print delayacct stats ON printing IO accounting PID 304 CPU count real total virtual total delay total delay average 277 780000000 849039485 18877296 0.068ms IO count delay total delay average 0 0 0ms SWAP count delay total delay average 0 0 0ms RECLAIM count delay total delay average 5 11088812685 2217ms THRASHING count delay total delay average 0 0 0ms COMPACT count delay total delay average 3 72758 0ms watch: read=0, write=0, cancelled_write=0 Link: https://lkml.kernel.org/r/1638619795-71451-1-git-send-email-wang.yong12@zte.com.cn Signed-off-by: wangyong <wang.yong12@zte.com.cn> Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Zhang Wenya <zhang.wenya1@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | delayacct: cleanup flags in struct task_delay_info and functions use itYang Yang2022-01-201-17/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Flags in struct task_delay_info is used to distinguish the difference between swapin and blkio delay acountings. But after patch "delayacct: support swapin delay accounting for swapping without blkio", there is no need to do that since swapin and blkio delay accounting use their own functions. Link: https://lkml.kernel.org/r/20211124065958.36703-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | delayacct: fix incomplete disable operation when switch enable to disableYang Yang2022-01-201-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a task is created after delayacct is enabled, kernel will do all the delay accountings for that task. The problems is if user disables delayacct by set /proc/sys/kernel/task_delayacct to zero, only blkio delay accounting is disabled. Now disable all the kinds of delay accountings when /proc/sys/kernel/task_delayacct sets to zero. Link: https://lkml.kernel.org/r/20211123140342.32962-1-ran.xiaokai@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>