summaryrefslogtreecommitdiffstats
path: root/include/net
Commit message (Collapse)AuthorAgeFilesLines
* dccp/tcp: Fixup bhash2 bucket when connect() fails.Kuniyuki Iwashima2022-11-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a socket bound to a wildcard address fails to connect(), we only reset saddr and keep the port. Then, we have to fix up the bhash2 bucket; otherwise, the bucket has an inconsistent address in the list. Also, listen() for such a socket will fire the WARN_ON() in inet_csk_get_port(). [0] Note that when a system runs out of memory, we give up fixing the bucket and unlink sk from bhash and bhash2 by inet_put_port(). [0]: WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Modules linked in: CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014 RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202 RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000 RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280 FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205) inet_listen (net/ipv4/af_inet.c:228) __sys_listen (net/socket.c:1810) __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f8ac051de5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007 R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388 R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000 </TASK> Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* dccp/tcp: Update saddr under bhash's lock.Kuniyuki Iwashima2022-11-221-1/+1
| | | | | | | | | | | | | | | | When we call connect() for a socket bound to a wildcard address, we update saddr locklessly. However, it could result in a data race; another thread iterating over bhash might see a corrupted address. Let's update saddr under the bhash bucket's lock. Fixes: 3df80d9320bc ("[DCCP]: Introduce DCCPv6") Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: neigh: decrement the family specific qlenThomas Zeitlhofer2022-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") introduced the length counter qlen in struct neigh_parms. There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and while the family specific qlen is incremented in pneigh_enqueue(), the mentioned commit decrements always the IPv4/ARP specific qlen, regardless of the currently processed family, in pneigh_queue_purge() and neigh_proxy_process(). As a result, with IPv6/ND, the family specific qlen is only incremented (and never decremented) until it exceeds PROXY_QLEN, and then, according to the check in pneigh_enqueue(), neighbor solicitations are not answered anymore. As an example, this is noted when using the subnet-router anycast address to access a Linux router. After a certain amount of time (in the observed case, qlen exceeded PROXY_QLEN after two days), the Linux router stops answering neighbor solicitations for its subnet-router anycast address and effectively becomes unreachable. Another result with IPv6/ND is that the IPv4/ARP specific qlen is decremented more often than incremented. This leads to negative qlen values, as a signed integer has been used for the length counter qlen, and potentially to an integer overflow. Fix this by introducing the helper function neigh_parms_qlen_dec(), which decrements the family specific qlen. Thereby, make use of the existing helper function neigh_get_dev_parms_rcu(), whose definition therefore needs to be placed earlier in neighbour.c. Take the family member from struct neigh_table to determine the currently processed family and appropriately call neigh_parms_qlen_dec() from pneigh_queue_purge() and neigh_proxy_process(). Additionally, use an unsigned integer for the length counter qlen. Fixes: 0ff4eb3d5ebb ("neighbour: make proxy_queue.qlen limit per-device") Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use struct_group to copy ip/ipv6 header addressesHangbin Liu2022-11-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel test robot reported warnings when build bonding module with make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/bonding/: from ../drivers/net/bonding/bond_main.c:35: In function ‘fortify_memcpy_chk’, inlined from ‘iph_to_flow_copy_v4addrs’ at ../include/net/ip.h:566:2, inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3984:3: ../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning] 413 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In function ‘fortify_memcpy_chk’, inlined from ‘iph_to_flow_copy_v6addrs’ at ../include/net/ipv6.h:900:2, inlined from ‘bond_flow_ip’ at ../drivers/net/bonding/bond_main.c:3994:3: ../include/linux/fortify-string.h:413:25: warning: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of f ield (2nd parameter); maybe use struct_group()? [-Wattribute-warning] 413 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is because we try to copy the whole ip/ip6 address to the flow_key, while we only point the to ip/ip6 saddr. Note that since these are UAPI headers, __struct_group() is used to avoid the compiler warnings. Reported-by: kernel test robot <lkp@intel.com> Fixes: c3f8324188fa ("net: Add full IPv6 addresses to flow_keys") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20221115142400.1204786-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* l2tp: Serialize access to sk_user_data with sk_callback_lockJakub Sitnicki2022-11-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_user_data has multiple users, which are not compatible with each other. Writers must synchronize by grabbing the sk->sk_callback_lock. l2tp currently fails to grab the lock when modifying the underlying tunnel socket fields. Fix it by adding appropriate locking. We err on the side of safety and grab the sk_callback_lock also inside the sk_destruct callback overridden by l2tp, even though there should be no refs allowing access to the sock at the time when sk_destruct gets called. v4: - serialize write to sk_user_data in l2tp sk_destruct v3: - switch from sock lock to sk_callback_lock - document write-protection for sk_user_data v2: - update Fixes to point to origin of the bug - use real names in Reported/Tested-by tags Cc: Tom Parkin <tparkin@katalix.com> Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core") Reported-by: Haowei Yan <g1042620637@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: introduce bigendian integer typesFlorian Westphal2022-11-011-9/+10
| | | | | | | | | | | | | | | | | | | | Jakub reported that the addition of the "network_byte_order" member in struct nla_policy increases size of 32bit platforms. Instead of scraping the bit from elsewhere Johannes suggested to add explicit NLA_BE types instead, so do this here. NLA_POLICY_MAX_BE() macro is removed again, there is no need for it: NLA_POLICY_MAX(NLA_BE.., ..) will do the right thing. NLA_BE64 can be added later. Fixes: 08724ef69907 ("netlink: introduce NLA_POLICY_MAX_BE") Reported-by: Jakub Kicinski <kuba@kernel.org> Suggested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20221031123407.9158-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: remove SOCK_SUPPORT_ZC from sockmapPavel Begunkov2022-10-281-0/+7
| | | | | | | | | | | sockmap replaces ->sk_prot with its own callbacks, we should remove SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info. Cc: <stable@vger.kernel.org> # 6.0 Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: e993ffe3da4bc ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* netlink: hide validation union fields from kdocJakub Kicinski2022-10-281-13/+18
| | | | | | | | | | | | | Mark the validation fields as private, users shouldn't set them directly and they are too complicated to explain in a more succinct way (there's already a long explanation in the comment above). The strict_start_type field is set directly and has a dedicated comment so move that above the "private" section. Link: https://lore.kernel.org/r/20221027212107.2639255-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* genetlink: piggy back on resv_op to default to a reject policyJakub Kicinski2022-10-241-1/+9
| | | | | | | | | | | | | | | | | To keep backward compatibility we used to leave attribute parsing to the family if no policy is specified. This becomes tedious as we move to more strict validation. Families must define reject all policies if they don't want any attributes accepted. Piggy back on the resv_start_op field as the switchover point. AFAICT only ethtool has added new commands since the resv_start_op was defined, and it has per-op policies so this should be a no-op. Nonetheless the patch should still go into v6.1 for consistency. Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/ Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net-memcg: avoid stalls when under memory pressureJakub Kicinski2022-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Shakeel explains the commit under Fixes had the unintended side-effect of no longer pre-loading the cached memory allowance. Even tho we previously dropped the first packet received when over memory limit - the consecutive ones would get thru by using the cache. The charging was happening in batches of 128kB, so we'd let in 128kB (truesize) worth of packets per one drop. After the change we no longer force charge, there will be no cache filling side effects. This causes significant drops and connection stalls for workloads which use a lot of page cache, since we can't reclaim page cache under GFP_NOWAIT. Some of the latency can be recovered by improving SACK reneg handling but nowhere near enough to get back to the pre-5.15 performance (the application I'm experimenting with still sees 5-10x worst latency). Apply the suggested workaround of using GFP_ATOMIC. We will now be more permissive than previously as we'll drop _no_ packets in softirq when under pressure. But I can't think of any good and simple way to address that within networking. Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/ Suggested-by: Shakeel Butt <shakeelb@google.com> Fixes: 4b1327be9fe5 ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()") Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge tag 'net-6.1-rc2' of ↵Linus Torvalds2022-10-202-9/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" - revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" - dsa: uninitialized variable in dsa_slave_netdevice_event() - eth: sunhme: uninitialized variable in happy_meal_init() Current release - new code bugs: - eth: octeontx2: fix resource not freed after malloc Previous releases - regressions: - sched: fix return value of qdisc ingress handling on success - sched: fix race condition in qdisc_graft() - udp: update reuse->has_conns under reuseport_lock. - tls: strp: make sure the TCP skbs do not have overlapping data - hsr: avoid possible NULL deref in skb_clone() - tipc: fix an information leak in tipc_topsrv_kern_subscr - phylink: add mac_managed_pm in phylink_config structure - eth: i40e: fix DMA mappings leak - eth: hyperv: fix a RX-path warning - eth: mtk: fix memory leaks Previous releases - always broken: - sched: cake: fix null pointer access issue when cake_init() fails" * tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) net: phy: dp83822: disable MDI crossover status change interrupt net: sched: fix race condition in qdisc_graft() net: hns: fix possible memory leak in hnae_ae_register() wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new() sfc: include vport_id in filter spec hash and equal() genetlink: fix kdoc warnings selftests: add selftest for chaining of tc ingress handling to egress net: Fix return value of qdisc ingress handling on success net: sched: sfb: fix null pointer access issue when sfb_init() fails Revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" net: sched: cake: fix null pointer access issue when cake_init() fails ethernet: marvell: octeontx2 Fix resource not freed after malloc netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces. ionic: catch NULL pointer issue on reconfig net: hsr: avoid possible NULL deref in skb_clone() bnxt_en: fix memory leak in bnxt_nvm_test() ip6mr: fix UAF issue in ip6mr_sk_done() when addrconf_init_net() failed udp: Update reuse->has_conns under reuseport_lock. net: ethernet: mediatek: ppe: Remove the unused function mtk_foe_entry_usable() ...
| * genetlink: fix kdoc warningsJakub Kicinski2022-10-191-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Address a bunch of kdoc warnings: include/net/genetlink.h:81: warning: Function parameter or member 'module' not described in 'genl_family' include/net/genetlink.h:243: warning: expecting prototype for struct genl_info. Prototype was for struct genl_dumpit_info instead include/net/genetlink.h:419: warning: Function parameter or member 'net' not described in 'genlmsg_unicast' include/net/genetlink.h:438: warning: expecting prototype for gennlmsg_data(). Prototype was for genlmsg_data() instead include/net/genetlink.h:244: warning: Function parameter or member 'op' not described in 'genl_dumpit_info' Link: https://lore.kernel.org/r/20221018231310.1040482-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * udp: Update reuse->has_conns under reuseport_lock.Kuniyuki Iwashima2022-10-181-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket. However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now. For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking. cpu1 cpu2 +----+ +----+ __ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0] For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case. 1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk. 2) New listen() overflows reuse->socks[] and call reuseport_grow(). 3) reuse->max_socks overflows u16 with the new listener. 4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock(). shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set(). [0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/ Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds2022-10-163-3/+3
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
| * treewide: use get_random_u32() when possibleJason A. Donenfeld2022-10-113-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* | Merge tag 'net-6.1-rc1' of ↵Linus Torvalds2022-10-134-12/+12
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, and wifi. Current release - regressions: - Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs", it may cause crashes when the qdisc is reconfigured - inet: ping: fix splat due to packet allocation refactoring in inet - tcp: clean up kernel listener's reqsk in inet_twsk_purge(), fix UAF due to races when per-netns hash table is used Current release - new code bugs: - eth: adin1110: check in netdev_event that netdev belongs to driver - fixes for PTR_ERR() vs NULL bugs in driver code, from Dan and co. Previous releases - regressions: - ipv4: handle attempt to delete multipath route when fib_info contains an nh reference, avoid oob access - wifi: fix handful of bugs in the new Multi-BSSID code - wifi: mt76: fix rate reporting / throughput regression on mt7915 and newer, fix checksum offload - wifi: iwlwifi: mvm: fix double list_add at iwl_mvm_mac_wake_tx_queue (other cases) - wifi: mac80211: do not drop packets smaller than the LLC-SNAP header on fast-rx Previous releases - always broken: - ieee802154: don't warn zero-sized raw_sendmsg() - ipv6: ping: fix wrong checksum for large frames - mctp: prevent double key removal and unref - tcp/udp: fix memory leaks and races around IPV6_ADDRFORM - hv_netvsc: fix race between VF offering and VF association message Misc: - remove -Warray-bounds silencing in the drivers, compilers fixed" * tag 'net-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits) sunhme: fix an IS_ERR() vs NULL check in probe net: marvell: prestera: fix a couple NULL vs IS_ERR() checks kcm: avoid potential race in kcm_tx_work tcp: Clean up kernel listener's reqsk in inet_twsk_purge() net: phy: micrel: Fixes FIELD_GET assertion openvswitch: add nf_ct_is_confirmed check before assigning the helper tcp: Fix data races around icsk->icsk_af_ops. ipv6: Fix data races around sk->sk_prot. tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct(). udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM). tcp/udp: Fix memory leak in ipv6_renew_options(). mctp: prevent double key removal and unref selftests: netfilter: Fix nft_fib.sh for all.rp_filter=1 netfilter: rpfilter/fib: Populate flowic_l3mdev field selftests: netfilter: Test reverse path filtering net/mlx5: Make ASO poll CQ usable in atomic context tcp: cdg: allow tcp_cdg_release() to be called multiple times inet: ping: fix recent breakage ipv6: ping: fix wrong checksum for large frames net: ethernet: ti: am65-cpsw: set correct devlink flavour for unused ports ...
| * tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().Kuniyuki Iwashima2022-10-123-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were able to clean them up by calling inet6_destroy_sock() during the IPv6 -> IPv4 conversion by IPV6_ADDRFORM. However, commit 03485f2adcde ("udpv6: Add lockless sendmsg() support") added a lockless memory allocation path, which could cause a memory leak: setsockopt(IPV6_ADDRFORM) sendmsg() +-----------------------+ +-------+ - do_ipv6_setsockopt(sk, ...) - udpv6_sendmsg(sk, ...) - sockopt_lock_sock(sk) ^._ called via udpv6_prot - lock_sock(sk) before WRITE_ONCE() - WRITE_ONCE(sk->sk_prot, &tcp_prot) - inet6_destroy_sock() - if (!corkreq) - sockopt_release_sock(sk) - ip6_make_skb(sk, ...) - release_sock(sk) ^._ lockless fast path for the non-corking case - __ip6_append_data(sk, ...) - ipv6_local_rxpmtu(sk, ...) - xchg(&np->rxpmtu, skb) ^._ rxpmtu is never freed. - goto out_no_dst; - lock_sock(sk) For now, rxpmtu is only the case, but not to miss the future change and a similar bug fixed in commit e27326009a3d ("net: ping6: Fix memleak in ipv6_renew_options()."), let's set a new function to IPv6 sk->sk_destruct() and call inet6_cleanup_sock() there. Since the conversion does not change sk->sk_destruct(), we can guarantee that we can clean up IPv6 resources finally. We can now remove all inet6_destroy_sock() calls from IPv6 protocol specific ->destroy() functions, but such changes are invasive to backport. So they can be posted as a follow-up later for net-next. Fixes: 03485f2adcde ("udpv6: Add lockless sendmsg() support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * udp: Call inet6_destroy_sock() in setsockopt(IPV6_ADDRFORM).Kuniyuki Iwashima2022-10-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") forgot to add a change to free inet6_sk(sk)->rxpmtu while converting an IPv6 socket into IPv4 with IPV6_ADDRFORM. After conversion, sk_prot is changed to udp_prot and ->destroy() never cleans it up, resulting in a memory leak. This is due to the discrepancy between inet6_destroy_sock() and IPV6_ADDRFORM, so let's call inet6_destroy_sock() from IPV6_ADDRFORM to remove the difference. However, this is not enough for now because rxpmtu can be changed without lock_sock() after commit 03485f2adcde ("udpv6: Add lockless sendmsg() support"). We will fix this case in the following patch. Note we will rename inet6_destroy_sock() to inet6_cleanup_sock() and remove unnecessary inet6_destroy_sock() calls in sk_prot->destroy() in the future. Fixes: 4b340ae20d0e ("IPv6: Complete IPV6_DONTFRAG support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * net: ieee802154: return -EINVAL for unknown addr typeAlexander Aring2022-10-071-3/+9
| | | | | | | | | | | | | | | | | | | | This patch adds handling to return -EINVAL for an unknown addr type. The current behaviour is to return 0 as successful but the size of an unknown addr type is not defined and should return an error like -EINVAL. Fixes: 94160108a70c ("net/ieee802154: fix uninit value bug in dgram_sendmsg") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag '9p-for-6.1' of https://github.com/martinetd/linuxLinus Torvalds2022-10-102-0/+8
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull 9p updates from Dominique Martinet: "Smaller buffers for small messages and fixes. The highlight of this is Christian's patch to allocate smaller buffers for most metadata requests: 9p with a big msize would try to allocate large buffers when just 4 or 8k would be more than enough; this brings in nice performance improvements. There's also a few fixes for problems reported by syzkaller (thanks to Schspa Shi, Tetsuo Handa for tests and feedback/patches) as well as some minor cleanup" * tag '9p-for-6.1' of https://github.com/martinetd/linux: net/9p: clarify trans_fd parse_opt failure handling net/9p: add __init/__exit annotations to module init/exit funcs net/9p: use a dedicated spinlock for trans_fd 9p/trans_fd: always use O_NONBLOCK read/write net/9p: allocate appropriate reduced message buffers net/9p: add 'pooled_rbuffers' flag to struct p9_trans_module net/9p: add p9_msg_buf_size() 9p: add P9_ERRMAX for 9p2000 and 9p2000.u net/9p: split message size argument into 't_size' and 'r_size' pair 9p: trans_fd/p9_conn_cancel: drop client lock earlier
| * net/9p: add 'pooled_rbuffers' flag to struct p9_trans_moduleChristian Schoenebeck2022-10-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | This is a preparatory change for the subsequent patch: the RDMA transport pulls the buffers for its 9p response messages from a shared pool. [1] So this case has to be considered when choosing an appropriate response message size in the subsequent patch. Link: https://lore.kernel.org/all/Ys3jjg52EIyITPua@codewreck.org/ [1] Link: https://lkml.kernel.org/r/79d24310226bc4eb037892b5c097ec4ad4819a03.1657920926.git.linux_oss@crudebyte.com Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
| * 9p: add P9_ERRMAX for 9p2000 and 9p2000.uChristian Schoenebeck2022-10-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add P9_ERRMAX macro to 9P protocol header which reflects the maximum error string length of Rerror replies for 9p2000 and 9p2000.u protocol versions. Unfortunately a maximum error string length is not defined by the 9p2000 spec, picking 128 as value for now, as this seems to be a common max. size for POSIX error strings in practice. 9p2000.L protocol version uses Rlerror replies instead which does not contain an error string. Link: https://lkml.kernel.org/r/3f23191d21032e7c14852b1e1a4ae26417a36739.1657920926.git.linux_oss@crudebyte.com Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-10-032-2/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * \ Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski2022-10-031-1/+1
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2022-10-03 We've added 10 non-merge commits during the last 23 day(s) which contain a total of 14 files changed, 130 insertions(+), 69 deletions(-). The main changes are: 1) Fix dynptr helper API to gate behind CAP_BPF given it was not intended for unprivileged BPF programs, from Kumar Kartikeya Dwivedi. 2) Fix need_wakeup flag inheritance from umem buffer pool for shared xsk sockets, from Jalal Mostafa. 3) Fix truncated last_member_type_id in btf_struct_resolve() which had a wrong storage type, from Lorenz Bauer. 4) Fix xsk back-pressure mechanism on tx when amount of produced descriptors to CQ is lower than what was grabbed from xsk tx ring, from Maciej Fijalkowski. 5) Fix wrong cgroup attach flags being displayed to effective progs, from Pu Lehui. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xsk: Inherit need_wakeup flag for shared sockets bpf: Gate dynptr API behind CAP_BPF selftests/bpf: Adapt cgroup effective query uapi change bpftool: Fix wrong cgroup attach flags being assigned to effective progs bpf, cgroup: Reject prog_attach_flags array when effective query bpf: Ensure correct locking around vulnerable function find_vpid() bpf: btf: fix truncated last_member_type_id in btf_struct_resolve selftests/xsk: Add missing close() on netns fd xsk: Fix backpressure mechanism on Tx MAINTAINERS: Add include/linux/tnum.h to BPF CORE ==================== Link: https://lore.kernel.org/r/20221003201957.13149-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | xsk: Inherit need_wakeup flag for shared socketsJalal Mostafa2022-09-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag for need_wakeup is not set for xsks with `XDP_SHARED_UMEM` flag and of different queue ids and/or devices. They should inherit the flag from the first socket buffer pool since no flags can be specified once `XDP_SHARED_UMEM` is specified. Fixes: b5aea28dca134 ("xsk: Add shared umem support between queue ids") Signed-off-by: Jalal Mostafa <jalal.a.mostapha@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220921135701.10199-1-jalal.a.mostapha@gmail.com
| * | | tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limitedNeal Cardwell2022-09-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit fixes a bug in the tracking of max_packets_out and is_cwnd_limited. This bug can cause the connection to fail to remember that is_cwnd_limited is true, causing the connection to fail to grow cwnd when it should, causing throughput to be lower than it should be. The following event sequence is an example that triggers the bug: (a) The connection is cwnd_limited, but packets_out is not at its peak due to TSO deferral deciding not to send another skb yet. In such cases the connection can advance max_packets_seq and set tp->is_cwnd_limited to true and max_packets_out to a small number. (b) Then later in the round trip the connection is pacing-limited (not cwnd-limited), and packets_out is larger. In such cases the connection would raise max_packets_out to a bigger number but (unexpectedly) flip tp->is_cwnd_limited from true to false. This commit fixes that bug. One straightforward fix would be to separately track (a) the next window after max_packets_out reaches a maximum, and (b) the next window after tp->is_cwnd_limited is set to true. But this would require consuming an extra u32 sequence number. Instead, to save space we track only the most important information. Specifically, we track the strongest available signal of the degree to which the cwnd is fully utilized: (1) If the connection is cwnd-limited then we remember that fact for the current window. (2) If the connection not cwnd-limited then we track the maximum number of outstanding packets in the current window. In particular, note that the new logic cannot trigger the buggy (a)/(b) sequence above because with the new logic a condition where tp->packets_out > tp->max_packets_out can only trigger an update of tp->is_cwnd_limited if tp->is_cwnd_limited is false. This first showed up in a testing of a BBRv2 dev branch, but this buggy behavior highlighted a general issue with the tcp_cwnd_validate() logic that can cause cwnd to fail to increase at the proper rate for any TCP congestion control, including Reno or CUBIC. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2022-10-031-1/+24
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.cLorenzo Bianconi2022-10-031-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove circular dependency between nf_nat module and nf_conntrack one moving bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Fixes: 0fabd2aa199f ("net: netfilter: add bpf_ct_set_nat_info kfunc helper") Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Yauheni Kaliuta <ykaliuta@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/51a65513d2cda3eeb0754842e8025ab3966068d8.1664490511.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | bpf: Move nf_conn extern declarations to filter.hDaniel Xu2022-09-201-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're seeing the following new warnings on netdev/build_32bit and netdev/build_allmodconfig_warn CI jobs: ../net/core/filter.c:8608:1: warning: symbol 'nf_conn_btf_access_lock' was not declared. Should it be static? ../net/core/filter.c:8611:5: warning: symbol 'nfct_bsa' was not declared. Should it be static? Fix by ensuring extern declaration is present while compiling filter.o. Fixes: 864b656f82cc ("bpf: Add support for writing to nf_conn:mark") Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/2bd2e0283df36d8a4119605878edb1838d144174.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
| * | | | bpf: Rename nfct_bsa to nfct_btf_struct_accessDaniel Xu2022-09-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The former name was a little hard to guess. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/73adc72385c8b162391fbfb404f0b6d4c5cc55d7.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
| * | | | bpf: Remove unused btf_struct_access stubDaniel Xu2022-09-201-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This stub was not being used anywhere. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/590e7bd6172ffe0f3d7b51cd40e8ded941aaf7e8.1663683114.git.dxu@dxuuu.xyz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
| * | | | bpf: Add support for writing to nf_conn:markDaniel Xu2022-09-101-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support direct writes to nf_conn:mark from TC and XDP prog types. This is useful when applications want to store per-connection metadata. This is also particularly useful for applications that run both bpf and iptables/nftables because the latter can trivially access this metadata. One example use case would be if a bpf prog is responsible for advanced packet classification and iptables/nftables is later used for routing due to pre-existing/legacy code. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ebca06dea366e3e7e861c12f375a548cc4c61108.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | | | | net: Remove DECnet leftovers from flow.h.Guillaume Nault2022-10-031-26/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DECnet was removed by commit 1202cdd66531 ("Remove DECnet support from kernel"). Let's also revome its flow structure. Compile-tested only (allmodconfig). Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net: Add helper function to parse netlink msg of ip_tunnel_parmLiu Jian2022-10-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ip_tunnel_netlink_parms to parse netlink msg of ip_tunnel_parm. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | net: Add helper function to parse netlink msg of ip_tunnel_encapLiu Jian2022-10-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ip_tunnel_netlink_encap_parms to parse netlink msg of ip_tunnel_encap. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | Merge branch 'master' of ↵David S. Miller2022-10-033-8/+49
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== 1) Refactor selftests to use an array of structs in xfrm_fill_key(). From Gautam Menghani. 2) Drop an unused argument from xfrm_policy_match. From Hongbin Wang. 3) Support collect metadata mode for xfrm interfaces. From Eyal Birger. 4) Add netlink extack support to xfrm. From Sabrina Dubroca. Please note, there is a merge conflict in: include/net/dst_metadata.h between commit: 0a28bfd4971f ("net/macsec: Add MACsec skb_metadata_dst Tx Data path support") from the net-next tree and commit: 5182a5d48c3d ("net: allow storing xfrm interface metadata in metadata_dst") from the ipsec-next tree. Can be solved as done in linux-next. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | xfrm: ipcomp: add extack to ipcomp{4,6}_init_stateSabrina Dubroca2022-09-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | And the shared helper ipcomp_init_state. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: pass extack down to xfrm_type ->init_stateSabrina Dubroca2022-09-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: add extack support to xfrm_init_replaySabrina Dubroca2022-09-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: add extack to __xfrm_init_stateSabrina Dubroca2022-09-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: add extack support to xfrm_dev_state_addSabrina Dubroca2022-09-221-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md modeEyal Birger2022-08-291-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow specifying the xfrm interface if_id and link as part of a route metadata using the lwtunnel infrastructure. This allows for example using a single xfrm interface in collect_md mode as the target of multiple routes each specifying a different if_id. With the appropriate changes to iproute2, considering an xfrm device ipsec1 in collect_md mode one can for example add a route specifying an if_id like so: ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 In which case traffic routed to the device via this route would use if_id in the xfrm interface policy lookup. Or in the context of vrf, one can also specify the "link" property: ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 link_dev eth15 Note: LWT_XFRM_LINK uses NLA_U32 similar to IFLA_XFRM_LINK even though internally "link" is signed. This is consistent with other _LINK attributes in other devices as well as in bpf and should not have an effect as device indexes can't be negative. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | xfrm: interface: support collect metadata modeEyal Birger2022-08-291-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds support for 'collect_md' mode on xfrm interfaces. Each net can have one collect_md device, created by providing the IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be altered and has no if_id or link device attributes. On transmit to this device, the if_id is fetched from the attached dst metadata on the skb. If exists, the link property is also fetched from the metadata. The dst metadata type used is METADATA_XFRM which holds these properties. On the receive side, xfrmi_rcv_cb() populates a dst metadata for each packet received and attaches it to the skb. The if_id used in this case is fetched from the xfrm state, and the link is fetched from the incoming device. This information can later be used by upper layers such as tc, ebpf, and ip rules. Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst metadata is postponed until after scrubing. Similarly, xfrm_input() is adapted to avoid dropping metadata dsts by only dropping 'valid' (skb_valid_dst(skb) == true) dsts. Policy matching on packets arriving from collect_md xfrmi devices is done by using the xfrm state existing in the skb's sec_path. The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session() is changed to keep the details of the if_id extraction tucked away in xfrm_interface.c. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | net: allow storing xfrm interface metadata in metadata_dstEyal Birger2022-08-291-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFRM interfaces provide the association of various XFRM transformations to a netdevice using an 'if_id' identifier common to both the XFRM data structures (polcies, states) and the interface. The if_id is configured by the controlling entity (usually the IKE daemon) and can be used by the administrator to define logical relations between different connections. For example, different connections can share the if_id identifier so that they pass through the same interface, . However, currently it is not possible for connections using a different if_id to use the same interface while retaining the logical separation between them, without using additional criteria such as skb marks or different traffic selectors. When having a large number of connections, it is useful to have a the logical separation offered by the if_id identifier but use a single network interface. Similar to the way collect_md mode is used in IP tunnels. This patch attempts to enable different configuration mechanisms - such as ebpf programs, LWT encapsulations, and TC - to attach metadata to skbs which would carry the if_id. This way a single xfrm interface in collect_md mode can demux traffic based on this configuration on tx and provide this metadata on rx. The XFRM metadata is somewhat similar to ip tunnel metadata in that it has an "id", and shares similar configuration entities (bpf, tc, ...), however, it does not necessarily represent an IP tunnel or use other ip tunnel information, and also has an optional "link" property which can be used for affecting underlying routing decisions. Additional xfrm related criteria may also be added in the future. Therefore, a new metadata type is introduced, to be used in subsequent patches in the xfrm interface and configuration entities. Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | | | net: sched: cls_api: introduce tc_cls_bind_class() helperZhengchao Shao2022-10-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the bind_class callback duplicate the same logic, this patch introduces tc_cls_bind_class() helper for common usage. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | net: dsa: remove bool devlink_port_setupVladimir Oltean2022-09-301-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since dsa_port_devlink_setup() and dsa_port_devlink_teardown() are already called from code paths which only execute once per port (due to the existing bool dp->setup), keeping another dp->devlink_port_setup is redundant, because we can already manage to balance the calls properly (and not call teardown when setup was never called, or call setup twice, or things like that). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | | | | net: devlink: add port_init/fini() helpers to allow ↵Jiri Pirko2022-09-301-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pre-register/post-unregister functions Lifetime of some of the devlink objects, like regions, is currently forced to be different for devlink instance and devlink port instance (per-port regions). The reason is that for devlink ports, the internal structures initialization happens only after devlink_port_register() is called. To resolve this inconsistency, introduce new set of helpers to allow driver to initialize devlink pointer and region list before devlink_register() is called. That allows port regions to be created before devlink port registration and destroyed after devlink port unregistration. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | | | | net: devlink: introduce a flag to indicate devlink port being registeredJiri Pirko2022-09-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of relying on devlink pointer not being initialized, introduce an extra flag to indicate if devlink port is registered. This is needed as later on devlink pointer is going to be initialized even in case devlink port is not registered yet. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | | | | Merge tag 'for-net-next-2022-09-30' of ↵Jakub Kicinski2022-09-305-3/+80
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next - Add RTL8761BUV device (Edimax BT-8500) - Add a new PID/VID 13d3/3583 for MT7921 - Add Realtek RTL8852C support ID 0x13D3:0x3592 - Add VID/PID 0489/e0e0 for MediaTek MT7921 - Add a new VID/PID 0e8d/0608 for MT7921 - Add a new PID/VID 13d3/3578 for MT7921 - Add BT device 0cb8:c549 from RTW8852AE - Add support for Intel Magnetor * tag 'for-net-next-2022-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (49 commits) Bluetooth: hci_sync: Fix not indicating power state Bluetooth: L2CAP: Fix user-after-free Bluetooth: Call shutdown for HCI_USER_CHANNEL Bluetooth: Prevent double register of suspend Bluetooth: hci_core: Fix not handling link timeouts propertly Bluetooth: hci_event: Make sure ISO events don't affect non-ISO connections Bluetooth: hci_debugfs: Fix not checking conn->debugfs Bluetooth: hci_sysfs: Fix attempting to call device_add multiple times Bluetooth: MGMT: fix zalloc-simple.cocci warnings Bluetooth: hci_{ldisc,serdev}: check percpu_init_rwsem() failure Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works Bluetooth: L2CAP: initialize delayed works at l2cap_chan_create() Bluetooth: RFCOMM: Fix possible deadlock on socket shutdown/release Bluetooth: hci_sync: allow advertise when scan without RPA Bluetooth: btusb: Add a new VID/PID 0e8d/0608 for MT7921 Bluetooth: btusb: Add a new PID/VID 13d3/3583 for MT7921 Bluetooth: avoid hci_dev_test_and_set_flag() in mgmt_init_hdev() Bluetooth: btintel: Mark Intel controller to support LE_STATES quirk Bluetooth: btintel: Add support for Magnetor Bluetooth: btusb: Add a new PID/VID 13d3/3578 for MT7921 ... ==================== Link: https://lore.kernel.org/r/20221001004602.297366-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | Bluetooth: Fix HCIGETDEVINFO regressionLuiz Augusto von Dentz2022-09-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent changes breaks HCIGETDEVINFO since it changes the size of hci_dev_info. Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>