summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* svcrdma: Fix page leak in svc_rdma_recv_read_chunk()Chuck Lever2020-07-131-7/+21
| | | | | | | | | | Commit 07d0ff3b0cd2 ("svcrdma: Clean up Read chunk path") moved the page saver logic so that it gets executed event when an error occurs. In that case, the I/O is never posted, and those pages are then leaked. Errors in this path, however, are quite rare. Fixes: 07d0ff3b0cd2 ("svcrdma: Clean up Read chunk path") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Augment server-side rpcgss tracepointsChuck Lever2020-07-132-38/+82
| | | | | | | | | | | | | | | | Add similar tracepoints to those that were recently added on the client side to track failures in the integ and priv unwrap paths. And, let's collect the seqno-specific tracepoints together with a common naming convention. Regarding the gss_check_seq_num() changes: everywhere else treats the GSS sequence number as an unsigned 32-bit integer. As far back as 2.6.12, I couldn't find a compelling reason to do things differently here. As a defensive change it's better to eliminate needless implicit sign conversions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2020-07-1079-402/+548
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking BPF programs, from Maciej Żenczykowski. 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu Mariappan. 3) Slay memory leak in nl80211 bss color attribute parsing code, from Luca Coelho. 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin. 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric Dumazet. 6) xsk code dips too deeply into DMA mapping implementation internals. Add dma_need_sync and use it. From Christoph Hellwig 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko. 8) Check for disallowed attributes when loading flow dissector BPF programs. From Lorenz Bauer. 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from Jason A. Donenfeld. 10) Don't advertise checksum offload on ipa devices that don't support it. From Alex Elder. 11) Resolve several issues in TCP MD5 signature support. Missing memory barriers, bogus options emitted when using syncookies, and failure to allow md5 key changes in established states. All from Eric Dumazet. 12) Fix interface leak in hsr code, from Taehee Yoo. 13) VF reset fixes in hns3 driver, from Huazhong Tan. 14) Make loopback work again with ipv6 anycast, from David Ahern. 15) Fix TX starvation under high load in fec driver, from Tobias Waldekranz. 16) MLD2 payload lengths not checked properly in bridge multicast code, from Linus Lüssing. 17) Packet scheduler code that wants to find the inner protocol currently only works for one level of VLAN encapsulation. Allow Q-in-Q situations to work properly here, from Toke Høiland-Jørgensen. 18) Fix route leak in l2tp, from Xin Long. 19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport support and various protocols. From Martin KaFai Lau. 20) Fix socket cgroup v2 reference counting in some situations, from Cong Wang. 21) Cure memory leak in mlx5 connection tracking offload support, from Eli Britstein. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) mlxsw: pci: Fix use-after-free in case of failed devlink reload mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON() net: macb: fix call to pm_runtime in the suspend/resume functions net: macb: fix macb_suspend() by removing call to netif_carrier_off() net: macb: fix macb_get/set_wol() when moving to phylink net: macb: mark device wake capable when "magic-packet" property present net: macb: fix wakeup test in runtime suspend/resume routines bnxt_en: fix NULL dereference in case SR-IOV configuration fails libbpf: Fix libbpf hashmap on (I)LP32 architectures net/mlx5e: CT: Fix memory leak in cleanup net/mlx5e: Fix port buffers cell size value net/mlx5e: Fix 50G per lane indication net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash net/mlx5e: Fix VXLAN configuration restore after function reload net/mlx5e: Fix usage of rcu-protected pointer net/mxl5e: Verify that rpriv is not NULL net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode net/mlx5: Fix eeprom support for SFP module cgroup: Fix sock_cgroup_data on big-endian. selftests: bpf: Fix detach from sockmap tests ...
| * tcp: make sure listeners don't initialize congestion-control stateChristoph Paasch2020-07-092-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller found its way into setsockopt with TCP_CONGESTION "cdg". tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock just copies all the memory, the allocated pointer will be copied as well, if the app called setsockopt(..., TCP_CONGESTION) on the listener. If now the socket will be destroyed before the congestion-control has properly been initialized (through a call to tcp_init_transfer), we will end up freeing memory that does not belong to that particular socket, opening the door to a double-free: [ 11.413102] ================================================================== [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0 [ 11.415329] [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80 [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.418148] Call Trace: [ 11.418534] <IRQ> [ 11.418834] dump_stack+0x7d/0xb0 [ 11.419297] print_address_description.constprop.0+0x1a/0x210 [ 11.422079] kasan_report_invalid_free+0x51/0x80 [ 11.423433] __kasan_slab_free+0x15e/0x170 [ 11.424761] kfree+0x8c/0x230 [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0 [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100 [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0 [ 11.429457] cookie_v4_check+0x13d0/0x2500 [ 11.433189] tcp_v4_do_rcv+0x60e/0x780 [ 11.433727] tcp_v4_rcv+0x2869/0x2e10 [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190 [ 11.437810] ip_local_deliver+0x294/0x350 [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0 [ 11.441995] process_backlog+0x1b1/0x6b0 [ 11.443148] net_rx_action+0x37e/0xc40 [ 11.445361] __do_softirq+0x18c/0x61a [ 11.445881] asm_call_on_stack+0x12/0x20 [ 11.446409] </IRQ> [ 11.446716] do_softirq_own_stack+0x34/0x40 [ 11.447259] do_softirq.part.0+0x26/0x30 [ 11.447827] __local_bh_enable_ip+0x46/0x50 [ 11.448406] ip_finish_output2+0x60f/0x1bc0 [ 11.450109] __ip_queue_xmit+0x71c/0x1b60 [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0 [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780 [ 11.457995] __release_sock+0x14b/0x2c0 [ 11.458529] release_sock+0x4a/0x170 [ 11.459005] __inet_stream_connect+0x467/0xc80 [ 11.461435] inet_stream_connect+0x4e/0xa0 [ 11.462043] __sys_connect+0x204/0x270 [ 11.465515] __x64_sys_connect+0x6a/0xb0 [ 11.466088] do_syscall_64+0x3e/0x70 [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.467341] RIP: 0033:0x7f56046dc469 [ 11.467844] Code: Bad RIP value. [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469 [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004 [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003 [ 11.474321] [ 11.474527] Allocated by task 4884: [ 11.475031] save_stack+0x1b/0x40 [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 11.476182] tcp_cdg_init+0xf0/0x150 [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0 [ 11.477435] tcp_set_congestion_control+0x270/0x32f [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00 [ 11.478744] __sys_setsockopt+0xff/0x1e0 [ 11.479259] __x64_sys_setsockopt+0xb5/0x150 [ 11.479895] do_syscall_64+0x3e/0x70 [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.481097] [ 11.481321] Freed by task 4872: [ 11.481783] save_stack+0x1b/0x40 [ 11.482230] __kasan_slab_free+0x12c/0x170 [ 11.482839] kfree+0x8c/0x230 [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0 [ 11.485144] tcp_close+0x932/0xfe0 [ 11.485642] inet_release+0xc1/0x1c0 [ 11.486131] __sock_release+0xc0/0x270 [ 11.486697] sock_close+0xc/0x10 [ 11.487145] __fput+0x277/0x780 [ 11.487632] task_work_run+0xeb/0x180 [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160 [ 11.488834] do_syscall_64+0x4a/0x70 [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750 ("tcp: memset ca_priv data to 0 properly"). This patch here fixes the listener-scenario: We make sure that listeners setting the congestion-control through setsockopt won't initialize it (thus CDG never allocates on listeners). For those who use AF_UNSPEC to reuse a socket, tcp_disconnect() is changed to cleanup afterwards. (The issue can be reproduced at least down to v4.4.x.) Cc: Wei Wang <weiwan@google.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ethtool: fix genlmsg_put() failure handling in ethnl_default_dumpit()Michal Kubecek2020-07-091-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the genlmsg_put() call in ethnl_default_dumpit() fails, we bail out without checking if we already have some messages in current skb like we do with ethnl_default_dump_one() failure later. Therefore if existing messages almost fill up the buffer so that there is not enough space even for netlink and genetlink header, we lose all prepared messages and return and error. Rather than duplicating the skb->len check, move the genlmsg_put(), genlmsg_cancel() and genlmsg_end() calls into ethnl_default_dump_one(). This is also more logical as all message composition will be in ethnl_default_dump_one() and only iteration logic will be left in ethnl_default_dumpit(). Fixes: 728480f12442 ("ethtool: default handlers for GET requests") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: fix a memory leak in atm_tc_init()Cong Wang2020-07-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tcf_block_get() fails inside atm_tc_init(), atm_tc_put() is called to release the qdisc p->link.q. But the flow->ref prevents it to do so, as the flow->ref is still zero. Fix this by moving the p->link.ref initialization before tcf_block_get(). Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure") Reported-and-tested-by: syzbot+d411cff6ab29cc2c311b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix retransmission on unicast linksHamish Martin2020-07-081-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A scenario has been observed where a 'bc_init' message for a link is not retransmitted if it fails to be received by the peer. This leads to the peer never establishing the link fully and it discarding all other data received on the link. In this scenario the message is lost in transit to the peer. The issue is traced to the 'nxt_retr' field of the skb not being initialised for links that aren't a bc_sndlink. This leads to the comparison in tipc_link_advance_transmq() that gates whether to attempt retransmission of a message performing in an undesirable way. Depending on the relative value of 'jiffies', this comparison: time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr) may return true or false given that 'nxt_retr' remains at the uninitialised value of 0 for non bc_sndlinks. This is most noticeable shortly after boot when jiffies is initialised to a high value (to flush out rollover bugs) and we compare a jiffies of, say, 4294940189 to zero. In that case time_before returns 'true' leading to the skb not being retransmitted. The fix is to ensure that all skbs have a valid 'nxt_retr' time set for them and this is achieved by refactoring the setting of this value into a central function. With this fix, transmission losses of 'bc_init' messages do not stall the link establishment forever because the 'bc_init' message is retransmitted and the link eventually establishes correctly. Fixes: 382f598fb66b ("tipc: reduce duplicate packets for unicast traffic") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
| * l2tp: remove skb_dst_set() from l2tp_xmit_skb()Xin Long2020-07-081-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the tx path of l2tp, l2tp_xmit_skb() calls skb_dst_set() to set skb's dst. However, it will eventually call inet6_csk_xmit() or ip_queue_xmit() where skb's dst will be overwritten by: skb_dst_set_noref(skb, dst); without releasing the old dst in skb. Then it causes dst/dev refcnt leak: unregister_netdevice: waiting for eth0 to become free. Usage count = 1 This can be reproduced by simply running: # modprobe l2tp_eth && modprobe l2tp_ip # sh ./tools/testing/selftests/net/l2tp.sh So before going to inet6_csk_xmit() or ip_queue_xmit(), skb's dst should be dropped. This patch is to fix it by removing skb_dst_set() from l2tp_xmit_skb() and moving skb_dst_drop() into l2tp_xmit_core(). Fixes: 3557baabf280 ("[L2TP]: PPP over L2TP driver core") Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: James Chapman <jchapman@katalix.com> Tested-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: tolerate future SMCD versionsUrsula Braun2020-07-082-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CLC proposal messages of future SMCD versions could be larger than SMCD V1 CLC proposal messages. To enable toleration in SMC V1 the receival of CLC proposal messages is adapted: * accept larger length values in CLC proposal * check trailing eye catcher for incoming CLC proposal with V1 length only * receive the whole CLC proposal even in cases it does not fit into the V1 buffer Fixes: e7b7a64a8493d ("smc: support variable CLC proposal messages") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: switch smcd_dev_list spinlock to mutexUrsula Braun2020-07-084-18/+20
| | | | | | | | | | | | | | | | | | | | | | The similar smc_ib_devices spinlock has been converted to a mutex. Protecting the smcd_dev_list by a mutex is possible as well. This patch converts the smcd_dev_list spinlock to a mutex. Fixes: c6ba7c9ba43d ("net/smc: add base infrastructure for SMC-D and ISM") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: fix sleep bug in smc_pnet_find_roce_resource()Ursula Braun2020-07-084-18/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tests showed this BUG: [572555.252867] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935 [572555.252876] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 131031, name: smcapp [572555.252879] INFO: lockdep is turned off. [572555.252883] CPU: 1 PID: 131031 Comm: smcapp Tainted: G O 5.7.0-rc3uschi+ #356 [572555.252885] Hardware name: IBM 3906 M03 703 (LPAR) [572555.252887] Call Trace: [572555.252896] [<00000000ac364554>] show_stack+0x94/0xe8 [572555.252901] [<00000000aca1f400>] dump_stack+0xa0/0xe0 [572555.252906] [<00000000ac3c8c10>] ___might_sleep+0x260/0x280 [572555.252910] [<00000000acdc0c98>] __mutex_lock+0x48/0x940 [572555.252912] [<00000000acdc15c2>] mutex_lock_nested+0x32/0x40 [572555.252975] [<000003ff801762d0>] mlx5_lag_get_roce_netdev+0x30/0xc0 [mlx5_core] [572555.252996] [<000003ff801fb3aa>] mlx5_ib_get_netdev+0x3a/0xe0 [mlx5_ib] [572555.253007] [<000003ff80063848>] smc_pnet_find_roce_resource+0x1d8/0x310 [smc] [572555.253011] [<000003ff800602f0>] __smc_connect+0x1f0/0x3e0 [smc] [572555.253015] [<000003ff80060634>] smc_connect+0x154/0x190 [smc] [572555.253022] [<00000000acbed8d4>] __sys_connect+0x94/0xd0 [572555.253025] [<00000000acbef620>] __s390x_sys_socketcall+0x170/0x360 [572555.253028] [<00000000acdc6800>] system_call+0x298/0x2b8 [572555.253030] INFO: lockdep is turned off. Function smc_pnet_find_rdma_dev() might be called from smc_pnet_find_roce_resource(). It holds the smc_ib_devices list spinlock while calling infiniband op get_netdev(). At least for mlx5 the get_netdev operation wants mutex serialization, which conflicts with the smc_ib_devices spinlock. This patch switches the smc_ib_devices spinlock into a mutex to allow sleeping when calling get_netdev(). Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: fix work request handlingKarsten Graul2020-07-082-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | Wait for pending sends only when smc_switch_conns() found a link to move the connections to. Do not wait during link freeing, this can lead to permanent hang situations. And refuse to provide a new tx slot on an unusable link. Fixes: c6f02ebeea3a ("net/smc: switch connections to alternate link") Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: separate LLC wait queues for flow and messagesKarsten Graul2020-07-083-48/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There might be races in scenarios where both SMC link groups are on the same system. Prevent that by creating separate wait queues for LLC flows and messages. Switch to non-interruptable versions of wait_event() and wake_up() for the llc flow waiter to make sure the waiters get control sequentially. Fine tune the llc_flow_lock to include the assignment of the message. Write to system log when an unexpected message was dropped. And remove an extra indirection and use the existing local variable lgr in smc_llc_enqueue(). Fixes: 555da9af827d ("net/smc: add event-based llc_flow framework") Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: mcast: Fix MLD2 Report IPv6 payload length checkLinus Lüssing2020-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling") introduced a bug in the IPv6 header payload length check which would potentially lead to rejecting a valid MLD2 Report: The check needs to take into account the 2 bytes for the "Number of Sources" field in the "Multicast Address Record" before reading it. And not the size of a pointer to this field. Fixes: e57f61858b7c ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling") Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/sched: act_ct: add miss tcf_lastuse_update.wenxu2020-07-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tcf_ct_act execute the tcf_lastuse_update should be update or the used stats never update filter protocol ip pref 3 flower chain 0 filter protocol ip pref 3 flower chain 0 handle 0x1 eth_type ipv4 dst_ip 1.1.1.1 ip_flags frag/firstfrag skip_hw not_in_hw action order 1: ct zone 1 nat pipe index 1 ref 1 bind 1 installed 103 sec used 103 sec Action statistics: Sent 151500 bytes 101 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 cookie 4519c04dc64a1a295787aab13b6a50fb Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mptcp: fix DSS map generation on fin retransmissionPaolo Abeni2020-07-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RFC 8684 mandates that no-data DATA FIN packets should carry a DSS with 0 sequence number and data len equal to 1. Currently, on FIN retransmission we re-use the existing mapping; if the previous fin transmission was part of a partially acked data packet, we could end-up writing in the egress packet a non-compliant DSS. The above will be detected by a "Bad mapping" warning on the receiver side. This change addresses the issue explicitly checking for 0 len packet when adding the DATA_FIN option. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsgSabrina Dubroca2020-07-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to incomplete IPsec ACQUIRE messages being sent to userspace. Currently, both raw sockets and IPv6 ping sockets set those fields. Expected output of "ip xfrm monitor": acquire proto esp sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4 policy src 10.0.2.15/32 dst 8.8.8.8/32 <snip> Currently with ping sockets: acquire proto esp sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4 policy src 10.0.2.15/32 dst 8.8.8.8/32 <snip> The Libreswan test suite found this problem after Fedora changed the value for the sysctl net.ipv4.ping_group_range. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Reported-by: Paul Wouters <pwouters@redhat.com> Tested-by: Paul Wouters <pwouters@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * cgroup: fix cgroup_sk_alloc() for sk_clone_lock()Cong Wang2020-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit 090e28b229af ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Fix use of anycast address with loopbackDavid Ahern2020-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thomas reported a regression with IPv6 and anycast using the following reproducer: echo 1 > /proc/sys/net/ipv6/conf/all/forwarding ip -6 a add fc12::1/16 dev lo sleep 2 echo "pinging lo" ping6 -c 2 fc12:: The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed the use of fib6_is_reject which checks addresses added to the loopback interface and sets the REJECT flag as needed. Update fib6_is_reject for loopback checks to handle RTF_ANYCAST addresses. Fixes: c7a1ce397ada ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create") Reported-by: thomas.gambier@nexedi.com Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: fib6_select_path can not use out path for nexthop objectsDavid Ahern2020-07-061-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Brian reported a crash in IPv6 code when using rpfilter with a setup running FRR and external nexthop objects. The root cause of the crash is fib6_select_path setting fib6_nh in the result to NULL because of an improper check for nexthop objects. More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif set causing fib6_select_path to be called with have_oif_match set. fib6_select_path has early check on have_oif_match and jumps to the out label which presumes a builtin fib6_nh. This path is invalid for nexthop objects; for external nexthops fib6_select_path needs to just return if the fib6_nh has already been set in the result otherwise it returns after the call to nexthop_path_fib6_result. Update the check on have_oif_match to not bail on external nexthops. Update selftests for this problem. Fixes: f88d8ea67fbd ("ipv6: Plumb support for nexthop object in a fib6_info") Reported-by: Brian Rak <brak@choopa.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * hsr: fix interface leak in error path of hsr_dev_finalize()Taehee Yoo2020-07-041-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To release hsr(upper) interface, it should release its own lower interfaces first. Then, hsr(upper) interface can be released safely. In the current code of error path of hsr_dev_finalize(), it releases hsr interface before releasing a lower interface. So, a warning occurs, which warns about the leak of lower interfaces. In order to fix this problem, changing the ordering of the error path of hsr_dev_finalize() is needed. Test commands: ip link add dummy0 type dummy ip link add dummy1 type dummy ip link add dummy2 type dummy ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 ip link add hsr1 type hsr slave1 dummy2 slave2 dummy0 Splat looks like: [ 214.923127][ C2] WARNING: CPU: 2 PID: 1093 at net/core/dev.c:8992 rollback_registered_many+0x986/0xcf0 [ 214.923129][ C2] Modules linked in: hsr dummy openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipx [ 214.923154][ C2] CPU: 2 PID: 1093 Comm: ip Not tainted 5.8.0-rc2+ #623 [ 214.923156][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 214.923157][ C2] RIP: 0010:rollback_registered_many+0x986/0xcf0 [ 214.923160][ C2] Code: 41 8b 4e cc 45 31 c0 31 d2 4c 89 ee 48 89 df e8 e0 47 ff ff 85 c0 0f 84 cd fc ff ff 5 [ 214.923162][ C2] RSP: 0018:ffff8880c5156f28 EFLAGS: 00010287 [ 214.923165][ C2] RAX: ffff8880d1dad458 RBX: ffff8880bd1b9000 RCX: ffffffffb929d243 [ 214.923167][ C2] RDX: 1ffffffff77e63f0 RSI: 0000000000000008 RDI: ffffffffbbf31f80 [ 214.923168][ C2] RBP: dffffc0000000000 R08: fffffbfff77e63f1 R09: fffffbfff77e63f1 [ 214.923170][ C2] R10: ffffffffbbf31f87 R11: 0000000000000001 R12: ffff8880c51570a0 [ 214.923172][ C2] R13: ffff8880bd1b90b8 R14: ffff8880c5157048 R15: ffff8880d1dacc40 [ 214.923174][ C2] FS: 00007fdd257a20c0(0000) GS:ffff8880da200000(0000) knlGS:0000000000000000 [ 214.923175][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.923177][ C2] CR2: 00007ffd78beb038 CR3: 00000000be544005 CR4: 00000000000606e0 [ 214.923179][ C2] Call Trace: [ 214.923180][ C2] ? netif_set_real_num_tx_queues+0x780/0x780 [ 214.923182][ C2] ? dev_validate_mtu+0x140/0x140 [ 214.923183][ C2] ? synchronize_rcu.part.79+0x85/0xd0 [ 214.923185][ C2] ? synchronize_rcu_expedited+0xbb0/0xbb0 [ 214.923187][ C2] rollback_registered+0xc8/0x170 [ 214.923188][ C2] ? rollback_registered_many+0xcf0/0xcf0 [ 214.923190][ C2] unregister_netdevice_queue+0x18b/0x240 [ 214.923191][ C2] hsr_dev_finalize+0x56e/0x6e0 [hsr] [ 214.923192][ C2] hsr_newlink+0x36b/0x450 [hsr] [ 214.923194][ C2] ? hsr_dellink+0x70/0x70 [hsr] [ 214.923195][ C2] ? rtnl_create_link+0x2e4/0xb00 [ 214.923197][ C2] ? __netlink_ns_capable+0xc3/0xf0 [ 214.923198][ C2] __rtnl_newlink+0xbdb/0x1270 [ ... ] Fixes: e0a4b99773d3 ("hsr: use upper/lower device infrastructure") Reported-by: syzbot+7f1c020f68dab95aab59@syzkaller.appspotmail.com Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2020-07-045-5/+7
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Use kvfree() to release vmalloc()'ed areas in ipset, from Eric Dumazet. 2) UAF in nfnetlink_queue from the nf_conntrack_update() path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: conntrack: refetch conntrack after nf_conntrack_update()Pablo Neira Ayuso2020-07-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __nf_conntrack_update() might refresh the conntrack object that is attached to the skbuff. Otherwise, this triggers UAF. [ 633.200434] ================================================================== [ 633.200472] BUG: KASAN: use-after-free in nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200478] Read of size 1 at addr ffff888370804c00 by task nfqnl_test/6769 [ 633.200487] CPU: 1 PID: 6769 Comm: nfqnl_test Not tainted 5.8.0-rc2+ #388 [ 633.200490] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012 [ 633.200491] Call Trace: [ 633.200499] dump_stack+0x7c/0xb0 [ 633.200526] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200532] print_address_description.constprop.6+0x1a/0x200 [ 633.200539] ? _raw_write_lock_irqsave+0xc0/0xc0 [ 633.200568] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200594] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200598] kasan_report.cold.9+0x1f/0x42 [ 633.200604] ? call_rcu+0x2c0/0x390 [ 633.200633] ? nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200659] nf_conntrack_update+0x34e/0x770 [nf_conntrack] [ 633.200687] ? nf_conntrack_find_get+0x30/0x30 [nf_conntrack] Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1436 Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: ipset: call ip_set_free() instead of kfree()Eric Dumazet2020-06-304-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever ip_set_alloc() is used, allocated memory can either use kmalloc() or vmalloc(). We should call kvfree() or ip_set_free() invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 21935 Comm: syz-executor.3 Not tainted 5.8.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__phys_addr+0xa7/0x110 arch/x86/mm/physaddr.c:28 Code: 1d 7a 09 4c 89 e3 31 ff 48 d3 eb 48 89 de e8 d0 58 3f 00 48 85 db 75 0d e8 26 5c 3f 00 4c 89 e0 5b 5d 41 5c c3 e8 19 5c 3f 00 <0f> 0b e8 12 5c 3f 00 48 c7 c0 10 10 a8 89 48 ba 00 00 00 00 00 fc RSP: 0000:ffffc900018572c0 EFLAGS: 00010046 RAX: 0000000000040000 RBX: 0000000000000001 RCX: ffffc9000fac3000 RDX: 0000000000040000 RSI: ffffffff8133f437 RDI: 0000000000000007 RBP: ffffc90098aff000 R08: 0000000000000000 R09: ffff8880ae636cdb R10: 0000000000000000 R11: 0000000000000000 R12: 0000408018aff000 R13: 0000000000080000 R14: 000000000000001d R15: ffffc900018573d8 FS: 00007fc540c66700(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc9dcd67200 CR3: 0000000059411000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: virt_to_head_page include/linux/mm.h:841 [inline] virt_to_cache mm/slab.h:474 [inline] kfree+0x77/0x2c0 mm/slab.c:3749 hash_net_create+0xbb2/0xd70 net/netfilter/ipset/ip_set_hash_gen.h:1536 ip_set_create+0x6a2/0x13c0 net/netfilter/ipset/ip_set_core.c:1128 nfnetlink_rcv_msg+0xbe8/0xea0 net/netfilter/nfnetlink.c:230 netlink_rcv_skb+0x15a/0x430 net/netlink/af_netlink.c:2469 nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:564 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1329 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1918 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2352 ___sys_sendmsg+0xf3/0x170 net/socket.c:2406 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45cb19 Code: Bad RIP value. RSP: 002b:00007fc540c65c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004fed80 RCX: 000000000045cb19 RDX: 0000000000000000 RSI: 0000000020001080 RDI: 0000000000000003 RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000095e R14: 00000000004cc295 R15: 00007fc540c666d4 Fixes: f66ee0410b1c ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports") Fixes: 03c8b234e61a ("netfilter: ipset: Generalize extensions support") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | sched: consistently handle layer3 header accesses in the presence of VLANsToke Høiland-Jørgensen2020-07-0316-32/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a couple of places in net/sched/ that check skb->protocol and act on the value there. However, in the presence of VLAN tags, the value stored in skb->protocol can be inconsistent based on whether VLAN acceleration is enabled. The commit quoted in the Fixes tag below fixed the users of skb->protocol to use a helper that will always see the VLAN ethertype. However, most of the callers don't actually handle the VLAN ethertype, but expect to find the IP header type in the protocol field. This means that things like changing the ECN field, or parsing diffserv values, stops working if there's a VLAN tag, or if there are multiple nested VLAN tags (QinQ). To fix this, change the helper to take an argument that indicates whether the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we make sure to skip all of them, so behaviour is consistent even in QinQ mode. To make the helper usable from the ECN code, move it to if_vlan.h instead of pkt_sched.h. v3: - Remove empty lines - Move vlan variable definitions inside loop in skb_protocol() - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and bpf_skb_ecn_set_ce() v2: - Use eth_type_vlan() helper in skb_protocol() - Also fix code that reads skb->protocol directly - Change a couple of 'if/else if' statements to switch constructs to avoid calling the helper twice Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: md5: allow changing MD5 keys in all socket statesEric Dumazet2020-07-021-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets") Mathieu reported that many vendors BGP implementations can actually switch TCP MD5 on established flows. Quoting Mathieu : Here is a list of a few network vendors along with their behavior with respect to TCP MD5: - Cisco: Allows for password to be changed, but within the hold-down timer (~180 seconds). - Juniper: When password is initially set on active connection it will reset, but after that any subsequent password changes no network resets. - Nokia: No notes on if they flap the tcp connection or not. - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until both sides are ok with new passwords. - Meta-Switch: Expects the password to be set before a connection is attempted, but no further info on whether they reset the TCP connection on a change. - Avaya: Disable the neighbor, then set password, then re-enable. - Zebos: Would normally allow the change when socket connected. We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential overestimation of TCP option space") removed the leak of 4 kernel bytes to the wire that was the main reason for my patch. While doing my investigations, I found a bug when a MD5 key is changed, leading to these commits that stable teams want to consider before backporting this revert : Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()") Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers") Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets" Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: fix SO_RCVLOWAT possible hangs under high mem pressureEric Dumazet2020-07-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever tcp_try_rmem_schedule() returns an error, we are under trouble and should make sure to wakeup readers so that they can drain socket queues and eventually make room. Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ip: Fix SO_MARK in RST, ACK and ICMP packetsWillem de Bruijn2020-07-013-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When no full socket is available, skbs are sent over a per-netns control socket. Its sk_mark is temporarily adjusted to match that of the real (request or timewait) socket or to reflect an incoming skb, so that the outgoing skb inherits this in __ip_make_skb. Introduction of the socket cookie mark field broke this. Now the skb is set through the cookie and cork: <caller> # init sockc.mark from sk_mark or cmsg ip_append_data ip_setup_cork # convert sockc.mark to cork mark ip_push_pending_frames ip_finish_skb __ip_make_skb # set skb->mark to cork mark But I missed these special control sockets. Update all callers of __ip(6)_make_skb that were originally missed. For IPv6, the same two icmp(v6) paths are affected. The third case is not, as commit 92e55f412cff ("tcp: don't annotate mark on control socket from tcp_v6_send_response()") replaced the ctl_sk->sk_mark with passing the mark field directly as a function argument. That commit predates the commit that introduced the bug. Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg") Signed-off-by: Willem de Bruijn <willemb@google.com> Reported-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: md5: do not send silly options in SYNCOOKIESEric Dumazet2020-07-011-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever cookie_init_timestamp() has been used to encode ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK. Otherwise, tcp_synack_options() will still advertize options like WSCALE that we can not deduce later when receiving the packet from the client to complete 3WHS. Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet, but we can not know for sure that all TCP stacks have the same logic. Before the fix a tcpdump would exhibit this wrong exchange : 10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0 10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0 10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0 10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12 10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0 After this patch the exchange looks saner : 11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0 11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0 11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0 11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12 11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0 11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12 11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0 Of course, no SACK block will ever be added later, but nothing should break. Technically, we could remove the 4 nops included in MD5+TS options, but again some stacks could break seeing not conventional alignment. Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | rds: If one path needs re-connection, check all and re-connectRao Shoaib2020-07-013-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In testing with mprds enabled, Oracle Cluster nodes after reboot were not able to communicate with others nodes and so failed to rejoin the cluster. Peers with lower IP address initiated connection but the node could not respond as it choose a different path and could not initiate a connection as it had a higher IP address. With this patch, when a node sends out a packet and the selected path is down, all other paths are also checked and any down paths are re-connected. Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriersEric Dumazet2020-07-012-9/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My prior fix went a bit too far, according to Herbert and Mathieu. Since we accept that concurrent TCP MD5 lookups might see inconsistent keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb() Clearing all key->key[] is needed to avoid possible KMSAN reports, if key->keylen is increased. Since tcp_md5_do_add() is not fast path, using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler. data_race() was added in linux-5.8 and will prevent KCSAN reports, this can safely be removed in stable backports, if data_race() is not yet backported. v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add() Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Marco Elver <elver@google.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | genetlink: remove genl_bindSean Tranchetti2020-07-011-49/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A potential deadlock can occur during registering or unregistering a new generic netlink family between the main nl_table_lock and the cb_lock where each thread wants the lock held by the other, as demonstrated below. 1) Thread 1 is performing a netlink_bind() operation on a socket. As part of this call, it will call netlink_lock_table(), incrementing the nl_table_users count to 1. 2) Thread 2 is registering (or unregistering) a genl_family via the genl_(un)register_family() API. The cb_lock semaphore will be taken for writing. 3) Thread 1 will call genl_bind() as part of the bind operation to handle subscribing to GENL multicast groups at the request of the user. It will attempt to take the cb_lock semaphore for reading, but it will fail and be scheduled away, waiting for Thread 2 to finish the write. 4) Thread 2 will call netlink_table_grab() during the (un)registration call. However, as Thread 1 has incremented nl_table_users, it will not be able to proceed, and both threads will be stuck waiting for the other. genl_bind() is a noop, unless a genl_family implements the mcast_bind() function to handle setting up family-specific multicast operations. Since no one in-tree uses this functionality as Cong pointed out, simply removing the genl_bind() function will remove the possibility for deadlock, as there is no attempt by Thread 1 above to take the cb_lock semaphore. Fixes: c380d9a7afff ("genetlink: pass multicast bind/unbind to families") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Johannes Berg <johannes.berg@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: qrtr: Fix an out of bounds read qrtr_endpoint_post()Dan Carpenter2020-06-301-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code assumes that the user passed in enough data for a qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true. If the buffer is too small then it will read beyond the end. Reported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com Fixes: 194ccc88297a ("net: qrtr: Support decoding incoming v2 packets") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()Eric Dumazet2020-06-302-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD5 keys are read with RCU protection, and tcp_md5_do_add() might update in-place a prior key. Normally, typical RCU updates would allocate a new piece of memory. In this case only key->key and key->keylen might be updated, and we do not care if an incoming packet could see the old key, the new one, or some intermediate value, since changing the key on a live flow is known to be problematic anyway. We only want to make sure that in the case key->keylen is changed, cpus in tcp_md5_hash_key() wont try to use uninitialized data, or crash because key->keylen was read twice to feed sg_init_one() and ahash_request_set_crypt() Fixes: 9ea88a153001 ("tcp: md5: check md5 signature without socket lock") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: qrtr: free flow in __qrtr_node_releaseCarl Huang2020-06-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flow is allocated in qrtr_tx_wait, but not freed when qrtr node is released. (*slot) becomes NULL after radix_tree_iter_delete is called in __qrtr_node_release. The fix is to save (*slot) to a vairable and then free it. This memory leak is catched when kmemleak is enabled in kernel, the report looks like below: unreferenced object 0xffffa0de69e08420 (size 32): comm "kworker/u16:3", pid 176, jiffies 4294918275 (age 82858.876s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 28 84 e0 69 de a0 ff ff ........(..i.... 28 84 e0 69 de a0 ff ff 03 00 00 00 00 00 00 00 (..i............ backtrace: [<00000000e252af0a>] qrtr_node_enqueue+0x38e/0x400 [qrtr] [<000000009cea437f>] qrtr_sendmsg+0x1e0/0x2a0 [qrtr] [<000000008bddbba4>] sock_sendmsg+0x5b/0x60 [<0000000003beb43a>] qmi_send_message.isra.3+0xbe/0x110 [qmi_helpers] [<000000009c9ae7de>] qmi_send_request+0x1c/0x20 [qmi_helpers] Signed-off-by: Carl Huang <cjhuang@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2020-06-305-84/+97
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | bpf: Add tests for PTR_TO_BTF_ID vs. null comparisonYonghong Song2020-06-301-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two tests for PTR_TO_BTF_ID vs. null ptr comparison, one for PTR_TO_BTF_ID in the ctx structure and the other for PTR_TO_BTF_ID after one level pointer chasing. In both cases, the test ensures condition is not removed. For example, for this test struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } Before the previous verifier change, we have xlated codes: int test7(long long unsigned int * ctx): ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 0: (79) r1 = *(u64 *)(r1 +0) ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 1: (b4) w0 = 0 2: (95) exit After the previous verifier change, we have: int test7(long long unsigned int * ctx): ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 0: (79) r1 = *(u64 *)(r1 +0) ; if (arg == 0) 1: (55) if r1 != 0x0 goto pc+4 ; test7_result = 1; 2: (18) r1 = map[id:6][0]+48 4: (b7) r2 = 1 5: (7b) *(u64 *)(r1 +0) = r2 ; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) 6: (b4) w0 = 0 7: (95) exit Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com
| | * | bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer2020-06-301-5/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
| | * | bpf: sockmap: Check value of unused args to BPF_PROG_ATTACHLorenz Bauer2020-06-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using BPF_PROG_ATTACH on a sockmap program currently understands no flags or replace_bpf_fd, but accepts any value. Return EINVAL instead. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-4-lmb@cloudflare.com
| | * | bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki2020-06-301-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
| | * | flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki2020-06-301-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
| | * | xsk: Use dma_need_sync instead of reimplenting itChristoph Hellwig2020-06-301-47/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the dma_need_sync helper instead of (not always entirely correctly) poking into the dma-mapping internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-5-hch@lst.de
| | * | xsk: Remove a double pool->dev assignment in xp_dma_mapChristoph Hellwig2020-06-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->dev is already assigned at the top of the function, remove the duplicate one at the end. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-4-hch@lst.de
| | * | xsk: Replace the cheap_dma flag with a dma_need_sync flagChristoph Hellwig2020-06-301-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Invert the polarity and better name the flag so that the use case is properly documented. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-3-hch@lst.de
| | * | bpf, sockmap: RCU dereferenced psock may be used outside RCU blockJohn Fastabend2020-06-281-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an ingress verdict program specifies message sizes greater than skb->len and there is an ENOMEM error due to memory pressure we may call the rcv_msg handler outside the strp_data_ready() caller context. This is because on an ENOMEM error the strparser will retry from a workqueue. The caller currently protects the use of psock by calling the strp_data_ready() inside a rcu_read_lock/unlock block. But, in above workqueue error case the psock is accessed outside the read_lock/unlock block of the caller. So instead of using psock directly we must do a look up against the sk again to ensure the psock is available. There is an an ugly piece here where we must handle the case where we paused the strp and removed the psock. On psock removal we first pause the strparser and then remove the psock. If the strparser is paused while an skb is scheduled on the workqueue the skb will be dropped on the flow and kfree_skb() is called. If the workqueue manages to get called before we pause the strparser but runs the rcvmsg callback after the psock is removed we will hit the unlikely case where we run the sockmap rcvmsg handler but do not have a psock. For now we will follow strparser logic and drop the skb on the floor with skb_kfree(). This is ugly because the data is dropped. To date this has not caused problems in practice because either the application controlling the sockmap is coordinating with the datapath so that skbs are "flushed" before removal or we simply wait for the sock to be closed before removing it. This patch fixes the describe RCU bug and dropping the skb doesn't make things worse. Future patches will improve this by allowing the normal case where skbs are not merged to skip the strparser altogether. In practice many (most?) use cases have no need to merge skbs so its both a code complexity hit as seen above and a performance issue. For example, in the Cilium case we always set the strparser up to return sbks 1:1 without any merging and have avoided above issues. Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
| | * | bpf, sockmap: RCU splat with redirect and strparser error or TLSJohn Fastabend2020-06-281-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two paths to generate the below RCU splat the first and most obvious is the result of the BPF verdict program issuing a redirect on a TLS socket (This is the splat shown below). Unlike the non-TLS case the caller of the *strp_read() hooks does not wrap the call in a rcu_read_lock/unlock. Then if the BPF program issues a redirect action we hit the RCU splat. However, in the non-TLS socket case the splat appears to be relatively rare, because the skmsg caller into the strp_data_ready() is wrapped in a rcu_read_lock/unlock. Shown here, static void sk_psock_strp_data_ready(struct sock *sk) { struct sk_psock *psock; rcu_read_lock(); psock = sk_psock(sk); if (likely(psock)) { if (tls_sw_has_ctx_rx(sk)) { psock->parser.saved_data_ready(sk); } else { write_lock_bh(&sk->sk_callback_lock); strp_data_ready(&psock->parser.strp); write_unlock_bh(&sk->sk_callback_lock); } } rcu_read_unlock(); } If the above was the only way to run the verdict program we would be safe. But, there is a case where the strparser may throw an ENOMEM error while parsing the skb. This is a result of a failed skb_clone, or alloc_skb_for_msg while building a new merged skb when the msg length needed spans multiple skbs. This will in turn put the skb on the strp_wrk workqueue in the strparser code. The skb will later be dequeued and verdict programs run, but now from a different context without the rcu_read_lock()/unlock() critical section in sk_psock_strp_data_ready() shown above. In practice I have not seen this yet, because as far as I know most users of the verdict programs are also only working on single skbs. In this case no merge happens which could trigger the above ENOMEM errors. In addition the system would need to be under memory pressure. For example, we can't hit the above case in selftests because we missed having tests to merge skbs. (Added in later patch) To fix the below splat extend the rcu_read_lock/unnlock block to include the call to sk_psock_tls_verdict_apply(). This will fix both TLS redirect case and non-TLS redirect+error case. Also remove psock from the sk_psock_tls_verdict_apply() function signature its not used there. [ 1095.937597] WARNING: suspicious RCU usage [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W [ 1095.944363] ----------------------------- [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage! [ 1095.950866] [ 1095.950866] other info that might help us debug this: [ 1095.950866] [ 1095.957146] [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1 [ 1095.961482] 1 lock held by test_sockmap/15970: [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls] [ 1095.968568] [ 1095.968568] stack backtrace: [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1 [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 1095.980519] Call Trace: [ 1095.982191] dump_stack+0x8f/0xd0 [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0 [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250 [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls] v2: Improve commit message to identify non-TLS redirect plus error case condition as well as more common TLS case. In the process I decided doing the rcu_read_unlock followed by the lock/unlock inside branches was unnecessarily complex. We can just extend the current rcu block and get the same effeective without the shuffling and branching. Thanks Martin! Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
| * | | net: xfrmi: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2020-06-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The xfrm interface uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and xfrmi rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: sit: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2020-06-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sit uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and sit rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: vti: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2020-06-302-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vti uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and vti rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: ipip: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld2020-06-302-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ipip uses skb->protocol to determine packet type, and bails out if it's not set. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and ipip rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>