summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2019-12-2250-274/+583
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso, including adding a missing ipv6 match description. 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi Bhat. 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet. 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold. 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien. 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul Chaignon. 7) Multicast MAC limit test is off by one in qede, from Manish Chopra. 8) Fix established socket lookup race when socket goes from TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening RCU grace period. From Eric Dumazet. 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet. 10) Fix active backup transition after link failure in bonding, from Mahesh Bandewar. 11) Avoid zero sized hash table in gtp driver, from Taehee Yoo. 12) Fix wrong interface passed to ->mac_link_up(), from Russell King. 13) Fix DSA egress flooding settings in b53, from Florian Fainelli. 14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost. 15) Fix double free in dpaa2-ptp code, from Ioana Ciornei. 16) Reject invalid MTU values in stmmac, from Jose Abreu. 17) Fix refcount leak in error path of u32 classifier, from Davide Caratti. 18) Fix regression causing iwlwifi firmware crashes on boot, from Anders Kaseorg. 19) Fix inverted return value logic in llc2 code, from Chan Shu Tak. 20) Disable hardware GRO when XDP is attached to qede, frm Manish Chopra. 21) Since we encode state in the low pointer bits, dst metrics must be at least 4 byte aligned, which is not necessarily true on m68k. Add annotations to fix this, from Geert Uytterhoeven. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits) sfc: Include XDP packet headroom in buffer step size. sfc: fix channel allocation with brute force net: dst: Force 4-byte alignment of dst_metrics selftests: pmtu: fix init mtu value in description hv_netvsc: Fix unwanted rx_table reset net: phy: ensure that phy IDs are correctly typed mod_devicetable: fix PHY module format qede: Disable hardware gro when xdp prog is installed net: ena: fix issues in setting interrupt moderation params in ethtool net: ena: fix default tx interrupt moderation interval net/smc: unregister ib devices in reboot_event net: stmmac: platform: Fix MDIO init for platforms without PHY llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) net: hisilicon: Fix a BUG trigered by wrong bytes_compl net: dsa: ksz: use common define for tag len s390/qeth: don't return -ENOTSUPP to userspace s390/qeth: fix promiscuous mode after reset s390/qeth: handle error due to unsupported transport mode cxgb4: fix refcount init for TC-MQPRIO offload tc-testing: initial tdc selftests for cls_u32 ...
| * net/smc: unregister ib devices in reboot_eventKarsten Graul2019-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | In the reboot_event handler, unregister the ib devices and enable the IB layer to release the devices before the reboot. Fixes: a33a803cfe64 ("net/smc: guarantee removal of link groups in reboot") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)Chan Shu Tak, Alex2019-12-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a frame with NULL DSAP is received, llc_station_rcv is called. In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1 when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a NULL XID response is returned unexpectedly, e.g. when the incoming frame is a NULL TEST command. To fix the error, simply remove the conditional operator. A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed. Signed-off-by: Chan Shu Tak, Alex <alexchan@task.com.hk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: ksz: use common define for tag lenMichael Grzeschik2019-12-201-5/+3
| | | | | | | | | | | | | | | | | | Remove special taglen define KSZ8795_INGRESS_TAG_LEN and use generic KSZ_INGRESS_TAG_LEN instead. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/sched: cls_u32: fix refcount leak in the error path of u32_change()Davide Caratti2019-12-191-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when users replace cls_u32 filters with new ones having wrong parameters, so that u32_change() fails to validate them, the kernel doesn't roll-back correctly, and leaves semi-configured rules. Fix this in u32_walk(), avoiding a call to the walker function on filters that don't have a match rule connected. The side effect is, these "empty" filters are not even dumped when present; but that shouldn't be a problem as long as we are restoring the original behaviour, where semi-configured filters were not even added in the error path of u32_change(). Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2019-12-193-8/+17
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2019-12-19 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 269 insertions(+), 108 deletions(-). The main changes are: 1) Fix lack of synchronization between xsk wakeup and destroying resources used by xsk wakeup, from Maxim Mikityanskiy. 2) Fix pruning with tail call patching, untrack programs in case of verifier error and fix a cgroup local storage tracking bug, from Daniel Borkmann. 3) Fix clearing skb->tstamp in bpf_redirect() when going from ingress to egress which otherwise cause issues e.g. on fq qdisc, from Lorenz Bauer. 4) Fix compile warning of unused proc_dointvec_minmax_bpf_restricted() when only cBPF is present, from Alexander Lobakin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net, sysctl: Fix compiler warning when only cBPF is presentAlexander Lobakin2019-12-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | proc_dointvec_minmax_bpf_restricted() has been firstly introduced in commit 2e4a30983b0f ("bpf: restrict access to core bpf sysctls") under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations"), because a new sysctl, bpf_jit_limit, made use of it. Finally, this parameter has become long instead of integer with fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") and thus, a new proc_dolongvec_minmax_bpf_restricted() has been added. With this last change, we got back to that proc_dointvec_minmax_bpf_restricted() is used only under CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been brought back. So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n since v4.20 we have: CC net/core/sysctl_net_core.o net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function] 292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Signed-off-by: Alexander Lobakin <alobakin@dlink.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru
| | * xsk: Add rcu_read_lock around the XSK wakeupMaxim Mikityanskiy2019-12-191-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XSK wakeup callback in drivers makes some sanity checks before triggering NAPI. However, some configuration changes may occur during this function that affect the result of those checks. For example, the interface can go down, and all the resources will be destroyed after the checks in the wakeup function, but before it attempts to use these resources. Wrap this callback in rcu_read_lock to allow driver to synchronize_rcu before actually destroying the resources. xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup wrapped into the RCU lock. After this commit, xsk_poll starts using xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide ndo_xsk_wakeup should be called. It also fixes a bug introduced with the need_wakeup feature: a non-zero-copy socket may be used with a driver supporting zero-copy, and in this case ndo_xsk_wakeup should not be called, so the xs->zc check is the correct one. Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com
| | * bpf: Clear skb->tstamp in bpf_redirect when necessaryLorenz Bauer2019-12-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redirecting a packet from ingress to egress by using bpf_redirect breaks if the egress interface has an fq qdisc installed. This is the same problem as fixed in 'commit 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths") Clear skb->tstamp when redirecting into the egress path. Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.") Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20191213180817.2510-1-lmb@cloudflare.com
| * | net: nfc: nci: fix a possible sleep-in-atomic-context bug in ↵Jia-Ju Bai2019-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nci_uart_tty_receive() The kernel may sleep while holding a spinlock. The function call path (from bottom to top) in Linux 4.19 is: net/nfc/nci/uart.c, 349: nci_skb_alloc in nci_uart_default_recv_buf net/nfc/nci/uart.c, 255: (FUNC_PTR)nci_uart_default_recv_buf in nci_uart_tty_receive net/nfc/nci/uart.c, 254: spin_lock in nci_uart_tty_receive nci_skb_alloc(GFP_KERNEL) can sleep at runtime. (FUNC_PTR) means a function pointer is called. To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC for nci_skb_alloc(). This bug is found by a static analysis tool STCheck written by myself. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net-sysfs: Call dev_hold always in rx_queue_add_kobjectJouni Hogander2019-12-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dev_hold has to be called always in rx_queue_add_kobject. Otherwise usage count drops below 0 in case of failure in kobject_init_and_add. Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject") Reported-by: syzbot <syzbot+30209ea299c09d8785c9@syzkaller.appspotmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Miller <davem@davemloft.net> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: dsa: make unexported dsa_link_touch() staticBen Dooks (Codethink)2019-12-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dsa_link_touch() is not exported, or defined outside of the file it is in so make it static to avoid the following warning: net/dsa/dsa2.c:127:17: warning: symbol 'dsa_link_touch' was not declared. Should it be static? Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: annotate lockless accesses to sk->sk_pacing_shiftEric Dumazet2019-12-173-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sk->sk_pacing_shift can be read and written without lock synchronization. This patch adds annotations to document this fact and avoid future syzbot complains. This might also avoid unexpected false sharing in sk_pacing_shift_update(), as the compiler could remove the conditional check and always write over sk->sk_pacing_shift : if (sk->sk_pacing_shift != val) sk->sk_pacing_shift = val; Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sctp: fix memleak on err handling of stream initializationMarcelo Ricardo Leitner2019-12-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported a memory leak when an allocation fails within genradix_prealloc() for output streams. That's because genradix_prealloc() leaves initialized members initialized when the issue happens and SCTP stack will abort the current initialization but without cleaning up such members. The fix here is to always call genradix_free() when genradix_prealloc() fails, for output and also input streams, as it suffers from the same issue. Reported-by: syzbot+772d9e36c490b18d51d1@syzkaller.appspotmail.com Fixes: 2075e50caf5e ("sctp: convert to genradix") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Tested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge tag 'mac80211-for-net-2019-10-16' of ↵David S. Miller2019-12-168-27/+80
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== A handful of fixes: * disable AQL on most drivers, addressing the iwlwifi issues * fix double-free on network namespace changes * fix TID field in frames injected through monitor interfaces * fix ieee80211_calc_rx_airtime() * fix NULL pointer dereference in rfkill (and remove BUG_ON) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | rfkill: Fix incorrect check to avoid NULL pointer dereferenceAditya Pakki2019-12-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In rfkill_register, the struct rfkill pointer is first derefernced and then checked for NULL. This patch removes the BUG_ON and returns an error to the caller in case rfkill is NULL. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Link: https://lore.kernel.org/r/20191215153409.21696-1-pakki001@umn.edu Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | mac80211: Turn AQL into an NL80211_EXT_FEATUREToke Høiland-Jørgensen2019-12-135-24/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of just having an airtime flag in debugfs, turn AQL into a proper NL80211_EXT_FEATURE, so drivers can turn it on when they are ready, and so we also expose the presence of the feature to userspace. This also has the effect of flipping the default, so drivers have to opt in to using AQL instead of getting it by default with TXQs. To keep functionality the same as pre-patch, we set this feature for ath10k (which is where it is needed the most). While we're at it, split out the debugfs interface so AQL gets its own per-station debugfs file instead of using the 'airtime' file. [Johannes:] This effectively disables AQL for iwlwifi, where it fixes a number of issues: * TSO in iwlwifi is causing underflows and associated warnings in AQL * HE (802.11ax) rates aren't reported properly so at HE rates, AQL could never have a valid estimate (it'd use 6 Mbps instead of up to 2400!) Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191212111437.224294-1-toke@redhat.com Fixes: 3ace10f5b5ad ("mac80211: Implement Airtime-based Queue Limit (AQL)") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | mac80211: airtime: Fix an off by one in ieee80211_calc_rx_airtime()Dan Carpenter2019-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code was copied from mt76 and inherited an off by one bug from there. The > should be >= so that we don't read one element beyond the end of the array. Fixes: db3e1c40cf2f ("mac80211: Import airtime calculation code from mt76") Reported-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191126120910.ftr4t7me3by32aiz@kili.mountain Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | cfg80211: fix double-free after changing network namespaceStefan Bühler2019-12-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If wdev->wext.keys was initialized it didn't get reset to NULL on unregister (and it doesn't get set in cfg80211_init_wdev either), but wdev is reused if unregister was triggered through cfg80211_switch_netns. The next unregister (for whatever reason) will try to free wdev->wext.keys again. Signed-off-by: Stefan Bühler <source@stbuehler.de> Link: https://lore.kernel.org/r/20191126100543.782023-1-stefan.buehler@tik.uni-stuttgart.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| | * | mac80211: fix TID field in monitor mode transmitFredrik Olofsson2019-12-131-0/+9
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix overwriting of the qos_ctrl.tid field for encrypted frames injected on a monitor interface. While qos_ctrl.tid is not encrypted, it's used as an input into the encryption algorithm so it's protected, and thus cannot be modified after encryption. For injected frames, the encryption may already have been done in userspace, so we cannot change any fields. Before passing the frame to the driver, the qos_ctrl.tid field is updated from skb->priority. Prior to dbd50a851c50 skb->priority was updated in ieee80211_select_queue_80211(), but this function is no longer always called. Update skb->priority in ieee80211_monitor_start_xmit() so that the value is stored, and when later code 'modifies' the TID it really sets it to the same value as before, preserving the encryption. Fixes: dbd50a851c50 ("mac80211: only allocate one queue when using iTXQs") Signed-off-by: Fredrik Olofsson <fredrik.olofsson@anyfinetworks.com> Link: https://lore.kernel.org/r/20191119133451.14711-1-fredrik.olofsson@anyfinetworks.com [rewrite commit message based on our discussion] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | vsock/virtio: add WARN_ON check on virtio_transport_get_ops()Stefano Garzarella2019-12-161-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | virtio_transport_get_ops() and virtio_transport_send_pkt_info() can only be used on connecting/connected sockets, since a socket assigned to a transport is required. This patch adds a WARN_ON() on virtio_transport_get_ops() to check this requirement, a comment and a returned error on virtio_transport_send_pkt_info(), Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | vsock/virtio: fix null-pointer dereference in virtio_transport_recv_listen()Stefano Garzarella2019-12-161-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With multi-transport support, listener sockets are not bound to any transport. So, calling virtio_transport_reset(), when an error occurs, on a listener socket produces the following null-pointer dereference: BUG: kernel NULL pointer dereference, address: 00000000000000e8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 20 Comm: kworker/0:1 Not tainted 5.5.0-rc1-ste-00003-gb4be21f316ac-dirty #56 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 Workqueue: virtio_vsock virtio_transport_rx_work [vmw_vsock_virtio_transport] RIP: 0010:virtio_transport_send_pkt_info+0x20/0x130 [vmw_vsock_virtio_transport_common] Code: 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 10 44 8b 76 20 e8 c0 ba fe ff <48> 8b 80 e8 00 00 00 e8 64 e3 7d c1 45 8b 45 00 41 8b 8c 24 d4 02 RSP: 0018:ffffc900000b7d08 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88807bf12728 RCX: 0000000000000000 RDX: ffff88807bf12700 RSI: ffffc900000b7d50 RDI: ffff888035c84000 RBP: ffffc900000b7d40 R08: ffff888035c84000 R09: ffffc900000b7d08 R10: ffff8880781de800 R11: 0000000000000018 R12: ffff888035c84000 R13: ffffc900000b7d50 R14: 0000000000000000 R15: ffff88807bf12724 FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e8 CR3: 00000000790f4004 CR4: 0000000000160ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: virtio_transport_reset+0x59/0x70 [vmw_vsock_virtio_transport_common] virtio_transport_recv_pkt+0x5bb/0xe50 [vmw_vsock_virtio_transport_common] ? detach_buf_split+0xf1/0x130 virtio_transport_rx_work+0xba/0x130 [vmw_vsock_virtio_transport] process_one_work+0x1c0/0x300 worker_thread+0x45/0x3c0 kthread+0xfc/0x130 ? current_work+0x40/0x40 ? kthread_park+0x90/0x90 ret_from_fork+0x35/0x40 Modules linked in: sunrpc kvm_intel kvm vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common irqbypass vsock virtio_rng rng_core CR2: 00000000000000e8 ---[ end trace e75400e2ea2fa824 ]--- This happens because virtio_transport_reset() calls virtio_transport_send_pkt_info() that can be used only on connecting/connected sockets. This patch fixes the issue, using virtio_transport_reset_no_sock() instead of virtio_transport_reset() when we are handling a listener socket. Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/smc: add fallback check to connect()Ursula Braun2019-12-151-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FASTOPEN setsockopt() or sendmsg() may switch the SMC socket to fallback mode. Once fallback mode is active, the native TCP socket functions are called. Nevertheless there is a small race window, when FASTOPEN setsockopt/sendmsg runs in parallel to a connect(), and switch the socket into fallback mode before connect() takes the sock lock. Make sure the SMC-specific connect setup is omitted in this case. This way a syzbot-reported refcount problem is fixed, triggered by different threads running non-blocking connect() and FASTOPEN_KEY setsockopt. Reported-by: syzbot+96d3f9ff6a86d37e44c8@syzkaller.appspotmail.com Fixes: 6d6dd528d5af ("net/smc: fix refcount non-blocking connect() -part 2") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * | tcp: refine rule to allow EPOLLOUT generation under mem pressureEric Dumazet2019-12-131-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the time commit ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty") was added to the kernel, we still had a single write queue, combining rtx and write queues. Once we moved the rtx queue into a separate rb-tree, testing if sk_write_queue is empty has been suboptimal. Indeed, if we have packets in the rtx queue, we probably want to delay the EPOLLOUT generation at the time incoming packets will free them, making room, but more importantly avoiding flooding application with EPOLLOUT events. Solution is to use tcp_rtx_and_write_queues_empty() helper. Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * | tcp: refine tcp_write_queue_empty() implementationEric Dumazet2019-12-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to how tcp_sendmsg() is implemented, we can have an empty skb at the tail of the write queue. Most [1] tcp_write_queue_empty() callers want to know if there is anything to send (payload and/or FIN) Instead of checking if the sk_write_queue is empty, we need to test if tp->write_seq == tp->snd_nxt [1] tcp_send_fin() was the only caller that expected to see if an skb was in the write queue, I have changed the code to reuse the tcp_write_queue_tail() result. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * | tcp: do not send empty skb from tcp_write_xmit()Eric Dumazet2019-12-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases") in linux-4.14 stable triggered various bugs. One of them has been fixed in commit ba2ddb43f270 ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but we still have crashes in some occasions. Root-cause is that when tcp_sendmsg() has allocated a fresh skb and could not append a fragment before being blocked in sk_stream_wait_memory(), tcp_write_xmit() might be called and decide to send this fresh and empty skb. Sending an empty packet is not only silly, it might have caused many issues we had in the past with tp->packets_out being out of sync. Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Christoph Paasch <cpaasch@apple.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Jason Baron <jbaron@akamai.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * | tcp/dccp: fix possible race __inet_lookup_established()Eric Dumazet2019-12-133-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michal Kubecek and Firo Yang did a very nice analysis of crashes happening in __inet_lookup_established(). Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN (via a close()/socket()/listen() cycle) without a RCU grace period, I should not have changed listeners linkage in their hash table. They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt), so that a lookup can detect a socket in a hash list was moved in another one. Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve merge conflict for v4/v6 ordering fix"), we have to add hlist_nulls_add_tail_rcu() helper. Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Michal Kubecek <mkubecek@suse.cz> Reported-by: Firo Yang <firo.yang@suse.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/ Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * | ipv6/addrconf: only check invalid header values when NETLINK_F_STRICT_CHK is setHangbin Liu2019-12-131-4/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for doit handlers") we add strict check for inet6_rtm_getaddr(). But we did the invalid header values check before checking if NETLINK_F_STRICT_CHK is set. This may break backwards compatibility if user already set the ifm->ifa_prefixlen, ifm->ifa_flags, ifm->ifa_scope in their netlink code. I didn't move the nlmsg_len check because I thought it's a valid check. Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for doit handlers") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
| * tipc: fix use-after-free in tipc_disc_rcv()Tuong Lien2019-12-101-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the function 'tipc_disc_rcv()', the 'msg_peer_net_hash()' is called to read the header data field but after the message skb has been freed, that might result in a garbage value... This commit fixes it by defining a new local variable to store the data first, just like the other header fields' handling. Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix retrans failure due to wrong destinationTuong Lien2019-12-101-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a user message is sent, TIPC will check if the socket has faced a congestion at link layer. If that happens, it will make a sleep to wait for the congestion to disappear. This leaves a gap for other users to take over the socket (e.g. multi threads) since the socket is released as well. Also, in case of connectionless (e.g. SOCK_RDM), user is free to send messages to various destinations (e.g. via 'sendto()'), then the socket's preformatted header has to be updated correspondingly prior to the actual payload message building. Unfortunately, the latter action is done before the first action which causes a condition issue that the destination of a certain message can be modified incorrectly in the middle, leading to wrong destination when that message is built. Consequently, when the message is sent to the link layer, it gets stuck there forever because the peer node will simply reject it. After a number of retransmission attempts, the link is eventually taken down and the retransmission failure is reported. This commit fixes the problem by rearranging the order of actions to prevent the race condition from occurring, so the message building is 'atomic' and its header will not be modified by anyone. Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix potential hanging after b/rcast changingTuong Lien2019-12-101-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit c55c8edafa91 ("tipc: smooth change between replicast and broadcast"), we allow instant switching between replicast and broadcast by sending a dummy 'SYN' packet on the last used link to synchronize packets on the links. The 'SYN' message is an object of link congestion also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent back to the socket... However, in that commit, we simply use the same socket 'cong_link_cnt' counter for both the 'SYN' & normal payload message sending. Therefore, if both the replicast & broadcast links are congested, the counter will be not updated correctly but overwritten by the latter congestion. Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is reduced one by one and eventually overflowed. Consequently, further activities on the socket will only wait for the false congestion signal to disappear but never been met. Because sending the 'SYN' message is vital for the mechanism, it should be done anyway. This commit fixes the issue by marking the message with an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a link congestion, there is no need to touch the socket 'cong_link_cnt' either. In addition, in the event of any error (e.g. -ENOBUFS), we will purge the entire payload message queue and make a return immediately. Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix name table rbtree issuesTuong Lien2019-12-101-100/+179
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current rbtree for service ranges in the name table is built based on the 'lower' & 'upper' range values resulting in a flaw in the rbtree searching. Some issues have been observed in case of range overlapping: Case #1: unable to withdraw a name entry: After some name services are bound, all of them are withdrawn by user but one remains in the name table forever. This corrupts the table and that service becomes dummy i.e. no real port. E.g. / {22, 22} / / ---> {10, 50} / \ / \ {10, 30} {20, 60} The node {10, 30} cannot be removed since the rbtree searching stops at the node's ancestor i.e. {10, 50}, so starting from it will never reach the finding node. Case #2: failed to send data in some cases: E.g. Two service ranges: {20, 60}, {10, 50} are bound. The rbtree for this service will be one of the two cases below depending on the order of the bindings: {20, 60} {10, 50} <-- / \ / \ / \ / \ {10, 50} NIL <-- NIL {20, 60} (a) (b) Now, try to send some data to service {30}, there will be two results: (a): Failed, no route to host. (b): Ok. The reason is that the rbtree searching will stop at the pointing node as shown above. Case #3: Same as case #2b above but if the data sending's scope is local and the {10, 50} is published by a peer node, then it will result in 'no route to host' even though the other {20, 60} is for example on the local node which should be able to get the data. The issues are actually due to the way we built the rbtree. This commit fixes it by introducing an additional field to each node - named 'max', which is the largest 'upper' of that node subtree. The 'max' value for each subtrees will be propagated correctly whenever a node is inserted/ removed or the tree is rebalanced by the augmented rbtree callbacks. By this way, we can change the rbtree searching appoarch to solve the issues above. Another benefit from this is that we can now improve the searching for a next range matching e.g. in case of multicast, so get rid of the unneeded looping over all nodes in the tree. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| * af_packet: set defaule value for tmoMao Wenan2019-12-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is softlockup when using TPACKET_V3: ... NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms! (__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54) (_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c) (mod_timer) from [<c0549c30>] (prb_retire_rx_blk_timer_expired+0x68/0x11c) (prb_retire_rx_blk_timer_expired) from [<c027a7ac>] (call_timer_fn+0x90/0x17c) (call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc) (run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318) (__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac) (irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4) (msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4) (handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118) (gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c) ... If __ethtool_get_link_ksettings() is failed in prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies is zero and the timer expire for retire_blk_timer is turn to mod_timer(&pkc->retire_blk_timer, jiffies + 0), which will trigger cpu usage of softirq is 100%. Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Tested-by: Xiao Jiangfeng <xiaojiangfeng@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-12-0911-54/+109
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Wait for rcu grace period after releasing netns in ctnetlink, from Florian Westphal. 2) Incorrect command type in flowtable offload ndo invocation, from wenxu. 3) Incorrect callback type in flowtable offload flow tuple updates, also from wenxu. 4) Fix compile warning on flowtable offload infrastructure due to possible reference to uninitialized variable, from Nathan Chancellor. 5) Do not inline nf_ct_resolve_clash(), this is called from slow path / stress situations. From Florian Westphal. 6) Missing IPv6 flow selector description in flowtable offload. 7) Missing check for NETDEV_UNREGISTER in nf_tables offload infrastructure, from wenxu. 8) Update NAT selftest to use randomized netns names, from Florian Westphal. 9) Restore nfqueue bridge support, from Marco Oliverio. 10) Compilation warning in SCTP_CHUNKMAP_*() on xt_sctp header. From Phil Sutter. 11) Fix bogus lookup/get match for non-anonymous rbtree sets. 12) Missing netlink validation for NFT_SET_ELEM_INTERVAL_END elements. 13) Missing netlink validation for NFT_DATA_VALUE after nft_data_init(). 14) If rule specifies no actions, offload infrastructure returns EOPNOTSUPP. 15) Module refcount leak in object updates. 16) Missing sanitization for ARP traffic from br_netfilter, from Eric Dumazet. 17) Compilation breakage on big-endian due to incorrect memcpy() size in the flowtable offload infrastructure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * netfilter: nf_flow_table_offload: Correct memcpy size for flow_overload_mangle()Pablo Neira Ayuso2019-12-091-31/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In function 'memcpy', inlined from 'flow_offload_mangle' at net/netfilter/nf_flow_table_offload.c:112:2, inlined from 'flow_offload_port_dnat' at net/netfilter/nf_flow_table_offload.c:373:2, inlined from 'nf_flow_rule_route_ipv4' at net/netfilter/nf_flow_table_offload.c:424:3: ./include/linux/string.h:376:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter 376 | __read_overflow2(); | ^~~~~~~~~~~~~~~~~~ The original u8* was done in the hope to make this more adaptable but consensus is to keep this like it is in tc pedit. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: bridge: make sure to pull arp header in br_nf_forward_arp()Eric Dumazet2019-12-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot is kind enough to remind us we need to call skb_may_pull() BUG: KMSAN: uninit-value in br_nf_forward_arp+0xe61/0x1230 net/bridge/br_netfilter_hooks.c:665 CPU: 1 PID: 11631 Comm: syz-executor.1 Not tainted 5.4.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108 __msan_warning+0x64/0xc0 mm/kmsan/kmsan_instr.c:245 br_nf_forward_arp+0xe61/0x1230 net/bridge/br_netfilter_hooks.c:665 nf_hook_entry_hookfn include/linux/netfilter.h:135 [inline] nf_hook_slow+0x18b/0x3f0 net/netfilter/core.c:512 nf_hook include/linux/netfilter.h:260 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] __br_forward+0x78f/0xe30 net/bridge/br_forward.c:109 br_flood+0xef0/0xfe0 net/bridge/br_forward.c:234 br_handle_frame_finish+0x1a77/0x1c20 net/bridge/br_input.c:162 nf_hook_bridge_pre net/bridge/br_input.c:245 [inline] br_handle_frame+0xfb6/0x1eb0 net/bridge/br_input.c:348 __netif_receive_skb_core+0x20b9/0x51a0 net/core/dev.c:4830 __netif_receive_skb_one_core net/core/dev.c:4927 [inline] __netif_receive_skb net/core/dev.c:5043 [inline] process_backlog+0x610/0x13c0 net/core/dev.c:5874 napi_poll net/core/dev.c:6311 [inline] net_rx_action+0x7a6/0x1aa0 net/core/dev.c:6379 __do_softirq+0x4a1/0x83a kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1091 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x184/0x1d0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline] __dev_queue_xmit+0x38e8/0x4200 net/core/dev.c:3819 dev_queue_xmit+0x4b/0x60 net/core/dev.c:3825 packet_snd net/packet/af_packet.c:2959 [inline] packet_sendmsg+0x8234/0x9100 net/packet/af_packet.c:2984 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] __sys_sendto+0xc44/0xc70 net/socket.c:1952 __do_sys_sendto net/socket.c:1964 [inline] __se_sys_sendto+0x107/0x130 net/socket.c:1960 __x64_sys_sendto+0x6e/0x90 net/socket.c:1960 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45a679 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f0a3c9e5c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 000000000045a679 RDX: 000000000000000e RSI: 0000000020000200 RDI: 0000000000000003 RBP: 000000000075bf20 R08: 00000000200000c0 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0a3c9e66d4 R13: 00000000004c8ec1 R14: 00000000004dfe28 R15: 00000000ffffffff Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline] kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132 kmsan_slab_alloc+0x97/0x100 mm/kmsan/kmsan_hooks.c:86 slab_alloc_node mm/slub.c:2773 [inline] __kmalloc_node_track_caller+0xe27/0x11a0 mm/slub.c:4381 __kmalloc_reserve net/core/skbuff.c:141 [inline] __alloc_skb+0x306/0xa10 net/core/skbuff.c:209 alloc_skb include/linux/skbuff.h:1049 [inline] alloc_skb_with_frags+0x18c/0xa80 net/core/skbuff.c:5662 sock_alloc_send_pskb+0xafd/0x10a0 net/core/sock.c:2244 packet_alloc_skb net/packet/af_packet.c:2807 [inline] packet_snd net/packet/af_packet.c:2902 [inline] packet_sendmsg+0x63a6/0x9100 net/packet/af_packet.c:2984 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] __sys_sendto+0xc44/0xc70 net/socket.c:1952 __do_sys_sendto net/socket.c:1964 [inline] __se_sys_sendto+0x107/0x130 net/socket.c:1960 __x64_sys_sendto+0x6e/0x90 net/socket.c:1960 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables_offload: return EOPNOTSUPP if rule specifies no actionsPablo Neira Ayuso2019-12-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | If the rule only specifies the matching side, return EOPNOTSUPP. Otherwise, the front-end relies on the drivers to reject this rule. Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: skip module reference count bump on object updatesPablo Neira Ayuso2019-12-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use __nft_obj_type_get() instead, otherwise there is a module reference counter leak. Fixes: d62d0ba97b58 ("netfilter: nf_tables: Introduce stateful object update operation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: validate NFT_DATA_VALUE after nft_data_init()Pablo Neira Ayuso2019-12-094-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace might bogusly sent NFT_DATA_VERDICT in several netlink attributes that assume NFT_DATA_VALUE. Moreover, make sure that error path invokes nft_data_release() to decrement the reference count on the chain object. Fixes: 96518518cc41 ("netfilter: add nftables") Fixes: 0f3cd9b36977 ("netfilter: nf_tables: add range expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables: validate NFT_SET_ELEM_INTERVAL_ENDPablo Neira Ayuso2019-12-091-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | Only NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_FLAGS make sense for elements whose NFT_SET_ELEM_INTERVAL_END flag is set on. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nft_set_rbtree: bogus lookup/get on consecutive elements in named ↵Pablo Neira Ayuso2019-12-091-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sets The existing rbtree implementation might store consecutive elements where the closing element and the opening element might overlap, eg. [ a, a+1) [ a+1, a+2) This patch removes the optimization for non-anonymous sets in the exact matching case, where it is assumed to stop searching in case that the closing element is found. Instead, invalidate candidate interval and keep looking further in the tree. The lookup/get operation might return false, while there is an element in the rbtree. Moreover, the get operation returns true as if a+2 would be in the tree. This happens with named sets after several set updates. The existing lookup optimization (that only works for the anonymous sets) might not reach the opening [ a+1,... element if the closing ...,a+1) is found in first place when walking over the rbtree. Hence, walking the full tree in that case is needed. This patch fixes the lookup and get operations. Fixes: e701001e7cbe ("netfilter: nft_rbtree: allow adjacent intervals with dynamic updates") Fixes: ba0e4d9917b4 ("netfilter: nf_tables: get set elements via netlink") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_queue: enqueue skbs with NULL dstMarco Oliverio2019-12-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bridge packets that are forwarded have skb->dst == NULL and get dropped by the check introduced by b60a77386b1d4868f72f6353d35dabe5fbe981f2 (net: make skb_dst_force return true when dst is refcounted). To fix this we check skb_dst() before skb_dst_force(), so we don't drop skb packet with dst == NULL. This holds also for skb at the PRE_ROUTING hook so we remove the second check. Fixes: b60a77386b1d ("net: make skb_dst_force return true when dst is refcounted") Signed-off-by: Marco Oliverio <marco.oliverio@tanaza.com> Signed-off-by: Rocco Folino <rocco.folino@tanaza.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_tables_offload: Check for the NETDEV_UNREGISTER eventwenxu2019-12-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Check for the NETDEV_UNREGISTER event from the nft_offload_netdev_event function, which is the event that actually triggers the clean up. Fixes: 06d392cbe3db ("netfilter: nf_tables_offload: remove rules when the device unregisters") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_flow_table_offload: add IPv6 match descriptionPablo Neira Ayuso2019-11-301-1/+11
| | | | | | | | | | | | | | | | | | | | | Add missing IPv6 matching description to flow_rule object. Fixes: 5c27d8d76ce8 ("netfilter: nf_flow_table_offload: add IPv6 support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: conntrack: tell compiler to not inline nf_ct_resolve_clashFlorian Westphal2019-11-301-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this time compiler inlines it, but this code will not be executed under normal conditions. Also, no inlining allows to use "nf_ct_resolve_clash%return" perf probe. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_flow_table_offload: Don't use offset uninitialized in ↵Nathan Chancellor2019-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | flow_offload_port_{d,s}nat Clang warns (trimmed the second warning for brevity): ../net/netfilter/nf_flow_table_offload.c:342:2: warning: variable 'offset' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized] default: ^~~~~~~ ../net/netfilter/nf_flow_table_offload.c:346:57: note: uninitialized use occurs here flow_offload_mangle(entry, flow_offload_l4proto(flow), offset, ^~~~~~ ../net/netfilter/nf_flow_table_offload.c:331:12: note: initialize the variable 'offset' to silence this warning u32 offset; ^ = 0 Match what was done in the flow_offload_ipv{4,6}_{d,s}nat functions and just return in the default case, since port would also be uninitialized. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Link: https://github.com/ClangBuiltLinux/linux/issues/780 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: kernelci.org bot <bot@kernelci.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_flow_table_offload: Fix block_cb tc_setup_type as ↵wenxu2019-11-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TC_SETUP_CLSFLOWER Add/del/stats flows through block_cb call must set the tc_setup_type as TC_SETUP_CLSFLOWER. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: nf_flow_table_offload: Fix block setup as TC_SETUP_FT cmdwenxu2019-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Set up block through TC_SETUP_FT command. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * netfilter: ctnetlink: netns exit must wait for callbacksFlorian Westphal2019-11-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Curtis Taylor and Jon Maxwell reported and debugged a crash on 3.10 based kernel. Crash occurs in ctnetlink_conntrack_events because net->nfnl socket is NULL. The nfnl socket was set to NULL by netns destruction running on another cpu. The exiting network namespace calls the relevant destructors in the following order: 1. ctnetlink_net_exit_batch This nulls out the event callback pointer in struct netns. 2. nfnetlink_net_exit_batch This nulls net->nfnl socket and frees it. 3. nf_conntrack_cleanup_net_list This removes all remaining conntrack entries. This is order is correct. The only explanation for the crash so ar is: cpu1: conntrack is dying, eviction occurs: -> nf_ct_delete() -> nf_conntrack_event_report \ -> nf_conntrack_eventmask_report -> notify->fcn() (== ctnetlink_conntrack_events). cpu1: a. fetches rcu protected pointer to obtain ctnetlink event callback. b. gets interrupted. cpu2: runs netns exit handlers: a runs ctnetlink destructor, event cb pointer set to NULL. b runs nfnetlink destructor, nfnl socket is closed and set to NULL. cpu1: c. resumes and trips over NULL net->nfnl. Problem appears to be that ctnetlink_net_exit_batch only prevents future callers of nf_conntrack_eventmask_report() from obtaining the callback. It doesn't wait of other cpus that might have already obtained the callbacks address. I don't see anything in upstream kernels that would prevent similar crash: We need to wait for all cpus to have exited the event callback. Fixes: 9592a5c01e79dbc59eb56fa ("netfilter: ctnetlink: netns support") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | net/x25: add new state X25_STATE_5Martin Schiller2019-12-092-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is needed, because if the flag X25_ACCPT_APPRV_FLAG is not set on a socket (manual call confirmation) and the channel is cleared by remote before the manual call confirmation was sent, this situation needs to be handled. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>