summaryrefslogtreecommitdiffstats
path: root/include/net
Commit message (Collapse)AuthorAgeFilesLines
* tcp: be more careful in tcp_fragment()Eric Dumazet2019-07-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Cc: Jonathan Looney <jtl@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* nl80211: fix VENDOR_CMD_RAW_DATAJohannes Berg2019-07-201-1/+1
| | | | | | | | | | Since ERR_PTR() is an inline, not a macro, just open-code it here so it's usable as an initializer, fixing the build in brcmfmac. Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com> Fixes: 901bb9891855 ("nl80211: require and validate vendor command policy") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* net: flow_offload: add flow_block structure and use itPablo Neira Ayuso2019-07-193-4/+15
| | | | | | | | | | | | This object stores the flow block callbacks that are attached to this block. Update flow_block_cb_lookup() to take this new object. This patch restores the block sharing feature. Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_tPablo Neira Ayuso2019-07-193-13/+15
| | | | | | | | Rename this type definition and adapt users. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: flow_offload: remove netns parameter from flow_block_cb_alloc()Pablo Neira Ayuso2019-07-191-2/+1
| | | | | | | | | | No need to annotate the netns on the flow block callback object, flow_block_cb_is_busy() already checks for used blocks. Fixes: d63db30c8537 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-07-192-3/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix a deadlock when module is requested via netlink_bind() in nfnetlink, from Florian Westphal. 2) Fix ipt_rpfilter and ip6t_rpfilter with VRF, from Miaohe Lin. 3) Skip master comparison in SIP helper to fix expectation clash under two valid scenarios, from xiao ruizhu. 4) Remove obsolete comments in nf_conntrack codebase, from Yonatan Goldschmidt. 5) Fix redirect extension module autoload, from Christian Hesse. 6) Fix incorrect mssg option sent to client in synproxy, from Fernando Fernandez. 7) Fix incorrect window calculations in TCP conntrack, from Florian Westphal. 8) Don't bail out when updating basechain policy due to recent offload works, also from Florian. 9) Allow symhash to use modulus 1 as other hash extensions do, from Laura.Garcia. 10) Missing NAT chain module autoload for the inet family, from Phil Sutter. 11) Fix missing adjustment of TCP RST packet in synproxy, from Fernando Fernandez. 12) Skip EAGAIN path when nft_meta_bridge is built-in or not selected. 13) Conntrack bridge does not depend on nf_tables_bridge. 14) Turn NF_TABLES_BRIDGE into tristate to fix possible link break of nft_meta_bridge, from Arnd Bergmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: synproxy: fix erroneous tcp mss optionFernando Fernandez Mancera2019-07-161-0/+1
| | | | | | | | | | | | | | | | | | Now synproxy sends the mss value set by the user on client syn-ack packet instead of the mss value that client announced. Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_conntrack_sip: fix expectation clashxiao ruizhu2019-07-161-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When conntracks change during a dialog, SDP messages may be sent from different conntracks to establish expects with identical tuples. In this case expects conflict may be detected for the 2nd SDP message and end up with a process failure. The fixing here is to reuse an existing expect who has the same tuple for a different conntrack if any. Here are two scenarios for the case. 1) SERVER CPE | INVITE SDP | 5060 |<----------------------|5060 | 100 Trying | 5060 |---------------------->|5060 | 183 SDP | 5060 |---------------------->|5060 ===> Conntrack 1 | PRACK | 50601 |<----------------------|5060 | 200 OK (PRACK) | 50601 |---------------------->|5060 | 200 OK (INVITE) | 5060 |---------------------->|5060 | ACK | 50601 |<----------------------|5060 | | |<--- RTP stream ------>| | | | INVITE SDP (t38) | 50601 |---------------------->|5060 ===> Conntrack 2 With a certain configuration in the CPE, SIP messages "183 with SDP" and "re-INVITE with SDP t38" will go through the sip helper to create expects for RTP and RTCP. It is okay to create RTP and RTCP expects for "183", whose master connection source port is 5060, and destination port is 5060. In the "183" message, port in Contact header changes to 50601 (from the original 5060). So the following requests e.g. PRACK and ACK are sent to port 50601. It is a different conntrack (let call Conntrack 2) from the original INVITE (let call Conntrack 1) due to the port difference. In this example, after the call is established, there is RTP stream but no RTCP stream for Conntrack 1, so the RTP expect created upon "183" is cleared, and RTCP expect created for Conntrack 1 retains. When "re-INVITE with SDP t38" arrives to create RTP&RTCP expects, current ALG implementation will call nf_ct_expect_related() for RTP and RTCP. The expects tuples are identical to those for Conntrack 1. RTP expect for Conntrack 2 succeeds in creation as the one for Conntrack 1 has been removed. RTCP expect for Conntrack 2 fails in creation because it has idential tuples and 'conflict' with the one retained for Conntrack 1. And then result in a failure in processing of the re-INVITE. 2) SERVER A CPE | REGISTER | 5060 |<------------------| 5060 ==> CT1 | 200 | 5060 |------------------>| 5060 | | | INVITE SDP(1) | 5060 |<------------------| 5060 | 300(multi choice) | 5060 |------------------>| 5060 SERVER B | ACK | 5060 |<------------------| 5060 | INVITE SDP(2) | 5060 |-------------------->| 5060 ==> CT2 | 100 | 5060 |<--------------------| 5060 | 200(contact changes)| 5060 |<--------------------| 5060 | ACK | 5060 |-------------------->| 50601 ==> CT3 | | |<--- RTP stream ---->| | | | BYE | 5060 |<--------------------| 50601 | 200 | 5060 |-------------------->| 50601 | INVITE SDP(3) | 5060 |<------------------| 5060 ==> CT1 CPE sends an INVITE request(1) to Server A, and creates a RTP&RTCP expect pair for this Conntrack 1 (CT1). Server A responds 300 to redirect to Server B. The RTP&RTCP expect pairs created on CT1 are removed upon 300 response. CPE sends the INVITE request(2) to Server B, and creates an expect pair for the new conntrack (due to destination address difference), let call CT2. Server B changes the port to 50601 in 200 OK response, and the following requests ACK and BYE from CPE are sent to 50601. The call is established. There is RTP stream and no RTCP stream. So RTP expect is removed and RTCP expect for CT2 retains. As BYE request is sent from port 50601, it is another conntrack, let call CT3, different from CT2 due to the port difference. So the BYE request will not remove the RTCP expect for CT2. Then another outgoing call is made, with the same RTP port being used (not definitely but possibly). CPE firstly sends the INVITE request(3) to Server A, and tries to create a RTP&RTCP expect pairs for this CT1. In current ALG implementation, the RTCP expect for CT1 fails in creation because it 'conflicts' with the residual one for CT2. As a result the INVITE request fails to send. Signed-off-by: xiao ruizhu <katrina.xiaorz@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet2019-07-181-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: Fix NULL-pointer dereference in tc_indr_block_ing_cmd()Vlad Buslov2019-07-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After recent refactoring of block offlads infrastructure, indr_dev->block pointer is dereferenced before it is verified to be non-NULL. Example stack trace where this behavior leads to NULL-pointer dereference error when creating vxlan dev on system with mlx5 NIC with offloads enabled: [ 1157.852938] ================================================================== [ 1157.866877] BUG: KASAN: null-ptr-deref in tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.880877] Read of size 4 at addr 0000000000000090 by task ip/3829 [ 1157.901637] CPU: 22 PID: 3829 Comm: ip Not tainted 5.2.0-rc6+ #488 [ 1157.914438] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1157.929031] Call Trace: [ 1157.938318] dump_stack+0x9a/0xeb [ 1157.948362] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.960262] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.972082] __kasan_report+0x176/0x192 [ 1157.982513] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.994348] kasan_report+0xe/0x20 [ 1158.004324] tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1158.015950] ? tcf_block_setup+0x430/0x430 [ 1158.026558] ? kasan_unpoison_shadow+0x30/0x40 [ 1158.037464] __tc_indr_block_cb_register+0x5f5/0xf20 [ 1158.049288] ? mlx5e_rep_indr_tc_block_unbind+0xa0/0xa0 [mlx5_core] [ 1158.062344] ? tc_indr_block_dev_put.part.47+0x5c0/0x5c0 [ 1158.074498] ? rdma_roce_rescan_device+0x20/0x20 [ib_core] [ 1158.086580] ? br_device_event+0x98/0x480 [bridge] [ 1158.097870] ? strcmp+0x30/0x50 [ 1158.107578] mlx5e_nic_rep_netdevice_event+0xdd/0x180 [mlx5_core] [ 1158.120212] notifier_call_chain+0x6d/0xa0 [ 1158.130753] register_netdevice+0x6fc/0x7e0 [ 1158.141322] ? netdev_change_features+0xa0/0xa0 [ 1158.152218] ? vxlan_config_apply+0x210/0x310 [vxlan] [ 1158.163593] __vxlan_dev_create+0x2ad/0x520 [vxlan] [ 1158.174770] ? vxlan_changelink+0x490/0x490 [vxlan] [ 1158.185870] ? rcu_read_unlock+0x60/0x60 [vxlan] [ 1158.196798] vxlan_newlink+0x99/0xf0 [vxlan] [ 1158.207303] ? __vxlan_dev_create+0x520/0x520 [vxlan] [ 1158.218601] ? rtnl_create_link+0x3d0/0x450 [ 1158.228900] __rtnl_newlink+0x8a7/0xb00 [ 1158.238701] ? stack_access_ok+0x35/0x80 [ 1158.248450] ? rtnl_link_unregister+0x1a0/0x1a0 [ 1158.258735] ? find_held_lock+0x6d/0xd0 [ 1158.268379] ? is_bpf_text_address+0x67/0xf0 [ 1158.278330] ? lock_acquire+0xc1/0x1f0 [ 1158.287686] ? is_bpf_text_address+0x5/0xf0 [ 1158.297449] ? is_bpf_text_address+0x86/0xf0 [ 1158.307310] ? kernel_text_address+0xec/0x100 [ 1158.317155] ? arch_stack_walk+0x92/0xe0 [ 1158.326497] ? __kernel_text_address+0xe/0x30 [ 1158.336213] ? unwind_get_return_address+0x2f/0x50 [ 1158.346267] ? create_prof_cpu_mask+0x20/0x20 [ 1158.355936] ? arch_stack_walk+0x92/0xe0 [ 1158.365117] ? stack_trace_save+0x8a/0xb0 [ 1158.374272] ? stack_trace_consume_entry+0x80/0x80 [ 1158.384226] ? match_held_lock+0x33/0x210 [ 1158.393216] ? kasan_unpoison_shadow+0x30/0x40 [ 1158.402593] rtnl_newlink+0x53/0x80 [ 1158.410925] rtnetlink_rcv_msg+0x3a5/0x600 [ 1158.419777] ? validate_linkmsg+0x400/0x400 [ 1158.428620] ? find_held_lock+0x6d/0xd0 [ 1158.437117] ? match_held_lock+0x1b/0x210 [ 1158.445760] ? validate_linkmsg+0x400/0x400 [ 1158.454642] netlink_rcv_skb+0xc7/0x1f0 [ 1158.463150] ? netlink_ack+0x470/0x470 [ 1158.471538] ? netlink_deliver_tap+0x1f3/0x5a0 [ 1158.480607] netlink_unicast+0x2ae/0x350 [ 1158.489099] ? netlink_attachskb+0x340/0x340 [ 1158.497935] ? _copy_from_iter_full+0xde/0x3b0 [ 1158.506945] ? __virt_addr_valid+0xb6/0xf0 [ 1158.515578] ? __check_object_size+0x159/0x240 [ 1158.524515] netlink_sendmsg+0x4d3/0x630 [ 1158.532879] ? netlink_unicast+0x350/0x350 [ 1158.541400] ? netlink_unicast+0x350/0x350 [ 1158.549805] sock_sendmsg+0x94/0xa0 [ 1158.557561] ___sys_sendmsg+0x49d/0x570 [ 1158.565625] ? copy_msghdr_from_user+0x210/0x210 [ 1158.574457] ? __fput+0x1e2/0x330 [ 1158.581948] ? __kasan_slab_free+0x130/0x180 [ 1158.590407] ? kmem_cache_free+0xb6/0x2d0 [ 1158.598574] ? mark_lock+0xc7/0x790 [ 1158.606177] ? task_work_run+0xcf/0x100 [ 1158.614165] ? exit_to_usermode_loop+0x102/0x110 [ 1158.622954] ? __lock_acquire+0x963/0x1ee0 [ 1158.631199] ? lockdep_hardirqs_on+0x260/0x260 [ 1158.639777] ? match_held_lock+0x1b/0x210 [ 1158.647918] ? lockdep_hardirqs_on+0x260/0x260 [ 1158.656501] ? match_held_lock+0x1b/0x210 [ 1158.664643] ? __fget_light+0xa6/0xe0 [ 1158.672423] ? __sys_sendmsg+0xd2/0x150 [ 1158.680334] __sys_sendmsg+0xd2/0x150 [ 1158.688063] ? __ia32_sys_shutdown+0x30/0x30 [ 1158.696435] ? lock_downgrade+0x2e0/0x2e0 [ 1158.704541] ? mark_held_locks+0x1a/0x90 [ 1158.712611] ? mark_held_locks+0x1a/0x90 [ 1158.720619] ? do_syscall_64+0x1e/0x2c0 [ 1158.728530] do_syscall_64+0x78/0x2c0 [ 1158.736254] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1158.745414] RIP: 0033:0x7f62d505cb87 [ 1158.753070] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00[87/1817] 48 89 f3 48 [ 1158.780924] RSP: 002b:00007fffd9832268 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1158.793204] RAX: ffffffffffffffda RBX: 000000005d26048f RCX: 00007f62d505cb87 [ 1158.805111] RDX: 0000000000000000 RSI: 00007fffd98322d0 RDI: 0000000000000003 [ 1158.817055] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1158.828987] R10: 00007f62d50ce260 R11: 0000000000000246 R12: 0000000000000001 [ 1158.840909] R13: 000000000067e540 R14: 0000000000000000 R15: 000000000067ed20 [ 1158.852873] ================================================================== Introduce new function tcf_block_non_null_shared() that verifies block pointer before dereferencing it to obtain index. Use the function in tc_indr_block_ing_cmd() to prevent NULL pointer dereference. Fixes: 955bcb6ea0df ("drivers: net: use flow block API") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fib_rules: do not flow dissect local packetsPetar Penkov2019-07-111-2/+2
| | | | | | | | | Rules matching on loopback iif do not need early flow dissection as the packet originates from the host. Stop counting such rules in fib_rule_requires_fldissect Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2019-07-1161-350/+1634
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Some highlights from this development cycle: 1) Big refactoring of ipv6 route and neigh handling to support nexthop objects configurable as units from userspace. From David Ahern. 2) Convert explored_states in BPF verifier into a hash table, significantly decreased state held for programs with bpf2bpf calls, from Alexei Starovoitov. 3) Implement bpf_send_signal() helper, from Yonghong Song. 4) Various classifier enhancements to mvpp2 driver, from Maxime Chevallier. 5) Add aRFS support to hns3 driver, from Jian Shen. 6) Fix use after free in inet frags by allocating fqdirs dynamically and reworking how rhashtable dismantle occurs, from Eric Dumazet. 7) Add act_ctinfo packet classifier action, from Kevin Darbyshire-Bryant. 8) Add TFO key backup infrastructure, from Jason Baron. 9) Remove several old and unused ISDN drivers, from Arnd Bergmann. 10) Add devlink notifications for flash update status to mlxsw driver, from Jiri Pirko. 11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski. 12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes. 13) Various enhancements to ipv6 flow label handling, from Eric Dumazet and Willem de Bruijn. 14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van der Merwe, and others. 15) Various improvements to axienet driver including converting it to phylink, from Robert Hancock. 16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean. 17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana Radulescu. 18) Add devlink health reporting to mlx5, from Moshe Shemesh. 19) Convert stmmac over to phylink, from Jose Abreu. 20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from Shalom Toledo. 21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera. 22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel. 23) Track spill/fill of constants in BPF verifier, from Alexei Starovoitov. 24) Support bounded loops in BPF, from Alexei Starovoitov. 25) Various page_pool API fixes and improvements, from Jesper Dangaard Brouer. 26) Just like ipv4, support ref-countless ipv6 route handling. From Wei Wang. 27) Support VLAN offloading in aquantia driver, from Igor Russkikh. 28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy. 29) Add flower GRE encap/decap support to nfp driver, from Pieter Jansen van Vuuren. 30) Protect against stack overflow when using act_mirred, from John Hurley. 31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen. 32) Use page_pool API in netsec driver, Ilias Apalodimas. 33) Add Google gve network driver, from Catherine Sullivan. 34) More indirect call avoidance, from Paolo Abeni. 35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan. 36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek. 37) Add MPLS manipulation actions to TC, from John Hurley. 38) Add sending a packet to connection tracking from TC actions, and then allow flower classifier matching on conntrack state. From Paul Blakey. 39) Netfilter hw offload support, from Pablo Neira Ayuso" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits) net/mlx5e: Return in default case statement in tx_post_resync_params mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync(). net: dsa: add support for BRIDGE_MROUTER attribute pkt_sched: Include const.h net: netsec: remove static declaration for netsec_set_tx_de() net: netsec: remove superfluous if statement netfilter: nf_tables: add hardware offload support net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload net: flow_offload: add flow_block_cb_is_busy() and use it net: sched: remove tcf block API drivers: net: use flow block API net: sched: use flow block API net: flow_offload: add flow_block_cb_{priv, incref, decref}() net: flow_offload: add list handling functions net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free() net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND net: flow_offload: add flow_block_cb_setup_simple() net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC ...
| * netfilter: nf_tables: add hardware offload supportPablo Neira Ayuso2019-07-092-0/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds hardware offload support for nftables through the existing netdev_ops->ndo_setup_tc() interface, the TC_SETUP_CLSFLOWER classifier and the flow rule API. This hardware offload support is available for the NFPROTO_NETDEV family and the ingress hook. Each nftables expression has a new ->offload interface, that is used to populate the flow rule object that is attached to the transaction object. There is a new per-table NFT_TABLE_F_HW flag, that is set on to offload an entire table, including all of its chains. This patch supports for basic metadata (layer 3 and 4 protocol numbers), 5-tuple payload matching and the accept/drop actions; this also includes basechain hardware offload only. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: rename tc_cls_flower_offload to flow_cls_offloadPablo Neira Ayuso2019-07-092-35/+35
| | | | | | | | | | | | | | | | | | | | | | | | And any other existing fields in this structure that refer to tc. Specifically: * tc_cls_flower_offload_flow_rule() to flow_cls_offload_flow_rule(). * TC_CLSFLOWER_* to FLOW_CLS_*. * tc_cls_common_offload to tc_cls_common_offload. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: add flow_block_cb_is_busy() and use itPablo Neira Ayuso2019-07-091-0/+3
| | | | | | | | | | | | | | | | | | This patch adds a function to check if flow block callback is already in use. Call this new function from flow_block_cb_setup_simple() and from drivers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: remove tcf block APIPablo Neira Ayuso2019-07-091-69/+0
| | | | | | | | | | | | | | Unused, now replaced by flow block API. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * drivers: net: use flow block APIPablo Neira Ayuso2019-07-092-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates flow_block_cb_setup_simple() to use the flow block API. Several drivers are also adjusted to use it. This patch introduces the per-driver list of flow blocks to account for blocks that are already in use. Remove tc_block_offload alias. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: add flow_block_cb_{priv, incref, decref}()Pablo Neira Ayuso2019-07-091-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch completes the flow block API to introduce: * flow_block_cb_priv() to access callback private data. * flow_block_cb_incref() to bump reference counter on this flow block. * flow_block_cb_decref() to decrement the reference counter. These functions are taken from the existing tcf_block_cb_priv(), tcf_block_cb_incref() and tcf_block_cb_decref(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: add list handling functionsPablo Neira Ayuso2019-07-091-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the list handling functions for the flow block API: * flow_block_cb_lookup() allows drivers to look up for existing flow blocks. * flow_block_cb_add() adds a flow block to the per driver list to be registered by the core. * flow_block_cb_remove() to remove a flow block from the list of existing flow blocks per driver and to request the core to unregister this. The flow block API also annotates the netns this flow block belongs to. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()Pablo Neira Ayuso2019-07-091-0/+14
| | | | | | | | | | | | | | Add a new helper function to allocate flow_block_cb objects. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*Pablo Neira Ayuso2019-07-092-5/+4
| | | | | | | | | | | | | | | | Rename from TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* and remove temporary tcf_block_binder_type alias. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BINDPablo Neira Ayuso2019-07-092-3/+2
| | | | | | | | | | | | | | | | Rename from TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND and remove temporary tc_block_command alias. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: flow_offload: add flow_block_cb_setup_simple()Pablo Neira Ayuso2019-07-092-17/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most drivers do the same thing to set up the flow block callbacks, this patch adds a helper function to do this. This preparation patch reduces the number of changes to adapt the existing drivers to use the flow block callback API. This new helper function takes a flow block list per-driver, which is set to NULL until this driver list is used. This patch also introduces the flow_block_command and flow_block_binder_type enumerations, which are renamed to use FLOW_BLOCK_* in follow up patches. There are three definitions (aliases) in order to reduce the number of updates in this patch, which go away once drivers are fully adapted to use this flow block API. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/flow_dissector: add connection tracking dissectionPaul Blakey2019-07-091-0/+15
| | | | | | | | | | | | | | | | | | | | Retreives connection tracking zone, mark, label, and state from a SKB. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/sched: Introduce action ctPaul Blakey2019-07-092-0/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow sending a packet to conntrack module for connection tracking. The packet will be marked with conntrack connection's state, and any metadata such as conntrack mark and label. This state metadata can later be matched against with tc classifers, for example with the flower classifier as below. In addition to committing new connections the user can optionally specific a zone to track within, set a mark/label and configure nat with an address range and port range. Usage is as follows: $ tc qdisc add dev ens1f0_0 ingress $ tc qdisc add dev ens1f0_1 ingress $ tc filter add dev ens1f0_0 ingress \ prio 1 chain 0 proto ip \ flower ip_proto tcp ct_state -trk \ action ct zone 2 pipe \ action goto chain 2 $ tc filter add dev ens1f0_0 ingress \ prio 1 chain 2 proto ip \ flower ct_state +trk+new \ action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \ action mirred egress redirect dev ens1f0_1 $ tc filter add dev ens1f0_0 ingress \ prio 1 chain 2 proto ip \ flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \ action ct nat pipe \ action mirred egress redirect dev ens1f0_1 $ tc filter add dev ens1f0_1 ingress \ prio 1 chain 0 proto ip \ flower ip_proto tcp ct_state -trk \ action ct zone 2 pipe \ action goto chain 1 $ tc filter add dev ens1f0_1 ingress \ prio 1 chain 1 proto ip \ flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \ action ct nat pipe \ action mirred egress redirect dev ens1f0_0 Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Changelog: V5->V6: Added CONFIG_NF_DEFRAG_IPV6 in handle fragments ipv6 case V4->V5: Reordered nf_conntrack_put() in tcf_ct_skb_nfct_cached() V3->V4: Added strict_start_type for act_ct policy V2->V3: Fixed david's comments: Removed extra newline after rcu in tcf_ct_params , and indent of break in act_ct.c V1->V2: Fixed parsing of ranges TCA_CT_NAT_IPV6_MAX as 'else' case overwritten ipv4 max Refactored NAT_PORT_MIN_MAX range handling as well Added ipv4/ipv6 defragmentation Removed extra skb pull push of nw offset in exectute nat Refactored tcf_ct_skb_network_trim after pull Removed TCA_ACT_CT define Signed-off-by: David S. Miller <davem@davemloft.net>
| * devlink: Introduce PCI VF port flavour and port attributeParav Pandit2019-07-091-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In an eswitch, PCI VF may have port which is normally represented using a representor netdevice. To have better visibility of eswitch port, its association with VF, and its representor netdevice, introduce a PCI VF port flavour. When devlink port flavour is PCI VF, fill up PCI VF attributes of the port. Extend port name creation using PCI PF and VF number scheme on best effort basis, so that vendor drivers can skip defining their own scheme. $ devlink port show pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0 pci/0000:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0 pci/0000:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1 Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * devlink: Introduce PCI PF port flavour and port attributeParav Pandit2019-07-091-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In an eswitch, PCI PF may have port which is normally represented using a representor netdevice. To have better visibility of eswitch port, its association with PF and a representor netdevice, introduce a PCI PF port flavour and port attriute. When devlink port flavour is PCI PF, fill up PCI PF attributes of the port. Extend port name creation using PCI PF number on best effort basis. So that vendor drivers can skip defining their own scheme. $ devlink port show pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0 Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * devlink: Refactor physical port attributesParav Pandit2019-07-091-2/+11
| | | | | | | | | | | | | | | | | | | | To support additional devlink port flavours and to support few common and few different port attributes, move physical port attributes to a different structure. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/tls: don't clear TX resync flag on errorDirk van der Merwe2019-07-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a return code for the tls_dev_resync callback. When the driver TX resync fails, kernel can retry the resync again until it succeeds. This prevents drivers from attempting to offload TLS packets if the connection is known to be out of sync. We don't worry about the RX resync since they will be retried naturally as more encrypted records get received. Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: rename sp strm_interleave to ep intl_enableXin Long2019-07-081-1/+1
| | | | | | | | | | | | | | | | Like other endpoint features, strm_interleave should be moved to sctp_endpoint and renamed to intl_enable. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: rename asoc intl_enable to asoc peer.intl_capableXin Long2019-07-081-16/+17
| | | | | | | | | | | | | | | | To keep consistent with other asoc features, we move intl_enable to peer.intl_capable in asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: remove prsctp_enable from asocXin Long2019-07-081-2/+1
| | | | | | | | | | | | | | | | Like reconf_enable, prsctp_enable should also be removed from asoc, as asoc->peer.prsctp_capable has taken its job. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: remove reconf_enable from asocXin Long2019-07-081-2/+1
| | | | | | | | | | | | | | | | | | asoc's reconf support is actually decided by the 4-shakehand negotiation, not something that users can set by sockopt. asoc->peer.reconf_capable is working for this. So remove it from asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: add mpls manipulation actions to TCJohn Hurley2019-07-081-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, TC offers the ability to match on the MPLS fields of a packet through the use of the flow_dissector_key_mpls struct. However, as yet, TC actions do not allow the modification or manipulation of such fields. Add a new module that registers TC action ops to allow manipulation of MPLS. This includes the ability to push and pop headers as well as modify the contents of new or existing headers. A further action to decrement the TTL field of an MPLS header is also provided with a new helper added to support this. Examples of the usage of the new action with flower rules to push and pop MPLS labels are: tc filter add dev eth0 protocol ip parent ffff: flower \ action mpls push protocol mpls_uc label 123 \ action mirred egress redirect dev eth1 tc filter add dev eth0 protocol mpls_uc parent ffff: flower \ action mpls pop protocol ipv4 \ action mirred egress redirect dev eth1 Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2019-07-085-5/+16
| |\ | | | | | | | | | | | | | | | Two cases of overlapping changes, nothing fancy. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2019-07-031-0/+5
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2019-07-03 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH on BE architectures, from Jiong. 2) Fix several bugs in the x32 BPF JIT for handling shifts by 0, from Luke and Xi. 3) Fix NULL pointer deref in btf_type_is_resolve_source_only(), from Stanislav. 4) Properly handle the check that forwarding is enabled on the device in bpf_ipv6_fib_lookup() helper code, from Anton. 5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit alignment such as m68k, from Baruch. 6) Fix kernel hanging in unregister_netdevice loop while unregistering device bound to XDP socket, from Ilya. 7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan. 8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri. 9) Fix bpftool to use correct argument in cgroup errors, from Jakub. 10) Fix detaching dummy prog in XDP redirect sample code, from Prashant. 11) Add Jonathan to AF_XDP reviewers, from Björn. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * xdp: fix hang while unregistering device bound to xdp socketIlya Maximets2019-07-031-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device that bound to XDP socket will not have zero refcount until the userspace application will not close it. This leads to hang inside 'netdev_wait_allrefs()' if device unregistering requested: # ip link del p1 < hang on recvmsg on netlink socket > # ps -x | grep ip 5126 pts/0 D+ 0:00 ip link del p1 # journalctl -b Jun 05 07:19:16 kernel: unregister_netdevice: waiting for p1 to become free. Usage count = 1 Jun 05 07:19:27 kernel: unregister_netdevice: waiting for p1 to become free. Usage count = 1 ... Fix that by implementing NETDEV_UNREGISTER event notification handler to properly clean up all the resources and unref device. This should also allow socket killing via ss(8) utility. Fixes: 965a99098443 ("xsk: add support for bind for Rx") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | net/tls: make sure offload also gets the keys wipedJakub Kicinski2019-07-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 86029d10af18 ("tls: zero the crypto information from tls_context before freeing") added memzero_explicit() calls to clear the key material before freeing struct tls_context, but it missed tls_device.c has its own way of freeing this structure. Replace the missing free. Fixes: 86029d10af18 ("tls: zero the crypto information from tls_context before freeing") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net:gue.h:Fix shifting signed 32-bit value by 31 bits problemVandana BN2019-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix GUE_PFLAG_REMCSUM to use "U" cast to avoid shifting signed 32-bit value by 31 bits problem. Signed-off-by: Vandana BN <bnvandana@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: dst.h: Fix shifting signed 32-bit value by 31 bits problemVandana BN2019-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix DST_FEATURE_ECN_CA to use "U" cast to avoid shifting signed 32-bit value by 31 bits problem. Signed-off-by: Vandana BN <bnvandana@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: make skb_dst_force return true when dst is refcountedFlorian Westphal2019-06-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netfilter did not expect that skb_dst_force() can cause skb to lose its dst entry. I got a bug report with a skb->dst NULL dereference in netfilter output path. The backtrace contains nf_reinject(), so the dst might have been cleared when skb got queued to userspace. Other users were fixed via if (skb_dst(skb)) { skb_dst_force(skb); if (!skb_dst(skb)) goto handle_err; } But I think its preferable to make the 'dst might be cleared' part of the function explicit. In netfilter case, skb with a null dst is expected when queueing in prerouting hook, so drop skb for the other hooks. v2: v1 of this patch returned true in case skb had no dst entry. Eric said: Say if we have two skb_dst_force() calls for some reason on the same skb, only the first one will return false. This now returns false even when skb had no dst, as per Erics suggestion, so callers might need to check skb_dst() first before skb_dst_force(). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-06-281-2/+4
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter fixes for net: 1) Fix memleak reported by syzkaller when registering IPVS hooks, patch from Julian Anastasov. 2) Fix memory leak in start_sync_thread, also from Julian. 3) Fix conntrack deletion via ctnetlink, from Felix Kaechele. 4) Fix reject for ICMP due to incorrect checksum handling, from He Zhe. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | ipvs: fix tinfo memory leak in start_sync_threadJulian Anastasov2019-06-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller reports for memory leak in start_sync_thread [1] As Eric points out, kthread may start and stop before the threadfn function is called, so there is no chance the data (tinfo in our case) to be released in thread. Fix this by releasing tinfo in the controlling code instead. [1] BUG: memory leak unreferenced object 0xffff8881206bf700 (size 32): comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s) hex dump (first 32 bytes): 00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff .@|......E.!.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline] [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline] [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862 [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402 [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115 [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline] [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238 [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130 [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078 [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline] [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline] [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301 [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com Suggested-by: Eric Biggers <ebiggers@kernel.org> Fixes: 998e7a76804b ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | ipv6: elide flowlabel check if no exclusive leases existWillem de Bruijn2019-07-081-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO. If not set, by default an autogenerated flowlabel is selected. Explicit flowlabels require a control operation per label plus a datapath check on every connection (every datagram if unconnected). This is particularly expensive on unconnected sockets multiplexing many flows, such as QUIC. In the common case, where no lease is exclusive, the check can be safely elided, as both lease request and check trivially succeed. Indeed, autoflowlabel does the same even with exclusive leases. Elide the check if no process has requested an exclusive lease. fl6_sock_lookup previously returns either a reference to a lease or NULL to denote failure. Modify to return a real error and update all callers. On return NULL, they can use the label and will elide the atomic_dec in fl6_sock_release. This is an optimization. Robust applications still have to revert to requesting leases if the fast path fails due to an exclusive lease. Changes RFC->v1: - use static_key_false_deferred to rate limit jump label operations - call static_key_deferred_flush to stop timers on exit - move decrement out of RCU context - defer optimization also if opt data is associated with a lease - updated all fp6_sock_lookup callers, not just udp Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | coallocate socket_wq with socket itselfAl Viro2019-07-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | socket->wq is assign-once, set when we are initializing both struct socket it's in and struct socket_wq it points to. As the matter of fact, the only reason for separate allocation was the ability to RCU-delay freeing of socket_wq. RCU-delaying the freeing of socket itself gets rid of that need, so we can just fold struct socket_wq into the end of struct socket and simplify the life both for sock_alloc_inode() (one allocation instead of two) and for tun/tap oddballs, where we used to embed struct socket and struct socket_wq into the same structure (now - embedding just the struct socket). Note that reference to struct socket_wq in struct sock does remain a reference - that's unchanged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2019-07-082-3/+3
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2019-07-09 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Lots of libbpf improvements: i) addition of new APIs to attach BPF programs to tracing entities such as {k,u}probes or tracepoints, ii) improve specification of BTF-defined maps by eliminating the need for data initialization for some of the members, iii) addition of a high-level API for setting up and polling perf buffers for BPF event output helpers, all from Andrii. 2) Add "prog run" subcommand to bpftool in order to test-run programs through the kernel testing infrastructure of BPF, from Quentin. 3) Improve verifier for BPF sockaddr programs to support 8-byte stores for user_ip6 and msg_src_ip6 members given clang tends to generate such stores, from Stanislav. 4) Enable the new BPF JIT zero-extension optimization for further riscv64 ALU ops, from Luke. 5) Fix a bpftool json JIT dump crash on powerpc, from Jiri. 6) Fix an AF_XDP race in generic XDP's receive path, from Ilya. 7) Various smaller fixes from Ilya, Yue and Arnd. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | xdp: fix race on generic receive pathIlya Maximets2019-07-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike driver mode, generic xdp receive could be triggered by different threads on different CPU cores at the same time leading to the fill and rx queue breakage. For example, this could happen while sending packets from two processes to the first interface of veth pair while the second part of it is open with AF_XDP socket. Need to take a lock for each generic receive to avoid race. Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Tested-by: William Tu <u9012063@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | | bpf: avoid unused variable warning in tcp_bpf_rtt()Arnd Bergmann2019-07-081-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_BPF is disabled, we get a warning for an unused variable: In file included from drivers/target/target_core_device.c:26: include/net/tcp.h:2226:19: error: unused variable 'tp' [-Werror,-Wunused-variable] struct tcp_sock *tp = tcp_sk(sk); The variable is only used in one place, so it can be replaced with its value there to avoid the warning. Fixes: 23729ff23186 ("bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | | | net: core: page_pool: add user refcnt and reintroduce page_pool_destroyIvan Khoronzhuk2019-07-081-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jesper recently removed page_pool_destroy() (from driver invocation) and moved shutdown and free of page_pool into xdp_rxq_info_unreg(), in-order to handle in-flight packets/pages. This created an asymmetry in drivers create/destroy pairs. This patch reintroduce page_pool_destroy and add page_pool user refcnt. This serves the purpose to simplify drivers error handling as driver now drivers always calls page_pool_destroy() and don't need to track if xdp_rxq_info_reg_mem_model() was unsuccessful. This could be used for a special cases where a single RX-queue (with a single page_pool) provides packets for two net_device'es, and thus needs to register the same page_pool twice with two xdp_rxq_info structures. This patch is primarily to ease API usage for drivers. The recently merged netsec driver, actually have a bug in this area, which is solved by this API change. This patch is a modified version of Ivan Khoronzhuk's original patch. Link: https://lore.kernel.org/netdev/20190625175948.24771-2-ivan.khoronzhuk@linaro.org/ Fixes: 5c67bf0ec4d0 ("net: netsec: Use page_pool API") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2019-07-084-2/+51
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Move bridge keys in nft_meta to nft_meta_bridge, from wenxu. 2) Support for bridge pvid matching, from wenxu. 3) Support for bridge vlan protocol matching, also from wenxu. 4) Add br_vlan_get_pvid_rcu(), to fetch the bridge port pvid from packet path. 5) Prefer specific family extension in nf_tables. 6) Autoload specific family extension in case it is missing. 7) Add synproxy support to nf_tables, from Fernando Fernandez Mancera. 8) Support for GRE encapsulation in IPVS, from Vadim Fedorenko. 9) ICMP handling for GRE encapsulation, from Julian Anastasov. 10) Remove unused parameter in nf_queue, from Florian Westphal. 11) Replace seq_printf() by seq_puts() in nf_log, from Markus Elfring. 12) Rename nf_SYNPROXY.h => nf_synproxy.h before this header becomes public. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>