summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* netfilter: nf_tables: add support for matching IPv4 optionsStephen Suryaputra2019-06-213-0/+136
| | | | | | | | | | This is the kernel change for the overall changes with this description: Add capability to have rules matching IPv4 options. This is developed mainly to support dropping of IP packets with loose and/or strict source route route options. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: fix manual bump of the reference counterFernando Fernandez Mancera2019-06-211-1/+0
| | | | | | | | This operation is handled by nf_synproxy_ipv4_init() now. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: bridge: Fix non-untagged fragment packetwenxu2019-06-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | ip netns exec ns1 ip a a dev eth0 10.0.0.7/24 ip netns exec ns2 ip link a link eth0 name vlan type vlan id 200 ip netns exec ns2 ip a a dev vlan 10.0.0.8/24 ip l add dev br0 type bridge vlan_filtering 1 brctl addif br0 veth1 brctl addif br0 veth2 bridge vlan add dev veth1 vid 200 pvid untagged bridge vlan add dev veth2 vid 200 A two fragment packet sent from ns2 contains the vlan tag 200. In the bridge conntrack, this packet will defrag to one skb with fraglist. When the packet is forwarded to ns1 through veth1, the first skb vlan tag will be cleared by the "untagged" flags. But the vlan tag in the second skb is still tagged, so the second fragment ends up with tag 200 to ns1. So if the first fragment packet doesn't contain the vlan tag, all of the remain should not contain vlan tag. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: fix nf_conntrack_bridge/ipv6 link errorArnd Bergmann2019-06-211-4/+12
| | | | | | | | | | | | | | | | | | | When CONFIG_IPV6 is disabled, the bridge netfilter code produces a link error: ERROR: "br_ip6_fragment" [net/bridge/netfilter/nf_conntrack_bridge.ko] undefined! ERROR: "nf_ct_frag6_gather" [net/bridge/netfilter/nf_conntrack_bridge.ko] undefined! The problem is that it assumes that whenever IPV6 is not a loadable module, we can call the functions direction. This is clearly not true when IPV6 is disabled. There are two other functions defined like this in linux/netfilter_ipv6.h, so change them all the same way. Fixes: 764dd163ac92 ("netfilter: nf_conntrack_bridge: add support for IPv6") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: bridge: prevent UAF in brnf_exit_net()Christian Brauner2019-06-201-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent a UAF in brnf_exit_net(). When unregister_net_sysctl_table() is called the ctl_hdr pointer will obviously be freed and so accessing it righter after is invalid. Fix this by stashing a pointer to the table we want to free before we unregister the sysctl header. Note that syzkaller falsely chased this down to the drm tree so the Fixes tag that syzkaller requested would be wrong. This commit uses a different but the correct Fixes tag. /* Splat */ BUG: KASAN: use-after-free in br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] BUG: KASAN: use-after-free in brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 Read of size 8 at addr ffff8880a4078d60 by task kworker/u4:4/8749 CPU: 0 PID: 8749 Comm: kworker/u4:4 Not tainted 5.2.0-rc5-next-20190618 #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351 __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 br_netfilter_sysctl_exit_net net/bridge/br_netfilter_hooks.c:1121 [inline] brnf_exit_net+0x38c/0x3a0 net/bridge/br_netfilter_hooks.c:1141 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 11374: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3645 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3654 kmalloc include/linux/slab.h:552 [inline] kzalloc include/linux/slab.h:743 [inline] __register_sysctl_table+0xc7/0xef0 fs/proc/proc_sysctl.c:1327 register_net_sysctl+0x29/0x30 net/sysctl_net.c:121 br_netfilter_sysctl_init_net net/bridge/br_netfilter_hooks.c:1105 [inline] brnf_init_net+0x379/0x6a0 net/bridge/br_netfilter_hooks.c:1126 ops_init+0xb3/0x410 net/core/net_namespace.c:130 setup_net+0x2d3/0x740 net/core/net_namespace.c:316 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:103 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:202 ksys_unshare+0x444/0x980 kernel/fork.c:2822 __do_sys_unshare kernel/fork.c:2890 [inline] __se_sys_unshare kernel/fork.c:2888 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2888 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3417 [inline] kfree+0x10a/0x2c0 mm/slab.c:3746 __rcu_reclaim kernel/rcu/rcu.h:215 [inline] rcu_do_batch kernel/rcu/tree.c:2092 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline] rcu_core+0xcc7/0x1500 kernel/rcu/tree.c:2291 __do_softirq+0x25c/0x94c kernel/softirq.c:292 The buggy address belongs to the object at ffff8880a4078d40 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 32 bytes inside of 512-byte region [ffff8880a4078d40, ffff8880a4078f40) The buggy address belongs to the page: page:ffffea0002901e00 refcount:1 mapcount:0 mapping:ffff8880aa400a80 index:0xffff8880a40785c0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea0001d636c8 ffffea0001b07308 ffff8880aa400a80 raw: ffff8880a40785c0 ffff8880a40780c0 0000000100000004 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a4078c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078c80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc > ffff8880a4078d00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a4078d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a4078e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+43a3fa52c0d9c5c94f41@syzkaller.appspotmail.com Fixes: 22567590b2e6 ("netfilter: bridge: namespace bridge netfilter sysctls") Signed-off-by: Christian Brauner <christian@brauner.io> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: use nf_cookie_v6_check() from corePablo Neira Ayuso2019-06-201-1/+1
| | | | | | | | This helper function is never used and it is intended to avoid a direct dependency with the ipv6 module. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: fix building syncookie callsArnd Bergmann2019-06-202-6/+10
| | | | | | | | | | | | | | | | | | | | | When either CONFIG_IPV6 or CONFIG_SYN_COOKIES are disabled, the kernel fails to build: include/linux/netfilter_ipv6.h:180:9: error: implicit declaration of function '__cookie_v6_init_sequence' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_init_sequence(iph, th, mssp); include/linux/netfilter_ipv6.h:194:9: error: implicit declaration of function '__cookie_v6_check' [-Werror,-Wimplicit-function-declaration] return __cookie_v6_check(iph, th, cookie); net/ipv6/netfilter.c:237:26: error: use of undeclared identifier '__cookie_v6_init_sequence'; did you mean 'cookie_init_sequence'? net/ipv6/netfilter.c:238:21: error: use of undeclared identifier '__cookie_v6_check'; did you mean '__cookie_v4_check'? Fix the IS_ENABLED() checks to match the function declaration and definitions for these. Fixes: 3006a5224f15 ("netfilter: synproxy: remove module dependency on IPv6 SYNPROXY") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nf_tables: enable set expiration time for set elementsLaura Garcia Liebana2019-06-193-8/+22
| | | | | | | | | | | | | | | | | Currently, the expiration of every element in a set or map is a read-only parameter generated at kernel side. This change will permit to set a certain expiration date per element that will be required, for example, during stateful replication among several nodes. This patch handles the NFTA_SET_ELEM_EXPIRATION in order to configure the expiration parameter per element, or will use the timeout in the case that the expiration is not set. Signed-off-by: Laura Garcia Liebana <nevola@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nft_ct: fix null pointer in ct expectations supportStéphane Veyret2019-06-191-0/+4
| | | | | | | | | nf_ct_helper_ext_add may return null, which must then be checked. Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Stéphane Veyret <sveyret@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: ensure zero is returned on non-error return pathColin Ian King2019-06-191-2/+2
| | | | | | | | | | | Currently functions nf_synproxy_{ipc4|ipv6}_init return an uninitialized garbage value in variable ret on a successful return. Fix this by returning zero on success. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXYFernando Fernandez Mancera2019-06-175-847/+920
| | | | | | | | | Add common functions into nf_synproxy_core.c to prepare for nftables support. The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new file nf_synproxy.h Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: remove module dependency on IPv6 SYNPROXYFernando Fernandez Mancera2019-06-172-0/+38
| | | | | | | | | This is a prerequisite for the infrastructure module NETFILTER_SYNPROXY. The new module is needed to avoid duplicated code for the SYNPROXY nftables support. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: synproxy: add common uapi for SYNPROXY infrastructureFernando Fernandez Mancera2019-06-172-11/+26
| | | | | | | | This new UAPI file is going to be used by the xt and nft common SYNPROXY infrastructure. It is needed to avoid duplicated code. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge branch 'master' of git://blackhole.kfki.hu/nf-nextPablo Neira Ayuso2019-06-1734-130/+104
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jozsef Kadlecsik says: ==================== ipset patches for nf-next - Remove useless memset() calls, nla_parse_nested/nla_parse erase the tb array properly, from Florent Fourcot. - Merge the uadd and udel functions, the code is nicer this way, also from Florent Fourcot. - Add a missing check for the return value of a nla_parse[_deprecated] call, from Aditya Pakki. - Add the last missing check for the return value of nla_parse[_deprecated] call. - Fix error path and release the references properly in set_target_v3_checkentry(). - Fix memory accounting which is reported to userspace for hash types on resize, from Stefano Brivio. - Update my email address to kadlec@netfilter.org. The patch covers all places in the source tree where my kadlec@blackhole.kfki.hu address could be found. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * Update my email addressJozsef Kadlecsik2019-06-1034-49/+49
| | | | | | | | | | | | | | | | | | It's better to use my kadlec@netfilter.org email address in the source code. I might not be able to use kadlec@blackhole.kfki.hu in the future. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * ipset: Fix memory accounting for hash types on resizeStefano Brivio2019-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a fresh array block is allocated during resize, the current in-memory set size should be increased by the size of the block, not replaced by it. Before the fix, adding entries to a hash set type, leading to a table resize, caused an inconsistent memory size to be reported. This becomes more obvious when swapping sets with similar sizes: # cat hash_ip_size.sh #!/bin/sh FAIL_RETRIES=10 tries=0 while [ ${tries} -lt ${FAIL_RETRIES} ]; do ipset create t1 hash:ip for i in `seq 1 4345`; do ipset add t1 1.2.$((i / 255)).$((i % 255)) done t1_init="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset create t2 hash:ip for i in `seq 1 4360`; do ipset add t2 1.2.$((i / 255)).$((i % 255)) done t2_init="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset swap t1 t2 t1_swap="$(ipset list t1|sed -n 's/Size in memory: \(.*\)/\1/p')" t2_swap="$(ipset list t2|sed -n 's/Size in memory: \(.*\)/\1/p')" ipset destroy t1 ipset destroy t2 tries=$((tries + 1)) if [ ${t1_init} -lt 10000 ] || [ ${t2_init} -lt 10000 ]; then echo "FAIL after ${tries} tries:" echo "T1 size ${t1_init}, after swap ${t1_swap}" echo "T2 size ${t2_init}, after swap ${t2_swap}" exit 1 fi done echo "PASS" # echo -n 'func hash_ip4_resize +p' > /sys/kernel/debug/dynamic_debug/control # ./hash_ip_size.sh [ 2035.018673] attempt to resize set t1 from 10 to 11, t 00000000fe6551fa [ 2035.078583] set t1 resized from 10 (00000000fe6551fa) to 11 (00000000172a0163) [ 2035.080353] Table destroy by resize 00000000fe6551fa FAIL after 4 tries: T1 size 9064, after swap 71128 T2 size 71128, after swap 9064 Reported-by: NOYB <JunkYardMail1@Frontier.com> Fixes: 9e41f26a505c ("netfilter: ipset: Count non-static extension memory for userspace") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: Fix error path in set_target_v3_checkentry()Jozsef Kadlecsik2019-06-101-20/+21
| | | | | | | | | | | | Fix error path and release the references properly. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: Fix the last missing check of nla_parse_deprecated()Jozsef Kadlecsik2019-06-101-4/+6
| | | | | | | | | | | | | | In dump_init() the outdated comment was incorrect and we had a missing validation check of nla_parse_deprecated(). Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: fix a missing check of nla_parseAditya Pakki2019-06-101-3/+7
| | | | | | | | | | | | | | | | | | When nla_parse fails, we should not use the results (the first argument). The fix checks if it fails, and if so, returns its error code upstream. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: merge uadd and udel functionsFlorent Fourcot2019-06-101-51/+20
| | | | | | | | | | | | | | | | Both functions are using exactly the same code, except the command value passed to call_ad function. Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
| * netfilter: ipset: remove useless memset() callsFlorent Fourcot2019-06-101-2/+0
| | | | | | | | | | | | | | | | | | One of the memset call is buggy: it does not erase full array, but only pointer size. Moreover, after a check, first step of nla_parse_nested/nla_parse is to erase tb array as well. We can remove both calls safely. Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
* | netfilter: bridge: namespace bridge netfilter sysctlsChristian Brauner2019-06-171-45/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the /proc/sys/net/bridge folder is only created in the initial network namespace. This patch ensures that the /proc/sys/net/bridge folder is available in each network namespace if the module is loaded and disappears from all network namespaces when the module is unloaded. In doing so the patch makes the sysctls: bridge-nf-call-arptables bridge-nf-call-ip6tables bridge-nf-call-iptables bridge-nf-filter-pppoe-tagged bridge-nf-filter-vlan-tagged bridge-nf-pass-vlan-input-dev apply per network namespace. This unblocks some use-cases where users would like to e.g. not do bridge filtering for bridges in a specific network namespace while doing so for bridges located in another network namespace. The netfilter rules are afaict already per network namespace so it should be safe for users to specify whether bridge devices inside a network namespace are supposed to go through iptables et al. or not. Also, this can already be done per-bridge by setting an option for each individual bridge via Netlink. It should also be possible to do this for all bridges in a network namespace via sysctls. Cc: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | netfilter: bridge: port sysctls to use brnf_netChristian Brauner2019-06-173-60/+107
| | | | | | | | | | | | | | | | | | | | This ports the sysctls to use struct brnf_net. With this patch we make it possible to namespace the br_netfilter module in the following patch. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | netfilter: xt_owner: bail out with EINVAL in case of unsupported flagsPablo Neira Ayuso2019-06-172-0/+8
| | | | | | | | | | | | Reject flags that are not supported with EINVAL. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | netfilter: conntrack: small conntrack lookup optimizationFlorian Westphal2019-06-172-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ____nf_conntrack_find() performs checks on the conntrack objects in this order: 1. if (nf_ct_is_expired(ct)) This fetches ct->timeout, in third cache line. The hnnode that is used to store the list pointers resides in the first (origin) or second (reply tuple) cache lines. This test rarely passes, but its necessary to reap obsolete entries. 2. if (nf_ct_is_dying(ct)) This fetches ct->status, also in third cache line. The test is useless, and can be removed: Consider: cpu0 cpu1 ct = ____nf_conntrack_find() atomic_inc_not_zero(ct) -> ok nf_ct_key_equal -> ok is_dying -> DYING bit not set, ok set_bit(ct, DYING); ... unhash ... etc. return ct -> returning a ct with dying bit set, despite having a test for it. This (unlikely) case is fine - refcount prevents ct from getting free'd. 3. if (nf_ct_key_equal(h, tuple, zone, net)) nf_ct_key_equal checks in following order: 1. Tuple equal (first or second cacheline) 2. Zone equal (third cacheline) 3. confirmed bit set (->status, third cacheline) 4. net namespace match (third cacheline). Swapping "timeout" and "cpu" places timeout in the first cacheline. This has two advantages: 1. For a conntrack that won't even match the original tuple, we will now only fetch the first and maybe the second cacheline instead of always accessing the 3rd one as well. 2. in case of TCP ct->timeout changes frequently because we reduce/increase it when there are packets outstanding in the network. The first cacheline contains both the reference count and the ct spinlock, i.e. moving timeout there avoids writes to 3rd cacheline. The restart sequence in __nf_conntrack_find() is removed, if we found a candidate, but then fail to increment the refcount or discover the tuple has changed (object recycling), just pretend we did not find an entry. A second lookup won't find anything until another CPU adds a new conntrack with identical tuple into the hash table, which is very unlikely. We have the confirmation-time checks (when we hold hash lock) that deal with identical entries and even perform clash resolution in some cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | netfilter: nft_ct: add ct expectations supportStéphane Veyret2019-06-172-3/+149
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows to add, list and delete expectations via nft objref infrastructure and assigning these expectations via nft rule. This allows manual port triggering when no helper is defined to manage a specific protocol. For example, if I have an online game which protocol is based on initial connection to TCP port 9753 of the server, and where the server opens a connection to port 9876, I can set rules as follow: table ip filter { ct expectation mygame { protocol udp; dport 9876; timeout 2m; size 1; } chain input { type filter hook input priority 0; policy drop; tcp dport 9753 ct expectation set "mygame"; } chain output { type filter hook output priority 0; policy drop; udp dport 9876 ct status expected accept; } } Signed-off-by: Stéphane Veyret <sveyret@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: ipv6: Fix undefined symbol nf_ct_frag6_gatherwenxu2019-06-061-1/+3
| | | | | | | | | | | CONFIG_NETFILTER=m and CONFIG_NF_DEFRAG_IPV6 is not set ERROR: "nf_ct_frag6_gather" [net/ipv6/ipv6.ko] undefined! Fixes: c9bb6165a16e ("netfilter: nf_conntrack_bridge: fix CONFIG_IPV6=y") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* inet_connection_sock: remove unused parameter of reqsk_queue_unlink funcZhiqiang Liu2019-06-051-3/+2
| | | | | | | | | small cleanup: "struct request_sock_queue *queue" parameter of reqsk_queue_unlink func is never used in the func, so we can remove it. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: remove state PHY_FORCINGHeiner Kallweit2019-06-052-35/+2
| | | | | | | | | | | In the early days of phylib we had a functionality that changed to the next lower speed in fixed mode if no link was established after a certain period of time. This functionality has been removed years ago, and state PHY_FORCING isn't needed any longer. Instead we can go from UP to RUNNING or NOLINK directly (same as in autoneg mode). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rds: add per rds connection cache statisticsZhu Yanjun2019-06-052-0/+4
| | | | | | | | | | | | | | | | | | | | | | | The variable cache_allocs is to indicate how many frags (KiB) are in one rds connection frag cache. The command "rds-info -Iv" will output the rds connection cache statistics as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 1.1.1.14 1.1.1.14 58 255 fe80::2:c903:a:7a31 fe80::2:c903:a:7a31 send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096, rdma_mr_size=257, cache_allocs=12 " This means that there are about 12KiB frag in this rds connection frag cache. Since rds.h in rds-tools is not related with the kernel rds.h, the change in kernel rds.h does not affect rds-tools. rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works well. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'dwmac-mediatek'David S. Miller2019-06-053-3/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Biao Huang says: ==================== complete dwmac-mediatek driver and fix flow control issue Changes in v2: patch#1: there is no extra action in mediatek_dwmac_remove, remove it v1: This series mainly complete dwmac-mediatek driver: 1. add power on/off operations for dwmac-mediatek. 2. disable rx watchdog to reduce rx path reponding time. 3. change the default value of tx-frames from 25 to 1, so ptp4l will test pass by default. and also fix the issue that flow control won't be disabled any more once being enabled. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: stmmac: dwmac4: fix flow control issueBiao Huang2019-06-051-2/+6
| | | | | | | | | | | | | | | | | | | | | | Current dwmac4_flow_ctrl will not clear GMAC_RX_FLOW_CTRL_RFE/GMAC_RX_FLOW_CTRL_RFE bits, so MAC hw will keep flow control on although expecting flow control off by ethtool. Add codes to fix it. Fixes: 477286b53f55 ("stmmac: add GMAC4 core support") Signed-off-by: Biao Huang <biao.huang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: stmmac: modify default value of tx-framesBiao Huang2019-06-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the default value of tx-frames is 25, it's too late when passing tstamp to stack, then the ptp4l will fail: ptp4l -i eth0 -f gPTP.cfg -m ptp4l: selected /dev/ptp0 as PTP clock ptp4l: port 1: INITIALIZING to LISTENING on INITIALIZE ptp4l: port 0: INITIALIZING to LISTENING on INITIALIZE ptp4l: port 1: link up ptp4l: timed out while polling for tx timestamp ptp4l: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l: port 1: send peer delay response failed ptp4l: port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l tests pass when changing the tx-frames from 25 to 1 with ethtool -C option. It should be fine to set tx-frames default value to 1, so ptp4l will pass by default. Signed-off-by: Biao Huang <biao.huang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: stmmac: dwmac-mediatek: disable rx watchdogBiao Huang2019-06-051-0/+1
| | | | | | | | | | | | | | | | | | disable rx watchdog for dwmac-mediatek, then the hw will issue a rx interrupt once receiving a packet, so the responding time for rx path will be reduced. Signed-off-by: Biao Huang <biao.huang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: stmmac: dwmac-mediatek: enable Ethernet power domainBiao Huang2019-06-051-0/+7
|/ | | | | | | add Ethernet power on/off operations in init/exit flow. Signed-off-by: Biao Huang <biao.huang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* drivers: net: vxlan: drop unneeded likely() call around IS_ERR()Enrico Weigelt2019-06-051-1/+1
| | | | | | | | IS_ERR() already calls unlikely(), so this extra likely() call around the !IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: drop unneeded likely() call around IS_ERR()Enrico Weigelt2019-06-052-2/+2
| | | | | | | | IS_ERR() already calls unlikely(), so this extra unlikely() call around IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: drop unneeded likely() call around IS_ERR()Enrico Weigelt2019-06-054-4/+4
| | | | | | | | IS_ERR() already calls unlikely(), so this extra unlikely() call around IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: openvswitch: drop unneeded likely() call around IS_ERR()Enrico Weigelt2019-06-051-1/+1
| | | | | | | | IS_ERR() already calls unlikely(), so this extra likely() call around the !IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: socket: drop unneeded likely() call around IS_ERR()Enrico Weigelt2019-06-051-1/+1
| | | | | | | | IS_ERR() already calls unlikely(), so this extra likely() call around the !IS_ERR() is not needed. Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* nfp: flower: use struct_size() helperGustavo A. R. Silva2019-06-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct nfp_tun_active_tuns { ... struct route_ip_info { __be32 ipv4; __be32 egress_port; __be32 extra[2]; } tun_info[]; }; Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. So, replace the following form: sizeof(struct nfp_tun_active_tuns) + sizeof(struct route_ip_info) * count with: struct_size(payload, tun_info, count) This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* i40e: Check and set the PF driver state first in i40e_ndo_set_vf_macLihong Yang2019-06-051-5/+5
| | | | | | | | | | | | The PF driver state flag __I40E_VIRTCHNL_OP_PENDING needs to be checked and set at the beginning of i40e_ndo_set_vf_mac. Otherwise, if there are error conditions before it, the flag will be cleared unexpectedly by this function to cause potential race conditions. Hence move the check to the top of this function. Signed-off-by: Lihong Yang <lihong.yang@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* i40e: Do not check VF state in i40e_ndo_get_vf_configLihong Yang2019-06-051-4/+2
| | | | | | | | | | | | | The VF configuration returned in i40e_ndo_get_vf_config is already stored by the PF. There is no dependency on any specific state of the VF to return the configuration. Drop the check against I40E_VF_STATE_INIT since it is not needed. Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch '10GbE' of ↵David S. Miller2019-06-0513-124/+200
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 10GbE Intel Wired LAN Driver Updates 2019-06-05 This series contains updates to mainly ixgbe, with a few updates to i40e, net, ice and hns2 driver. Jan adds support for tracking each queue pair for whether or not AF_XDP zero copy is enabled. Also updated the ixgbe driver to use the netdev-provided umems so that we do not need to contain these structures in our own adapter structure. William Tu provides two fixes for AF_XDP statistics which were causing incorrect counts. Jake reduces the PTP transmit timestamp timeout from 15 seconds to 1 second, which is still well after the maximum expected delay. Also fixes an issues with the PTP SDP pin setup which was not properly aligning on a full second, so updated the code to account for the cyclecounter multiplier and simplify the code to make the intent of the calculations more clear. Updated the function header comments to help with the code documentation. Added support for SDP/PPS output for x550 devices, which is slightly different than x540 devices that currently have this support. Anirudh adds a new define for Link Layer Discovery Protocol to the networking core, so that drivers do not have to create and use their own definitions. In addition, update all the drivers currently defining their own LLDP define to use the new networking core define. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ixgbevf: fix a missing check of ixgbevf_write_msg_read_ackKangjie Lu2019-06-051-3/+2
| | | | | | | | | | | | | | | | | | If ixgbevf_write_msg_read_ack fails, return its error code upstream Signed-off-by: Kangjie Lu <kjlu@umn.edu> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * ixgbe: implement support for SDP/PPS output on X550 hardwareJacob Keller2019-06-052-5/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to the X540 hardware, enable support for generating a 1pps output signal on SDP0. This support is slightly different to the X540 hardware, because of the register layout changes. First, the system time register is now represented in 'cycles' and 'billions of cycles'. Second, we need to also program the TSSDP register, as well as the ESDP register. Third, the clock output uses only FREQOUT, instead of a full 64bit value for the output clock period. Finally, we have to use the ST0 bit instead of the SYNCLK bit in the TSAUXC register. This support should work even for the hardware with a higher frequency clock, as it carefully takes into account the multiply and shift of the cycle counter used. We also set the pps configuration to 1, since we now support generating a pulse per second output. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * net: hns3: Use LLDP ethertype define ETH_P_LLDPAnirudh Venkataramanan2019-06-052-2/+1
| | | | | | | | | | | | | | | | Remove references to HCLGE_MAC_ETHERTYPE_LLDP and use ETH_P_LLDP instead. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * ice: Use LLDP ethertype define ETH_P_LLDPJeff Kirsher2019-06-051-3/+1
| | | | | | | | | | | | | | Instead of using a local define for the LLDP ethertype, use the kernel define ETH_P_LLDP. Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * ixgbe: Use LLDP ethertype define ETH_P_LLDPAnirudh Venkataramanan2019-06-052-3/+1
| | | | | | | | | | | | | | | | Remove references to IXGBE_ETH_P_LLD and use ETH_P_LLDP instead. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * i40e: Use LLDP ethertype define ETH_P_LLDPAnirudh Venkataramanan2019-06-052-4/+2
| | | | | | | | | | | | | | | | Remove references to I40E_ETH_P_LLDP and use ETH_P_LLDP instead. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>