summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* net: x25: fix one potential use-after-free issuelinzhang2017-05-183-11/+22
| | | | | | | | | | | | The function x25_init is not properly unregister related resources on error handler.It is will result in kernel oops if x25_init init failed, so add properly unregister call on error handler. Also, i adjust the coding style and make x25_register_sysctl properly return failure. Signed-off-by: linzhang <xiaolou4617@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf: adjust verifier heuristicsDaniel Borkmann2017-05-171-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current limits with regards to processing program paths do not really reflect today's needs anymore due to programs becoming more complex and verifier smarter, keeping track of more data such as const ALU operations, alignment tracking, spilling of PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for smarter matching of what LLVM generates. This also comes with the side-effect that we result in fewer opportunities to prune search states and thus often need to do more work to prove safety than in the past due to different register states and stack layout where we mismatch. Generally, it's quite hard to determine what caused a sudden increase in complexity, it could be caused by something as trivial as a single branch somewhere at the beginning of the program where LLVM assigned a stack slot that is marked differently throughout other branches and thus causing a mismatch, where verifier then needs to prove safety for the whole rest of the program. Subsequently, programs with even less than half the insn size limit can get rejected. We noticed that while some programs load fine under pre 4.11, they get rejected due to hitting limits on more recent kernels. We saw that in the vast majority of cases (90+%) pruning failed due to register mismatches. In case of stack mismatches, majority of cases failed due to different stack slot types (invalid, spill, misc) rather than differences in spilled registers. This patch makes pruning more aggressive by also adding markers that sit at conditional jumps as well. Currently, we only mark jump targets for pruning. For example in direct packet access, these are usually error paths where we bail out. We found that adding these markers, it can reduce number of processed insns by up to 30%. Another option is to ignore reg->id in probing PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning slightly as well by up to 7% observed complexity reduction as stand-alone. Meaning, if a previous path with register type PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then in the current state a PTR_TO_MAP_VALUE_OR_NULL register for the same map X must be safe as well. Last but not least the patch also adds a scheduling point and bumps the current limit for instructions to be processed to a more adequate value. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Check ip6_find_1stfragopt() return value properly.David S. Miller2017-05-173-12/+12
| | | | | | | | | Do not use unsigned variables to see if it returns a negative error or not. Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options") Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* selftests/bpf: fix broken build due to types.hYonghong Song2017-05-172-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | Commit 0a5539f66133 ("bpf: Provide a linux/types.h override for bpf selftests.") caused a build failure for tools/testing/selftest/bpf because of some missing types: $ make -C tools/testing/selftests/bpf/ ... In file included from /home/yhs/work/net-next/tools/testing/selftests/bpf/test_pkt_access.c:8: ../../../include/uapi/linux/bpf.h:170:3: error: unknown type name '__aligned_u64' __aligned_u64 key; ... /usr/include/linux/swab.h:160:8: error: unknown type name '__always_inline' static __always_inline __u16 __swab16p(const __u16 *p) ... The type __aligned_u64 is defined in linux:include/uapi/linux/types.h. The fix is to copy missing type definition into tools/testing/selftests/bpf/include/uapi/linux/types.h. Adding additional include "string.h" resolves __always_inline issue. Fixes: 0a5539f66133 ("bpf: Provide a linux/types.h override for bpf selftests.") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'bnxt_en-DCBX-fixes'David S. Miller2017-05-172-4/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | Michael Chan says: ==================== bnxt_en: DCBX fixes. 2 bug fixes for the case where the NIC's firmware DCBX agent is enabled. With these fixes, we will return the proper information to lldpad. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * bnxt_en: Check status of firmware DCBX agent before setting DCB_CAP_DCBX_HOST.Michael Chan2017-05-171-2/+4
| | | | | | | | | | | | | | | | Otherwise, all the host based DCBX settings from lldpad will fail if the firmware DCBX agent is running. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bnxt_en: Call bnxt_dcb_init() after getting firmware DCBX configuration.Michael Chan2017-05-171-2/+1
|/ | | | | | | | | | In the current code, bnxt_dcb_init() is called too early before we determine if the firmware DCBX agent is running or not. As a result, we are not setting the DCB_CAP_DCBX_HOST and DCB_CAP_DCBX_LLD_MANAGED flags properly to report to DCBNL. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fix compile error in skb_orphan_partial()Eric Dumazet2017-05-171-3/+0
| | | | | | | | | | | | | | | | | | | If CONFIG_INET is not set, net/core/sock.c can not compile : net/core/sock.c: In function ‘skb_orphan_partial’: net/core/sock.c:1810:2: error: implicit declaration of function ‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration] if (skb_is_tcp_pure_ack(skb)) ^ Fix this by always including <net/tcp.h> Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Prevent overrun when parsing v6 header optionsCraig Gallek2017-05-174-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The KASAN warning repoted below was discovered with a syzkaller program. The reproducer is basically: int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP); send(s, &one_byte_of_data, 1, MSG_MORE); send(s, &more_than_mtu_bytes_data, 2000, 0); The socket() call sets the nexthdr field of the v6 header to NEXTHDR_HOP, the first send call primes the payload with a non zero byte of data, and the second send call triggers the fragmentation path. The fragmentation code tries to parse the header options in order to figure out where to insert the fragment option. Since nexthdr points to an invalid option, the calculation of the size of the network header can made to be much larger than the linear section of the skb and data is read outside of it. This fix makes ip6_find_1stfrag return an error if it detects running out-of-bounds. [ 42.361487] ================================================================== [ 42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730 [ 42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789 [ 42.366469] [ 42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41 [ 42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 [ 42.368824] Call Trace: [ 42.369183] dump_stack+0xb3/0x10b [ 42.369664] print_address_description+0x73/0x290 [ 42.370325] kasan_report+0x252/0x370 [ 42.370839] ? ip6_fragment+0x11c8/0x3730 [ 42.371396] check_memory_region+0x13c/0x1a0 [ 42.371978] memcpy+0x23/0x50 [ 42.372395] ip6_fragment+0x11c8/0x3730 [ 42.372920] ? nf_ct_expect_unregister_notifier+0x110/0x110 [ 42.373681] ? ip6_copy_metadata+0x7f0/0x7f0 [ 42.374263] ? ip6_forward+0x2e30/0x2e30 [ 42.374803] ip6_finish_output+0x584/0x990 [ 42.375350] ip6_output+0x1b7/0x690 [ 42.375836] ? ip6_finish_output+0x990/0x990 [ 42.376411] ? ip6_fragment+0x3730/0x3730 [ 42.376968] ip6_local_out+0x95/0x160 [ 42.377471] ip6_send_skb+0xa1/0x330 [ 42.377969] ip6_push_pending_frames+0xb3/0xe0 [ 42.378589] rawv6_sendmsg+0x2051/0x2db0 [ 42.379129] ? rawv6_bind+0x8b0/0x8b0 [ 42.379633] ? _copy_from_user+0x84/0xe0 [ 42.380193] ? debug_check_no_locks_freed+0x290/0x290 [ 42.380878] ? ___sys_sendmsg+0x162/0x930 [ 42.381427] ? rcu_read_lock_sched_held+0xa3/0x120 [ 42.382074] ? sock_has_perm+0x1f6/0x290 [ 42.382614] ? ___sys_sendmsg+0x167/0x930 [ 42.383173] ? lock_downgrade+0x660/0x660 [ 42.383727] inet_sendmsg+0x123/0x500 [ 42.384226] ? inet_sendmsg+0x123/0x500 [ 42.384748] ? inet_recvmsg+0x540/0x540 [ 42.385263] sock_sendmsg+0xca/0x110 [ 42.385758] SYSC_sendto+0x217/0x380 [ 42.386249] ? SYSC_connect+0x310/0x310 [ 42.386783] ? __might_fault+0x110/0x1d0 [ 42.387324] ? lock_downgrade+0x660/0x660 [ 42.387880] ? __fget_light+0xa1/0x1f0 [ 42.388403] ? __fdget+0x18/0x20 [ 42.388851] ? sock_common_setsockopt+0x95/0xd0 [ 42.389472] ? SyS_setsockopt+0x17f/0x260 [ 42.390021] ? entry_SYSCALL_64_fastpath+0x5/0xbe [ 42.390650] SyS_sendto+0x40/0x50 [ 42.391103] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.391731] RIP: 0033:0x7fbbb711e383 [ 42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383 [ 42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003 [ 42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018 [ 42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad [ 42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00 [ 42.397257] [ 42.397411] Allocated by task 3789: [ 42.397702] save_stack_trace+0x16/0x20 [ 42.398005] save_stack+0x46/0xd0 [ 42.398267] kasan_kmalloc+0xad/0xe0 [ 42.398548] kasan_slab_alloc+0x12/0x20 [ 42.398848] __kmalloc_node_track_caller+0xcb/0x380 [ 42.399224] __kmalloc_reserve.isra.32+0x41/0xe0 [ 42.399654] __alloc_skb+0xf8/0x580 [ 42.400003] sock_wmalloc+0xab/0xf0 [ 42.400346] __ip6_append_data.isra.41+0x2472/0x33d0 [ 42.400813] ip6_append_data+0x1a8/0x2f0 [ 42.401122] rawv6_sendmsg+0x11ee/0x2db0 [ 42.401505] inet_sendmsg+0x123/0x500 [ 42.401860] sock_sendmsg+0xca/0x110 [ 42.402209] ___sys_sendmsg+0x7cb/0x930 [ 42.402582] __sys_sendmsg+0xd9/0x190 [ 42.402941] SyS_sendmsg+0x2d/0x50 [ 42.403273] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.403718] [ 42.403871] Freed by task 1794: [ 42.404146] save_stack_trace+0x16/0x20 [ 42.404515] save_stack+0x46/0xd0 [ 42.404827] kasan_slab_free+0x72/0xc0 [ 42.405167] kfree+0xe8/0x2b0 [ 42.405462] skb_free_head+0x74/0xb0 [ 42.405806] skb_release_data+0x30e/0x3a0 [ 42.406198] skb_release_all+0x4a/0x60 [ 42.406563] consume_skb+0x113/0x2e0 [ 42.406910] skb_free_datagram+0x1a/0xe0 [ 42.407288] netlink_recvmsg+0x60d/0xe40 [ 42.407667] sock_recvmsg+0xd7/0x110 [ 42.408022] ___sys_recvmsg+0x25c/0x580 [ 42.408395] __sys_recvmsg+0xd6/0x190 [ 42.408753] SyS_recvmsg+0x2d/0x50 [ 42.409086] entry_SYSCALL_64_fastpath+0x1f/0xbe [ 42.409513] [ 42.409665] The buggy address belongs to the object at ffff88000969e780 [ 42.409665] which belongs to the cache kmalloc-512 of size 512 [ 42.410846] The buggy address is located 24 bytes inside of [ 42.410846] 512-byte region [ffff88000969e780, ffff88000969e980) [ 42.411941] The buggy address belongs to the page: [ 42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0 [ 42.413298] flags: 0x100000000008100(slab|head) [ 42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c [ 42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000 [ 42.415074] page dumped because: kasan: bad access detected [ 42.415604] [ 42.415757] Memory state around the buggy address: [ 42.416222] ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 42.416904] ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 42.418273] ^ [ 42.418588] ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 42.419273] ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 42.419882] ================================================================== Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* neighbour: update neigh timestamps iff update is effectiveIhar Hrachyshka2017-05-171-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a common practice to send gratuitous ARPs after moving an IP address to another device to speed up healing of a service. To fulfill service availability constraints, the timing of network peers updating their caches to point to a new location of an IP address can be particularly important. Sometimes neigh_update calls won't touch neither lladdr nor state, for example if an update arrives in locktime interval. The neigh->updated value is tested by the protocol specific neigh code, which in turn will influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the call to neigh_update() or not. As a result, we may effectively ignore the update request, bailing out of touching the neigh entry, except that we still bump its timestamps inside neigh_update. This may be a problem for updates arriving in quick succession. For example, consider the following scenario: A service is moved to another device with its IP address. The new device sends three gratuitous ARP requests into the network with ~1 seconds interval between them. Just before the first request arrives to one of network peer nodes, its neigh entry for the IP address transitions from STALE to DELAY. This transition, among other things, updates neigh->updated. Once the kernel receives the first gratuitous ARP, it ignores it because its arrival time is inside the locktime interval. The kernel still bumps neigh->updated. Then the second gratuitous ARP request arrives, and it's also ignored because it's still in the (new) locktime interval. Same happens for the third request. The node eventually heals itself (after delay_first_probe_time seconds since the initial transition to DELAY state), but it just wasted some time and require a new ARP request/reply round trip. This unfortunate behaviour both puts more load on the network, as well as reduces service availability. This patch changes neigh_update so that it bumps neigh->updated (as well as neigh->confirmed) only once we are sure that either lladdr or entry state will change). In the scenario described above, it means that the second gratuitous ARP request will actually update the entry lladdr. Ideally, we would update the neigh entry on the very first gratuitous ARP request. The locktime mechanism is designed to ignore ARP updates in a short timeframe after a previous ARP update was honoured by the kernel layer. This would require tracking timestamps for state transitions separately from timestamps when actual updates are received. This would probably involve changes in neighbour struct. Therefore, the patch doesn't tackle the issue of the first gratuitous APR ignored, leaving it for a follow-up. Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* arp: honour gratuitous ARP _replies_Ihar Hrachyshka2017-05-171-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When arp_accept is 1, gratuitous ARPs are supposed to override matching entries irrespective of whether they arrive during locktime. This was implemented in commit 56022a8fdd87 ("ipv4: arp: update neighbour address when a gratuitous arp is received and arp_accept is set") There is a glitch in the patch though. RFC 2002, section 4.6, "ARP, Proxy ARP, and Gratuitous ARP", defines gratuitous ARPs so that they can be either of Request or Reply type. Those Reply gratuitous ARPs can be triggered with standard tooling, for example, arping -A option does just that. This patch fixes the glitch, making both Request and Reply flavours of gratuitous ARPs to behave identically. As per RFC, if gratuitous ARPs are of Reply type, their Target Hardware Address field should also be set to the link-layer address to which this cache entry should be updated. The field is present in ARP over Ethernet but not in IEEE 1394. In this patch, I don't consider any broadcasted ARP replies as gratuitous if the field is not present, to conform the standard. It's not clear whether there is such a thing for IEEE 1394 as a gratuitous ARP reply; until it's cleared up, we will ignore such broadcasts. Note that they will still update existing ARP cache entries, assuming they arrive out of locktime time interval. Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mlx5e: add CONFIG_INET dependencyArnd Bergmann2017-05-161-1/+1
| | | | | | | | | | | | | | | | | | We now reference the arp_tbl, which requires IPv4 support to be enabled in the kernel, otherwise we get a link error: drivers/net/built-in.o: In function `mlx5e_tc_update_neigh_used_value': (.text+0x16afec): undefined reference to `arp_tbl' drivers/net/built-in.o: In function `mlx5e_rep_neigh_init': en_rep.c:(.text+0x16c16d): undefined reference to `arp_tbl' drivers/net/built-in.o: In function `mlx5e_rep_netevent_event': en_rep.c:(.text+0x16cbb5): undefined reference to `arp_tbl' This adds a Kconfig dependency for it. Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Improve handling of failures on link and route dumpsDavid Ahern2017-05-163-28/+49
| | | | | | | | | | | | | | | | | | | | In general, rtnetlink dumps do not anticipate failure to dump a single object (e.g., link or route) on a single pass. As both route and link objects have grown via more attributes, that is no longer a given. netlink dumps can handle a failure if the dump function returns an error; specifically, netlink_dump adds the return code to the response if it is <= 0 so userspace is notified of the failure. The missing piece is the rtnetlink dump functions returning the error. Fix route and link dump functions to return the errors if no object is added to an skb (detected by skb->len != 0). IPv6 route dumps (rt6_dump_route) already return the error; this patch updates IPv4 and link dumps. Other dump functions may need to be ajusted as well. Reported-by: Jan Moskyto Matejka <mq@ucw.cz> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Add warning about remote memory exposureChristoph Hellwig2017-05-161-0/+4
| | | | | | | | | | The driver explicitly bypasses APIs to register all memory once a connection is made, and thus allows remote access to memory. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Leon Romanovsky <leon@kernel.org> Acked-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* smc: switch to usage of IB_PD_UNSAFE_GLOBAL_RKEYUrsula Braun2017-05-165-37/+8
| | | | | | | | | | | | | | Currently, SMC enables remote access to physical memory when a user has successfully configured and established an SMC-connection until ten minutes after the last SMC connection is closed. Because this is considered a security risk, drivers are supposed to use IB_PD_UNSAFE_GLOBAL_RKEY in such a case. This patch changes the current SMC code to use IB_PD_UNSAFE_GLOBAL_RKEY. This improves user awareness, but does not remove the security risk itself. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipmr: vrf: Find VIFs using the actual deviceThomas Winter2017-05-161-2/+16
| | | | | | | | | | | | | | | | The skb->dev that is passed into ip_mr_input is the loX device for VRFs. When we lookup a vif for this dev, none is found as we do not create vifs for loopbacks. Instead lookup a vif for the actual device that the packet was received on, eg the vlan. Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz> cc: David Ahern <dsa@cumulusnetworks.com> cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> cc: roopa <roopa@cumulusnetworks.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: eliminate negative reordering in tcp_clean_rtx_queueSoheil Hassas Yeganeh2017-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | tcp_ack() can call tcp_fragment() which may dededuct the value tp->fackets_out when MSS changes. When prior_fackets is larger than tp->fackets_out, tcp_clean_rtx_queue() can invoke tcp_update_reordering() with negative values. This results in absurd tp->reodering values higher than sysctl_tcp_max_reordering. Note that tcp_update_reordering indeeds sets tp->reordering to min(sysctl_tcp_max_reordering, metric), but because the comparison is signed, a negative metric always wins. Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes") Reported-by: Rebecca Isaacs <risaacs@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-05-1574-249/+1026
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Track alignment in BPF verifier so that legitimate programs won't be rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures. 2) Make tail calls work properly in arm64 BPF JIT, from Deniel Borkmann. 3) Make the configuration and semantics Generic XDP make more sense and don't allow both generic XDP and a driver specific instance to be active at the same time. Also from Daniel. 4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov. 5) Fix use-after-free in VRF driver, from Gao Feng. 6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in qca_spi driver, from Stefan Wahren. 7) Always run cleanup routines in BPF samples when we get SIGTERM, from Andy Gospodarek. 8) The mdio phy code should bring PHYs out of reset using the shared GPIO lines before invoking bus->reset(). From Florian Fainelli. 9) Some USB descriptor access endian fixes in various drivers from Johan Hovold. 10) Handle PAUSE advertisements properly in mlx5 driver, from Gal Pressman. 11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed. 12) Cure netdev leak in AF_PACKET when using timestamping via control messages. From Douglas Caetano dos Santos. 13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav Lichvar. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) ldmvsw: stop the clean timer at beginning of remove ldmvsw: unregistering netdev before disable hardware net: netcp: fix check of requested timestamping filter ipv6: avoid dad-failures for addresses with NODAD qed: Fix uninitialized data in aRFS infrastructure mdio: mux: fix device_node_continue.cocci warnings net/packet: fix missing net_device reference release net/mlx4_core: Use min3 to select number of MSI-X vectors macvlan: Fix performance issues with vlan tagged packets net: stmmac: use correct pointer when printing normal descriptor ring net/mlx5: Use underlay QPN from the root name space net/mlx5e: IPoIB, Only support regular RQ for now net/mlx5e: Fix setup TC ndo net/mlx5e: Fix ethtool pause support and advertise reporting net/mlx5e: Use the correct pause values for ethtool advertising vmxnet3: ensure that adapter is in proper state during force_close sfc: revert changes to NIC revision numbers net: ch9200: add missing USB-descriptor endianness conversions net: irda: irda-usb: fix firmware name on big-endian hosts net: dsa: mv88e6xxx: add default case to switch ...
| * Merge branch 'ldmsw-fixes'David S. Miller2017-05-151-2/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shannon Nelson says: ==================== ldmvsw: port removal stability Under heavy reboot stress testing we found a couple of timing issues when removing the device that could cause the kernel great heartburn, addressed by these two patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * ldmvsw: stop the clean timer at beginning of removeShannon Nelson2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stop the clean timer earlier to be sure there's no asynchronous interference while stopping the port. Orabug: 25748241 Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * ldmvsw: unregistering netdev before disable hardwareThomas Tai2017-05-151-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running LDom binding/unbinding test, kernel may panic in ldmvsw_open(). It is more likely that because we're removing the ldc connection before unregistering the netdev in vsw_port_remove(), we set up a window of time where one process could be removing the device while another trying to UP the device. This also sometimes causes vio handshake error due to opening a device without closing it completely. We should unregister the netdev before we disable the "hardware". Orabug: 25980913, 25925306 Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: netcp: fix check of requested timestamping filterMiroslav Lichvar2017-05-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The driver doesn't support timestamping of all received packets and should return error when trying to enable the HWTSTAMP_FILTER_ALL filter. Cc: WingMan Kwok <w-kwok2@ti.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge tag 'mlx5-fixes-2017-05-12-V2' of ↵David S. Miller2017-05-1510-23/+49
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== Mellanox, mlx5 fixes 2017-05-12 This series contains some mlx5 fixes for net. Please pull and let me know if there's any problem. For -stable: ("net/mlx5e: Fix ethtool pause support and advertise reporting") kernels >= 4.8 ("net/mlx5e: Use the correct pause values for ethtool advertising") kernels >= 4.8 v1->v2: Dropped statistics spinlock patch, it needs some extra work. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net/mlx5: Use underlay QPN from the root name spaceYishai Hadas2017-05-148-19/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Root flow table is dynamically changed by the underlying flow steering layer, and IPoIB/ULPs have no idea what will be the root flow table in the future, hence we need a dynamic infrastructure to move Underlay QPs with the root flow table. Fixes: b3ba51498bdd ("net/mlx5: Refactor create flow table method to accept underlay QP") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| | * net/mlx5e: IPoIB, Only support regular RQ for nowSaeed Mahameed2017-05-141-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | IPoIB doesn't support striding RQ at the moment, for this we need to explicitly choose non striding RQ in IPoIB init, even if the HW supports it. Fixes: 8f493ffd88ea ("net/mlx5e: IPoIB, RX steering RSS RQTs and TIRs") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| | * net/mlx5e: Fix setup TC ndoSaeed Mahameed2017-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fail-safe support patches introduced a trivial bug, setup tc callback is doing a wrong check of the netdevice state, the fix is simply to invert the condition. Fixes: 6f9485af4020 ("net/mlx5e: Fail safe tc setup") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| | * net/mlx5e: Fix ethtool pause support and advertise reportingGal Pressman2017-05-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pause bit should set when RX pause is on, not TX pause. Also, setting Asym_Pause is incorrect, and should be turned off. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman <galp@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| | * net/mlx5e: Use the correct pause values for ethtool advertisingGal Pressman2017-05-141-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Query the operational pause from firmware (PFCC register) instead of always passing zeros. Fixes: 665bc53969d7 ("net/mlx5e: Use new ethtool get/set link ksettings API") Signed-off-by: Gal Pressman <galp@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * | ipv6: avoid dad-failures for addresses with NODADMahesh Bandewar2017-05-151-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every address gets added with TENTATIVE flag even for the addresses with IFA_F_NODAD flag and dad-work is scheduled for them. During this DAD process we realize it's an address with NODAD and complete the process without sending any probe. However the TENTATIVE flags stays on the address for sometime enough to cause misinterpretation when we receive a NS. While processing NS, if the address has TENTATIVE flag, we mark it DADFAILED and endup with an address that was originally configured as NODAD with DADFAILED. We can't avoid scheduling dad_work for addresses with NODAD but we can avoid adding TENTATIVE flag to avoid this racy situation. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | qed: Fix uninitialized data in aRFS infrastructureMintz, Yuval2017-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current memset is using incorrect type of variable, causing the upper-half of the strucutre to be left uninitialized and causing: ethernet/qlogic/qed/qed_init_fw_funcs.c: In function 'qed_set_rfs_mode_disable': ethernet/qlogic/qed/qed_init_fw_funcs.c:993:3: error: '*((void *)&ramline+4)' is used uninitialized in this function [-Werror=uninitialized] Fixes: d51e4af5c209 ("qed: aRFS infrastructure support") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mdio: mux: fix device_node_continue.cocci warningsJulia Lawall2017-05-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device node iterators put the previous value of the index variable, so an explicit put causes a double put. In particular, of_mdiobus_register can fail before doing anything interesting, so one could view it as a no-op from the reference count point of view. Generated by: scripts/coccinelle/iterators/device_node_continue.cocci CC: Jon Mason <jon.mason@broadcom.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/packet: fix missing net_device reference releaseDouglas Caetano dos Santos2017-05-151-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using a TX ring buffer, if an error occurs processing a control message (e.g. invalid message), the net_device reference is not released. Fixes c14ac9451c348 ("sock: enable timestamping using control messages") Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/mlx4_core: Use min3 to select number of MSI-X vectorsyuval.shaia@oracle.com2017-05-151-6/+4
| | | | | | | | | | | | | | | | | | Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | macvlan: Fix performance issues with vlan tagged packetsVlad Yasevich2017-05-151-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Macvlan always turns on offload features that have sofware fallback (NETIF_GSO_SOFTWARE). This allows much higher guest-guest communications over macvtap. However, macvtap does not turn on these features for vlan tagged traffic. As a result, depending on the HW that mactap is configured on, the performance of guest-guest communication over a vlan is very inconsistent. If the HW supports TSO/UFO over vlans, then the performance will be fine. If not, the the performance will suffer greatly since the VM may continue using TSO/UFO, and will force the host segment the traffic and possibly overlow the macvtap queue. This patch adds the always on offloads to vlan_features. This makes sure that any vlan tagged traffic between 2 guest will not be segmented needlessly. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: stmmac: use correct pointer when printing normal descriptor ringNiklas Cassel2017-05-151-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | There are two pointers in sysfs_display_ring, one that increments if using normal dma descriptors, another if using extended dma descriptors. When printing the normal dma descriptors, the wrong pointer is used, thus the printed descriptor addresses are incorrect. Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vmxnet3: ensure that adapter is in proper state during force_closeNeil Horman2017-05-121-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several paths in vmxnet3, where settings changes cause the adapter to be brought down and back up (vmxnet3_set_ringparam among them). Should part of the reset operation fail, these paths call vmxnet3_force_close, which enables all napi instances prior to calling dev_close (with the expectation that vmxnet3_close will then properly disable them again). However, vmxnet3_force_close neglects to clear VMXNET3_STATE_BIT_QUIESCED prior to calling dev_close. As a result vmxnet3_quiesce_dev (called from vmxnet3_close), returns early, and leaves all the napi instances in a enabled state while the device itself is closed. If a device in this state is activated again, napi_enable will be called on already enabled napi_instances, leading to a BUG halt. The fix is to simply enausre that the QUIESCED bit is cleared in vmxnet3_force_close to allow quesence to be completed properly on close. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Shrikrishna Khare <skhare@vmware.com> CC: "VMware, Inc." <pv-drivers@vmware.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sfc: revert changes to NIC revision numbersBert Kenward2017-05-121-2/+6
| | | | | | | | | | | | | | | | | | | | The revision enum values (eg EFX_REV_HUNT_A0) form part of our API, and are included in ethtool. If these are inconsistent then ethtool will print garbage for a register dump (ethtool -d). Fixes: 5a6681e22c14 ("sfc: separate out SFC4000 ("Falcon") support into new sfc-falcon driver") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ch9200: add missing USB-descriptor endianness conversionsJohan Hovold2017-05-121-2/+2
| | | | | | | | | | | | | | | | Add the missing endianness conversions to a debug statement printing the USB device-descriptor idVendor and idProduct fields during probe. Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: irda: irda-usb: fix firmware name on big-endian hostsJohan Hovold2017-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | Add missing endianness conversion when using the USB device-descriptor bcdDevice field to construct a firmware file name. Fixes: 8ef80aef118e ("[IRDA]: irda-usb.c: STIR421x cleanups") Cc: stable <stable@vger.kernel.org> # 2.6.18 Cc: Nick Fedchik <nfedchik@atlantic-link.com.ua> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: mv88e6xxx: add default case to switchGustavo A. R. Silva2017-05-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | Add default case to switch in order to avoid any chance of using an uninitialized variable _low_, in case s->type does not match any of the listed case values. Addresses-Coverity-ID: 1398130 Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: fix src address selection if using secondary addresses for ipv6Xin Long2017-05-121-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") has fixed a src address selection issue when using secondary addresses for ipv4. Now sctp ipv6 also has the similar issue. When using a secondary address, sctp_v6_get_dst tries to choose the saddr which has the most same bits with the daddr by sctp_v6_addr_match_len. It may make some cases not work as expected. hostA: [1] fd21:356b:459a:cf10::11 (eth1) [2] fd21:356b:459a:cf20::11 (eth2) hostB: [a] fd21:356b:459a:cf30::2 (eth1) [b] fd21:356b:459a:cf40::2 (eth2) route from hostA to hostB: fd21:356b:459a:cf30::/64 dev eth1 metric 1024 mtu 1500 The expected path should be: fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2 But addr[2] matches addr[a] more bits than addr[1] does, according to sctp_v6_addr_match_len. It causes the path to be: fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2 This patch is to fix it with the same way as Marcelo's fix for sctp ipv4. As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check if the saddr is in a dev instead. Note that for backwards compatibility, it will still do the addr_match_len check here when no optimal is found. Reported-by: Patrick Talbert <ptalbert@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: phy: Call bus->reset() after releasing PHYs from resetFlorian Fainelli2017-05-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The API convention makes it that a given MDIO bus reset should be able to access PHY devices in its reset() callback and perform additional MDIO accesses in order to bring the bus and PHYs in a working state. Commit 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that contract by first calling bus->reset() and then release all PHYs from reset using their shared GPIO line, so restore the expected functionality here. Fixes: 69226896ad63 ("mdio_bus: Issue GPIO RESET to PHYs.") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: Handle multiple variable additions into packet pointers in verifier.David S. Miller2017-05-112-1/+38
| | | | | | | | | | | | | | | | | | We must accumulate into reg->aux_off rather than use a plain assignment. Add a test for this situation to test_align. Reported-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: make macro tipc_wait_for_cond() smp safeJon Paul Maloy2017-05-111-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macro tipc_wait_for_cond() is embedding the macro sk_wait_event() to fulfil its task. The latter, in turn, is evaluating the stated condition outside the socket lock context. This is problematic if the condition is accessing non-trivial data structures which may be altered by incoming interrupts, as is the case with the cong_links() linked list, used by socket to keep track of the current set of congested links. We sometimes see crashes when this list is accessed by a condition function at the same time as a SOCK_WAKEUP interrupt is removing an element from the list. We fix this by expanding selected parts of sk_wait_event() into the outer macro, while ensuring that all evaluations of a given condition are performed under socket lock protection. Fixes: commit 365ad353c256 ("tipc: reduce risk of user starvation during link congestion") Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * samples/bpf: run cleanup routines when receiving SIGTERMAndy Gospodarek2017-05-117-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shahid Habib noticed that when xdp1 was killed from a different console the xdp program was not cleaned-up properly in the kernel and it continued to forward traffic. Most of the applications in samples/bpf cleanup properly, but only when getting SIGINT. Since kill defaults to using SIGTERM, add support to cleanup when the application receives either SIGINT or SIGTERM. Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Reported-by: Shahid Habib <shahid.habib@broadcom.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ethernet: aquantia: remove redundant checks on error statusColin Ian King2017-05-112-23/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error status err is initialized as zero and then being checked several times to see if it is less than zero even when it has not been updated. It may seem that the err should be assigned to the return code of the call to the various *offload_en_set calls and then we check for failure, however, these functions are void and never actually return any status. Since these error checks are redundant we can remove these as well as err and the error exit label err_exit. Detected by CoverityScan, CID#1398313 and CID#1398306 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Acked-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: Remove commented out debugging hack in test_align.David S. Miller2017-05-111-1/+0
| | | | | | | | | | Reported-by: Alexander Alemayhu <alexander@alemayhu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'qlcnic-fixes'David S. Miller2017-05-114-2/+40
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Manish Chopra says: ==================== qlcnic: Bug fix and update version This series has one fix and bumps up driver version. Please consider applying to "net" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * qlcnic: Update version to 5.3.66Chopra, Manish2017-05-111-2/+2
| | | | | | | | | | | | | | | | | | | | | Bumping up the version as couple of fixes added after 5.3.65 Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * qlcnic: Fix link configuration with autoneg disabledChopra, Manish2017-05-113-0/+38
| |/ | | | | | | | | | | | | | | | | | | | | | | Currently driver returns error on speed configurations for 83xx adapter's non XGBE ports, due to this link doesn't come up on the ports using 1000Base-T as a connector with autoneg disabled. This patch fixes this with initializing appropriate port type based on queried module/connector types from hardware before any speed/autoneg configuration. Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>