summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* mm: memcontrol: lockless page countersJohannes Weiner2014-12-101-43/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse] [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2014-12-093-20/+18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle are: - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y. This instruments might_sleep() checks to catch places that nest blocking primitives - such as mutex usage in a wait loop. Such bugs can result in hard to debug races/hangs. Another category of invalid nesting that this facility will detect is the calling of blocking functions from within schedule() -> sched_submit_work() -> blk_schedule_flush_plug(). There's some potential for false positives (if secondary blocking primitives themselves are not ready yet for this facility), but the kernel will warn once about such bugs per bootup, so the warning isn't much of a nuisance. This feature comes with a number of fixes, for problems uncovered with it, so no messages are expected normally. - Another round of sched/numa optimizations and refinements, for CONFIG_NUMA_BALANCING=y. - Another round of sched/dl fixes and refinements. Plus various smaller fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) sched: Add missing rcu protection to wake_up_all_idle_cpus sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK sched/numa: Init numa balancing fields of init_task sched/deadline: Remove unnecessary definitions in cpudeadline.h sched/cpupri: Remove unnecessary definitions in cpupri.h sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task() sched/fair: Fix stale overloaded status in the busiest group finding logic sched: Move p->nr_cpus_allowed check to select_task_rq() sched/completion: Document when to use wait_for_completion_io_*() sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC sched/fair: Kill task_struct::numa_entry and numa_group::task_list sched: Refactor task_struct to use numa_faults instead of numa_* pointers sched/deadline: Don't check CONFIG_SMP in switched_from_dl() sched/deadline: Reschedule from switched_from_dl() after a successful pull sched/deadline: Push task away if the deadline is equal to curr during wakeup sched/deadline: Add deadline rq status print sched/deadline: Fix artificial overrun introduced by yield_task_dl() sched/rt: Clean up check_preempt_equal_prio() sched/core: Use dl_bw_of() under rcu_read_lock_sched() sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu ...
| * Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar2014-11-1657-324/+870
| |\ | | | | | | | | | | | | | | | applying more changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | netdev, sched/wait: Fix sleeping inside wait eventPeter Zijlstra2014-11-042-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rtnl_lock_unregistering*() take rtnl_lock() -- a mutex -- inside a wait loop. The wait loop relies on current->state to function, but so does mutex_lock(), nesting them makes for the inner to destroy the outer state. Fix this using the new wait_woken() bits. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <cwang@twopensource.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jerry Chu <hkchu@google.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: sfeldma@cumulusnetworks.com <sfeldma@cumulusnetworks.com> Cc: stephen hemminger <stephen@networkplumber.org> Cc: Tom Gundersen <teg@jklm.no> Cc: Tom Herbert <therbert@google.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/20141029173110.GE15602@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | rfcomm, sched/wait: Fix broken wait constructPeter Zijlstra2014-11-041-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rfcomm_run() is a tad broken in that is has a nested wait loop. One cannot rely on p->state for the outer wait because the inner wait will overwrite it. Fix this using the new wait_woken() facility. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Alexander Holler <holler@ahsoftware.de> Cc: David S. Miller <davem@davemloft.net> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: Joe Perches <joe@perches.com> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Seung-Woo Kim <sw0312.kim@samsung.com> Cc: Vignesh Raman <Vignesh_Raman@mentor.com> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | rtnetlink: release net refcnt on error in do_setlink()Nicolas Dichtel2014-11-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rtnl_link_get_net() holds a reference on the 'struct net', we need to release it in case of error. CC: Eric W. Biederman <ebiederm@xmission.com> Fixes: b51642f6d77b ("net: Enable a userns root rtnl calls that are safe for unprivilged users") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2014-11-2714-33/+67
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Several small fixes here: 1) Don't crash in tg3 driver when the number of tx queues has been configured to be different from the number of rx queues. From Thadeu Lima de Souza Cascardo. 2) VLAN filter not disabled properly in promisc mode in ixgbe driver, from Vlad Yasevich. 3) Fix OOPS on dellink op in VTI tunnel driver, from Xin Long. 4) IPV6 GRE driver WCCP code checks skb->protocol for ETH_P_IP instead of ETH_P_IPV6, whoops. From Yuri Chislov. 5) Socket matching in ping driver is buggy when packet AF does not match socket's AF. Fix from Jane Zhou. 6) Fix checksum calculation errors in VXLAN due to where the udp_tunnel6_xmit_skb() helper gets it's saddr/daddr from. From Alexander Duyck. 7) Fix 5G detection problem in rtlwifi driver, from Larry Finger. 8) Fix NULL deref in tcp_v{4,6}_send_reset, from Eric Dumazet. 9) Various missing netlink attribute verifications in bridging code, from Thomas Graf. 10) tcp_recvmsg() unconditionally calls ipv4 ip_recv_error even for ipv6 sockets, whoops. Fix from Willem de Bruijn" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits) net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks bridge: Sanitize IFLA_EXT_MASK for AF_BRIDGE:RTM_GETLINK bridge: Add missing policy entry for IFLA_BRPORT_FAST_LEAVE net: Check for presence of IFLA_AF_SPEC net: Validate IFLA_BRIDGE_MODE attribute length bridge: Validate IFLA_BRIDGE_FLAGS attribute length stmmac: platform: fix default values of the filter bins setting net/mlx4_core: Limit count field to 24 bits in qp_alloc_res net: dsa: bcm_sf2: reset switch prior to initialization net: dsa: bcm_sf2: fix unmapping registers in case of errors tg3: fix ring init when there are more TX than RX channels tcp: fix possible NULL dereference in tcp_vX_send_reset() rtlwifi: Change order in device startup rtlwifi: rtl8821ae: Fix 5G detection problem Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse" vxlan: Fix boolean flip in VXLAN_F_UDP_ZERO_CSUM6_[TX|RX] ip6_udp_tunnel: Fix checksum calculation net-timestamp: Fix a documentation typo net/ping: handle protocol mismatching scenario af_packet: fix sparse warning ...
| * | | net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socksWillem de Bruijn2014-11-263-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets. If the socket is of family AF_INET6, call ipv6_recv_error instead of ip_recv_error. This change is more complex than a single branch due to the loadable ipv6 module. It reuses a pre-existing indirect function call from ping. The ping code is safe to call, because it is part of the core ipv6 module and always present when AF_INET6 sockets are active. Fixes: 4ed2d765 (net-timestamp: TCP timestamping) Signed-off-by: Willem de Bruijn <willemb@google.com> ---- It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6) to ip_recv_error. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bridge: Sanitize IFLA_EXT_MASK for AF_BRIDGE:RTM_GETLINKThomas Graf2014-11-261-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only search for IFLA_EXT_MASK if the message actually carries a ifinfomsg header and validate minimal length requirements for IFLA_EXT_MASK. Fixes: 6cbdceeb ("bridge: Dump vlan information from a bridge port") Cc: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bridge: Add missing policy entry for IFLA_BRPORT_FAST_LEAVEThomas Graf2014-11-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes: c2d3babf ("bridge: implement multicast fast leave") Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | bridge: Validate IFLA_BRIDGE_FLAGS attribute lengthThomas Graf2014-11-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Payload is currently accessed blindly and may exceed valid message boundaries. Fixes: 407af3299 ("bridge: Add netlink interface to configure vlans on bridge ports") Cc: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: fix possible NULL dereference in tcp_vX_send_reset()Eric Dumazet2014-11-252-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") we have to relax check against skb dst in tcp_v[46]_send_reset() if prequeue dropped the dst. If a socket is provided, a full lookup was done to find this socket, so the dst test can be skipped. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88191 Reported-by: Jaša Bartelj <jasa.bartelj@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Daniel Borkmann <dborkman@redhat.com> Fixes: ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against ↵Pablo Neira2014-11-251-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_next_corpse" This reverts commit 5195c14c8b27cc0b18220ddbf0e5ad3328a04187. If the conntrack clashes with an existing one, it is left out of the unconfirmed list, thus, crashing when dropping the packet and releasing the conntrack since golden rule is that conntracks are always placed in any of the existing lists for traceability reasons. Reported-by: Daniel Borkmann <dborkman@redhat.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=88841 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ip6_udp_tunnel: Fix checksum calculationAlexander Duyck2014-11-251-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The UDP checksum calculation for VXLAN tunnels is currently using the socket addresses instead of the actual packet source and destination addresses. As a result the checksum calculated is incorrect in some cases. Also uh->check was being set twice, first it was set to 0, and then it is set again in udp6_set_csum. This change removes the redundant assignment to 0. Fixes: acbf74a7 ("vxlan: Refactor vxlan driver to make use of the common UDP tunnel functions.") Cc: Andy Zhou <azhou@nicira.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net/ping: handle protocol mismatching scenarioJane Zhou2014-11-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ping_lookup() may return a wrong sock if sk_buff's and sock's protocols dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong sock will be returned. the fix is to "continue" the searching, if no matching, return NULL. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Jane Zhou <a17711@motorola.com> Signed-off-by: Yiwei Zhao <gbjc64@motorola.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | af_packet: fix sparse warningMichael S. Tsirkin2014-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | af_packet produces lots of these: net/packet/af_packet.c:384:39: warning: incorrect type in return expression (different modifiers) net/packet/af_packet.c:384:39: expected struct page [pure] * net/packet/af_packet.c:384:39: got struct page * this seems to be because sparse does not realize that _pure refers to function, not the returned pointer. Tweak code slightly to avoid the warning. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv6: gre: fix wrong skb->protocol in WCCPYuri Chislov2014-11-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using GRE redirection in WCCP, it sets the wrong skb->protocol, that is, ETH_P_IP instead of ETH_P_IPV6 for the encapuslated traffic. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: Yuri Chislov <yuri.chislov@gmail.com> Tested-by: Yuri Chislov <yuri.chislov@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ip_tunnel: the lack of vti_link_ops' dellink() cause kernel paniclucien2014-11-232-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the vti_link_ops do not point the .dellink, for fb tunnel device (ip_vti0), the net_device will be removed as the default .dellink is unregister_netdevice_queue,but the tunnel still in the tunnel list, then if we add a new vti tunnel, in ip_tunnel_find(): hlist_for_each_entry_rcu(t, head, hash_node) { if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr && link == t->parms.link && ==> type == t->dev->type && ip_tunnel_key_match(&t->parms, flags, key)) break; } the panic will happen, cause dev of ip_tunnel *t is null: [ 3835.072977] IP: [<ffffffffa04103fd>] ip_tunnel_find+0x9d/0xc0 [ip_tunnel] [ 3835.073008] PGD b2c21067 PUD b7277067 PMD 0 [ 3835.073008] Oops: 0000 [#1] SMP ..... [ 3835.073008] Stack: [ 3835.073008] ffff8800b72d77f0 ffffffffa0411924 ffff8800bb956000 ffff8800b72d78e0 [ 3835.073008] ffff8800b72d78a0 0000000000000000 ffffffffa040d100 ffff8800b72d7858 [ 3835.073008] ffffffffa040b2e3 0000000000000000 0000000000000000 0000000000000000 [ 3835.073008] Call Trace: [ 3835.073008] [<ffffffffa0411924>] ip_tunnel_newlink+0x64/0x160 [ip_tunnel] [ 3835.073008] [<ffffffffa040b2e3>] vti_newlink+0x43/0x70 [ip_vti] [ 3835.073008] [<ffffffff8150d4da>] rtnl_newlink+0x4fa/0x5f0 [ 3835.073008] [<ffffffff812f68bb>] ? nla_strlcpy+0x5b/0x70 [ 3835.073008] [<ffffffff81508fb0>] ? rtnl_link_ops_get+0x40/0x60 [ 3835.073008] [<ffffffff8150d11f>] ? rtnl_newlink+0x13f/0x5f0 [ 3835.073008] [<ffffffff81509cf4>] rtnetlink_rcv_msg+0xa4/0x270 [ 3835.073008] [<ffffffff8126adf5>] ? sock_has_perm+0x75/0x90 [ 3835.073008] [<ffffffff81509c50>] ? rtnetlink_rcv+0x30/0x30 [ 3835.073008] [<ffffffff81529e39>] netlink_rcv_skb+0xa9/0xc0 [ 3835.073008] [<ffffffff81509c48>] rtnetlink_rcv+0x28/0x30 .... modprobe ip_vti ip link del ip_vti0 type vti ip link add ip_vti0 type vti rmmod ip_vti do that one or more times, kernel will panic. fix it by assigning ip_tunnel_dellink to vti_link_ops' dellink, in which we skip the unregister of fb tunnel device. do the same on ip6_vti. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv6: Do not treat a GSO_TCPV4 request from UDP tunnel over IPv6 as invalidAlexander Duyck2014-11-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds SKB_GSO_TCPV4 to the list of supported GSO types handled by the IPv6 GSO offloads. Without this change VXLAN tunnels running over IPv6 do not currently handle IPv4 TCP TSO requests correctly and end up handing the non-segmented frame off to the device. Below is the before and after for a simple netperf TCP_STREAM test between two endpoints tunneling IPv4 over a VXLAN tunnel running on IPv6 on top of a 1Gb/s network adapter. Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.29 0.88 Before 87380 16384 16384 10.03 895.69 After Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2014-11-251-11/+16
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd bugfixes from Bruce Fields: "These fix one mishandling of the case when security labels are configured out, and two races in the 4.1 backchannel code" * 'for-3.18' of git://linux-nfs.org/~bfields/linux: nfsd: Fix slot wake up race in the nfsv4.1 callback code SUNRPC: Fix locking around callback channel reply receive nfsd: correctly define v4.2 support attributes
| * | | SUNRPC: Fix locking around callback channel reply receiveTrond Myklebust2014-11-191-11/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both xprt_lookup_rqst() and xprt_complete_rqst() require that you take the transport lock in order to avoid races with xprt_transmit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2014-11-2122-128/+123
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix BUG when decrypting empty packets in mac80211, from Ronald Wahl. 2) nf_nat_range is not fully initialized and this is copied back to userspace, from Daniel Borkmann. 3) Fix read past end of b uffer in netfilter ipset, also from Dan Carpenter. 4) Signed integer overflow in ipv4 address mask creation helper inet_make_mask(), from Vincent BENAYOUN. 5) VXLAN, be2net, mlx4_en, and qlcnic need ->ndo_gso_check() methods to properly describe the device's capabilities, from Joe Stringer. 6) Fix memory leaks and checksum miscalculations in openvswitch, from Pravin B SHelar and Jesse Gross. 7) FIB rules passes back ambiguous error code for unreachable routes, making behavior confusing for userspace. Fix from Panu Matilainen. 8) ieee802154fake_probe() doesn't release resources properly on error, from Alexey Khoroshilov. 9) Fix skb_over_panic in add_grhead(), from Daniel Borkmann. 10) Fix access of stale slave pointers in bonding code, from Nikolay Aleksandrov. 11) Fix stack info leak in PPP pptp code, from Mathias Krause. 12) Cure locking bug in IPX stack, from Jiri Bohac. 13) Revert SKB fclone memory freeing optimization that is racey and can allow accesses to freed up memory, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (71 commits) tcp: Restore RFC5961-compliant behavior for SYN packets net: Revert "net: avoid one atomic operation in skb_clone()" virtio-net: validate features during probe cxgb4 : Fix DCB priority groups being returned in wrong order ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg openvswitch: Don't validate IPv6 label masks. pptp: fix stack info leak in pptp_getname() brcmfmac: don't include linux/unaligned/access_ok.h cxgb4i : Don't block unload/cxgb4 unload when remote closes TCP connection ipv6: delete protocol and unregister rtnetlink when cleanup net/mlx4_en: Add VXLAN ndo calls to the PF net device ops too bonding: fix curr_active_slave/carrier with loadbalance arp monitoring mac80211: minstrel_ht: fix a crash in rate sorting vxlan: Inline vxlan_gso_check(). can: m_can: update to support CAN FD features can: m_can: fix incorrect error messages can: m_can: add missing delay after setting CCCR_INIT bit can: m_can: fix not set can_dlc for remote frame can: m_can: fix possible sleep in napi poll can: m_can: add missing message RAM initialization ...
| * | | tcp: Restore RFC5961-compliant behavior for SYN packetsCalvin Owens2014-11-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c3ae62af8e755 ("tcp: should drop incoming frames without ACK flag set") was created to mitigate a security vulnerability in which a local attacker is able to inject data into locally-opened sockets by using TCP protocol statistics in procfs to quickly find the correct sequence number. This broke the RFC5961 requirement to send a challenge ACK in response to spurious RST packets, which was subsequently fixed by commit 7b514a886ba50 ("tcp: accept RST without ACK flag"). Unfortunately, the RFC5961 requirement that spurious SYN packets be handled in a similar manner remains broken. RFC5961 section 4 states that: ... the handling of the SYN in the synchronized state SHOULD be performed as follows: 1) If the SYN bit is set, irrespective of the sequence number, TCP MUST send an ACK (also referred to as challenge ACK) to the remote peer: <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> After sending the acknowledgment, TCP MUST drop the unacceptable segment and stop processing further. By sending an ACK, the remote peer is challenged to confirm the loss of the previous connection and the request to start a new connection. A legitimate peer, after restart, would not have a TCB in the synchronized state. Thus, when the ACK arrives, the peer should send a RST segment back with the sequence number derived from the ACK field that caused the RST. This RST will confirm that the remote peer has indeed closed the previous connection. Upon receipt of a valid RST, the local TCP endpoint MUST terminate its connection. The local TCP endpoint should then rely on SYN retransmission from the remote end to re-establish the connection. This patch lets SYN packets through the discard added in c3ae62af8e755, so that spurious SYN packets are properly dealt with as per the RFC. The challenge ACK is sent unconditionally and is rate-limited, so the original vulnerability is not reintroduced by this patch. Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: Revert "net: avoid one atomic operation in skb_clone()"Eric Dumazet2014-11-211-17/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not sure what I was thinking, but doing anything after releasing a refcount is suicidal or/and embarrassing. By the time we set skb->fclone to SKB_FCLONE_FREE, another cpu could have released last reference and freed whole skb. We potentially corrupt memory or trap if CONFIG_DEBUG_PAGEALLOC is set. Reported-by: Chris Mason <clm@fb.com> Fixes: ce1a4ea3f1258 ("net: avoid one atomic operation in skb_clone()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2014-11-212-3/+12
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains two bugfixes for your net tree, they are: 1) Validate netlink group from nfnetlink to avoid an out of bound array access. This should only happen with superuser priviledges though. Discovered by Andrey Ryabinin using trinity. 2) Don't push ethernet header before calling the netfilter output hook for multicast traffic, this breaks ebtables since it expects to see skb->data pointing to the network header, patch from Linus Luessing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | bridge: fix netfilter/NF_BR_LOCAL_OUT for own, locally generated queriesLinus Lüssing2014-11-171-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ebtables on the OUTPUT chain (NF_BR_LOCAL_OUT) would not work as expected for both locally generated IGMP and MLD queries. The IP header specific filter options are off by 14 Bytes for netfilter (actual output on interfaces is fine). NF_HOOK() expects the skb->data to point to the IP header, not the ethernet one (while dev_queue_xmit() does not). Luckily there is an br_dev_queue_push_xmit() helper function already - let's just use that. Introduced by eb1d16414339a6e113d89e2cca2556005d7ce919 ("bridge: Add core IGMP snooping support") Ebtables example: $ ebtables -I OUTPUT -p IPv6 -o eth1 --logical-out br0 \ --log --log-level 6 --log-ip6 --log-prefix="~EBT: " -j DROP before (broken): ~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \ MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \ SRC=64a4:39c2:86dd:6000:0000:0020:0001:fe80 IPv6 \ DST=0000:0000:0000:0004:64ff:fea4:39c2:ff02, \ IPv6 priority=0x3, Next Header=2 after (working): ~EBT: IN= OUT=eth1 MAC source = 02:04:64:a4:39:c2 \ MAC dest = 33:33:00:00:00:01 proto = 0x86dd IPv6 \ SRC=fe80:0000:0000:0000:0004:64ff:fea4:39c2 IPv6 \ DST=ff02:0000:0000:0000:0000:0000:0000:0001, \ IPv6 priority=0x0, Next Header=0 Signed-off-by: Linus Lüssing <linus.luessing@web.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | netfilter: nfnetlink: fix insufficient validation in nfnetlink_bindPablo Neira Ayuso2014-11-171-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure the netlink group exists, otherwise you can trigger an out of bound array memory access from the netlink_bind() path. This splat can only be triggered only by superuser. [ 180.203600] UBSan: Undefined behaviour in ../net/netfilter/nfnetlink.c:467:28 [ 180.204249] index 9 is out of range for type 'int [9]' [ 180.204697] CPU: 0 PID: 1771 Comm: trinity-main Not tainted 3.18.0-rc4-mm1+ #122 [ 180.205365] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org +04/01/2014 [ 180.206498] 0000000000000018 0000000000000000 0000000000000009 ffff88007bdf7da8 [ 180.207220] ffffffff82b0ef5f 0000000000000092 ffffffff845ae2e0 ffff88007bdf7db8 [ 180.207887] ffffffff8199e489 ffff88007bdf7e18 ffffffff8199ea22 0000003900000000 [ 180.208639] Call Trace: [ 180.208857] dump_stack (lib/dump_stack.c:52) [ 180.209370] ubsan_epilogue (lib/ubsan.c:174) [ 180.209849] __ubsan_handle_out_of_bounds (lib/ubsan.c:400) [ 180.210512] nfnetlink_bind (net/netfilter/nfnetlink.c:467) [ 180.210986] netlink_bind (net/netlink/af_netlink.c:1483) [ 180.211495] SYSC_bind (net/socket.c:1541) Moreover, define the missing nf_tables and nf_acct multicast groups too. Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | Merge tag 'master-2014-11-20' of ↵David S. Miller2014-11-211-9/+6
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless John W. Linville says: ==================== pull request: wireless 2014-11-20 Please full this little batch of fixes intended for the 3.18 stream! For the mac80211 patch, Johannes says: "Here's another last minute fix, for minstrel HT crashing depending on the value of some uninitialised stack." On top of that... Ben Greear fixes an ath9k regression in which a BSSID mask is miscalculated. Dmitry Torokhov corrects an error handling routing in brcmfmac which was checking an unsigned variable for a negative value. Johannes Berg avoids a build problem in brcmfmac for arches where linux/unaligned/access_ok.h and asm/unaligned.h conflict. Mathy Vanhoef addresses another brcmfmac issue so as to eliminate a use-after-free of the URB transfer buffer if a timeout occurs. Please let me know if there are problems! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * \ \ \ Merge tag 'mac80211-for-john-2014-11-18' of ↵John W. Linville2014-11-191-9/+6
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg <johannes@sipsolutions.net> says: "Here's another last minute fix, for minstrel HT crashing depending on the value of some uninitialised stack." Signed-off-by: John W. Linville <linville@tuxdriver.com>
| | | * | | | mac80211: minstrel_ht: fix a crash in rate sortingFelix Fietkau2014-11-181-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 5935839ad73583781b8bbe8d91412f6826e218a4 "mac80211: improve minstrel_ht rate sorting by throughput & probability" introduced a crash on rate sorting that occurs when the rate added to the sorting array is faster than all the previous rates. Due to an off-by-one error, it reads the rate index from tp_list[-1], which contains uninitialized stack garbage, and then uses the resulting index for accessing the group rate stats, leading to a crash if the garbage value is big enough. Cc: Thomas Huehn <thomas@net.t-labs.tu-berlin.de> Reported-by: Jouni Malinen <j@w1.fi> Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | | | ipx: fix locking regression in ipx_sendmsg and ipx_recvmsgJiri Bohac2014-11-201-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes an old regression introduced by commit b0d0d915 (ipx: remove the BKL). When a recvmsg syscall blocks waiting for new data, no data can be sent on the same socket with sendmsg because ipx_recvmsg() sleeps with the socket locked. This breaks mars-nwe (NetWare emulator): - the ncpserv process reads the request using recvmsg - ncpserv forks and spawns nwconn - ncpserv calls a (blocking) recvmsg and waits for new requests - nwconn deadlocks in sendmsg on the same socket Commit b0d0d915 has simply replaced BKL locking with lock_sock/release_sock. Unlike now, BKL got unlocked while sleeping, so a blocking recvmsg did not block a concurrent sendmsg. Only keep the socket locked while actually working with the socket data and release it prior to calling skb_recv_datagram(). Signed-off-by: Jiri Bohac <jbohac@suse.cz> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | openvswitch: Don't validate IPv6 label masks.Joe Stringer2014-11-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When userspace doesn't provide a mask, OVS datapath generates a fully unwildcarded mask for the flow by copying the flow and setting all bits in all fields. For IPv6 label, this creates a mask that matches on the upper 12 bits, causing the following error: openvswitch: netlink: Invalid IPv6 flow label value (value=ffffffff, max=fffff) This patch ignores the label validation check for masks, avoiding this error. Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | ipv6: delete protocol and unregister rtnetlink when cleanupDuan Jiong2014-11-191-0/+4
| | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pim6_protocol was added when initiation, but it not deleted. Similarly, unregister RTNL_FAMILY_IP6MR rtnetlink. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUsDaniel Borkmann2014-11-162-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been reported that generating an MLD listener report on devices with large MTUs (e.g. 9000) and a high number of IPv6 addresses can trigger a skb_over_panic(): skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20 head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0 dev:port1 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:100! invalid opcode: 0000 [#1] SMP Modules linked in: ixgbe(O) CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4 [...] Call Trace: <IRQ> [<ffffffff80578226>] ? skb_put+0x3a/0x3b [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4 [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45 [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68 [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182 [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3 [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46 [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70 mld_newpack() skb allocations are usually requested with dev->mtu in size, since commit 72e09ad107e7 ("ipv6: avoid high order allocations") we have changed the limit in order to be less likely to fail. However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb) macros, which determine if we may end up doing an skb_put() for adding another record. To avoid possible fragmentation, we check the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong assumption as the actual max allocation size can be much smaller. The IGMP case doesn't have this issue as commit 57e1ab6eaddc ("igmp: refine skb allocations") stores the allocation size in the cb[]. Set a reserved_tailroom to make it fit into the MTU and use skb_availroom() helper instead. This also allows to get rid of igmp_skb_size(). Reported-by: Wei Liu <lw1a2.jing@gmail.com> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: David L Stevens <david.stevens@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge branch 'net_ovs' of ↵David S. Miller2014-11-163-12/+21
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pshelar/openvswitch Pravin B Shelar says: ==================== Open vSwitch Following fixes are accumulated in ovs-repo. Three of them are related to protocol processing, one is related to memory leak in case of error and one is to fix race. Patch "Validate IPv6 flow key and mask values" has conflicts with net-next, Let me know if you want me to send the patch for net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | openvswitch: Validate IPv6 flow key and mask values.Jarno Rajahalme2014-11-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reject flow label key and mask values with invalid bits set. Introduced by commit 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action"). Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
| | * | | | | openvswitch: Convert dp rcu read operation to locked operationsPravin B Shelar2014-11-141-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dp read operations depends on ovs_dp_cmd_fill_info(). This API needs to looup vport to find dp name, but vport lookup can fail. Therefore to keep vport reference alive we need to take ovs lock. Introduced by commit 6093ae9abac1 ("openvswitch: Minimize dp and vport critical sections"). Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
| | * | | | | openvswitch: Fix NDP flow mask validationDaniele Di Proietto2014-11-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | match_validate() enforce that a mask matching on NDP attributes has also an exact match on ICMPv6 type. The ICMPv6 type, which is 8-bit wide, is stored in the 'tp.src' field of 'struct sw_flow_key', which is 16-bit wide. Therefore, an exact match on ICMPv6 type should only check the first 8 bits. This commit fixes a bug that prevented flows with an exact match on NDP field from being installed Introduced by commit 03f0d916aa03 ("openvswitch: Mega flow implementation"). Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
| | * | | | | openvswitch: Fix checksum calculation when modifying ICMPv6 packets.Jesse Gross2014-11-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The checksum of ICMPv6 packets uses the IP pseudoheader as part of the calculation, unlike ICMP in IPv4. This was not implemented, which means that modifying the IP addresses of an ICMPv6 packet would cause the checksum to no longer be correct as the psuedoheader did not match. Introduced by commit 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action"). Reported-by: Neal Shrader <icosahedral@gmail.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
| | * | | | | openvswitch: Fix memory leak.Pravin B Shelar2014-11-141-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Need to free memory in case of sample action error. Introduced by commit 651887b0c22cffcfce7eb9c ("openvswitch: Sample action without side effects"). Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
| * | | | | | dcbnl : Disable software interrupts before taking dcb_lockAnish Bhatt2014-11-161-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Solves possible lockup issues that can be seen from firmware DCB agents calling into the DCB app api. DCB firmware event queues can be tied in with NAPI so that dcb events are generated in softIRQ context. This can results in calls to dcb_*app() functions which try to take the dcb_lock. If the the event triggers while we also have the dcb_lock because lldpad or some other agent happened to be issuing a get/set command we could see a cpu lockup. This code was not originally written with firmware agents in mind, hence grabbing dcb_lock from softIRQ context was not considered. Signed-off-by: Anish Bhatt <anish@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2014-11-167-56/+32
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter updates for your net tree, they are: 1) Fix missing initialization of the range structure (allocated in the stack) in nft_masq_{ipv4, ipv6}_eval, from Daniel Borkmann. 2) Make sure the data we receive from userspace contains the req_version structure, otherwise return an error incomplete on truncated input. From Dan Carpenter. 3) Fix handling og skb->sk which may cause incorrect handling of connections from a local process. Via Simon Horman, patch from Calvin Owens. 4) Fix wrong netns in nft_compat when setting target and match params structure. 5) Relax chain type validation in nft_compat that was recently included, this broke the matches that need to be run from the route chain type. Now iptables-test.py automated regression tests report success again and we avoid the only possible problematic case, which is the use of nat targets out of nat chain type. 6) Use match->table to validate the tablename, instead of the match->name. Again patch for nft_compat. 7) Restore the synchronous release of objects from the commit and abort path in nf_tables. This is causing two major problems: splats when using nft_compat, given that matches and targets may sleep and call_rcu is invoked from softirq context. Moreover Patrick reported possible event notification reordering when rules refer to anonymous sets. 8) Fix race condition in between packets that are being confirmed by conntrack and the ctnetlink flush operation. This happens since the removal of the central spinlock. Thanks to Jesper D. Brouer to looking into this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | | netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpsebill bonaparte2014-11-141-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After removal of the central spinlock nf_conntrack_lock, in commit 93bb0ceb75be2 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock"), it is possible to race against get_next_corpse(). The race is against the get_next_corpse() cleanup on the "unconfirmed" list (a per-cpu list with seperate locking), which set the DYING bit. Fix this race, in __nf_conntrack_confirm(), by removing the CT from unconfirmed list before checking the DYING bit. In case race occured, re-add the CT to the dying list. While at this, fix coding style of the comment that has been updated. Fixes: 93bb0ceb75be2 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock") Reported-by: bill bonaparte <programme110@gmail.com> Signed-off-by: bill bonaparte <programme110@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nf_tables: restore synchronous object release from commit/abortPablo Neira Ayuso2014-11-121-16/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing xtables matches and targets, when used from nft_compat, may sleep from the destroy path, ie. when removing rules. Since the objects are released via call_rcu from softirq context, this results in lockdep splats and possible lockups that may be hard to reproduce. Patrick also indicated that delayed object release via call_rcu can cause us problems in the ordering of event notifications when anonymous sets are in place. So, this patch restores the synchronous object release from the commit and abort paths. This includes a call to synchronize_rcu() to make sure that no packets are walking on the objects that are going to be released. This is slowier though, but it's simple and it resolves the aforementioned problems. This is a partial revert of c7c32e7 ("netfilter: nf_tables: defer all object release via rcu") that was introduced in 3.16 to speed up interaction with userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nft_compat: use the match->table to validate dependenciesPablo Neira Ayuso2014-11-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of the match->name, which is of course not relevant. Fixes: f3f5dde ("netfilter: nft_compat: validate chain type in match/target") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nft_compat: relax chain type validationPablo Neira Ayuso2014-11-121-30/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check for nat chain dependency only, which is the one that can actually crash the kernel. Don't care if mangle, filter and security specific match and targets are used out of their scope, they are harmless. This restores iptables-compat with mangle specific match/target when used out of the OUTPUT chain, that are actually emulated through filter chains, which broke when performing strict validation. Fixes: f3f5dde ("netfilter: nft_compat: validate chain type in match/target") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nft_compat: use current net namespacePablo Neira Ayuso2014-11-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of init_net when using xtables over nftables compat. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | ipvs: Keep skb->sk when allocating headroom on tunnel xmitCalvin Owens2014-11-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip_vs_prepare_tunneled_skb() ignores ->sk when allocating a new skb, either unconditionally setting ->sk to NULL or allowing the uninitialized ->sk from a newly allocated skb to leak through to the caller. This patch properly copies ->sk and increments its reference count. Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | * | | | | | netfilter: ipset: small potential read beyond the end of bufferDan Carpenter2014-11-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We could be reading 8 bytes into a 4 byte buffer here. It seems harmless but adding a check is the right thing to do and it silences a static checker warning. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | | | | | netfilter: nft_masq: fix uninitialized range in nft_masq_{ipv4, ipv6}_evalDaniel Borkmann2014-11-102-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When transferring from the original range in nf_nat_masquerade_{ipv4,ipv6}() we copy over values from stack in from min_proto/max_proto due to uninitialized range variable in both, nft_masq_{ipv4,ipv6}_eval. As we only initialize flags at this time from nft_masq struct, just zero out the rest. Fixes: 9ba1f726bec09 ("netfilter: nf_tables: add new nft_masq expression") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>