linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	netlink: remove mmapped netlink support	Florian Westphal	2016-02-18	4	-808/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mmapped netlink has a number of unresolved issues: - TX zerocopy support had to be disabled more than a year ago via commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.") because the content of the mmapped area can change after netlink attribute validation but before message processing. - RX support was implemented mainly to speed up nfqueue dumping packet payload to userspace. However, since commit ae08ce0021087a5d812d2 ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy with the socket-based interface too (via the skb_zerocopy helper). The other problem is that skbs attached to mmaped netlink socket behave different from normal skbs: - they don't have a shinfo area, so all functions that use skb_shinfo() (e.g. skb_clone) cannot be used. - reserving headroom prevents userspace from seeing the content as it expects message to start at skb->head. See for instance commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump"). - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we crash because it needs the sk to check if a tx ring is attached. Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359 ("netfilter: nfnetlink: use original skbuff when acking batches"). mmaped netlink also didn't play nicely with the skb_zerocopy helper used by nfqueue and openvswitch. Daniel Borkmann fixed this via commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue zero-copy")' but at the cost of also needing to provide remaining length to the allocation function. nfqueue also has problems when used with mmaped rx netlink: - mmaped netlink doesn't allow use of nfqueue batch verdict messages. Problem is that in the mmap case, the allocation time also determines the ordering in which the frame will be seen by userspace (A allocating before B means that A is located in earlier ring slot, but this also means that B might get a lower sequence number then A since seqno is decided later. To fix this we would need to extend the spinlocked region to also cover the allocation and message setup which isn't desirable. - nfqueue can now be configured to queue large (GSO) skbs to userspace. Queing GSO packets is faster than having to force a software segmentation in the kernel, so this is a desirable option. However, with a mmap based ring one has to use 64kb per ring slot element, else mmap has to fall back to the socket path (NL_MMAP_STATUS_COPY) for all large packets. To use the mmap interface, userspace not only has to probe for mmap netlink support, it also has to implement a recv/socket receive path in order to handle messages that exceed the size of an rx ring element. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net_sched: Improve readability of filter processing	Jamal Hadi Salim	2016-02-18	1	-1/+1
\| \| \| \| \| \|	Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	bridge: switchdev: Offload VLAN flags to hardware bridge	Ido Schimmel	2016-02-18	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When VLANs are created / destroyed on a VLAN filtering bridge (MASTER flag set), the configuration is passed down to the hardware. However, when only the flags (e.g. PVID) are toggled, the configuration is done in the software bridge alone. While it is possible to pass these flags to hardware when invoked with the SELF flag set, this creates inconsistency with regards to the way the VLANs are initially configured. Pass the flags down to the hardware even when the VLAN already exists and only the flags are toggled. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net-sysfs: remove unused fmt_long_hex	Colin Ian King	2016-02-18	1	-1/+0
\| \| \| \| \| \| \| \| \|	Ever since commit 04ed3e741d0f133e02bed7fa5c98edba128f90e7 ("net: change netdev->features to u32") the format string fmt_long_hex has not been used, so we may as well remove it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	vlan: change return type of vlan_proc_rem_dev	Zhang Shengju	2016-02-17	2	-4/+3
\| \| \| \| \| \| \| \|	Since function vlan_proc_rem_dev() will only return 0, it's better to return void instead of int. Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv4: Remove inet_lro library	Ben Hutchings	2016-02-17	3	-383/+0
\| \| \| \| \| \| \|	There are no longer any in-tree drivers that use it. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	af_llc: fix types on llc_ui_wait_for_conn	One Thousand Gnomes	2016-02-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	The timeout is a long, we return it truncated if it is huge. Basically harmless as the only caller does a boolean check, but tidy it up anyway. (64bit build tested this time. Thank you 0day) Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	sctp: remove the unused sctp_datamsg_free()	Xin Long	2016-02-17	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \|	Since commit 8b570dc9f7b6 ("sctp: only drop the reference on the datamsg after sending a msg") used sctp_datamsg_put in sctp_sendmsg, instead of sctp_datamsg_free, this function has no use in sctp. So we will remove it. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	sctp: remove rcu_read_lock in sctp_seq_dump_remote_addrs()	Xin Long	2016-02-17	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \|	sctp_seq_dump_remote_addrs is only called by sctp_assocs_seq_show() and it has been protected by rcu_read_lock that is from rhashtable_walk_start(). So we will remove this one. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	sctp: move rcu_read_lock from __sctp_lookup_association to ↵	Xin Long	2016-02-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	sctp_lookup_association __sctp_lookup_association() is only invoked by sctp_v4_err() and sctp_rcv(), both which run on the rx BH, and it has been protected by rcu_read_lock [see ip_local_deliver_finish() / ipv6_rcv()]. So we can move it to sctp_lookup_association, only let sctp_lookup_association use rcu_read_lock. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	core: remove unneded headers for net cgroup controllers.	Rosen, Rami	2016-02-17	2	-2/+0
\| \| \| \| \| \| \| \| \| \| \|	commit 3ed80a6 (cgroup: drop module support) made including module.h redundant in the net cgroup controllers, netclassid_cgroup.c and netprio_cgroup.c. This patch removes them. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: add tc offload feature flag	John Fastabend	2016-02-17	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Its useful to turn off the qdisc offload feature at a per device level. This gives us a big hammer to enable/disable offloading. More fine grained control (i.e. per rule) may be supported later. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: sched: add cls_u32 offload hooks for netdevs	John Fastabend	2016-02-17	1	-2/+97
\| \| \| \| \| \| \| \| \| \| \| \|	This patch allows netdev drivers to consume cls_u32 offloads via the ndo_setup_tc ndo op. This works aligns with how network drivers have been doing qdisc offloads for mqprio. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: rework setup_tc ndo op to consume general tc operand	John Fastabend	2016-02-17	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch updates setup_tc so we can pass additional parameters into the ndo op in a generic way. To do this we provide structured union and type flag. This lets each classifier and qdisc provide its own set of attributes without having to add new ndo ops or grow the signature of the callback. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: rework ndo tc op to consume additional qdisc handle parameter	John Fastabend	2016-02-17	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ndo_setup_tc() op was added to support drivers offloading tx qdiscs however only support for mqprio was ever added. So we only ever added support for passing the number of traffic classes to the driver. This patch generalizes the ndo_setup_tc op so that a handle can be provided to indicate if the offload is for ingress or egress or potentially even child qdiscs. CC: Murali Karicheri <m-karicheri2@ti.com> CC: Shradha Shah <sshah@solarflare.com> CC: Or Gerlitz <ogerlitz@mellanox.com> CC: Ariel Elior <ariel.elior@qlogic.com> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com> CC: Bruce Allan <bruce.w.allan@intel.com> CC: Jesse Brandeburg <jesse.brandeburg@intel.com> CC: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Export ip fragment sysctl to unprivileged users	Nikolay Borisov	2016-02-16	1	-4/+0
\| \| \| \| \| \| \| \| \|	Now that all the ip fragmentation related sysctls are namespaceified there is no reason to hide them anymore from "root" users inside containers. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv4: namespacify ip fragment max dist sysctl knob	Nikolay Borisov	2016-02-16	1	-12/+13
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv4: namespacify ip_early_demux sysctl knob	Nikolay Borisov	2016-02-16	3	-12/+10
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv4: Namespacify ip_dynaddr sysctl knob	Nikolay Borisov	2016-02-16	2	-15/+10
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	igmp: net: Move igmp namespace init to correct file	Nikolay Borisov	2016-02-16	2	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \|	When igmp related sysctl were namespacified their initializatin was erroneously put into the tcp socket namespace constructor. This patch moves the relevant code into the igmp namespace constructor to keep things consistent. Also sprinkle some #ifdefs to silence warnings Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ipv4: Namespaceify ip_default_ttl sysctl knob	Nikolay Borisov	2016-02-16	6	-18/+23
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp: add tcpi_min_rtt and tcpi_notsent_bytes to tcp_info	Eric Dumazet	2016-02-16	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow, in usec unit. Might be ~0U if not yet known. tcpi_notsent_bytes reports the amount of bytes in the write queue that were not yet sent. This is done in a single patch to not add a temporary 32bit padding hole in tcp_info. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net/ipv4: add dst cache support for gre lwtunnels	Paolo Abeni	2016-02-16	1	-3/+10
\| \| \| \| \| \| \| \| \|	In case of UDP traffic with datagram length below MTU this gives about 4% performance increase Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: add dst_cache to ovs vxlan lwtunnel	Paolo Abeni	2016-02-16	3	-1/+16
\| \| \| \| \| \| \| \| \| \| \|	In case of UDP traffic with datagram length below MTU this give about 2% performance increase when tunneling over ipv4 and about 60% when tunneling over ipv6 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ip_tunnel: replace dst_cache with generic implementation	Paolo Abeni	2016-02-16	3	-73/+23
\| \| \| \| \| \| \| \| \| \| \| \|	The current ip_tunnel cache implementation is prone to a race that will cause the wrong dst to be cached on cuncurrent dst cache miss and ip tunnel update via netlink. Replacing with the generic implementation fix the issue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: replace dst_cache ip6_tunnel implementation with the generic one	Paolo Abeni	2016-02-16	4	-104/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This also fix a potential race into the existing tunnel code, which could lead to the wrong dst to be permanenty cached: CPU1: CPU2: <xmit on ip6_tunnel> <cache lookup fails> dst = ip6_route_output(...) <tunnel params are changed via nl> dst_cache_reset() // no effect, // the cache is empty dst_cache_set() // the wrong dst // is permanenty stored // into the cache With the new dst implementation the above race is not possible since the first cache lookup after dst_cache_reset will fail due to the timestamp check Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: add dst_cache support	Paolo Abeni	2016-02-16	3	-0/+173
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch add a generic, lockless dst cache implementation. The need for lock is avoided updating the dst cache fields only in per cpu scope, and requiring that the cache manipulation functions are invoked with the local bh disabled. The refresh_ts and reset_ts fields are used to ensure the cache consistency in case of cuncurrent cache update (dst_cache_set*) and reset operation (dst_cache_reset). Consider the following scenario: CPU1: CPU2: <cache lookup with emtpy cache: it fails> <get dst via uncached route lookup> <related configuration changes> dst_cache_reset() dst_cache_set() The dst entry set passed to dst_cache_set() should not be used for later dst cache lookup, because it's obtained using old configuration values. Since the refresh_ts is updated only on dst_cache lookup, the cached value in the above scenario will be discarded on the next lookup. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tipc: refactor node xmit and fix memory leaks	Richard Alpe	2016-02-16	2	-24/+38
\| \| \| \| \| \| \| \| \| \|	Refactor tipc_node_xmit() to fail fast and fail early. Fix several potential memory leaks in unexpected error paths. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ethtool: ensure channel counts are within bounds during SCHANNELS	Keller, Jacob E	2016-02-16	1	-2/+11
\| \| \| \| \| \| \| \|	Add a sanity check to ensure that all requested channel sizes are within bounds, which should reduce errors in driver implementation. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}	Keller, Jacob E	2016-02-16	1	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops incorrectly allow SCHANNELS when it would conflict with the settings from SRXFH. This occurs because it is not possible for drivers to understand whether their Rx flow indirection table has been configured or is in the default state. In addition, drivers currently behave in various ways when increasing the number of Rx channels. Some drivers will always destroy the Rx flow indirection table when this occurs, whether it has been set by the user or not. Other drivers will attempt to preserve the table even if the user has never modified it from the default driver settings. Neither of these situation is desirable because it leads to unexpected behavior or loss of user configuration. The correct behavior is to simply return -EINVAL when SCHANNELS would conflict with the current Rx flow table settings. However, it should only do so if the current settings were modified by the user. If we required that the new settings never conflict with the current (default) Rx flow settings, we would force users to first reduce their Rx flow settings and then reduce the number of Rx channels. This patch proposes a solution implemented in net/core/ethtool.c which ensures that all drivers behave correctly. It checks whether the RXFH table has been configured to non-default settings, and stores this information in a private netdev flag. When the number of channels is requested to change, it first ensures that the current Rx flow table is not going to assign flows to now disabled channels. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloads	Edward Cree	2016-02-12	6	-23/+14
\| \| \| \| \| \| \| \|	All users now pass false, so we can remove it, and remove the code that was conditional upon it. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: gre: Implement LCO for GRE over IPv4	Edward Cree	2016-02-12	1	-3/+13
\| \| \| \| \|	Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	fou: enable LCO in FOU and GUE	Edward Cree	2016-02-12	1	-8/+6
\| \| \| \| \|	Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: udp: always set up for CHECKSUM_PARTIAL offload	Edward Cree	2016-02-12	2	-25/+2
\| \| \| \| \| \| \| \| \|	If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: local checksum offload for encapsulation	Edward Cree	2016-02-12	3	-22/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp/dccp: better use of ephemeral ports in bind()	Eric Dumazet	2016-02-12	1	-126/+114
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	tcp/dccp: better use of ephemeral ports in connect()	Eric Dumazet	2016-02-12	1	-85/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: bulk free SKBs that were delay free'ed due to IRQ context	Jesper Dangaard Brouer	2016-02-11	2	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The network stack defers SKBs free, in-case free happens in IRQ or when IRQs are disabled. This happens in __dev_kfree_skb_irq() that writes SKBs that were free'ed during IRQ to the softirq completion queue (softnet_data.completion_queue). These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ in function net_tx_action(). Take advantage of this a use the skb defer and flush API, as we are already in softirq context. For modern drivers this rarely happens. Although most drivers do call dev_kfree_skb_any(), which detects the situation and calls __dev_kfree_skb_irq() when needed. This due to netpoll can call from IRQ context. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: bulk free infrastructure for NAPI context, use napi_consume_skb	Jesper Dangaard Brouer	2016-02-11	2	-6/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Discovered that network stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk can speedup this slowpath. NAPI context is a bit special, lets take advantage of that for bulk free'ing SKBs. In NAPI context we are running in softirq, which gives us certain protection. A softirq can run on several CPUs at once. BUT the important part is a softirq will never preempt another softirq running on the same CPU. This gives us the opportunity to access per-cpu variables in softirq context. Extend napi_alloc_cache (before only contained page_frag_cache) to be a struct with a small array based stack for holding SKBs. Introduce a SKB defer and flush API for accessing this. Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any() when running in NAPI context. A small trick to handle/detect if we are called from netpoll is to see if budget is 0. In that case, we need to invoke dev_consume_skb_irq(). Joint work with Alexander Duyck. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	igmp: Namespacify igmp_qrv sysctl knob	Nikolay Borisov	2016-02-11	3	-22/+28
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	igmp: Namespaceify igmp_llm_reports sysctl knob	Nikolay Borisov	2016-02-11	3	-12/+18
\| \| \| \| \| \| \| \| \|	This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit reports for local multicast groups") by defining the sysctl in the ipv4_net_table array, however it was never implemented to be namespace aware. Fix this by changing the code accordingly. Signed-off-by: David S. Miller <davem@davemloft.net>
*	igmp: Namespaceify igmp_max_msf sysctl knob	Nikolay Borisov	2016-02-11	4	-13/+12
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	igmp: Namespaceify igmp_max_memberships sysctl knob	Nikolay Borisov	2016-02-11	3	-10/+10
\| \| \| \| \|	Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	openvswitch: allow management from inside user namespaces	Tycho Andersen	2016-02-11	2	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Operations with the GENL_ADMIN_PERM flag fail permissions checks because this flag means we call netlink_capable, which uses the init user ns. Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations which should be allowed inside a user namespace. The motivation for this is to be able to run openvswitch in unprivileged containers. I've tested this and it seems to work, but I really have no idea about the security consequences of this patch, so thoughts would be much appreciated. v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one massive one Reported-by: James Page <james.page@canonical.com> Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Eric Biederman <ebiederm@xmission.com> CC: Pravin Shelar <pshelar@ovn.org> CC: Justin Pettit <jpettit@nicira.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	rds: duplicate include net/tcp.h	stephen hemminger	2016-02-11	1	-1/+0
\| \| \| \| \| \| \|	Duplicate include detected. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: Allow tunnels to use inner checksum offloads with outer checksums needed	Alexander Duyck	2016-02-11	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch enables us to use inner checksum offloads if provided by hardware with outer checksums computed by software. It basically reduces encap_hdr_csum to an advisory flag for now, but based on the fact that SCTP may be getting segmentation support before long I thought we may want to keep it as it is possible we may need to support CRC32c and 1's compliment checksum in the same packet at some point in the future. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	udp: Use uh->len instead of skb->len to compute checksum in segmentation	Alexander Duyck	2016-02-11	1	-15/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	The segmentation code was having to do a bunch of work to pull the skb->len and strip the udp header offset before the value could be used to adjust the checksum. Instead of doing all this work we can just use the value that goes into uh->len since that is the correct value with the correct byte order that we need anyway. By using this value we can save ourselves a bunch of pain as there is no need to do multiple byte swaps. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	udp: Clean up the use of flags in UDP segmentation offload	Alexander Duyck	2016-02-11	1	-19/+18
\| \| \| \| \| \| \| \| \| \| \| \|	This patch goes though and cleans up the logic related to several of the control flags used in UDP segmentation. Specifically the use of dont_encap isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and if it isn't set then we don't need to update the internal headers. As such we can just drop that value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	gre: Use inner_proto to obtain inner header protocol	Alexander Duyck	2016-02-11	1	-4/+2
\| \| \| \| \| \| \| \|	Instead of parsing headers to determine the inner protocol we can just pull the value from inner_proto. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	gre: Use GSO flags to determine csum need instead of GRE flags	Alexander Duyck	2016-02-11	1	-34/+30
\| \| \| \| \| \| \| \| \| \|	This patch updates the gre checksum path to follow something much closer to the UDP checksum path. By doing this we can avoid needing to do as much header inspection and can just make use of the fields we were already reading in the sk_buff structure. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>