summaryrefslogtreecommitdiffstats
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* net: Add skb_free_frag to replace use of put_page in freeing skb->headAlexander Duyck2015-05-121-4/+6
| | | | | | | | | | This change adds a function called skb_free_frag which is meant to compliment the function netdev_alloc_frag. The general idea is to enable a more lightweight version of page freeing since we don't actually need all the overhead of a put_page, and we don't quite fit the model of __free_pages. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mm/net: Rename and move page fragment handling from net/ to mm/Alexander Duyck2015-05-121-94/+6
| | | | | | | | | | | | | | | | | | | | This change moves the __alloc_page_frag functionality out of the networking stack and into the page allocation portion of mm. The idea it so help make this maintainable by placing it with other page allocation functions. Since we are moving it from skbuff.c to page_alloc.c I have also renamed the basic defines and structure from netdev_alloc_cache to page_frag_cache to reflect that this is now part of a different kernel subsystem. I have also added a simple __free_page_frag function which can handle freeing the frags based on the skb->head pointer. The model for this is based off of __free_pages since we don't actually need to deal with all of the cases that put_page handles. I incorporated the virt_to_head_page call and compound_order into the function as it actually allows for a signficant size reduction by reducing code duplication. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Store virtual address instead of page in netdev_alloc_cacheAlexander Duyck2015-05-121-23/+32
| | | | | | | | | | | | | | | This change makes it so that we store the virtual address of the page in the netdev_alloc_cache instead of the page pointer. The idea behind this is to avoid multiple calls to page_address since the virtual address is required for every access, but the page pointer is only needed at allocation or reset of the page. While I was at it I also reordered the netdev_alloc_cache structure a bit so that the size is always 16 bytes by dropping size in the case where PAGE_SIZE is greater than or equal to 32KB. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Use cached copy of pfmemalloc to avoid accessing pageAlexander Duyck2015-05-121-62/+79
| | | | | | | | | | | | | | | | | | | | | While testing I found that the testing for pfmemalloc in build_skb was rather expensive. I found the issue to be two-fold. First we have to get from the virtual address to the head page and that comes at the cost of something like 11 cycles. Then there is the cost for reading pfmemalloc out of the head page which can be cache cold due to the fact that put_page_testzero is likely invalidating the cache-line on one or more CPUs as the fragments can be shared. To avoid this extra expense I have added a pfmemalloc member to the netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into __napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to make use of the cached pfmemalloc value. The result is that my perf traces show a reduction from 9.28% overhead to 3.7% for the code covered by build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with the packet being dropped instead of being handed to napi_gro_receive. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: deprecate enqueue_root()Eric Dumazet2015-05-111-2/+2
| | | | | | | | | Only left enqueue_root() user is netem, and it looks not necessary : qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: further simplify handle_ingDaniel Borkmann2015-05-112-61/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ingress qdisc has no other purpose than calling into tc_classify() that executes attached classifier(s) and action(s). It has a 1:1 relationship to dev->ingress_queue. After having commit 087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed the central ingress lock, one major contention point is gone. The extra indirection layers however, are not necessary for calling into ingress qdisc. pktgen calling locally into netif_receive_skb() with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps. We can redirect the private classifier list to the netdev directly, without changing any classifier API bits (!) and execute on that from handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed, ingress qdisc doesn't have a queue and thus dev_deactivate_queue() is also not applicable, ingress_cl_list provides similar behaviour. In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc. One next possible step is the removal of the dev's ingress (dummy) netdev_queue, and to only have the list member in the netdevice itself. Note, the filter chain is RCU protected and individual filter elements are being kfree'd by sched subsystem after RCU grace period. RCU read lock is being held by __netif_receive_skb_core(). Joint work with Alexei Starovoitov. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: consolidate handle_ing and ing_filterDaniel Borkmann2015-05-111-30/+16
| | | | | | | | | | Given quite some code has been removed from ing_filter(), we can just consolidate that function into handle_ing() and get rid of a few instructions at the same time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: kill sk_change_net and sk_release_kernelEric W. Biederman2015-05-111-19/+0
| | | | | | | These functions are no longer needed and no longer used kill them. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: Create kernel netlink sockets in the proper network namespaceEric W. Biederman2015-05-111-8/+6
| | | | | | | | Utilize the new functionality of sk_alloc so that nothing needs to be done to suprress the reference counting on kernel sockets. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Modify sk_alloc to not reference count the netns of kernel sockets.Eric W. Biederman2015-05-116-44/+27
| | | | | | | | | | | | | | | Now that sk_alloc knows when a kernel socket is being allocated modify it to not reference count the network namespace of kernel sockets. Keep track of if a socket needs reference counting by adding a flag to struct sock called sk_net_refcnt. Update all of the callers of sock_create_kern to stop using sk_change_net and sk_release_kernel as those hacks are no longer needed, to avoid reference counting a kernel socket. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Pass kern from net_proto_family.create to sk_allocEric W. Biederman2015-05-1148-89/+90
| | | | | | | | | In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add a struct net parameter to sock_create_kernEric W. Biederman2015-05-119-14/+14
| | | | | | | | This is long overdue, and is part of cleaning up how we allocate kernel sockets that don't reference count struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Utilize the normal socket network namespace refcounting.Eric W. Biederman2015-05-111-3/+0
| | | | | | | | | | | | | | | | | | | | There is no need for tun to do the weird network namespace refcounting. The existing network namespace refcounting in tfile has almost exactly the same lifetime. So rewrite the code to use the struct sock network namespace refcounting and remove the unnecessary hand rolled network namespace refcounting and the unncesary tfile->net. This change allows the tun code to directly call sock_put bypassing sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary. Remove the now unncessary tun_release so that if anything tries to use the sock_release code path the kernel will oops, and let us know about the bug. The macvtap code already uses it's internal socket this way. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* codel: add ce_threshold attributeEric Dumazet2015-05-102-3/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For DCTCP or similar ECN based deployments on fabrics with shallow buffers, hosts are responsible for a good part of the buffering. This patch adds an optional ce_threshold to codel & fq_codel qdiscs, so that DCTCP can have feedback from queuing in the host. A DCTCP enabled egress port simply have a queue occupancy threshold above which ECT packets get CE mark. In codel language this translates to a sojourn time, so that one doesn't have to worry about bytes or bandwidth but delays. This makes the host an active participant in the health of the whole network. This also helps experimenting DCTCP in a setup without DCTCP compliant fabric. On following example, ce_threshold is set to 1ms, and we can see from 'ldelay xxx us' that TCP is not trying to go around the 5ms codel target. Queue has more capacity to absorb inelastic bursts (say from UDP traffic), as queues are maintained to an optimal level. lpaa23:~# ./tc -s -d qd sh dev eth1 qdisc mq 1: dev eth1 root Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961) backlog 3108242b 364p requeues 42961 qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503) rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503 count 0 lastcount 0 ldelay 1.0ms drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384 qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186) rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186 count 0 lastcount 0 ldelay 694us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873 qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554) rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554 count 0 lastcount 0 ldelay 889us drop_next 0us maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780 ... Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: introduce xmit_mode '<start_xmit|netif_receive>'Alexei Starovoitov2015-05-091-5/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xmit_mode 'netif_receive' for pktgen which generates the packets using familiar pktgen commands, but feeds them into netif_receive_skb() instead of ndo_start_xmit(). Default mode is called 'start_xmit'. It is designed to test netif_receive_skb and ingress qdisc performace only. Make sure to understand how it works before using it for other rx benchmarking. Sample script 'pktgen.sh': \#!/bin/bash function pgset() { local result echo $1 > $PGDEV result=`cat $PGDEV | fgrep "Result: OK:"` if [ "$result" = "" ]; then cat $PGDEV | fgrep Result: fi } [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1 ETH=$1 PGDEV=/proc/net/pktgen/kpktgend_0 pgset "rem_device_all" pgset "add_device $ETH" PGDEV=/proc/net/pktgen/$ETH pgset "xmit_mode netif_receive" pgset "pkt_size 60" pgset "dst 198.18.0.1" pgset "dst_mac 90:e2:ba:ff:ff:ff" pgset "count 10000000" pgset "burst 32" PGDEV=/proc/net/pktgen/pgctrl echo "Running... ctrl^C to stop" pgset "start" echo "Done" cat /proc/net/pktgen/$ETH Usage: $ sudo ./pktgen.sh eth2 ... Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags) 43033682pps 20656Mb/sec (20656167360bps) errors: 10000000 Raw netif_receive_skb speed should be ~43 million packet per second on 3.7Ghz x86 and 'perf report' should look like: 37.69% kpktgend_0 [kernel.vmlinux] [k] __netif_receive_skb_core 25.81% kpktgend_0 [kernel.vmlinux] [k] kfree_skb 7.22% kpktgend_0 [kernel.vmlinux] [k] ip_rcv 5.68% kpktgend_0 [pktgen] [k] pktgen_thread_worker If fib_table_lookup is seen on top, it means skb was processed by the stack. To benchmark netif_receive_skb only make sure that 'dst_mac' of your pktgen script is different from receiving device mac and it will be dropped by ip_rcv Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliantJesper Dangaard Brouer2015-05-091-0/+3
| | | | | | | | | | | Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags, with a negation of the flag like !NO_TIMESTAMP. Also document the option flag NO_TIMESTAMP. Fixes: afb84b626184 ("pktgen: add flag NO_TIMESTAMP to disable timestamping") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: allow to listen "all" netnsNicolas Dichtel2015-05-092-6/+56
| | | | | | | | | | | | | | | | | | | | | | More accurately, listen all netns that have a nsid assigned into the netns where the netlink socket is opened. For this purpose, a netlink socket option is added: NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this socket will receive netlink notifications from all netns that have a nsid assigned into the netns where the socket has been opened. The nsid is sent to userland via an anscillary data. With this patch, a daemon needs only one socket to listen many netns. This is useful when the number of netns is high. Because 0 is a valid value for a nsid, the field nsid_is_set indicates if the field nsid is valid or not. skb->cb is initialized to 0 on skb allocation, thus we are sure that we will never send a nsid 0 by error to the userland. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: rename private flags and statesNicolas Dichtel2015-05-091-29/+30
| | | | | | | | | | | | | These flags and states have the same prefix (NETLINK_) that netlink socket options. To avoid confusion and to be able to name a flag like a socket option, let's use an other prefix: NETLINK_[S|F]_. Note: a comment has been fixed, it was talking about NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: use a spin_lock to protect nsid managementNicolas Dichtel2015-05-091-13/+44
| | | | | | | | | | | | | | Before this patch, nsid were protected by the rtnl lock. The goal of this patch is to be able to find a nsid without needing to hold the rtnl lock. The next patch will introduce a netlink socket option to listen to all netns that have a nsid assigned into the netns where the socket is opened. Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to avoid a recursive lock (nsid are notified via rtnl). This was the main reason of the previous patch. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: notify new nsid outside __peernet2id()Nicolas Dichtel2015-05-091-14/+27
| | | | | | | | | | There is no functional change with this patch. It will ease the refactoring of the locking system that protects nsids and the support of the netlink socket option NETLINK_LISTEN_ALL_NSID. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: rename peernet2id() to peernet2id_alloc()Nicolas Dichtel2015-05-092-3/+3
| | | | | | | | | | In a following commit, a new function will be introduced to only lookup for a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the existing function is renamed. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: always provide the id to rtnl_net_fill()Nicolas Dichtel2015-05-091-20/+11
| | | | | | | | | | | The goal of this commit is to prepare the rework of the locking of nsnid protection. After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(), ie no idr_* operation into this function. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: returns always an id in __peernet2id()Nicolas Dichtel2015-05-091-11/+8
| | | | | | | | | | All callers of this function expect a nsid, not an error. Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: set SOCK_NOSPACE under memory pressureJason Baron2015-05-091-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* seccomp, filter: add and use bpf_prog_create_from_user from seccompDaniel Borkmann2015-05-091-2/+49
| | | | | | | | | | | | | | | | | | | | Seccomp has always been a special candidate when it comes to preparation of its filters in seccomp_prepare_filter(). Due to the extra checks and filter rewrite it partially duplicates code and has BPF internals exposed. This patch adds a generic API inside the BPF code code that seccomp can use and thus keep it's filter preparation code minimal and better maintainable. The other side-effect is that now classic JITs can add seccomp support as well by only providing a BPF_LDX | BPF_W | BPF_ABS translation. Tested with seccomp and BPF test suites. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: filter: add __GFP_NOWARN flag for larger kmem allocsDaniel Borkmann2015-05-091-3/+6
| | | | | | | | | | | | | | | | | | | | When seccomp BPF was added, it was discussed to add __GFP_NOWARN flag for their configuration path as f.e. up to 32K allocations are more prone to fail under stress. As we're going to reuse BPF API, add __GFP_NOWARN flags where larger kmalloc() and friends allocations could fail. It doesn't make much sense to pass around __GFP_NOWARN everywhere as an extra argument only for seccomp while we just as well could run into similar issues for socket filters, where it's not desired to have a user application throw a WARN() due to allocation failure. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Nicolas Schichan <nschichan@freebox.fr> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filterNicolas Schichan2015-05-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | Remove the calls to bpf_check_classic(), bpf_convert_filter() and bpf_migrate_runtime() and let bpf_prepare_filter() take care of that instead. seccomp_check_filter() is passed to bpf_prepare_filter() so that it gets called from there, after bpf_check_classic(). We can now remove exposure of two internal classic BPF functions previously used by seccomp. The export of bpf_check_classic() symbol, previously known as sk_chk_filter(), was there since pre git times, and no in-tree module was using it, therefore remove it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: filter: add a callback to allow classic post-verifier transformationsNicolas Schichan2015-05-091-3/+15
| | | | | | | | | | | | | | | | | | | This is in preparation for use by the seccomp code, the rationale is not to duplicate additional code within the seccomp layer, but instead, have it abstracted and hidden within the classic BPF API. As an interim step, this now also makes bpf_prepare_filter() visible (not as exported symbol though), so that seccomp can reuse that code path instead of reimplementing it. Joint work with Daniel Borkmann. Signed-off-by: Nicolas Schichan <nschichan@freebox.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Kees Cook <keescook@chromium.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'mac80211-next-for-davem-2015-05-06' of ↵David S. Miller2015-05-0927-568/+1215
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Lots of updates for net-next for this cycle. As usual, we have a lot of small fixes and cleanups, the bigger items are: * proper mac80211 rate control locking, to fix some random crashes (this required changing other locking as well) * mac80211 "fast-xmit", a mechanism to reduce, in most cases, the amount of code we execute while going from ndo_start_xmit() to the driver * this also clears the way for properly supporting S/G and checksum and segmentation offloads ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * mac80211: add missing documentation for rate_ctrl_lockJohannes Berg2015-05-061-0/+2
| | | | | | | | | | | | | | This was missed in the previous patch, add some documentation for rate_ctrl_lock to avoid docbook warnings. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STAArik Nemtsov2015-05-063-20/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | The GO_CONCURRENT regulatory definition can be extended to station interfaces requesting to IR as part of TDLS off-channel operations. Rename the GO_CONCURRENT flag to IR_CONCURRENT and allow the added use-case. Change internal users of GO_CONCURRENT to use the new definition. Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: Allow GO concurrent relaxation after BSS disconnectionAvraham Stern2015-05-061-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a P2P GO was allowed on a channel because of the GO concurrent relaxation, i.e., another station interface was associated to an AP on the same channel or the same UNII band, and the station interface disconnected from the AP, allow the following use cases unless the channel is marked as indoor only and the device is not operating in an indoor environment: 1. Allow the P2P GO to stay on its current channel. The rationale behind this is that if the channel or UNII band were allowed by the AP they could still be used to continue the P2P GO operation, and avoid connection breakage. 2. Allow another P2P GO to start on the same channel or another channel that is in the same UNII band as the previous instantiated P2P GO. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: validate cipher scheme PN length betterJohannes Berg2015-05-062-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, a cipher scheme can advertise an arbitrarily long sequence counter, but mac80211 only supports up to 16 bytes and the initial value from userspace will be truncated. Fix two things: * don't allow the driver to register anything longer than the 16 bytes that mac80211 reserves space for * require userspace to specify a starting value with the correct length (or none at all) Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: extend get_key() to return PN for all ciphersJohannes Berg2015-05-063-4/+12
| | | | | | | | | | | | | | | | For ciphers not supported by mac80211, the function currently doesn't return any PN data. Fix this by extending the driver's get_key_seq() a little more to allow moving arbitrary PN data. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: extend get_tkip_seq to all keysJohannes Berg2015-05-063-57/+87
| | | | | | | | | | | | | | | | Extend the function to read the TKIP IV32/IV16 to read the IV/PN for all ciphers in order to allow drivers with full hardware crypto to properly support this. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: remove useless skb->encapsulation checkJohannes Berg2015-05-051-6/+2
| | | | | | | | | | | | | | No current (and planned, as far as I know) wifi devices support encapsulation checksum offload, so remove the useless test here. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: make LED triggering depend on activationJohannes Berg2015-05-054-100/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When LED triggers are compiled in, but not used, mac80211 will still call them to update the status. This isn't really a problem for the assoc and radio ones, but the TX/RX (and to a certain extend TPT) ones can be called very frequently (for every packet.) In order to avoid that when they're not used, track their activation and call the corresponding trigger (and in the TPT case, account for throughput) only when the trigger is actually used by an LED. Additionally, make those trigger functions inlines since theyre only used once in the remaining code. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: make LED trigger names constJohannes Berg2015-05-051-8/+9
| | | | | | | | | | | | | | This is just a code cleanup, make the LED trigger names const as they're not expected to be modified by drivers. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: clean up station debugfsJohannes Berg2015-05-051-84/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Remove items that can be retrieved through nl80211. This also removes two items (tx_packets and tx_bytes) where only the VO counter was exposed since they are split up per AC but in the debugfs file only the first AC was shown. Also remove the useless "dev" file - the stations have long been in a sub-directory of the netdev so there's no need for that any more. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: remove sta->tx_fragments counterJohannes Berg2015-05-054-7/+1
| | | | | | | | | | | | | | | | | | | | This counter is unsafe with concurrent TX and is only exposed through debugfs and ethtool. Instead of trying to fix it just remove it for now, if it's really needed then it should be exposed through nl80211 and in a way that drivers that do the fragmentation in the device could support it as well. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: move dot11 counters under MAC80211_DEBUG_COUNTERSJohannes Berg2015-05-055-27/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | Since these counters can only be read through debugfs, there's very little point in maintaining them all the time. However, even just making them depend on debugfs is pointless - they're not normally used. Additionally a number of them aren't even concurrency safe. Move them under MAC80211_DEBUG_COUNTERS so they're normally not even compiled in. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: clean up global debugfs statisticsJohannes Berg2015-05-053-52/+28
| | | | | | | | | | | | | | | | | | The debugfs statistics macros are pointlessly verbose, so change that macro to just have a single argument. While at it, remove the unused counters and rename rx_expand_skb_head2 to the better rx_expand_skb_head_defrag. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: force off channel transmission for public action framesMatti Gottlieb2015-04-241-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently while associated to an AP and sending a (public) action frame to a different AP on the same channel, the action frame will be sent like a regular tx frame without going off channel. When power save is enabled this can cause problems, since the device can go into power save and miss the response to the action frame that is sent by the other AP. Force off-channel transmission to avoid this issue in case - HW offchannel is used, - the user didn't forbid transmitting frames off channel - the frame is not sent to the AP that we are associated with (if it is we assume the response would be bufferable) Signed-off-by: Matti Gottlieb <matti.gottlieb@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> [reword commit message a bit] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: notify the driver on reordering buffer timeoutEmmanuel Grumbach2015-04-241-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | When frames time out in the reordering buffer, it is a good indication that something went wrong and the driver may want to know about that to take action or trigger debug flows. It is pointless to notify the driver about each frame that is released. Notify each time the timer fires. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: notify the driver upon BAR RxEmmanuel Grumbach2015-04-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we receive a BAR, this typically means that our peer doesn't hear our Block-Acks or that we can't hear its frames. Either way, it is a good indication that the link is in a bad condition. This is why it can serve as a probe to the driver. Use the event_callback callback for this. Since more events with the same data will be added in the feature, the structure that describes the data attached to the event is called in a generic name: ieee80211_ba_event. This also means that from now on, the event_callback can't sleep. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: fix ignored HT/VHT override configsChaya Rachel Ivgi2015-04-241-21/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | HT and VHT override configurations were ignored during association and applied only when first beacon recived, or not applied at all. Fix the code to apply HT/VHT overrides during association. This is a bit tricky since the channel was already configured during authentication and we don't want to reconfigure it unless there's really a change. Signed-off-by: Chaya Rachel Ivgi <chaya.rachel.ivgi@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: allow the plink state blocking for user managed meshChun-Yeow Yeoh2015-04-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | wpa_supplicant or authsae handles the mesh peering in user space, but the plink state is still managed in kernel space. Currently, there is no implementation by wpa_supplicant or authsae to block the plink state after it is set to ESTAB. By applying this patch, we can use the "iw mesh0 station set <MAC address> plink_action block" to block the peer mesh STA. This is useful for experimenting purposes. Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: use per-CPU TX/RX statisticsJohannes Berg2015-04-243-12/+70
| | | | | | | | | | | | | | | | | | This isn't all that relevant for RX right now, but TX can be concurrent due to multi-queue and the accounting is therefore broken. Use the standard per-CPU statistics to avoid this. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: don't update dev->trans_startJohannes Berg2015-04-241-2/+0
| | | | | | | | | | | | | | This isn't necessary any more as the stack will automatically update the TXQ's trans_start after calling ndo_start_xmit(). Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: OCB: remove pointless check for broadcast BSSIDJohannes Berg2015-04-241-5/+1
| | | | | | | | | | | | | | The OCB input path already checked that the BSSID is the broadcast address, so the later check can never fail. Signed-off-by: Johannes Berg <johannes.berg@intel.com>