summaryrefslogtreecommitdiffstats
path: root/net/sched
Commit message (Collapse)AuthorAgeFilesLines
...
* | net_sched: cbq: remove a flaky use of qdisc_is_throttled()Eric Dumazet2016-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | So far no qdisc ever unset the throttled bit at enqueue() time, so CBQ usage of qdisc_is_throttled() was flaky. Since __QDISC_STATE_THROTTLED set/unset is way too expensive considering that only CBQ was eventually caring for this status, it would make sense to implement a Qdisc ops ->is_throttled() if we find that this is needed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: sch_plug: use a private throttled statusEric Dumazet2016-06-101-6/+8
| | | | | | | | | | | | | | | | We want to get rid of generic qdisc throttled management, so this qdisc has to use a private flag. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net, cls: allow for deleting all filters for given parentDaniel Borkmann2016-06-101-4/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a possibility where the user can just specify the parent and all filters under that parent are then being purged. Currently, for example for scripting, one needs to specify pref/prio to have a well-defined number for 'tc filter del' command for addressing the previously created instance or additionally filter handle in case of priorities being the same. Improve usage by allowing the option for tc to specify the parent and removing the whole chain for that given parent. Example usage after patch, no tc changes required: # tc qdisc replace dev foo clsact # tc filter add dev foo egress bpf da obj ./bpf.o # tc filter add dev foo egress bpf da obj ./bpf.o # tc filter show dev foo egress filter protocol all pref 49151 bpf filter protocol all pref 49151 bpf handle 0x1 bpf.o:[classifier] direct-action filter protocol all pref 49152 bpf filter protocol all pref 49152 bpf handle 0x1 bpf.o:[classifier] direct-action # tc filter del dev foo egress # tc filter show dev foo egress # Previously, RTM_DELTFILTER requests with invalid prio of 0 were rejected, so only netlink requests with RTM_NEWTFILTER and NLM_F_CREATE flag were allowed where the kernel would auto-generate a pref/prio. We can piggyback on that and use prio of 0 as a wildcard for requests of RTM_DELTFILTER. For notifying tc netlink monitoring users (e.g. libnl uses this for caching), there are two options, that is, sending individual tfilter_notify() notifications for each tcf_proto, or sending a single one indicating wildcard removal. I tried both and there are pros and cons for each, eventually I decided for sending individual tfilter_notify(), so that user space can support this seamlessly and there won't be a mess of changing each and every application to make sure expectations from the kernel won't break when they don't understand single notification. Since linear chains don't really scale, I expect only a handful of classifiers to be attached at max for a given parent anyway. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-06-1012-75/+106
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: net/sched/act_police.c net/sched/sch_drr.c net/sched/sch_hfsc.c net/sched/sch_prio.c net/sched/sch_red.c net/sched/sch_tbf.c In net-next the drop methods of the packet schedulers got removed, so the bug fixes to them in 'net' are irrelevant. A packet action unload crash fix conflicts with the addition of the new firstuse timestamp. Signed-off-by: David S. Miller <davem@davemloft.net>
| * qfq: don't leak skb if kzalloc failsFlorian Westphal2016-06-081-2/+4
| | | | | | | | | | | | | | | | | | | | When we need to create a new aggregate to enqueue the skb we call kzalloc. If that fails we returned ENOBUFS without freeing the skb. Spotted during code review. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: cls_u32: be more strict about skip-sw flag for knodesJakub Kicinski2016-06-081-18/+19
| | | | | | | | | | | | | | | | | | Return an error if user requested skip-sw and the underlaying hardware cannot handle tc offloads (or offloads are disabled). This patch fixes the knode handling. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: cls_u32: catch all hardware offload errorsJakub Kicinski2016-06-081-1/+7
| | | | | | | | | | | | | | | | Errors reported by u32_replace_hw_hnode() were not propagated. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: fix tc_should_offload for specific clsact classesDaniel Borkmann2016-06-073-7/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When offloading classifiers such as u32 or flower to hardware, and the qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes, since not all of them handle ingress, therefore we must leave those in software path. Add a .tcf_cl_offload() callback, so we can generically handle them, tested on ixgbe. Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support") Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support") Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * act_police: fix a crash during removalWANG Cong2016-06-071-22/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The police action is using its own code to initialize tcf hash info, which makes us to forgot to initialize a->hinfo correctly. Fix this by calling the helper function tcf_hash_create() directly. This patch fixed the following crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91 PGD d3c34067 PUD d3e18067 PMD 0 Oops: 0000 [#1] SMP CPU: 2 PID: 853 Comm: tc Not tainted 4.6.0+ #87 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800d3e28040 ti: ffff8800d3f6c000 task.ti: ffff8800d3f6c000 RIP: 0010:[<ffffffff810c099f>] [<ffffffff810c099f>] __lock_acquire+0xd3/0xf91 RSP: 0000:ffff88011b203c80 EFLAGS: 00010002 RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000028 RBP: ffff88011b203d40 R08: 0000000000000001 R09: 0000000000000000 R10: ffff88011b203d58 R11: ffff88011b208000 R12: 0000000000000001 R13: ffff8800d3e28040 R14: 0000000000000028 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88011b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 00000000d4be1000 CR4: 00000000000006e0 Stack: ffff8800d3e289c0 0000000000000046 000000001b203d60 ffffffff00000000 0000000000000000 ffff880000000000 0000000000000000 ffffffff00000000 ffffffff8187142c ffff88011b203ce8 ffff88011b203ce8 ffffffff8101dbfc Call Trace: <IRQ> [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35 [<ffffffff8101dbfc>] ? native_sched_clock+0x1a/0x35 [<ffffffff810a9604>] ? sched_clock_local+0x11/0x78 [<ffffffff810bf6a1>] ? mark_lock+0x24/0x201 [<ffffffff810c1dbd>] lock_acquire+0x120/0x1b4 [<ffffffff810c1dbd>] ? lock_acquire+0x120/0x1b4 [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff81aad89f>] _raw_spin_lock_bh+0x3c/0x72 [<ffffffff8187142c>] ? __tcf_hash_release+0x77/0xd1 [<ffffffff8187142c>] __tcf_hash_release+0x77/0xd1 [<ffffffff81871a27>] tcf_action_destroy+0x49/0x7c [<ffffffff81870b1c>] tcf_exts_destroy+0x20/0x2d [<ffffffff8189273b>] u32_destroy_key+0x1b/0x4d [<ffffffff81892788>] u32_delete_key_freepf_rcu+0x1b/0x1d [<ffffffff810de3b8>] rcu_process_callbacks+0x610/0x82e [<ffffffff8189276d>] ? u32_destroy_key+0x4d/0x4d [<ffffffff81ab0bc1>] __do_softirq+0x191/0x3f4 Fixes: ddf97ccdd7cb ("net_sched: add network namespace support for tc actions") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fq_codel: return non zero qlen in class dumpsEric Dumazet2016-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | We properly scan the flow list to count number of packets, but John passed 0 to gnet_stats_copy_queue() so we report a zero value to user space instead of the result. Fixes: 640158536632 ("net: sched: restrict use of qstats qlen") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.r.fastabend@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: cls_u32: be more strict about skip-sw flagJakub Kicinski2016-06-071-10/+11
| | | | | | | | | | | | | | | | Return an error if user requested skip-sw and the underlaying hardware cannot handle tc offloads (or offloads are disabled). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: cls_u32: fix error code for invalid flagsJakub Kicinski2016-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | 'err' variable is not set in this test, we would return whatever previous test set 'err' to. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fq_codel: fix NET_XMIT_CN behaviorEric Dumazet2016-06-071-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My prior attempt to fix the backlogs of parents failed. If we return NET_XMIT_CN, our parents wont increase their backlog, so our qdisc_tree_reduce_backlog() should take this into account. v2: Florian Westphal pointed out that we could drop the packet, so we need to save qdisc_pkt_len(skb) in a temp variable before calling fq_codel_drop() Fixes: 9d18562a2278 ("fq_codel: add batch ability to fq_codel_drop()") Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Reported-by: Stas Nichiporovich <stasn77@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: keep backlog updated with qlenWANG Cong2016-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | For gso_skb we only update qlen, backlog should be updated too. Note, it is correct to just update these stats at one layer, because the gso_skb is cached there. Reported-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_tbf: update backlog as wellWANG Cong2016-06-031-0/+4
| | | | | | | | | | | | | | Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_red: update backlog as wellWANG Cong2016-06-031-0/+4
| | | | | | | | | | | | | | Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_drr: update backlog as wellWANG Cong2016-06-031-0/+4
| | | | | | | | | | | | | | Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_prio: update backlog as wellWANG Cong2016-06-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | We need to update backlog too when we update qlen. Joint work with Stas. Reported-by: Stas Nichiporovich <stasn77@gmail.com> Tested-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_hfsc: always keep backlog updatedWANG Cong2016-06-031-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | hfsc updates backlog lazily, that is only when we dump the stats. This is problematic after we begin to update backlog in qdisc_tree_reduce_backlog(). Reported-by: Stas Nichiporovich <stasn77@gmail.com> Tested-by: Stas Nichiporovich <stasn77@gmail.com> Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: fix qdisc->running lockdep annotationsEric Dumazet2016-06-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | 1) qdisc_run_begin() is really using the equivalent of a trylock. Instead of using write_seqcount_begin(), use a combination of raw_write_seqcount_begin() and correct lockdep annotation. 2) sch_direct_xmit() should use regular spin_lock(root_lock) Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched: remove qdisc->dropFlorian Westphal2016-06-0818-359/+0
| | | | | | | | | | | | | | | | | | after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no more callers of ->drop() outside of other ->drop functions, i.e. nothing calls them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched: remove qdisc_rehape_failFlorian Westphal2016-06-084-7/+7
| | | | | | | | | | | | | | | | After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cbq: remove TCA_CBQ_POLICE supportFlorian Westphal2016-06-081-94/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iproute2 doesn't implement any cbq option that results in this attribute being sent to kernel. To make use of it, user would have to - patch iproute2 - add a class - attach a qdisc to the class (default pfifo doesn't work as q->handle is 0 and cbq_set_police() is a no-op in this case) - re-'add' the same class (tc class change ...) again - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since this 'police' feature relies on its presence - the added qdisc must be one of bfifo, pfifo or netem If all of these conditions are met and _some_ leaf qdiscs, namely p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into cbq, which will attempt to re-queue the skb into a different class as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT. [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ]. This feature, which isn't documented or implemented in iproute2, and isn't implemented consistently (most qdiscs like sfq, codel, etc drop right away instead of attempting this reclassification) is the sole reason for the reshape_fail and __parent member in Qdisc struct. So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP so userspace knows we don't support it, and then remove no-longer needed infrastructure in followup commit. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cbq: remove TCA_CBQ_OVL_STRATEGY supportFlorian Westphal2016-06-081-160/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | since initial revision of cbq in 2004 iproute 2 has never implemented support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be set by userspace value must be set by userspace). David Miller says: It seems really safe to kill this thing off, flag an error if someone tries to set the attribute, and therefore kill off all of the non-default cbq_ovl_*() functions. A followup commit can then remove all .drop qdisc methods since this removed the only caller. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: do not acquire qdisc spinlock in qdisc/class stats dumpEric Dumazet2016-06-0714-44/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host agent [1] are problematic at scale : For each qdisc/class found in the dump, we currently lock the root qdisc spinlock in order to get stats. Sampling stats every 5 seconds from thousands of HTB classes is a challenge when the root qdisc spinlock is under high pressure. Not only the dumps take time, they also slow down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases. An audit of existing qdiscs showed that sch_fq_codel is the only qdisc that might need the qdisc lock in fq_codel_dump_stats() and fq_codel_dump_class_stats() In v2 of this patch, I now use the Qdisc running seqcount to provide consistent reads of packets/bytes counters, regardless of 32/64 bit arches. I also changed rate estimators to use the same infrastructure so that they no longer need to lock root qdisc lock. [1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kevin Athey <kda@google.com> Cc: Xiaotian Pei <xiaotian@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: transform qdisc running bit into a seqcountEric Dumazet2016-06-071-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using a single bit (__QDISC___STATE_RUNNING) in sch->__state, use a seqcount. This adds lockdep support, but more importantly it will allow us to sample qdisc/class statistics without having to grab qdisc root lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched: indentation and other OCD stylistic fixesJamal Hadi Salim2016-06-076-17/+28
| | | | | | | | | | Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
* | net sched actions: aggregate dumping of actions timeinfoJamal Hadi Salim2016-06-0712-48/+21
| | | | | | | | | | Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched actions: introduce timestamp for firsttime useJamal Hadi Salim2016-06-0713-0/+14
| | | | | | | | | | | | | | | | Useful to know when the action was first used for accounting (and debugging) Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net sched: actions use tcf_lastuse_update for consistencyJamal Hadi Salim2016-06-0710-12/+11
| | | | | | | | | | Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: cls_flower: Introduce support in SKIP SW flagAmir Vadai2016-06-071-9/+22
|/ | | | | | | | | | | | | | | | | | | | | | | | | | In order to make a filter processed only by hardware, skip_sw flag should be supplied. This is an addition to the already existing skip_hw flag (filter will be processed by software only). If no flag is specified, filter will be processed by both software and hardware. If only hardware offloaded filters exist, fl_classify() will return without doing anything. A following userspace patch will be sent once kernel patch is accepted. Example: tc filter add dev enp0s9 protocol ip prio 20 parent ffff: \ flower \ ip_proto 6 \ indev enp0s9 \ skip_sw \ action skbedit mark 0x1234 Signed-off-by: Amir Vadai <amirva@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net sched actions: policer missing timestamp processingJamal Hadi Salim2016-05-241-0/+11
| | | | | | | | Policer was not dumping or updating timestamps Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: avoid too many hrtimer_start() callsEric Dumazet2016-05-242-10/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. We constantly rearm high resolution timers if some packets are throttled in one (or more) class, and other packets are flying through qdisc on another (non throttled) class. hrtimer_start() does not have the mod_timer() trick of doing nothing if expires value does not change : if (timer_pending(timer) && timer->expires == expires) return 1; This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base. I used following fix : 1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding it. 2) Cache watchdog prior expiration. hrtimer might provide this, but I prefer to not rely on some hrtimer internal. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: close another race condition in tcf_mirred_release()WANG Cong2016-05-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We saw the following extra refcount release on veth device: kernel: [7957821.463992] unregister_netdevice: waiting for mesos50284 to become free. Usage count = -1 Since we heavily use mirred action to redirect packets to veth, I think this is caused by the following race condition: CPU0: tcf_mirred_release(): (in RCU callback) struct net_device *dev = rcu_dereference_protected(m->tcfm_dev, 1); CPU1: mirred_device_event(): spin_lock_bh(&mirred_list_lock); list_for_each_entry(m, &mirred_list, tcfm_list) { if (rcu_access_pointer(m->tcfm_dev) == dev) { dev_put(dev); /* Note : no rcu grace period necessary, as * net_device are already rcu protected. */ RCU_INIT_POINTER(m->tcfm_dev, NULL); } } spin_unlock_bh(&mirred_list_lock); CPU0: tcf_mirred_release(): spin_lock_bh(&mirred_list_lock); list_del(&m->tcfm_list); spin_unlock_bh(&mirred_list_lock); if (dev) // <======== Stil refers to the old m->tcfm_dev dev_put(dev); // <======== dev_put() is called on it again The action init code path is good because it is impossible to modify an action that is being removed. So, fix this by moving everything under the spinlock. Fixes: 2ee22a90c7af ("net_sched: act_mirred: remove spinlock in fast path") Fixes: 6bd00b850635 ("act_mirred: fix a race condition on mirred_list") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* fq_codel: fix memory limitation driftEric Dumazet2016-05-161-1/+2
| | | | | | | | | | | | memory_usage must be decreased in dequeue_func(), not in fq_codel_dequeue(), otherwise packets dropped by Codel algo are missing this decrease. Also we need to clear memory_usage in fq_codel_reset() Fixes: 95b58430abe7 ("fq_codel: add memory limitation per queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: cls_flower: Hardware offloaded filters statistics supportAmir Vadai2016-05-161-0/+21
| | | | | | | | | Introduce a new command in ndo_setup_tc() for hardware offloaded filters, to call the NIC driver, and make it update the statistics. This will be done before dumping the filter and its statistics. Signed-off-by: Amir Vadai <amirva@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: act_gact: Update statistics when offloaded to hardwareAmir Vadai2016-05-161-0/+15
| | | | | | | | Implement the stats_update callback that will be called by NIC drivers for hardware offloaded filters. Signed-off-by: Amir Vadai <amirva@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: cls_u32: Add support for skip-sw flag to tc u32 classifier.Samudrala, Sridhar2016-05-161-9/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On devices that support TC U32 offloads, this flag enables a filter to be added only to HW. skip-sw and skip-hw are mutually exclusive flags. By default without any flags, the filter is added to both HW and SW, but no error checks are done in case of failure to add to HW. With skip-sw, failure to add to HW is treated as an error. Here is a sample script that adds 2 filters, one with skip-sw and the other with skip-hw flag. # add ingress qdisc tc qdisc add dev p4p1 ingress # enable hw tc offload. ethtool -K p4p1 hw-tc-offload on # add u32 filter with skip-sw flag. tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:1 u32 ht 800: flowid 800:1 \ skip-sw \ match ip src 192.168.1.0/24 \ action drop # add u32 filter with skip-hw flag. tc filter add dev p4p1 parent ffff: protocol ip prio 99 \ handle 800:0:2 u32 ht 800: flowid 800:2 \ skip-hw \ match ip src 192.168.2.0/24 \ action drop Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-05-156-36/+74
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next' because we no longer have a per-netns conntrack hash. The ip_gre.c conflict as well as the iwlwifi ones were cases of overlapping changes. Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/tx.c net/ipv4/ip_gre.c net/netfilter/nf_conntrack_core.c Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: ife action fix late bindingJamal Hadi Salim2016-05-101-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The process below was broken and is fixed with this patch. //add an ife action and give it an instance id of 1 sudo tc actions add action ife encode \ type 0xDEAD allow mark dst 02:15:15:15:15:15 index 1 //create a filter which binds to ife action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:11 action ife index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: skbedit action fix late bindingJamal Hadi Salim2016-05-101-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The process below was broken and is fixed with this patch. //add a skbedit action and give it an instance id of 1 sudo tc actions add action skbedit mark 10 index 1 //create a filter which binds to skbedit action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action skbedit index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: simple action fix late bindingJamal Hadi Salim2016-05-101-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The process below was broken and is fixed with this patch. //add a simple action and give it an instance id of 1 sudo tc actions add action simple sdata "foobar" index 1 //create a filter which binds to simple action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action simple index 1 Message before fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: mirred action fix late bindingJamal Hadi Salim2016-05-101-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The process below was broken and is fixed with this patch. //add an mirred action and give it an instance id of 1 sudo tc actions add action mirred egress mirror dev $MDEV index 1 //create a filter which binds to mirred action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action mirred index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: ipt action fix late bindingJamal Hadi Salim2016-05-101-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was broken and is fixed with this patch. //add an ipt action and give it an instance id of 1 sudo tc actions add action ipt -j mark --set-mark 2 index 1 //create a filter which binds to ipt action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32\ match ip dst 17.0.0.1/32 flowid 1:10 action ipt index 1 Message before bug fix was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net sched: vlan action fix late bindingJamal Hadi Salim2016-05-101-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Late vlan action binding was broken and is fixed with this patch. //add a vlan action to pop and give it an instance id of 1 sudo tc actions add action vlan pop index 1 //create filter which binds to vlan action id 1 sudo tc filter add dev $DEV parent ffff: protocol ip prio 1 u32 \ match ip dst 17.0.0.1/32 flowid 1:1 action vlan index 1 current message(before bug fix) was: RTNETLINK answers: Invalid argument We have an error talking to the kernel Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | fq_codel: add memory limitation per queueEric Dumazet2016-05-081-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On small embedded routers, one wants to control maximal amount of memory used by fq_codel, instead of controlling number of packets or bytes, since GRO/TSO make these not practical. Assuming skb->truesize is accurate, we have to keep track of skb->truesize sum for skbs in queue. This patch adds a new TCA_FQ_CODEL_MEMORY_LIMIT attribute. I chose a default value of 32 MBytes, which looks reasonable even for heavy duty usages. (Prior fq_codel users should not be hurt when they upgrade their kernels) Two fields are added to tc_fq_codel_qd_stats to report : - Current memory usage - Number of drops caused by memory limits # tc qd replace dev eth1 root est 1sec 4sec fq_codel memory_limit 4M .. # tc -s -d qd sh dev eth1 qdisc fq_codel 8008: root refcnt 257 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 4Mb ecn Sent 2083566791363 bytes 1376214889 pkt (dropped 4994406, overlimits 0 requeues 21705223) rate 9841Mbit 812549pps backlog 3906120b 376p requeues 21705223 maxpacket 68130 drop_overlimit 4994406 new_flow_count 28855414 ecn_mark 0 memory_used 4190048 drop_overmemory 4994406 new_flows_len 1 old_flows_len 177 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Dave Täht <dave.taht@gmail.com> Cc: Sebastian Möller <moeller0@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: wire in data and data_end for cls_act_bpfAlexei Starovoitov2016-05-062-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers. The bpf helpers that change skb->data need to update data_end pointer as well. The verifier checks that programs always reload data, data_end pointers after calls to such bpf helpers. We cannot add 'data_end' pointer to struct qdisc_skb_cb directly, since it's embedded as-is by infiniband ipoib, so wrapper struct is needed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove dev->trans_startFlorian Westphal2016-05-041-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | previous patches removed all direct accesses to dev->trans_start, so change the netif_trans_update helper to update trans_start of netdev queue 0 instead and then remove trans_start from struct net_device. AFAICS a lot of the netif_trans_update() invocations are now useless because they occur in ndo_start_xmit and driver doesn't set LLTX (i.e. stack already took care of the update). As I can't test any of them it seems better to just leave them alone. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | treewide: replace dev->trans_start update with helperFlorian Westphal2016-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace all trans_start updates with netif_trans_update helper. change was done via spatch: struct net_device *d; @@ - d->trans_start = jiffies + netif_trans_update(d) Compile tested only. Cc: user-mode-linux-devel@lists.sourceforge.net Cc: linux-xtensa@linux-xtensa.org Cc: linux1394-devel@lists.sourceforge.net Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: MPT-FusionLinux.pdl@broadcom.com Cc: linux-scsi@vger.kernel.org Cc: linux-can@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-omap@vger.kernel.org Cc: linux-hams@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: linux-wireless@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: b.a.t.m.a.n@lists.open-mesh.org Cc: linux-bluetooth@vger.kernel.org Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Acked-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-05-041-2/+59
|\| | | | | | | | | | | | | | | | | | | Conflicts: net/ipv4/ip_gre.c Minor conflicts between tunnel bug fixes in net and ipv6 tunnel cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>