summaryrefslogtreecommitdiffstats
path: root/net/sched
Commit message (Collapse)AuthorAgeFilesLines
* net: filter: split 'struct sk_filter' into socket and bpf partsAlexei Starovoitov2014-08-021-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clean up names related to socket filtering and bpf in the following way: - everything that deals with sockets keeps 'sk_*' prefix - everything that is pure BPF is changed to 'bpf_*' prefix split 'struct sk_filter' into struct sk_filter { atomic_t refcnt; struct rcu_head rcu; struct bpf_prog *prog; }; and struct bpf_prog { u32 jited:1, len:31; struct sock_fprog_kern *orig_prog; unsigned int (*bpf_func)(const struct sk_buff *skb, const struct bpf_insn *filter); union { struct sock_filter insns[0]; struct bpf_insn insnsi[0]; struct work_struct work; }; }; so that 'struct bpf_prog' can be used independent of sockets and cleans up 'unattached' bpf use cases split SK_RUN_FILTER macro into: SK_RUN_FILTER to be used with 'struct sk_filter *' and BPF_PROG_RUN to be used with 'struct bpf_prog *' __sk_filter_release(struct sk_filter *) gains __bpf_prog_release(struct bpf_prog *) helper function also perform related renames for the functions that work with 'struct bpf_prog *', since they're on the same lines: sk_filter_size -> bpf_prog_size sk_filter_select_runtime -> bpf_prog_select_runtime sk_filter_free -> bpf_prog_free sk_unattached_filter_create -> bpf_prog_create sk_unattached_filter_destroy -> bpf_prog_destroy sk_store_orig_filter -> bpf_prog_store_orig_filter sk_release_orig_filter -> bpf_release_orig_filter __sk_migrate_filter -> bpf_migrate_filter __sk_prepare_filter -> bpf_prepare_filter API for attaching classic BPF to a socket stays the same: sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *) and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program which is used by sockets, tun, af_packet API for 'unattached' BPF programs becomes: bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *) and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: remove exceptional & on function nameHimangi Saraogi2014-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | In this file, function names are otherwise used as pointers without &. A simplified version of the Coccinelle semantic patch that makes this change is as follows: // <smpl> @r@ identifier f; @@ f(...) { ... } @@ identifier r.f; @@ - &f + f // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-07-221-5/+14
|\ | | | | | | | | | | | | | | | | Conflicts: drivers/infiniband/hw/cxgb4/device.c The cxgb4 conflict was simply overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: avoid generating same handle for u32 filtersCong Wang2014-07-201-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When kernel generates a handle for a u32 filter, it tries to start from the max in the bucket. So when we have a filter with the max (fff) handle, it will cause kernel always generates the same handle for new filters. This can be shown by the following command: tc qdisc add dev eth0 ingress tc filter add dev eth0 parent ffff: protocol ip pref 770 handle 800::fff u32 match ip protocol 1 0xff tc filter add dev eth0 parent ffff: protocol ip pref 770 u32 match ip protocol 1 0xff ... we will get some u32 filters with same handle: # tc filter show dev eth0 parent ffff: filter protocol ip pref 770 u32 filter protocol ip pref 770 u32 fh 800: ht divisor 1 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 filter protocol ip pref 770 u32 fh 800::fff order 4095 key ht 800 bkt 0 match 00010000/00ff0000 at 8 handles should be unique. This patch fixes it by looking up a bitmap, so that can guarantee the handle is as unique as possible. For compatibility, we still start from 0x800. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: hold tcf_lock in netdevice notifierCong Wang2014-07-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | We modify mirred action (m->tcfm_dev) in netdev event, we need to prevent on-going mirred actions from reading freed m->tcfm_dev. So we need to acquire this spin lock. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: cancel nest attribute on failure in tcf_exts_dump()Cong Wang2014-07-171-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | Like other places, we need to cancel the nest attribute after we start. Fortunately the netlink message will not be sent on failure, so it's not a big problem at all. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: set name_assign_type in alloc_netdev()Tom Gundersen2014-07-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert all users to pass NET_NAME_UNKNOWN. Coccinelle patch: @@ expression sizeof_priv, name, setup, txqs, rxqs, count; @@ ( -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs) +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs) | -alloc_netdev_mq(sizeof_priv, name, setup, count) +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count) | -alloc_netdev(sizeof_priv, name, setup) +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup) ) v9: move comments here from the wrong commit Signed-off-by: Tom Gundersen <teg@jklm.no> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix some typos in commentYing Xue2014-07-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | In commit 371121057607e3127e19b3fa094330181b5b031e("net: QDISC_STATE_RUNNING dont need atomic bit ops") the __QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING, but the old names existing in comment are not replaced with the new name completely. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: em_canid: remove useless statements from em_canid_changeDuan Jiong2014-06-211-7/+0
|/ | | | | | | | tcf_ematch is allocated by kzalloc in function tcf_em_tree_validate(), so cm_old is always NULL. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: drr: warn when qdisc is not work conservingFlorian Westphal2014-06-112-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DRR scheduler requires that items on the active list are work conserving, i.e. do not hold on to skbs for throttling purposes, etc. Attaching e.g. tbf renders DRR useless because all other classes on the active list are delayed as well. So, warn users that this configuration won't work as expected; we already do this in couple of other qdiscs, see e.g. commit b00355db3f88d96810a60011a30cfb2c3469409d ('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler') The 'const' change is needed to avoid compiler warning ("discards 'const' qualifier from pointer target type"). tested with: drr_hier() { parent=$1 classes=$2 for i in $(seq 1 $classes); do classid=$parent$(printf %x $i) tc class add dev eth0 parent $parent classid $classid drr tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit done } tc qdisc add dev eth0 root handle 1: drr drr_hier 1: 32 tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use the new API kvfree()WANG Cong2014-06-056-34/+6
| | | | | | | | | It is available since v3.15-rc5. Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-05-241-10/+20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/bonding/bond_alb.c drivers/net/ethernet/altera/altera_msgdma.c drivers/net/ethernet/altera/altera_sgdma.c net/ipv6/xfrm6_output.c Several cases of overlapping changes. The xfrm6_output.c has a bug fix which overlaps the renaming of skb->local_df to skb->ignore_df. In the Altera TSE driver cases, the register access cleanups in net-next overlapped with bug fixes done in net. Similarly a bug fix to send ALB packets in the bonding driver using the right source address overlaps with cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: fix an oops in tcindex filterCong Wang2014-05-211-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kelly reported the following crash: IP: [<ffffffff817a993d>] tcf_action_exec+0x46/0x90 PGD 3009067 PUD 300c067 PMD 11ff30067 PTE 800000011634b060 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 639 Comm: dhclient Not tainted 3.15.0-rc4+ #342 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8801169ecd00 ti: ffff8800d21b8000 task.ti: ffff8800d21b8000 RIP: 0010:[<ffffffff817a993d>] [<ffffffff817a993d>] tcf_action_exec+0x46/0x90 RSP: 0018:ffff8800d21b9b90 EFLAGS: 00010283 RAX: 00000000ffffffff RBX: ffff88011634b8e8 RCX: ffff8800cf7133d8 RDX: ffff88011634b900 RSI: ffff8800cf7133e0 RDI: ffff8800d210f840 RBP: ffff8800d21b9bb0 R08: ffffffff8287bf60 R09: 0000000000000001 R10: ffff8800d2b22b24 R11: 0000000000000001 R12: ffff8800d210f840 R13: ffff8800d21b9c50 R14: ffff8800cf7133e0 R15: ffff8800cad433d8 FS: 00007f49723e1840(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88011634b8f0 CR3: 00000000ce469000 CR4: 00000000000006e0 Stack: ffff8800d2170188 ffff8800d210f840 ffff8800d2171b90 0000000000000000 ffff8800d21b9be8 ffffffff817c55bb ffff8800d21b9c50 ffff8800d2171b90 ffff8800d210f840 ffff8800d21b0300 ffff8800d21b9c50 ffff8800d21b9c18 Call Trace: [<ffffffff817c55bb>] tcindex_classify+0x88/0x9b [<ffffffff817a7f7d>] tc_classify_compat+0x3e/0x7b [<ffffffff817a7fdf>] tc_classify+0x25/0x9f [<ffffffff817b0e68>] htb_enqueue+0x55/0x27a [<ffffffff817b6c2e>] dsmark_enqueue+0x165/0x1a4 [<ffffffff81775642>] __dev_queue_xmit+0x35e/0x536 [<ffffffff8177582a>] dev_queue_xmit+0x10/0x12 [<ffffffff818f8ecd>] packet_sendmsg+0xb26/0xb9a [<ffffffff810b1507>] ? __lock_acquire+0x3ae/0xdf3 [<ffffffff8175cf08>] __sock_sendmsg_nosec+0x25/0x27 [<ffffffff8175d916>] sock_aio_write+0xd0/0xe7 [<ffffffff8117d6b8>] do_sync_write+0x59/0x78 [<ffffffff8117d84d>] vfs_write+0xb5/0x10a [<ffffffff8117d96a>] SyS_write+0x49/0x7f [<ffffffff8198e212>] system_call_fastpath+0x16/0x1b This is because we memcpy struct tcindex_filter_result which contains struct tcf_exts, obviously struct list_head can not be simply copied. This is a regression introduced by commit 33be627159913b094bb578 (net_sched: act: use standard struct list_head). It's not very easy to fix it as the code is a mess: if (old_r) memcpy(&cr, r, sizeof(cr)); else { memset(&cr, 0, sizeof(cr)); tcf_exts_init(&cr.exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); } ... tcf_exts_change(tp, &cr.exts, &e); ... memcpy(r, &cr, sizeof(cr)); the above code should equal to: tcindex_filter_result_init(&cr); if (old_r) cr.res = r->res; ... if (old_r) tcf_exts_change(tp, &r->exts, &e); else tcf_exts_change(tp, &cr.exts, &e); ... r->res = cr.res; after this change, since there is no need to copy struct tcf_exts. And it also fixes other places zero'ing struct's contains struct tcf_exts. Fixes: commit 33be627159913b0 (net_sched: act: use standard struct list_head) Reported-by: Kelly Anderson <kelly@xilka.com> Tested-by: Kelly Anderson <kelly@xilka.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: filter: let unattached filters use sock_fprog_kernDaniel Borkmann2014-05-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sk_unattached_filter_create() API is used by BPF filters that are not directly attached or related to sockets, and are used in team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own internal managment of obtaining filter blocks and thus already have them in kernel memory and set up before calling into sk_unattached_filter_create(). As a result, due to __user annotation in sock_fprog, sparse triggers false positives (incorrect type in assignment [different address space]) when filters are set up before passing them to sk_unattached_filter_create(). Therefore, let sk_unattached_filter_create() API use sock_fprog_kern to overcome this issue. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_hhf: fix comparison of qlen and limitYang Yingliang2014-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | When I use the following command, eth0 cannot send any packets. #tc qdisc add dev eth0 root handle 1: hhf limit 1 Because qlen need be smaller than limit, all packets were dropped. Fix this by qlen *<=* limit. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-05-124-10/+11
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/altera/altera_sgdma.c net/netlink/af_netlink.c net/sched/cls_api.c net/sched/sch_api.c The netlink conflict dealt with moving to netlink_capable() and netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations in non-init namespaces. These were simple transformations from netlink_capable to netlink_ns_capable. The Altera driver conflict was simply code removal overlapping some void pointer cast cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: lock imbalance in hhf qdiscJohn Fastabend2014-05-041-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | hhf_change() takes the sch_tree_lock and releases it but misses the error cases. Fix the missed case here. To reproduce try a command like this, # tc qdisc change dev p3p2 root hhf quantum 40960 non_hh_weight 300000 Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Use netlink_ns_capable to verify the permisions of netlink messagesEric W. Biederman2014-04-243-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible by passing a netlink socket to a more privileged executable and then to fool that executable into writing to the socket data that happens to be valid netlink message to do something that privileged executable did not intend to do. To keep this from happening replace bare capable and ns_capable calls with netlink_capable, netlink_net_calls and netlink_ns_capable calls. Which act the same as the previous calls except they verify that the opener of the socket had the desired permissions as well. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Allow tc changes in user namespacesStéphane Graber2014-05-022-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | This switches a few remaining capable(CAP_NET_ADMIN) to ns_capable so that root in a user namespace may set tc rules inside that namespace. Signed-off-by: Stéphane Graber <stgraber@ubuntu.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched, act: allow to clear all actions as wellCong Wang2014-04-271-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | When we change the list of action on a given filter, currently we don't change it to empty. This is a bug, we should allow to change to whatever users given. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched, cls: check if we could overwrite actions when changing a filterCong Wang2014-04-2710-39/+41
|/ | | | | | | | | | | | | | | | | | | | | When actions are attached to a filter, they are a part of the filter itself, so when changing a filter we should allow to overwrite the actions inside as well. In my specific case, when I tried to _append_ a new action to an existing filter which already has an action, I got EEXIST since kernel refused to overwrite the existing one in kernel. This patch checks if we are changing the filter checking NLM_F_CREATE flag (Sigh, filters don't use NLM_F_REPLACE...) and then passes the boolean down to actions. This fixes the problem above. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net-sysfs: expose number of carrier on/off changesdavid decotigny2014-03-311-0/+2
| | | | | | | | | | | | | | | | | This allows to monitor carrier on/off transitions and detect link flapping issues: - new /sys/class/net/X/carrier_changes - new rtnetlink IFLA_CARRIER_CHANGES (getlink) Tested: - grep . /sys/class/net/*/carrier_changes + ip link set dev X down/up + plug/unplug cable - updated iproute2: prints IFLA_CARRIER_CHANGES - iproute2 20121211-2 (debian): unchanged behavior Signed-off-by: David Decotigny <decot@googlers.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: use no more than one page in struct fw_headEric Dumazet2014-03-181-26/+7
| | | | | | | | | | | | | | | | | | | | In commit b4e9b520ca5d ("[NET_SCHED]: Add mask support to fwmark classifier") Patrick added an u32 field in fw_head, making it slightly bigger than one page. Lets use 256 slots to make fw_hash() more straight forward, and move @mask to the beginning of the structure as we often use a small number of skb->mark. @mask and first hash buckets share the same cache line. This brings back the memory usage to less than 4000 bytes, and permits John to add a rcu_head at the end of the structure later without any worry. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-03-141-3/+4
|\ | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/r8152.c drivers/net/xen-netback/netback.c Both the r8152 and netback conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: fq: do not hold qdisc lock while allocating memoryEric Dumazet2014-03-101-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Resizing fq hash table allocates memory while holding qdisc spinlock, with BH disabled. This is definitely not good, as allocation might sleep. We can drop the lock and get it when needed, we hold RTNL so no other changes can happen at the same time. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: move the sanity test in qdisc_list_add()Eric Dumazet2014-03-101-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | The WARN_ON(root == &noop_qdisc)) added in qdisc_list_add() can trigger in normal conditions when devices are not up. It should be done only right before the list_add_tail() call. Fixes: e57a784d8cae4 ("pkt_sched: set root qdisc before change() in attach_default_qdiscs()") Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Mirco Tischler <mt-ml@gmx.de> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: return nla_nest_end() instead of skb->lenYang Yingliang2014-03-138-18/+9
| | | | | | | | | | | | | | | | nla_nest_end() already has return skb->len, so replace return skb->len with return nla_nest_end instead(). Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | pkt_sched: add cond_resched() to class and qdisc dumpEric Dumazet2014-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | We have seen delays of more than 50ms in class or qdisc dumps, in case device is under high TX stress, even with the prior 4KB per skb limit. Add cond_resched() to give a chance to higher prio tasks to get cpu. Signed-off-by; Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | pkt_sched: do not use rcu in tc_dump_qdisc()Eric Dumazet2014-03-111-4/+2
| | | | | | | | | | | | | | | | | | | | | | Like all rtnetlink dump operations, we hold RTNL in tc_dump_qdisc(), so we do not need to use rcu protection to protect list of netdevices. This will allow preemption to occur, thus reducing latencies. Following patch adds explicit cond_resched() calls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | pkt_sched: fq: do not hold qdisc lock while allocating memoryEric Dumazet2014-03-081-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Resizing fq hash table allocates memory while holding qdisc spinlock, with BH disabled. This is definitely not good, as allocation might sleep. We can drop the lock and get it when needed, we hold RTNL so no other changes can happen at the same time. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: htb: do not acquire qdisc lock in dump operationsEric Dumazet2014-03-061-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | htb_dump() and htb_dump_class() do not strictly need to acquire qdisc lock to fetch qdisc and/or class parameters. We hold RTNL and no changes can occur. This reduces by 50% qdisc lock pressure while doing tc qdisc|class dump operations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-03-051-12/+12
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * sch_tbf: Fix potential memory leak in tbf_change().Hiroaki SHIMODA2014-02-271-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | The allocated child qdisc is not freed in error conditions. Defer the allocation after user configuration turns out to be valid and acceptable. Fixes: cc106e441a63b ("net: sched: tbf: fix the calculation of max_size") Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Cc: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_tbf: Remove holes in struct tbf_sched_data.Hiroaki SHIMODA2014-03-031-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86_64 we have 3 holes in struct tbf_sched_data. The member peak_present can be replaced with peak.rate_bytes_ps, because peak.rate_bytes_ps is set only when peak is specified in tbf_change(). tbf_peak_present() is introduced to test peak.rate_bytes_ps. The member max_size is moved to fill 32bit hole. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-02-191-5/+16
|\| | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/bonding/bond_3ad.h drivers/net/bonding/bond_main.c Two minor conflicts in bonding, both of which were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: Cleanup PIE commentsVijay Subramanian2014-02-131-5/+16
| | | | | | | | | | | | | | | | | | | | | | Fix incorrect comment reported by Norbert Kiesel. Edit another comment to add more details. Also add references to algorithm (IETF draft and paper) to top of file. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> CC: Mythili Prabhu <mysuryan@cisco.com> CC: Norbert Kiesel <nkiesel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_netem: replace magic numbers with enumerate in get_loss_clgYang Yingliang2014-02-171-2/+2
| | | | | | | | | | | | | | Replace two magic numbers which intialize clgstate::state. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_netem: replace magic numbers with enumerate in GE modelYang Yingliang2014-02-141-4/+9
| | | | | | | | | | | | | | | | Replace some magic numbers which describe states of GE model loss generator with enumerate. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_netem: change some func's param from "struct Qdisc *" to "struct ↵Yang Yingliang2014-02-141-15/+10
| | | | | | | | | | | | | | | | | | | | | | netem_sched_data *" In netem_change(), we have already get "struct netem_sched_data *q". Replace params of get_correlation() and other similar functions with "struct netem_sched_data *q". Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sch_netem: return errcode before setting paramsYang Yingliang2014-02-141-10/+29
| | | | | | | | | | | | | | | | | | | | get_dist_table() and get_loss_clg() may be failed. These two functions should be called after setting the members of qdisc_priv(sch), or it will break the old settings while either of them is failed. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: act: clean up tca_action_flush()WANG Cong2014-02-121-30/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | We could allocate tc_action on stack in tca_action_flush(), since it is not large. Also, we could use create_a() in tcf_action_get_1(). Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: act: refuse to remove bound action outsideWANG Cong2014-02-121-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | When an action is bonnd to a filter, there is no point to remove it outside. Currently we just silently decrease the refcnt, we should reject this explicitly with EPERM. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: act: move tcf_hashinfo_init() into tcf_register_action()WANG Cong2014-02-1210-81/+28
| | | | | | | | | | | | | | | | Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: act: refactor cleanup opsWANG Cong2014-02-1210-63/+21
| | | | | | | | | | | | | | | | | | | | | | | | For bindcnt and refcnt etc., they are common for all actions, not need to repeat such operations for their own, they can be unified now. Actions just need to do its specific cleanup if needed. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net_sched: act: hide struct tcf_common from APIWANG Cong2014-02-1210-181/+112
|/ | | | | | | | | | | | Now we can totally hide it from modules. tcf_hash_*() API's will operate on struct tc_action, modules don't need to care about the details. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add and use skb_gso_transport_seglen()Florian Westphal2014-01-261-10/+3
| | | | | | | | | This moves part of Eric Dumazets skb_gso_seglen helper from tbf sched to skbuff core so it may be reused by upcoming ip forwarding path patch. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sch_htb: let skb->priority refer to non-leaf classHarry Mason2014-01-221-3/+8
| | | | | | | | | | If the class in skb->priority is not a leaf, apply filters from the selected class, not the qdisc. This lets netfilter or user space partially classify the packet. Signed-off-by: Harry Mason <harry.mason@smoothwall.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* reciprocal_divide: update/correction of the algorithmHannes Frederic Sowa2014-01-211-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jakub Zawadzki noticed that some divisions by reciprocal_divide() were not correct [1][2], which he could also show with BPF code after divisions are transformed into reciprocal_value() for runtime invariance which can be passed to reciprocal_divide() later on; reverse in BPF dump ended up with a different, off-by-one K in some situations. This has been fixed by Eric Dumazet in commit aee636c4809fa5 ("bpf: do not use reciprocal divide"). This follow-up patch improves reciprocal_value() and reciprocal_divide() to work in all cases by using Granlund and Montgomery method, so that also future use is safe and without any non-obvious side-effects. Known problems with the old implementation were that division by 1 always returned 0 and some off-by-ones when the dividend and divisor where very large. This seemed to not be problematic with its current users, as far as we can tell. Eric Dumazet checked for the slab usage, we cannot surely say so in the case of flex_array. Still, in order to fix that, we propose an extension from the original implementation from commit 6a2d7a955d8d resp. [3][4], by using the algorithm proposed in "Division by Invariant Integers Using Multiplication" [5], Torbjörn Granlund and Peter L. Montgomery, that is, pseudocode for q = n/d where q, n, d is in u32 universe: 1) Initialization: int l = ceil(log_2 d) uword m' = floor((1<<32)*((1<<l)-d)/d)+1 int sh_1 = min(l,1) int sh_2 = max(l-1,0) 2) For q = n/d, all uword: uword t = (n*m')>>32 q = (t+((n-t)>>sh_1))>>sh_2 The assembler implementation from Agner Fog [6] also helped a lot while implementing. We have tested the implementation on x86_64, ppc64, i686, s390x; on x86_64/haswell we're still half the latency compared to normal divide. Joint work with Daniel Borkmann. [1] http://www.wireshark.org/~darkjames/reciprocal-buggy.c [2] http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c [3] https://gmplib.org/~tege/division-paper.pdf [4] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html [5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556 [6] http://www.agner.org/optimize/asmlib.zip Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: Jesse Gross <jesse@nicira.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Jay Vosburgh <fubar@us.ibm.com> Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* random32: add prandom_u32_max and convert open coded usersDaniel Borkmann2014-01-211-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Many functions have open coded a function that returns a random number in range [0,N-1]. Under the assumption that we have a PRNG such as taus113 with being well distributed in [0, ~0U] space, we can implement such a function as uword t = (n*m')>>32, where m' is a random number obtained from PRNG, n the right open interval border and t our resulting random number, with n,m',t in u32 universe. Lets go with Joe and simply call it prandom_u32_max(), although technically we have an right open interval endpoint, but that we have documented. Other users can further be migrated to the new prandom_u32_max() function later on; for now, we need to make sure to migrate reciprocal_divide() users for the reciprocal_divide() follow-up fixup since their function signatures are going to change. Joint work with Hannes Frederic Sowa. Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: act: export tcf_hash_search() instead of tcf_hash_lookup()WANG Cong2014-01-212-9/+5
| | | | | | | | | | So that we will not expose struct tcf_common to modules. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>