summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* samples: bpf: force IPv4 in pingJakub Kicinski2019-03-015-5/+5
| | | | | | | | | | | | | ping localhost may default of IPv6 on modern systems, but samples are trying to only parse IPv4. Force IPv4. samples/bpf/tracex1_user.c doesn't interpret the packet so we don't care which IP version will be used there. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* selftests/bpf: use __bpf_constant_htons in test_prog.c for flow dissectorStanislav Fomichev2019-03-011-2/+2
| | | | | | | | | | | | | | | | Older GCC (<4.8) isn't smart enough to optimize !__builtin_constant_p() branch in bpf_htons. I recently fixed it for pkt_v4 and pkt_v6 in commit a0517a0f7ef23 ("selftests/bpf: use __bpf_constant_htons in test_prog.c"), but later added another bunch of bpf_htons in commit bf0f0fd939451 ("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector"). Fixes: bf0f0fd939451 ("selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector") Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: add missing entries to bpf_helpers.hWillem de Bruijn2019-03-011-0/+30
| | | | | | | | | This header defines the BPF functions enumerated in uapi/linux.bpf.h in a callable format. Expand to include all registered functions. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: fix build without bpf_syscallAlexei Starovoitov2019-03-011-2/+5
| | | | | | | | | | wrap bpf_stats_enabled sysctl with #ifdef Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge branch 'inner_map_spin_lock-fix'Alexei Starovoitov2019-02-272-0/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Yonghong Song says: ==================== The inner_map_meta->spin_lock_off is not set correctly during map creation for BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS. This may lead verifier error due to misinformation. This patch set fixed the issue with Patch #1 for the kernel change and Patch #2 for enhanced selftest test_maps. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * tools/bpf: selftests: add map lookup to test_map_in_map bpf progYonghong Song2019-02-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bpf_map_lookup_elem is added in the bpf program. Without previous patch, the test change will trigger the following error: $ ./test_maps ... ; value_p = bpf_map_lookup_elem(map, &key); 20: (bf) r1 = r7 21: (bf) r2 = r8 22: (85) call bpf_map_lookup_elem#1 ; if (!value_p || *value_p != 123) 23: (15) if r0 == 0x0 goto pc+16 R0=map_value(id=2,off=0,ks=4,vs=4,imm=0) R6=inv1 R7=map_ptr(id=0,off=0,ks=4,vs=4,imm=0) R8=fp-8,call_-1 R10=fp0,call_-1 fp-8=mmmmmmmm ; if (!value_p || *value_p != 123) 24: (61) r1 = *(u32 *)(r0 +0) R0=map_value(id=2,off=0,ks=4,vs=4,imm=0) R6=inv1 R7=map_ptr(id=0,off=0,ks=4,vs=4,imm=0) R8=fp-8,call_-1 R10=fp0,call_-1 fp-8=mmmmmmmm bpf_spin_lock cannot be accessed directly by load/store With the kernel fix in the previous commit, the error goes away. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: set inner_map_meta->spin_lock_off correctlyYonghong Song2019-02-271-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d83525ca62cf ("bpf: introduce bpf_spin_lock") introduced bpf_spin_lock and the field spin_lock_off in kernel internal structure bpf_map has the following meaning: >=0 valid offset, <0 error For every map created, the kernel will ensure spin_lock_off has correct value. Currently, bpf_map->spin_lock_off is not copied from the inner map to the map_in_map inner_map_meta during a map_in_map type map creation, so inner_map_meta->spin_lock_off = 0. This will give verifier wrong information that inner_map has bpf_spin_lock and the bpf_spin_lock is defined at offset 0. An access to offset 0 of a value pointer will trigger the following error: bpf_spin_lock cannot be accessed directly by load/store This patch fixed the issue by copy inner map's spin_lock_off value to inner_map_meta->spin_lock_off. Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* samples: bpf: fix: broken sample regarding removed functionDaniel T. Lee2019-02-273-3/+3
| | | | | | | | | | | | | | | | | Currently, running sample "task_fd_query" and "tracex3" occurs the following error. On kernel v5.0-rc* this sample will be unavailable due to the removal of function 'blk_start_request' at commit "a1ce35f". (function removed, as "Single Queue IO scheduler" no longer exists) $ sudo ./task_fd_query failed to create kprobe 'blk_start_request' error 'No such file or directory' This commit will change the function 'blk_start_request' to 'blk_mq_start_request' to fix the broken sample. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge branch 'bpf-prog-stats'Daniel Borkmann2019-02-2710-7/+148
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== Introduce per program stats to monitor the usage BPF. v2->v3: - rename to run_time_ns/run_cnt everywhere v1->v2: - fixed u64 stats on 32-bit archs. Thanks Eric - use more verbose run_time_ns in json output as suggested by Andrii - refactored prog_alloc and clarified behavior of stats in subprogs ==================== Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * tools/bpftool: recognize bpf_prog_info run_time_ns and run_cntAlexei Starovoitov2019-02-272-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | $ bpftool p s 1: kprobe tag a56587d488d216c9 gpl run_time_ns 79786 run_cnt 8 loaded_at 2019-02-22T12:22:51-0800 uid 0 xlated 352B not jited memlock 4096B $ bpftool --json --pretty p s [{ "id": 1, "type": "kprobe", "tag": "a56587d488d216c9", "gpl_compatible": true, "run_time_ns": 79786, "run_cnt": 8, "loaded_at": 1550866971, "uid": 0, "bytes_xlated": 352, "jited": false, "bytes_memlock": 4096 } ] Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * tools/bpf: sync bpf.h into toolsAlexei Starovoitov2019-02-271-0/+2
| | | | | | | | | | | | | | sync bpf.h into tools directory Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: expose program stats via bpf_prog_infoAlexei Starovoitov2019-02-272-0/+7
| | | | | | | | | | | | | | | | Return bpf program run_time_ns and run_cnt via bpf_prog_info Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: enable program statsAlexei Starovoitov2019-02-276-6/+129
|/ | | | | | | | | | | | | | | | | | | | | | | | JITed BPF programs are indistinguishable from kernel functions, but unlike kernel code BPF code can be changed often. Typical approach of "perf record" + "perf report" profiling and tuning of kernel code works just as well for BPF programs, but kernel code doesn't need to be monitored whereas BPF programs do. Users load and run large amount of BPF programs. These BPF stats allow tools monitor the usage of BPF on the server. The monitoring tools will turn sysctl kernel.bpf_stats_enabled on and off for few seconds to sample average cost of the programs. Aggregated data over hours and days will provide an insight into cost of BPF and alarms can trigger in case given program suddenly gets more expensive. The cost of two sched_clock() per program invocation adds ~20 nsec. Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down from ~10 nsec to ~30 nsec. static_key minimizes the cost of the stats collection. There is no measurable difference before/after this patch with kernel.bpf_stats_enabled=0 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge branch 'bpf-libbpf-af-xdp'Daniel Borkmann2019-02-2513-652/+1376
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Magnus Karlsson says: ==================== This patch proposes to add AF_XDP support to libbpf. The main reason for this is to facilitate writing applications that use AF_XDP by offering higher-level APIs that hide many of the details of the AF_XDP uapi. This is in the same vein as libbpf facilitates XDP adoption by offering easy-to-use higher level interfaces of XDP functionality. Hopefully this will facilitate adoption of AF_XDP, make applications using it simpler and smaller, and finally also make it possible for applications to benefit from optimizations in the AF_XDP user space access code. Previously, people just copied and pasted the code from the sample application into their application, which is not desirable. The proposed interface is composed of two parts: * Low-level access interface to the four rings and the packet * High-level control plane interface for creating and setting up umems and AF_XDP sockets. This interface also loads a simple XDP program that routes all traffic on a queue up to the AF_XDP socket. The sample program has been updated to use this new interface and in that process it lost roughly 300 lines of code. I cannot detect any performance degradations due to the use of this library instead of the previous functions that were inlined in the sample application. But I did measure this on a slower machine and not the Broadwell that we normally use. The rings are now called xsk_ring and when a producer operates on it. It is xsk_ring_prod and for a consumer it is xsk_ring_cons. This way we can get some compile time error checking that the rings are used correctly. Comments and contenplations: * The current behaviour is that the library loads an XDP program (if requested to do so) but the clean up of this program is left to the application. It would be possible to implement this cleanup in the library, but it would require state to be kept on netdev level, which there is none at the moment, and the synchronization of this between processes. All this adding complexity. But when we get an XDP program per queue id, then it becomes trivial to also remove the XDP program when the application exits. This proposal from Jesper, Björn and others will also improve the performance of libbpf, since most of the XDP program code can be removed when that feature is supported. * In a future release, I am planning on adding a higher level data plane interface too. This will be based around recvmsg and sendmsg with the use of struct iovec for batching, without the user having to know anything about the underlying four rings of an AF_XDP socket. There will be one semantic difference though from the standard recvmsg and that is that the kernel will fill in the iovecs instead of the application. But the rest should be the same as the libc versions so that application writers feel at home. Patch 1: adds AF_XDP support in libbpf Patch 2: updates the xdpsock sample application to use the libbpf functions Patch 3: Documentation update to help first time users Changes v5 to v6: * Fixed prog_fd bug found by Xiaolong Ye. Thanks! Changes v4 to v5: * Added a FAQ to the documentation * Removed xsk_umem__get_data and renamed xsk_umem__get_dat_raw to xsk_umem__get_data * Replaced the netlink code with bpf_get_link_xdp_id() * Dynamic allocation of the map sizes. They are now sized after the max number of queueus on the netdev in question. Changes v3 to v4: * Dropped the pr_*() patch in favor of Yonghong Song's patch set * Addressed the review comments of Daniel Borkmann, mainly leaking of file descriptors at clean up and making the data plane APIs all static inline (with the exception of xsk_umem__get_data that uses an internal structure I do not want to expose). * Fixed the netlink callback as suggested by Maciej Fijalkowski. * Removed an unecessary include in the sample program as spotted by Ilia Fillipov. Changes v2 to v3: * Added automatic loading of a simple XDP program that routes all traffic on a queue up to the AF_XDP socket. This program loading can be disabled. * Updated function names to be consistent with the libbpf naming convention * Moved all code to xsk.[ch] * Removed all the XDP program loading code from the sample since this is now done by libbpf * The initialization functions now return a handle as suggested by Alexei * const statements added in the API where applicable. Changes v1 to v2: * Fixed cleanup of library state on error. * Moved API to initial version * Prefixed all public functions by xsk__ instead of xsk_ * Added comment about changed default ring sizes, batch size and umem size in the sample application commit message * The library now only creates an Rx or Tx ring if the respective parameter is != NULL ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * xsk: add FAQ to facilitate for first time usersMagnus Karlsson2019-02-251-1/+35
| | | | | | | | | | | | | | | | | | Added an FAQ section in Documentation/networking/af_xdp.rst to help first time users with common problems. As problems are getting identified, entries will be added to the FAQ. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * samples/bpf: convert xdpsock to use libbpf for AF_XDP accessMagnus Karlsson2019-02-254-648/+261
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit converts the xdpsock sample application to use the AF_XDP functions present in libbpf. This cuts down the size of it by nearly 300 lines of code. The default ring sizes plus the batch size has been increased and the size of the umem area has decreased. This so that the sample application will provide higher throughput. Note also that the shared umem code has been removed from the sample as this is not supported by libbpf at this point in time. Tested-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * libbpf: add support for using AF_XDP socketsMagnus Karlsson2019-02-258-3/+1080
|/ | | | | | | | | | | | | | | | | | | | | | | | This commit adds AF_XDP support to libbpf. The main reason for this is to facilitate writing applications that use AF_XDP by offering higher-level APIs that hide many of the details of the AF_XDP uapi. This is in the same vein as libbpf facilitates XDP adoption by offering easy-to-use higher level interfaces of XDP functionality. Hopefully this will facilitate adoption of AF_XDP, make applications using it simpler and smaller, and finally also make it possible for applications to benefit from optimizations in the AF_XDP user space access code. Previously, people just copied and pasted the code from the sample application into their application, which is not desirable. The interface is composed of two parts: * Low-level access interface to the four rings and the packet * High-level control plane interface for creating and setting up umems and af_xdp sockets as well as a simple XDP program. Tested-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* selftests/bpf: make sure signal interrupts BPF_PROG_TEST_RUNStanislav Fomichev2019-02-251-0/+44
| | | | | | | | | | | | | | | Simple test that I used to reproduce the issue in the previous commit: Do BPF_PROG_TEST_RUN with max iterations, each program is 4096 simple move instructions. File alarm in 0.1 second and check that bpf_prog_test_run is interrupted (i.e. test doesn't hang). Note: reposting this for bpf-next to avoid linux-next conflict. In this version I test both BPF_PROG_TYPE_SOCKET_FILTER (which uses generic bpf_test_run implementation) and BPF_PROG_TYPE_FLOW_DISSECTOR (which has it own loop with preempt handling in bpf_prog_test_run_flow_dissector). Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf/test_run: fix unkillable BPF_PROG_TEST_RUN for flow dissectorStanislav Fomichev2019-02-251-6/+20
| | | | | | | | | | | | | | | | | | | | | Syzbot found out that running BPF_PROG_TEST_RUN with repeat=0xffffffff makes process unkillable. The problem is that when CONFIG_PREEMPT is enabled, we never see need_resched() return true. This is due to the fact that preempt_enable() (which we do in bpf_test_run_one on each iteration) now handles resched if it's needed. Let's disable preemption for the whole run, not per test. In this case we can properly see whether resched is needed. Let's also properly return -EINTR to the userspace in case of a signal interrupt. This is a follow up for a recently fixed issue in bpf_test_run, see commit df1a2cb7c74b ("bpf/test_run: fix unkillable BPF_PROG_TEST_RUN"). Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: test_bpf: turn off preemption in function __run_onceAnders Roxell2019-02-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running BPF test suite the following splat occurs: [ 415.930950] test_bpf: #0 TAX jited:0 [ 415.931067] BUG: assuming atomic context at lib/test_bpf.c:6674 [ 415.946169] in_atomic(): 0, irqs_disabled(): 0, pid: 11556, name: modprobe [ 415.953176] INFO: lockdep is turned off. [ 415.957207] CPU: 1 PID: 11556 Comm: modprobe Tainted: G W 5.0.0-rc7-next-20190220 #1 [ 415.966328] Hardware name: HiKey Development Board (DT) [ 415.971592] Call trace: [ 415.974069] dump_backtrace+0x0/0x160 [ 415.977761] show_stack+0x24/0x30 [ 415.981104] dump_stack+0xc8/0x114 [ 415.984534] __cant_sleep+0xf0/0x108 [ 415.988145] test_bpf_init+0x5e0/0x1000 [test_bpf] [ 415.992971] do_one_initcall+0x90/0x428 [ 415.996837] do_init_module+0x60/0x1e4 [ 416.000614] load_module+0x1de0/0x1f50 [ 416.004391] __se_sys_finit_module+0xc8/0xe0 [ 416.008691] __arm64_sys_finit_module+0x24/0x30 [ 416.013255] el0_svc_common+0x78/0x130 [ 416.017031] el0_svc_handler+0x38/0x78 [ 416.020806] el0_svc+0x8/0xc Rework so that preemption is disabled when we loop over function 'BPF_PROG_RUN(...)'. Fixes: 568f196756ad ("bpf: check that BPF programs run with preemption disabled") Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* samples/bpf: Fix dummy program unloading for xdp_redirect samplesToke Høiland-Jørgensen2019-02-222-2/+2
| | | | | | | | | | | | | The xdp_redirect and xdp_redirect_map sample programs both load a dummy program onto the egress interfaces. However, the unload code checks these programs against the wrong id number, and thus refuses to unload them. Fix the comparison to avoid this. Fixes: 3b7a8ec2dec3 ("samples/bpf: Check the prog id before exiting") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* seccomp, bpf: disable preemption before calling into bpf progAlexei Starovoitov2019-02-221-0/+2
| | | | | | | | | All BPF programs must be called with preemption disabled. Fixes: 568f196756ad ("bpf: check that BPF programs run with preemption disabled") Reported-by: syzbot+8bf19ee2aa580de7a2a7@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: add skb->queue_mapping write access from tc clsactJesper Dangaard Brouer2019-02-191-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skb->queue_mapping already have read access, via __sk_buff->queue_mapping. This patch allow BPF tc qdisc clsact write access to the queue_mapping via tc_cls_act_is_valid_access. Also handle that the value NO_QUEUE_MAPPING is not allowed. It is already possible to change this via TC filter action skbedit tc-skbedit(8). Due to the lack of TC examples, lets show one: # tc qdisc add dev ixgbe1 clsact # tc filter add dev ixgbe1 ingress matchall action skbedit queue_mapping 5 # tc filter list dev ixgbe1 ingress The most common mistake is that XPS (Transmit Packet Steering) takes precedence over setting skb->queue_mapping. XPS is configured per DEVICE via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To disable set mask=00. The purpose of changing skb->queue_mapping is to influence the selection of the net_device "txq" (struct netdev_queue), which influence selection of the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When using the MQ qdisc the txq->qdisc points to different qdiscs and associated locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability. Due to lack of TC examples, lets show howto attach clsact BPF programs: # tc qdisc add dev ixgbe2 clsact # tc filter add dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu # tc filter list dev ixgbe2 egress Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: check that BPF programs run with preemption disabledPeter Zijlstra2019-02-193-3/+41
| | | | | | | | | | | | | | Introduce cant_sleep() macro for annotation of functions that cannot sleep. Use it in BPF_PROG_RUN to catch execution of BPF programs in preemptable context. Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: bpftool, fix documentation for attach typesAlban Crequy2019-02-193-4/+4
| | | | | | | | | | | | | | | | | bpftool has support for attach types "stream_verdict" and "stream_parser" but the documentation was referring to them as "skb_verdict" and "skb_parse". The inconsistency comes from commit b7d3826c2ed6 ("bpf: bpftool, add support for attaching programs to maps"). This patch changes the documentation to match the implementation: - "bpftool prog help" - man pages - bash completion Signed-off-by: Alban Crequy <alban@kinvolk.io> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bnx2x: Remove set but not used variable 'mfw_vn'YueHaibing2019-02-181-3/+1
| | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c: In function 'bnx2x_get_hwinfo': drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c:11940:10: warning: variable 'mfw_vn' set but not used [-Wunused-but-set-variable] It's never used since introduction. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Sudarsana Reddy Kalluru <skalluru@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'net-phy-add-helpers-for-handling-C45-10GBT-AN-register-values'David S. Miller2019-02-182-9/+20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Heiner Kallweit says: ==================== net: phy: add helpers for handling C45 10GBT AN register values Similar to the existing helpers for the Clause 22 registers add helpers to deal with converting Clause 45 advertisement registers to / from link mode bitmaps. Note that these helpers are defined in linux/mdio.h, not like the Clause 22 helpers in linux/mii.h. Reason is that the Clause 45 register constants are defined in uapi/linux/mdio.h. And uapi/linux/mdio.h includes linux/mii.h before defining the C45 register constants. v2: - Remove few helpers which aren't used by this series. They will follow together with the users. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: phy: use mii_10gbt_stat_mod_linkmode_lpa_t in genphy_c45_read_lpaHeiner Kallweit2019-02-181-9/+1
| | | | | | | | | | | | | | | | | | Use mii_10gbt_stat_mod_linkmode_lpa_t() in genphy_c45_read_lpa() to simplify the code. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: phy: add helper mii_10gbt_stat_mod_linkmode_lpa_tHeiner Kallweit2019-02-181-0/+19
|/ | | | | | | | | | | | | | | | | Similar to the existing helpers for the Clause 22 registers add helper mii_10gbt_stat_mod_linkmode_lpa_t. Note that this helper is defined in linux/mdio.h, not like the Clause 22 helpers in linux/mii.h. Reason is that the Clause 45 register constants are defined in uapi/linux/mdio.h. And uapi/linux/mdio.h includes linux/mii.h before defining the C45 register constants. v2: - remove helpers that don't have users in this series Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* liquidio: using NULL instead of plain integerYueHaibing2019-02-182-2/+2
| | | | | | | | | | Fix following warning: drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c:1453:35: warning: Using plain integer as NULL pointer drivers/net/ethernet/cavium/liquidio/lio_main.c:2910:23: warning: Using plain integer as NULL pointer Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* r8169: remove unneeded mmiowb barriersHeiner Kallweit2019-02-181-6/+0
| | | | | | | | | | | | writex() has implicit barriers, that's what makes it different from writex_relaxed(). Therefore these calls to mmiowb() can be removed. This patch was recently reverted due to a dependency with another problematic patch. But because it didn't contribute to the problem it was rebased and can be resubmitted. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: Implement flow_dissect callback for tag_dsa.Rundong Ge2019-02-182-0/+18
| | | | | | | | | | | | | | RPS not work for DSA devices since the 'skb_get_hash' will always get the invalid hash for dsa tagged packets. "[PATCH] tag_mtk: add flow_dissect callback to the ops struct" introduced the flow_dissect callback to get the right hash for MTK tagged packet. Tag_dsa and tag_edsa also need to implement the callback. Signed-off-by: Rundong Ge <rdong.ge@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: using kfree_rcu() to simplify the codeWei Yongjun2019-02-181-6/+1
| | | | | | | | The callback function of call_rcu() just calls a kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mdio_bus: Fix PTR_ERR() usage after initialization to constantYueHaibing2019-02-181-5/+6
| | | | | | | | | | | | | Fix coccinelle warning: ./drivers/net/phy/mdio_bus.c:51:5-12: ERROR: PTR_ERR applied after initialization to constant on line 44 ./drivers/net/phy/mdio_bus.c:52:5-12: ERROR: PTR_ERR applied after initialization to constant on line 44 fix this by using IS_ERR before PTR_ERR Fixes: bafbdd527d56 ("phylib: Add device reset GPIO support") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mlxsw: spectrum: Change IP2ME CPU policer rate and burst size valuesShalom Toledo2019-02-181-2/+2
| | | | | | | | | | | The IP2ME packet trap is triggered by packets hitting local routes. After evaluating current defaults used by the driver it was decided to reduce the amount of traffic generated by this trap to 1Kpps and increase the burst size. This is inline with similarly deployed systems. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: hamradio: remove unused hweight*() definesMasahiro Yamada2019-02-181-26/+0
| | | | | | | | This file does not use hweight*() at all, and the definition is surrounded by #if 0 ... #endif. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2019-02-1822-63/+244
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for you net-next tree: 1) Missing NFTA_RULE_POSITION_ID netlink attribute validation, from Phil Sutter. 2) Restrict matching on tunnel metadata to rx/tx path, from wenxu. 3) Avoid indirect calls for IPV6=y, from Florian Westphal. 4) Add two indirections to prepare merger of IPV4 and IPV6 nat modules, from Florian Westphal. 5) Broken indentation in ctnetlink, from Colin Ian King. 6) Patches to use struct_size() from netfilter and IPVS, from Gustavo A. R. Silva. 7) Display kernel splat only once in case of racing to confirm conntrack from bridge plus nfqueue setups, from Chieh-Min Wang. 8) Skip checksum validation for layer 4 protocols that don't need it, patch from Alin Nastac. 9) Sparse warning due to symbol that should be static in CLUSTERIP, from Wei Yongjun. 10) Add new toggle to disable SDP payload translation when media endpoint is reachable though the same interface as the signalling peer, from Alin Nastac. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_conntrack_sip: add sip_external_media logicAlin Nastac2019-02-161-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When enabled, the sip_external_media logic will leave SDP payload untouched when it detects that interface towards INVITEd party is the same with the one towards media endpoint. The typical scenario for this logic is when a LAN SIP agent has more than one IP address (uses a different address for media streams than the one used on signalling stream) and it also forwards calls to a voice mailbox located on the WAN side. In such case sip_direct_media must be disabled (so normal calls could be handled by the SIP helper), but media streams that are not traversing this router must also be excluded from address translation (e.g. call forwards). Signed-off-by: Alin Nastac <alin.nastac@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: ipt_CLUSTERIP: make symbol 'cip_netdev_notifier' staticWei Yongjun2019-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | Fixes the following sparse warnings: net/ipv4/netfilter/ipt_CLUSTERIP.c:867:23: warning: symbol 'cip_netdev_notifier' was not declared. Should it be static? Fixes: 5a86d68bcf02 ("netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: reject: skip csum verification for protocols that don't support itAlin Nastac2019-02-136-12/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some protocols have other means to verify the payload integrity (AH, ESP, SCTP) while others are incompatible with nf_ip(6)_checksum implementation because checksum is either optional or might be partial (UDPLITE, DCCP, GRE). Because nf_ip(6)_checksum was used to validate the packets, ip(6)tables REJECT rules were not capable to generate ICMP(v6) errors for the protocols mentioned above. This commit also fixes the incorrect pseudo-header protocol used for IPv4 packets that carry other transport protocols than TCP or UDP (pseudo-header used protocol 0 iso the proper value). Signed-off-by: Alin Nastac <alin.nastac@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in ↵Chieh-Min Wang2019-02-121-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __nf_conntrack_confirm For bridge(br_flood) or broadcast/multicast packets, they could clone skb with unconfirmed conntrack which break the rule that unconfirmed skb->_nfct is never shared. With nfqueue running on my system, the race can be easily reproduced with following warning calltrace: [13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P W 4.4.60 #7744 [13257.707568] Hardware name: Qualcomm (Flattened Device Tree) [13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14) [13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8) [13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0) [13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24) [13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618) [13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc) [13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c) [13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8) [13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0) [13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4) [13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac) [13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170) [13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420) [13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248) [13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0) [13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c) [13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368) [13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44) [13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200) [13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64) [13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34) The original code just triggered the warning but do nothing. It will caused the shared conntrack moves to the dying list and the packet be droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack). - Reproduce steps: +----------------------------+ | br0(bridge) | | | +-+---------+---------+------+ | eth0| | eth1| | eth2| | | | | | | +--+--+ +--+--+ +---+-+ | | | | | | +--+-+ +-+--+ +--+-+ | PC1| | PC2| | PC3| +----+ +----+ +----+ iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass ps: Our nfq userspace program will set mark on packets whose connection has already been processed. PC1 sends broadcast packets simulated by hping3: hping3 --rand-source --udp 192.168.1.255 -i u100 - Broadcast racing flow chart is as follow: br_handle_frame BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish) // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage br_handle_frame_finish // check if this packet is broadcast br_flood_forward br_flood list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port maybe_deliver deliver_clone skb = skb_clone(skb) __br_forward BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...) // queue in our nfq and received by our userspace program // goto __nf_conntrack_confirm with process context on CPU 1 br_pass_frame_up BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...) // goto __nf_conntrack_confirm with softirq context on CPU 0 Because conntrack confirm can happen at both INPUT and POSTROUTING stage. So with NFQUEUE running, skb->_nfct with the same unconfirmed conntrack could race on different core. This patch fixes a repeating kernel splat, now it is only displayed once. Signed-off-by: Chieh-Min Wang <chiehminw@synology.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: xt_recent: Use struct_size() in kvzalloc()Gustavo A. R. Silva2019-02-121-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; size = sizeof(struct foo) + count * sizeof(void *); instance = alloc(size, GFP_KERNEL) Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); instance = alloc(size, GFP_KERNEL) Notice that, in this case, variable sz is not necessary, hence it is removed. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * ipvs: Use struct_size() helperGustavo A. R. Silva2019-02-121-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = alloc(size, GFP_KERNEL) Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: size = struct_size(instance, entry, count); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: conntrack: fix indentation issueColin Ian King2019-02-121-1/+1
| | | | | | | | | | | | | | A statement in an if block is not indented correctly. Fix this. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: ipv6: avoid indirect calls for IPV6=y caseFlorian Westphal2019-02-045-42/+68
| | | | | | | | | | | | | | | | | | | | | | | | indirect calls are only needed if ipv6 is a module. Add helpers to abstract the v6ops indirections and use them instead. fragment, reroute and route_input are kept as indirect calls. The first two are not not used in hot path and route_input is only used by bridge netfilter. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nat: remove module dependency on ipv6 coreFlorian Westphal2019-02-044-3/+45
| | | | | | | | | | | | | | | | | | | | | | | | nf_nat_ipv6 calls two ipv6 core functions, so add those to v6ops to avoid the module dependency. This is a prerequisite for merging ipv4 and ipv6 nat implementations. Add wrappers to avoid the indirection if ipv6 is builtin. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_tunnel: Add NFTA_TUNNEL_MODE optionswenxu2019-02-042-2/+41
| | | | | | | | | | | | | | | | | | nft "tunnel" expr match both the tun_info of RX and TX. This patch provide the NFTA_TUNNEL_MODE to individually match the tun_info of RX or TX. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nf_tables: add NFTA_RULE_POSITION_ID to nla_policyFlorian Westphal2019-01-291-0/+1
| | | | | | | | | | | | | | | | Fixes: 75dd48e2e420a ("netfilter: nf_tables: Support RULE_ID reference in new rule") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | net: hns3: make function hclge_set_all_vf_rst() staticWei Yongjun2019-02-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | Fixes the following sparse warning: drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:2431:5: warning: symbol 'hclge_set_all_vf_rst' was not declared. Should it be static? Fixes: aa5c4f175be6 ("net: hns3: add reset handling for VF when doing PF reset") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ptr_ring: remove duplicated include from ptr_ring.hYueHaibing2019-02-171-1/+0
| | | | | | | | | | | | | | Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>