summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* bpf: Add a bpf_snprintf helperFlorent Revest2021-04-192-0/+29
| | | | | | | | | | | | | | | | | | | | | | | The implementation takes inspiration from the existing bpf_trace_printk helper but there are a few differences: To allow for a large number of format-specifiers, parameters are provided in an array, like in bpf_seq_printf. Because the output string takes two arguments and the array of parameters also takes two arguments, the format string needs to fit in one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to a zero-terminated read-only map so we don't need a format string length arg. Because the format-string is known at verification time, we also do a first pass of format string validation in the verifier logic. This makes debugging easier. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
* bpf: Add a ARG_PTR_TO_CONST_STR argument typeFlorent Revest2021-04-191-0/+1
| | | | | | | | | | | This type provides the guarantee that an argument is going to be a const pointer to somewhere in a read-only map value. It also checks that this pointer is followed by a zero character before the end of the map value. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org
* bpf: Factorize bpf_trace_printk and bpf_seq_printfFlorent Revest2021-04-191-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two helpers (trace_printk and seq_printf) have very similar implementations of format string parsing and a third one is coming (snprintf). To avoid code duplication and make the code easier to maintain, this moves the operations associated with format string parsing (validation and argument sanitization) into one generic function. The implementation of the two existing helpers already drifted quite a bit so unifying them entailed a lot of changes: - bpf_trace_printk always expected fmt[fmt_size] to be the terminating NULL character, this is no longer true, the first 0 is terminating. - bpf_trace_printk now supports %% (which produces the percentage char). - bpf_trace_printk now skips width formating fields. - bpf_trace_printk now supports the X modifier (capital hexadecimal). - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6 - argument casting on 32 bit has been simplified into one macro and using an enum instead of obscure int increments. - bpf_seq_printf now uses bpf_trace_copy_string instead of strncpy_from_kernel_nofault and handles the %pks %pus specifiers. - bpf_seq_printf now prints longs correctly on 32 bit architectures. - both were changed to use a global per-cpu tmp buffer instead of one stack buffer for trace_printk and 6 small buffers for seq_printf. - to avoid per-cpu buffer usage conflict, these helpers disable preemption while the per-cpu buffer is in use. - both helpers now support the %ps and %pS specifiers to print symbols. The implementation is also moved from bpf_trace.c to helpers.c because the upcoming bpf_snprintf helper will be made available to all BPF programs and will need it. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210419155243.1632274-2-revest@chromium.org
* bpf: Return target info when a tracing bpf_link is queriedToke Høiland-Jørgensen2021-04-132-0/+11
| | | | | | | | | | | There is currently no way to discover the target of a tracing program attachment after the fact. Add this information to bpf_link_info and return it when querying the bpf_link fd. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210413091607.58945-1-toke@redhat.com
* libbpf: Clarify flags in ringbuf helpersPedro Tammela2021-04-121-0/+16
| | | | | | | | | | | In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment. For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a notification to the process if needed. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210412192434.944343-1-pctammela@mojatatu.com
* skmsg: Pass psock pointer to ->psock_update_sk_prot()Cong Wang2021-04-124-5/+9
| | | | | | | | | | | | | | | | Using sk_psock() to retrieve psock pointer from sock requires RCU read lock, but we already get psock pointer before calling ->psock_update_sk_prot() in both cases, so we can just pass it without bothering sk_psock(). Fixes: 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com
* bpf: Document PROG_TEST_RUN limitationsJoe Stringer2021-04-121-0/+21
| | | | | | | | | | | | | | | | | | Per net/bpf/test_run.c, particular prog types have additional restrictions around the parameters that can be provided, so document these in the header. I didn't bother documenting the limitation on duration for raw tracepoints since that's an output parameter anyway. Tested with ./tools/testing/selftests/bpf/test_doc_build.sh. Suggested-by: Yonghong Song <yhs@fb.com> Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/20210410174549.816482-1-joe@cilium.io
* bpf: Remove repeated struct btf_type declarationWan Jiabing2021-04-031-1/+0
| | | | | | | | | | struct btf_type is declared twice. One is declared at 35th line. The below one is not needed, hence remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210401072037.995849-1-wanjiabing@vivo.com
* bpf, cgroup: Delete repeated struct bpf_prog declarationWan Jiabing2021-04-031-1/+0
| | | | | | | | | | | struct bpf_prog is declared twice. There is one declaration which is independent on the macro at 18th line. So the below one is not needed though. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210401064637.993327-1-wanjiabing@vivo.com
* tcp: reorder tcp_congestion_ops for better cache localityEric Dumazet2021-04-021-15/+27
| | | | | | | | | Group all the often used fields in the first cache line, to reduce cache line misses. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: reorganize fields in netns_mibEric Dumazet2021-04-021-10/+20
| | | | | | | | | | | | | Order fields to increase locality for most used protocols. udplite and icmp are moved at the end. Same for proc_net_devsnmp6 which is not used in fast path. This potentially saves one cache line miss for typical TCP/UDP over IPv4/IPv6. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: add mptcp reset option supportFlorian Westphal2021-04-022-2/+27
| | | | | | | | | | | | | | | | | | | The MPTCP reset option allows to carry a mptcp-specific error code that provides more information on the nature of a connection reset. Reset option data received gets stored in the subflow context so it can be sent to userspace via the 'subflow closed' netlink event. When a subflow is closed, the desired error code that should be sent to the peer is also placed in the subflow context structure. If a reset is sent before subflow establishment could complete, e.g. on HMAC failure during an MP_JOIN operation, the mptcp skb extension is used to store the reset information. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2021-04-0211-48/+179
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-04-01 The following pull-request contains BPF updates for your *net-next* tree. We've added 68 non-merge commits during the last 7 day(s) which contain a total of 70 files changed, 2944 insertions(+), 1139 deletions(-). The main changes are: 1) UDP support for sockmap, from Cong. 2) Verifier merge conflict resolution fix, from Daniel. 3) xsk selftests enhancements, from Maciej. 4) Unstable helpers aka kernel func calling, from Martin. 5) Batches ops for LPM map, from Pedro. 6) Fix race in bpf_get_local_storage, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()Cong Wang2021-04-012-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Although these two functions are only used by TCP, they are not specific to TCP at all, both operate on skmsg and ingress_msg, so fit in net/core/skmsg.c very well. And we will need them for non-TCP, so rename and move them to skmsg.c and export them to modules. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com
| * udp: Implement ->read_sock() for sockmapCong Wang2021-04-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to tcp_read_sock(), except we do not need to worry about connections, we just need to retrieve skb from UDP receive queue. Note, the return value of ->read_sock() is unused in sk_psock_verdict_data_ready(), and UDP still does not support splice() due to lack of ->splice_read(), so users can not reach udp_read_sock() directly. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-12-xiyou.wangcong@gmail.com
| * sock: Introduce sk->sk_prot->psock_update_sk_prot()Cong Wang2021-04-014-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently sockmap calls into each protocol to update the struct proto and replace it. This certainly won't work when the protocol is implemented as a module, for example, AF_UNIX. Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each protocol can implement its own way to replace the struct proto. This also helps get rid of symbol dependencies on CONFIG_INET. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
| * sock_map: Introduce BPF_SK_SKB_VERDICTCong Wang2021-04-012-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is confusing and more importantly we still want to distinguish them from user-space. So we can just reuse the stream verdict code but introduce a new type of eBPF program, skb_verdict. Users are not allowed to attach stream_verdict and skb_verdict programs to the same map. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-10-xiyou.wangcong@gmail.com
| * skmsg: Use rcu work for destroying psockCong Wang2021-04-011-4/+1
| | | | | | | | | | | | | | | | | | | | The RCU callback sk_psock_destroy() only queues work psock->gc, so we can just switch to rcu work to simplify the code. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-6-xiyou.wangcong@gmail.com
| * skmsg: Avoid lock_sock() in sk_psock_backlog()Cong Wang2021-04-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do not have to lock the sock to avoid losing sk_socket, instead we can purge all the ingress queues when we close the socket. Sending or receiving packets after orphaning socket makes no sense. We do purge these queues when psock refcnt reaches zero but here we want to purge them explicitly in sock_map_close(). There are also some nasty race conditions on testing bit SK_PSOCK_TX_ENABLED and queuing/canceling the psock work, we can expand psock->ingress_lock a bit to protect them too. As noticed by John, we still have to lock the psock->work, because the same work item could be running concurrently on different CPU's. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-5-xiyou.wangcong@gmail.com
| * net: Introduce skb_send_sock() for sock_mapCong Wang2021-04-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only have skb_send_sock_locked() which requires callers to use lock_sock(). Introduce a variant skb_send_sock() which locks on its own, callers do not need to lock it any more. This will save us from adding a ->sendmsg_locked for each protocol. To reuse the code, pass function pointers to __skb_send_sock() and build skb_send_sock() and skb_send_sock_locked() on top. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-4-xiyou.wangcong@gmail.com
| * skmsg: Introduce a spinlock to protect ingress_msgCong Wang2021-04-011-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we rely on lock_sock to protect ingress_msg, it is too big for this, we can actually just use a spinlock to protect this list like protecting other skb queues. __tcp_bpf_recvmsg() is still special because of peeking, it still has to use lock_sock. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210331023237.41094-3-xiyou.wangcong@gmail.com
| * bpf: Remove unused bpf_load_pointerHe Fengqing2021-03-301-9/+0
| | | | | | | | | | | | | | | | | | | | Remove unused bpf_load_pointer function in filter.h. The last user of it has been removed with 24dea04767e6 ("bpf, x32: remove ld_abs/ld_ind"). Signed-off-by: He Fengqing <hefengqing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210330024843.3479844-1-hefengqing@huawei.com
| * bpf: selftests: Add kfunc_call testMartin KaFai Lau2021-03-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds a few kernel function bpf_kfunc_call_test*() for the selftest's test_run purpose. They will be allowed for tc_cls prog. The selftest calling the kernel function bpf_kfunc_call_test*() is also added in this patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
| * bpf: Support bpf program calling kernel functionMartin KaFai Lau2021-03-264-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support to BPF verifier to allow bpf program calling kernel function directly. The use case included in this set is to allow bpf-tcp-cc to directly call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()"). Those functions have already been used by some kernel tcp-cc implementations. This set will also allow the bpf-tcp-cc program to directly call the kernel tcp-cc implementation, For example, a bpf_dctcp may only want to implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly from the kernel tcp_dctcp.c instead of reimplementing (or copy-and-pasting) them. The tcp-cc kernel functions mentioned above will be white listed for the struct_ops bpf-tcp-cc programs to use in a later patch. The white listed functions are not bounded to a fixed ABI contract. Those functions have already been used by the existing kernel tcp-cc. If any of them has changed, both in-tree and out-of-tree kernel tcp-cc implementations have to be changed. The same goes for the struct_ops bpf-tcp-cc programs which have to be adjusted accordingly. This patch is to make the required changes in the bpf verifier. First change is in btf.c, it adds a case in "btf_check_func_arg_match()". When the passed in "btf->kernel_btf == true", it means matching the verifier regs' states with a kernel function. This will handle the PTR_TO_BTF_ID reg. It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET, and PTR_TO_TCP_SOCK to its kernel's btf_id. In the later libbpf patch, the insn calling a kernel function will look like: insn->code == (BPF_JMP | BPF_CALL) insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */ insn->imm == func_btf_id /* btf_id of the running kernel */ [ For the future calling function-in-kernel-module support, an array of module btf_fds can be passed at the load time and insn->off can be used to index into this array. ] At the early stage of verifier, the verifier will collect all kernel function calls into "struct bpf_kfunc_desc". Those descriptors are stored in "prog->aux->kfunc_tab" and will be available to the JIT. Since this "add" operation is similar to the current "add_subprog()" and looking for the same insn->code, they are done together in the new "add_subprog_and_kfunc()". In the "do_check()" stage, the new "check_kfunc_call()" is added to verify the kernel function call instruction: 1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE. A new bpf_verifier_ops "check_kfunc_call" is added to do that. The bpf-tcp-cc struct_ops program will implement this function in a later patch. 2. Call "btf_check_kfunc_args_match()" to ensure the regs can be used as the args of a kernel function. 3. Mark the regs' type, subreg_def, and zext_dst. At the later do_misc_fixups() stage, the new fixup_kfunc_call() will replace the insn->imm with the function address (relative to __bpf_call_base). If needed, the jit can find the btf_func_model by calling the new bpf_jit_find_kfunc_model(prog, insn). With the imm set to the function address, "bpftool prog dump xlated" will be able to display the kernel function calls the same way as it displays other bpf helper calls. gpl_compatible program is required to call kernel function. This feature currently requires JIT. The verifier selftests are adjusted because of the changes in the verbose log in add_subprog_and_kfunc(). Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
| * bpf: Refactor btf_check_func_arg_matchMartin KaFai Lau2021-03-262-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moved the subprog specific logic from btf_check_func_arg_match() to the new btf_check_subprog_arg_match(). The core logic is left in btf_check_func_arg_match() which will be reused later to check the kernel function call. The "if (!btf_type_is_ptr(t))" is checked first to improve the indentation which will be useful for a later patch. Some of the "btf_kind_str[]" usages is replaced with the shortcut "btf_type_str(t)". Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015136.1544504-1-kafai@fb.com
| * bpf: Simplify freeing logic in linfo and jited_linfoMartin KaFai Lau2021-03-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch simplifies the linfo freeing logic by combining "bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()" into the new "bpf_prog_jit_attempt_done()". It is a prep work for the kernel function call support. In a later patch, freeing the kernel function call descriptors will also be done in the "bpf_prog_jit_attempt_done()". "bpf_prog_free_linfo()" is removed since it is only called by "__bpf_prog_put_noref()". The kvfree() are directly called instead. It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo allocation. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
| * bpf: struct sock is declared twice in bpf_sk_storage headerWan Jiabing2021-03-261-1/+0
| | | | | | | | | | | | | | | | struct sock has been declared twice, therefore remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210325070602.858024-1-wanjiabing@vivo.com
| * bpf: Fix NULL pointer dereference in bpf_get_local_storage() helperYonghong Song2021-03-252-12/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Olsa reported a bug ([1]) in kernel where cgroup local storage pointer may be NULL in bpf_get_local_storage() helper. There are two issues uncovered by this bug: (1). kprobe or tracepoint prog incorrectly sets cgroup local storage before prog run, (2). due to change from preempt_disable to migrate_disable, preemption is possible and percpu storage might be overwritten by other tasks. This issue (1) is fixed in [2]. This patch tried to address issue (2). The following shows how things can go wrong: task 1: bpf_cgroup_storage_set() for percpu local storage preemption happens task 2: bpf_cgroup_storage_set() for percpu local storage preemption happens task 1: run bpf program task 1 will effectively use the percpu local storage setting by task 2 which will be either NULL or incorrect ones. Instead of just one common local storage per cpu, this patch fixed the issue by permitting 8 local storages per cpu and each local storage is identified by a task_struct pointer. This way, we allow at most 8 nested preemption between bpf_cgroup_storage_set() and bpf_cgroup_storage_unset(). The percpu local storage slot is released (calling bpf_cgroup_storage_unset()) by the same task after bpf program finished running. bpf_test_run() is also fixed to use the new bpf_cgroup_storage_set() interface. The patch is tested on top of [2] with reproducer in [1]. Without this patch, kernel will emit error in 2-3 minutes. With this patch, after one hour, still no error. [1] https://lore.kernel.org/bpf/CAKH8qBuXCfUz=w8L+Fj74OaUpbosO29niYwTki7e3Ag044_aww@mail.gmail.com/T [2] https://lore.kernel.org/bpf/20210309185028.3763817-1-yhs@fb.com Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20210323055146.3334476-1-yhs@fb.com
| * bpf: Fix typo 'accesible' into 'accessible'Ricardo Ribalda2021-03-261-1/+1
| | | | | | | | | | | | | | | | Trivial fix. Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210318202223.164873-8-ribalda@chromium.org
* | include: net: Remove repeated struct declarationWan Jiabing2021-04-011-1/+0
| | | | | | | | | | | | | | | | struct ctl_table_header is declared twice. One is declared at 46th line. The blew one is not needed. Remove the duplicate. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: move ip6_dst_ops first in netns_ipv6Eric Dumazet2021-03-311-1/+3
| | | | | | | | | | | | | | | | | | | | | | ip6_dst_ops have cache line alignement. Moving it at beginning of netns_ipv6 removes a 48 byte hole, and shrinks netns_ipv6 from 12 to 11 cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: convert elligible sysctls to u8Eric Dumazet2021-03-311-12/+12
| | | | | | | | | | | | | | Convert most sysctls that can fit in a byte. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: convert tcp_comp_sack_nr sysctl to u8Eric Dumazet2021-03-311-1/+1
| | | | | | | | | | | | | | tcp_comp_sack_nr max value was already 255. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: convert igmp_link_local_mcast_reports sysctl to u8Eric Dumazet2021-03-311-1/+1
| | | | | | | | | | | | | | This sysctl is a bool, can use less storage. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: convert fib_multipath_{use_neigh|hash_policy} sysctls to u8Eric Dumazet2021-03-311-2/+2
| | | | | | | | | | | | | | Make room for better packing of netns_ipv4 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: convert udp_l3mdev_accept sysctl to u8Eric Dumazet2021-03-311-1/+1
| | | | | | | | | | | | | | Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: convert fib_notify_on_flag_change sysctl to u8Eric Dumazet2021-03-311-1/+1
| | | | | | | | | | | | | | Reduce footprint of sysctls. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: shrink netns_ipv4 by another cache lineEric Dumazet2021-03-311-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By shuffling around some fields to remove 8 bytes of hole, we can save one cache line. pahole result before/after the patch : /* size: 768, cachelines: 12, members: 139 */ /* sum members: 673, holes: 11, sum holes: 39 */ /* padding: 56 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ -> /* size: 704, cachelines: 11, members: 139 */ /* sum members: 673, holes: 10, sum holes: 31 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: shrink inet_timewait_death_row by 48 bytesEric Dumazet2021-03-311-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct inet_timewait_death_row uses two cache lines, because we want tw_count to use a full cache line to avoid false sharing. Rework its definition and placement in netns_ipv4 so that: 1) We add 60 bytes of padding after tw_count to avoid false sharing, knowing that tcp_death_row will have ____cacheline_aligned_in_smp attribute. 2) We do not risk padding before tcp_death_row, because we move it at the beginning of netns_ipv4, even if new fields are added later. 3) We do not waste 48 bytes of padding after it. Note that I have not changed dccp. pahole result for struct netns_ipv4 before/after the patch : /* size: 832, cachelines: 13, members: 139 */ /* sum members: 721, holes: 12, sum holes: 95 */ /* padding: 16 */ /* paddings: 2, sum paddings: 55 */ -> /* size: 768, cachelines: 12, members: 139 */ /* sum members: 673, holes: 11, sum holes: 39 */ /* padding: 56 */ /* paddings: 2, sum paddings: 7 */ /* forced alignments: 1 */ Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ethtool: support FEC settings over netlinkJakub Kicinski2021-03-311-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add FEC API to netlink. This is not a 1-to-1 conversion. FEC settings already depend on link modes to tell user which modes are supported. Take this further an use link modes for manual configuration. Old struct ethtool_fecparam is still used to talk to the drivers, so we need to translate back and forth. We can revisit the internal API if number of FEC encodings starts to grow. Enforce only one active FEC bit (by using a bit position rather than another mask). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vxlan: allow L4 GRO passthroughPaolo Abeni2021-03-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When passing up an UDP GSO packet with L4 aggregation, there is no need to segment it at the vxlan level. We can propagate the packet untouched and let it be segmented later, if needed. Introduce an helper to allow let the UDP socket to accept any L4 aggregation and use it in the vxlan driver. v1 -> v2: - updated to use the newly introduced UDP socket 'accept*' fields Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: never accept GSO_FRAGLIST packetsPaolo Abeni2021-03-301-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the UDP protocol delivers GSO_FRAGLIST packets to the sockets without the expected segmentation. This change addresses the issue introducing and maintaining a couple of new fields to explicitly accept SKB_GSO_UDP_L4 or GSO_FRAGLIST packets. Additionally updates udp_unexpected_gso() accordingly. UDP sockets enabling UDP_GRO stil keep accept_udp_fraglist zeroed. v1 -> v2: - use 2 bits instead of a whole GSO bitmask (Willem) Fixes: 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: fixup csum for GSO receive slow pathPaolo Abeni2021-03-301-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When UDP packets generated locally by a socket with UDP_SEGMENT traverse the following path: UDP tunnel(xmit) -> veth (segmentation) -> veth (gro) -> UDP tunnel (rx) -> UDP socket (no UDP_GRO) ip_summed will be set to CHECKSUM_PARTIAL at creation time and such checksum mode will be preserved in the above path up to the UDP tunnel receive code where we have: __iptunnel_pull_header() -> skb_pull_rcsum() -> skb_postpull_rcsum() -> __skb_postpull_rcsum() The latter will convert the skb to CHECKSUM_NONE. The UDP GSO packet will be later segmented as part of the rx socket receive operation, and will present a CHECKSUM_NONE after segmentation. Additionally the segmented packets UDP CB still refers to the original GSO packet len. Overall that causes unexpected/wrong csum validation errors later in the UDP receive path. We could possibly address the issue with some additional checks and csum mangling in the UDP tunnel code. Since the issue affects only this UDP receive slow path, let's set a suitable csum status there. Note that SKB_GSO_UDP_L4 or SKB_GSO_FRAGLIST packets lacking an UDP encapsulation present a valid checksum when landing to udp_queue_rcv_skb(), as the UDP checksum has been validated by the GRO engine. v2 -> v3: - even more verbose commit message and comments v1 -> v2: - restrict the csum update to the packets strictly needing them - hopefully clarify the commit message and code comments Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: add ipv6_dev_find to stubsAndreas Roeseler2021-03-301-0/+2
| | | | | | | | | | | | | | | | Add ipv6_dev_find to ipv6_stub to allow lookup of net_devices by IPV6 address in net/ipv4/icmp.c. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add sysctl for enabling RFC 8335 PROBE messagesAndreas Roeseler2021-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | Section 8 of RFC 8335 specifies potential security concerns of responding to PROBE requests, and states that nodes that support PROBE functionality MUST be able to enable/disable responses and that responses MUST be disabled by default Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ICMPV6: add support for RFC 8335 PROBEAndreas Roeseler2021-03-301-0/+3
| | | | | | | | | | | | | | | | Add definitions for the ICMPV6 type of Extended Echo Request and Extended Echo Reply, as defined by sections 2 and 3 of RFC 8335. Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | icmp: add support for RFC 8335 PROBEAndreas Roeseler2021-03-301-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add definitions for PROBE ICMP types and codes. Add AFI definitions for IP and IPV6 as specified by IANA Add a struct to represent the additional header when probing by IP address (ctype == 3) for use in parsing incoming PROBE messages Add a struct to represent the entire Interface Identification Object (IIO) section of an incoming PROBE packet Signed-off-by: Andreas Roeseler <andreas.a.roeseler@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | can: bittiming: add CAN_KBPS, CAN_MBPS and CAN_MHZ macrosVincent Mailhol2021-03-301-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add three macro to simplify the readability of big bit timing numbers: - CAN_KBPS: kilobits per second (one thousand) - CAN_MBPS: megabits per second (one million) - CAN_MHZ: megahertz per second (one million) Example: u32 bitrate_max = 8 * CAN_MBPS; struct can_clock clock = {.freq = 80 * CAN_MHZ}; instead of: u32 bitrate_max = 8000000; struct can_clock clock = {.freq = 80000000}; Apply the new macro to driver/net/can/dev/bittiming.c. Link: https://lore.kernel.org/r/20210306054040.76483-1-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
* | can: bittiming: add calculation for CAN FD Transmitter Delay Compensation (TDC)Vincent Mailhol2021-03-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic for the tdco calculation is to just reuse the normal sample point: tdco = sp. Because the sample point is expressed in tenth of percent and the tdco is expressed in time quanta, a conversion is needed. At the end, ssp = tdcv + tdco = tdcv + sp. Another popular method is to set tdco to the middle of the bit: tdc->tdco = can_bit_time(dbt) / 2 During benchmark tests, we could not find a clear advantages for one of the two methods. The tdco calculation is triggered each time the data_bittiming is changed so that users relying on automated calculation can use the netlink interface the exact same way without need of new parameters. For example, a command such as: ip link set canX type can bitrate 500000 dbitrate 4000000 fd on would trigger the calculation. The user using CONFIG_CAN_CALC_BITTIMING who does not want automated calculation needs to manually set tdco to zero. For example with: ip link set canX type can tdco 0 bitrate 500000 dbitrate 4000000 fd on (if the tdco parameter is provided in a previous command, it will be overwritten). If tdcv is set to zero (default), it is automatically calculated by the transiver for each frame. As such, there is no code in the kernel to calculate it. tdcf has no automated calculation functions because we could not figure out a formula for this parameter. Link: https://lore.kernel.org/r/20210224002008.4158-6-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
* | can: dev: reorder struct can_priv members for better packingVincent Mailhol2021-03-301-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Save eight bytes of holes on x86-64 architectures by reordering struct can_priv members. Before: $ pahole -C can_priv drivers/net/can/dev/dev.o struct can_priv { struct net_device * dev; /* 0 8 */ struct can_device_stats can_stats; /* 8 24 */ struct can_bittiming bittiming; /* 32 32 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct can_bittiming data_bittiming; /* 64 32 */ const struct can_bittiming_const * bittiming_const; /* 96 8 */ const struct can_bittiming_const * data_bittiming_const; /* 104 8 */ struct can_tdc tdc; /* 112 12 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 2 boundary (128 bytes) --- */ const struct can_tdc_const * tdc_const; /* 128 8 */ const u16 * termination_const; /* 136 8 */ unsigned int termination_const_cnt; /* 144 4 */ u16 termination; /* 148 2 */ /* XXX 2 bytes hole, try to pack */ const u32 * bitrate_const; /* 152 8 */ unsigned int bitrate_const_cnt; /* 160 4 */ /* XXX 4 bytes hole, try to pack */ const u32 * data_bitrate_const; /* 168 8 */ unsigned int data_bitrate_const_cnt; /* 176 4 */ u32 bitrate_max; /* 180 4 */ struct can_clock clock; /* 184 4 */ enum can_state state; /* 188 4 */ /* --- cacheline 3 boundary (192 bytes) --- */ u32 ctrlmode; /* 192 4 */ u32 ctrlmode_supported; /* 196 4 */ u32 ctrlmode_static; /* 200 4 */ int restart_ms; /* 204 4 */ struct delayed_work restart_work; /* 208 168 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 56 bytes ago --- */ int (*do_set_bittiming)(struct net_device *); /* 376 8 */ /* --- cacheline 6 boundary (384 bytes) --- */ int (*do_set_data_bittiming)(struct net_device *); /* 384 8 */ int (*do_set_mode)(struct net_device *, enum can_mode); /* 392 8 */ int (*do_set_termination)(struct net_device *, u16); /* 400 8 */ int (*do_get_state)(const struct net_device *, enum can_state *); /* 408 8 */ int (*do_get_berr_counter)(const struct net_device *, struct can_berr_counter *); /* 416 8 */ unsigned int echo_skb_max; /* 424 4 */ /* XXX 4 bytes hole, try to pack */ struct sk_buff * * echo_skb; /* 432 8 */ /* size: 440, cachelines: 7, members: 31 */ /* sum members: 426, holes: 4, sum holes: 14 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 56 bytes */ }; After: $ pahole -C can_priv drivers/net/can/dev/dev.o struct can_priv { struct net_device * dev; /* 0 8 */ struct can_device_stats can_stats; /* 8 24 */ const struct can_bittiming_const * bittiming_const; /* 32 8 */ const struct can_bittiming_const * data_bittiming_const; /* 40 8 */ struct can_bittiming bittiming; /* 48 32 */ /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */ struct can_bittiming data_bittiming; /* 80 32 */ const struct can_tdc_const * tdc_const; /* 112 8 */ struct can_tdc tdc; /* 120 12 */ /* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */ unsigned int bitrate_const_cnt; /* 132 4 */ const u32 * bitrate_const; /* 136 8 */ const u32 * data_bitrate_const; /* 144 8 */ unsigned int data_bitrate_const_cnt; /* 152 4 */ u32 bitrate_max; /* 156 4 */ struct can_clock clock; /* 160 4 */ unsigned int termination_const_cnt; /* 164 4 */ const u16 * termination_const; /* 168 8 */ u16 termination; /* 176 2 */ /* XXX 2 bytes hole, try to pack */ enum can_state state; /* 180 4 */ u32 ctrlmode; /* 184 4 */ u32 ctrlmode_supported; /* 188 4 */ /* --- cacheline 3 boundary (192 bytes) --- */ u32 ctrlmode_static; /* 192 4 */ int restart_ms; /* 196 4 */ struct delayed_work restart_work; /* 200 168 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */ int (*do_set_bittiming)(struct net_device *); /* 368 8 */ int (*do_set_data_bittiming)(struct net_device *); /* 376 8 */ /* --- cacheline 6 boundary (384 bytes) --- */ int (*do_set_mode)(struct net_device *, enum can_mode); /* 384 8 */ int (*do_set_termination)(struct net_device *, u16); /* 392 8 */ int (*do_get_state)(const struct net_device *, enum can_state *); /* 400 8 */ int (*do_get_berr_counter)(const struct net_device *, struct can_berr_counter *); /* 408 8 */ unsigned int echo_skb_max; /* 416 4 */ /* XXX 4 bytes hole, try to pack */ struct sk_buff * * echo_skb; /* 424 8 */ /* size: 432, cachelines: 7, members: 31 */ /* sum members: 426, holes: 2, sum holes: 6 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 48 bytes */ }; Link: https://lore.kernel.org/r/20210224002008.4158-3-mailhol.vincent@wanadoo.fr Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>