summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* bpf, bpftool: enable recvmsg attach typesDaniel Borkmann2019-06-065-6/+15
| | | | | | | | | | Trivial patch to bpftool in order to complete enabling attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf, libbpf: enable recvmsg attach typesDaniel Borkmann2019-06-061-0/+4
| | | | | | | | Another trivial patch to libbpf in order to enable identifying and attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: sync tooling uapi headerDaniel Borkmann2019-06-061-0/+2
| | | | | | | | | | | Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG attach types. This is done and preferred as an extra patch in order to ease sync of libbpf. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: fix unconnected udp hooksDaniel Borkmann2019-06-067-4/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* tools: bpftool: Fix JSON output when lookup failsKrzesimir Nowak2019-06-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 9a5ab8bf1d6d ("tools: bpftool: turn err() and info() macros into functions") one case of error reporting was special cased, so it could report a lookup error for a specific key when dumping the map element. What the code forgot to do is to wrap the key and value keys into a JSON object, so an example output of pretty JSON dump of a sockhash map (which does not support looking up its values) is: [ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00" ], "value": { "error": "Operation not supported" }, "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01" ], "value": { "error": "Operation not supported" } ] Note the key-value pairs inside the toplevel array. They should be wrapped inside a JSON object, otherwise it is an invalid JSON. This commit fixes this, so the output now is: [{ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x00" ], "value": { "error": "Operation not supported" } },{ "key": ["0x0a","0x41","0x00","0x02","0x1f","0x78","0x00","0x01" ], "value": { "error": "Operation not supported" } } ] Fixes: 9a5ab8bf1d6d ("tools: bpftool: turn err() and info() macros into functions") Cc: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io> Acked-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* selftests/bpf: move test_lirc_mode2_user to TEST_GEN_PROGS_EXTENDEDHangbin Liu2019-06-051-3/+4
| | | | | | | | | test_lirc_mode2_user is included in test_lirc_mode2.sh test and should not be run directly. Fixes: 6bdd533cee9a ("bpf: add selftest for lirc_mode2 type program") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge branch 'reuseport-fixes'Alexei Starovoitov2019-06-032-3/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Martin Lau says: ==================== s series has fixes when running reuseport's bpf_prog for udp lookup. If there is reuseport's bpf_prog, the common issue is the reuseport code path expects skb->data pointing to the transport header (udphdr here). A couple of commits broke this expectation. The issue is specific to running bpf_prog, so bpf tag is used for this series. Please refer to the individual commit message for details. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: udp: Avoid calling reuseport's bpf_prog from udp_groMartin KaFai Lau2019-06-032-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the commit a6024562ffd7 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_errMartin KaFai Lau2019-06-031-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | __udp6_lib_err() may be called when handling icmpv6 message. For example, the icmpv6 toobig(type=2). __udp6_lib_lookup() is then called which may call reuseport_select_sock(). reuseport_select_sock() will call into a bpf_prog (if there is one). reuseport_select_sock() is expecting the skb->data pointing to the transport header (udphdr in this case). For example, run_bpf_filter() is pulling the transport header. However, in the __udp6_lib_err() path, the skb->data is pointing to the ipv6hdr instead of the udphdr. One option is to pull and push the ipv6hdr in __udp6_lib_err(). Instead of doing this, this patch follows how the original commit 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") was done in IPv4, which has passed a NULL skb pointer to reuseport_select_sock(). Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Cc: Craig Gallek <kraig@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf, riscv: clear high 32 bits for ALU32 add/sub/neg/lsh/rsh/arshLuke Nelson2019-05-311-0/+18
| | | | | | | | | | | | | | | | | | | | | In BPF, 32-bit ALU operations should zero-extend their results into the 64-bit registers. The current BPF JIT on RISC-V emits incorrect instructions that perform sign extension only (e.g., addw, subw) on 32-bit add, sub, lsh, rsh, arsh, and neg. This behavior diverges from the interpreter and JITs for other architectures. This patch fixes the bugs by performing zero extension on the destination register of 32-bit ALU operations. Fixes: 2353ecc6f91f ("bpf, riscv: add BPF JIT for RV64G") Cc: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Björn Töpel <bjorn.topel@gmail.com> Reviewed-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* libbpf: Return btf_fd for load_sk_storage_btfMichal Rostecki2019-05-313-23/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this change, function load_sk_storage_btf expected that libbpf__probe_raw_btf was returning a BTF descriptor, but in fact it was returning an information about whether the probe was successful (0 or 1). load_sk_storage_btf was using that value as an argument of the close function, which was resulting in closing stdout and thus terminating the process which called that function. That bug was visible in bpftool. `bpftool feature` subcommand was always exiting too early (because of closed stdout) and it didn't display all requested probes. `bpftool -j feature` or `bpftool -p feature` were not returning a valid json object. This change renames the libbpf__probe_raw_btf function to libbpf__load_raw_btf, which now returns a BTF descriptor, as expected in load_sk_storage_btf. v2: - Fix typo in the commit message. v3: - Simplify BTF descriptor handling in bpf_object__probe_btf_* functions. - Rename libbpf__probe_raw_btf function to libbpf__load_raw_btf and return a BTF descriptor. v4: - Fix typo in the commit message. Fixes: d7c4b3980c18 ("libbpf: detect supported kernel BTF features and sanitize BTF") Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* selftests: bpf: fix compiler warning in flow_dissector testAlakesh Haloi2019-05-291-0/+1
| | | | | | | | | | | | | | | Add missing header file following compiler warning: prog_tests/flow_dissector.c: In function ‘tx_tap’: prog_tests/flow_dissector.c:175:9: warning: implicit declaration of function ‘writev’; did you mean ‘write’? [-Wimplicit-function-declaration] return writev(fd, iov, ARRAY_SIZE(iov)); ^~~~~~ write Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode") Signed-off-by: Alakesh Haloi <alakesh.haloi@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Merge branch 'bpf-subreg-tests'Daniel Borkmann2019-05-292-39/+533
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Jiong Wang says: ==================== JIT back-ends need to guarantee high 32-bit cleared whenever one eBPF insn write low 32-bit sub-register only. It is possible that some JIT back-ends have failed doing this and are silently generating wrong image. This set completes the unit tests, so bug on this could be exposed in JITs. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * selftests: bpf: complete sub-register zero extension checksJiong Wang2019-05-291-11/+505
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF ISA specification requires high 32-bit cleared when only low 32-bit sub-register is written. JIT back-ends must guarantee this semantics when doing code-gen. This patch complete unit tests for all of those insns that could be visible to JIT back-ends and defining sub-registers, if JIT back-ends failed to guarantee the mentioned semantics, these unit tests will fail. Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * selftests: bpf: move sub-register zero extension checks into subreg.cJiong Wang2019-05-292-39/+39
|/ | | | | | | | | | | | | It is better to centralize all sub-register zero extension checks into an independent file. This patch takes the first step to move existing sub-register zero extension checks into subreg.c. Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: sockmap, fix use after free from sleep in psock backlog workqueueJohn Fastabend2019-05-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Backlog work for psock (sk_psock_backlog) might sleep while waiting for memory to free up when sending packets. However, while sleeping the socket may be closed and removed from the map by the user space side. This breaks an assumption in sk_stream_wait_memory, which expects the wait queue to be still there when it wakes up resulting in a use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT to avoid the sleep altogether. We already set the flag for the sendpage case but we missed the case were sendmsg is used. Sockmap is currently the only user of skb_send_sock_locked() so only the sockmap paths should be impacted. ================================================================== BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70 Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110 CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 Workqueue: events sk_psock_backlog Call Trace: print_address_description+0x6e/0x2b0 ? remove_wait_queue+0x31/0x70 kasan_report+0xfd/0x177 ? remove_wait_queue+0x31/0x70 ? remove_wait_queue+0x31/0x70 remove_wait_queue+0x31/0x70 sk_stream_wait_memory+0x4dd/0x5f0 ? sk_stream_wait_close+0x1b0/0x1b0 ? wait_woken+0xc0/0xc0 ? tcp_current_mss+0xc5/0x110 tcp_sendmsg_locked+0x634/0x15d0 ? tcp_set_state+0x2e0/0x2e0 ? __kasan_slab_free+0x1d1/0x230 ? kmem_cache_free+0x70/0x140 ? sk_psock_backlog+0x40c/0x4b0 ? process_one_work+0x40b/0x660 ? worker_thread+0x82/0x680 ? kthread+0x1b9/0x1e0 ? ret_from_fork+0x1f/0x30 ? check_preempt_curr+0xaf/0x130 ? iov_iter_kvec+0x5f/0x70 ? kernel_sendmsg_locked+0xa0/0xe0 skb_send_sock_locked+0x273/0x3c0 ? skb_splice_bits+0x180/0x180 ? start_thread+0xe0/0xe0 ? update_min_vruntime.constprop.27+0x88/0xc0 sk_psock_backlog+0xb3/0x4b0 ? strscpy+0xbf/0x1e0 process_one_work+0x40b/0x660 worker_thread+0x82/0x680 ? process_one_work+0x660/0x660 kthread+0x1b9/0x1e0 ? __kthread_create_on_node+0x250/0x250 ret_from_fork+0x1f/0x30 Fixes: 20bf50de3028c ("skbuff: Function to send an skbuf on a socket") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: sockmap, restore sk_write_space when psock gets droppedJakub Sitnicki2019-05-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once psock gets unlinked from its sock (sk_psock_drop), user-space can still trigger a call to sk->sk_write_space by setting TCP_NOTSENT_LOWAT socket option. This causes a null-ptr-deref because we try to read psock->saved_write_space from sk_psock_write_space: ================================================================== BUG: KASAN: null-ptr-deref in sk_psock_write_space+0x69/0x80 Read of size 8 at addr 00000000000001a0 by task sockmap-echo/131 CPU: 0 PID: 131 Comm: sockmap-echo Not tainted 5.2.0-rc1-00094-gf49aa1de9836 #81 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014 Call Trace: ? sk_psock_write_space+0x69/0x80 __kasan_report.cold.2+0x5/0x3f ? sk_psock_write_space+0x69/0x80 kasan_report+0xe/0x20 sk_psock_write_space+0x69/0x80 tcp_setsockopt+0x69a/0xfc0 ? tcp_shutdown+0x70/0x70 ? fsnotify+0x5b0/0x5f0 ? remove_wait_queue+0x90/0x90 ? __fget_light+0xa5/0xf0 __sys_setsockopt+0xe6/0x180 ? sockfd_lookup_light+0xb0/0xb0 ? vfs_write+0x195/0x210 ? ksys_write+0xc9/0x150 ? __x64_sys_read+0x50/0x50 ? __bpf_trace_x86_fpu+0x10/0x10 __x64_sys_setsockopt+0x61/0x70 do_syscall_64+0xc5/0x520 ? vmacache_find+0xc0/0x110 ? syscall_return_slowpath+0x110/0x110 ? handle_mm_fault+0xb4/0x110 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe ? trace_hardirqs_off_caller+0x4b/0x120 ? trace_hardirqs_off_thunk+0x1a/0x3a entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2e5e7cdcce Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8a 11 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffed011b778 EFLAGS: 00000206 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2e5e7cdcce RDX: 0000000000000019 RSI: 0000000000000006 RDI: 0000000000000007 RBP: 00007ffed011b790 R08: 0000000000000004 R09: 00007f2e5e84ee80 R10: 00007ffed011b788 R11: 0000000000000206 R12: 00007ffed011b78c R13: 00007ffed011b788 R14: 0000000000000007 R15: 0000000000000068 ================================================================== Restore the saved sk_write_space callback when psock is being dropped to fix the crash. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* selftests: bpf: add zero extend checks for ALU32 and/or/xorBjörn Töpel2019-05-231-0/+39
| | | | | | | | | | Add three tests to test_verifier/basic_instr that make sure that the high 32-bits of the destination register is cleared after an ALU32 and/or/xor. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf, riscv: clear target register high 32-bits for and/or/xor on ALU32Björn Töpel2019-05-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using 32-bit subregisters (ALU32), the RISC-V JIT would not clear the high 32-bits of the target register and therefore generate incorrect code. E.g., in the following code: $ cat test.c unsigned int f(unsigned long long a, unsigned int b) { return (unsigned int)a & b; } $ clang-9 -target bpf -O2 -emit-llvm -S test.c -o - | \ llc-9 -mattr=+alu32 -mcpu=v3 .text .file "test.c" .globl f .p2align 3 .type f,@function f: r0 = r1 w0 &= w2 exit .Lfunc_end0: .size f, .Lfunc_end0-f The JIT would not clear the high 32-bits of r0 after the and-operation, which in this case might give an incorrect return value. After this patch, that is not the case, and the upper 32-bits are cleared. Reported-by: Jiong Wang <jiong.wang@netronome.com> Fixes: 2353ecc6f91f ("bpf, riscv: add BPF JIT for RV64G") Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* samples, bpf: suppress compiler warningMatteo Croce2019-05-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | GCC 9 fails to calculate the size of local constant strings and produces a false positive: samples/bpf/task_fd_query_user.c: In function ‘test_debug_fs_uprobe’: samples/bpf/task_fd_query_user.c:242:67: warning: ‘%s’ directive output may be truncated writing up to 255 bytes into a region of size 215 [-Wformat-truncation=] 242 | snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s/id", | ^~ 243 | event_type, event_alias); | ~~~~~~~~~~~ samples/bpf/task_fd_query_user.c:242:2: note: ‘snprintf’ output between 45 and 300 bytes into a destination of size 256 242 | snprintf(buf, sizeof(buf), "/sys/kernel/debug/tracing/events/%ss/%s/id", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 243 | event_type, event_alias); | ~~~~~~~~~~~~~~~~~~~~~~~~ Workaround this by lowering the buffer size to a reasonable value. Related GCC Bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83431 Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* samples, bpf: fix to change the buffer size for read()Chang-Hsien Tsai2019-05-211-1/+1
| | | | | | | | | | | | | | | | | | If the trace for read is larger than 4096, the return value sz will be 4096. This results in off-by-one error on buf: static char buf[4096]; ssize_t sz; sz = read(trace_fd, buf, sizeof(buf)); if (sz > 0) { buf[sz] = 0; puts(buf); } Signed-off-by: Chang-Hsien Tsai <luke.tw@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: Check sk_fullsock() before returning from bpf_sk_lookup()Martin KaFai Lau2019-05-211-2/+14
| | | | | | | | | | | | | | | | | | | | | | The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL. Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be accessed, e.g. type, protocol, mark and priority. Some new helper, like bpf_sk_storage_get(), also expects ARG_PTR_TO_SOCKET is a fullsock. bpf_sk_lookup() currently calls sk_to_full_sk() before returning. However, the ptr returned from sk_to_full_sk() is not guaranteed to be a fullsock. For example, it cannot get a fullsock if sk is in TCP_TIME_WAIT. This patch checks for sk_fullsock() before returning. If it is not a fullsock, sock_gen_put() is called if needed and then returns NULL. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Cc: Joe Stringer <joe@isovalent.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Joe Stringer <joe@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: fix out-of-bounds read in __bpf_skc_lookupLorenz Bauer2019-05-211-1/+7
| | | | | | | | | | | | | | __bpf_skc_lookup takes a socket tuple and the length of the tuple as an argument. Based on the length, it decides which address family to pass to the helper function sk_lookup. In case of AF_INET6, it fails to verify that the length of the tuple is long enough. sk_lookup may therefore access data past the end of the tuple. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* Documentation/networking: fix af_xdp.rst Sphinx warningsRandy Dunlap2019-05-211-4/+4
| | | | | | | | | | | | | | | | Fix Sphinx warnings in Documentation/networking/af_xdp.rst by adding indentation: Documentation/networking/af_xdp.rst:319: WARNING: Literal block expected; none found. Documentation/networking/af_xdp.rst:326: WARNING: Literal block expected; none found. Fixes: 0f4a9b7d4ecb ("xsk: add FAQ to facilitate for first time users") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* net: stmmac: dma channel control register need to be init firstWeifeng Voon2019-05-201-4/+4
| | | | | | | | | | | | | stmmac_init_chan() needs to be called before stmmac_init_rx_chan() and stmmac_init_tx_chan(). This is because if PBLx8 is to be used, "DMA_CH(#i)_Control.PBLx8" needs to be set before programming "DMA_CH(#i)_TX_Control.TxPBL" and "DMA_CH(#i)_RX_Control.RxPBL". Fixes: 47f2a9ce527a ("net: stmmac: dma channel init prepared for multiple queues") Reviewed-by: Zhang, Baoli <baoli.zhang@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Weifeng Voon <weifeng.voon@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: stmmac: fix ethtool flow control not able to get/setTan, Tee Min2019-05-201-2/+2
| | | | | | | | | | | | | | | | Currently ethtool was not able to get/set the flow control due to a missing "!". It will always return -EOPNOTSUPP even the device is flow control supported. This patch fixes the condition check for ethtool flow control get/set function for ETHTOOL_LINK_MODE_Asym_Pause_BIT. Fixes: 3c1bcc8614db (“net: ethernet: Convert phydev advertize and supported from u32 to link mode”) Signed-off-by: Tan, Tee Min <tee.min.tan@intel.com> Reviewed-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Voon, Weifeng <weifeng.voon@intel.com@intel.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: qrtr: Fix message type of outgoing packetsBjorn Andersson2019-05-201-2/+2
| | | | | | | | | | | | | | | QRTR packets has a message type in the header, which is repeated in the control header. For control packets we therefor copy the type from beginning of the outgoing payload and use that as message type. For non-control messages an endianness fix introduced in v5.2-rc1 caused the type to be 0, rather than QRTR_TYPE_DATA, causing all messages to be dropped by the receiver. Fix this by converting and using qrtr_type, which will remain QRTR_TYPE_DATA for non-control messages. Fixes: 8f5e24514cbd ("net: qrtr: use protocol endiannes variable") Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* networking: : fix typos in code commentsWeitao Hou2019-05-201-2/+2
| | | | | | | fix accelleration to acceleration Signed-off-by: Weitao Hou <houweitaoo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ptp: Fix example program to match kernel.Richard Cochran2019-05-201-84/+1
| | | | | | | | | | | | | Ever since commit 3a06c7ac24f9 ("posix-clocks: Remove interval timer facility and mmap/fasync callbacks") the possibility of PHC based posix timers has been removed. In addition it will probably never make sense to implement this functionality. This patch removes the misleading example code which seems to suggest that posix timers for PHC devices will ever be a thing. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* fddi: fix typos in code commentsWeitao Hou2019-05-201-2/+2
| | | | | | | fix abord to abort Signed-off-by: Weitao Hou <houweitaoo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'kselftests-fib_rule_tests-fix'David S. Miller2019-05-201-1/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Hangbin Liu says: ==================== kselftests: fib_rule_tests: fix "from $SRC_IP iif $DEV" match testing As all the IPv4 testing addresses are in the same subnet and egress device == ingress device, to pass "from $SRC_IP iif $DEV" match test, we need enable forwarding to get the route entry. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * selftests: fib_rule_tests: enable forwarding before ipv4 from/iif testHangbin Liu2019-05-201-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | As all the testing addresses are in the same subnet and egress device == ingress device. We need enable forwarding to get the route entry. Also disable rp_filer separately as some distributions enable it in startup scripts. Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * selftests: fib_rule_tests: fix local IPv4 address typoHangbin Liu2019-05-201-1/+1
|/ | | | | | | | | | The IPv4 testing address are all in 192.51.100.0 subnet. It doesn't make sense to set a 198.51.100.1 local address. Should be a typo. Fixes: 65b2b4939a64 ("selftests: net: initial fib rule tests") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: Avoid copying bytes beyond the supplied dataChris Packham2019-05-201-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLV_SET is called with a data pointer and a len parameter that tells us how many bytes are pointed to by data. When invoking memcpy() we need to careful to only copy len bytes. Previously we would copy TLV_LENGTH(len) bytes which would copy an extra 4 bytes past the end of the data pointer which newer GCC versions complain about. In file included from test.c:17: In function 'TLV_SET', inlined from 'test' at test.c:186:5: /usr/include/linux/tipc_config.h:317:3: warning: 'memcpy' forming offset [33, 36] is out of the bounds [0, 32] of object 'bearer_name' with type 'char[32]' [-Warray-bounds] memcpy(TLV_DATA(tlv_ptr), data, tlv_len); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ test.c: In function 'test': test.c::161:10: note: 'bearer_name' declared here char bearer_name[TIPC_MAX_BEARER_NAME]; ^~~~~~~~~~~ We still want to ensure any padding bytes at the end are initialised, do this with a explicit memset() rather than copy bytes past the end of data. Apply the same logic to TCM_SET. Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'net-readx_poll_timeout'David S. Miller2019-05-203-20/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Benedikt Spranger says: ==================== Convert mdio wait function to use readx_poll_timeout() On loaded systems with a preemptible kernel both functions axienet_mdio_wait_until_ready() and xemaclite_mdio_wait() may report a false positive error return. Convert both functions to use readx_poll_timeout() to handle the situation in a safe manner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * 2/2] net: xilinx_emaclite: use readx_poll_timeout() in mdio wait functionKurt Kanzenbach2019-05-201-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On loaded systems with a preemptible kernel the mdio_wait() function may report an error while everything is working fine: xemaclite_mdio_wait(): xemaclite_readl() -> chip not ready --> interrupt here (other work for some time / chip become ready) if (time_before_eq(end, jiffies)) --> false positive error report Replace the current code with readx_poll_timeout() which takes care of the situation. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * 1/2] net: axienet: use readx_poll_timeout() in mdio wait functionKurt Kanzenbach2019-05-202-10/+11
|/ | | | | | | | | | | | | | | | | | On loaded systems with a preemptible kernel the mdio_wait() function may report an error while everything is working fine: axienet_mdio_wait_until_ready(): axienet_ior() -> chip not ready --> interrupt here (other work for some time / chip become ready) if (time_before_eq(end, jiffies)) --> false positive error report Replace the current code with readx_poll_timeout() which take care of the situation. Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Benedikt Spranger <b.spranger@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* vlan: Mark expected switch fall-throughGustavo A. R. Silva2019-05-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: net/8021q/vlan_dev.c: In function ‘vlan_dev_ioctl’: net/8021q/vlan_dev.c:374:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (!net_eq(dev_net(dev), &init_net)) ^ net/8021q/vlan_dev.c:376:2: note: here case SIOCGMIIPHY: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* macvlan: Mark expected switch fall-throughGustavo A. R. Silva2019-05-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: drivers/net/macvlan.c: In function ‘macvlan_do_ioctl’: drivers/net/macvlan.c:839:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (!net_eq(dev_net(dev), &init_net)) ^ drivers/net/macvlan.c:841:2: note: here case SIOCGHWTSTAMP: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages queryErez Alfasi2019-05-202-6/+3
| | | | | | | | | | | | | | Querying EEPROM high pages data for SFP module is currently not supported by our driver but is still tried, resulting in invalid FW queries. Set the EEPROM ethtool data length to 256 for SFP module to limit the reading for page 0 only and prevent invalid FW queries. Fixes: 7202da8b7f71 ("ethtool, net/mlx4_en: Cable info, get_module_info/eeprom ethtool support") Signed-off-by: Erez Alfasi <ereza@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: fix modprobe tipc failed after switch order of device registrationJunwei Hu2019-05-203-10/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error message printed: modprobe: ERROR: could not insert 'tipc': Address family not supported by protocol. when modprobe tipc after the following patch: switch order of device registration, commit 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Because sock_create_kern(net, AF_TIPC, ...) called by tipc_topsrv_create_listener() in the initialization process of tipc_init_net(), so tipc_socket_init() must be execute before that. Meanwhile, tipc_net_id need to be initialized when sock_create() called, and tipc_socket_init() is no need to be called for each namespace. I add a variable tipc_topsrv_net_ops, and split the register_pernet_subsys() of tipc into two parts, and split tipc_socket_init() with initialization of pernet params. By the way, I fixed resources rollback error when tipc_bcast_init() failed in tipc_init_net(). Fixes: 7e27e8d6130c ("tipc: switch order of device registration to fix a crash") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reported-by: syzbot+1e8114b61079bfe9cbc5@syzkaller.appspotmail.com Reviewed-by: Kang Zhou <zhoukang7@huawei.com> Reviewed-by: Suanming Mou <mousuanming@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'for-5.2-rc1-tag' of ↵Linus Torvalds2019-05-208-26/+97
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Notable highlights: - fixes for some long-standing bugs in fsync that were quite hard to catch but now finaly fixed - some fixups to error handling paths that did not properly clean up (locking, memory) - fix to space reservation for inheriting properties" * tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: tree-checker: detect file extent items with overlapping ranges Btrfs: fix race between ranged fsync and writeback of adjacent ranges Btrfs: avoid fallback to transaction commit during fsync of files with holes btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes btrfs: sysfs: don't leak memory when failing add fsid btrfs: sysfs: Fix error path kobject memory leak Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path btrfs: use the existing reserved items for our first prop for inheritance btrfs: don't double unlock on error in btrfs_punch_hole btrfs: Check the compression level before getting a workspace
| * Btrfs: tree-checker: detect file extent items with overlapping rangesFilipe Manana2019-05-161-4/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Having file extent items with ranges that overlap each other is a serious issue that leads to all sorts of corruptions and crashes (like a BUG_ON() during the course of __btrfs_drop_extents() when it traims file extent items). Therefore teach the tree checker to detect such cases. This is motivated by a recently fixed bug (race between ranged full fsync and writeback or adjacent ranges). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana2019-05-161-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana2019-05-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytesQu Wenruo2019-05-161-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") refactored add_pinned_bytes(), but during that refactor, there are two callers which add the pinned bytes instead of subtracting. That refactor misses those two caller, causing incorrect pinned bytes calculation and resulting unexpected ENOSPC error. Fix it by adding a new parameter @sign to restore the original behavior. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: sysfs: don't leak memory when failing add fsidTobin C. Harding2019-05-161-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: sysfs: Fix error path kobject memory leakTobin C. Harding2019-05-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: do not abort transaction at btrfs_update_root() after failure to COW pathFilipe Manana2019-05-091-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when we fail to COW a path at btrfs_update_root() we end up always aborting the transaction. However all the current callers of btrfs_update_root() are able to deal with errors returned from it, many do end up aborting the transaction themselves (directly or not, such as the transaction commit path), other BUG_ON() or just gracefully cancel whatever they were doing. When syncing the fsync log, we call btrfs_update_root() through tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log sync code does not abort the transaction, instead it gracefully handles the error and returns -EAGAIN to the fsync handler, so that it falls back to a transaction commit. Any other error different from -ENOSPC, makes the log sync code abort the transaction. So remove the transaction abort from btrfs_update_log() when we fail to COW a path to update the root item, so that if an -ENOSPC failure happens we avoid aborting the current transaction and have a chance of the fsync succeeding after falling back to a transaction commit. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413 Fixes: 79787eaab46121 ("btrfs: replace many BUG_ONs with proper error handling") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use the existing reserved items for our first prop for inheritanceJosef Bacik2019-05-091-8/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We're now reserving an extra items worth of space for property inheritance. We only have one property at the moment so this covers us, but if we add more in the future this will allow us to not get bitten by the extra space reservation. If we do add more properties in the future we should re-visit how we calculate the space reservation needs by the callers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ refreshed on top of prop/xattr cleanups ] Signed-off-by: David Sterba <dsterba@suse.com>