summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2020-06-307-80/+173
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2020-06-30 The following pull-request contains BPF updates for your *net* tree. We've added 28 non-merge commits during the last 9 day(s) which contain a total of 35 files changed, 486 insertions(+), 232 deletions(-). The main changes are: 1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer types, from Yonghong Song. 2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki. 3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb internals and integrate it properly into DMA core, from Christoph Hellwig. 4) Fix RCU splat from recent changes to avoid skipping ingress policy when kTLS is enabled, from John Fastabend. 5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its position masking to work, from Andrii Nakryiko. 6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading of network programs, from Maciej Żenczykowski. 7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer. 8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: Fix an incorrect branch elimination by verifierYonghong Song2020-06-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wenbo reported an issue in [1] where a checking of null pointer is evaluated as always false. In this particular case, the program type is tp_btf and the pointer to compare is a PTR_TO_BTF_ID. The current verifier considers PTR_TO_BTF_ID always reprents a non-null pointer, hence all PTR_TO_BTF_ID compares to 0 will be evaluated as always not-equal, which resulted in the branch elimination. For example, struct bpf_fentry_test_t { struct bpf_fentry_test_t *a; }; int BPF_PROG(test7, struct bpf_fentry_test_t *arg) { if (arg == 0) test7_result = 1; return 0; } int BPF_PROG(test8, struct bpf_fentry_test_t *arg) { if (arg->a == 0) test8_result = 1; return 0; } In above bpf programs, both branch arg == 0 and arg->a == 0 are removed. This may not be what developer expected. The bug is introduced by Commit cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ"), where PTR_TO_BTF_ID is considered to be non-null when evaluting pointer vs. scalar comparison. This may be added considering we have PTR_TO_BTF_ID_OR_NULL in the verifier as well. PTR_TO_BTF_ID_OR_NULL is added to explicitly requires a non-NULL testing in selective cases. The current generic pointer tracing framework in verifier always assigns PTR_TO_BTF_ID so users does not need to check NULL pointer at every pointer level like a->b->c->d. We may not want to assign every PTR_TO_BTF_ID as PTR_TO_BTF_ID_OR_NULL as this will require a null test before pointer dereference which may cause inconvenience for developers. But we could avoid branch elimination to preserve original code intention. This patch simply removed PTR_TO_BTD_ID from reg_type_not_null() in verifier, which prevented the above branches from being eliminated. [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/ Fixes: cac616db39c2 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ") Reported-by: Wenbo Zhang <ethercflow@gmail.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com
| * bpf, netns: Fix use-after-free in pernet pre_exit callbackJakub Sitnicki2020-06-301-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Iterating over BPF links attached to network namespace in pre_exit hook is not safe, even if there is just one. Once link gets auto-detached, that is its back-pointer to net object is set to NULL, the link can be released and freed without waiting on netns_bpf_mutex, effectively causing the list element we are operating on to be freed. This leads to use-after-free when trying to access the next element on the list, as reported by KASAN. Bug can be triggered by destroying a network namespace, while also releasing a link attached to this network namespace. | ================================================================== | BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130 | Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177 | | CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: netns cleanup_net | Call Trace: | dump_stack+0x9e/0xe0 | print_address_description.constprop.0+0x3a/0x60 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | kasan_report.cold+0x1f/0x40 | ? netns_bpf_pernet_pre_exit+0xd9/0x130 | netns_bpf_pernet_pre_exit+0xd9/0x130 | cleanup_net+0x30b/0x5b0 | ? unregister_pernet_device+0x50/0x50 | ? rcu_read_lock_bh_held+0xb0/0xb0 | ? _raw_spin_unlock_irq+0x24/0x50 | process_one_work+0x4d1/0xa10 | ? lock_release+0x3e0/0x3e0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x5c0 | ? process_one_work+0xa10/0xa10 | kthread+0x1e3/0x240 | ? kthread_create_on_node+0xd0/0xd0 | ret_from_fork+0x1f/0x30 | | Allocated by task 280: | save_stack+0x1b/0x40 | __kasan_kmalloc.constprop.0+0xc2/0xd0 | netns_bpf_link_create+0xfe/0x650 | __do_sys_bpf+0x153a/0x2a50 | do_syscall_64+0x59/0x300 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | | Freed by task 198: | save_stack+0x1b/0x40 | __kasan_slab_free+0x12f/0x180 | kfree+0xed/0x350 | process_one_work+0x4d1/0xa10 | worker_thread+0x7a/0x5c0 | kthread+0x1e3/0x240 | ret_from_fork+0x1f/0x30 | | The buggy address belongs to the object at ffff888119e0d700 | which belongs to the cache kmalloc-192 of size 192 | The buggy address is located 120 bytes inside of | 192-byte region [ffff888119e0d700, ffff888119e0d7c0) | The buggy address belongs to the page: | page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 | flags: 0x2fffe0000000200(slab) | raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80 | raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000 | page dumped because: kasan: bad access detected | | Memory state around the buggy address: | ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ^ | ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc | ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb | ================================================================== Remove the "fast-path" for releasing a link that got auto-detached by a dying network namespace to fix it. This way as long as link is on the list and netns_bpf mutex is held, we have a guarantee that link memory can be accessed. An alternative way to fix this issue would be to safely iterate over the list of links and ensure there is no access to link object after detaching it. But, at the moment, optimizing synchronization overhead on link release without a workload in mind seems like an overkill. Fixes: ab53cad90eb1 ("bpf, netns: Keep a list of attached bpf_link's") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com
| * bpf: sockmap: Require attach_bpf_fd when detaching a programLorenz Bauer2020-06-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sockmap code currently ignores the value of attach_bpf_fd when detaching a program. This is contrary to the usual behaviour of checking that attach_bpf_fd represents the currently attached program. Ensure that attach_bpf_fd is indeed the currently attached program. It turns out that all sockmap selftests already do this, which indicates that this is unlikely to cause breakage. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
| * bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACHLorenz Bauer2020-06-302-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using BPF_PROG_DETACH on a flow dissector program supports neither attach_flags nor attach_bpf_fd. Yet no value is enforced for them. Enforce that attach_flags are zero, and require the current program to be passed via attach_bpf_fd. This allows us to remove the check for CAP_SYS_ADMIN, since userspace can now no longer remove arbitrary flow dissector programs. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com
| * bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACHLorenz Bauer2020-06-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using BPF_PROG_ATTACH on a flow dissector program supports neither target_fd, attach_flags or replace_bpf_fd but accepts any value. Enforce that all of them are zero. This is fine for replace_bpf_fd since its presence is indicated by BPF_F_REPLACE. It's more problematic for target_fd, since zero is a valid fd. Should we want to use the flag later on we'd have to add an exception for fd 0. The alternative is to force a value like -1. This requires more changes to tests. There is also precedent for using 0, since bpf_iter uses this for target_fd as well. Fixes: b27f7bb590ba ("flow_dissector: Move out netns_bpf prog callbacks") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com
| * bpf, netns: Keep a list of attached bpf_link'sJakub Sitnicki2020-06-301-19/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support multi-prog link-based attachments for new netns attach types, we need to keep track of more than one bpf_link per attach type. Hence, convert net->bpf.links into a list, that currently can be either empty or have just one item. Instead of reusing bpf_prog_list from bpf-cgroup, we link together bpf_netns_link's themselves. This makes list management simpler as we don't have to allocate, initialize, and later release list elements. We can do this because multi-prog attachment will be available only for bpf_link, and we don't need to build a list of programs attached directly and indirectly via links. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
| * bpf, netns: Keep attached programs in bpf_prog_arrayJakub Sitnicki2020-06-301-38/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for having multi-prog attachments for new netns attach types by storing programs to run in a bpf_prog_array, which is well suited for iterating over programs and running them in sequence. After this change bpf(PROG_QUERY) may block to allocate memory in bpf_prog_array_copy_to_user() for collected program IDs. This forces a change in how we protect access to the attached program in the query callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from an RCU read lock to holding a mutex that serializes updaters. Because we allow only one BPF flow_dissector program to be attached to netns at all times, the bpf_prog_array pointed by net->bpf.run_array is always either detached (null) or one element long. No functional changes intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
| * flow_dissector: Pull BPF program assignment up to bpf-netnsJakub Sitnicki2020-06-301-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for using bpf_prog_array to store attached programs by moving out code that updates the attached program out of flow dissector. Managing bpf_prog_array is more involved than updating a single bpf_prog pointer. This will let us do it all from one place, bpf/net_namespace.c, in the subsequent patch. No functional change intended. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
| * bpf: Enforce BPF ringbuf size to be the power of 2Andrii Nakryiko2020-06-301-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | BPF ringbuf assumes the size to be a multiple of page size and the power of 2 value. The latter is important to avoid division while calculating position inside the ring buffer and using (N-1) mask instead. This patch fixes omission to enforce power-of-2 size rule. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
| * dma-mapping: Add a new dma_need_sync APIChristoph Hellwig2020-06-302-0/+16
| | | | | | | | | | | | | | | | | | Add a new API to check if calls to dma_sync_single_for_{device,cpu} are required for a given DMA streaming mapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de
| * bpf: Do not allow btf_ctx_access with __int128 typesJohn Fastabend2020-06-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To ensure btf_ctx_access() is safe the verifier checks that the BTF arg type is an int, enum, or pointer. When the function does the BTF arg lookup it uses the calculation 'arg = off / 8' using the fact that registers are 8B. This requires that the first arg is in the first reg, the second in the second, and so on. However, for __int128 the arg will consume two registers by default LLVM implementation. So this will cause the arg layout assumed by the 'arg = off / 8' calculation to be incorrect. Because __int128 is uncommon this patch applies the easiest fix and will force int types to be sizeof(u64) or smaller so that they will fit in a single register. v2: remove unneeded parens per Andrii's feedback Fixes: 9e15db66136a1 ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/159303723962.11287.13309537171132420717.stgit@john-Precision-5820-Tower
| * bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking ↵Maciej Żenczykowski2020-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bpf programs This is a fix for a regression in commit 2c78ee898d8f ("bpf: Implement CAP_BPF"). Before the above commit it was possible to load network bpf programs with just the CAP_SYS_ADMIN privilege. The Android bpfloader happens to run in such a configuration (it has SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN). Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: John Stultz <john.stultz@linaro.org> Link: https://lore.kernel.org/bpf/20200620212616.93894-1-zenczykowski@gmail.com
| * bpf: Set the number of exception entries properly for subprogramsYonghong Song2020-06-231-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if a bpf program has more than one subprograms, each program will be jitted separately. For programs with bpf-to-bpf calls the prog->aux->num_exentries is not setup properly. For example, with bpf_iter_netlink.c modified to force one function to be not inlined and with CONFIG_BPF_JIT_ALWAYS_ON the following error is seen: $ ./test_progs -n 3/3 ... libbpf: failed to load program 'iter/netlink' libbpf: failed to load object 'bpf_iter_netlink' libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007 test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed #3/3 netlink:FAIL The dmesg shows the following errors: ex gen bug which is triggered by the following code in arch/x86/net/bpf_jit_comp.c: if (excnt >= bpf_prog->aux->num_exentries) { pr_err("ex gen bug\n"); return -EFAULT; } This patch fixes the issue by computing proper num_exentries for each subprogram before calling JIT. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2020-06-253-25/+40
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen. 2) The default crypto algorithm selection in Kconfig for IPSEC is out of touch with modern reality, fix this up. From Eric Biggers. 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii Nakryiko. 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from Hangbin Liu. 5) Adjust packet alignment handling in ax88179_178a driver to match what the hardware actually does. From Jeremy Kerr. 6) register_netdevice can leak in the case one of the notifiers fail, from Yang Yingliang. 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo. 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir Oltean. 9) tg3 driver can sleep forever when we get enough EEH errors, fix from David Christensen. 10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet drivers, from Ciara Loftus. 11) Fix scanning loop break condition in of_mdiobus_register(), from Florian Fainelli. 12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon. 13) Endianness fix in mlxsw, from Ido Schimmel. 14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen. 15) Missing bridge mrp configuration validation, from Horatiu Vultur. 16) Fix circular netns references in wireguard, from Jason A. Donenfeld. 17) PTP initialization on recovery is not done properly in qed driver, from Alexander Lobakin. 18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong, from Rahul Lakkireddy. 19) Don't clear bound device TX queue of socket prematurely otherwise we get problems with ktls hw offloading, from Tariq Toukan. 20) ipset can do atomics on unaligned memory, fix from Russell King. 21) Align ethernet addresses properly in bridging code, from Thomas Martitz. 22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set, from Marcelo Ricardo Leitner. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits) rds: transport module should be auto loaded when transport is set sch_cake: fix a few style nits sch_cake: don't call diffserv parsing code when it is not needed sch_cake: don't try to reallocate or unshare skb unconditionally ethtool: fix error handling in linkstate_prepare_data() wil6210: account for napi_gro_receive never returning GRO_DROP hns: do not cast return value of napi_gro_receive to null socionext: account for napi_gro_receive never returning GRO_DROP wireguard: receive: account for napi_gro_receive never returning GRO_DROP vxlan: fix last fdb index during dump of fdb with nhid sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket tc-testing: avoid action cookies with odd length. bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT net: dsa: sja1105: fix tc-gate schedule with single element net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules net: dsa: sja1105: unconditionally free old gating config net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top net: macb: free resources on failure path of at91ether_open() net: macb: call pm_runtime_put_sync on failure path ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2020-06-173-25/+40
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf 2020-06-17 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 2 day(s) which contain a total of 14 files changed, 158 insertions(+), 59 deletions(-). The main changes are: 1) Important fix for bpf_probe_read_kernel_str() return value, from Andrii. 2) [gs]etsockopt fix for large optlen, from Stanislav. 3) devmap allocation fix, from Toke. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZEStanislav Fomichev2020-06-171-20/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Attaching to these hooks can break iptables because its optval is usually quite big, or at least bigger than the current PAGE_SIZE limit. David also mentioned some SCTP options can be big (around 256k). For such optvals we expose only the first PAGE_SIZE bytes to the BPF program. BPF program has two options: 1. Set ctx->optlen to 0 to indicate that the BPF's optval should be ignored and the kernel should use original userspace value. 2. Set ctx->optlen to something that's smaller than the PAGE_SIZE. v5: * use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov) * update the docs accordingly v4: * use temporary buffer to avoid optval == optval_end == NULL; this removes the corner case in the verifier that might assume non-zero PTR_TO_PACKET/PTR_TO_PACKET_END. v3: * don't increase the limit, bypass the argument v2: * proper comments formatting (Jakub Kicinski) Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Link: https://lore.kernel.org/bpf/20200617010416.93086-1-sdf@google.com
| | * devmap: Use bpf_map_area_alloc() for allocating hash bucketsToke Høiland-Jørgensen2020-06-171-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller discovered that creating a hash of type devmap_hash with a large number of entries can hit the memory allocator limit for allocating contiguous memory regions. There's really no reason to use kmalloc_array() directly in the devmap code, so just switch it to the existing bpf_map_area_alloc() function that is used elsewhere. Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616142829.114173-1-toke@redhat.com
| | * bpf: bpf_probe_read_kernel_str() has to return amount of data read on successAndrii Nakryiko2020-06-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on success, instead of amount of data successfully read. This majorly breaks applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str() and their results. Fix this by returning actual number of bytes read. Fixes: 8d92db5c04d1 ("bpf: rework the compat kernel probe handling") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com
* | | Merge tag 'trace-v5.8-rc2' of ↵Linus Torvalds2020-06-253-6/+27
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Four small fixes: - Fix a ringbuffer bug for nested events having time go backwards - Fix a config dependency for boot time tracing to depend on synthetic events instead of histograms. - Fix trigger format parsing to handle multiple spaces - Fix bootconfig to handle failures in multiple events" * tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/boottime: Fix kprobe multiple events tracing: Fix event trigger to accept redundant spaces tracing/boot: Fix config dependency for synthedic event ring-buffer: Zero out time extend if it is nested and not absolute
| * | | tracing/boottime: Fix kprobe multiple eventsSascha Ortmann2020-06-231-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix boottime kprobe events to report and abort after each failure when adding probes. As an example, when we try to set multiprobe kprobe events in bootconfig like this: ftrace.event.kprobes.vfsevents { probes = "vfs_read $arg1 $arg2,, !error! not reported;?", // leads to error "vfs_write $arg1 $arg2" } This will not work as expected. After commit da0f1f4167e3af69e ("tracing/boottime: Fix kprobe event API usage"), the function trace_boot_add_kprobe_event will not produce any error message when adding a probe fails at kprobe_event_gen_cmd_start. Furthermore, we continue to add probes when kprobe_event_gen_cmd_end fails (and kprobe_event_gen_cmd_start did not fail). In this case the function even returns successfully when the last call to kprobe_event_gen_cmd_end is successful. The behaviour of reporting and aborting after failures is not consistent. The function trace_boot_add_kprobe_event now reports each failure and stops adding probes immediately. Link: https://lkml.kernel.org/r/20200618163301.25854-1-sascha.ortmann@stud.uni-hannover.de Cc: stable@vger.kernel.org Cc: linux-kernel@i4.cs.fau.de Co-developed-by: Maximilian Werner <maximilian.werner96@gmail.com> Fixes: da0f1f4167e3 ("tracing/boottime: Fix kprobe event API usage") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Maximilian Werner <maximilian.werner96@gmail.com> Signed-off-by: Sascha Ortmann <sascha.ortmann@stud.uni-hannover.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | tracing: Fix event trigger to accept redundant spacesMasami Hiramatsu2020-06-231-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the event trigger to accept redundant spaces in the trigger input. For example, these return -EINVAL echo " traceon" > events/ftrace/print/trigger echo "traceon if common_pid == 0" > events/ftrace/print/trigger echo "disable_event:kmem:kmalloc " > events/ftrace/print/trigger But these are hard to find what is wrong. To fix this issue, use skip_spaces() to remove spaces in front of actual tokens, and set NULL if there is no token. Link: http://lkml.kernel.org/r/159262476352.185015.5261566783045364186.stgit@devnote2 Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: 85f2b08268c0 ("tracing: Add basic event trigger framework") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | tracing/boot: Fix config dependency for synthedic eventMasami Hiramatsu2020-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 726721a51838 ("tracing: Move synthetic events to a separate file") decoupled synthetic event from histogram, boot-time tracing also has to check CONFIG_SYNTH_EVENT instead of CONFIG_HIST_TRIGGERS. Link: http://lkml.kernel.org/r/159262475441.185015.5300725180746017555.stgit@devnote2 Fixes: 726721a51838 ("tracing: Move synthetic events to a separate file") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | ring-buffer: Zero out time extend if it is nested and not absoluteSteven Rostedt (VMware)2020-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the ring buffer makes events that happen in interrupts that preempt another event have a delta of zero. (Hopefully we can change this soon). But this is to deal with the races of updating a global counter with lockless and nesting functions updating deltas. With the addition of absolute time stamps, the time extend didn't follow this rule. A time extend can happen if two events happen longer than 2^27 nanoseconds appart, as the delta time field in each event is only 27 bits. If that happens, then a time extend is injected with 2^59 bits of nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between two events, and as it is writing the event, an interrupt triggers, it will see the 2^27 difference as well and inject a time extend of its own. But a recent change made the time extend logic not take into account the nesting, and this can cause two time extend deltas to happen moving the time stamp much further ahead than the current time. This gets all reset when the ring buffer moves to the next page, but that can cause time to appear to go backwards. This was observed in a trace-cmd recording, and since the data is saved in a file, with trace-cmd report --debug, it was possible to see that this indeed did happen! bash-52501 110d... 81778.908247: sched_switch: bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64] <idle>-0 110d... 81778.908757: sched_switch: swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64] TIME EXTEND: delta:306454770 length:0 bash-52501 110.... 81779.215212: sched_swap_numa: src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48] TIME EXTEND: delta:306458165 length:0 bash-52501 110dNh. 81779.521670: sched_wakeup: migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40] and at the next page, caused the time to go backwards: bash-52504 110d... 81779.685411: sched_switch: bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64] CPU:110 [SUBBUFFER START] [81779379165886:0x1320000] <idle>-0 110dN.. 81779.379166: sched_wakeup: bash:52504 [120] success=1 CPU:110 [0:0x10:40] <idle>-0 110d... 81779.379167: sched_switch: swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64] Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: stable@vger.kernel.org Fixes: dc4e2801d400b ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP") Reported-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | | Merge tag 'for-linus-2020-06-24' of ↵Linus Torvalds2020-06-241-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread fix from Christian Brauner: "This fixes a regression introduced with 303cc571d107 ("nsproxy: attach to namespaces via pidfds"). The LTP testsuite reported a regression where users would now see EBADF returned instead of EINVAL when an fd was passed that referred to an open file but the file was not a namespace file. Fix this by continuing to report EINVAL and add a regression test" * tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: test for setns() EINVAL regression nsproxy: restore EINVAL for non-namespace file descriptor
| * | | | nsproxy: restore EINVAL for non-namespace file descriptorChristian Brauner2020-06-171-1/+1
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The LTP testsuite reported a regression where users would now see EBADF returned instead of EINVAL when an fd was passed that referred to an open file but the file was not a nsfd. Fix this by continuing to report EINVAL. Reported-by: kernel test robot <rong.a.chen@intel.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Cyril Hrubis <chrubis@suse.cz> Link: https://lore.kernel.org/lkml/20200615085836.GR12456@shao2-debian Fixes: 303cc571d107 ("nsproxy: attach to namespaces via pidfds") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* | | | Revert "kernel/printk: add kmsg SEEK_CUR handling"Jason A. Donenfeld2020-06-211-10/+0
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8ece3b3eb576a78d2e67ad4c3a80a39fa6708809. This commit broke userspace. Bash uses ESPIPE to determine whether or not the file should be read using "unbuffered I/O", which means reading 1 byte at a time instead of 128 bytes at a time. I used to use bash to read through kmsg in a really quite nasty way: while read -t 0.1 -r line 2>/dev/null || [[ $? -ne 142 ]]; do echo "SARU $line" done < /dev/kmsg This will show all lines that can fit into the 128 byte buffer, and skip lines that don't. That's pretty awful, but at least it worked. With this change, bash now tries to do 1-byte reads, which means it skips all the lines, which is worse than before. Now, I don't really care very much about this, and I'm already look for a workaround. But I did just spend an hour trying to figure out why my scripts were broken. Either way, it makes no difference to me personally whether this is reverted, but it might be something to consider. If you declare that "trying to read /dev/kmsg with bash is terminally stupid anyway," I might be inclined to agree with you. But do note that bash uses lseek(fd, 0, SEEK_CUR)==>ESPIPE to determine whether or not it's reading from a pipe. Cc: Bruno Meneguele <bmeneg@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'trace-v5.8-rc1' of ↵Linus Torvalds2020-06-208-27/+88
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Have recordmcount work with > 64K sections (to support LTO) - kprobe RCU fixes - Correct a kprobe critical section with missing mutex - Remove redundant arch_disarm_kprobe() call - Fix lockup when kretprobe triggers within kprobe_flush_task() - Fix memory leak in fetch_op_data operations - Fix sleep in atomic in ftrace trace array sample code - Free up memory on failure in sample trace array code - Fix incorrect reporting of function_graph fields in format file - Fix quote within quote parsing in bootconfig - Fix return value of bootconfig tool - Add testcases for bootconfig tool - Fix maybe uninitialized warning in ftrace pid file code - Remove unused variable in tracing_iter_reset() - Fix some typos * tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix maybe-uninitialized compiler warning tools/bootconfig: Add testcase for show-command and quotes test tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig tools/bootconfig: Fix to use correct quotes for value proc/bootconfig: Fix to use correct quotes for value tracing: Remove unused event variable in tracing_iter_reset tracing/probe: Fix memleak in fetch_op_data operations trace: Fix typo in allocate_ftrace_ops()'s comment tracing: Make ftrace packed events have align of 1 sample-trace-array: Remove trace_array 'sample-instance' sample-trace-array: Fix sleeping function called from invalid context kretprobe: Prevent triggering kretprobe from within kprobe_flush_task kprobes: Remove redundant arch_disarm_kprobe() call kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex kprobes: Use non RCU traversal APIs on kprobe_tables if possible kprobes: Suppress the suspicious RCU warning on kprobes recordmcount: support >64k sections
| * | | ftrace: Fix maybe-uninitialized compiler warningKaitao Cheng2020-06-171-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During build compiler reports some 'false positive' warnings about variables {'seq_ops', 'filtered_pids', 'other_pids'} may be used uninitialized. This patch silences these warnings. Also delete some useless spaces Link: https://lkml.kernel.org/r/20200529141214.37648-1-pilgrimtao@gmail.com Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | tracing: Remove unused event variable in tracing_iter_resetYangHui2020-06-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We do not use the event variable, just remove it. Signed-off-by: YangHui <yanghui.def@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | tracing/probe: Fix memleak in fetch_op_data operationsVamshi K Sthambamkadi2020-06-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmemleak report: [<57dcc2ca>] __kmalloc_track_caller+0x139/0x2b0 [<f1c45d0f>] kstrndup+0x37/0x80 [<f9761eb0>] parse_probe_arg.isra.7+0x3cc/0x630 [<055bf2ba>] traceprobe_parse_probe_arg+0x2f5/0x810 [<655a7766>] trace_kprobe_create+0x2ca/0x950 [<4fc6a02a>] create_or_delete_trace_kprobe+0xf/0x30 [<6d1c8a52>] trace_run_command+0x67/0x80 [<be812cc0>] trace_parse_run_command+0xa7/0x140 [<aecfe401>] probes_write+0x10/0x20 [<2027641c>] __vfs_write+0x30/0x1e0 [<6a4aeee1>] vfs_write+0x96/0x1b0 [<3517fb7d>] ksys_write+0x53/0xc0 [<dad91db7>] __ia32_sys_write+0x15/0x20 [<da347f64>] do_syscall_32_irqs_on+0x3d/0x260 [<fd0b7e7d>] do_fast_syscall_32+0x39/0xb0 [<ea5ae810>] entry_SYSENTER_32+0xaf/0x102 Post parse_probe_arg(), the FETCH_OP_DATA operation type is overwritten to FETCH_OP_ST_STRING, as a result memory is never freed since traceprobe_free_probe_arg() iterates only over SYMBOL and DATA op types Setup fetch string operation correctly after fetch_op_data operation. Link: https://lkml.kernel.org/r/20200615143034.GA1734@cosmos Cc: stable@vger.kernel.org Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | trace: Fix typo in allocate_ftrace_ops()'s commentWei Yang2020-06-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No functional change, just correct the word. Link: https://lkml.kernel.org/r/20200610033251.31713-1-richard.weiyang@linux.alibaba.com Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | tracing: Make ftrace packed events have align of 1Steven Rostedt (VMware)2020-06-163-7/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using trace-cmd on 5.6-rt for the function graph tracer, the output was corrupted. It gave output like this: funcgraph_entry: func=0xffffffff depth=38982 funcgraph_entry: func=0x1ffffffff depth=16044 funcgraph_exit: func=0xffffffff overrun=0x92539aaf00000000 calltime=0x92539c9900000072 rettime=0x100000072 depth=11084 funcgraph_exit: func=0xffffffff overrun=0x9253946e00000000 calltime=0x92539e2100000072 rettime=0x72 depth=26033702 funcgraph_entry: func=0xffffffff depth=85798 funcgraph_entry: func=0x1ffffffff depth=12044 The reason was because the tracefs/events/ftrace/funcgraph_entry/exit format file was incorrect. The -rt kernel adds more common fields to the trace events. Namely, common_migrate_disable and common_preempt_lazy_count. Each is one byte in size. This changes the alignment of the normal payload. Most events are aligned normally, but the function and function graph events are defined with a "PACKED" macro, that packs their payload. As the offsets displayed in the format files are now calculated by an aligned field, the aligned field for function and function graph events should be 1, not their normal alignment. With aligning of the funcgraph_entry event, the format file has: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:unsigned char common_migrate_disable; offset:8; size:1; signed:0; field:unsigned char common_preempt_lazy_count; offset:9; size:1; signed:0; field:unsigned long func; offset:16; size:8; signed:0; field:int depth; offset:24; size:4; signed:1; But the actual alignment is: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:unsigned char common_migrate_disable; offset:8; size:1; signed:0; field:unsigned char common_preempt_lazy_count; offset:9; size:1; signed:0; field:unsigned long func; offset:12; size:8; signed:0; field:int depth; offset:20; size:4; signed:1; Link: https://lkml.kernel.org/r/20200609220041.2a3b527f@oasis.local.home Cc: stable@vger.kernel.org Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | kretprobe: Prevent triggering kretprobe from within kprobe_flush_taskJiri Olsa2020-06-161-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave. My test was also able to trigger lockdep output: ============================================ WARNING: possible recursive locking detected 5.6.0-rc6+ #6 Not tainted -------------------------------------------- sched-messaging/2767 is trying to acquire lock: ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0 but task is already holding lock: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(kretprobe_table_locks[i].lock)); lock(&(kretprobe_table_locks[i].lock)); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by sched-messaging/2767: #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50 stack backtrace: CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6 Call Trace: dump_stack+0x96/0xe0 __lock_acquire.cold.57+0x173/0x2b7 ? native_queued_spin_lock_slowpath+0x42b/0x9e0 ? lockdep_hardirqs_on+0x590/0x590 ? __lock_acquire+0xf63/0x4030 lock_acquire+0x15a/0x3d0 ? kretprobe_hash_lock+0x52/0xa0 _raw_spin_lock_irqsave+0x36/0x70 ? kretprobe_hash_lock+0x52/0xa0 kretprobe_hash_lock+0x52/0xa0 trampoline_handler+0xf8/0x940 ? kprobe_fault_handler+0x380/0x380 ? find_held_lock+0x3a/0x1c0 kretprobe_trampoline+0x25/0x50 ? lock_acquired+0x392/0xbc0 ? _raw_spin_lock_irqsave+0x50/0x70 ? __get_valid_kprobe+0x1f0/0x1f0 ? _raw_spin_unlock_irqrestore+0x3b/0x40 ? finish_task_switch+0x4b9/0x6d0 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 The code within the kretprobe handler checks for probe reentrancy, so we won't trigger any _raw_spin_lock_irqsave probe in there. The problem is in outside kprobe_flush_task, where we call: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave where _raw_spin_lock_irqsave triggers the kretprobe and installs kretprobe_trampoline handler on _raw_spin_lock_irqsave return. The kretprobe_trampoline handler is then executed with already locked kretprobe_table_locks, and first thing it does is to lock kretprobe_table_locks ;-) the whole lockup path like: kprobe_flush_task kretprobe_table_lock raw_spin_lock_irqsave _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed ---> kretprobe_table_locks locked kretprobe_trampoline trampoline_handler kretprobe_hash_lock(current, &head, &flags); <--- deadlock Adding kprobe_busy_begin/end helpers that mark code with fake probe installed to prevent triggering of another kprobe within this code. Using these helpers in kprobe_flush_task, so the probe recursion protection check is hit and the probe is never set to prevent above lockup. Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2 Fixes: ef53d9c5e4da ("kprobes: improve kretprobe scalability with hashed locking") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | kprobes: Remove redundant arch_disarm_kprobe() callMasami Hiramatsu2020-06-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix to remove redundant arch_disarm_kprobe() call in force_unoptimize_kprobe(). This arch_disarm_kprobe() will be invoked if the kprobe is optimized but disabled, but that means the kprobe (optprobe) is unused (and unoptimized) state. In that case, unoptimize_kprobe() puts it in freeing_list and kprobe_optimizer (do_unoptimize_kprobes()) automatically disarm it. Thus this arch_disarm_kprobe() is redundant. Link: http://lkml.kernel.org/r/158927058719.27680.17183632908465341189.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutexMasami Hiramatsu2020-06-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In kprobe_optimizer() kick_kprobe_optimizer() is called without kprobe_mutex, but this can race with other caller which is protected by kprobe_mutex. To fix that, expand kprobe_mutex protected area to protect kick_kprobe_optimizer() call. Link: http://lkml.kernel.org/r/158927057586.27680.5036330063955940456.stgit@devnote2 Fixes: cd7ebe2298ff ("kprobes: Use text_poke_smp_batch for optimizing") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ziqian SUN <zsun@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | kprobes: Use non RCU traversal APIs on kprobe_tables if possibleMasami Hiramatsu2020-06-161-9/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current kprobes uses RCU traversal APIs on kprobe_tables even if it is safe because kprobe_mutex is locked. Make those traversals to non-RCU APIs where the kprobe_mutex is locked. Link: http://lkml.kernel.org/r/158927056452.27680.9710575332163005121.stgit@devnote2 Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | kprobes: Suppress the suspicious RCU warning on kprobesMasami Hiramatsu2020-06-161-1/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anders reported that the lockdep warns that suspicious RCU list usage in register_kprobe() (detected by CONFIG_PROVE_RCU_LIST.) This is because get_kprobe() access kprobe_table[] by hlist_for_each_entry_rcu() without rcu_read_lock. If we call get_kprobe() from the breakpoint handler context, it is run with preempt disabled, so this is not a problem. But in other cases, instead of rcu_read_lock(), we locks kprobe_mutex so that the kprobe_table[] is not updated. So, current code is safe, but still not good from the view point of RCU. Joel suggested that we can silent that warning by passing lockdep_is_held() to the last argument of hlist_for_each_entry_rcu(). Add lockdep_is_held(&kprobe_mutex) at the end of the hlist_for_each_entry_rcu() to suppress the warning. Link: http://lkml.kernel.org/r/158927055350.27680.10261450713467997503.stgit@devnote2 Reported-by: Anders Roxell <anders.roxell@linaro.org> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
* | | Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-blockLinus Torvalds2020-06-191-10/+20
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - Use import_uuid() where appropriate (Andy) - bcache fixes (Coly, Mauricio, Zhiqiang) - blktrace sparse warnings fix (Jan) - blktrace concurrent setup fix (Luis) - blkdev_get use-after-free fix (Jason) - Ensure all blk-mq maps are updated (Weiping) - Loop invalidate bdev fix (Zheng) * tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block: block: make function 'kill_bdev' static loop: replace kill_bdev with invalidate_bdev partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense block: update hctx map when use multiple maps blktrace: Avoid sparse warnings when assigning q->blk_trace blktrace: break out of blktrace setup on concurrent calls block: Fix use-after-free in blkdev_get() trace/events/block.h: drop kernel-doc for dropped function parameter blk-mq: Remove redundant 'return' statement bcache: pr_info() format clean up in bcache_device_init() bcache: use delayed kworker fo asynchronous devices registration bcache: check and adjust logical block size for backing devices bcache: fix potential deadlock problem in btree_gc_coalesce
| * | | blktrace: Avoid sparse warnings when assigning q->blk_traceJan Kara2020-06-171-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mostly for historical reasons, q->blk_trace is assigned through xchg() and cmpxchg() atomic operations. Although this is correct, sparse complains about this because it violates rcu annotations since commit c780e86dd48e ("blktrace: Protect q->blk_trace with RCU") which started to use rcu for accessing q->blk_trace. Furthermore there's no real need for atomic operations anymore since all changes to q->blk_trace happen under q->blk_trace_mutex and since it also makes more sense to check if q->blk_trace is set with the mutex held earlier. So let's just replace xchg() with rcu_replace_pointer() and cmpxchg() with explicit check and rcu_assign_pointer(). This makes the code more efficient and sparse happy. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blktrace: break out of blktrace setup on concurrent callsLuis Chamberlain2020-06-171-0/+13
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use one blktrace per request_queue, that means one per the entire disk. So we cannot run one blktrace on say /dev/vda and then /dev/vda1, or just two calls on /dev/vda. We check for concurrent setup only at the very end of the blktrace setup though. If we try to run two concurrent blktraces on the same block device the second one will fail, and the first one seems to go on. However when one tries to kill the first one one will see things like this: The kernel will show these: ``` debugfs: File 'dropped' in directory 'nvme1n1' already present! debugfs: File 'msg' in directory 'nvme1n1' already present! debugfs: File 'trace0' in directory 'nvme1n1' already present! `` And userspace just sees this error message for the second call: ``` blktrace /dev/nvme1n1 BLKTRACESETUP(2) /dev/nvme1n1 failed: 5/Input/output error ``` The first userspace process #1 will also claim that the files were taken underneath their nose as well. The files are taken away form the first process given that when the second blktrace fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN. This means that even if go-happy process #1 is waiting for blktrace data, we *have* been asked to take teardown the blktrace. This can easily be reproduced with break-blktrace [0] run_0005.sh test. Just break out early if we know we're already going to fail, this will prevent trying to create the files all over again, which we know still exist. [0] https://github.com/mcgrof/break-blktrace Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge branch 'hch' (maccess patches from Christoph Hellwig)Linus Torvalds2020-06-188-23/+25
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge non-faulting memory access cleanups from Christoph Hellwig: "Andrew and I decided to drop the patches implementing your suggested rename of the probe_kernel_* and probe_user_* helpers from -mm as there were way to many conflicts. After -rc1 might be a good time for this as all the conflicts are resolved now" This also adds a type safety checking patch on top of the renaming series to make the subtle behavioral difference between 'get_user()' and 'get_kernel_nofault()' less potentially dangerous and surprising. * emailed patches from Christoph Hellwig <hch@lst.de>: maccess: make get_kernel_nofault() check for minimal type compatibility maccess: rename probe_kernel_address to get_kernel_nofault maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
| * | | maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofaultChristoph Hellwig2020-06-172-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofaultChristoph Hellwig2020-06-178-20/+22
| |/ / | | | | | | | | | | | | | | | | | | | | | Better describe what these functions do. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds2020-06-172-9/+8
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | Pull dma-mapping fixes from Christoph Hellwig: "Fixes for the SEV atomic pool (Geert Uytterhoeven and David Rientjes)" * tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mapping: dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL dma-pool: fix too large DMA pools on medium memory size systems
| * | dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOLDavid Rientjes2020-06-151-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DMA_REMAP is an unnecessary requirement for AMD SEV, which requires DMA_COHERENT_POOL, so avoid selecting it when it is otherwise unnecessary. The only other requirement for DMA coherent pools is DMA_DIRECT_REMAP, so ensure that properly selects the config option when needed. Fixes: 82fef0ad811f ("x86/mm: unencrypted non-blocking DMA allocations use coherent pools") Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: David Rientjes <rientjes@google.com> Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | dma-pool: fix too large DMA pools on medium memory size systemsGeert Uytterhoeven2020-06-091-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems with at least 32 MiB, but less than 32 GiB of RAM, the DMA memory pools are much larger than intended (e.g. 2 MiB instead of 128 KiB on a 256 MiB system). Fix this by correcting the calculation of the number of GiBs of RAM in the system. Invert the order of the min/max operations, to keep on calculating in pages until the last step, which aids readability. Fixes: 1d659236fb43c4d2 ("dma-pool: scale the default DMA coherent pool size with memory capacity") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | tracing/probe: Replace zero-length array with flexible-arrayGustavo A. R. Silva2020-06-151-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* | Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of ↵Linus Torvalds2020-06-141-1/+14
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://github.com/micah-morton/linux Pull SafeSetID update from Micah Morton: "Add additional LSM hooks for SafeSetID SafeSetID is capable of making allow/deny decisions for set*uid calls on a system, and we want to add similar functionality for set*gid calls. The work to do that is not yet complete, so probably won't make it in for v5.8, but we are looking to get this simple patch in for v5.8 since we have it ready. We are planning on the rest of the work for extending the SafeSetID LSM being merged during the v5.9 merge window" * tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux: security: Add LSM hooks to set*gid syscalls
| * | security: Add LSM hooks to set*gid syscallsThomas Cedeno2020-06-141-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SafeSetID LSM uses the security_task_fix_setuid hook to filter set*uid() syscalls according to its configured security policy. In preparation for adding analagous support in the LSM for set*gid() syscalls, we add the requisite hook here. Tested by putting print statements in the security_task_fix_setgid hook and seeing them get hit during kernel boot. Signed-off-by: Thomas Cedeno <thomascedeno@google.com> Signed-off-by: Micah Morton <mortonm@chromium.org>