summaryrefslogtreecommitdiffstats
path: root/kernel/bpf/memalloc.c
Commit message (Collapse)AuthorAgeFilesLines
* bpf: Remove unnecessary cpu == 0 check in memallocYonghong Song2024-01-041-1/+1
| | | | | | | | | | | | | | | | | After merging the patch set [1] to reduce memory usage for bpf_global_percpu_ma, Alexei found a redundant check (cpu == 0) in function bpf_mem_alloc_percpu_unit_init() ([2]). Indeed, the check is unnecessary since c->unit_size will be all NULL or all non-NULL for all cpus before for_each_possible_cpu() loop. Removing the check makes code less confusing. [1] https://lore.kernel.org/all/20231222031729.1287957-1-yonghong.song@linux.dev/ [2] https://lore.kernel.org/all/20231222031745.1289082-1-yonghong.song@linux.dev/ Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240104165744.702239-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Use smaller low/high marks for percpu allocationYonghong Song2024-01-031-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | Currently, refill low/high marks are set with the assumption of normal non-percpu memory allocation. For example, for an allocation size 256, for non-percpu memory allocation, low mark is 32 and high mark is 96, resulting in the batch allocation of 48 elements and the allocated memory will be 48 * 256 = 12KB for this particular cpu. Assuming an 128-cpu system, the total memory consumption across all cpus will be 12K * 128 = 1.5MB memory. This might be okay for non-percpu allocation, but may not be good for percpu allocation, which will consume 1.5MB * 128 = 192MB memory in the worst case if every cpu has a chance of memory allocation. In practice, percpu allocation is very rare compared to non-percpu allocation. So let us have smaller low/high marks which can avoid unnecessary memory consumption. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031755.1289671-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Refill only one percpu element in memallocYonghong Song2024-01-031-4/+9
| | | | | | | | | | | | | | Typically for percpu map element or data structure, once allocated, most operations are lookup or in-place update. Deletion are really rare. Currently, for percpu data strcture, 4 elements will be refilled if the size is <= 256. Let us just do with one element for percpu data. For example, for size 256 and 128 cpus, the potential saving will be 3 * 256 * 128 * 128 = 12MB. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031750.1289290-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Allow per unit prefill for non-fix-size percpu memory allocatorYonghong Song2024-01-031-1/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 41a5db8d8161 ("Add support for non-fix-size percpu mem allocation") added support for non-fix-size percpu memory allocation. Such allocation will allocate percpu memory for all buckets on all cpus and the memory consumption is in the order to quadratic. For example, let us say, 4 cpus, unit size 16 bytes, so each cpu has 16 * 4 = 64 bytes, with 4 cpus, total will be 64 * 4 = 256 bytes. Then let us say, 8 cpus with the same unit size, each cpu has 16 * 8 = 128 bytes, with 8 cpus, total will be 128 * 8 = 1024 bytes. So if the number of cpus doubles, the number of memory consumption will be 4 times. So for a system with large number of cpus, the memory consumption goes up quickly with quadratic order. For example, for 4KB percpu allocation, 128 cpus. The total memory consumption will 4KB * 128 * 128 = 64MB. Things will become worse if the number of cpus is bigger (e.g., 512, 1024, etc.) In Commit 41a5db8d8161, the non-fix-size percpu memory allocation is done in boot time, so for system with large number of cpus, the initial percpu memory consumption is very visible. For example, for 128 cpu system, the total percpu memory allocation will be at least (16 + 32 + 64 + 96 + 128 + 196 + 256 + 512 + 1024 + 2048 + 4096) * 128 * 128 = ~138MB. which is pretty big. It will be even bigger for larger number of cpus. Note that the current prefill also allocates 4 entries if the unit size is less than 256. So on top of 138MB memory consumption, this will add more consumption with 3 * (16 + 32 + 64 + 96 + 128 + 196 + 256) * 128 * 128 = ~38MB. Next patch will try to reduce this memory consumption. Later on, Commit 1fda5bb66ad8 ("bpf: Do not allocate percpu memory at init stage") moved the non-fix-size percpu memory allocation to bpf verificaiton stage. Once a particular bpf_percpu_obj_new() is called by bpf program, the memory allocator will try to fill in the cache with all sizes, causing the same amount of percpu memory consumption as in the boot stage. To reduce the initial percpu memory consumption for non-fix-size percpu memory allocation, instead of filling the cache with all supported allocation sizes, this patch intends to fill the cache only for the requested size. As typically users will not use large percpu data structure, this can save memory significantly. For example, the allocation size is 64 bytes with 128 cpus. Then total percpu memory amount will be 64 * 128 * 128 = 1MB, much less than previous 138MB. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231222031745.1289082-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add objcg to bpf_mem_allocYonghong Song2024-01-031-5/+6
| | | | | | | | | | | | The objcg is a bpf_mem_alloc level property since all bpf_mem_cache's are with the same objcg. This patch made such a property explicit. The next patch will use this property to save and restore objcg for percpu unit allocator. Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031739.1288590-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Avoid unnecessary extra percpu memory allocationYonghong Song2024-01-031-1/+3
| | | | | | | | | | | | | | | | | | Currently, for percpu memory allocation, say if the user requests allocation size to be 32 bytes, the actually calculated size will be 40 bytes and it further rounds to 64 bytes, and eventually 64 bytes are allocated, wasting 32-byte memory. Change bpf_mem_alloc() to calculate the cache index based on the user-provided allocation size so unnecessary extra memory can be avoided. Suggested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231222031734.1288400-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Use c->unit_size to select target cache during freeHou Tao2023-12-201-94/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present, bpf memory allocator uses check_obj_size() to ensure that ksize() of allocated pointer is equal with the unit_size of used bpf_mem_cache. Its purpose is to prevent bpf_mem_free() from selecting a bpf_mem_cache which has different unit_size compared with the bpf_mem_cache used for allocation. But as reported by lkp, the return value of ksize() or kmalloc_size_roundup() may change due to slab merge and it will lead to the warning report in check_obj_size(). The reported warning happened as follows: (1) in bpf_mem_cache_adjust_size(), kmalloc_size_roundup(96) returns the object_size of kmalloc-96 instead of kmalloc-cg-96. The object_size of kmalloc-96 is 96, so size_index for 96 is not adjusted accordingly. (2) the object_size of kmalloc-cg-96 is adjust from 96 to 128 due to slab merge in __kmem_cache_alias(). For SLAB, SLAB_HWCACHE_ALIGN is enabled by default for kmalloc slab, so align is 64 and size is 128 for kmalloc-cg-96. SLUB has a similar merge logic, but its object_size will not be changed, because its align is 8 under x86-64. (3) when unit_alloc() does kmalloc_node(96, __GFP_ACCOUNT, node), ksize() returns 128 instead of 96 for the returned pointer. (4) the warning in check_obj_size() is triggered. Considering the slab merge can happen in anytime (e.g, a slab created in a new module), the following case is also possible: during the initialization of bpf_global_ma, there is no slab merge and ksize() for a 96-bytes object returns 96. But after that a new slab created by a kernel module is merged to kmalloc-cg-96 and the object_size of kmalloc-cg-96 is adjust from 96 to 128 (which is possible for x86-64 + CONFIG_SLAB, because its alignment requirement is 64 for 96-bytes slab). So soon or later, when bpf_global_ma frees a 96-byte-sized pointer which is allocated from bpf_mem_cache with unit_size=96, bpf_mem_free() will free the pointer through a bpf_mem_cache in which unit_size is 128, because the return value of ksize() changes. The warning for the mismatch will be triggered again. A feasible fix is introducing similar APIs compared with ksize() and kmalloc_size_roundup() to return the actually-allocated size instead of size which may change due to slab merge, but it will introduce unnecessary dependency on the implementation details of mm subsystem. As for now the pointer of bpf_mem_cache is saved in the 8-bytes area (or 4-bytes under 32-bit host) above the returned pointer, using unit_size in the saved bpf_mem_cache to select the target cache instead of inferring the size from the pointer itself. Beside no extra dependency on mm subsystem, the performance for bpf_mem_free_rcu() is also improved as shown below. Before applying the patch, the performances of bpf_mem_alloc() and bpf_mem_free_rcu() on 8-CPUs VM with one producer are as follows: kmalloc : alloc 11.69 ± 0.28M/s free 29.58 ± 0.93M/s percpu : alloc 14.11 ± 0.52M/s free 14.29 ± 0.99M/s After apply the patch, the performance for bpf_mem_free_rcu() increases 9% and 146% for kmalloc memory and per-cpu memory respectively: kmalloc: alloc 11.01 ± 0.03M/s free 32.42 ± 0.48M/s percpu: alloc 12.84 ± 0.12M/s free 35.24 ± 0.23M/s After the fixes, there is no need to adjust size_index to fix the mismatch between allocation and free, so remove it as well. Also return NULL instead of ZERO_SIZE_PTR for zero-sized alloc in bpf_mem_alloc(), because there is no bpf_mem_cache pointer saved above ZERO_SIZE_PTR. Fixes: 9077fc228f09 ("bpf: Use kmalloc_size_roundup() to adjust size_index") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/bpf/202310302113.9f8fe705-oliver.sang@intel.com Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231216131052.27621-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()Hou Tao2023-11-261-0/+2
| | | | | | | | | | | | | bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no free object in free list, but it doesn't initialize the allocation hint for the returned pointer. It may lead to bad memory dereference when freeing the pointer, so fix it by initializing the allocation hint. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add more WARN_ON_ONCE checks for mismatched alloc and freeHou Tao2023-10-261-0/+4
| | | | | | | | | | | | | | | There are two possible mismatched alloc and free cases in BPF memory allocator: 1) allocate from cache X but free by cache Y with a different unit_size 2) allocate from per-cpu cache but free by kmalloc cache or vice versa So add more WARN_ON_ONCE checks in free_bulk() and __free_by_rcu() to spot these mismatched alloc and free early. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231021014959.3563841-1-houtao@huaweicloud.com
* bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao2023-10-201-2/+14
| | | | | | | | | | | | | | | For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Re-enable unit_size checking for global per-cpu allocatorHou Tao2023-10-201-10/+12
| | | | | | | | | With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-10-051-25/+19
|\ | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bpf: Use kmalloc_size_roundup() to adjust size_indexHou Tao2023-09-301-25/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM). Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni2023-09-211-4/+90
|\| | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * bpf: Skip unit_size checking for global per-cpu allocatorHou Tao2023-09-151-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | For global per-cpu allocator, the size of free object in free list doesn't match with unit_size and now there is no way to get the size of per-cpu pointer saved in free object, so just skip the checking. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/bpf/20230913133436.0eeec4cb@canb.auug.org.au/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Biju Das <biju.das.jz@bp.renesas.com> Link: https://lore.kernel.org/r/20230913135943.3137292-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Ensure unit_size is matched with slab cache object sizeHou Tao2023-09-111-2/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | Add extra check in bpf_mem_alloc_init() to ensure the unit_size of bpf_mem_cache is matched with the object_size of underlying slab cache. If these two sizes are unmatched, print a warning once and return -EINVAL in bpf_mem_alloc_init(), so the mismatch can be found early and the potential issue can be prevented. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Don't prefill for unused bpf_mem_cacheHou Tao2023-09-111-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | When the unit_size of a bpf_mem_cache is unmatched with the object_size of the underlying slab cache, the bpf_mem_cache will not be used, and the allocation will be redirected to a bpf_mem_cache with a bigger unit_size instead, so there is no need to prefill for these unused bpf_mem_caches. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZEHou Tao2023-09-111-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following warning was reported when running "./test_progs -a link_api -a linked_list" on a RISC-V QEMU VM: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill Modules linked in: bpf_testmod(OE) CPU: 3 PID: 261 Comm: test_progs- ... 6.5.0-rc5-01743-gdcb152bb8328 #2 Hardware name: riscv-virtio,qemu (DT) epc : bpf_mem_refill+0x1fc/0x206 ra : irq_work_single+0x68/0x70 epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20 gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600 t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70 s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000 a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060 a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049 s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000 s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30 s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000 t5 : ff6000008fd28278 t6 : 0000000000040000 [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206 [<ffffffff8015fe84>] irq_work_single+0x68/0x70 [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36 [<ffffffff8015fefa>] irq_work_run+0x38/0x66 [<ffffffff8000828a>] handle_IPI+0x3a/0xb4 [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88 [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36 [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86 [<ffffffff812b6904>] do_irq+0x66/0x98 ---[ end trace 0000000000000000 ]--- The warning is due to WARN_ON_ONCE(tgt->unit_size != c->unit_size) in free_bulk(). The direct reason is that a object is allocated and freed by bpf_mem_caches with different unit_size. The root cause is that KMALLOC_MIN_SIZE is 64 and there is no 96-bytes slab cache in the specific VM. When linked_list test allocates a 72-bytes object through bpf_obj_new(), bpf_global_ma will allocate it from a bpf_mem_cache with 96-bytes unit_size, but this bpf_mem_cache is backed by 128-bytes slab cache. When the object is freed, bpf_mem_free() uses ksize() to choose the corresponding bpf_mem_cache. Because the object is allocated from 128-bytes slab cache, ksize() returns 128, bpf_mem_free() chooses a 128-bytes bpf_mem_cache to free the object and triggers the warning. A similar warning will also be reported when using CONFIG_SLAB instead of CONFIG_SLUB in a x86-64 kernel. Because CONFIG_SLUB defines KMALLOC_MIN_SIZE as 8 but CONFIG_SLAB defines KMALLOC_MIN_SIZE as 32. An alternative fix is to use kmalloc_size_round() in bpf_mem_alloc() to choose a bpf_mem_cache which has the same unit_size with the backing slab cache, but it may introduce performance degradation, so fix the warning by adjusting the indexes in size_index according to the value of KMALLOC_MIN_SIZE just like setup_kmalloc_cache_index_table() does. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Björn Töpel <bjorn@kernel.org> Closes: https://lore.kernel.org/bpf/87jztjmmy4.fsf@all.your.base.are.belong.to.us Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230908133923.2675053-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}()Hou Tao2023-09-081-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free freed objects back to slab and the invocation may also be preempted by unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in the following case: task A task B unit_free() // high_watermark = 48 // free_cnt = 49 after free irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 48 after alloc // does unit_alloc() 32-times ...... // free_cnt = 16 unit_alloc() // free_cnt = 15 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 15-times ...... // free_cnt = 0 unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()Hou Tao2023-09-081-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing stress test for qp-trie, bpf_mem_alloc() returned NULL unexpectedly because all qp-trie operations were initiated from bpf syscalls and there was still available free memory. bpf_obj_new() has the same problem as shown by the following selftest. The failure is due to the preemption. irq_work_raise() will invoke irq_work_claim() first to mark the irq work as pending and then inovke __irq_work_queue_local() to raise an IPI. So when the current task which is invoking irq_work_raise() is preempted by other task, unit_alloc() may return NULL for preemption task as shown below: task A task B unit_alloc() // low_watermark = 32 // free_cnt = 31 after alloc irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 30 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 30-times ...... unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. An alternative fix is using preempt_{disable|enable}_notrace() pair, but it may have extra overhead. Another feasible fix is to only disable preemption or IRQ before invoking irq_work_queue() and enable preemption or IRQ after the invocation completes, but it can't handle the case when c->low_watermark is 1. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: Add support for non-fix-size percpu mem allocationYonghong Song2023-09-081-8/+6
|/ | | | | | | | | | | This is needed for later percpu mem allocation when the allocation is done by bpf program. For such cases, a global bpf_global_percpu_ma is added where a flexible allocation size is needed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Non-atomically allocate freelist during prefillYiFei Zhu2023-07-281-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. And if during prefill all precpu chunks are full, -ENOMEM would happen immediately upon next unit_alloc. Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM. GFP_NOWAIT has to be used in unit_alloc when bpf program runs in atomic context. Even if bpf program runs in non-atomic context, in most cases, rcu read lock is enabled for the program so GFP_NOWAIT is still needed. This is often also the case for BPF_MAP_UPDATE_ELEM syscalls. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230728043359.3324347-1-zhuyifei@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: work around -Wuninitialized warningArnd Bergmann2023-07-251-6/+6
| | | | | | | | | | | | | | | | | | | | | Splitting these out into separate helper functions means that we actually pass an uninitialized variable into another function call if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT is disabled: kernel/bpf/memalloc.c: In function 'add_obj_to_free_list': kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized] 200 | dec_active(c, flags); Avoid this by passing the flags by reference, so they either get initialized and dereferenced through a pointer, or the pointer never gets accessed at all. Fixes: 18e027b1c7c6d ("bpf: Factor out inc/dec of active flag into helpers.") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230725202653.2905259-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add object leak check.Hou Tao2023-07-121-0/+35
| | | | | | | | | | The object leak check is cheap. Do it unconditionally to spot difficult races in bpf_mem_alloc. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com
* bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().Alexei Starovoitov2023-07-121-3/+126
| | | | | | | | | | | | | | | | | | | | | Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into per-cpu free list the _rcu() flavor waits for RCU grace period and then moves objects into free_by_rcu_ttrace list where they are waiting for RCU task trace grace period to be freed into slab. The life cycle of objects: alloc: dequeue free_llist free: enqeueu free_llist free_rcu: enqueue free_by_rcu -> waiting_for_gp free_llist above high watermark -> free_by_rcu_ttrace after RCU GP waiting_for_gp -> free_by_rcu_ttrace free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com
* bpf: Allow reuse from waiting_for_gp_ttrace list.Alexei Starovoitov2023-07-121-6/+10
| | | | | | | | | alloc_bulk() can reuse elements from free_by_rcu_ttrace. Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com
* bpf: Add a hint to allocated objects.Alexei Starovoitov2023-07-121-19/+31
| | | | | | | | | | | | | | | | To address OOM issue when one cpu is allocating and another cpu is freeing add a target bpf_mem_cache hint to allocated objects and when local cpu free_llist overflows free to that bpf_mem_cache. The hint addresses the OOM while maintaining the same performance for common case when alloc/free are done on the same cpu. Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case, since do_call_rcu_ttrace() is called not only for current cpu. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com
* bpf: Change bpf_mem_cache draining process.Alexei Starovoitov2023-07-121-9/+9
| | | | | | | | | | | | | | | | | | | The next patch will introduce cross-cpu llist access and existing irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add objects back to llist of cpu A. Add 'bool draining' flag. The modified sequence looks like: for_each_cpu: WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more irq_work_sync(); // wait for irq_work callback (free_bulk) to finish drain_mem_cache(); // free all objects rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com
* bpf: Further refactor alloc_bulk().Alexei Starovoitov2023-07-121-12/+18
| | | | | | | | | | | | | In certain scenarios alloc_bulk() might be taking free objects mainly from free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are redundant, but they show up in perf profile. Split the loop and only set memcg when allocating from slab. No performance difference in this patch alone, but it helps in combination with further patches. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com
* bpf: Factor out inc/dec of active flag into helpers.Alexei Starovoitov2023-07-121-12/+18
| | | | | | | | | | Factor out local_inc/dec_return(&c->active) into helpers. No functional changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com
* bpf: Refactor alloc_bulk().Alexei Starovoitov2023-07-121-20/+26
| | | | | | | | | | Factor out inner body of alloc_bulk into separate helper. No functional changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com
* bpf: Let free_all() return the number of freed elements.Alexei Starovoitov2023-07-121-2/+6
| | | | | | | | | | | | | | | | | | | | Let free_all() helper return the number of freed elements. It's not used in this patch, but helps in debug/development of bpf_mem_alloc. For example this diff for __free_rcu(): - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); + printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(), + free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size)); would show how busy RCU tasks trace is. In artificial benchmark where one cpu is allocating and different cpu is freeing the RCU tasks trace won't be able to keep up and the list of objects would keep growing from thousands to millions and eventually OOMing. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com
* bpf: Simplify code of destroy_mem_alloc() with kmemdup().Alexei Starovoitov2023-07-121-5/+2
| | | | | | | | | Use kmemdup() to simplify the code. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com
* bpf: Rename few bpf_mem_alloc fields.Alexei Starovoitov2023-07-121-28/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | Rename: - struct rcu_head rcu; - struct llist_head free_by_rcu; - struct llist_head waiting_for_gp; - atomic_t call_rcu_in_progress; + struct llist_head free_by_rcu_ttrace; + struct llist_head waiting_for_gp_ttrace; + struct rcu_head rcu_ttrace; + atomic_t call_rcu_ttrace_in_progress; ... - static void do_call_rcu(struct bpf_mem_cache *c) + static void do_call_rcu_ttrace(struct bpf_mem_cache *c) to better indicate intended use. The 'tasks trace' is shortened to 'ttrace' to reduce verbosity. No functional changes. Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com
* bpf: Factor out a common helper free_all()Hou Tao2023-06-061-15/+16
| | | | | | | | | Factor out a common helper free_all() to free all normal elements or per-cpu elements on a lock-less list. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230606035310.4026145-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Add a few bpf mem allocator functionsMartin KaFai Lau2023-03-251-9/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a few bpf mem allocator functions which will be used in the bpf_local_storage in a later patch. bpf_mem_cache_alloc_flags(..., gfp_t flags) is added. When the flags == GFP_KERNEL, it will fallback to __alloc(..., GFP_KERNEL). bpf_local_storage knows its running context is sleepable (GFP_KERNEL) and provides a better guarantee on memory allocation. bpf_local_storage has some uncommon cases that its selem cannot be reused immediately. It handles its own rcu_head and goes through a rcu_trace gp and then free it. bpf_mem_cache_raw_free() is added for direct free purpose without leaking the LLIST_NODE_SZ internal knowledge. During free time, the 'struct bpf_mem_alloc *ma' is no longer available. However, the caller should know if it is percpu memory or not and it can call different raw_free functions. bpf_local_storage does not support percpu value, so only the non-percpu 'bpf_mem_cache_raw_free()' is added in this patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20230322215246.1675516-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Zeroing allocated object from slab in bpf memory allocatorHou Tao2023-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the freed element in bpf memory allocator may be immediately reused, for htab map the reuse will reinitialize special fields in map value (e.g., bpf_spin_lock), but lookup procedure may still access these special fields, and it may lead to hard-lockup as shown below: NMI backtrace for cpu 16 CPU: 16 PID: 2574 Comm: htab.bin Tainted: G L 6.1.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), RIP: 0010:queued_spin_lock_slowpath+0x283/0x2c0 ...... Call Trace: <TASK> copy_map_value_locked+0xb7/0x170 bpf_map_copy_value+0x113/0x3c0 __sys_bpf+0x1c67/0x2780 __x64_sys_bpf+0x1c/0x20 do_syscall_64+0x30/0x60 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ...... </TASK> For htab map, just like the preallocated case, these is no need to initialize these special fields in map value again once these fields have been initialized. For preallocated htab map, these fields are initialized through __GFP_ZERO in bpf_map_area_alloc(), so do the similar thing for non-preallocated htab in bpf memory allocator. And there is no need to use __GFP_ZERO for per-cpu bpf memory allocator, because __alloc_percpu_gfp() does it implicitly. Fixes: 0fd7c5d43339 ("bpf: Optimize call_rcu in non-preallocated hash map.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230215082132.3856544-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: allow to disable bpf map memory accountingYafang Shao2023-02-101-1/+2
| | | | | | | | | | | | | We can simply set root memcg as the map's memcg to disable bpf memory accounting. bpf_map_area_alloc is a little special as it gets the memcg from current rather than from the map, so we need to disable GFP_ACCOUNT specifically for it. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20230210154734.4416-4-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Fix off-by-one error in bpf_mem_cache_idx()Hou Tao2023-01-181-1/+1
| | | | | | | | | | | According to the definition of sizes[NUM_CACHES], when the size passed to bpf_mem_cache_size() is 256, it should return 6 instead 7. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230118084630.3750680-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Skip rcu_barrier() if rcu_trace_implies_rcu_gp() is trueHou Tao2022-12-081-1/+9
| | | | | | | | | | | | | | | If there are pending rcu callback, free_mem_alloc() will use rcu_barrier_tasks_trace() and rcu_barrier() to wait for the pending __free_rcu_tasks_trace() and __free_rcu() callback. If rcu_trace_implies_rcu_gp() is true, there will be no pending __free_rcu(), so it will be OK to skip rcu_barrier() as well. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221209010947.3130477-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Reuse freed element in free_by_rcu during allocationHou Tao2022-12-081-3/+18
| | | | | | | | | | | | | | | | | | | | When there are batched freeing operations on a specific CPU, part of the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be indirectly moved into waiting_for_gp list through free_by_rcu list. After call_rcu_in_progress becomes false again, the remaining elements in free_by_rcu list will be moved to waiting_for_gp list by the next invocation of free_bulk(). However if the expiration of RCU tasks trace grace period is relatively slow, none element in free_by_rcu list will be moved. So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new object, in alloc_bulk() just check whether or not there is freed element in free_by_rcu list and reuse it if available. Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221209010947.3130477-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-10-241-2/+16
|\ | | | | | | | | | | | | | | include/linux/net.h a5ef058dc4d9 ("net: introduce and use custom sockopt socket flag") e993ffe3da4b ("net: flag sockets supporting msghdr originated zerocopy") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bpf: Use __llist_del_all() whenever possbile during memory drainingHou Tao2022-10-211-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | Except for waiting_for_gp list, there are no concurrent operations on free_by_rcu, free_llist and free_llist_extra lists, so use __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be deleted by RCU callback concurrently, so still use llist_del_all(). Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Wait for busy refill_work when destroying bpf memory allocatorHou Tao2022-10-211-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A busy irq work is an unfinished irq work and it can be either in the pending state or in the running state. When destroying bpf memory allocator, refill_work may be busy for PREEMPT_RT kernel in which irq work is invoked in a per-CPU RT-kthread. It is also possible for kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or mips) and irq work is inovked in timer interrupt. The busy refill_work leads to various issues. The obvious one is that there will be concurrent operations on free_by_rcu and free_list between irq work and memory draining. Another one is call_rcu_in_progress will not be reliable for the checking of pending RCU callback because do_call_rcu() may have not been invoked by irq work yet. The other is there will be use-after-free if irq work is freed before the callback of irq work is invoked as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 Oops: 0010 [#1] PREEMPT_RT SMP CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x23/0x30 smpboot_thread_fn+0x203/0x300 kthread+0x126/0x150 ret_from_fork+0x1f/0x30 </TASK> Considering the ease of concurrency handling, no overhead for irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt kernel and the short wait time used for irq_work_sync() under PREEMPT_RT (When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and the 99th percentile is 10us), just using irq_work_sync() to wait for busy refill_work to complete before memory draining and memory freeing. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocatorHou Tao2022-10-181-5/+10
|/ | | | | | | | | | | | | | | | The memory free logic in bpf memory allocator chains a RCU Tasks Trace grace period and a normal RCU grace period one after the other, so it can ensure that both sleepable and non-sleepable programs have finished. With the introduction of rcu_trace_implies_rcu_gp(), __free_rcu_tasks_trace() can check whether or not a normal RCU grace period has also passed after a RCU Tasks Trace grace period has passed. If it is true, freeing these elements directly, else freeing through call_rcu(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Check whether or not node is NULL before free it in free_bulkHou Tao2022-09-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | llnode could be NULL if there are new allocations after the checking of c-free_cnt > c->high_watermark in bpf_mem_refill() and before the calling of __llist_del_first() in free_bulk (e.g. a PREEMPT_RT kernel or allocation in NMI context). And it will incur oops as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT_RT SMP CPU: 39 PID: 373 Comm: irq_work/39 Tainted: G W 6.0.0-rc6-rt9+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:bpf_mem_refill+0x66/0x130 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x18/0x20 smpboot_thread_fn+0x13f/0x2c0 kthread+0x121/0x140 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Simply fixing it by checking whether or not llnode is NULL in free_bulk(). Fixes: 8d5a8011b35d ("bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20220919144811.3570825-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Replace __ksize with ksize.Alexei Starovoitov2022-09-061-1/+1
| | | | | | | __ksize() was made private. Use ksize() instead. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.Alexei Starovoitov2022-09-051-16/+64
| | | | | | | | | | | | | | | | | | User space might be creating and destroying a lot of hash maps. Synchronous rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets and other map memory and may cause artificial OOM situation under stress. Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc: - remove rcu_barrier from hash map, since htab doesn't use call_rcu directly and there are no callback to wait for. - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks. Use it to avoid barriers in fast path. - When barriers are needed copy bpf_mem_alloc into temp structure and wait for rcu barrier-s in the worker to let the rest of hash map freeing to proceed. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
* bpf: Remove usage of kmem_cache from bpf_mem_cache.Alexei Starovoitov2022-09-051-36/+14
| | | | | | | | | | | | | | | | | | | For bpf_mem_cache based hash maps the following stress test: for (i = 1; i <= 512; i <<= 1) for (j = 1; j <= 1 << 18; j <<= 1) fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, i, j, 2, 0); creates many kmem_cache-s that are not mergeable in debug kernels and consume unnecessary amount of memory. Turned out bpf_mem_cache's free_list logic does batching well, so usage of kmem_cache for fixes size allocations doesn't bring any performance benefits vs normal kmalloc. Hence get rid of kmem_cache in bpf_mem_cache. That saves memory, speeds up map create/destroy operations, while maintains hash map update/delete performance. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220902211058.60789-16-alexei.starovoitov@gmail.com
* bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.Alexei Starovoitov2022-09-051-1/+14
| | | | | | | | | | | | | Use call_rcu_tasks_trace() to wait for sleepable progs to finish. Then use call_rcu() to wait for normal progs to finish and finally do free_one() on each element when freeing objects into global memory pool. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220902211058.60789-14-alexei.starovoitov@gmail.com