summaryrefslogtreecommitdiffstats
path: root/mm/kfence/core.c
Commit message (Collapse)AuthorAgeFilesLines
* mm: introduce slabobj_ext to support slab object extensionsSuren Baghdasaryan2024-04-251-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently slab pages can store only vectors of obj_cgroup pointers in page->memcg_data. Introduce slabobj_ext structure to allow more data to be stored for each slab object. Wrap obj_cgroup into slabobj_ext to support current functionality while allowing to extend slabobj_ext in the future. Link: https://lkml.kernel.org/r/20240321163705.3067592-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* KFENCE: cleanup kfence_guarded_alloc() after CONFIG_SLAB removalVlastimil Babka2023-12-051-4/+0
| | | | | | | | | | | | | Some struct slab fields are initialized differently for SLAB and SLUB so we can simplify with SLUB being the only remaining allocator. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
* Merge tag 'loongarch-6.6' of ↵Linus Torvalds2023-09-081-2/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch updates from Huacai Chen: - Allow usage of LSX/LASX in the kernel, and use them for SIMD-optimized RAID5/RAID6 routines - Add Loongson Binary Translation (LBT) extension support - Add basic KGDB & KDB support - Add building with kcov coverage - Add KFENCE (Kernel Electric-Fence) support - Add KASAN (Kernel Address Sanitizer) support - Some bug fixes and other small changes - Update the default config file * tag 'loongarch-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (25 commits) LoongArch: Update Loongson-3 default config file LoongArch: Add KASAN (Kernel Address Sanitizer) support LoongArch: Simplify the processing of jumping new kernel for KASLR kasan: Add (pmd|pud)_init for LoongArch zero_(pud|p4d)_populate process kasan: Add __HAVE_ARCH_SHADOW_MAP to support arch specific mapping LoongArch: Add KFENCE (Kernel Electric-Fence) support LoongArch: Get partial stack information when providing regs parameter LoongArch: mm: Add page table mapped mode support for virt_to_page() kfence: Defer the assignment of the local variable addr LoongArch: Allow building with kcov coverage LoongArch: Provide kaslr_offset() to get kernel offset LoongArch: Add basic KGDB & KDB support LoongArch: Add Loongson Binary Translation (LBT) extension support raid6: Add LoongArch SIMD recovery implementation raid6: Add LoongArch SIMD syndrome calculation LoongArch: Add SIMD-optimized XOR routines LoongArch: Allow usage of LSX/LASX in the kernel LoongArch: Define symbol 'fault' as a local label in fpu.S LoongArch: Adjust {copy, clear}_user exception handler behavior LoongArch: Use static defined zero page rather than allocated ...
| * kfence: Defer the assignment of the local variable addrEnze Li2023-09-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The LoongArch architecture is different from other architectures. It needs to update __kfence_pool during arch_kfence_init_pool(). This patch modifies the assignment location of the local variable addr in the kfence_init_pool() function to support the case of updating __kfence_pool in arch_kfence_init_pool(). Acked-by: Marco Elver <elver@google.com> Signed-off-by: Enze Li <lienze@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
* | mm: kfence: allocate kfence_metadata at runtimePeng Zhang2023-08-181-37/+86
|/ | | | | | | | | | | | | | | | | kfence_metadata is currently a static array. For the purpose of allocating scalable __kfence_pool, we first change it to runtime allocation of metadata. Since the size of an object of kfence_metadata is 1160 bytes, we can save at least 72 pages (with default 256 objects) without enabling kfence. [akpm@linux-foundation.org: restore newline, per Marco] Link: https://lkml.kernel.org/r/20230718073019.52513-1-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds2023-04-271-21/+49
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
| * mm: kfence: improve the performance of __kfence_alloc() and __kfence_free()Peng Zhang2023-04-181-21/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In __kfence_alloc() and __kfence_free(), we will set and check canary. Assuming that the size of the object is close to 0, nearly 4k memory accesses are required because setting and checking canary is executed byte by byte. canary is now defined like this: KFENCE_CANARY_PATTERN(addr) ((u8)0xaa ^ (u8)((unsigned long)(addr) & 0x7)) Observe that canary is only related to the lower three bits of the address, so every 8 bytes of canary are the same. We can access 8-byte canary each time instead of byte-by-byte, thereby optimizing nearly 4k memory accesses to 4k/8 times. Use the bcc tool funclatency to measure the latency of __kfence_alloc() and __kfence_free(), the numbers (deleted the distribution of latency) is posted below. Though different object sizes will have an impact on the measurement, we ignore it for now and assume the average object size is roughly equal. Before patching: __kfence_alloc: avg = 5055 nsecs, total: 5515252 nsecs, count: 1091 __kfence_free: avg = 5319 nsecs, total: 9735130 nsecs, count: 1830 After patching: __kfence_alloc: avg = 3597 nsecs, total: 6428491 nsecs, count: 1787 __kfence_free: avg = 3046 nsecs, total: 3415390 nsecs, count: 1121 The numbers indicate that there is ~30% - ~40% performance improvement. Link: https://lkml.kernel.org/r/20230403122738.6006-1-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | Merge tag 'arm64-upstream' of ↵Linus Torvalds2023-04-251-0/+4
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "ACPI: - Improve error reporting when failing to manage SDEI on AGDI device removal Assembly routines: - Improve register constraints so that the compiler can make use of the zero register instead of moving an immediate #0 into a GPR - Allow the compiler to allocate the registers used for CAS instructions CPU features and system registers: - Cleanups to the way in which CPU features are identified from the ID register fields - Extend system register definition generation to handle Enum types when defining shared register fields - Generate definitions for new _EL2 registers and add new fields for ID_AA64PFR1_EL1 - Allow SVE to be disabled separately from SME on the kernel command-line Tracing: - Support for "direct calls" in ftrace, which enables BPF tracing for arm64 Kdump: - Don't bother unmapping the crashkernel from the linear mapping, which then allows us to use huge (block) mappings and reduce TLB pressure when a crashkernel is loaded. Memory management: - Try again to remove data cache invalidation from the coherent DMA allocation path - Simplify the fixmap code by mapping at page granularity - Allow the kfence pool to be allocated early, preventing the rest of the linear mapping from being forced to page granularity Perf and PMU: - Move CPU PMU code out to drivers/perf/ where it can be reused by the 32-bit ARM architecture when running on ARMv8 CPUs - Fix race between CPU PMU probing and pKVM host de-privilege - Add support for Apple M2 CPU PMU - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event dynamically, depending on what the CPU actually supports - Minor fixes and cleanups to system PMU drivers Stack tracing: - Use the XPACLRI instruction to strip PAC from pointers, rather than rolling our own function in C - Remove redundant PAC removal for toolchains that handle this in their builtins - Make backtracing more resilient in the face of instrumentation Miscellaneous: - Fix single-step with KGDB - Remove harmless warning when 'nokaslr' is passed on the kernel command-line - Minor fixes and cleanups across the board" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege arm64: kexec: include reboot.h arm64: delete dead code in this_cpu_set_vectors() arm64/cpufeature: Use helper macro to specify ID register for capabilites drivers/perf: hisi: add NULL check for name drivers/perf: hisi: Remove redundant initialized of pmu->name arm64/cpufeature: Consistently use symbolic constants for min_field_value arm64/cpufeature: Pull out helper for CPUID register definitions arm64/sysreg: Convert HFGITR_EL2 to automatic generation ACPI: AGDI: Improve error reporting for problems during .remove() arm64: kernel: Fix kernel warning when nokaslr is passed to commandline perf/arm-cmn: Fix port detection for CMN-700 arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step arm64: move PAC masks to <asm/pointer_auth.h> arm64: use XPACLRI to strip PAC arm64: avoid redundant PAC stripping in __builtin_return_address() arm64/sme: Fix some comments of ARM SME arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2() arm64/signal: Use system_supports_tpidr2() to check TPIDR2 arm64/idreg: Don't disable SME when disabling SVE ...
| * mm,kfence: decouple kfence from page granularity mapping judgementZhenhua Huang2023-03-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kfence only needs its pool to be mapped as page granularity, if it is inited early. Previous judgement was a bit over protected. From [1], Mark suggested to "just map the KFENCE region a page granularity". So I decouple it from judgement and do page granularity mapping for kfence pool only. Need to be noticed that late init of kfence pool still requires page granularity mapping. Page granularity mapping in theory cost more(2M per 1GB) memory on arm64 platform. Like what I've tested on QEMU(emulated 1GB RAM) with gki_defconfig, also turning off rodata protection: Before: [root@liebao ]# cat /proc/meminfo MemTotal: 999484 kB After: [root@liebao ]# cat /proc/meminfo MemTotal: 1001480 kB To implement this, also relocate the kfence pool allocation before the linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys addr, __kfence_pool is to be set after linear mapping set up. LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/ Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com Signed-off-by: Will Deacon <will@kernel.org>
* | mm: kfence: fix handling discontiguous pageMuchun Song2023-03-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The struct pages could be discontiguous when the kfence pool is allocated via alloc_contig_pages() with CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP. This may result in setting PG_slab and memcg_data to a arbitrary address (may be not used as a struct page), which in the worst case might corrupt the kernel. So the iteration should use nth_page(). Link: https://lkml.kernel.org/r/20230323025003.94447-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: kfence: fix PG_slab and memcg_data clearingMuchun Song2023-03-281-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It does not reset PG_slab and memcg_data when KFENCE fails to initialize kfence pool at runtime. It is reporting a "Bad page state" message when kfence pool is freed to buddy. The checking of whether it is a compound head page seems unnecessary since we already guarantee this when allocating kfence pool. Remove the check to simplify the code. Link: https://lkml.kernel.org/r/20230320030059.20189-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Marco Elver <elver@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: kfence: fix using kfence_metadata without initialization in show_object()Muchun Song2023-03-231-2/+8
|/ | | | | | | | | | | | | | | | | | | | | The variable kfence_metadata is initialized in kfence_init_pool(), then, it is not initialized if kfence is disabled after booting. In this case, kfence_metadata will be used (e.g. ->lock and ->state fields) without initialization when reading /sys/kernel/debug/kfence/objects. There will be a warning if you enable CONFIG_DEBUG_SPINLOCK. Fix it by creating debugfs files when necessary. Link: https://lkml.kernel.org/r/20230315034441.44321-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds2022-12-131-11/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
| * mm/kfence: remove hung_task cruftPavankumar Kondeti2022-11-301-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fdf756f71271 ("sched: Fix more TASK_state comparisons") makes hung_task not to monitor TASK_IDLE tasks. The special handling to workaround hung_task warnings is not required anymore. Link: https://lkml.kernel.org/r/1667986006-25420-1-git-send-email-quic_pkondeti@quicinc.com Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld2022-11-181-2/+2
|/ | | | | | | | | | | | | | | | | | | | This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds2022-10-101-13/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
| * mm: kfence: convert to DEFINE_SEQ_ATTRIBUTELiu Shixin2022-10-031-13/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Use DEFINE_SEQ_ATTRIBUTE helper macro to simplify the code. Link: https://lkml.kernel.org/r/20220909083140.3592919-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kfence: add sysfs interface to disable kfence for selected slabs.Imran Khan2022-09-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | kfence: use better stack hash seedJason A. Donenfeld2022-09-291-1/+1
|/ | | | | | | | | | | | | | | As of the prior commit, the RNG will have incorporated both a cycle counter value and RDRAND, in addition to various other environmental noise. Therefore, using get_random_u32() will supply a stronger seed than simply using random_get_entropy(). N.B.: random_get_entropy() should be considered an internal API of random.c and not generally consumed. Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2022-08-051-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
| * mm: kfence: pass a pointer to virt_to_page()Linus Walleij2022-07-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/kfence/core.c:558:30: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] In one case we can refer to __kfence_pool directly (and that is a proper (char *) pointer) and in the other call site we use an explicit cast. Link: https://lkml.kernel.org/r/20220630084124.691207-4-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: kfence: apply kmemleak_ignore_phys on early allocated poolYee Lee2022-07-181-9/+9
|/ | | | | | | | | | | | | | | | | | | | | This patch solves two issues. (1) The pool allocated by memblock needs to unregister from kmemleak scanning. Apply kmemleak_ignore_phys to replace the original kmemleak_free as its address now is stored in the phys tree. (2) The pool late allocated by page-alloc doesn't need to unregister. Move out the freeing operation from its call path. Link: https://lkml.kernel.org/r/20220628113714.7792-2-yee.lee@mediatek.com Fixes: 0c24e061196c21d5 ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/kfence: select random number before taking raw lockJason A. Donenfeld2022-06-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RNG uses vanilla spinlocks, not raw spinlocks, so kfence should pick its random numbers before taking its raw spinlocks. This also has the nice effect of doing less work inside the lock. It should fix a splat that Geert saw with CONFIG_PROVE_RAW_LOCK_NESTING: dump_backtrace.part.0+0x98/0xc0 show_stack+0x14/0x28 dump_stack_lvl+0xac/0xec dump_stack+0x14/0x2c __lock_acquire+0x388/0x10a0 lock_acquire+0x190/0x2c0 _raw_spin_lock_irqsave+0x6c/0x94 crng_make_state+0x148/0x1e4 _get_random_bytes.part.0+0x4c/0xe8 get_random_u32+0x4c/0x140 __kfence_alloc+0x460/0x5c4 kmem_cache_alloc_trace+0x194/0x1dc __kthread_create_on_node+0x5c/0x1a8 kthread_create_on_node+0x58/0x7c printk_start_kthread.part.0+0x34/0xa8 printk_activate_kthreads+0x4c/0x54 do_one_initcall+0xec/0x278 kernel_init_freeable+0x11c/0x214 kernel_init+0x24/0x124 ret_from_fork+0x10/0x20 Link: https://lkml.kernel.org/r/20220609123319.17576-1-Jason@zx2c4.com Fixes: d4150779e60f ("random32: use real rng for non-deterministic randomness") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2022-05-25' of ↵Linus Torvalds2022-05-261-1/+39
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
| * mm/kfence: print disabling or re-enabling messageJackie Liu2022-05-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | By printing information, we can friendly prompt the status change information of kfence by dmesg and record by syslog. Also, set kfence_enabled to false only when needed. Link: https://lkml.kernel.org/r/20220518073105.3160335-1-liu.yun@linux.dev Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Co-developed-by: Marco Elver <elver@google.com> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kfence: enable check kfence canary on panic via boot paramhuangshaobo2022-05-131-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Out-of-bounds accesses that aren't caught by a guard page will result in corruption of canary memory. In pathological cases, where an object has certain alignment requirements, an out-of-bounds access might never be caught by the guard page. Such corruptions, however, are only detected on kfree() normally. If the bug causes the kernel to panic before kfree(), KFENCE has no opportunity to report the issue. Such corruptions may also indicate failing memory or other faults. To provide some more information in such cases, add the option to check canary bytes on panic. This might help narrow the search for the panic cause; but, due to only having the allocation stack trace, such reports are difficult to use to diagnose an issue alone. In most cases, such reports are inactionable, and is therefore an opt-in feature (disabled by default). [akpm@linux-foundation.org: add __read_mostly, per Marco] Link: https://lkml.kernel.org/r/20220425022456.44300-1-huangshaobo6@huawei.com Signed-off-by: huangshaobo <huangshaobo6@huawei.com> Suggested-by: chenzefeng <chenzefeng2@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Wangbing <wangbing6@huawei.com> Cc: Jubin Zhong <zhongjubin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/kfence: reset PG_slab and memcg_data before freeing __kfence_poolHyeonggon Yoo2022-05-091-0/+10
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When kfence fails to initialize kfence pool, it frees the pool. But it does not reset memcg_data and PG_slab flag. Below is a BUG because of this. Let's fix it by resetting memcg_data and PG_slab flag before free. [ 0.089149] BUG: Bad page state in process swapper/0 pfn:3d8e06 [ 0.089149] page:ffffea46cf638180 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3d8e06 [ 0.089150] memcg:ffffffff94a475d1 [ 0.089150] flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 0.089151] raw: 0017ffffc0000200 ffffea46cf638188 ffffea46cf638188 0000000000000000 [ 0.089152] raw: 0000000000000000 0000000000000000 00000000ffffffff ffffffff94a475d1 [ 0.089152] page dumped because: page still charged to cgroup [ 0.089153] Modules linked in: [ 0.089153] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B W 5.18.0-rc1+ #965 [ 0.089154] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 0.089154] Call Trace: [ 0.089155] <TASK> [ 0.089155] dump_stack_lvl+0x49/0x5f [ 0.089157] dump_stack+0x10/0x12 [ 0.089158] bad_page.cold+0x63/0x94 [ 0.089159] check_free_page_bad+0x66/0x70 [ 0.089160] __free_pages_ok+0x423/0x530 [ 0.089161] __free_pages_core+0x8e/0xa0 [ 0.089162] memblock_free_pages+0x10/0x12 [ 0.089164] memblock_free_late+0x8f/0xb9 [ 0.089165] kfence_init+0x68/0x92 [ 0.089166] start_kernel+0x789/0x992 [ 0.089167] x86_64_start_reservations+0x24/0x26 [ 0.089168] x86_64_start_kernel+0xa9/0xaf [ 0.089170] secondary_startup_64_no_verify+0xd5/0xdb [ 0.089171] </TASK> Link: https://lkml.kernel.org/r/YnPG3pQrqfcgOlVa@hyeyoo Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Fixes: 8f0b36497303 ("mm: kfence: fix objcgs vector allocation") Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm, kfence: support kmem_dump_obj() for KFENCE objectsMarco Elver2022-04-151-21/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been producing garbage data due to the object not actually being maintained by SLAB or SLUB. Fix this by implementing __kfence_obj_info() that copies relevant information to struct kmem_obj_info when the object was allocated by KFENCE; this is called by a common kmem_obj_info(), which also calls the slab/slub/slob specific variant now called __kmem_obj_info(). For completeness, kmem_dump_obj() now displays if the object was allocated by KFENCE. Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB") Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: kfence: fix objcgs vector allocationMuchun Song2022-04-011-1/+10
| | | | | | | | | | | | | | | | | | | | | | If the kfence object is allocated to be used for objects vector, then this slot of the pool eventually being occupied permanently since the vector is never freed. The solutions could be (1) freeing vector when the kfence object is freed or (2) allocating all vectors statically. Since the memory consumption of object vectors is low, it is better to chose (2) to fix the issue and it is also can reduce overhead of vectors allocating in the future. Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: allow use of a deferrable timerMarco Elver2022-03-221-2/+13
| | | | | | | | | | | | | | | | | Allow the use of a deferrable timer, which does not force CPU wake-ups when the system is idle. A consequence is that the sample interval becomes very unpredictable, to the point that it is not guaranteed that the KFENCE KUnit test still passes. Nevertheless, on power-constrained systems this may be preferable, so let's give the user the option should they accept the above trade-off. Link: https://lkml.kernel.org/r/20220308141415.3168078-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: alloc kfence_pool after system startupTianchen Ding2022-03-221-21/+90
| | | | | | | | | | | | | | | Allow enabling KFENCE after system startup by allocating its pool via the page allocator. This provides the flexibility to enable KFENCE even if it wasn't enabled at boot time. Link: https://lkml.kernel.org/r/20220307074516.6920-3-dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Peng Liu <liupeng256@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: allow re-enabling KFENCE after system startupTianchen Ding2022-03-221-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "provide the flexibility to enable KFENCE", v3. If CONFIG_CONTIG_ALLOC is not supported, we fallback to try alloc_pages_exact(). Allocating pages in this way has limits about MAX_ORDER (default 11). So we will not support allocating kfence pool after system startup with a large KFENCE_NUM_OBJECTS. When handling failures in kfence_init_pool_late(), we pair free_pages_exact() to alloc_pages_exact() for compatibility consideration, though it actually does the same as free_contig_range(). This patch (of 2): If once KFENCE is disabled by: echo 0 > /sys/module/kfence/parameters/sample_interval KFENCE could never be re-enabled until next rebooting. Allow re-enabling it by writing a positive num to sample_interval. Link: https://lkml.kernel.org/r/20220307074516.6920-1-dtcccc@linux.alibaba.com Link: https://lkml.kernel.org/r/20220307074516.6920-2-dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: make test case compatible with run time set sample intervalPeng Liu2022-02-111-1/+2
| | | | | | | | | | | | | | | | | | | | The parameter kfence_sample_interval can be set via boot parameter and late shell command, which is convenient for automated tests and KFENCE parameter optimization. However, KFENCE test case just uses compile-time CONFIG_KFENCE_SAMPLE_INTERVAL, which will make KFENCE test case not run as users desired. Export kfence_sample_interval, so that KFENCE test case can use run-time-set sample interval. Link: https://lkml.kernel.org/r/20220207034432.185532-1-liupeng256@huawei.com Signed-off-by: Peng Liu <liupeng256@huawei.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Christian Knig <christian.koenig@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'slab-for-5.17' of ↵Linus Torvalds2022-01-101-8/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Separate struct slab from struct page - an offshot of the page folio work. Struct page fields used by slab allocators are moved from struct page to a new struct slab, that uses the same physical storage. Similar to struct folio, it always is a head page. This brings better type safety, separation of large kmalloc allocations from true slabs, and cleanup of related objcg code. - A SLAB_MERGE_DEFAULT config optimization. * tag 'slab-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (33 commits) mm/slob: Remove unnecessary page_mapcount_reset() function call bootmem: Use page->index instead of page->freelist zsmalloc: Stop using slab fields in struct page mm/slub: Define struct slab fields for CONFIG_SLUB_CPU_PARTIAL only when enabled mm/slub: Simplify struct slab slabs field definition mm/sl*b: Differentiate struct slab fields by sl*b implementations mm/kfence: Convert kfence_guarded_alloc() to struct slab mm/kasan: Convert to struct folio and struct slab mm/slob: Convert SLOB to use struct slab and struct folio mm/memcg: Convert slab objcgs from struct page to struct slab mm: Convert struct page to struct slab in functions used by other subsystems mm/slab: Finish struct page to struct slab conversion mm/slab: Convert most struct page to struct slab by spatch mm/slab: Convert kmem_getpages() and kmem_freepages() to struct slab mm/slub: Finish struct page to struct slab conversion mm/slub: Convert most struct page to struct slab by spatch mm/slub: Convert pfmemalloc_match() to take a struct slab mm/slub: Convert __free_slab() to use struct slab mm/slub: Convert alloc_slab_page() to return a struct slab mm/slub: Convert print_page_info() to print_slab_info() ...
| * mm/sl*b: Differentiate struct slab fields by sl*b implementationsVlastimil Babka2022-01-061-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a struct slab definition separate from struct page, we can go further and define only fields that the chosen sl*b implementation uses. This means everything between __page_flags and __page_refcount placeholders now depends on the chosen CONFIG_SL*B. Some fields exist in all implementations (slab_list) but can be part of a union in some, so it's simpler to repeat them than complicate the definition with ifdefs even more. The patch doesn't change physical offsets of the fields, although it could be done later - for example it's now clear that tighter packing in SLOB could be possible. This should also prevent accidental use of fields that don't exist in given implementation. Before this patch virt_to_cache() and cache_from_obj() were visible for SLOB (albeit not used), although they rely on the slab_cache field that isn't set by SLOB. With this patch it's now a compile error, so these functions are now hidden behind an #ifndef CONFIG_SLOB. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <guro@fb.com> Tested-by: Marco Elver <elver@google.com> # kfence Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <kasan-dev@googlegroups.com>
| * mm/kfence: Convert kfence_guarded_alloc() to struct slabVlastimil Babka2022-01-061-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | The function sets some fields that are being moved from struct page to struct slab so it needs to be converted. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <kasan-dev@googlegroups.com>
* | kfence: fix memory leak when cat kfence objectsBaokun Li2021-12-251-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hulk robot reported a kmemleak problem: unreferenced object 0xffff93d1d8cc02e8 (size 248): comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s) hex dump (first 32 bytes): 00 40 85 19 d4 93 ff ff 00 10 00 00 00 00 00 00 .@.............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: seq_open+0x2a/0x80 full_proxy_open+0x167/0x1e0 do_dentry_open+0x1e1/0x3a0 path_openat+0x961/0xa20 do_filp_open+0xae/0x120 do_sys_openat2+0x216/0x2f0 do_sys_open+0x57/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff93d419854000 (size 4096): comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s) hex dump (first 32 bytes): 6b 66 65 6e 63 65 2d 23 32 35 30 3a 20 30 78 30 kfence-#250: 0x0 30 30 30 30 30 30 30 37 35 34 62 64 61 31 32 2d 0000000754bda12- backtrace: seq_read_iter+0x313/0x440 seq_read+0x14b/0x1a0 full_proxy_read+0x56/0x80 vfs_read+0xa5/0x1b0 ksys_read+0xa0/0xf0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I find that we can easily reproduce this problem with the following commands: cat /sys/kernel/debug/kfence/objects echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak The leaked memory is allocated in the stack below: do_syscall_64 do_sys_open do_dentry_open full_proxy_open seq_open ---> alloc seq_file vfs_read full_proxy_read seq_read seq_read_iter traverse ---> alloc seq_buf And it should have been released in the following process: do_syscall_64 syscall_exit_to_user_mode exit_to_user_mode_prepare task_work_run ____fput __fput full_proxy_release ---> free here However, the release function corresponding to file_operations is not implemented in kfence. As a result, a memory leak occurs. Therefore, the solution to this problem is to implement the corresponding release function. Link: https://lkml.kernel.org/r/20211206133628.2822545-1-libaokun1@huawei.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reported-by: Hulk Robot <hulkci@huawei.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: always use static branches to guard kfence_alloc()Marco Elver2021-11-061-9/+7
| | | | | | | | | | | | | | | | | | | | | Regardless of KFENCE mode (CONFIG_KFENCE_STATIC_KEYS: either using static keys to gate allocations, or using a simple dynamic branch), always use a static branch to avoid the dynamic branch in kfence_alloc() if KFENCE was disabled at boot. For CONFIG_KFENCE_STATIC_KEYS=n, this now avoids the dynamic branch if KFENCE was disabled at boot. To simplify, also unifies the location where kfence_allocation_gate is read-checked to just be inline in kfence_alloc(). Link: https://lkml.kernel.org/r/20211019102524.2807208-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: shorten critical sections of alloc/freeMarco Elver2021-11-061-17/+21
| | | | | | | | | | | | | | | | | | | | | | | | Initializing memory and setting/checking the canary bytes is relatively expensive, and doing so in the meta->lock critical sections extends the duration with preemption and interrupts disabled unnecessarily. Any reads to meta->addr and meta->size in kfence_guarded_alloc() and kfence_guarded_free() don't require locking meta->lock as long as the object is removed from the freelist: only kfence_guarded_alloc() sets meta->addr and meta->size after removing it from the freelist, which requires a preceding kfence_guarded_free() returning it to the list or the initial state. Therefore move reads to meta->addr and meta->size, including expensive memory initialization using them, out of meta->lock critical sections. Link: https://lkml.kernel.org/r/20210930153706.2105471-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: limit currently covered allocations when pool nearly fullMarco Elver2021-11-061-2/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of KFENCE's main design principles is that with increasing uptime, allocation coverage increases sufficiently to detect previously undetected bugs. We have observed that frequent long-lived allocations of the same source (e.g. pagecache) tend to permanently fill up the KFENCE pool with increasing system uptime, thus breaking the above requirement. The workaround thus far had been increasing the sample interval and/or increasing the KFENCE pool size, but is no reliable solution. To ensure diverse coverage of allocations, limit currently covered allocations of the same source once pool utilization reaches 75% (configurable via `kfence.skip_covered_thresh`) or above. The effect is retaining reasonable allocation coverage when the pool is close to full. A side-effect is that this also limits frequent long-lived allocations of the same source filling up the pool permanently. Uniqueness of an allocation for coverage purposes is based on its (partial) allocation stack trace (the source). A Counting Bloom filter is used to check if an allocation is covered; if the allocation is currently covered, the allocation is skipped by KFENCE. Testing was done using: (a) a synthetic workload that performs frequent long-lived allocations (default config values; sample_interval=1; num_objects=63), and (b) normal desktop workloads on an otherwise idle machine where the problem was first reported after a few days of uptime (default config values). In both test cases the sampled allocation rate no longer drops to zero at any point. In the case of (b) we observe (after 2 days uptime) 15% unique allocations in the pool, 77% pool utilization, with 20% "skipped allocations (covered)". [elver@google.com: simplify and just use hash_32(), use more random stack_hash_seed] Link: https://lkml.kernel.org/r/YU3MRGaCaJiYht5g@elver.google.com [elver@google.com: fix 32 bit] Link: https://lkml.kernel.org/r/20210923104803.2620285-4-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Jann Horn <jannh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: move saving stack trace of allocations into __kfence_alloc()Marco Elver2021-11-061-11/+24
| | | | | | | | | | | | | | | | | Move the saving of the stack trace of allocations into __kfence_alloc(), so that the stack entries array can be used outside of kfence_guarded_alloc() and we avoid potentially unwinding the stack multiple times. Link: https://lkml.kernel.org/r/20210923104803.2620285-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Jann Horn <jannh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: count unexpectedly skipped allocationsMarco Elver2021-11-061-3/+13
| | | | | | | | | | | | | | | | | | Maintain a counter to count allocations that are skipped due to being incompatible (oversized, incompatible gfp flags) or no capacity. This is to compute the fraction of allocations that could not be serviced by KFENCE, which we expect to be rare. Link: https://lkml.kernel.org/r/20210923104803.2620285-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Jann Horn <jannh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: show cpu and timestamp in alloc/free infoMarco Elver2021-09-081-0/+3
| | | | | | | | | | | | | | | Record cpu and timestamp on allocations and frees, and show them in reports. Upon an error, this can help correlate earlier messages in the kernel log via allocation and free timestamps. Link: https://lkml.kernel.org/r/20210714175312.2947941-1-elver@google.com Suggested-by: Joern Engel <joern@purestorage.com> Signed-off-by: Marco Elver <elver@google.com> Acked-by: Alexander Potapenko <glider@google.com> Acked-by: Joern Engel <joern@purestorage.com> Cc: Yuanyuan Zhong <yzhong@purestorage.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: skip all GFP_ZONEMASK allocationsAlexander Potapenko2021-07-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot be fulfilled by KFENCE, because KFENCE memory pool is located in a zone different from the requested one. Because callers of kmem_cache_alloc() may actually rely on the allocation to reside in the requested zone (e.g. memory allocations done with __GFP_DMA must be DMAable), skip all allocations done with GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and SLAB_CACHE_DMA32). Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: move the size check to the beginning of __kfence_alloc()Alexander Potapenko2021-07-231-3/+7
| | | | | | | | | | | | | | | | | | | Check the allocation size before toggling kfence_allocation_gate. This way allocations that can't be served by KFENCE will not result in waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating anything. Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: unconditionally use unbound work queueMarco Elver2021-07-011-2/+2
| | | | | | | | | | | | | | | | | Unconditionally use unbound work queue, and not just if wq_power_efficient is true. Because if the system is idle, KFENCE may wait, and by being run on the unbound work queue, we permit the scheduler to make better scheduling decisions and not require pinning KFENCE to the same CPU upon waking up. Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: use TASK_IDLE when awaiting allocationMarco Elver2021-06-051-3/+3
| | | | | | | | | | | | | | | | | | | | | Since wait_event() uses TASK_UNINTERRUPTIBLE by default, waiting for an allocation counts towards load. However, for KFENCE, this does not make any sense, since there is no busy work we're awaiting. Instead, use TASK_IDLE via wait_event_idle() to not count towards load. BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1185565 Link: https://lkml.kernel.org/r/20210521083209.3740269-1-elver@google.com Fixes: 407f1d8c1b5f ("kfence: await for allocation using wait_event") Signed-off-by: Marco Elver <elver@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Hillf Danton <hdanton@sina.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: use power-efficient work queue to run delayed workMarco Elver2021-05-051-2/+3
| | | | | | | | | | | | | | | | Use the power-efficient work queue, to avoid the pathological case where we keep pinning ourselves on the same possibly idle CPU on systems that want to be power-efficient (https://lwn.net/Articles/731052/). Link: https://lkml.kernel.org/r/20210421105132.3965998-4-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: maximize allocation wait timeout durationMarco Elver2021-05-051-1/+11
| | | | | | | | | | | | | | | | | | | | | | The allocation wait timeout was initially added because of warnings due to CONFIG_DETECT_HUNG_TASK=y [1]. While the 1 sec timeout is sufficient to resolve the warnings (given the hung task timeout must be 1 sec or larger) it may cause unnecessary wake-ups if the system is idle: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com Fix it by computing the timeout duration in terms of the current sysctl_hung_task_timeout_secs value. Link: https://lkml.kernel.org/r/20210421105132.3965998-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kfence: await for allocation using wait_eventMarco Elver2021-05-051-16/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kfence: optimize timer scheduling", v2. We have observed that mostly-idle systems with KFENCE enabled wake up otherwise idle CPUs, preventing such to enter a lower power state. Debugging revealed that KFENCE spends too much active time in toggle_allocation_gate(). While the first version of KFENCE was using all the right bits to be scheduling optimal, and thus power efficient, by simply using wait_event() + wake_up(), that code was unfortunately removed. As KFENCE was exposed to various different configs and tests, the scheduling optimal code slowly disappeared. First because of hung task warnings, and finally because of deadlocks when an allocation is made by timer code with debug objects enabled. Clearly, the "fixes" were not too friendly for devices that want to be power efficient. Therefore, let's try a little harder to fix the hung task and deadlock problems that we have with wait_event() + wake_up(), while remaining as scheduling friendly and power efficient as possible. Crucially, we need to defer the wake_up() to an irq_work, avoiding any potential for deadlock. The result with this series is that on the devices where we observed a power regression, power usage returns back to baseline levels. This patch (of 3): On mostly-idle systems, we have observed that toggle_allocation_gate() is a cause of frequent wake-ups, preventing an otherwise idle CPU to go into a lower power state. A late change in KFENCE's development, due to a potential deadlock [1], required changing the scheduling-friendly wait_event_timeout() and wake_up() to an open-coded wait-loop using schedule_timeout(). [1] https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com To avoid unnecessary wake-ups, switch to using wait_event_timeout(). Unfortunately, we still cannot use a version with direct wake_up() in __kfence_alloc() due to the same potential for deadlock as in [1]. Instead, add a level of indirection via an irq_work that is scheduled if we determine that the kfence_timer requires a wake_up(). Link: https://lkml.kernel.org/r/20210421105132.3965998-1-elver@google.com Link: https://lkml.kernel.org/r/20210421105132.3965998-2-elver@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>