summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'riscv-for-linus-6.9-mw2' of ↵Linus Torvalds2024-03-221-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for various vector-accelerated crypto routines - Hibernation is now enabled for portable kernel builds - mmap_rnd_bits_max is larger on systems with larger VAs - Support for fast GUP - Support for membarrier-based instruction cache synchronization - Support for the Andes hart-level interrupt controller and PMU - Some cleanups around unaligned access speed probing and Kconfig settings - Support for ACPI LPI and CPPC - Various cleanus related to barriers - A handful of fixes * tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits) riscv: Fix syscall wrapper for >word-size arguments crypto: riscv - add vector crypto accelerated AES-CBC-CTS crypto: riscv - parallelize AES-CBC decryption riscv: Only flush the mm icache when setting an exec pte riscv: Use kcalloc() instead of kzalloc() riscv/barrier: Add missing space after ',' riscv/barrier: Consolidate fence definitions riscv/barrier: Define RISCV_FULL_BARRIER riscv/barrier: Define __{mb,rmb,wmb} RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ cpufreq: Move CPPC configs to common Kconfig and add RISC-V ACPI: RISC-V: Add CPPC driver ACPI: Enable ACPI_PROCESSOR for RISC-V ACPI: RISC-V: Add LPI driver cpuidle: RISC-V: Move few functions to arch/riscv riscv: Introduce set_compat_task() in asm/compat.h riscv: Introduce is_compat_thread() into compat.h riscv: add compile-time test into is_compat_task() riscv: Replace direct thread flag check with is_compat_task() riscv: Improve arch_get_mmap_end() macro ...
| * mm: Change mmap_rnd_bits_max to __ro_after_initSami Tolvanen2024-01-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Allow mmap_rnd_bits_max to be updated on architectures that determine virtual address space size at runtime instead of relying on Kconfig options by changing it from const to __ro_after_init. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230929211155.3910949-5-samitolvanen@google.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* | Merge tag 'kbuild-v6.9' of ↵Linus Torvalds2024-03-211-4/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ...
| * | treewide: replace or remove redundant def_bool in Kconfig filesMasahiro Yamada2024-02-201-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | 'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | | Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-03-151-0/+13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull bcachefs updates from Kent Overstreet: - Subvolume children btree; this is needed for providing a userspace interface for walking subvolumes, which will come later - Lots of improvements to directory structure checking - Improved journal pipelining, significantly improving performance on high iodepth write workloads - Discard path improvements: the discard path is more efficient, and no longer flushes the journal unnecessarily - Buffered write path can now avoid taking the inode lock - new mm helper: memalloc_flags_{save|restore} - mempool now does kvmalloc mempools * tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits) bcachefs: time_stats: shrink time_stat_buffer for better alignment bcachefs: time_stats: split stats-with-quantiles into a separate structure bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet bcachefs: time_stats: add larger units bcachefs: pull out time_stats.[ch] bcachefs: reconstruct_alloc cleanup bcachefs: fix bch_folio_sector padding bcachefs: Fix btree key cache coherency during replay bcachefs: Always flush write buffer in delete_dead_inodes() bcachefs: Fix order of gc_done passes bcachefs: fix deletion of indirect extents in btree_gc bcachefs: Prefer struct_size over open coded arithmetic bcachefs: Kill unused flags argument to btree_split() bcachefs: Check for writing superblocks with nonsense member seq fields bcachefs: fix bch2_journal_buf_to_text() lib/generic-radix-tree.c: Make nodes more reasonably sized bcachefs: copy_(to|from)_user_errcode() bcachefs: Split out bkey_types.h bcachefs: fix lost journal buf wakeup due to improved pipelining bcachefs: intercept mountoption value for bool type ...
| * | | mempool: kvmalloc poolKent Overstreet2024-03-131-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(), which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc() fallback. This is part of a bcachefs cleanup - dropping an internal kvpmalloc() helper (which predates kvmalloc()) along with mempool helpers; this replaces the bcachefs-private kvpmalloc_pool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: linux-mm@kvack.org
* | | | Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of ↵Linus Torvalds2024-03-141-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ...
| * | | | treewide: update LLVM Bugzilla linksNathan Chancellor2024-02-221-1/+1
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LLVM moved their issue tracker from their own Bugzilla instance to GitHub issues. While all of the links are still valid, they may not necessarily show the most up to date information around the issues, as all updates will occur on GitHub, not Bugzilla. Another complication is that the Bugzilla issue number is not always the same as the GitHub issue number. Thankfully, LLVM maintains this mapping through two shortlinks: https://llvm.org/bz<num> -> https://bugs.llvm.org/show_bug.cgi?id=<num> https://llvm.org/pr<num> -> https://github.com/llvm/llvm-project/issues/<mapped_num> Switch all "https://bugs.llvm.org/show_bug.cgi?id=<num>" links to the "https://llvm.org/pr<num>" shortlink so that the links show the most up to date information. Each migrated issue links back to the Bugzilla entry, so there should be no loss of fidelity of information here. Link: https://lkml.kernel.org/r/20240109-update-llvm-links-v1-3-eb09b59db071@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Fangrui Song <maskray@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Mykola Lysenko <mykolal@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds2024-03-1462-2944/+5490
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
| * | | | mm/zswap: remove the memcpy if acomp is not sleepableBarry Song2024-03-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most compressors are actually CPU-based and won't sleep during compression and decompression. We should remove the redundant memcpy for them. This patch checks if the algorithm is sleepable by testing the CRYPTO_ALG_ASYNC algorithm flag. Generally speaking, async and sleepable are semantically similar but not equal. But for compress drivers, they are basically equal at least due to the below facts. Firstly, scompress drivers - crypto/deflate.c, lz4.c, zstd.c, lzo.c etc have no sleep. Secondly, zRAM has been using these scompress drivers for years in atomic contexts, and never worried those drivers going to sleep. One exception is that an async driver can sometimes still return synchronously per Herbert's clarification. In this case, we are still having a redundant memcpy. But we can't know if one particular acomp request will sleep or not unless crypto can expose more details for each specific request from offload drivers. Link: https://lkml.kernel.org/r/20240222081135.173040-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | memtest: use {READ,WRITE}_ONCE in memory scanningQiang Zhang2024-03-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memtest failed to find bad memory when compiled with clang. So use {WRITE,READ}_ONCE to access memory to avoid compiler over optimization. Link: https://lkml.kernel.org/r/20240312080422.691222-1-qiang4.zhang@intel.com Signed-off-by: Qiang Zhang <qiang4.zhang@intel.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: prohibit the last subpage from reusing the entire large folioBarry Song2024-03-131-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire large folio, resulting in the waste of (nr_pages - 1) pages. This wasted memory remains allocated until it is either unmapped or memory reclamation occurs. The following small program can serve as evidence of this behavior main() { #define SIZE 1024 * 1024 * 1024UL void *p = malloc(SIZE); memset(p, 0x11, SIZE); if (fork() == 0) _exit(0); memset(p, 0x12, SIZE); printf("done\n"); while(1); } For example, using a 1024KiB mTHP by: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled (1) w/o the patch, it takes 2GiB, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 84 5692 0 17 5669 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 2149 3627 0 19 3605 Swap: 0 0 0 (2) w/ the patch, it takes 1GiB only, Before running the test program, / # free -m total used free shared buff/cache available Mem: 5754 89 5687 0 17 5664 Swap: 0 0 0 / # /a.out & / # done After running the test program, / # free -m total used free shared buff/cache available Mem: 5754 1122 4655 0 17 4632 Swap: 0 0 0 This patch migrates the last subpage to a small folio and immediately returns the large folio to the system. It benefits both memory availability and anti-fragmentation. Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/huge_memory: skip invalid debugfs new_order input for folio splitZi Yan2024-03-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | User can put arbitrary new_order via debugfs for folio split test. Although new_order check is added to split_huge_page_to_list_order() in the prior commit, these two additional checks can avoid unnecessary folio locking and split_folio_to_order() calls. Link: https://lkml.kernel.org/r/20240307181854.138928-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto.mountain/ Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/huge_memory: check new folio order when split a folioZi Yan2024-03-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A folio can only be split into lower orders. Since there are no new_order checks in debugfs, any new_order can be passed via debugfs into split_huge_page_to_list_to_order(). Check new_order to make sure it is smaller than input folio order. Link: https://lkml.kernel.org/r/20240307181854.138928-1-zi.yan@sent.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto.mountain/ Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failureByungchul Park2024-03-121-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon pages. However, it should be more careful to use the mode because it's going to prevent anon pages from being reclaimed even if there are a huge number of anon pages that are cold and should be reclaimed. Even worse, that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping kswapd from functioning until direct reclaim eventually works to resume kswapd. So kswapd needs to retry its scan priority loop with cache_trim_mode off again if the mode doesn't work for reclaim. The problematic behavior can be reproduced by: CONFIG_NUMA_BALANCING enabled sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING numa node0 (8GB local memory, 16 CPUs) numa node1 (8GB slow tier memory, no CPUs) Sequence: 1) echo 3 > /proc/sys/vm/drop_caches 2) To emulate the system with full of cold memory in local DRAM, run the following dummy program and never touch the region: mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0); 3) Run any memory intensive work e.g. XSBench. 4) Check if numa balancing is working e.i. promotion/demotion. 5) Iterate 1) ~ 4) until numa balancing stops. With this, you could see that promotion/demotion are not working because kswapd has stopped due to ->kswapd_failures >= MAX_RECLAIM_RETRIES. Interesting vmstat delta's differences between before and after are like: +-----------------------+-------------------------------+ | interesting vmstat | before | after | +-----------------------+-------------------------------+ | nr_inactive_anon | 321935 | 1664772 | | nr_active_anon | 1780700 | 437834 | | nr_inactive_file | 30425 | 40882 | | nr_active_file | 14961 | 3012 | | pgpromote_success | 356 | 1293122 | | pgpromote_candidate | 21953245 | 1824148 | | pgactivate | 1844523 | 3311907 | | pgdeactivate | 50634 | 1554069 | | pgfault | 31100294 | 6518806 | | pgdemote_kswapd | 30856 | 2230821 | | pgscan_kswapd | 1861981 | 7667629 | | pgscan_anon | 1822930 | 7610583 | | pgscan_file | 39051 | 57046 | | pgsteal_anon | 386 | 2192033 | | pgsteal_file | 30470 | 38788 | | pageoutrun | 30 | 412 | | numa_hint_faults | 27418279 | 2875955 | | numa_pages_migrated | 356 | 1293122 | +-----------------------+-------------------------------+ Link: https://lkml.kernel.org/r/20240304082118.20499-1-byungchul@sk.com Signed-off-by: Byungchul Park <byungchul@sk.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: add an explicit smp_wmb() to UFFDIO_CONTINUEJames Houghton2024-03-122-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may believe that all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone subsequently reading the page through the newly mapped virtual memory region. Today, such a user happens to be correct. mmget_not_zero(), for example, is called as part of UFFDIO_CONTINUE (and comes before any PTE updates), and it implicitly gives us a write barrier. To be resilient against future changes, include an explicit smp_wmb(). While we're at it, optimize the smp_wmb() that is already incidentally present for the HugeTLB case. Merely making a syscall does not generally imply the memory ordering constraints that we need (including on x86). Link: https://lkml.kernel.org/r/20240307010250.3847179-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: fix list corruption in put_pages_listMatthew Wilcox (Oracle)2024-03-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My recent change to put_pages_list() dereferences folio->lru.next after returning the folio to the page allocator. Usually this is now on the pcp list with other free folios, so we try to free an already-free folio. This only happens with lists that have more than 15 entries, so it wasn't immediately discovered. Revert to using list_for_each_safe() so we dereference lru.next before disposing of the folio. Link: https://lkml.kernel.org/r/20240306212749.1823380-1-willy@infradead.org Fixes: 24835f899c01 ("mm: use free_unref_folios() in put_pages_list()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/intel-gfx/SJ1PR11MB61292145F3B79DA58ADDDA63B9232@SJ1PR11MB6129.namprd11.prod.outlook.com/ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: remove folio from deferred split list before uncharging itMatthew Wilcox (Oracle)2024-03-122-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing a large folio, we must remove it from the deferred split list before we uncharge it as each memcg has its own deferred split list (with associated lock) and removing a folio from the deferred split list while holding the wrong lock will corrupt that list and cause various related problems. Link: https://lore.kernel.org/linux-mm/367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com/ Link: https://lkml.kernel.org/r/20240311191835.312162-1-willy@infradead.org Fixes: f77171d241e3 (mm: allow non-hugetlb large folios to be batch processed) Fixes: 29f3843026cf (mm: free folios directly in move_folios_to_lru()) Fixes: bc2ff4cbc329 (mm: free folios in a batch in shrink_folio_list()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Debugged-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | filemap: avoid unnecessary major faults in filemap_fault()ZhangPeng2024-03-061-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in application, which leading to an unexpected issue[1]. This is caused by temporarily cleared PTE during a read+clear/modify/write update of the PTE, eg, do_numa_page()/change_pte_range(). For the data segment of the user-mode program, the global variable area is a private mapping. After the pagecache is loaded, the private anonymous page is generated after the COW is triggered. Mlockall can lock COW pages (anonymous pages), but the original file pages cannot be locked and may be reclaimed. If the global variable (private anon page) is accessed when vmf->pte is zeroed in numa fault, a file page fault will be triggered. At this time, the original private file page may have been reclaimed. If the page cache is not available at this time, a major fault will be triggered and the file will be read, causing additional overhead. This issue affects our traffic analysis service. The inbound traffic is heavy. If a major fault occurs, the I/O schedule is triggered and the original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If other applications are operating disks, the system needs to wait for more than 10 ms. However, the inbound traffic is heavy and the NIC buffer is small. As a result, packet loss occurs. But the traffic analysis service can't tolerate packet loss. Fix this by holding PTL and rechecking the PTE in filemap_fault() before triggering a major fault. We do this check only if vma is VM_LOCKED to reduce the performance impact in common scenarios. In our product environment, there were 7 major faults every 12 hours. After the patch is applied, no major fault have been triggered. Testing file page read and write page fault performance in ext4 and ramdisk using will-it-scale[2] on a x86 physical machine. The data is the average change compared with the mainline after the patch is applied. The test results are within the range of fluctuation. We do this check only if vma is VM_LOCKED, therefore, no performance regressions is caused for most common cases. The test results are as follows: processes processes_idle threads threads_idle ext4 private file write: 0.22% 0.26% 1.21% -0.15% ext4 private file read: 0.03% 1.00% 1.39% 0.34% ext4 shared file write: -0.50% -0.02% -0.14% -0.02% ramdisk private file write: 0.07% 0.02% 0.53% 0.04% ramdisk private file read: 0.01% 1.60% -0.32% -0.02% [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ [2] https://github.com/antonblanchard/will-it-scale/ Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm,page_owner: drop unnecessary checkOscar Salvador2024-03-061-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stackdepot only saves stack_records which size is greather than 0, so we cannot possibly have empty stack_records. Drop the check. Link: https://lkml.kernel.org/r/20240306123217.29774-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm,page_owner: check for null stack_record before bumping its refcountOscar Salvador2024-03-061-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "page_owner: Fixup and cleanup". This patchset consists of a fixup by an error that was reported by intel robot, where it seems to be that by the time page_owner gets initialized, stackdepot has already depleted its allocation space and returns 0-handles, turning that into null stack_records when trying to retrieve the stack_record. I was not able to reproduce that from the config because it booted fine for me, but when setting e.g: dummy_handle to 0 artificially, I could see the same error that was reported. The second patch is a cleanup that can also lead to a compilation warning. This patch (of 2): Although the retrieval of the stack_records for {dummy,failure}_handle happen when page_owner gets initialized, there seems to be some situations where stackdepot space has been already depleted by then, so we get 0-handles which make stack_records being NULL for those cases. Be careful to 1) only bump stack_records refcount and 2) only access stack_record fields if we actually have a non-null stack_record between hands. Link: https://lkml.kernel.org/r/20240306123217.29774-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240306123217.29774-2-osalvador@suse.de Fixes: 4bedfb314bdd ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202403051032.e2f865a-lkp@intel.com Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: swap: fix race between free_swap_and_cache() and swapoff()Ryan Roberts2024-03-061-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was previously a theoretical window where swapoff() could run and teardown a swap_info_struct while a call to free_swap_and_cache() was running in another thread. This could cause, amongst other bad possibilities, swap_page_trans_huge_swapped() (called by free_swap_and_cache()) to access the freed memory for swap_map. This is a theoretical problem and I haven't been able to provoke it from a test case. But there has been agreement based on code review that this is possible (see link below). Fix it by using get_swap_device()/put_swap_device(), which will stall swapoff(). There was an extra check in _swap_info_get() to confirm that the swap entry was not free. This isn't present in get_swap_device() because it doesn't make sense in general due to the race between getting the reference and swapoff. So I've added an equivalent check directly in free_swap_and_cache(). Details of how to provoke one possible issue (thanks to David Hildenbrand for deriving this): --8<----- __swap_entry_free() might be the last user and result in "count == SWAP_HAS_CACHE". swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. So the question is: could someone reclaim the folio and turn si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are still references by swap entries. Process 1 still references subpage 0 via swap entry. Process 2 still references subpage 1 via swap entry. Process 1 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE [then, preempted in the hypervisor etc.] Process 2 quits. Calls free_swap_and_cache(). -> count == SWAP_HAS_CACHE Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls __try_to_reclaim_swap(). __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> swap_entry_free()->swap_range_free()-> ... WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); What stops swapoff to succeed after process 2 reclaimed the swap cache but before process1 finished its call to swap_page_trans_huge_swapped()? --8<----- Link: https://lkml.kernel.org/r/20240306140356.3974886-1-ryan.roberts@arm.com Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/kasan: use pXd_leaf() in shadow_mapped()Peter Xu2024-03-061-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an old trick in shadow_mapped() to use pXd_bad() to detect huge pages. After commit 93fab1b22ef7 ("mm: add generic p?d_leaf() macros") we have a global API for huge mappings. Use that to replace the trick. Link: https://lkml.kernel.org/r/20240305043750.93762-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/zswap: global lru and shrinker shared by all zswap_pools fixChengming Zhou2024-03-061-39/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all zswap_pools") introduced a new lock to protect zswap_next_shrink, instead of reusing zswap_pools_lock. But the problem is that it's initialized only when zswap enabled, which causes bug if zswap_memcg_offline_cleanup() called without zswap enabled. Fix it by using DEFINE_SPINLOCK() to statically initialize them and define them as multiple static variables to keep in consistent with the existing global variables in zswap. Link: https://lkml.kernel.org/r/20240305075345.1493214-1-chengming.zhou@linux.dev Fixes: bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all zswap_pools") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202403051008.a8cf8a94-lkp@intel.com Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: memory: fix shift-out-of-bounds in fault_around_bytes_setKefeng Wang2024-03-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the fault_around_bytes_set(), and leads to shift-out-of-bounds, UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13 shift exponent 4294967295 is too large for 64-bit type 'long unsigned int' CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x94/0xec show_stack+0x18/0x24 dump_stack_lvl+0x78/0x90 dump_stack+0x18/0x24 ubsan_epilogue+0x10/0x44 __ubsan_handle_shift_out_of_bounds+0x98/0x134 fault_around_bytes_set+0xa4/0xb0 simple_attr_write_xsigned.isra.0+0xe4/0x1ac simple_attr_write+0x18/0x24 debugfs_attr_write+0x4c/0x98 vfs_write+0xd0/0x4b0 ksys_write+0x6c/0xfc __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x104 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x34/0xdc el0t_64_sync_handler+0xc0/0xc4 el0t_64_sync+0x190/0x194 ---[ end trace ]--- Fix it by setting the minimum val to PAGE_SIZE. Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.com Fixes: 53d36a56d8c4 ("mm: prefer fault_around_pages to fault_around_bytes") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: Yue Sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/all/CAEkJfYPim6DQqW1GqCiHLdh2-eweqk1fGyXqs3JM+8e1qGge8w@mail.gmail.com/ Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: page_alloc: use div64_ul() instead of do_div()Thorsten Blum2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes Coccinelle/coccicheck warning reported by do_div.cocci. Compared to do_div(), div64_ul() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lkml.kernel.org/r/20240228224911.1164-2-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/mempolicy: use a folio in do_mbind()Matthew Wilcox (Oracle)2024-03-061-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We actually add folios to the pagelist already, but then work with them as pages. Removes a call to compound_head() in PageKsm() and removes a reference to page->index. Link: https://lkml.kernel.org/r/20240229153015.1996829-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Gregory Price <gregory.price@memverge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: make folio_pte_batch available outside of mm/memory.cBarry Song2024-03-062-76/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | madvise, mprotect and some others might need folio_pte_batch to check if a range of PTEs are completely mapped to a large folio with contiguous physical addresses. Let's make it available in mm/internal.h. While at it, add proper kernel doc and sanity-check more input parameters using two additional VM_WARN_ON_FOLIO(). [21cnbao@gmail.com: build fix] Link: https://lkml.kernel.org/r/CAGsJ_4wWzG-37D82vqP_zt+Fcbz+URVe5oXLBc4M5wbN8A_gpQ@mail.gmail.com [david@redhat.com: improve the doc for the exported func] Link: https://lkml.kernel.org/r/20240227104201.337988-1-21cnbao@gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: constify more page/folio testsMatthew Wilcox (Oracle)2024-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Constify the flag tests that aren't automatically generated and the tests that look like flag tests but are more complicated. Link: https://lkml.kernel.org/r/20240227192337.757313-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: make dump_page() take a const argumentMatthew Wilcox (Oracle)2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that __dump_page() takes a const argument, we can make dump_page() take a const struct page too. Link: https://lkml.kernel.org/r/20240227192337.757313-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: add __dump_folio()Matthew Wilcox (Oracle)2024-03-061-55/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the page & folio into a stack variable so we don't hit BUG_ON() if an allocation is freed under us and what was a folio pointer becomes a pointer to a tail page. [willy@infradead.org: fix build issue] Link: https://lkml.kernel.org/r/ZeAKCyTn_xS3O9cE@casper.infradead.org [willy@infradead.org: fix __dump_folio] Link: https://lkml.kernel.org/r/ZeJJegP8zM7S9GTy@casper.infradead.org [willy@infradead.org: fix pointer confusion] Link: https://lkml.kernel.org/r/ZeYa00ixxC4k1ot-@casper.infradead.org [akpm@linux-foundation.org: s/printk/pr_warn/] Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | hugetlb: parallelize 1G hugetlb initializationGang Li2024-03-061-8/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimizing the initialization speed of 1G huge pages through parallelization. 1G hugetlbs are allocated from bootmem, a process that is already very fast and does not currently require optimization. Therefore, we focus on parallelizing only the initialization phase in `gather_bootmem_prealloc`. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% [akpm@linux-foundation.org: s/initialied/initialized/, per Alexey] Link: https://lkml.kernel.org/r/20240222140422.393911-9-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | hugetlb: parallelize 2M hugetlb allocation and initializationGang Li2024-03-061-17/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By distributing both the allocation and the initialization tasks across multiple threads, the initialization of 2M hugetlb will be faster, thereby improving the boot speed. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% Link: https://lkml.kernel.org/r/20240222140422.393911-8-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | Author: Gang Li padata: dispatch works onGang Li Subject: padata: dispatch works on2024-03-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800 When a group of tasks that access different nodes are scheduled on the same node, they may encounter bandwidth bottlenecks and access latency. Thus, numa_aware flag is introduced here, allowing tasks to be distributed across different nodes to fully utilize the advantage of multi-node systems. Link: https://lkml.kernel.org/r/20240222140422.393911-5-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_allocGang Li2024-03-061-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With parallelization of hugetlb allocation across different threads, each thread works on a differnet node to allocate pages from, instead of all allocating from a common node h->next_nid_to_alloc. To address this, it's necessary to assign a separate next_nid_to_alloc for each thread. Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc have been modified to directly accept a *next_nid_to_alloc parameter, ensuring thread-specific allocation and avoiding concurrent access issues. Link: https://lkml.kernel.org/r/20240222140422.393911-4-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | hugetlb: split hugetlb_hstate_alloc_pagesGang Li2024-03-061-44/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1G and 2M huge pages have different allocation and initialization logic, which leads to subtle differences in parallelization. Therefore, it is appropriate to split hugetlb_hstate_alloc_pages into gigantic and non-gigantic. This patch has no functional changes. Link: https://lkml.kernel.org/r/20240222140422.393911-3-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | hugetlb: code clean for hugetlb_hstate_alloc_pagesGang Li2024-03-061-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "hugetlb: parallelize hugetlb page init on boot", v6. Introduction ------------ Hugetlb initialization during boot takes up a considerable amount of time. For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel host takes more than 1 minute[1]. This is a noteworthy figure. Inspired by [2] and [3], hugetlb initialization can also be accelerated through parallelization. Kernel already has infrastructure like padata_do_multithreaded, this patch uses it to achieve effective results by minimal modifications. [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/ [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/ [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/ [4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/ max_threads ----------- This patch use `padata_do_multithreaded` like this: ``` job.max_threads = num_node_state(N_MEMORY) * multiplier; padata_do_multithreaded(&job); ``` To fully utilize the CPU, the number of parallel threads needs to be carefully considered. `max_threads = num_node_state(N_MEMORY)` does not fully utilize the CPU, so we need to multiply it by a multiplier. Tests below indicate that a multiplier of 2 significantly improves performance, and although larger values also provide improvements, the gains are marginal. multiplier 1 2 3 4 5 ------------ ------- ------- ------- ------- ------- 256G 2node 358ms 215ms 157ms 134ms 126ms 2T 4node 979ms 679ms 543ms 489ms 481ms 50G 2node 71ms 44ms 37ms 30ms 31ms Therefore, choosing 2 as the multiplier strikes a good balance between enhancing parallel processing capabilities and maintaining efficient resource management. Test result ----------- test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% This patch (of 8): The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the code, its readability can be improved, facilitating future modifications. This patch extracts two functions to reduce the complexity of `hugetlb_hstate_alloc_pages` and has no functional changes. - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through each online node and performs allocation if necessary. - hugetlb_hstate_alloc_pages_report() report error during allocation. And the value of h->max_huge_pages is updated accordingly. Link: https://lkml.kernel.org/r/20240222140422.393911-1-gang.li@linux.dev Link: https://lkml.kernel.org/r/20240222140422.393911-2-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/zsmalloc: don't need to reserve LSB in handleChengming Zhou2024-03-041-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will save allocated tag in the object header to indicate that it's allocated. handle |= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/20240228023854.3511239-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/memory.c: do_numa_page(): remove a redundant page table readJohn Hubbard2024-03-041-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_numa_page() is reading from the same page table entry, twice, while holding the page table lock: once while checking that the pte hasn't changed, and again in order to modify the pte. Instead, just read the pte once, and save it in the same old_pte variable that already exists. This has no effect on behavior, other than to provide a tiny potential improvement to performance, by avoiding the redundant memory read (which the compiler cannot elide, due to READ_ONCE()). Also improve the associated comments nearby. Link: https://lkml.kernel.org/r/20240228034151.459370-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: add alloc_contig_migrate_range allocation statisticsRichard Chang2024-03-043-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org. Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: use folio more widely in __split_huge_pageMatthew Wilcox (Oracle)2024-03-041-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have a folio; use it instead of the head page where reasonable. Saves a couple of calls to compound_head() and elimimnates a few references to page->mapping. Link: https://lkml.kernel.org/r/20240228164326.1355045-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: convert free_swap_cache() to take a folioMatthew Wilcox (Oracle)2024-03-043-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All but one caller already has a folio, so convert free_page_and_swap_cache() to have a folio and remove the call to page_folio(). Link: https://lkml.kernel.org/r/20240227174254.710559-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: use a folio in __collapse_huge_page_copy_succeeded()Matthew Wilcox (Oracle)2024-03-041-16/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These pages are all chained together through the lru list, so we know they're folios. Use the folio APIs to save three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20240227174254.710559-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: convert free_pages_and_swap_cache() to use folios_put()Matthew Wilcox (Oracle)2024-03-041-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Process the pages in batch-sized quantities instead of all-at-once. Link: https://lkml.kernel.org/r/20240227174254.710559-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: remove free_unref_page_list()Matthew Wilcox (Oracle)2024-03-042-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | memcg: remove mem_cgroup_uncharge_list()Matthew Wilcox (Oracle)2024-03-041-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All users have been converted to mem_cgroup_uncharge_folios() so we can remove this API. Link: https://lkml.kernel.org/r/20240227174254.710559-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: free folios directly in move_folios_to_lru()Matthew Wilcox (Oracle)2024-03-041-20/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The few folios which can't be moved to the LRU list (because their refcount dropped to zero) used to be returned to the caller to dispose of. Make this simpler to call by freeing the folios directly through free_unref_folios(). Link: https://lkml.kernel.org/r/20240227174254.710559-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: free folios in a batch in shrink_folio_list()Matthew Wilcox (Oracle)2024-03-041-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use free_unref_page_batch() to free the folios. This may increase the number of IPIs from calling try_to_unmap_flush() more often, but that's going to be very workload-dependent. It may even reduce the number of IPIs as we now batch-free large folios instead of freeing them one at a time. Link: https://lkml.kernel.org/r/20240227174254.710559-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: allow non-hugetlb large folios to be batch processedMatthew Wilcox (Oracle)2024-03-041-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hugetlb folios still get special treatment, but normal large folios can now be freed by free_unref_folios(). This should have a reasonable performance impact, TBD. Link: https://lkml.kernel.org/r/20240227174254.710559-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: handle large folios in free_unref_folios()Matthew Wilcox (Oracle)2024-03-041-8/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call folio_undo_large_rmappable() if needed. free_unref_page_prepare() destroys the ability to call folio_order(), so stash the order in folio->private for the benefit of the second loop. Link: https://lkml.kernel.org/r/20240227174254.710559-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>