summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm: remove PageAnonExclusive assertions in unuse_pte()Matthew Wilcox (Oracle)2023-12-291-4/+0
| | | | | | | | | | The page in question is either freshly allocated or known to be in the swap cache; these assertions are not particularly useful. Link: https://lkml.kernel.org/r/20231212164813.2540119-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: convert ksm_might_need_to_copy() to work on foliosMatthew Wilcox (Oracle)2023-12-293-17/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Finish two folio conversions". Most callers of page_add_new_anon_rmap() and lru_cache_add_inactive_or_unevictable() have been converted to their folio equivalents, but there are still a few stragglers. There's a bit of preparatory work in ksm and unuse_pte(), but after that it's pretty mechanical. This patch (of 9): Accept a folio as an argument and return a folio result. Removes a call to compound_head() in do_swap_page(), and prevents folio & page from getting out of sync in unuse_pte(). Reviewed-by: David Hildenbrand <david@redhat.com> [willy@infradead.org: fix smatch warning] Link: https://lkml.kernel.org/r/ZXnPtblC6A1IkyAB@casper.infradead.org [david@redhat.com: only adjust the page if the folio changed] Link: https://lkml.kernel.org/r/6a8f2110-fa91-4c10-9eae-88315309a6e3@redhat.com Link: https://lkml.kernel.org/r/20231211162214.2146080-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231211162214.2146080-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* userfaultfd: UFFDIO_MOVE uABIAndrea Arcangeli2023-12-294-0/+745
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the uABI of UFFDIO_MOVE ioctl. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [2]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [2]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/ [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ Update for the ioctl_userfaultfd(2) manpage: UFFDIO_MOVE (Since Linux xxx) Move a continuous memory chunk into the userfault registered range and optionally wake up the blocked thread. The source and destination addresses and the number of bytes to move are specified by the src, dst, and len fields of the uffdio_move structure pointed to by argp: struct uffdio_move { __u64 dst; /* Destination of move */ __u64 src; /* Source of move */ __u64 len; /* Number of bytes to move */ __u64 mode; /* Flags controlling behavior of move */ __s64 move; /* Number of bytes moved, or negated error */ }; The following value may be bitwise ORed in mode to change the behavior of the UFFDIO_MOVE operation: UFFDIO_MOVE_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES Allow holes in the source virtual range that is being moved. When not specified, the holes will result in ENOENT error. When specified, the holes will be accounted as successfully moved memory. This is mostly useful to move hugepage aligned virtual regions without knowing if there are transparent hugepages in the regions or not, but preventing the risk of having to split the hugepage during the operation. The move field is used by the kernel to return the number of bytes that was actually moved, or an error (a negated errno- style value). If the value returned in move doesn't match the value that was specified in len, the operation fails with the error EAGAIN. The move field is output-only; it is not read by the UFFDIO_MOVE operation. The operation may fail for various reasons. Usually, remapping of pages that are not exclusive to the given process fail; once KSM might deduplicate pages or fork() COW-shares pages during fork() with child processes, they are no longer exclusive. Further, the kernel might only perform lightweight checks for detecting whether the pages are exclusive, and return -EBUSY in case that check fails. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source VMA before fork(). This ioctl(2) operation returns 0 on success. In this case, the entire area was moved. On error, -1 is returned and errno is set to indicate the error. Possible errors include: EAGAIN The number of bytes moved (i.e., the value returned in the move field) does not equal the value that was specified in the len field. EINVAL Either dst or len was not a multiple of the system page size, or the range specified by src and len or dst and len was invalid. EINVAL An invalid bit was specified in the mode field. ENOENT The source virtual memory range has unmapped holes and UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. EEXIST The destination virtual memory range is fully or partially mapped. EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). ENOMEM Allocating memory needed for the operation failed. ESRCH The target process has exited at the time of a UFFDIO_MOVE operation. Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/rmap: support move to different root anon_vma in folio_move_anon_rmap()Andrea Arcangeli2023-12-291-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "userfaultfd move option", v6. This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has long been implemented and maintained by Andrea in his local tree [1], but was not upstreamed due to lack of use cases where this approach would be better than allocating a new page and copying the contents. Previous upstraming attempts could be found at [6] and [7]. UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are available (in userspace) for recycling, as is usually the case in heap compaction algorithms, then we can avoid the page allocation and memcpy (done by UFFDIO_COPY). Also, since the pages are recycled in the userspace, we avoid the need to release (via madvise) the pages back to the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in the compacting thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was measured using a benchmark that emulates a heap compaction implementation using userfaultfd (to allow concurrent accesses by application threads). More details of the usecase are explained in [3]. Furthermore, UFFDIO_MOVE enables moving swapped-out pages without touching them within the same vma. Today, it can only be done by mremap, however it forces splitting the vma. TODOs for follow-up improvements: - cross-mm support. Known differences from single-mm and missing pieces: - memcg recharging (might need to isolate pages in the process) - mm counters - cross-mm deposit table moves - cross-mm test - document the address space where src and dest reside in struct uffdio_move - TLB flush batching. Will require extensive changes to PTL locking in move_pages_pte(). OTOH that might let us reuse parts of mremap code. This patch (of 5): For now, folio_move_anon_rmap() was only used to move a folio to a different anon_vma after fork(), whereby the root anon_vma stayed unchanged. For that, it was sufficient to hold the folio lock when calling folio_move_anon_rmap(). However, we want to make use of folio_move_anon_rmap() to move folios between VMAs that have a different root anon_vma. As folio_referenced() performs an RMAP walk without holding the folio lock but only holding the anon_vma in read mode, holding the folio lock is insufficient. When moving to an anon_vma with a different root anon_vma, we'll have to hold both, the folio lock and the anon_vma lock in write mode. Consequently, whenever we succeeded in folio_lock_anon_vma_read() to read-lock the anon_vma, we have to re-check if the mapping was changed in the meantime. If that was the case, we have to retry. Note that folio_move_anon_rmap() must only be called if the anon page is exclusive to a process, and must not be called on KSM folios. This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the anon_vma lock in write mode, and the mmap_lock in read mode. Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: kernel-team@android.com Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_alloc: simplify __free_pages_ok()Yajun Deng2023-12-201-8/+1
| | | | | | | | | | There is redundant code in __free_pages_ok(). Use free_one_page() simplify it. Link: https://lkml.kernel.org/r/20231216030503.2126130-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/memory: replace kmap() with kmap_local_page()Fabio M. De Francesco2023-12-201-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmap() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap() with kmap_local_page() in mm/memory.c. There are two main problems with kmap(): (1) It comes with an overhead as the mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap's pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page-faults, and can be called from any context (including interrupts). It is faster than kmap() in kernels with HIGHMEM enabled. The tasks can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and still valid. Obviously, thread locality implies that the kernel virtual addresses returned by kmap_local_page() are only valid in the context of the callers (i.e., they cannot be handed to other threads). The use of kmap_local_page() in mm/memory.c does not break the above-mentioned assumption, so it is allowed and preferred. Link: https://lkml.kernel.org/r/20231215084417.2002370-1-fabio.maria.de.francesco@linux.intel.com Link: https://lkml.kernel.org/r/20231214081039.1919328-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zeroSeongJae Park2023-12-201-0/+11
| | | | | | | | | | | | Commit 35f5d94187a6 ("mm/damon: implement a function for max nr_accesses safe calculation") has fixed an overflow bug that could cause divide-by-zero. Add a kunit test for the bug to ensure similar bugs are not introduced again. Link: https://lkml.kernel.org/r/20231213190338.54146-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/damon: update email of SeongJaeSeongJae Park2023-12-207-7/+7
| | | | | | | | | | | | | | | | | | Patch series "mm/damon: misc updates for 6.8". Update comments, tests, and documents for DAMON. This patch (of 6): SeongJae is using his kernel.org account for DAMON development. Update the old email addresses on the comments of DAMON source files. Link: https://lkml.kernel.org/r/20231213190338.54146-1-sj@kernel.org Link: https://lkml.kernel.org/r/20231213190338.54146-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: ksm: remove unnecessary try_to_freeze()Kevin Hao2023-12-201-3/+1
| | | | | | | | | | | | | | A freezable kernel thread can enter frozen state during freezing by either calling try_to_freeze() or using wait_event_freezable() and its variants. However, there is no need to use both methods simultaneously. Link: https://lkml.kernel.org/r/20231213090906.1070985-1-haokexin@gmail.com Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: thp: support allocation of anonymous multi-size THPRyan Roberts2023-12-201-9/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the logic to allow THP to be configured (through the new sysfs interface we just added) to allocate large folios to back anonymous memory, which are larger than the base page size but smaller than PMD-size. We call this new THP extension "multi-size THP" (mTHP). mTHP continues to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default, but can be enabled at runtime by writing to /sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled (see documentation in previous commit). The long term aim is to change the default to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. [ryan.roberts@arm.com: resolve some multi-size THP review nits] Link: https://lkml.kernel.org/r/20231214160251.3574571-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: thp: introduce multi-size THP sysfs interfaceRyan Roberts2023-12-204-39/+221
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding support for anonymous multi-size THP, introduce new sysfs structure that will be used to control the new behaviours. A new directory is added under transparent_hugepage for each supported THP size, and contains an `enabled` file, which can be set to "inherit" (to inherit the global setting), "always", "madvise" or "never". For now, the kernel still only supports PMD-sized anonymous THP, so only 1 directory is populated. The first half of the change converts transhuge_vma_suitable() and hugepage_vma_check() so that they take a bitfield of orders for which the user wants to determine support, and the functions filter out all the orders that can't be supported, given the current sysfs configuration and the VMA dimensions. The resulting functions are renamed to thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively. Convenience functions that take a single, unencoded order and return a boolean are also defined as thp_vma_suitable_order() and thp_vma_allowable_order(). The second half of the change implements the new sysfs interface. It has been done so that each supported THP size has a `struct thpsize`, which describes the relevant metadata and is itself a kobject. This is pretty minimal for now, but should make it easy to add new per-thpsize files to the interface if needed in future (e.g. per-size defrag). Rather than keep the `enabled` state directly in the struct thpsize, I've elected to directly encode it into huge_anon_orders_[always|madvise|inherit] bitfields since this reduces the amount of work required in thp_vma_allowable_orders() which is called for every page fault. See Documentation/admin-guide/mm/transhuge.rst, as modified by this commit, for details of how the new sysfs interface works. [ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled] Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: non-pmd-mappable, large folios for folio_add_new_anon_rmap()Ryan Roberts2023-12-201-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for supporting anonymous multi-size THP, improve folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be passed to it. In this case, all contained pages are accounted using the order-0 folio (or base page) scheme. Link: https://lkml.kernel.org/r/20231207161211.2374093-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: allow deferred splitting of arbitrary anon large foliosRyan Roberts2023-12-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Multi-size THP for anonymous memory", v9. A series to implement multi-size THP (mTHP) for anonymous memory (previously called "small-sized THP" and "large anonymous folios"). The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults: 1) Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel overhead. This should benefit all architectures. 2) Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch). This version incorporates David's feedback on the core patches (#3, #4) and adds some RB and TB tags (see change log for details). By default, the existing behaviour (and performance) is maintained. The user must explicitly enable multi-size THP to see the performance benefit. This is done via a new sysfs interface (as recommended by David Hildenbrand - thanks to David for the suggestion)! This interface is inspired by the existing per-hugepage-size sysfs interface used by hugetlb, provides full backwards compatibility with the existing PMD-size THP interface, and provides a base for future extensibility. See [9] for detailed discussion of the interface. This series is based on mm-unstable (715b67adf4c8). Prerequisites ============= I'm removing this section on the basis that I don't believe what we were previously calling prerequisites are really prerequisites anymore. We originally defined them when mTHP was a compile-time feature. There is now a runtime control to opt-in to mTHP; when disabled, correctness and performance are as before. When enabled, the code is still correct/robust, but in the absence of the one remaining item (compaction) there may be a performance impact in some corners. See the old list in the v8 cover letter at [8]. And a longer explanation of my thinking here [10]. SUMMARY: I don't think we should hold this series up, waiting for the items on the prerequisites list. I believe this series should be ready now so hopefully can be added to mm-unstable for some testing, then fingers crossed for v6.8. Testing ======= The series includes patches for mm selftests to enlighten the cow and khugepaged tests to explicitly test with multi-size THP, in the same way that PMD-sized THP is tested. The new tests all pass, and no regressions are observed in the mm selftest suite. I've also run my usual kernel compilation and java script benchmarks without any issues. Refer to my performance numbers posted with v6 [6]. (These are for multi-size THP only - they do not include the arm64 contpte follow-on series). John Hubbard at Nvidia has indicated dramatic 10x performance improvements for some workloads at [11]. (Observed using v6 of this series as well as the arm64 contpte series). Kefeng Wang at Huawei has also indicated he sees improvements at [12] although there are some latency regressions also. I've also checked that there is no regression in the write fault path when mTHP is disabled using a microbenchmark. I ran it for a baseline kernel, as well as v8 and v9. I repeated on Ampere Altra (bare metal) and Apple M2 (VM): | | m2 vm | altra | |--------------|---------------------|---------------------| | kernel | mean | std_rel | mean | std_rel | |--------------|----------|----------|----------|----------| | baseline | 0.000% | 0.341% | 0.000% | 3.581% | | anonfolio-v8 | 0.005% | 0.272% | 5.068% | 1.128% | | anonfolio-v9 | -0.013% | 0.442% | 0.107% | 1.788% | There is no measurable difference on M2, but altra has a slow down in v8 which is fixed in v9 by moving the THP order check to be inline within thp_vma_allowable_orders(), as suggested by David. This patch (of 10): In preparation for the introduction of anonymous multi-size THP, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized. Link: https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20231207161211.2374093-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: memcg: restore subtree stats flushingYosry Ahmed2023-12-204-34/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stats flushing for memcg currently follows the following rules: - Always flush the entire memcg hierarchy (i.e. flush the root). - Only one flusher is allowed at a time. If someone else tries to flush concurrently, they skip and return immediately. - A periodic flusher flushes all the stats every 2 seconds. The reason this approach is followed is because all flushes are serialized by a global rstat spinlock. On the memcg side, flushing is invoked from userspace reads as well as in-kernel flushers (e.g. reclaim, refault, etc). This approach aims to avoid serializing all flushers on the global lock, which can cause a significant performance hit under high concurrency. This approach has the following problems: - Occasionally a userspace read of the stats of a non-root cgroup will be too expensive as it has to flush the entire hierarchy [1]. - Sometimes the stats accuracy are compromised if there is an ongoing flush, and we skip and return before the subtree of interest is actually flushed, yielding stale stats (by up to 2s due to periodic flushing). This is more visible when reading stats from userspace, but can also affect in-kernel flushers. The latter problem is particulary a concern when userspace reads stats after an event occurs, but gets stats from before the event. Examples: - When memory usage / pressure spikes, a userspace OOM handler may look at the stats of different memcgs to select a victim based on various heuristics (e.g. how much private memory will be freed by killing this). Reading stale stats from before the usage spike in this case may cause a wrongful OOM kill. - A proactive reclaimer may read the stats after writing to memory.reclaim to measure the success of the reclaim operation. Stale stats from before reclaim may give a false negative. - Reading the stats of a parent and a child memcg may be inconsistent (child larger than parent), if the flush doesn't happen when the parent is read, but happens when the child is read. As for in-kernel flushers, they will occasionally get stale stats. No regressions are currently known from this, but if there are regressions, they would be very difficult to debug and link to the source of the problem. This patch aims to fix these problems by restoring subtree flushing, and removing the unified/coalesced flushing logic that skips flushing if there is an ongoing flush. This change would introduce a significant regression with global stats flushing thresholds. With per-memcg stats flushing thresholds, this seems to perform really well. The thresholds protect the underlying lock from unnecessary contention. This patch was tested in two ways to ensure the latency of flushing is up to par, on a machine with 384 cpus: - A synthetic test with 5000 concurrent workers in 500 cgroups doing allocations and reclaim, as well as 1000 readers for memory.stat (variation of [2]). No regressions were noticed in the total runtime. Note that significant regressions in this test are observed with global stats thresholds, but not with per-memcg thresholds. - A synthetic stress test for concurrently reading memcg stats while memory allocation/freeing workers are running in the background, provided by Wei Xu [3]. With 250k threads reading the stats every 100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01% of reads take more than 1ms, and no reads take more than 100ms. [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/ [akpm@linux-foundation.org: fix mm/zswap.c] [yosryahmed@google.com: remove stats flushing mutex] Link: https://lkml.kernel.org/r/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com Link: https://lkml.kernel.org/r/20231129032154.3710765-6-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: workingset: move the stats flush into workingset_test_recent()Yosry Ahmed2023-12-201-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The workingset code flushes the stats in workingset_refault() to get accurate stats of the eviction memcg. In preparation for more scoped flushed and passing the eviction memcg to the flush call, move the call to workingset_test_recent() where we have a pointer to the eviction memcg. The flush call is sleepable, and cannot be made in an rcu read section. Hence, minimize the rcu read section by also moving it into workingset_test_recent(). Furthermore, instead of holding the rcu read lock throughout workingset_test_recent(), only hold it briefly to get a ref on the eviction memcg. This allows us to make the flush call after we get the eviction memcg. As for workingset_refault(), nothing else there appears to be protected by rcu. The memcg of the faulted folio (which is not necessarily the same as the eviction memcg) is protected by the folio lock, which is held from all callsites. Add a VM_BUG_ON() to make sure this doesn't change from under us. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-5-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: memcg: make stats flushing threshold per-memcgYosry Ahmed2023-12-201-16/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were no further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ Link: https://lkml.kernel.org/r/20231129032154.3710765-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: memcg: move vmstats structs definition above flushing codeYosry Ahmed2023-12-201-74/+74
| | | | | | | | | | | | | | | | | | | | | | | | | The following patch will make use of those structs in the flushing code, so move their definitions (and a few other dependencies) a little bit up to reduce the diff noise in the following patch. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: memcg: change flush_next_time to flush_last_timeYosry Ahmed2023-12-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: memcg: subtree stats flushing and thresholds", v4. This series attempts to address shortages in today's approach for memcg stats flushing, namely occasionally stale or expensive stat reads. The series does so by changing the threshold that we use to decide whether to trigger a flush to be per memcg instead of global (patch 3), and then changing flushing to be per memcg (i.e. subtree flushes) instead of global (patch 5). This patch (of 5): flush_next_time is an inaccurate name. It's not the next time that periodic flushing will happen, it's rather the next time that ratelimited flushing can happen if the periodic flusher is late. Simplify its semantics by just storing the timestamp of the last flush instead, flush_last_time. Move the 2*FLUSH_TIME addition to mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it. This way, all the ratelimiting semantics live in one place. No functional change intended. Link: https://lkml.kernel.org/r/20231129032154.3710765-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20231129032154.3710765-2-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Ivan Babrou <ivan@cloudflare.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/list_lru.c: remove unused list_lru_from_kmem()Andrew Morton2023-12-201-31/+0
| | | | | | | | | | | Fixes: 0a97c01cd20bb ("list_lru: allow explicit memcg and NUMA node selection) Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312141318.q8b5yrAq-lkp@intel.com/ Cc: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* sync mm-stable with mm-hotfixes-stable to pick up depended-upon changesAndrew Morton2023-12-208-57/+137
|\
| * mm/memory-failure: cast index to loff_t before shifting itMatthew Wilcox (Oracle)2023-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit systems, we'll lose the top bits of index because arithmetic will be performed in unsigned long instead of unsigned long long. This affects files over 4GB in size. Link: https://lkml.kernel.org/r/20231218135837.3310403-4-willy@infradead.org Fixes: 6100e34b2526 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/memory-failure: check the mapcount of the precise pageMatthew Wilcox (Oracle)2023-12-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A process may map only some of the pages in a folio, and might be missed if it maps the poisoned page but not the head page. Or it might be unnecessarily hit if it maps the head page, but not the poisoned page. Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org Fixes: 7af446a841a2 ("HWPOISON, hugetlb: enable error handling path for hugepage") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/memory-failure: pass the folio and the page to collect_procs()Matthew Wilcox (Oracle)2023-12-201-13/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Three memory-failure fixes". I've been looking at the memory-failure code and I believe I have found three bugs that need fixing -- one going all the way back to 2010! I'll have more patches later to use folios more extensively but didn't want these bugfixes to get caught up in that. This patch (of 3): Both collect_procs_anon() and collect_procs_file() iterate over the VMA interval trees looking for a single pgoff, so it is wrong to look for the pgoff of the head page as is currently done. However, it is also wrong to look at page->mapping of the precise page as this is invalid for tail pages. Clear up the confusion by passing both the folio and the precise page to collect_procs(). Link: https://lkml.kernel.org/r/20231218135837.3310403-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231218135837.3310403-2-willy@infradead.org Fixes: 415c64c1453a ("mm/memory-failure: split thp earlier in memory error handling") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: migrate high-order folios in swap cache correctlyCharan Teja Kalla2023-12-201-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. [willy@infradead.org: modifications to the changelog & tweaked the fix] Fixes: 3417013e0d18 ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/filemap: avoid buffered read/write race to read inconsistent dataBaokun Li2023-12-201-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following concurrency may cause the data read to be inconsistent with the data on disk: cpu1 cpu2 ------------------------------|------------------------------ // Buffered write 2048 from 0 ext4_buffered_write_iter generic_perform_write copy_page_from_iter_atomic ext4_da_write_end ext4_da_do_write_end block_write_end __block_commit_write folio_mark_uptodate // Buffered read 4096 from 0 smp_wmb() ext4_file_read_iter set_bit(PG_uptodate, folio_flags) generic_file_read_iter i_size_write // 2048 filemap_read unlock_page(page) filemap_get_pages filemap_get_read_batch folio_test_uptodate(folio) ret = test_bit(PG_uptodate, folio_flags) if (ret) smp_rmb(); // Ensure that the data in page 0-2048 is up-to-date. // New buffered write 2048 from 2048 ext4_buffered_write_iter generic_perform_write copy_page_from_iter_atomic ext4_da_write_end ext4_da_do_write_end block_write_end __block_commit_write folio_mark_uptodate smp_wmb() set_bit(PG_uptodate, folio_flags) i_size_write // 4096 unlock_page(page) isize = i_size_read(inode) // 4096 // Read the latest isize 4096, but without smp_rmb(), there may be // Load-Load disorder resulting in the data in the 2048-4096 range // in the page is not up-to-date. copy_page_to_iter // copyout 4096 In the concurrency above, we read the updated i_size, but there is no read barrier to ensure that the data in the page is the same as the i_size at this point, so we may copy the unsynchronized page out. Hence adding the missing read memory barrier to fix this. This is a Load-Load reordering issue, which only occurs on some weak mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong mem-ordering architectures (e.g. X86). And theoretically the problem doesn't only happen on ext4, filesystems that call filemap_read() but don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems with inode lock (e.g. xfs, nfs) won't have this problem. Link: https://lkml.kernel.org/r/20231213062324.739009-1-libaokun1@huawei.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: yangerkun <yangerkun@huawei.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kunit: kasan_test: disable fortify string checker on kmalloc_oob_memsetNico Pache2023-12-201-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit 09c6304e38e4 ("kasan: test: fix compatibility with FORTIFY_SOURCE") the kernel is panicing in kmalloc_oob_memset_*. This is due to the `ptr` not being hidden from the optimizer which would disable the runtime fortify string checker. kernel BUG at lib/string_helpers.c:1048! Call Trace: [<00000000272502e2>] fortify_panic+0x2a/0x30 ([<00000000272502de>] fortify_panic+0x26/0x30) [<001bffff817045c4>] kmalloc_oob_memset_2+0x22c/0x230 [kasan_test] Hide the `ptr` variable from the optimizer to fix the kernel panic. Also define a memset_size variable and hide that as well. This cleans up the code and follows the same convention as other tests. [npache@redhat.com: address review comments from Andrey] Link: https://lkml.kernel.org/r/20231214164423.6202-1-npache@redhat.com Link: https://lkml.kernel.org/r/20231212232659.18839-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mglru: reclaim offlined memcgs harderYu Zhao2023-12-121-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the effort to reduce zombie memcgs [1], it was discovered that the memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically, instead of rotating them to the tail of the current generation (MEMCG_LRU_TAIL) for a second attempt, it moves them to the next generation (MEMCG_LRU_YOUNG) after the first attempt. Not applying enough pressure on offlined memcgs can cause them to build up, and this can be particularly harmful to memory-constrained systems. On Pixel 8 Pro, launching apps for 50 cycles: Before After Change Zombie memcgs 45 35 -22% [1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-4-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mglru: respect min_ttl_ms with memcgsYu Zhao2023-12-121-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru: try to stop at high watermarks"), it was discovered that the memcg LRU can breach the thrashing protection imposed by min_ttl_ms. Before the memcg LRU: kswapd() shrink_node_memcgs() mem_cgroup_iter() inc_max_seq() // always hit a different memcg lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation After the memcg LRU: kswapd() shrink_many() restart: iterate the memcg LRU: inc_max_seq() // occasionally hit the same memcg if raced with lru_gen_rotate_memcg(): goto restart lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation Specifically, when the restart happens in shrink_many(), it needs to stick with the (memcg LRU) generation it began with. In other words, it should neither re-read memcg_lru->seq nor age an lruvec of a different generation. Otherwise it can hit the same memcg multiple times without giving lru_gen_age_node() a chance to check the timestamp of that memcg's oldest generation (against min_ttl_ms). [1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-3-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: T.J. Mercier <tjmercier@google.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mglru: try to stop at high watermarksYu Zhao2023-12-121-8/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The initial MGLRU patchset didn't include the memcg LRU support, and it relied on should_abort_scan(), added by commit f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid overshooting their aggregate reclaim target by too much". Later on when the memcg LRU was added, should_abort_scan() was deemed unnecessary, and the test results [1] showed no side effects after it was removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard"). However, that test used memory.reclaim, which sets nr_to_reclaim to SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore that test isn't able to reproduce the worst case scenario, i.e., kswapd overshooting GBs on large systems and "consuming 100% CPU" (see the Closes tag). Bring back a simplified version of should_abort_scan() on top of the memcg LRU, so that kswapd stops when all eligible zones are above their respective high watermarks plus a small delta to lower the chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0 reclaim, meaning compaction-induced reclaim can still run wild (which is a different problem). On Android, launching 55 apps sequentially: Before After Change pgpgin 838377172 802955040 -4% pgpgout 38037080 34336300 -10% [1] https://lore.kernel.org/20221222041905.2431096-1-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20231208061407.2125867-2-yuzhao@google.com Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mglru: fix underprotected page cacheYu Zhao2023-12-122-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/20231208061407.2125867-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Tested-by: Kalesh Singh <kaleshsingh@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/shmem: fix race in shmem_undo_range w/THPDavid Stevens2023-12-121-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split folios during the second loop of shmem_undo_range. It's not sufficient to only split folios when dealing with partial pages, since it's possible for a THP to be faulted in after that point. Calling truncate_inode_folio in that situation can result in throwing away data outside of the range being targeted. [akpm@linux-foundation.org: tidy up comment layout] Link: https://lkml.kernel.org/r/20230418084031.3439795-1-stevensd@google.com Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: David Stevens <stevensd@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/damon/core: make damon_start() waits until kdamond_fn() startsSeongJae Park2023-12-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cleanup tasks of kdamond threads including reset of corresponding DAMON context's ->kdamond field and decrease of global nr_running_ctxs counter is supposed to be executed by kdamond_fn(). However, commit 0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither damon_start() nor damon_stop() ensure the corresponding kdamond has started the execution of kdamond_fn(). As a result, the cleanup can be skipped if damon_stop() is called fast enough after the previous damon_start(). Especially the skipped reset of ->kdamond could cause a use-after-free. Fix it by waiting for start of kdamond_fn() execution from damon_start(). Link: https://lkml.kernel.org/r/20231208175018.63880-1-sj@kernel.org Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Jakub Acs <acsjakub@amazon.de> Cc: Changbin Du <changbin.du@intel.com> Cc: Jakub Acs <acsjakub@amazon.de> Cc: <stable@vger.kernel.org> # 5.15.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: compaction: avoid fast_isolate_freepages blindly choose improper pageblockBarry Song2023-12-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Testing shows fast_isolate_freepages can blindly choose an unsuitable pageblock from time to time particularly while the min mark is used from XXX path: if (!page) { cc->fast_search_fail++; if (scan_start) { /* * Use the highest PFN found above min. If one was * not found, be pessimistic for direct compaction * and use the min mark. */ if (highest >= min_pfn) { page = pfn_to_page(highest); cc->free_pfn = highest; } else { if (cc->direct_compaction && pfn_valid(min_pfn)) { /* XXX */ page = pageblock_pfn_to_page(min_pfn, min(pageblock_end_pfn(min_pfn), zone_end_pfn(cc->zone)), cc->zone); cc->free_pfn = min_pfn; } } } } The reason is that no code is doing any check on the min_pfn min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); In contrast, slow path of isolate_freepages() is always skipping unsuitable pageblocks in a decent way. This issue doesn't happen quite often. When running 25 machines with 16GiB memory for one night, most of them can hit this unexpected code path. However the frequency isn't like many times per second. It might be one time in a couple of hours. Thus, it is very hard to measure the visible performance impact in my machines though the affection of choosing the unsuitable migration_target should be negative in theory. I feel it's still worth fixing this to at least make the code theoretically self-explanatory as it is quite odd an unsuitable migration_target can be still migration_target. Link: https://lkml.kernel.org/r/20231206110054.61617-1-v-songbaohua@oppo.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: Zhanyuan Hu <huzhanyuan@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: use vma_pages() for vma objectsChen Haonan2023-12-121-1/+1
| | | | | | | | | | | | | | | | | | vma_pages() is more readable and also better at avoiding error codes, so use vma_pages() instead of direct operations on vma Link: https://lkml.kernel.org/r/tencent_151850CF327EB055BBC83298A929BD06CD0A@qq.com Signed-off-by: Chen Haonan <chen.haonan2@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: cma: remove unnecessary initialization of retLi zeming2023-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | The ret variable can be defined without assigning a value, as it is assigned before use. Link: https://lkml.kernel.org/r/20231205021751.100459-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range()Muchun Song2023-12-121-13/+4
| | | | | | | | | | | | | | | | | | | | | | All the users of vmemmap_remap_range() will hold the mmap lock and release it once it returns, it is naturally to move the lock to vmemmap_remap_range() to simplify the code and the users. Link: https://lkml.kernel.org/r/20231205030853.3921-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: hugetlb_vmemmap: add check of CONFIG_MEMORY_HOTPLUG backMuchun Song2023-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | The compiler will optimize the code as much as possible if we add the check of CONFIG_MEMORY_HOTPLUG back. Link: https://lkml.kernel.org/r/20231205030530.3802-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: filemap: remove unnecessary iitialization of retLi zeming2023-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | The ret variable can be defined without assigning a value, as it is assigned before use. Link: https://lkml.kernel.org/r/20231205022954.101045-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/thp: add CONFIG_TRANSPARENT_HUGEPAGE_NEVER optionDmytro Maluka2023-12-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently enabling THP support (CONFIG_TRANSPARENT_HUGEPAGE) requires enabling either CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS or CONFIG_TRANSPARENT_HUGEPAGE_MADVISE, which both cause khugepaged starting by default at kernel bootup. Add the third choice CONFIG_TRANSPARENT_HUGEPAGE_NEVER, in line with the existing kernel command line setting transparent_hugepage=never, to disable THP by default (in particular, to prevent starting khugepaged by default) but still allow enabling it at runtime via sysfs. Rationale: khugepaged has its own non-negligible memory cost even if it is not used by any applications, since it bumps up vm.min_free_kbytes to its own required minimum in set_recommended_min_free_kbytes(). For example, on a machine with 4GB RAM, with 3 mm zones and pageblock_order == MAX_ORDER, starting khugepaged causes vm.min_free_kbytes increase from 8MB to 132MB. So if we use THP on machines with e.g. >=8GB of memory for better performance, but avoid using it on lower-memory machines to avoid its memory overhead, then for the same reason we also want to avoid even starting khugepaged on those <8GB machines. So with CONFIG_TRANSPARENT_HUGEPAGE_NEVER we can use the same kernel image on both >=8GB and <8GB machines, with THP support enabled but khugepaged not started by default. The userspace can then decide to enable THP via sysfs if needed, based on the total amount of memory. This could also be achieved with the existing transparent_hugepage=never setting in the kernel command line instead. But it seems cleaner to avoid tweaking the command line for such a basic setting. P.S. I see that CONFIG_TRANSPARENT_HUGEPAGE_NEVER was already proposed in the past [1] but without an explanation of the purpose. [1] https://lore.kernel.org/all/202211301651462590168@zte.com.cn/ Link: https://lkml.kernel.org/r/20231205170244.2746210-1-dmaluka@chromium.org Link: https://lore.kernel.org/all/20231204163254.2636289-1-dmaluka@chromium.org/ Signed-off-by: Dmytro Maluka <dmaluka@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: huge_memory: use more folio api in __split_huge_page_tail()Kefeng Wang2023-12-121-6/+6
| | | | | | | | | | | | | | | | | | | | Use more folio APIs to save six compound_head() calls in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20231110033324.2455523-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointersCatalin Marinas2023-12-121-81/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems with large number of CPUs, the following soft lockup splat might sometimes happen: [ 2656.001617] watchdog: BUG: soft lockup - CPU#364 stuck for 21s! [ksoftirqd/364:2206] : [ 2656.141194] RIP: 0010:_raw_spin_unlock_irqrestore+0x3d/0x70 : 2656.241214] Call Trace: [ 2656.243971] <IRQ> [ 2656.246237] ? show_trace_log_lvl+0x1c4/0x2df [ 2656.251152] ? show_trace_log_lvl+0x1c4/0x2df [ 2656.256066] ? kmemleak_free_percpu+0x11f/0x1f0 [ 2656.261173] ? watchdog_timer_fn+0x379/0x470 [ 2656.265984] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 2656.271179] ? __hrtimer_run_queues+0x5f3/0xd00 [ 2656.276283] ? __pfx___hrtimer_run_queues+0x10/0x10 [ 2656.281783] ? ktime_get_update_offsets_now+0x95/0x2c0 [ 2656.287573] ? ktime_get_update_offsets_now+0xdd/0x2c0 [ 2656.293380] ? hrtimer_interrupt+0x2e9/0x780 [ 2656.298221] ? __sysvec_apic_timer_interrupt+0x184/0x640 [ 2656.304211] ? sysvec_apic_timer_interrupt+0x8e/0xc0 [ 2656.309807] </IRQ> [ 2656.312169] <TASK> [ 2656.326110] kmemleak_free_percpu+0x11f/0x1f0 [ 2656.331015] free_percpu.part.0+0x1b/0xe70 [ 2656.335635] free_vfsmnt+0xb9/0x100 [ 2656.339567] rcu_do_batch+0x3c8/0xe30 [ 2656.363693] rcu_core+0x3de/0x5a0 [ 2656.367433] __do_softirq+0x2d0/0x9a8 [ 2656.381119] run_ksoftirqd+0x36/0x60 [ 2656.385145] smpboot_thread_fn+0x556/0x910 [ 2656.394971] kthread+0x2a4/0x350 [ 2656.402826] ret_from_fork+0x29/0x50 [ 2656.406861] </TASK> The issue is caused by kmemleak registering each per_cpu_ptr() corresponding to the __percpu pointer. This is unnecessary since such individual per-CPU pointers are not tracked anyway. Create a new object_percpu_tree_root rbtree that stores a single __percpu pointer together with an OBJECT_PERCPU flag for the kmemleak metadata. Scanning needs to be done for all per_cpu_ptr() pointers with a cond_resched() between each CPU iteration to avoid RCU stalls. [catalin.marinas@arm.com: update comment] Link: https://lkml.kernel.org/r/20231206114414.2085824-1-catalin.marinas@arm.com Link: https://lore.kernel.org/r/20231127194153.289626-1-longman@redhat.comLink: https://lkml.kernel.org/r/20231201190829.825856-1-catalin.marinas@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Waiman Long <longman@redhat.com> Closes: https://lore.kernel.org/r/20231127194153.289626-1-longman@redhat.com Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/readahead: do not allow order-1 folioRyan Roberts2023-12-121-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The THP machinery does not support order-1 folios because it requires meta data spanning the first 3 `struct page`s. So order-2 is the smallest large folio that we can safely create. There was a theoretical bug whereby if ra->size was 2 or 3 pages (due to the device-specific bdi->ra_pages being set that way), we could end up with order = 1. Fix this by unconditionally checking if the preferred order is 1 and if so, set it to 0. Previously this was done in a few specific places, but with this refactoring it is done just once, unconditionally, at the end of the calculation. This is a theoretical bug found during review of the code; I have no evidence to suggest this manifests in the real world (I expect all device-specific ra_pages values are much bigger than 3). Link: https://lkml.kernel.org/r/20231201161045.3962614-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: memory: use folio_prealloc() in wp_page_copy()Kefeng Wang2023-12-121-15/+7
| | | | | | | | | | | | | | | | | | | | | | | | Use folio_prealloc() helper to simplify code a bit. Link: https://lkml.kernel.org/r/20231118023232.1409103-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: memory: use a folio in do_cow_fault()Kefeng Wang2023-12-121-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Use folio_prealloc() helper and convert to use a folio in do_cow_fault(), which save five compound_head() calls. Link: https://lkml.kernel.org/r/20231118023232.1409103-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: memory: rename page_copy_prealloc() to folio_prealloc()Kefeng Wang2023-12-121-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's rename page_copy_prealloc() to folio_prealloc(), which could be reused in more functons, as it maybe zero the new page, pass a new need_zero to it, and call the vma_alloc_zeroed_movable_folio() if need_zero is true. Link: https://lkml.kernel.org/r/20231118023232.1409103-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: memory: use a folio in validate_page_before_insert()Kefeng Wang2023-12-121-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Use a folio in validate_page_before_insert() to save two compound_head() calls. Link: https://lkml.kernel.org/r/20231118023232.1409103-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: ksm: use more folio api in ksm_might_need_to_copy()Kefeng Wang2023-12-121-18/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: cleanup and use more folio in page fault", v3. Rename page_copy_prealloc() to folio_prealloc(), which is used by more functions, also do more folio conversion in page fault. This patch (of 5): Since ksm only support normal page, no swapout/in for ksm large folio too, add large folio check in ksm_might_need_to_copy(), also convert page->index to folio->index as page->index is going away. Then convert ksm_might_need_to_copy() to use more folio api to save nine compound_head() calls, short 'address' to reduce max-line-length. Link: https://lkml.kernel.org/r/20231118023232.1409103-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20231118023232.1409103-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/damon/core-test: add a unit test for the feedback loop algorithmSeongJae Park2023-12-121-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a simple kunit test for testing the behavior of the feedback loop algorithm for the aim-oriented feedback-friven DAMOS aggressiveness auto tuning. Link: https://lkml.kernel.org/r/20231130023652.50284-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/damon/sysfs-schemes: implement a command for scheme quota goals only commitSeongJae Park2023-12-123-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To update DAMOS quota goals, users need to enter 'commit' command to the 'state' file of the kdamond, which applies not only the goals but entire inputs. It is inefficient. Implement yet another 'state' file input command for reading and committing only the scheme quota goals, namely 'commit_schemes_quota_goals'. Link: https://lkml.kernel.org/r/20231130023652.50284-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/damon/sysfs-schemes: commit damos quota goals user input to DAMOSSeongJae Park2023-12-121-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make DAMON sysfs interface to read the user inputs for DAMOS quota goals and pass those to DAMOS, so that the users can use the quota auto-tuning feature. It uses the DAMON sysfs interface's user input commit mechanism, which applies all user inputs for initial starting of DAMON and online input updates, which can be done by writing 'on' and 'commit' to the kdamond's 'state' file, respectively. In other words, the user should periodically write appropriate value to 'current_value' files and 'commit' command to the 'state' file. 'target_value' files could also be similarly updated at any time. Note that the interface is supporting multiple goals while the core logic supports only one goal. DAMON sysfs interface passes only best feedback among the given inputs, to avoid making DAMOS too aggressive. Link: https://lkml.kernel.org/r/20231130023652.50284-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>