summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm/shmem: turn shmem_should_replace_page into shmem_should_replace_folioMatthew Wilcox (Oracle)2022-05-131-4/+4
| | | | | | | | | This is a straightforward conversion. Link: https://lkml.kernel.org/r/20220504182857.4013401-20-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/shmem: convert shmem_add_to_page_cache to take a folioMatthew Wilcox (Oracle)2022-05-131-26/+31
| | | | | | | | | | Shrinks shmem_add_to_page_cache() by 16 bytes. All the callers grow, but this is temporary as they will all be converted to folios soon. Link: https://lkml.kernel.org/r/20220504182857.4013401-19-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/swap: add folio_throttle_swaprateMatthew Wilcox (Oracle)2022-05-131-0/+4
| | | | | | | | | | | The only use of the page argument to cgroup_throttle_swaprate() is to get the node ID, and this will be the same for all pages in the folio, so just pass in the first page of the folio. Link: https://lkml.kernel.org/r/20220504182857.4013401-18-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/shmem: use a folio in shmem_unused_huge_shrinkMatthew Wilcox (Oracle)2022-05-131-11/+12
| | | | | | | | | | | When calling split_huge_page() we usually have to find the precise page, but that's not necessary here because we only need to unlock and put the folio afterwards. Saves 231 bytes of text (20% of this function). Link: https://lkml.kernel.org/r/20220504182857.4013401-17-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: remove remaining uses of page in shrink_page_listMatthew Wilcox (Oracle)2022-05-131-62/+60
| | | | | | | | | These are all straightforward conversions to the folio API. Link: https://lkml.kernel.org/r/20220504182857.4013401-16-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: allow can_split_folio() to be called when THP are disabledMatthew Wilcox (Oracle)2022-05-131-1/+0
| | | | | | | | | | | | | The call to can_split_folio() in vmscan is currently guarded by a test of PageTransHuge() so the BUILD_BUG() is eliminated if THP are disabled. The next patch replaces that test with folio_test_large() which may be true, even when THP are disabled. However, if THP are disabled, we cannot split, so an unconditional return of false is appropriate. Link: https://lkml.kernel.org/r/20220504182857.4013401-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: convert the activate_locked portion of shrink_page_list to foliosMatthew Wilcox (Oracle)2022-05-131-8/+9
| | | | | | | | | This accounts the number of pages activated correctly for large folios. Link: https://lkml.kernel.org/r/20220504182857.4013401-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: move initialisation of mapping downMatthew Wilcox (Oracle)2022-05-131-5/+2
| | | | | | | | | | | | Now that we don't interrogate the BDI for congestion, we can delay looking up the folio's mapping until we've got further through the function, reducing register pressure and saving a call to folio_mapping for folios we're adding to the swap cache. Link: https://lkml.kernel.org/r/20220504182857.4013401-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: convert lazy freeing to foliosMatthew Wilcox (Oracle)2022-05-132-9/+23
| | | | | | | | | | | Remove a hidden call to compound_head(), and account nr_pages instead of a single page. This matches the code in lru_lazyfree_fn() that accounts nr_pages to PGLAZYFREE. Link: https://lkml.kernel.org/r/20220504182857.4013401-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: convert page buffer handling to use foliosMatthew Wilcox (Oracle)2022-05-131-23/+25
| | | | | | | | | | This mostly just removes calls to compound_head() although nr_reclaimed should be incremented by the number of pages, not just 1. Link: https://lkml.kernel.org/r/20220504182857.4013401-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: convert dirty page handling to foliosMatthew Wilcox (Oracle)2022-05-131-22/+26
| | | | | | | | | | Mostly this just eliminates calls to compound_head(), but NR_VMSCAN_IMMEDIATE was being incremented by 1 instead of by nr_pages. Link: https://lkml.kernel.org/r/20220504182857.4013401-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: convert add_to_swap() to take a folioMatthew Wilcox (Oracle)2022-05-133-28/+31
| | | | | | | | | | The only caller already has a folio available, so this saves a conversion. Also convert the return type to boolean. Link: https://lkml.kernel.org/r/20220504182857.4013401-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: turn get_swap_page() into folio_alloc_swap()Matthew Wilcox (Oracle)2022-05-136-31/+35
| | | | | | | | | | This removes an assumption that a large folio is HPAGE_PMD_NR pages in size. Link: https://lkml.kernel.org/r/20220504182857.4013401-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: convert the writeback handling in shrink_page_list() to foliosMatthew Wilcox (Oracle)2022-05-131-36/+42
| | | | | | | | | Slightly more efficient due to fewer calls to compound_head(). Link: https://lkml.kernel.org/r/20220504182857.4013401-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmscan: use folio_mapped() in shrink_page_list()Matthew Wilcox (Oracle)2022-05-131-8/+8
| | | | | | | | | Remove some legacy function calls. Link: https://lkml.kernel.org/r/20220504182857.4013401-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: remove alloc_pages_vma()Matthew Wilcox (Oracle)2022-05-132-37/+32
| | | | | | | | | | All callers have now been converted to use vma_alloc_folio(), so convert the body of alloc_pages_vma() to allocate folios instead. Link: https://lkml.kernel.org/r/20220504182857.4013401-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* alpha: fix alloc_zeroed_user_highpage_movable()Matthew Wilcox (Oracle)2022-05-131-1/+1
| | | | | | | | | | | | Due to a typo, the final argument to alloc_page_vma() didn't refer to a real variable. This only affected CONFIG_NUMA, which was marked BROKEN in 2006 and removed from alpha in 2021. Found due to a refactoring patch. Link: https://lkml.kernel.org/r/20220504182857.4013401-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/huge_memory: convert do_huge_pmd_anonymous_page() to use vma_alloc_folio()Matthew Wilcox (Oracle)2022-05-131-5/+4
| | | | | | | | | | Remove the use of this old API, eliminating a call to prep_transhuge_page(). Link: https://lkml.kernel.org/r/20220504182857.4013401-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* shmem: convert shmem_alloc_hugepage() to use vma_alloc_folio()Matthew Wilcox (Oracle)2022-05-131-6/+4
| | | | | | | | | | | | | | | | | Patch series "Folio patches for 5.19", v2. This patch (of 26): For now, return the head page of the folio, but remove use of the old alloc_pages_vma() API. Link: https://lkml.kernel.org/r/20220504182857.4013401-1-willy@infradead.org Link: https://lkml.kernel.org/r/20220504182857.4013401-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/shmem: remove duplicate include in memory.cWan Jiabing2022-05-131-1/+0
| | | | | | | | | | | Fix following checkincludes.pl warning: mm/memory.c: linux/mm_inline.h is included more than once. The include is in line 44. Remove the duplicated here. Link: https://lkml.kernel.org/r/20220427064717.803019-1-wanjiabing@vivo.com Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: don't use NUMA_NO_NODE as indicator of page on different nodeWei Yang2022-05-131-4/+3
| | | | | | | | | | | | Now we are sure there is at least one page on page_list, so it is safe to get the nid of it. This means it is not necessary to use NUMA_NO_NODE as an indicator for the beginning of iteration or a page on different node. Link: https://lkml.kernel.org/r/20220429014426.29223-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: filter empty page_list at the beginningWei Yang2022-05-131-4/+6
| | | | | | | | | | | | | | node_page_list would always be !empty on finishing the loop, except page_list is empty. Let's handle empty page_list before doing any real work including touching PF_MEMALLOC flag. Link: https://lkml.kernel.org/r/20220429014426.29223-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: use helper folio_is_file_lru()Miaohe Lin2022-05-131-2/+2
| | | | | | | | | | | | | | | Use helper folio_is_file_lru() to check whether folio is file lru. Minor readability improvement. [linmiaohe@huawei.com: use folio_is_file_lru()] Link: https://lkml.kernel.org/r/20220428105802.21389-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220425111232.23182-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Huang, Ying <ying.huang@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: remove obsolete comment in kswapd_runMiaohe Lin2022-05-131-1/+0
| | | | | | | | | | | | | | Since commit 6b700b5b3c59 ("mm/vmscan.c: remove cpu online notification for now"), cpu online notification is removed. So kswapd won't move to proper cpus if cpus are hot-added. Remove this obsolete comment. Link: https://lkml.kernel.org/r/20220425111232.23182-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Huang, Ying <ying.huang@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: take all base pages of THP into account when race with ↵Miaohe Lin2022-05-131-1/+1
| | | | | | | | | | | | | | | | | | | speculative reference If the page has buffers, shrink_page_list will try to free the buffer mappings associated with the page and try to free the page as well. In the rare race with speculative reference, the page will be freed shortly by speculative reference. But nr_reclaimed is not incremented correctly when we come across the THP. We need to account all the base pages in this case. Link: https://lkml.kernel.org/r/20220425111232.23182-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Huang, Ying <ying.huang@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: introduce helper function reclaim_page_list()Miaohe Lin2022-05-131-25/+25
| | | | | | | | | | | | | | | Introduce helper function reclaim_page_list() to eliminate the duplicated code of doing shrink_page_list() and putback_lru_page. Also we can separate node reclaim from node page list operation this way. No functional change intended. Link: https://lkml.kernel.org/r/20220425111232.23182-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Huang, Ying <ying.huang@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: add a comment about MADV_FREE pages check in ↵Miaohe Lin2022-05-131-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | folio_check_dirty_writeback Patch series "A few cleanup and fixup patches for vmscan This series contains a few patches to remove obsolete comment, introduce helper to remove duplicated code and so no. Also we take all base pages of THP into account in rare race condition. More details can be found in the respective changelogs. This patch (of 6): The MADV_FREE pages check in folio_check_dirty_writeback is a bit hard to follow. Add a comment to make the code clear. Link: https://lkml.kernel.org/r/20220425111232.23182-2-linmiaohe@huawei.com Suggested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: not necessary to re-init the list for each iterationWei Yang2022-05-131-3/+1
| | | | | | | | | | | | node_page_list is defined with LIST_HEAD and be cleaned until list_empty. So it is not necessary to re-init it again. [akpm@linux-foundation.org: remove unneeded braces] Link: https://lkml.kernel.org/r/20220426021743.21007-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: convert sysfs input to bool using kstrtobool()Jagdish Gediya2022-05-132-12/+10
| | | | | | | | | | | | | | | | | | | Sysfs input conversion to corrosponding bool value e.g. "false" or "0" to false, "true" or "1" to true are currently handled through strncmp at multiple places. Use kstrtobool() to convert sysfs input to bool value. [akpm@linux-foundation.org: propagate kstrtobool() return value, per Andy] Link: https://lkml.kernel.org/r/20220426180203.70782-2-jvgediya@linux.ibm.com Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Fitzgerald <rf@opensource.cirrus.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* lib/kstrtox.c: add "false"/"true" support to kstrtobool()Jagdish Gediya2022-05-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | At many places in kernel, It is necessary to convert sysfs input to corresponding bool value e.g. "false" or "0" need to be converted to bool false, "true" or "1" need to be converted to bool true, places where such conversion is needed currently check the input string manually, kstrtobool() can be utilized at such places but currently it doesn't have support to accept "false"/"true". Add support to accept "false"/"true" as valid string in kstrtobool(). [akpm@linux-foundation.org: undo s/iff/if/, per Matthew] Link: https://lkml.kernel.org/r/20220426180203.70782-1-jvgediya@linux.ibm.com Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Richard Fitzgerald <rf@opensource.cirrus.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmscan: take min_slab_pages into account when try to call shrink_nodeMiaohe Lin2022-05-131-1/+2
| | | | | | | | | | | | | | Since commit 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()"), slab reclaim and lru page reclaim are done together in the shrink_node. So we should take min_slab_pages into account when try to call shrink_node. Link: https://lkml.kernel.org/r/20220425112118.20924-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* drivers: virtio_mem: use pageblock size as the minimum virtio_mem size.Zi Yan2022-05-131-3/+3
| | | | | | | | | | | | | | | | | | alloc_contig_range() now only needs to be aligned to pageblock_nr_pages, drop virtio_mem size requirement that it needs to be MAX_ORDER_NR_PAGES. Link: https://lkml.kernel.org/r/20220425143118.2850746-7-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: kernel test robot <lkp@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: cma: use pageblock_order as the single alignmentZi Yan2022-05-133-8/+5
| | | | | | | | | | | | | | | | | | | Now alloc_contig_range() works at pageblock granularity. Change CMA allocation, which uses alloc_contig_range(), to use pageblock_nr_pages alignment. Link: https://lkml.kernel.org/r/20220425143118.2850746-6-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: kernel test robot <lkp@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_isolation: enable arbitrary range page isolation.Zi Yan2022-05-132-31/+18
| | | | | | | | | | | | | | | | | | | | | Now start_isolate_page_range() is ready to handle arbitrary range isolation, so move the alignment check/adjustment into the function body. Do the same for its counterpart undo_isolate_page_range(). alloc_contig_range(), its caller, can pass an arbitrary range instead of a MAX_ORDER_NR_PAGES aligned one. Link: https://lkml.kernel.org/r/20220425143118.2850746-5-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: kernel test robot <lkp@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: make alloc_contig_range work at pageblock granularityZi Yan2022-05-135-18/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_contig_range() worked at MAX_ORDER_NR_PAGES granularity to avoid merging pageblocks with different migratetypes. It might unnecessarily convert extra pageblocks at the beginning and at the end of the range. Change alloc_contig_range() to work at pageblock granularity. Special handling is needed for free pages and in-use pages across the boundaries of the range specified by alloc_contig_range(). Because these= Partially isolated pages causes free page accounting issues. The free pages will be split and freed into separate migratetype lists; the in-use= Pages will be migrated then the freed pages will be handled in the aforementioned way. [ziy@nvidia.com: fix deadlock/crash] Link: https://lkml.kernel.org/r/23A7297E-6C84-4138-A9FE-3598234004E6@nvidia.com Link: https://lkml.kernel.org/r/20220425143118.2850746-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_isolation: check specified range for unmovable pagesZi Yan2022-05-131-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable set_migratetype_isolate() to check specified range for unmovable pages during isolation to prepare arbitrary range page isolation. The functionality will take effect in upcoming commits by adjusting the callers of start_isolate_page_range(), which uses set_migratetype_isolate(). For example, alloc_contig_range(), which calls start_isolate_page_range(), accepts unaligned ranges, but because page isolation is currently done at MAX_ORDER_NR_PAEGS granularity, pages that are out of the specified range but withint MAX_ORDER_NR_PAEGS alignment might be attempted for isolation and the failure of isolating these unrelated pages fails the whole operation undesirably. Link: https://lkml.kernel.org/r/20220425143118.2850746-3-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: kernel test robot <lkp@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.cZi Yan2022-05-133-121/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Use pageblock_order for cma and alloc_contig_range alignment", v11. This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA and alloc_contig_range(). It prepares for my upcoming changes to make MAX_ORDER adjustable at boot time[1]. The MAX_ORDER - 1 alignment requirement comes from that alloc_contig_range() isolates pageblocks to remove free memory from buddy allocator but isolating only a subset of pageblocks within a page spanning across multiple pageblocks causes free page accounting issues. Isolated page might not be put into the right free list, since the code assumes the migratetype of the first pageblock as the whole free page migratetype. This is based on the discussion at [2]. To remove the requirement, this patchset: 1. isolates pages at pageblock granularity instead of max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages); 2. splits free pages across the specified range or migrates in-use pages across the specified range then splits the freed page to avoid free page accounting issues (it happens when multiple pageblocks within a single page have different migratetypes); 3. only checks unmovable pages within the range instead of MAX_ORDER - 1 aligned range during isolation to avoid alloc_contig_range() failure when pageblocks within a MAX_ORDER - 1 aligned range are allocated separately. 4. returns pages not in the range as it did before. One optimization might come later: 1. make MIGRATE_ISOLATE a separate bit to be able to restore the original migratetypes when isolation fails in the middle of the range. [1] https://lore.kernel.org/linux-mm/20210805190253.2795604-1-zi.yan@sent.com/ [2] https://lore.kernel.org/linux-mm/d19fb078-cb9b-f60f-e310-fdeea1b947d2@redhat.com/ This patch (of 6): has_unmovable_pages() is only used in mm/page_isolation.c. Move it from mm/page_alloc.c and make it static. Link: https://lkml.kernel.org/r/20220425143118.2850746-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Minchan Kim <minchan@kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* cgroup: fix racy check in alloc_pagecache_max_30M() helper functionDavid Vernet2022-05-131-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | alloc_pagecache_max_30M() in the cgroup memcg tests performs a 50MB pagecache allocation, which it expects to be capped at 30MB due to the calling process having a memory.high setting of 30MB. After the allocation, the function contains a check that verifies that MB(29) < memory.current <= MB(30). This check can actually fail non-deterministically. The testcases that use this function are test_memcg_high() and test_memcg_max(), which set memory.min and memory.max to 30MB respectively for the cgroup under test. The allocation can slightly exceed this number in both cases, and for memory.max, the process performing the allocation will not have the OOM killer invoked as it's performing a pagecache allocation. This patchset therefore updates the above check to instead use the verify_close() helper function. Link: https://lkml.kernel.org/r/20220423155619.3669555-6-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* cgroup: remove racy check in test_memcg_sock()David Vernet2022-05-131-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | test_memcg_sock() in the cgroup memcg tests, verifies expected memory accounting for sockets. The test forks a process which functions as a TCP server, and sends large buffers back and forth between itself (as the TCP client) and the forked TCP server. While doing so, it verifies that memory.current and memory.stat.sock look correct. There is currently a check in tcp_client() which asserts memory.current >= memory.stat.sock. This check is racy, as between memory.current and memory.stat.sock being queried, a packet could come in which causes mem_cgroup_charge_skmem() to be invoked. This could cause memory.stat.sock to exceed memory.current. Reversing the order of querying doesn't address the problem either, as memory may be reclaimed between the two calls. Instead, this patch just removes that assertion altogether, and instead relies on the values_close() check that follows to validate the expected accounting. Link: https://lkml.kernel.org/r/20220423155619.3669555-5-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events()David Vernet2022-05-131-5/+17
| | | | | | | | | | | | | | | | | | | | | | | The test_memcg_oom_group_leaf_events() testcase in the cgroup memcg tests validates that processes in a group that perform allocations exceeding memory.oom.group are killed. It also validates that the memory.events.oom_kill events are properly propagated in this case. Commit 06e11c907ea4 ("kselftests: memcg: update the oom group leaf events test") fixed test_memcg_oom_group_leaf_events() to account for the fact that the memory.events.oom_kill events in a child cgroup is propagated up to its parent. This behavior can actually be configured by the memory_localevents mount option, so this patch updates the testcase to properly account for the possible presence of this mount option. Link: https://lkml.kernel.org/r/20220423155619.3669555-4-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* cgroup: account for memory_recursiveprot in test_memcg_low()David Vernet2022-05-133-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The test_memcg_low() testcase in test_memcontrol.c verifies the expected behavior of groups using the memory.low knob. Part of the testcase verifies that a group with memory.low that experiences reclaim due to memory pressure elsewhere in the system, observes memory.events.low events as a result of that reclaim. In commit 8a931f801340 ("mm: memcontrol: recursive memory.low protection"), the memory controller was updated to propagate memory.low and memory.min protection from a parent group to its children via a configurable memory_recursiveprot mount option. This unfortunately broke the memcg tests, which asserts that a sibling that experienced reclaim but had a memory.low value of 0, would not observe any memory.low events. This patch updates test_memcg_low() to account for the new behavior introduced by memory_recursiveprot. So as to make the test resilient to multiple configurations, the patch also adds a new proc_mount_contains() helper that checks for a string in /proc/mounts, and is used to toggle behavior based on whether the default memory_recursiveprot was present. Link: https://lkml.kernel.org/r/20220423155619.3669555-3-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* cgroups: refactor children cgroups in memcg testsDavid Vernet2022-05-131-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fix bugs in memcontroller cgroup tests", v2. tools/testing/selftests/cgroup/test_memcontrol.c contains a set of testcases which validate expected behavior of the cgroup memory controller. Roman Gushchin recently sent out a patchset that fixed a few issues in the test. This patchset continues that effort by fixing a few more issues that were causing non-deterministic failures in the suite. With this patchset, I'm unable to reproduce any more errors after running the tests in a continuous loop for many iterations. Before, I was able to reproduce at least one of the errors fixed in this patchset with just one or two runs. This patch (of 5): In test_memcg_min() and test_memcg_low(), there is an array of four sibling cgroups. All but one of these sibling groups does a 50MB allocation, and the group that does no allocation is the third of four in the array. This is not a problem per se, but makes it a bit tricky to do some assertions in test_memcg_low(), as we want to make assertions on the siblings based on whether or not they performed allocations. Having a static index before which all groups have performed an allocation makes this cleaner. This patch therefore reorders the sibling groups so that the group that performs no allocations is the last in the array. A follow-on patch will leverage this to fix a bug in the test that incorrectly asserts that a sibling group that had performed an allocation, but only had protection from its parent, will not observe any memory.events.low events during reclaim. Link: https://lkml.kernel.org/r/20220423155619.3669555-1-void@manifault.com Link: https://lkml.kernel.org/r/20220423155619.3669555-2-void@manifault.com Signed-off-by: David Vernet <void@manifault.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/uffd: move USERFAULTFD configs into mm/Peter Xu2022-05-132-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to have USERFAULTFD configs stored in init/. It makes sense as a start because that's the default place for storing syscall related configs. However userfaultfd evolved a bit in the past few years and some more config options were added. They're no longer related to syscalls and start to be not suitable to be kept in the init/ directory anymore, because they're pure mm concepts. But it's not ideal either to keep the userfaultfd configs separate from each other. Hence this patch moves the userfaultfd configs under init/ to be under mm/ so that we'll start to group all userfaultfd configs together. We do have quite a few examples of syscall related configs that are not put under init/Kconfig: FTRACE_SYSCALLS, SWAP, FILE_LOCKING, MEMFD_CREATE.. They all reside in the dir where they're more suitable for the concept. So it seems there's no restriction to keep the role of having syscall related CONFIG_* under init/ only. Link: https://lkml.kernel.org/r/20220420144823.35277-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* userfaultfd/selftests: use swap() instead of open coding itGuo Zhengkui2022-05-131-7/+2
| | | | | | | | | | | | | | | | | | | | | | | Address the following coccicheck warning: tools/testing/selftests/vm/userfaultfd.c:1536:21-22: WARNING opportunity for swap(). tools/testing/selftests/vm/userfaultfd.c:1540:33-34: WARNING opportunity for swap(). by using swap() for the swapping of variable values and drop `tmp_area` that is not needed any more. `swap()` macro in userfaultfd.c is introduced in commit 681696862bc18 ("selftests: vm: remove dependecy from internal kernel macros") It has been tested with gcc (Debian 8.3.0-6) 8.3.0. Link: https://lkml.kernel.org/r/20220407123141.4998-1-guozhengkui@vivo.com Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* selftests/uffd: enable uffd-wp for shmem/hugetlbfsPeter Xu2022-05-131-3/+1
| | | | | | | | | | | | | | | | | | | | After we added support for shmem and hugetlbfs, we can turn uffd-wp test on always now. Link: https://lkml.kernel.org/r/20220405014932.15212-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: enable PTE markers by defaultPeter Xu2022-05-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | Enable PTE markers by default. On x86_64 it means it'll auto-enable PTE_MARKER_UFFD_WP as well. [peterx@redhat.com: hide PTE_MARKER option] Link: https://lkml.kernel.org/r/20220419202531.27415-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220405014929.15158-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/uffd: enable write protection for shmem & hugetlbfsPeter Xu2022-05-134-26/+34
| | | | | | | | | | | | | | | | | | | | | | | | | We've had all the necessary changes ready for both shmem and hugetlbfs. Turn on all the shmem/hugetlbfs switches for userfaultfd-wp. We can expand UFFD_API_RANGE_IOCTLS_BASIC with _UFFDIO_WRITEPROTECT too because all existing types now support write protection mode. Since vma_can_userfault() will be used elsewhere, move into userfaultfd_k.h. Link: https://lkml.kernel.org/r/20220405014926.15101-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/pagemap: recognize uffd-wp bit for shmem/hugetlbfsPeter Xu2022-05-131-0/+7
| | | | | | | | | | | | | | | | | | | | | | This requires the pagemap code to be able to recognize the newly introduced swap special pte for uffd-wp, meanwhile the general case for hugetlb that we recently start to support. It should make pagemap uffd-wp support complete. Link: https://lkml.kernel.org/r/20220405014923.15047-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/khugepaged: don't recycle vma pgtable if uffd-wp registeredPeter Xu2022-05-131-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd page if it's registered with uffd-wp, because that pgtable could have pte markers installed. Recycling of that pgtable means we'll lose the pte markers. That could cause data loss for an uffd-wp enabled application on shmem. Instead of disabling khugepaged on these files, simply skip retracting these special VMAs, then the page cache can still be merged into a huge thp, and other mm/vma can still map the range of file with a huge thp when proper. Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write, that avoids race like: khugepaged user thread ========== =========== check VM_UFFD_WP, not set UFFDIO_REGISTER with uffd-wp on shmem wr-protect some pages (install markers) take mmap_sem write lock erase pmd and free pmd page --> pte markers are dropped unnoticed! Link: https://lkml.kernel.org/r/20220405014921.14994-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/hugetlb: handle uffd-wp during fork()Peter Xu2022-05-133-18/+35
| | | | | | | | | | | | | | | | | | | | | | | | | Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range() because for uffd-wp it's the dst vma that matters on deciding how we should treat uffd-wp protected ptes. We should recognize pte markers during fork and do the pte copy if needed. [lkp@intel.com: vma_needs_copy can be static] Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>