summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
...
* mm/mmap: separate writenotify and dirty tracking logicLorenzo Stoakes2023-06-091-12/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/gup: disallow GUP writing to file-backed mappings by default", v9. Writing to file-backed mappings which require folio dirty tracking using GUP is a fundamentally broken operation, as kernel write access to GUP mappings do not adhere to the semantics expected by a file system. A GUP caller uses the direct mapping to access the folio, which does not cause write notify to trigger, nor does it enforce that the caller marks the folio dirty. The problem arises when, after an initial write to the folio, writeback results in the folio being cleaned and then the caller, via the GUP interface, writes to the folio again. As a result of the use of this secondary, direct, mapping to the folio no write notify will occur, and if the caller does mark the folio dirty, this will be done so unexpectedly. For example, consider the following scenario:- 1. A folio is written to via GUP which write-faults the memory, notifying the file system and dirtying the folio. 2. Later, writeback is triggered, resulting in the folio being cleaned and the PTE being marked read-only. 3. The GUP caller writes to the folio, as it is mapped read/write via the direct mapping. 4. The GUP caller, now done with the page, unpins it and sets it dirty (though it does not have to). This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As pin_user_pages_fast_only() does not exist, we can rely on a slightly imperfect whitelisting in the PUP-fast case and fall back to the slow case should this fail. This patch (of 3): vma_wants_writenotify() is specifically intended for setting PTE page table flags, accounting for existing page table flag state and whether the underlying filesystem performs dirty tracking for a file-backed mapping. Everything is predicated firstly on whether the mapping is shared writable, as this is the only instance where dirty tracking is pertinent - MAP_PRIVATE mappings will always be CoW'd and unshared, and read-only file-backed shared mappings cannot be written to, even with FOLL_FORCE. All other checks are in line with existing logic, though now separated into checks eplicitily for dirty tracking and those for determining how to set page table flags. We make this change so we can perform checks in the GUP logic to determine which mappings might be problematic when written to. Link: https://lkml.kernel.org/r/cover.1683235180.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/0f218370bd49b4e6bbfbb499f7c7b92c26ba1ceb.1683235180.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Mika Penttilä <mpenttil@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup()Liam Ni2023-06-091-5/+3
| | | | | | | | | Reduce the number of invalid loops of the function early_ioremap_setup() to improve the efficiency of function execution Link: https://lkml.kernel.org/r/CACZJ9cU6t5sLoDwE6_XOg+UJLpZt4+qHfjYN2bA0s+3y9y6pQQ@mail.gmail.com Signed-off-by: LiamNi <zhiguangni01@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* memcg: use helper macro FLUSH_TIMEMiaohe Lin2023-06-091-1/+1
| | | | | | | | | | | | | | | Use helper macro FLUSH_TIME to indicate the flush time to improve the readability a bit. No functional change intended. Link: https://lkml.kernel.org/r/20230603072116.1101690-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_alloc: remove unneeded header filesMiaohe Lin2023-06-091-4/+0
| | | | | | | | | Remove some unneeded header files. No functional change intended. Link: https://lkml.kernel.org/r/20230603112558.213694-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: fix failure to unmap pte on highmem systemsRyan Roberts2023-06-091-4/+2
| | | | | | | | | | | | | | | | | | | | | The loser of a race to service a pte for a device private entry in the swap path previously unlocked the ptl, but failed to unmap the pte. This only affects highmem systems since unmapping a pte is a noop on non-highmem systems. Link: https://lkml.kernel.org/r/20230602092949.545577-5-ryan.roberts@arm.com Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/damon/ops-common: refactor to use {pte|pmd}p_clear_young_notify()Ryan Roberts2023-06-091-20/+2
| | | | | | | | | | | | | | | | | | | With the fix in place to atomically test and clear young on ptes and pmds, simplify the code to handle the clearing for both the primary mmu and the mmu notifier with a single API call. Link: https://lkml.kernel.org/r/20230602092949.545577-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/damon/ops-common: atomically test and clear young on ptes and pmdsRyan Roberts2023-06-094-16/+12
| | | | | | | | | | | | | | | | | | | | | | | | | It is racy to non-atomically read a pte, then clear the young bit, then write it back as this could discard dirty information. Further, it is bad practice to directly set a pte entry within a table. Instead clearing young must go through the arch-provided helper, ptep_test_and_clear_young() to ensure it is modified atomically and to give the arch code visibility and allow it to check (and potentially modify) the operation. Link: https://lkml.kernel.org/r/20230602092949.545577-3-ryan.roberts@arm.com Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces"). Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: vmalloc must set pte via arch codeRyan Roberts2023-06-091-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fixes for pte encapsulation bypasses", v3. A series to improve the encapsulation of pte entries by disallowing non-arch code from directly dereferencing pte_t pointers. This patch (of 4): It is bad practice to directly set pte entries within a pte table. Instead all modifications must go through arch-provided helpers such as set_pte_at() to give the arch code visibility and allow it to check (and potentially modify) the operation. Link: https://lkml.kernel.org/r/20230602092949.545577-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230602092949.545577-2-ryan.roberts@arm.com Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* vmstat: allow_direct_reclaim should use zone_page_state_snapshotMarcelo Tosatti2023-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A customer provided evidence indicating that a process was stalled in direct reclaim: - The process was trapped in throttle_direct_reclaim(). The function wait_event_killable() was called to wait condition allow_direct_reclaim(pgdat) for current node to be true. The allow_direct_reclaim(pgdat) examined the number of free pages on the node by zone_page_state() which just returns value in zone->vm_stat[NR_FREE_PAGES]. - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0. However, the freelist on this node was not empty. - This inconsistent of vmstat value was caused by percpu vmstat on nohz_full cpus. Every increment/decrement of vmstat is performed on percpu vmstat counter at first, then pooled diffs are cumulated to the zone's vmstat counter in timely manner. However, on nohz_full cpus (in case of this customer's system, 48 of 52 cpus) these pooled diffs were not cumulated once the cpu had no event on it so that the cpu started sleeping infinitely. I checked percpu vmstat and found there were total 69 counts not cumulated to the zone's vmstat counter yet. - In this situation, kswapd did not help the trapped process. In pgdat_balanced(), zone_wakermark_ok_safe() examined the number of free pages on the node by zone_page_state_snapshot() which checks pending counts on percpu vmstat. Therefore kswapd could know there were 69 free pages correctly. Since zone->_watermark = {8, 20, 32}, kswapd did not work because 69 was greater than 32 as high watermark. Change allow_direct_reclaim to use zone_page_state_snapshot, which allows a more precise version of the vmstat counters to be used. allow_direct_reclaim will only be called from try_to_free_pages, which is not a hot path. Testing: Due to difficulties accessing the system, it has not been possible for the reproducer to test the patch (however its clear from available data and analysis that it should fix it). Link: https://lkml.kernel.org/r/20230530145335.677325196@redhat.com Reviewed-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* fs: factor out a direct_write_fallback helperChristoph Hellwig2023-06-091-51/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper dealing with handling the syncing of a buffered write fallback for direct I/O. Link: https://lkml.kernel.org/r/20230601145904.1385409-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* filemap: add a kiocb_invalidate_post_direct_write helperChristoph Hellwig2023-06-091-17/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a helper to invalidate page cache after a dio write. Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* filemap: add a kiocb_invalidate_pages helperChristoph Hellwig2023-06-091-20/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out a helper that calls filemap_write_and_wait_range and invalidate_inode_pages2_range for the range covered by a write kiocb or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write or invalidate. Link: https://lkml.kernel.org/r/20230601145904.1385409-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* filemap: add a kiocb_write_and_wait helperChristoph Hellwig2023-06-091-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out a helper that does filemap_write_and_wait_range for the range covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write. Link: https://lkml.kernel.org/r/20230601145904.1385409-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* filemap: update ki_pos in generic_perform_writeChristoph Hellwig2023-06-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* backing_dev: remove current->backing_dev_infoChristoph Hellwig2023-06-091-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "cleanup the filemap / direct I/O interaction", v4. This series cleans up some of the generic write helper calling conventions and the page cache writeback / invalidation for direct I/O. This is a spinoff from the no-bufferhead kernel project, for which we'll want to an use iomap based buffered write path in the block layer. This patch (of 12): The last user of current->backing_dev_info disappeared in commit b9b1335e6403 ("remove bdi_congested() and wb_congested() and related functions"). Remove the field and all assignments to it. Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: zswap: shrink until can acceptDomenico Cerasuolo2023-06-091-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This update addresses an issue with the zswap reclaim mechanism, which hinders the efficient offloading of cold pages to disk, thereby compromising the preservation of the LRU order and consequently diminishing, if not inverting, its performance benefits. The functioning of the zswap shrink worker was found to be inadequate, as shown by basic benchmark test. For the test, a kernel build was utilized as a reference, with its memory confined to 1G via a cgroup and a 5G swap file provided. The results are presented below, these are averages of three runs without the use of zswap: real 46m26s user 35m4s sys 7m37s With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G system), the results changed to: real 56m4s user 35m13s sys 8m43s written_back_pages: 18 reject_reclaim_fail: 0 pool_limit_hit:1478 Besides the evident regression, one thing to notice from this data is the extremely low number of written_back_pages and pool_limit_hit. The pool_limit_hit counter, which is increased in zswap_frontswap_store when zswap is completely full, doesn't account for a particular scenario: once zswap hits his limit, zswap_pool_reached_full is set to true; with this flag on, zswap_frontswap_store rejects pages if zswap is still above the acceptance threshold. Once we include the rejections due to zswap_pool_reached_full && !zswap_can_accept(), the number goes from 1478 to a significant 21578266. Zswap is stuck in an undesirable state where it rejects pages because it's above the acceptance threshold, yet fails to attempt memory reclaimation. This happens because the shrink work is only queued when zswap_frontswap_store detects that it's full and the work itself only reclaims one page per run. This state results in hot pages getting written directly to disk, while cold ones remain memory, waiting only to be invalidated. The LRU order is completely broken and zswap ends up being just an overhead without providing any benefits. This commit applies 2 changes: a) the shrink worker is set to reclaim pages until the acceptance threshold is met and b) the task is also enqueued when zswap is not full but still above the threshold. Testing this suggested update showed much better numbers: real 36m37s user 35m8s sys 9m32s written_back_pages: 10459423 reject_reclaim_fail: 12896 pool_limit_hit: 75653 Link: https://lkml.kernel.org/r/20230526183227.793977-1-cerasuolodomenico@gmail.com Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit") Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mm_init.c: move set_pageblock_order() to free_area_init()Haifeng Xu2023-06-091-1/+2
| | | | | | | | | | | pageblock_order only needs to be set once, there is no need to initialize it in every zone/node. Link: https://lkml.kernel.org/r/20230601063536.26882-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: khugepaged: avoid pointless allocation for "struct mm_slot"Xin Hao2023-06-091-7/+5
| | | | | | | | | | | In __khugepaged_enter(), if "mm->flags" with MMF_VM_HUGEPAGE bit is set, the "mm_slot" will be released and return, so we can call mm_slot_alloc() after test_and_set_bit(). Link: https://lkml.kernel.org/r/20230531095817.11012-1-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM ↵Tetsuo Handa2023-06-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | is specified Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue() using ZONE_BOOSTED_WATERMARK flag. Only allocation contexts that include ALLOC_KSWAPD (which corresponds to __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a risk of deadlock. But since zone->flags is a shared variable, a thread doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag being set immediately after another thread doing __GFP_KSWAPD_RECLAIM allocation request set this flag, causing possibility of deadlock. Link: https://lkml.kernel.org/r/c3c3dacf-dd3b-77c9-f96a-d0982b4b2a4f@I-love.SAKURA.ne.jp Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mm_init.c: remove free_area_init_memoryless_node()Haifeng Xu2023-06-091-6/+1
| | | | | | | | | | | free_area_init_memoryless_node() is just a wrapper of free_area_init_node(), remove it to clean up. Link: https://lkml.kernel.org/r/20230528045720.4835-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* THP: avoid lock when check whether THP is in deferred listYin Fengwei2023-06-091-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_transhuge_page() acquires split queue lock then check whether the THP was added to deferred list or not. It brings high deferred queue lock contention. It's safe to check whether the THP is in deferred list or not without holding the deferred queue lock in free_transhuge_page() because when code hit free_transhuge_page(), there is no one tries to add the folio to _deferred_list. Running page_fault1 of will-it-scale + order 2 folio for anonymous mapping with 96 processes on an Ice Lake 48C/96T test box, we could see the 61% split_queue_lock contention: - 63.02% 0.01% page_fault1_pro [kernel.kallsyms] [k] free_transhuge_page - 63.01% free_transhuge_page + 62.91% _raw_spin_lock_irqsave With this patch applied, the split_queue_lock contention is less than 1%. Link: https://lkml.kernel.org/r/20230429082759.1600796-2-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: comments get_swap_device() with usage ruleHuang Ying2023-06-091-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The general rule to use a swap entry is as follows. When we get a swap entry, if there aren't some other ways to prevent swapoff, such as the folio in swap cache is locked, page table lock is held, etc., the swap entry may become invalid because of swapoff. Then, we need to enclose all swap related functions with get_swap_device() and put_swap_device(), unless the swap functions call get/put_swap_device() by themselves. Add the rule as comments of get_swap_device(). Link: https://lkml.kernel.org/r/20230529061355.125791-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: remove get/put_swap_device() in __swap_duplicate()Huang Ying2023-06-091-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | __swap_duplicate() is called by - swap_shmem_alloc(): the folio in swap cache is locked. - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one() -> swap_duplicate(): the page table lock is held. - __read_swap_cache_async() -> swapcache_prepare(): enclosed with get/put_swap_device() in __read_swap_cache_async() already. So, it's safe to remove get/put_swap_device() in __swap_duplicate(). Link: https://lkml.kernel.org/r/20230529061355.125791-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: remove __swp_swapcount()Huang Ying2023-06-092-20/+2
| | | | | | | | | | | | | | | | | | | | | | __swp_swapcount() just encloses the calling to swap_swapcount() with get/put_swap_device(). It is called in __read_swap_cache_async() only, which encloses the calling with get/put_swap_device() already. So, __read_swap_cache_async() can call swap_swapcount() directly. Link: https://lkml.kernel.org/r/20230529061355.125791-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap, __read_swap_cache_async(): enlarge get/put_swap_device protection rangeHuang Ying2023-06-091-10/+21
| | | | | | | | | | | | | | | | | | | | | | This makes the function a little easier to be understood because we don't need to consider swapoff. And this makes it possible to remove get/put_swap_device() calling in some functions called by __read_swap_cache_async(). Link: https://lkml.kernel.org/r/20230529061355.125791-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* swap: remove get/put_swap_device() in __swap_count()Huang Ying2023-06-091-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "swap: cleanup get/put_swap_device() usage", v3. The general rule to use a swap entry is as follows. When we get a swap entry, if there aren't some other ways to prevent swapoff, such as the folio in swap cache is locked, page table lock is held, etc., the swap entry may become invalid because of swapoff. Then, we need to enclose all swap related functions with get_swap_device() and put_swap_device(), unless the swap functions call get/put_swap_device() by themselves. Based on the above rule, all get/put_swap_device() usage are checked and cleaned up if necessary. This patch (of 5): get/put_swap_device() are added to __swap_count() in commit eb085574a752 ("mm, swap: fix race between swapoff and some swap operations"). Later, in commit 2799e77529c2 ("swap: fix do_swap_page() race with swapoff"), get/put_swap_device() are added to do_swap_page(). And they enclose the only call site of __swap_count(). So, it's safe to remove get/put_swap_device() in __swap_count() now. Link: https://lkml.kernel.org/r/20230529061355.125791-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20230529061355.125791-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mm_init.c: do not calculate zone_start_pfn/zone_end_pfn in ↵Haifeng Xu2023-06-091-18/+12
| | | | | | | | | | | | | | | | | | zone_absent_pages_in_node() In calculate_node_totalpages(), zone_start_pfn/zone_end_pfn are already calculated in zone_spanned_pages_in_node(), so use them as parameters instead of node_start_pfn/node_end_pfn and the duplicated calculation process can de dropped. Link: https://lkml.kernel.org/r/20230526085251.1977-2-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Haifeng Xu <haifeng.xu@shopee.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mm_init.c: introduce reset_memoryless_node_totalpages()Haifeng Xu2023-06-091-9/+22
| | | | | | | | | | | | | | | | | | | | Currently, no matter whether a node actually has memory or not, calculate_node_totalpages() is used to account number of pages in zone/node. However, for node without memory, these unnecessary calculations can be skipped. All the zone/node page counts can be set to 0 directly. So introduce reset_memoryless_node_totalpages() to perform this action. Furthermore, calculate_node_totalpages() only gets called for the node with memory. Link: https://lkml.kernel.org/r/20230526085251.1977-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Multi-gen LRU: fix workingset accountingKalesh Singh2023-06-092-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On Android app cycle workloads, MGLRU showed a significant reduction in workingset refaults although pgpgin/pswpin remained relatively unchanged. This indicated MGLRU may be undercounting workingset refaults. This has impact on userspace programs, like Android's LMKD, that monitor workingset refault statistics to detect thrashing. It was found that refaults were only accounted if the MGLRU shadow entry was for a recently evicted folio. However, recently evicted folios should be accounted as workingset activation, and refaults should be accounted regardless of recency. Fix MGLRU's workingset refault and activation accounting to more closely match that of the conventional active/inactive LRU. Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcgLars R. Damerow2023-06-091-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's watermark. We allocate jobs to our compute farm using heuristics determined by memory and swap usage from previous jobs. Tracking the peak swap usage for new jobs is important for determining when jobs are exceeding their expected bounds, or when our baseline metrics are getting outdated. Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current" would give less accurate results as well as add complication to the code. Having this watermark exposed in sysfs is much preferred. Link: https://lkml.kernel.org/r/20230524181734.125696-1-lars@pixar.com Signed-off-by: Lars R. Damerow <lars@pixar.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: shmem: fix UAF bug in shmem_show_options()Tu Jinjiang2023-06-091-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shmem_show_options() uses sbinfo->mpol without adding it's refcnt. This may lead to race with replacement of the mpol by remount. The execution sequence is as follows. CPU0 CPU1 shmem_show_options() shmem_reconfigure() shmem_show_mpol(seq, sbinfo->mpol) mpol = sbinfo->mpol mpol_put(mpol) mpol->mode The KASAN report is as follows. BUG: KASAN: slab-use-after-free in shmem_show_options+0x21b/0x340 Read of size 2 at addr ffff888124324004 by task mount/2388 CPU: 2 PID: 2388 Comm: mount Not tainted 6.4.0-rc3-00017-g9d646009f65d-dirty #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x37/0x50 print_report+0xd0/0x620 ? shmem_show_options+0x21b/0x340 ? __virt_addr_valid+0xf4/0x180 ? shmem_show_options+0x21b/0x340 kasan_report+0xb8/0xe0 ? shmem_show_options+0x21b/0x340 shmem_show_options+0x21b/0x340 ? __pfx_shmem_show_options+0x10/0x10 ? strchr+0x2c/0x50 ? strlen+0x23/0x40 ? seq_puts+0x7d/0x90 show_vfsmnt+0x1e6/0x260 ? __pfx_show_vfsmnt+0x10/0x10 ? __kasan_kmalloc+0x7f/0x90 seq_read_iter+0x57a/0x740 vfs_read+0x2e2/0x4a0 ? __pfx_vfs_read+0x10/0x10 ? down_write_killable+0xb8/0x140 ? __pfx_down_write_killable+0x10/0x10 ? __fget_light+0xa9/0x1e0 ? up_write+0x3f/0x80 ksys_read+0xb8/0x150 ? __pfx_ksys_read+0x10/0x10 ? fpregs_assert_state_consistent+0x55/0x60 ? exit_to_user_mode_prepare+0x2d/0x120 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc </TASK> Allocated by task 2387: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 __kasan_slab_alloc+0x59/0x70 kmem_cache_alloc+0xdd/0x220 mpol_new+0x83/0x150 mpol_parse_str+0x280/0x4a0 shmem_parse_one+0x364/0x520 vfs_parse_fs_param+0xf8/0x1a0 vfs_parse_fs_string+0xc9/0x130 shmem_parse_options+0xb2/0x110 path_mount+0x597/0xdf0 do_mount+0xcd/0xf0 __x64_sys_mount+0xbd/0x100 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 2389: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10e/0x1a0 kmem_cache_free+0x9c/0x350 shmem_reconfigure+0x278/0x370 reconfigure_super+0x383/0x450 path_mount+0xcc5/0xdf0 do_mount+0xcd/0xf0 __x64_sys_mount+0xbd/0x100 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The buggy address belongs to the object at ffff888124324000 which belongs to the cache numa_policy of size 32 The buggy address is located 4 bytes inside of freed 32-byte region [ffff888124324000, ffff888124324020) ================================================================== To fix the bug, shmem_get_sbmpol() / mpol_put() needs to be called before / after shmem_show_mpol() call. Link: https://lkml.kernel.org/r/20230525031640.593733-1-tujinjiang@huawei.com Signed-off-by: Tu Jinjiang <tujinjiang@huawei.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: skip fast freepages isolation if enough freepages are isolatedBaolin Wang2023-06-091-0/+4
| | | | | | | | | | | | | | I've observed that fast isolation often isolates more pages than cc->migratepages, and the excess freepages will be released back to the buddy system. So skip fast freepages isolation if enough freepages are isolated to save some CPU cycles. Link: https://lkml.kernel.org/r/f39c2c07f2dba2732fd9c0843572e5bef96f7f67.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: add trace event for fast freepages isolationBaolin Wang2023-06-091-1/+5
| | | | | | | | | | | | | The fast_isolate_freepages() can also isolate freepages, but we can not know the fast isolation efficiency to understand the fast isolation pressure. So add a trace event to show some numbers to help to understand the efficiency for fast freepages isolation. Link: https://lkml.kernel.org/r/78d2932d0160d122c15372aceb3f2c45460a17fc.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: only set skip flag if cc->no_set_skip_hint is falseBaolin Wang2023-06-091-1/+1
| | | | | | | | | | | To keep the same logic as test_and_set_skip(), only set the skip flag if cc->no_set_skip_hint is false, which makes code more reasonable. Link: https://lkml.kernel.org/r/0eb2cd2407ffb259ae6e3071e10f70f2d41d0f3e.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: skip more fully scanned pageblockBaolin Wang2023-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | In fast_isolate_around(), it assumes the pageblock is fully scanned if cc->nr_freepages < cc->nr_migratepages after trying to isolate some free pages, and will set skip flag to avoid scanning in future. However this can miss setting the skip flag for a fully scanned pageblock (returned 'start_pfn' is equal to 'end_pfn') in the case where cc->nr_freepages is larger than cc->nr_migratepages. So using the returned 'start_pfn' from isolate_freepages_block() and 'end_pfn' to decide if a pageblock is fully scanned makes more sense. It can also cover the case where cc->nr_freepages < cc->nr_migratepages, which means the 'start_pfn' is usually equal to 'end_pfn' except some uncommon fatal error occurs after non-strict mode isolation. Link: https://lkml.kernel.org/r/f4efd2fa08735794a6d809da3249b6715ba6ad38.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: change fast_isolate_freepages() to void typeBaolin Wang2023-06-091-5/+3
| | | | | | | | | | | No caller cares about the return value of fast_isolate_freepages(), void it. Link: https://lkml.kernel.org/r/759fca20b22ebf4c81afa30496837b9e0fb2e53b.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: compaction: drop the redundant page validation in update_pageblock_skip()Baolin Wang2023-06-091-3/+0
| | | | | | | | | | | | | | | | | | | Patch series "Misc cleanups and improvements for compaction". This series cantains some cleanups and improvements for compaction. This patch (of 6): The caller has validated the page before calling update_pageblock_skip(), thus drop the redundant page validation in update_pageblock_skip(). Link: https://lkml.kernel.org/r/5142e15b9295fe8c447dbb39b7907a20177a1413.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: dont purge usable blocks unnecessarilyThomas Gleixner2023-06-091-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Purging fragmented blocks is done unconditionally in several contexts: 1) From drain_vmap_area_work(), when the number of lazy to be freed vmap_areas reached the threshold 2) Reclaiming vmalloc address space from pcpu_get_vm_areas() 3) _vm_unmap_aliases() #1 There is no reason to zap fragmented vmap blocks unconditionally, simply because reclaiming all lazy areas drains at least 32MB * fls(num_online_cpus()) per invocation which is plenty. #2 Reclaiming when running out of space or due to memory pressure makes a lot of sense #3 _unmap_aliases() requires to touch everything because the caller has no clue which vmap_area used a particular page last and the vmap_area lost that information too. Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the vmap area first and then cares about the flush. That in turn requires a full walk of _all_ vmap areas including the one which was just added to the purge list. But as this has to be flushed anyway this is an opportunity to combine outstanding TLB flushes and do the housekeeping of purging freed areas, but like #1 there is no real good reason to zap usable vmap blocks unconditionally. Add a @force_purge argument to the newly split out block purge function and if not true only purge fragmented blocks which have less than 1/4 of their capacity left. Rename purge_vmap_area_lazy() to reclaim_and_purge_vmap_areas() to make it clear what the function does. [lstoakes@gmail.com: correct VMAP_PURGE_THRESHOLD check] Link: https://lkml.kernel.org/r/3e92ef61-b910-4576-88e7-cf43211fd4e7@lucifer.local Link: https://lkml.kernel.org/r/20230525124504.864005691@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: add missing READ/WRITE_ONCE() annotationsThomas Gleixner2023-06-091-5/+8
| | | | | | | | | | | | | | | purge_fragmented_blocks() accesses vmap_block::free and vmap_block::dirty lockless for a quick check. Add the missing READ/WRITE_ONCE() annotations. Link: https://lkml.kernel.org/r/20230525124504.807356682@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: check free space in vmap_block locklessThomas Gleixner2023-06-091-1/+4
| | | | | | | | | | | | | | | | | | | vb_alloc() unconditionally locks a vmap_block on the free list to check the free space. This can be done locklessly because vmap_block::free never increases, it's only decreased on allocations. Check the free space lockless and only if that succeeds, recheck under the lock. Link: https://lkml.kernel.org/r/20230525124504.750481992@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: prevent flushing dirty space over and overThomas Gleixner2023-06-091-2/+6
| | | | | | | | | | | | | | | | | | | | | | | vmap blocks which have active mappings cannot be purged. Allocations which have been freed are accounted for in vmap_block::dirty_min/max, so that they can be detected in _vm_unmap_aliases() as potentially stale TLBs. If there are several invocations of _vm_unmap_aliases() then each of them will flush the dirty range. That's pointless and just increases the probability of full TLB flushes. Avoid that by resetting the flush range after accounting for it. That's safe versus other invocations of _vm_unmap_aliases() because this is all serialized with vmap_purge_lock. Link: https://lkml.kernel.org/r/20230525124504.692056496@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: avoid iterating over per CPU vmap blocks twiceThomas Gleixner2023-06-091-24/+46
| | | | | | | | | | | | | | | | | | | | | | | | _vunmap_aliases() walks the per CPU xarrays to find partially unmapped blocks and then walks the per cpu free lists to purge fragmented blocks. Arguably that's waste of CPU cycles and cache lines as the full xarray walk already touches every block. Avoid this double iteration: - Split out the code to purge one block and the code to free the local purge list into helper functions. - Try to purge the fragmented blocks in the xarray walk before looking at their dirty space. Link: https://lkml.kernel.org/r/20230525124504.633469722@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/vmalloc: prevent stale TLBs in fully utilized blocksThomas Gleixner2023-06-091-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/vmalloc: Assorted fixes and improvements", v2. this series addresses the following issues: 1) Prevent the stale TLB problem related to fully utilized vmap blocks 2) Avoid the double per CPU list walk in _vm_unmap_aliases() 3) Avoid flushing dirty space over and over 4) Add a lockless quickcheck in vb_alloc() and add missing READ/WRITE_ONCE() annotations 5) Prevent overeager purging of usable vmap_blocks if not under memory/address space pressure. This patch (of 6): _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a page are left in the system. This is required due to the lazy TLB flush mechanism in vmalloc. This is tried to achieve by walking the per CPU free lists, but those do not contain fully utilized vmap blocks because they are removed from the free list once the blocks free space became zero. When the block is not fully unmapped then it is not on the purge list either. So neither the per CPU list iteration nor the purge list walk find the block and if the page was mapped via such a block and the TLB has not yet been flushed, the guarantee of _vm_unmap_aliases() that there are no stale TLBs after returning is broken: x = vb_alloc() // Removes vmap_block from free list because vb->free became 0 vb_free(x) // Unmaps page and marks in dirty_min/max range // Block has still mappings and is not put on purge list // Page is reused vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL So instead of walking the per CPU free lists, walk the per CPU xarrays which hold pointers to _all_ active blocks in the system including those removed from the free lists. Link: https://lkml.kernel.org/r/20230525122342.109672430@linutronix.de Link: https://lkml.kernel.org/r/20230525124504.573987880@linutronix.de Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: multi-gen LRU: cleanup lru_gen_test_recent()T.J. Alumbaugh2023-06-091-30/+16
| | | | | | | | | | | | Avoid passing memcg* and pglist_data* to lru_gen_test_recent() since we only use the lruvec anyway. Link: https://lkml.kernel.org/r/20230522112058.2965866-4-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: multi-gen LRU: add helpers in page table walksT.J. Alumbaugh2023-06-091-5/+15
| | | | | | | | | | | | | Add helpers to page table walking code: - Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young" - Avoids repeating same logic in two places Link: https://lkml.kernel.org/r/20230522112058.2965866-3-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()T.J. Alumbaugh2023-06-092-2/+4
| | | | | | | | | | | | lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a cleaner interface on the caller side. Link: https://lkml.kernel.org/r/20230522112058.2965866-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: multi-gen LRU: use macro for bitmapT.J. Alumbaugh2023-06-091-1/+1
| | | | | | | | | | | Use DECLARE_BITMAP macro when possible. Link: https://lkml.kernel.org/r/20230522112058.2965866-1-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/memcontrol: fix typo in commentHaifeng Xu2023-06-091-1/+1
| | | | | | | | | | | | Replace 'then' with 'than'. Link: https://lkml.kernel.org/r/20230522095233.4246-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mlock: rename mlock_future_check() to mlock_future_ok()Andrew Morton2023-06-094-7/+7
| | | | | | | | | | | | It is felt that the name mlock_future_check() is vague - it doesn't particularly convey the function's operation. mlock_future_ok() is a clearer name for a predicate function. Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mmap: refactor mlock_future_check()Lorenzo Stoakes2023-06-094-20/+21
| | | | | | | | | | | | | | | | | In all but one instance, mlock_future_check() is treated as a boolean function despite returning an error code. In one instance, this error code is ignored and replaced with -ENOMEM. This is confusing, and the inversion of true -> failure, false -> success is not warranted. Convert the function to a bool, lightly refactor and return true if the check passes, false if not. Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>