summaryrefslogtreecommitdiffstats
path: root/include/linux/mm.h
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds2023-08-311-8/+39
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
| * x86/shstk: Move arch detail comment out of core mmRick Edgecombe2023-08-021-26/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment around VM_SHADOW_STACK in mm.h refers to a lot of x86 specific details that don't belong in a cross arch file. Remove these out of core mm, and just leave the non-arch details. Since the comment includes some useful details that would be good to retain in the source somewhere, put the arch specifics parts in arch/x86/shstk.c near alloc_shstk(), where memory of this type is allocated. Include a reference to the existence of the x86 details near the VM_SHADOW_STACK definition mm.h. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/all/20230706233248.445713-1-rick.p.edgecombe%40intel.com
| * mm: Add guard pages around a shadow stack.Rick Edgecombe2023-07-111-6/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. The architecture of shadow stack constrains the ability of userspace to move the shadow stack pointer (SSP) in order to prevent corrupting or switching to other shadow stacks. The RSTORSSP instruction can move the SSP to different shadow stacks, but it requires a specially placed token in order to do this. However, the architecture does not prevent incrementing the stack pointer to wander onto an adjacent shadow stack. To prevent this in software, enforce guard pages at the beginning of shadow stack VMAs, such that there will always be a gap between adjacent shadow stacks. Make the gap big enough so that no userspace SSP changing operations (besides RSTORSSP), can move the SSP from one stack to the next. The SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and RET can move the SSP by a maximum of 8 bytes, at which point the shadow stack would be accessed. The INCSSP instruction can also increment the shadow stack pointer. It is the shadow stack analog of an instruction like: addq $0x80, %rsp However, there is one important difference between an ADD on %rsp and INCSSP. In addition to modifying SSP, INCSSP also reads from the memory of the first and last elements that were "popped". It can be thought of as acting like this: READ_ONCE(ssp); // read+discard top element on stack ssp += nr_to_pop * 8; // move the shadow stack READ_ONCE(ssp-8); // read+discard last popped stack element The maximum distance INCSSP can move the SSP is 2040 bytes, before it would read the memory. Therefore, a single page gap will be enough to prevent any operation from shifting the SSP to an adjacent stack, since it would have to land in the gap at least once, causing a fault. This could be accomplished by using VM_GROWSDOWN, but this has a downside. The behavior would allow shadow stacks to grow, which is unneeded and adds a strange difference to how most regular stacks work. In the maple tree code, there is some logic for retrying the unmapped area search if a guard gap is violated. This retry should happen for shadow stack guard gap violations as well. This logic currently only checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have a start gap as well, create an new define VM_STARTGAP_FLAGS to hold all the VM flag bits that have start gaps, and make mmap use it. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-17-rick.p.edgecombe%40intel.com
| * mm: Introduce VM_SHADOW_STACK for shadow stack memoryYu-cheng Yu2023-07-111-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New hardware extensions implement support for shadow stack memory, such as x86 Control-flow Enforcement Technology (CET). Add a new VM flag to identify these areas, for example, to be used to properly indicate shadow stack PTEs to the hardware. Shadow stack VMA creation will be tightly controlled and limited to anonymous memory to make the implementation simpler and since that is all that is required. The solution will rely on pte_mkwrite() to create the shadow stack PTEs, so it will not be required for vm_get_page_prot() to learn how to create shadow stack memory. For this reason document that VM_SHADOW_STACK should not be mixed with VM_SHARED. Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-15-rick.p.edgecombe%40intel.com
| * mm: Move VM_UFFD_MINOR_BIT from 37 to 38Yu-cheng Yu2023-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. Future patches will introduce a new VM flag VM_SHADOW_STACK that will be VM_HIGH_ARCH_BIT_5. VM_HIGH_ARCH_BIT_1 through VM_HIGH_ARCH_BIT_4 are bits 32-36, and bit 37 is the unrelated VM_UFFD_MINOR_BIT. For the sake of order, make all VM_HIGH_ARCH_BITs stay together by moving VM_UFFD_MINOR_BIT from 37 to 38. This will allow VM_SHADOW_STACK to be introduced as 37. Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Peter Xu <peterx@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-6-rick.p.edgecombe%40intel.com
| * mm: Re-introduce vm_flags to do_mmap()Yu-cheng Yu2023-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by: commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()"). There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap(). Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com
| * mm: Make pte_mkwrite() take a VMARick Edgecombe2023-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The x86 Shadow stack feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. One of these unusual properties is that shadow stack memory is writable, but only in limited ways. These limits are applied via a specific PTE bit combination. Nevertheless, the memory is writable, and core mm code will need to apply the writable permissions in the typical paths that call pte_mkwrite(). Future patches will make pte_mkwrite() take a VMA, so that the x86 implementation of it can know whether to create regular writable or shadow stack mappings. But there are a couple of challenges to this. Modifying the signatures of each arch pte_mkwrite() implementation would be error prone because some are generated with macros and would need to be re-implemented. Also, some pte_mkwrite() callers operate on kernel memory without a VMA. So this can be done in a three step process. First pte_mkwrite() can be renamed to pte_mkwrite_novma() in each arch, with a generic pte_mkwrite() added that just calls pte_mkwrite_novma(). Next callers without a VMA can be moved to pte_mkwrite_novma(). And lastly, pte_mkwrite() and all callers can be changed to take/pass a VMA. Previous work pte_mkwrite() renamed pte_mkwrite_novma() and converted callers that don't have a VMA were to use pte_mkwrite_novma(). So now change pte_mkwrite() to take a VMA and change the remaining callers to pass a VMA. Apply the same changes for pmd_mkwrite(). No functional change. Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20230613001108.3040476-4-rick.p.edgecombe%40intel.com
* | Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds2023-08-291-0/+18
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
| * | fs: rename and move block_page_mkwrite_returnChristoph Hellwig2023-08-021-0/+18
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)2023-08-241-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: convert do_set_pte() to set_pte_range()Yin Fengwei2023-08-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | set_pte_range() allows to setup page table entries for a specific range. It takes advantage of batched rmap update for large folio. It now takes care of calling update_mmu_cache_range(). Link: https://lkml.kernel.org/r/20230802151406.3735276-37-willy@infradead.org Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: handle userfaults under VMA lockSuren Baghdasaryan2023-08-241-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [surenb@google.com: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: handle swap page faults under per-VMA lockSuren Baghdasaryan2023-08-241-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When page fault is handled under per-VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry has to drop and reacquire mmap_lock if folio could not be immediately locked. Follow the same pattern as mmap_lock to drop per-VMA lock when waiting for folio and retrying once folio is available. With this obstacle removed, enable do_swap_page to operate under per-VMA lock protection. Drivers implementing ops->migrate_to_ram might still rely on mmap_lock, therefore we have to fall back to mmap_lock in that particular case. Note that the only time do_swap_page calls synchronous swap_readpage is when SWP_SYNCHRONOUS_IO is set, which is only set for QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem). Therefore we don't sleep in this path, and there's no need to drop the mmap or per-VMA lock. Link: https://lkml.kernel.org/r/20230630211957.1341547-6-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: free up a word in the first tail pageMatthew Wilcox (Oracle)2023-08-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: add large_rmappable page flagMatthew Wilcox (Oracle)2023-08-211-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: remove HUGETLB_PAGE_DTORMatthew Wilcox (Oracle)2023-08-211-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: remove free_compound_page() and the compound_page_dtors arrayMatthew Wilcox (Oracle)2023-08-211-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | The only remaining destructor is free_compound_page(). Inline it into destroy_large_folio() and remove the array it used to live in. Link: https://lkml.kernel.org/r/20230816151201.3655946-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: convert free_transhuge_folio() to folio_undo_large_rmappable()Matthew Wilcox (Oracle)2023-08-211-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Indirect calls are expensive, thanks to Spectre. Test for TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the free_compound_page() call into destroy_large_folio() to simplify later patches. Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton2023-08-211-6/+15
|\ \
| * | mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand2023-08-211-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: liubo <liubo254@huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx@redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappersVishal Moola (Oracle)2023-08-211-20/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions are no longer necessary. Remove them and cleanup Documentation referencing them. Link: https://lkml.kernel.org/r/20230807230513.102486-32-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}Vishal Moola (Oracle)2023-08-211-14/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create pagetable_pte_ctor(), pagetable_pmd_ctor(), pagetable_pte_dtor(), and pagetable_pmd_dtor() and make the original pgtable constructor/destructors wrappers. Link: https://lkml.kernel.org/r/20230807230513.102486-12-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert ptlock_free() to use ptdescsVishal Moola (Oracle)2023-08-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-11-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert pmd_ptlock_free() to use ptdescsVishal Moola (Oracle)2023-08-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-10-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert ptlock_init() to use ptdescsVishal Moola (Oracle)2023-08-211-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-9-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert pmd_ptlock_init() to use ptdescsVishal Moola (Oracle)2023-08-211-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-8-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert ptlock_ptr() to use ptdescsVishal Moola (Oracle)2023-08-211-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-7-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert ptlock_alloc() to use ptdescsVishal Moola (Oracle)2023-08-211-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: convert pmd_pgtable_page() callers to use pmd_ptdesc()Vishal Moola (Oracle)2023-08-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Converts internal pmd_pgtable_page() callers to use pmd_ptdesc(). This removes some direct accesses to struct page, working towards splitting out struct ptdesc from struct page. Link: https://lkml.kernel.org/r/20230807230513.102486-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: add utility functions for ptdescVishal Moola (Oracle)2023-08-211-0/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce utility functions setting the foundation for ptdescs. These will also assist in the splitting out of ptdesc from struct page. Functions that focus on the descriptor are prefixed with ptdesc_* while functions that focus on the pagetable are prefixed with pagetable_*. pagetable_alloc() is defined to allocate new ptdesc pages as compound pages. This is to standardize ptdescs by allowing for one allocation and one free function, in contrast to 2 allocation and 2 free functions. Link: https://lkml.kernel.org/r/20230807230513.102486-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Guo Ren <guoren@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: move dummy_vm_ops out of a headerMateusz Guzik2023-08-211-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise the kernel ends up with multiple copies: $ nm vmlinux | grep dummy_vm_ops ffffffff81e4ea00 d dummy_vm_ops.2 ffffffff81e11760 d dummy_vm_ops.254 ffffffff81e406e0 d dummy_vm_ops.4 ffffffff81e3c780 d dummy_vm_ops.7 While here prefix it with vma_. Link: https://lkml.kernel.org/r/20230806231611.1395735-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_onceSuren Baghdasaryan2023-08-211-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is not obvious and makes it hard to understand where vma locking is happening. Also in some cases (like in dup_userfaultfd()) vma should be locked earlier than vma_flags modification. To make locking more visible, change these functions to assert that the vma write lock is taken and explicitly lock the vma beforehand. Fix userfaultfd functions which should lock the vma earlier. Link: https://lkml.kernel.org/r/20230804152724.3090321-5-surenb@google.com Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmapSuren Baghdasaryan2023-08-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent to mmap_assert_write_locked(). Link: https://lkml.kernel.org/r/20230804152724.3090321-3-surenb@google.com Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: factor out VMA stack and heap checksKefeng Wang2023-08-211-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: convert to vma_is_initial_heap/stack()", v3. Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them to simplify code. This patch (of 4): Factor out VMA stack and heap checks and name them vma_is_initial_stack() and vma_is_initial_heap() for general use. Link: https://lkml.kernel.org/r/20230728050043.59880-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230728050043.59880-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Göttsche <cgzones@googlemail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Christian Göttsche <cgzones@googlemail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm/vmemmap optimization: split hugetlb and devdax vmemmap optimizationAneesh Kumar K.V2023-08-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap optimization includes an update of both the permissions (writeable to read-only) and the output address (pfn) of the vmemmap ptes. That is not supported without unmapping of pte(marking it invalid) by some architectures. With DAX vmemmap optimization we don't require such pte updates and architectures can enable DAX vmemmap optimization while having hugetlb vmemmap optimization disabled. Hence split DAX optimization support into a different config. s390, loongarch and riscv don't have devdax support. So the DAX config is not enabled for them. With this change, arm64 should be able to select DAX optimization [1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable HUGETLB_PAGE_OPTIMIZE_VMEMMAP") Link: https://lkml.kernel.org/r/20230724190759.483013-8-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm/vmemmap: improve vmemmap_can_optimize and allow architectures to overrideAneesh Kumar K.V2023-08-181-4/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap such that tail page mapping can point to the second PAGE_SIZE area. Enforce that in vmemmap_can_optimize() function. Architectures like powerpc also want to enable vmemmap optimization conditionally (only with radix MMU translation). Hence allow architecture override. Link: https://lkml.kernel.org/r/20230724190759.483013-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: remove CONFIG_PER_VMA_LOCK ifdefsMatthew Wilcox (Oracle)2023-08-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Handle most file-backed faults under the VMA lock", v3. This patchset adds the ability to handle page faults on parts of files which are already in the page cache without taking the mmap lock. This patch (of 10): Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to eliminate ifdefs in the users. Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: change do_vmi_align_munmap() tracking of VMAs to removeLiam R. Howlett2023-08-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of 1 as a pointer and avoid any allocations. Change do_vmi_align_munmap() to store the VMAs being munmap()'ed into a tree indexed by the count. This will leverage the ability to store the first entry without a node allocation. Storing the entries into a tree by the count and not the vma start and end means changing the functions which iterate over the entries. Update unmap_vmas() and free_pgtables() to take a maple state and a tree end address to support this functionality. Passing through the same maple state to unmap_vmas() and free_pgtables() means the state needs to be reset between calls. This happens in the static unmap_region() and exit_mmap(). Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: don't drop VMA locks in mm_drop_all_locks()Jann Horn2023-08-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap lock is held write-locked by the caller, and the caller is responsible for dropping the mmap lock at a later point (which will also release the VMA locks). Calling vma_end_write_all() here is dangerous because the caller might have write-locked a VMA with the expectation that it will stay write-locked until the mmap_lock is released, as usual. This _almost_ becomes a problem in the following scenario: An anonymous VMA A and an SGX VMA B are mapped adjacent to each other. Userspace calls munmap() on a range starting at the start address of A and ending in the middle of B. Hypothetical call graph with additional notes in brackets: do_vmi_align_munmap [begin first for_each_vma_range loop] vma_start_write [on VMA A] vma_mark_detached [on VMA A] __split_vma [on VMA B] sgx_vma_open [== new->vm_ops->open] sgx_encl_mm_add __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN] mm_take_all_locks mm_drop_all_locks vma_end_write_all [drops VMA lock taken on VMA A before] vma_start_write [on VMA B] vma_mark_detached [on VMA B] [end first for_each_vma_range loop] vma_iter_clear_gfp [removes VMAs from maple tree] mmap_write_downgrade unmap_region mmap_read_unlock In this hypothetical scenario, while do_vmi_align_munmap() thinks it still holds a VMA write lock on VMA A, the VMA write lock has actually been invalidated inside __split_vma(). The call from sgx_encl_mm_add() to __mmu_notifier_register() can't actually happen here, as far as I understand, because we are duplicating an existing SGX VMA, but sgx_encl_mm_add() only calls __mmu_notifier_register() for the first SGX VMA created in a given process. So this could only happen in fork(), not on munmap(). But in my view it is just pure luck that this can't happen. Also, we wouldn't actually have any bad consequences from this in do_vmi_align_munmap(), because by the time the bug drops the lock on VMA A, we've already marked VMA A as detached, which makes it completely ineligible for any VMA-locked page faults. But again, that's just pure luck. So remove the vma_end_write_all(), so that VMA write locks are only ever released on mmap_write_unlock() or mmap_write_downgrade(). Also add comments to document the locking rules established by this patch. Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com Fixes: eeff9a5d47f8 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: fix obsolete function name above debug_pagealloc_enabled_static()Miaohe Lin2023-08-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 04013513cc84 ("mm, page_alloc: do not rely on the order of page_poison and init_on_alloc/free parameters"), init_debug_pagealloc() is converted to init_mem_debugging_and_hardening(). Later it's renamed to mem_debugging_and_hardening_init() via commit f2fc4b44ec2b ("mm: move init_mem_debugging_and_hardening() to mm/mm_init.c"). Link: https://lkml.kernel.org/r/20230720112806.3851893-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: move is_ioremap_addr() into new header fileBaoquan He2023-08-181-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in mm/ioremap.c. Move it into its own new header file linux/ioremap.h. Link: https://lkml.kernel.org/r/20230706154520.11257-17-bhe@redhat.com Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: delete mmap_write_trylock() and vma_try_start_write()Hugh Dickins2023-08-181-17/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mmap_write_trylock() and vma_try_start_write() were added just for khugepaged, but now it has no use for them: delete. Link: https://lkml.kernel.org/r/4e6db3d-e8e-73fb-1f2a-8de2dab2a87c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: make show_free_areas() staticKefeng Wang2023-08-181-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of show_free_areas() pass 0 and NULL, so we can directly use show_mem() instead of show_free_areas(0, NULL), which could make show_free_areas() a static function. Link: https://lkml.kernel.org/r/20230630062253.189440-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: remove arguments of show_mem()Kefeng Wang2023-08-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of show_mem() pass 0 and NULL, so we can remove the two arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem(). Link: https://lkml.kernel.org/r/20230630062253.189440-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | mm: remove page_rmapping()ZhangPeng2023-08-181-1/+0
|/ / | | | | | | | | | | | | | | | | | | | | | | | | After converting the last user to folio_raw_mapping(), we can safely remove the function. Link: https://lkml.kernel.org/r/20230701032853.258697-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* / mm: fix memory ordering for mm_lock_seq and vm_lock_seqJann Horn2023-07-271-6/+23
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm->mm_lock_seq effectively functions as a read/write lock; therefore it must be used with acquire/release semantics. A specific example is the interaction between userfaultfd_register() and lock_vma_under_rcu(). userfaultfd_register() does the following from the point where it changes a VMA's flags to the point where concurrent readers are permitted again (in a simple scenario where only a single private VMA is accessed and no merging/splitting is involved): userfaultfd_register userfaultfd_set_vm_flags vm_flags_reset vma_start_write down_write(&vma->vm_lock->lock) vma->vm_lock_seq = mm_lock_seq [marks VMA as busy] up_write(&vma->vm_lock->lock) vm_flags_init [sets VM_UFFD_* in __vm_flags] vma->vm_userfaultfd_ctx.ctx = ctx mmap_write_unlock vma_end_write_all WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA] There are no memory barriers in between the __vm_flags update and the mm->mm_lock_seq update that unlocks the VMA, so the unlock can be reordered to above the `vm_flags_init()` call, which means from the perspective of a concurrent reader, a VMA can be marked as a userfaultfd VMA while it is not VMA-locked. That's bad, we definitely need a store-release for the unlock operation. The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly fine because all accesses to vma->vm_lock_seq that matter are always protected by the VMA lock. There is a racy read in vma_start_read() though that can tolerate false-positives, so we should be using WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN). On the other side, lock_vma_under_rcu() works as follows in the relevant region for locking and userfaultfd check: lock_vma_under_rcu vma_start_read vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout] down_read_trylock(&vma->vm_lock->lock) vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check] userfaultfd_armed checks vma->vm_flags & __VM_UFFD_FLAGS Here, the interesting aspect is how far down the mm->mm_lock_seq read can be reordered - if this read is reordered down below the vma->vm_flags access, this could cause lock_vma_under_rcu() to partly operate on information that was read while the VMA was supposed to be locked. To prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we need to read it with a load-acquire. Some of the comment wording is based on suggestions by Suren. BACKPORT WARNING: One of the functions changed by this patch (which I've written against Linus' tree) is vma_try_start_write(), but this function no longer exists in mm/mm-everything. I don't know whether the merged version of this patch will be ordered before or after the patch that removes vma_try_start_write(). If you're backporting this patch to a tree with vma_try_start_write(), make sure this patch changes that function. Link: https://lkml.kernel.org/r/20230721225107.942336-1-jannh@google.com Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* execve: always mark stack as growing down during early stack setupLinus Torvalds2023-07-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While our user stacks can grow either down (all common architectures) or up (parisc and the ia64 register stack), the initial stack setup when we copy the argument and environment strings to the new stack at execve() time is always done by extending the stack downwards. But it turns out that in commit 8d7071af8907 ("mm: always expand the stack with the mmap write lock held"), as part of making the stack growing code more robust, 'expand_downwards()' was now made to actually check the vma flags: if (!(vma->vm_flags & VM_GROWSDOWN)) return -EFAULT; and that meant that this execve-time stack expansion started failing on parisc, because on that architecture, the stack flags do not contain the VM_GROWSDOWN bit. At the same time the new check in expand_downwards() is clearly correct, and simplified the callers, so let's not remove it. The solution is instead to just codify the fact that yes, during execve(), the stack grows down. This not only matches reality, it ends up being particularly simple: we already have special execve-time flags for the stack (VM_STACK_INCOMPLETE_SETUP) and use those flags to avoid page migration during this setup time (see vma_is_temporary_stack() and invalid_migration_vma()). So just add VM_GROWSDOWN to that set of temporary flags, and now our stack flags automatically match reality, and the parisc stack expansion works again. Note that the VM_STACK_INCOMPLETE_SETUP bits will be cleared when the stack is finalized, so we only add the extra VM_GROWSDOWN bit on CONFIG_STACK_GROWSUP architectures (ie parisc) rather than adding it in general. Link: https://lore.kernel.org/all/612eaa53-6904-6e16-67fc-394f4faa0e16@bell.net/ Link: https://lore.kernel.org/all/5fd98a09-4792-1433-752d-029ae3545168@gmx.de/ Fixes: 8d7071af8907 ("mm: always expand the stack with the mmap write lock held") Reported-by: John David Anglin <dave.anglin@bell.net> Reported-and-tested-by: Helge Deller <deller@gmx.de> Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'libnvdimm-for-6.5' of ↵Linus Torvalds2023-07-011-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull nvdimm and DAX updates from Vishal Verma: "This is mostly small cleanups and fixes, with the biggest change being the change to the DAX fault handler allowing it to return VM_FAULT_HWPOISON. Summary: - DAX fixes and cleanups including a use after free, extra references, and device unregistration, and a redundant variable. - Allow the DAX fault handler to return VM_FAULT_HWPOISON - A few libnvdimm cleanups such as making some functions and variables static where sufficient. - Add a few missing prototypes for wrapped functions in tools/testing/nvdimm" * tag 'libnvdimm-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: enable dax fault handler to report VM_FAULT_HWPOISON nvdimm: make security_show static nvdimm: make nd_class variable static dax/kmem: Pass valid argument to memory_group_register_static fsdax: remove redundant variable 'error' dax: Cleanup extra dax_region references dax: Introduce alloc_dev_dax_id() dax: Use device_unregister() in unregister_dax_mapping() dax: Fix dax_mapping_release() use after free tools/testing/nvdimm: Drop empty platform remove function libnvdimm: mark 'security_show' static again testing: nvdimm: add missing prototypes for wrapped functions dax: fix missing-prototype warnings
| * dax: enable dax fault handler to report VM_FAULT_HWPOISONJane Chu2023-06-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When multiple processes mmap() a dax file, then at some point, a process issues a 'load' and consumes a hwpoison, the process receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb set for the poison scope. Soon after, any other process issues a 'load' to the poisoned page (that is unmapped from the kernel side by memory_failure), it receives a SIGBUS with si_code = BUS_ADRERR and without valid si_lsb. This is confusing to user, and is different from page fault due to poison in RAM memory, also some helpful information is lost. Channel dax backend driver's poison detection to the filesystem such that instead of reporting VM_FAULT_SIGBUS, it could report VM_FAULT_HWPOISON. If user level block IO syscalls fail due to poison, the errno will be converted to EIO to maintain block API consistency. Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20230615181325.1327259-2-jane.chu@oracle.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
* | mm: Update do_vmi_align_munmap() return semanticsLiam R. Howlett2023-07-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since do_vmi_align_munmap() will always honor the downgrade request on the success, the callers no longer have to deal with confusing return codes. Since all callers that request downgrade actually want the lock to be dropped, change the downgrade to an unlock request. Note that the lock still needs to be held in read mode during the page table clean up to avoid races with a map request. Update do_vmi_align_munmap() to return 0 for success. Clean up the callers and comments to always expect the unlock to be honored on the success path. The error path will always leave the lock untouched. As part of the cleanup, the wrapper function do_vmi_munmap() and callers to the wrapper are also updated. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-mm/20230629191414.1215929-1-willy@infradead.org/ Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>