From c1cb20d43728aa9b5393bd8d489bc85c142949b2 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 30 Nov 2018 14:09:03 -0800 Subject: mm: use swp_offset as key in shmem_replace_page() We changed the key of swap cache tree from swp_entry_t.val to swp_offset. We need to do so in shmem_replace_page() as well. Hugh said: "shmem_replace_page() has been wrong since the day I wrote it: good enough to work on swap "type" 0, which is all most people ever use (especially those few who need shmem_replace_page() at all), but broken once there are any non-0 swp_type bits set in the higher order bits" Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache") Signed-off-by: Yu Zhao Reviewed-by: Matthew Wilcox Acked-by: Hugh Dickins Cc: [4.9+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index d44991ea5ed4..42b70978e814 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1509,11 +1509,13 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, { struct page *oldpage, *newpage; struct address_space *swap_mapping; + swp_entry_t entry; pgoff_t swap_index; int error; oldpage = *pagep; - swap_index = page_private(oldpage); + entry.val = page_private(oldpage); + swap_index = swp_offset(entry); swap_mapping = page_mapping(oldpage); /* @@ -1532,7 +1534,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, __SetPageLocked(newpage); __SetPageSwapBacked(newpage); SetPageUptodate(newpage); - set_page_private(newpage, swap_index); + set_page_private(newpage, entry.val); SetPageSwapCache(newpage); /* -- cgit v1.2.3 From 9e368259ad988356c4c95150fafd1a06af095d98 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 30 Nov 2018 14:09:25 -0800 Subject: userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails Patch series "userfaultfd shmem updates". Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the lack of the VM_MAYWRITE check and the lack of i_size checks. Then looking into the above we also fixed the MAP_PRIVATE case. Hugh by source review also found a data loss source if UFFDIO_COPY is used on shmem MAP_SHARED PROT_READ mappings (the production usages incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't happen in those production usages like with QEMU). The whole patchset is marked for stable. We verified QEMU postcopy live migration with guest running on shmem MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE. Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU unconditionally invokes a punch hole if the guest mapping is filebacked and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and for the anon backend). This patch (of 5): We internally used EFAULT to communicate with the caller, switch to ENOENT, so EFAULT can be used as a non internal retval. Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli Reviewed-by: Mike Rapoport Reviewed-by: Hugh Dickins Cc: Mike Kravetz Cc: Jann Horn Cc: Peter Xu Cc: "Dr. David Alan Gilbert" Cc: Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index 42b70978e814..6c54a6874e41 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2238,7 +2238,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, *pagep = page; shmem_inode_unacct_blocks(inode, 1); /* don't free the page */ - return -EFAULT; + return -ENOENT; } } else { /* mfill_zeropage_atomic */ clear_highpage(page); -- cgit v1.2.3 From e2a50c1f64145a04959df2442305d57307e5395a Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 30 Nov 2018 14:09:37 -0800 Subject: userfaultfd: shmem: add i_size checks With MAP_SHARED: recheck the i_size after taking the PT lock, to serialize against truncate with the PT lock. Delete the page from the pagecache if the i_size_read check fails. With MAP_PRIVATE: check the i_size after the PT lock before mapping anonymous memory or zeropages into the MAP_PRIVATE shmem mapping. A mostly irrelevant cleanup: like we do the delete_from_page_cache() pagecache removal after dropping the PT lock, the PT lock is a spinlock so drop it before the sleepable page lock. Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Andrea Arcangeli Reviewed-by: Mike Rapoport Reviewed-by: Hugh Dickins Reported-by: Jann Horn Cc: Cc: "Dr. David Alan Gilbert" Cc: Mike Kravetz Cc: Peter Xu Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index 6c54a6874e41..99d5867daadb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2216,6 +2216,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, struct page *page; pte_t _dst_pte, *dst_pte; int ret; + pgoff_t offset, max_off; ret = -ENOMEM; if (!shmem_inode_acct_block(inode, 1)) @@ -2253,6 +2254,12 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, __SetPageSwapBacked(page); __SetPageUptodate(page); + ret = -EFAULT; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + if (unlikely(offset >= max_off)) + goto out_release; + ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg, false); if (ret) goto out_release; @@ -2268,8 +2275,14 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, if (dst_vma->vm_flags & VM_WRITE) _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); - ret = -EEXIST; dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + ret = -EFAULT; + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + if (unlikely(offset >= max_off)) + goto out_release_uncharge_unlock; + + ret = -EEXIST; if (!pte_none(*dst_pte)) goto out_release_uncharge_unlock; @@ -2287,13 +2300,14 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); - unlock_page(page); pte_unmap_unlock(dst_pte, ptl); + unlock_page(page); ret = 0; out: return ret; out_release_uncharge_unlock: pte_unmap_unlock(dst_pte, ptl); + delete_from_page_cache(page); out_release_uncharge: mem_cgroup_cancel_charge(page, memcg, false); out_release: -- cgit v1.2.3 From dcf7fe9d89763a28e0f43975b422ff141fe79e43 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 30 Nov 2018 14:09:43 -0800 Subject: userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set Set the page dirty if VM_WRITE is not set because in such case the pte won't be marked dirty and the page would be reclaimed without writepage (i.e. swapout in the shmem case). This was found by source review. Most apps (certainly including QEMU) only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't modify the memory in the first place. This is for correctness and it could help the non cooperative use case to avoid unexpected data loss. Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com Reviewed-by: Hugh Dickins Cc: stable@vger.kernel.org Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Reported-by: Hugh Dickins Signed-off-by: Andrea Arcangeli Cc: "Dr. David Alan Gilbert" Cc: Jann Horn Cc: Mike Kravetz Cc: Mike Rapoport Cc: Peter Xu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index 99d5867daadb..16a3d7044c52 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2274,6 +2274,16 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, _dst_pte = mk_pte(page, dst_vma->vm_page_prot); if (dst_vma->vm_flags & VM_WRITE) _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); + else { + /* + * We don't set the pte dirty if the vma has no + * VM_WRITE permission, so mark the page dirty or it + * could be freed from under us. We could do it + * unconditionally before unlock_page(), but doing it + * only if VM_WRITE is not set is faster. + */ + set_page_dirty(page); + } dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); @@ -2307,6 +2317,7 @@ out: return ret; out_release_uncharge_unlock: pte_unmap_unlock(dst_pte, ptl); + ClearPageDirty(page); delete_from_page_cache(page); out_release_uncharge: mem_cgroup_cancel_charge(page, memcg, false); -- cgit v1.2.3 From aaa52e340073b7f4593b3c4ddafcafa70cf838b5 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Fri, 30 Nov 2018 14:10:29 -0800 Subject: mm/khugepaged: fix crashes due to misaccounted holes Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later closed and unlinked. khugepaged's collapse_shmem() was forgetting to update mapping->nrpages on the rollback path, after it had added but then needs to undo some holes. There is indeed an irritating asymmetry between shmem_charge(), whose callers want it to increment nrpages after successfully accounting blocks, and shmem_uncharge(), when __delete_from_page_cache() already decremented nrpages itself: oh well, just add a comment on that to them both. And shmem_recalc_inode() is supposed to be called when the accounting is expected to be in balance (so it can deduce from imbalance that reclaim discarded some pages): so change shmem_charge() to update nrpages earlier (though it's rare for the difference to matter at all). Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils Fixes: 800d8c63b2e98 ("shmem: add huge pages support") Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Cc: Jerome Glisse Cc: Konstantin Khlebnikov Cc: Matthew Wilcox Cc: [4.8+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index 16a3d7044c52..cddc72ac44d8 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -297,12 +297,14 @@ bool shmem_charge(struct inode *inode, long pages) if (!shmem_inode_acct_block(inode, pages)) return false; + /* nrpages adjustment first, then shmem_recalc_inode() when balanced */ + inode->i_mapping->nrpages += pages; + spin_lock_irqsave(&info->lock, flags); info->alloced += pages; inode->i_blocks += pages * BLOCKS_PER_PAGE; shmem_recalc_inode(inode); spin_unlock_irqrestore(&info->lock, flags); - inode->i_mapping->nrpages += pages; return true; } @@ -312,6 +314,8 @@ void shmem_uncharge(struct inode *inode, long pages) struct shmem_inode_info *info = SHMEM_I(inode); unsigned long flags; + /* nrpages adjustment done by __delete_from_page_cache() or caller */ + spin_lock_irqsave(&info->lock, flags); info->alloced -= pages; inode->i_blocks -= pages * BLOCKS_PER_PAGE; -- cgit v1.2.3 From 356ff8a9a78fb35d6482584d260c3754dcbdf669 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Fri, 7 Dec 2018 14:50:16 -0800 Subject: Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317. This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations"). The movement of the thp allocation policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was intended to only set __GFP_THISNODE for mempolicies that are not MPOL_BIND whereas the revert could set this regardless of mempolicy. While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() was racy, that has since been removed since the revert. What is left is the possibility to use __GFP_THISNODE in policy_node() when it is unexpected because the special handling for hugepages in alloc_pages_vma() was removed as part of the consolidation. Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat different policy for hugepage allocations, which were allocated through alloc_hugepage_vma(). For hugepage allocations, if the allocating process's node is in the set of allowed nodes, allocate with __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in mm/mempolicy.c which is functionally different behavior and removes the requirement to only allocate hugepages locally. So this commit does a full revert of 89c83fb539f9 instead of the partial revert that was done in 2f0799a0ffc0. The result is the same thp allocation policy for 4.20 that was in 4.19. Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: David Rientjes Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Michal Hocko Cc: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/shmem.c') diff --git a/mm/shmem.c b/mm/shmem.c index cddc72ac44d8..921f80488bb3 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1439,7 +1439,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp, shmem_pseudo_vma_init(&pvma, info, hindex); page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN, - HPAGE_PMD_ORDER, &pvma, 0, numa_node_id()); + HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true); shmem_pseudo_vma_destroy(&pvma); if (page) prep_transhuge_page(page); -- cgit v1.2.3