summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm: bdi: initialize bdi_min_ratio when bdi is unregisteredManjong Lee2021-12-141-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3c376dfafbf7a8ea0dea212d095ddd83e93280bb upstream. Initialize min_ratio if it is set during bdi unregistration. This can prevent problems that may occur a when bdi is removed without resetting min_ratio. For example. 1) insert external sdcard 2) set external sdcard's min_ratio 70 3) remove external sdcard without setting min_ratio 0 4) insert external sdcard 5) set external sdcard's min_ratio 70 << error occur(can't set) Because when an sdcard is removed, the present bdi_min_ratio value will remain. Currently, the only way to reset bdi_min_ratio is to reboot. [akpm@linux-foundation.org: tweak comment and coding style] Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com Signed-off-by: Manjong Lee <mj0123.lee@samsung.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Changheun Lee <nanich.lee@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <seunghwan.hyun@samsung.com> Cc: <sookwan7.kim@samsung.com> Cc: <yt0928.kim@samsung.com> Cc: <junho89.kim@samsung.com> Cc: <jisoo2146.oh@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hugetlb: take PMD sharing into account when flushing tlb/cachesMike Kravetz2021-12-081-10/+43
| | | | | | | | | | | | | | | | | | | | | | | | | commit dff11abe280b47c21b804a8ace318e0638bb9a49 upstream. When fixing an issue with PMD sharing and migration, it was discovered via code inspection that other callers of huge_pmd_unshare potentially have an issue with cache and tlb flushing. Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst case ranges for mmu notifiers. Ensure that this range is flushed if huge_pmd_unshare succeeds and unmaps a PUD_SUZE area. Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hugetlbfs: flush TLBs correctly after huge_pmd_unshareNadav Amit2021-12-082-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a4a118f2eead1d6c49e00765de89878288d4b890 upstream. When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB flush is missing. This TLB flush must be performed before releasing the i_mmap_rwsem, in order to prevent an unshared PMDs page from being released and reused before the TLB flush took place. Arguably, a comprehensive solution would use mmu_gather interface to batch the TLB flushes and the PMDs page release, however it is not an easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2) deferring the release of the page reference for the PMDs page until after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into thinking PMDs are shared when they are not. Fix __unmap_hugepage_range() by adding the missing TLB flush, and forcing a flush when unshare is successful. Fixes: 24669e58477e ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6 Signed-off-by: Nadav Amit <namit@vmware.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flagRustam Kovhaev2021-11-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 34dbc3aaf5d9e89ba6cc5e24add9458c21ab1950 upstream. When kmemleak is enabled for SLOB, system does not boot and does not print anything to the console. At the very early stage in the boot process we hit infinite recursion from kmemleak_init() and eventually kernel crashes. kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not valid for SLOB. Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation") Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Glauber Costa <glommer@parallels.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm, oom: do not trigger out_of_memory from the #PFMichal Hocko2021-11-261-14/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 60e2793d440a3ec95abb5d6d4fc034a4b480472d upstream. Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasksVasily Averin2021-11-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0b28179a6138a5edd9d82ad2687c05b3773c387b upstream. Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3. Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It can be misused and allowed to trigger global OOM from inside a memcg-limited container. On the other hand if memcg fails allocation, called from inside #PF handler it triggers global OOM from inside pagefault_out_of_memory(). To prevent these problems this patchset: (a) removes execution of out_of_memory() from pagefault_out_of_memory(), becasue nobody can explain why it is necessary. (b) allow memcg to fail allocation of dying/killed tasks. This patch (of 3): Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory which in turn executes out_out_memory() and can kill a random task. An allocation might fail when the current task is the oom victim and there are no memory reserves left. The OOM killer is already handled at the page allocator level for the global OOM and at the charging level for the memcg one. Both have much more information about the scope of allocation/charge request. This means that either the OOM killer has been invoked properly and didn't lead to the allocation success or it has been skipped because it couldn't have been invoked. In both cases triggering it from here is pointless and even harmful. It makes much more sense to let the killed task die rather than to wake up an eternally hungry oom-killer and send him to choose a fatter victim for breakfast. Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and ↵Miaohe Lin2021-11-261-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | zs_unregister_migration() [ Upstream commit afe8605ca45424629fdddfd85984b442c763dc47 ] There is one possible race window between zs_pool_dec_isolated() and zs_unregister_migration() because wait_for_isolated_drain() checks the isolated count without holding class->lock and there is no order inside zs_pool_dec_isolated(). Thus the below race window could be possible: zs_pool_dec_isolated zs_unregister_migration check pool->destroying != 0 pool->destroying = true; smp_mb(); wait_for_isolated_drain() wait for pool->isolated_pages == 0 atomic_long_dec(&pool->isolated_pages); atomic_long_read(&pool->isolated_pages) == 0 Since we observe the pool->destroying (false) before atomic_long_dec() for pool->isolated_pages, waking pool->migration_wait up is missed. Fix this by ensure checking pool->destroying happens after the atomic_long_dec(&pool->isolated_pages). Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com Fixes: 701d678599d0 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Henry Burns <henryburns@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/zsmalloc: Prepare to variable MAX_PHYSMEM_BITSKirill A. Shutemov2021-11-121-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 02390b87a9459937cdb299e6b34ff33992512ec7 upstream With boot-time switching between paging mode we will have variable MAX_PHYSMEM_BITS. Let's use the maximum variable possible for CONFIG_X86_5LEVEL=y configuration to define zsmalloc data structures. The patch introduces MAX_POSSIBLE_PHYSMEM_BITS to cover such case. It also suits well to handle PAE special case. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Nitin Gupta <ngupta@vflare.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20180214111656.88514-3-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()David Hildenbrand2021-09-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream. Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory" These are all cleanups and one fix previously sent as part of [1]: [PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory groups. These patches make sense even without the other series, therefore I pulled them out to make the other series easier to digest. [1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com This patch (of 4): Checkpatch complained on a follow-up patch that we are using "unsigned" here, which defaults to "unsigned int" and checkpatch is correct. As we will search for a fitting zone using the wrong pfn, we might end up onlining memory to one of the special kernel zones, such as ZONE_DMA, which can end badly as the onlined memory does not satisfy properties of these zones. Use "unsigned long" instead, just as we do in other places when handling PFNs. This can bite us once we have physical addresses in the range of multiple TB. Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Len Brown <lenb@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: virtualization@lists.linux-foundation.org Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pierre Morel <pmorel@linux.ibm.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Rich Felker <dalias@libc.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Sergei Trofimovich <slyfox@gentoo.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/page_alloc: speed up the iteration of max_orderMuchun Song2021-09-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7ad69832f37e3cea8557db6df7c793905f1135e8 upstream. When we free a page whose order is very close to MAX_ORDER and greater than pageblock_order, it wastes some CPU cycles to increase max_order to MAX_ORDER one by one and check the pageblock migratetype of that page repeatedly especially when MAX_ORDER is much larger than pageblock_order. We also should not be checking migratetype of buddy when "order == MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition. With the new check, we don't need the max_order check anymore, so we replace it. Also adjust max_order initialization so that it's lower by one than previously, which makes the code hopefully more clear. Link: https://lkml.kernel.org/r/20201204155109.55451-1-songmuchun@bytedance.com Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other pageblocks") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/kmemleak.c: make cond_resched() rate-limiting more efficientAndrew Morton2021-09-221-1/+1
| | | | | | | | | | | | | | | | | | commit 13ab183d138f607d885e995d625e58d47678bf97 upstream. Commit bde5f6bc68db ("kmemleak: add scheduling point to kmemleak_scan()") tries to rate-limit the frequency of cond_resched() calls, but does it in a way which might incur an expensive division operation in the inner loop. Simplify this. Fixes: bde5f6bc68db5 ("kmemleak: add scheduling point to kmemleak_scan()") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bdi: Do not use freezable workqueueMika Westerberg2021-07-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a2b90f11217790ec0964ba9c93a4abb369758c26 upstream. A removable block device, such as NVMe or SSD connected over Thunderbolt can be hot-removed any time including when the system is suspended. When device is hot-removed during suspend and the system gets resumed, kernel first resumes devices and then thaws the userspace including freezable workqueues. What happens in that case is that the NVMe driver notices that the device is unplugged and removes it from the system. This ends up calling bdi_unregister() for the gendisk which then schedules wb_workfn() to be run one more time. However, since the bdi_wq is still frozen flush_delayed_work() call in wb_shutdown() blocks forever halting system resume process. User sees this as hang as nothing is happening anymore. Triggering sysrq-w reveals this: Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme] Call Trace: ? __schedule+0x2c5/0x630 ? wait_for_completion+0xa4/0x120 schedule+0x3e/0xc0 schedule_timeout+0x1c9/0x320 ? resched_curr+0x1f/0xd0 ? wait_for_completion+0xa4/0x120 wait_for_completion+0xc3/0x120 ? wake_up_q+0x60/0x60 __flush_work+0x131/0x1e0 ? flush_workqueue_prep_pwqs+0x130/0x130 bdi_unregister+0xb9/0x130 del_gendisk+0x2d2/0x2e0 nvme_ns_remove+0xed/0x110 [nvme_core] nvme_remove_namespaces+0x96/0xd0 [nvme_core] nvme_remove+0x5b/0x160 [nvme] pci_device_remove+0x36/0x90 device_release_driver_internal+0xdf/0x1c0 nvme_remove_dead_ctrl_work+0x14/0x30 [nvme] process_one_work+0x1c2/0x3f0 worker_thread+0x48/0x3e0 kthread+0x100/0x140 ? current_work+0x30/0x30 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 This is not limited to NVMes so exactly same issue can be reproduced by hot-removing SSD (over Thunderbolt) while the system is suspended. Prevent this from happening by removing WQ_FREEZABLE from bdi_wq. Reported-by: AceLan Kao <acelan.kao@canonical.com> Link: https://marc.info/?l=linux-kernel&m=138695698516487 Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385 Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Cc: Macpaul Lin <macpaul.lin@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/huge_memory.c: don't discard hugepage if other processes are mapping itMiaohe Lin2021-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit babbbdd08af98a59089334eb3effbed5a7a0cf7f ] If other processes are mapping any other subpages of the hugepage, i.e. in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then we would discard the page while other processes are still mapping it. Fix it by using total_mapcount() which can tell whether other processes are still mapping it. Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins2021-07-111-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 ] If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Note on stable backport: leave redundant #include <linux/hugetlb.h> in kernel/futex.c, to avoid conflict over the header files included. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()Hugh Dickins2021-07-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a7a69d8ba88d8dcee7ef00e91d413a4bd003a814 ] Aha! Shouldn't that quick scan over pte_none()s make sure that it holds ptlock in the PVMW_SYNC case? That too might have been responsible for BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though I've never seen any. Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Wang Yugui <wangyugui@e16-tech.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/thp: fix page_vma_mapped_walk() if THP mapped by ptesHugh Dickins2021-07-111-9/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a9a7504d9beaf395481faa91e70e2fd08f7a3dde ] Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). Crash dumps showed two tail pages of a shmem huge page remained mapped by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end of a long unmapped range; and no page table had yet been allocated for the head of the huge page to be mapped into. Although designed to handle these odd misaligned huge-page-mapped-by-pte cases, page_vma_mapped_walk() falls short by returning false prematurely when !pmd_present or !pud_present or !p4d_present or !pgd_present: there are cases when a huge page may span the boundary, with ptes present in the next. Restructure page_vma_mapped_walk() as a loop to continue in these cases, while keeping its layout much as before. Add a step_forward() helper to advance pvmw->address across those boundaries: originally I tried to use mm's standard p?d_addr_end() macros, but hit the same crash 512 times less often: because of the way redundant levels are folded together, but folded differently in different configurations, it was just too difficult to use them correctly; and step_forward() is simpler anyway. Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): get vma_address_end() earlierHugh Dickins2021-07-111-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a765c417d876cc635f628365ec9aa6f09470069a ] page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start, rather than later at next_pte. It's a little unnecessary overhead on the first call, but makes for a simpler loop in the following commit. Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): use goto instead of while (1)Hugh Dickins2021-07-111-4/+3
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 474466301dfd8b39a10c01db740645f3f7ae9a28 ] page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte, and use "goto this_pte", in place of the "while (1)" loop at the end. Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): add a level of indentationHugh Dickins2021-07-111-50/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b3807a91aca7d21c05d5790612e49969117a72b9 ] page_vma_mapped_walk() cleanup: add a level of indentation to much of the body, making no functional change in this commit, but reducing the later diff when this is all converted to a loop. [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix] Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): crossing page table boundaryHugh Dickins2021-07-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 448282487483d6fa5b2eeeafaa0acc681e544a9c ] page_vma_mapped_walk() cleanup: adjust the test for crossing page table boundary - I believe pvmw->address is always page-aligned, but nothing else here assumed that; and remember to reset pvmw->pte to NULL after unmapping the page table, though I never saw any bug from that. Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION blockHugh Dickins2021-07-111-16/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e2e1d4076c77b3671cf8ce702535ae7dee3acf89 ] page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to follow the same "return not_found, return not_found, return true" pattern as the block above it (note: returning not_found there is never premature, since existence or prior existence of huge pmd guarantees good alignment). Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): use pmde for *pvmw->pmdHugh Dickins2021-07-111-5/+6
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3306d3119ceacc43ea8b141a73e21fea68eec30c ] page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then use it in subsequent tests, instead of repeatedly dereferencing pointer. Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): settle PageHuge on entryHugh Dickins2021-07-111-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6d0fd5987657cb0c9756ce684e3a74c0f6351728 ] page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of the way at the start, so no need to worry about it later. Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: page_vma_mapped_walk(): use page for pvmw->pageHugh Dickins2021-07-111-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f003c03bd29e6f46fef1b9a8e8d636ac732286d5 ] Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes". I've marked all of these for stable: many are merely cleanups, but I think they are much better before the main fix than after. This patch (of 11): page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was used, sometimes pvmw->page itself: use the local copy "page" throughout. Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for splitYang Shi2021-07-111-17/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 504e070dc08f757bccaed6d05c0f53ecbfac8a23 ] When debugging the bug reported by Wang Yugui [1], try_to_unmap() may fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however it may miss the failure when head page is unmapped but other subpage is mapped. Then the second DEBUG_VM BUG() that check total mapcount would catch it. This may incur some confusion. As this is not a fatal issue, so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE(). [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/ Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Note on stable backport: fixed up variables, split_queue_lock, tree_lock in split_huge_page_to_list(), and conflict on ttu_flags in unmap_page(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/thp: fix page_address_in_vma() on file THP tailsJue Wang2021-07-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 ] Anon THP tails were already supported, but memory-failure may need to use page_address_in_vma() on file THP tails, which its page->mapping check did not permit: fix it. hughd adds: no current usage is known to hit the issue, but this does fix a subtle trap in a general helper: best fixed in stable sooner than later. Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/thp: fix vma_address() if virtual address below file offsetHugh Dickins2021-07-113-31/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 494334e43c16d63b878536a26505397fce6ff3a2 ] Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). When that BUG() was changed to a WARN(), it would later crash on the VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap(). vma_address() is usually correct, but there's a wraparound case when the vm_start address is unusually low, but vm_pgoff not so low: vma_address() chooses max(start, vma->vm_start), but that decides on the wrong address, because start has become almost ULONG_MAX. Rewrite vma_address() to be more careful about vm_pgoff; move the VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be safely used from page_mapped_in_vma() and page_address_in_vma() too. Add vma_address_end() to apply similar care to end address calculation, in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one(); though it raises a question of whether callers would do better to supply pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch. An irritation is that their apparent generality breaks down on KSM pages, which cannot be located by the page->index that page_to_pgoff() uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte() for ksm pages") once discovered. I dithered over the best thing to do about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both vma_address() and vma_address_end(); though the only place in danger of using it on them was try_to_unmap_one(). Sidenote: vma_address() and vma_address_end() now use compound_nr() on a head page, instead of thp_size(): to make the right calculation on a hugetlbfs page, whether or not THPs are configured. try_to_unmap() is used on hugetlbfs pages, but perhaps the wrong calculation never mattered. Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Note on stable backport: fixed up conflicts on intervening thp_size(), and mmu_notifier_range initializations; substitute for compound_nr(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/thp: try_to_unmap() use TTU_SYNC for safe splittingHugh Dickins2021-07-113-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 ] Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0. And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount. Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page(). When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Note on stable backport: upstream TTU_SYNC 0x10 takes the value which 5.11 commit 013339df116c ("mm/rmap: always do TTU_IGNORE_ACCESS") freed. It is very tempting to backport that commit (as 5.10 already did) and make no change here; but on reflection, good as that commit is, I'm reluctant to include any possible side-effect of it in this series. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/rmap: use page_not_mapped in try_to_unmap()Miaohe Lin2021-07-111-8/+3
| | | | | | | | | | | | | | | | | [ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ] page_mapcount_is_zero() calculates accurately how many mappings a hugepage has in order to check against 0 only. This is a waste of cpu time. We can do this via page_not_mapped() to save some possible atomic_read cycles. Remove the function page_mapcount_is_zero() as it's not used anymore and move page_not_mapped() above try_to_unmap() to avoid identifier undeclared compilation error. Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/rmap: remove unneeded semicolon in page_not_mapped()Miaohe Lin2021-07-111-1/+1
| | | | | | | | | | | | [ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ] Remove extra semicolon without any functional change intended. Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/slub.c: include swab.hAndrew Morton2021-06-301-0/+1
| | | | | | | | | | | | | | | | | commit 1b3865d016815cbd69a1879ca1c8a8901fda1072 upstream. Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y. Hopefully. But it's the right thing to do anwyay. Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation") Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417 Reported-by: <vannguye@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/memory-failure: make sure wait for page writeback in memory_failureyangerkun2021-06-301-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e8675d291ac007e1c636870db880f837a9ea112a ] Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in clear_inode: kernel BUG at fs/inode.c:519! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7) CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO) pc : clear_inode+0x280/0x2a8 lr : clear_inode+0x280/0x2a8 Call trace: clear_inode+0x280/0x2a8 ext4_clear_inode+0x38/0xe8 ext4_free_inode+0x130/0xc68 ext4_evict_inode+0xb20/0xcb8 evict+0x1a8/0x3c0 iput+0x344/0x460 do_unlinkat+0x260/0x410 __arm64_sys_unlinkat+0x6c/0xc0 el0_svc_common+0xdc/0x3b0 el0_svc_handler+0xf8/0x160 el0_svc+0x10/0x218 Kernel panic - not syncing: Fatal exception A crash dump of this problem show that someone called __munlock_pagevec to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap -> munlock_vma_pages_range -> __munlock_pagevec. As a result memory_failure will call identify_page_state without wait_on_page_writeback. And after truncate_error_page clear the mapping of this page. end_page_writeback won't call sb_clear_inode_writeback to clear inode->i_wb_list. That will trigger BUG_ON in clear_inode! Fix it by checking PageWriteback too to help determine should we skip wait_on_page_writeback. Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com Fixes: 0bc1f8b0682c ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU") Signed-off-by: yangerkun <yangerkun@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPYMina Almasry2021-06-101-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d84cf06e3dd8c5c5b547b5d8931015fc536678e5 ] The userfaultfd hugetlb tests cause a resv_huge_pages underflow. This happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on an index for which we already have a page in the cache. When this happens, we allocate a second page, double consuming the reservation, and then fail to insert the page into the cache and return -EEXIST. To fix this, we first check if there is a page in the cache which already consumed the reservation, and return -EEXIST immediately if so. There is still a rare condition where we fail to copy the page contents AND race with a call for hugetlb_no_page() for this index and again we will underflow resv_huge_pages. That is fixed in a more complicated patch not targeted for -stable. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. [mike.kravetz@oracle.com: changelog fixes] Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com Fixes: 8fb5debc5fcd ("userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* hugetlbfs: hugetlb_fault_mutex_hash() cleanupMike Kravetz2021-06-032-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 552546366a30d88bd1d6f5efe848b2ab50fd57e5 upstream. A new clang diagnostic (-Wsizeof-array-div) warns about the calculation to determine the number of u32's in an array of unsigned longs. Suppress warning by adding parentheses. While looking at the above issue, noticed that the 'address' parameter to hugetlb_fault_mutex_hash is no longer used. So, remove it from the definition and all callers. No functional change. Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Ilie Halip <ilie.halip@gmail.com> Cc: David Bolvansky <david.bolvansky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm, vmstat: drop zone->lock in /proc/pagetypeinfoStephen Brennan2021-06-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit 93b3a674485f6a4b8ffff85d1682d5e8b7c51560 upstream Commit 93b3a674485f ("mm,vmstat: reduce zone->lock holding time by /proc/pagetypeinfo") upstream caps the number of iterations over each free_list at 100,000, and also drops the zone->lock in between each migrate type. Capping the iteration count alters the file contents in some cases, which means this approach may not be suitable for stable backports. However, dropping zone->lock in between migrate types (and, as a result, page orders) will not change the /proc/pagetypeinfo file contents. It can significantly reduce the length of time spent with IRQs disabled, which can prevent missed interrupts or soft lockups which we have observed on systems with particularly large memory. Thus, this commit is a modified version of the upstream one which only drops the lock in between migrate types. Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* userfaultfd: release page in error path to avoid BUG_ONAxel Rasmussen2021-05-221-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7ed9d238c7dbb1fdb63ad96a6184985151b0171c upstream. Consider the following sequence of events: 1. Userspace issues a UFFD ioctl, which ends up calling into shmem_mfill_atomic_pte(). We successfully account the blocks, we shmem_alloc_page(), but then the copy_from_user() fails. We return -ENOENT. We don't release the page we allocated. 2. Our caller detects this error code, tries the copy_from_user() after dropping the mmap_lock, and retries, calling back into shmem_mfill_atomic_pte(). 3. Meanwhile, let's say another process filled up the tmpfs being used. 4. So shmem_mfill_atomic_pte() fails to account blocks this time, and immediately returns - without releasing the page. This triggers a BUG_ON in our caller, which asserts that the page should always be consumed, unless -ENOENT is returned. To fix this, detect if we have such a "dangling" page when accounting fails, and if so, release it before returning. Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com Fixes: cb658a453b93 ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY") Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reported-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ksm: fix potential missing rmap_item for stable_nodeMiaohe Lin2021-05-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c89a384e2551c692a9fe60d093fd7080f50afc51 ] When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is cleared with head reserved. So the following scenario might happen: For ksm page with rmap_item1: cmp_and_merge_page stable_node->head = &migrate_nodes; remove_rmap_item_from_tree, but head still equal to stable_node; try_to_merge_with_ksm_page failed; return; For the same ksm page with rmap_item2, stable node migration succeed this time. The stable_node->head does not equal to migrate_nodes now. For ksm page with rmap_item1 again: cmp_and_merge_page stable_node->head != &migrate_nodes && rmap_item->head == stable_node return; We would miss the rmap_item for stable_node and might result in failed rmap_walk_ksm(). Fix this by set rmap_item->head to NULL when rmap_item is removed from stable tree. Link: https://lkml.kernel.org/r/20210330140228.45635-5-linmiaohe@huawei.com Fixes: 4146d2d673e8 ("ksm: make !merge_across_nodes migration safe") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()Miaohe Lin2021-05-221-2/+9
| | | | | | | | | | | | | | | | | | | [ Upstream commit da56388c4397878a65b74f7fe97760f5aa7d316b ] A rare out of memory error would prevent removal of the reserve map region for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages and hugetlb_acct_memory could possibly fail too. We should correctly handle these cases. Link: https://lkml.kernel.org/r/20210410072348.20437-5-linmiaohe@huawei.com Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()Miaohe Lin2021-05-221-9/+9
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 74e579bf231a337ab3786d59e64bc94f45ca7b3f ] In writable and !referenced case, the result value should be SCAN_LACK_REFERENCED_PAGE for trace_mm_collapse_huge_page_isolate() instead of default 0 (SCAN_FAIL) here. Link: https://lkml.kernel.org/r/20210306032947.35921-5-linmiaohe@huawei.com Fixes: 7d2eba0557c1 ("mm: add tracepoint for scanning pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* gup: document and work around "COW can break either way" issueLinus Torvalds2021-04-282-10/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream. Doing a "get_user_pages()" on a copy-on-write page for reading can be ambiguous: the page can be COW'ed at any time afterwards, and the direction of a COW event isn't defined. Yes, whoever writes to it will generally do the COW, but if the thread that did the get_user_pages() unmapped the page before the write (and that could happen due to memory pressure in addition to any outright action), the writer could also just take over the old page instead. End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead. So when doing a get_user_pages() on a COW mapping, the only really safe thing to do would be to break the COW when getting the page, even when only getting it for reading. At the same time, some users simply don't even care. For example, the perf code wants to look up the page not because it cares about the page, but because the code simply wants to look up the physical address of the access for informational purposes, and doesn't really care about races when a page might be unmapped and remapped elsewhere. This adds logic to force a COW event by setting FOLL_WRITE on any copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page pointer as a result. The current semantics end up being: - __get_user_pages_fast(): no change. If you don't ask for a write, you won't break COW. You'd better know what you're doing. - get_user_pages_fast(): the fast-case "look it up in the page tables without anything getting mmap_sem" now refuses to follow a read-only page, since it might need COW breaking. Which happens in the slow path - the fast path doesn't know if the memory might be COW or not. - get_user_pages() (including the slow-path fallback for gup_fast()): for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with very similar semantics to FOLL_FORCE. If it turns out that we want finer granularity (ie "only break COW when it might actually matter" - things like the zero page are special and don't need to be broken) we might need to push these semantics deeper into the lookup fault path. So if people care enough, it's possible that we might end up adding a new internal FOLL_BREAK_COW flag to go with the internal FOLL_COW flag we already have for tracking "I had a COW". Alternatively, if it turns out that different callers might want to explicitly control the forced COW break behavior, we might even want to make such a flag visible to the users of get_user_pages() instead of using the above default semantics. But for now, this is mostly commentary on the issue (this commit message being a lot bigger than the patch, and that patch in turn is almost all comments), with that minimal "enable COW breaking early" logic using the existing FOLL_WRITE behavior. [ It might be worth noting that we've always had this ambiguity, and it could arguably be seen as a user-space issue. You only get private COW mappings that could break either way in situations where user space is doing cooperative things (ie fork() before an execve() etc), but it _is_ surprising and very subtle, and fork() is supposed to give you independent address spaces. So let's treat this as a kernel issue and make the semantics of get_user_pages() easier to understand. Note that obviously a true shared mapping will still get a page that can change under us, so this does _not_ mean that get_user_pages() somehow returns any "stable" page ] [surenb: backport notes Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range. Removed FOLL_PIN usage in should_force_cow_break since it's missing in the earlier kernels.] Reported-by: Jann Horn <jannh@google.com> Tested-by: Christoph Hellwig <hch@lst.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill Shutemov <kirill@shutemov.name> Acked-by: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [surenb: backport to 4.14 kernel] Cc: stable@vger.kernel.org # 4.14.x Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: writeback: use exact memcg dirty countsGreg Thelen2021-04-071-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream. Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/stat.h> #include <sys/sysinfo.h> #include <sys/types.h> #include <unistd.h> static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: fix oom_kill event handlingRoman Gushchin2021-04-072-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fe6bdfc8e1e131720abbe77a2eb990c94c9024cb upstream. Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch] Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [fllinden@amazon.com: backport to 4.14, minor contextual changes] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: memcg: make sure memory.events is uptodate when waking pollersJohannes Weiner2021-04-072-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e27be240df53f1a20c659168e722b5d9f16cc7f4 upstream. Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") added per-cpu drift to all memory cgroup stats and events shown in memory.stat and memory.events. For memory.stat this is acceptable. But memory.events issues file notifications, and somebody polling the file for changes will be confused when the counters in it are unchanged after a wakeup. Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM - are sufficiently rare and high-level that we don't need per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most frequent, but they're counting invocations of reclaim, which is a complex operation that touches many shared cachelines. This splits memory.events from the generic VM events and tracks them in their own, unbuffered atomic counters. That's also cleaner, as it eliminates the ugly enum nesting of VM and cgroup events. [hannes@cmpxchg.org: "array subscript is above array bounds"] Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* mm: fix race by making init_zero_pfn() early_initcallIlya Lipnitskiy2021-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e720e7d0e983bf05de80b231bccc39f1487f0f16 upstream. There are code paths that rely on zero_pfn to be fully initialized before core_initcall. For example, wq_sysfs_init() is a core_initcall function that eventually results in a call to kernel_execve, which causes a page fault with a subsequent mmput. If zero_pfn is not initialized by then it may not get cleaned up properly and result in an error: BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1 Here is an analysis of the race as seen on a MIPS device. On this particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until initialized, at which point it becomes PFN 5120: 1. wq_sysfs_init calls into kobject_uevent_env at core_initcall: kobject_uevent_env+0x7e4/0x7ec kset_register+0x68/0x88 bus_register+0xdc/0x34c subsys_virtual_register+0x34/0x78 wq_sysfs_init+0x1c/0x4c do_one_initcall+0x50/0x1a8 kernel_init_freeable+0x230/0x2c8 kernel_init+0x10/0x100 ret_from_kernel_thread+0x14/0x1c 2. kobject_uevent_env() calls call_usermodehelper_exec() which executes kernel_execve asynchronously. 3. Memory allocations in kernel_execve cause a page fault, bumping the MM reference counter: add_mm_counter_fast+0xb4/0xc0 handle_mm_fault+0x6e4/0xea0 __get_user_pages.part.78+0x190/0x37c __get_user_pages_remote+0x128/0x360 get_arg_page+0x34/0xa0 copy_string_kernel+0x194/0x2a4 kernel_execve+0x11c/0x298 call_usermodehelper_exec_async+0x114/0x194 4. In case zero_pfn has not been initialized yet, zap_pte_range does not decrement the MM_ANONPAGES RSS counter and the BUG message is triggered shortly afterwards when __mmdrop checks the ref counters: __mmdrop+0x98/0x1d0 free_bprm+0x44/0x118 kernel_execve+0x160/0x1d8 call_usermodehelper_exec_async+0x114/0x194 ret_from_kernel_thread+0x14/0x1c To avoid races such as described above, initialize init_zero_pfn at early_initcall level. Depending on the architecture, ZERO_PAGE is either constant or gets initialized even earlier, at paging_init, so there is no issue with initializing zero_pfn earlier. Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.com Signed-off-by: Ilya Lipnitskiy <ilya.lipnitskiy@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: stable@vger.kernel.org Tested-by: 周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "mm, slub: consider rest of partial list if acquire_slab() fails"Linus Torvalds2021-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9b1ea29bc0d7b94d420f96a0f4121403efc3dd85 upstream. This reverts commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf. The kernel test robot reports a huge performance regression due to the commit, and the reason seems fairly straightforward: when there is contention on the page list (which is what causes acquire_slab() to fail), we do _not_ want to just loop and try again, because that will transfer the contention to the 'n->list_lock' spinlock we hold, and just make things even worse. This is admittedly likely a problem only on big machines - the kernel test robot report comes from a 96-thread dual socket Intel Xeon Gold 6252 setup, but the regression there really is quite noticeable: -47.9% regression of stress-ng.rawpkt.ops_per_sec and the commit that was marked as being fixed (7ced37197196: "slub: Acquire_slab() avoid loop") actually did the loop exit early very intentionally (the hint being that "avoid loop" part of that commit message), exactly to avoid this issue. The correct thing to do may be to pick some kind of reasonable middle ground: instead of breaking out of the loop on the very first sign of contention, or trying over and over and over again, the right thing may be to re-try _once_, and then give up on the second failure (or pick your favorite value for "once"..). Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <jannh@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* swap: fix swapfile read/write offsetJens Axboe2021-03-072-9/+4
| | | | | | | | | | | | | | | | | commit caf6912f3f4af7232340d500a4a2008f81b93f14 upstream. We're not factoring in the start of the file for where to write and read the swapfile, which leads to very unfortunate side effects of writing where we should not be... [This issue only affects swapfiles on filesystems on top of blockdevs that implement rw_page ops (brd, zram, btt, pmem), and not on top of any other block devices, in contrast to the upstream commit fix.] Fixes: dd6bd0d9c7db ("swap: use bdev_read_page() / bdev_write_page()") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* zsmalloc: account the number of compacted pages correctlyRokudo Yan2021-03-071-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2395928158059b8f9858365fce7713ce7fef62e4 upstream. There exists multiple path may do zram compaction concurrently. 1. auto-compaction triggered during memory reclaim 2. userspace utils write zram<id>/compaction node So, multiple threads may call zs_shrinker_scan/zs_compact concurrently. But pages_compacted is a per zsmalloc pool variable and modification of the variable is not serialized(through under class->lock). There are two issues here: 1. the pages_compacted may not equal to total number of pages freed(due to concurrently add). 2. zs_shrinker_scan may not return the correct number of pages freed(issued by current shrinker). The fix is simple: 1. account the number of pages freed in zs_compact locally. 2. use actomic variable pages_compacted to accumulate total number. Link: https://lkml.kernel.org/r/20210202122235.26885-1-wu-yan@tcl.com Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/hugetlb.c: fix unnecessary address expansion of pmd sharingLi Xinhai2021-03-071-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a1ba9da8f0f9a37d900ff7eff66482cf7de8015e upstream. The current code would unnecessarily expand the address range. Consider one example, (start, end) = (1G-2M, 3G+2M), and (vm_start, vm_end) = (1G-4M, 3G+4M), the expected adjustment should be keep (1G-2M, 3G+2M) without expand. But the current result will be (1G-4M, 3G+4M). Actually, the range (1G-4M, 1G) and (3G, 3G+4M) would never been involved in pmd sharing. After this patch, we will check that the vma span at least one PUD aligned size and the start,end range overlap the aligned range of vma. With above example, the aligned vma range is (1G, 3G), so if (start, end) range is within (1G-4M, 1G), or within (3G, 3G+4M), then no adjustment to both start and end. Otherwise, we will have chance to adjust start downwards or end upwards without exceeding (vm_start, vm_end). Mike: : The 'adjusted range' is used for calls to mmu notifiers and cache(tlb) : flushing. Since the current code unnecessarily expands the range in some : cases, more entries than necessary would be flushed. This would/could : result in performance degradation. However, this is highly dependent on : the user runtime. Is there a combination of vma layout and calls to : actually hit this issue? If the issue is hit, will those entries : unnecessarily flushed be used again and need to be unnecessarily reloaded? Link: https://lkml.kernel.org/r/20210104081631.2921415-1-lixinhai.lxh@gmail.com Fixes: 75802ca66354 ("mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible") Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hugetlb: fix update_and_free_page contig page struct assumptionMike Kravetz2021-03-071-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit dbfee5aee7e54f83d96ceb8e3e80717fac62ad63 upstream. page structs are not guaranteed to be contiguous for gigantic pages. The routine update_and_free_page can encounter a gigantic page, yet it assumes page structs are contiguous when setting page flags in subpages. If update_and_free_page encounters non-contiguous page structs, we can see “BUG: Bad page state in process …” errors. Non-contiguous page structs are generally not an issue. However, they can exist with a specific kernel configuration and hotplug operations. For example: Configure the kernel with CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where the gigantic page will be allocated. Zi Yan outlined steps to reproduce here [1]. [1] https://lore.kernel.org/linux-mm/16F7C58B-4D79-41C5-9B64-A1A1628F4AF2@nvidia.com/ Link: https://lkml.kernel.org/r/20210217184926.33567-1-mike.kravetz@oracle.com Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
* mm: hugetlb: fix a race between freeing and dissolving the pageMuchun Song2021-03-031-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7ffddd499ba6122b1a07828f023d1d67629aa017 upstream There is a race condition between __free_huge_page() and dissolve_free_huge_page(). CPU0: CPU1: // page_count(page) == 1 put_page(page) __free_huge_page(page) dissolve_free_huge_page(page) spin_lock(&hugetlb_lock) // PageHuge(page) && !page_count(page) update_and_free_page(page) // page is freed to the buddy spin_unlock(&hugetlb_lock) spin_lock(&hugetlb_lock) clear_page_huge_active(page) enqueue_huge_page(page) // It is wrong, the page is already freed spin_unlock(&hugetlb_lock) The race window is between put_page() and dissolve_free_huge_page(). We should make sure that the page is already on the free list when it is dissolved. As a result __free_huge_page would corrupt page(s) already in the buddy allocator. Link: https://lkml.kernel.org/r/20210115124942.46403-4-songmuchun@bytedance.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [sudip: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>