summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()Rongwei Wang2023-04-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below: core 0 core 1 swapoff del_from_avail_list(si) waiting try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head acquire si->lock but missing si already being added again, and continuing to clear SWP_WRITEOK, etc. It can be easily found that a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap). However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled) ------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270 Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again. This problem exists in versions after stable 5.10.y. Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node") Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: take a page reference when removing device exclusive entriesAlistair Popple2023-04-051-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device exclusive page table entries are used to prevent CPU access to a page whilst it is being accessed from a device. Typically this is used to implement atomic operations when the underlying bus does not support atomic access. When a CPU thread encounters a device exclusive entry it locks the page and restores the original entry after calling mmu notifiers to signal drivers that exclusive access is no longer available. The device exclusive entry holds a reference to the page making it safe to access the struct page whilst the entry is present. However the fault handling code does not hold the PTL when taking the page lock. This means if there are multiple threads faulting concurrently on the device exclusive entry one will remove the entry whilst others will wait on the page lock without holding a reference. This can lead to threads locking or waiting on a folio with a zero refcount. Whilst mmap_lock prevents the pages getting freed via munmap() they may still be freed by a migration. This leads to warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount drops to zero. Fix this by trying to take a reference on the folio before locking it. The code already checks the PTE under the PTL and aborts if the entry is no longer there. It is also possible the folio has been unmapped, freed and re-allocated allowing a reference to be taken on an unrelated folio. This case is also detected by the PTE check and the folio is unlocked without further changes. Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: vmalloc: avoid warn_alloc noise caused by fatal signalYafang Shao2023-04-051-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There're some suspicious warn_alloc on my test serer, for example, [13366.518837] warn_alloc: 81 callbacks suppressed [13366.518841] test_verifier: vmalloc error: size 4096, page order 0, failed to allocate pages, mode:0x500dc2(GFP_HIGHUSER|__GFP_ZERO|__GFP_ACCOUNT), nodemask=(null),cpuset=/,mems_allowed=0-1 [13366.522240] CPU: 30 PID: 722463 Comm: test_verifier Kdump: loaded Tainted: G W O 6.2.0+ #638 [13366.524216] Call Trace: [13366.524702] <TASK> [13366.525148] dump_stack_lvl+0x6c/0x80 [13366.525712] dump_stack+0x10/0x20 [13366.526239] warn_alloc+0x119/0x190 [13366.526783] ? alloc_pages_bulk_array_mempolicy+0x9e/0x2a0 [13366.527470] __vmalloc_area_node+0x546/0x5b0 [13366.528066] __vmalloc_node_range+0xc2/0x210 [13366.528660] __vmalloc_node+0x42/0x50 [13366.529186] ? bpf_prog_realloc+0x53/0xc0 [13366.529743] __vmalloc+0x1e/0x30 [13366.530235] bpf_prog_realloc+0x53/0xc0 [13366.530771] bpf_patch_insn_single+0x80/0x1b0 [13366.531351] bpf_jit_blind_constants+0xe9/0x1c0 [13366.531932] ? __free_pages+0xee/0x100 [13366.532457] ? free_large_kmalloc+0x58/0xb0 [13366.533002] bpf_int_jit_compile+0x8c/0x5e0 [13366.533546] bpf_prog_select_runtime+0xb4/0x100 [13366.534108] bpf_prog_load+0x6b1/0xa50 [13366.534610] ? perf_event_task_tick+0x96/0xb0 [13366.535151] ? security_capable+0x3a/0x60 [13366.535663] __sys_bpf+0xb38/0x2190 [13366.536120] ? kvm_clock_get_cycles+0x9/0x10 [13366.536643] __x64_sys_bpf+0x1c/0x30 [13366.537094] do_syscall_64+0x38/0x90 [13366.537554] entry_SYSCALL_64_after_hwframe+0x72/0xdc [13366.538107] RIP: 0033:0x7f78310f8e29 [13366.538561] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 e0 2c 00 f7 d8 64 89 01 48 [13366.540286] RSP: 002b:00007ffe2a61fff8 EFLAGS: 00000206 ORIG_RAX: 0000000000000141 [13366.541031] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f78310f8e29 [13366.541749] RDX: 0000000000000080 RSI: 00007ffe2a6200b0 RDI: 0000000000000005 [13366.542470] RBP: 00007ffe2a620010 R08: 00007ffe2a6202a0 R09: 00007ffe2a6200b0 [13366.543183] R10: 00000000000f423e R11: 0000000000000206 R12: 0000000000407800 [13366.543900] R13: 00007ffe2a620540 R14: 0000000000000000 R15: 0000000000000000 [13366.544623] </TASK> [13366.545260] Mem-Info: [13366.546121] active_anon:81319 inactive_anon:20733 isolated_anon:0 active_file:69450 inactive_file:5624 isolated_file:0 unevictable:0 dirty:10 writeback:0 slab_reclaimable:69649 slab_unreclaimable:48930 mapped:27400 shmem:12868 pagetables:4929 sec_pagetables:0 bounce:0 kernel_misc_reclaimable:0 free:15870308 free_pcp:142935 free_cma:0 [13366.551886] Node 0 active_anon:224836kB inactive_anon:33528kB active_file:175692kB inactive_file:13752kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:59248kB dirty:32kB writeback:0kB shmem:18252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:4616kB pagetables:10664kB sec_pagetables:0kB all_unreclaimable? no [13366.555184] Node 1 active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:50352kB dirty:8kB writeback:0kB shmem:33220kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:3896kB pagetables:9052kB sec_pagetables:0kB all_unreclaimable? no [13366.558262] Node 0 DMA free:15360kB boost:0kB min:304kB low:380kB high:456kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [13366.560821] lowmem_reserve[]: 0 2735 31873 31873 31873 [13366.561981] Node 0 DMA32 free:2790904kB boost:0kB min:56028kB low:70032kB high:84036kB reserved_highatomic:0KB active_anon:1936kB inactive_anon:20kB active_file:396kB inactive_file:344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:5188kB local_pcp:0kB free_cma:0kB [13366.565148] lowmem_reserve[]: 0 0 29137 29137 29137 [13366.566168] Node 0 Normal free:28533824kB boost:0kB min:596740kB low:745924kB high:895108kB reserved_highatomic:28672KB active_anon:222900kB inactive_anon:33508kB active_file:175296kB inactive_file:13408kB unevictable:0kB writepending:32kB present:30408704kB managed:29837172kB mlocked:0kB bounce:0kB free_pcp:295724kB local_pcp:0kB free_cma:0kB [13366.569485] lowmem_reserve[]: 0 0 0 0 0 [13366.570416] Node 1 Normal free:32141144kB boost:0kB min:660504kB low:825628kB high:990752kB reserved_highatomic:69632KB active_anon:100440kB inactive_anon:49404kB active_file:102108kB inactive_file:8744kB unevictable:0kB writepending:8kB present:33554432kB managed:33025372kB mlocked:0kB bounce:0kB free_pcp:270880kB local_pcp:46860kB free_cma:0kB [13366.573403] lowmem_reserve[]: 0 0 0 0 0 [13366.574015] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB [13366.575474] Node 0 DMA32: 782*4kB (UME) 756*8kB (UME) 736*16kB (UME) 745*32kB (UME) 694*64kB (UME) 653*128kB (UME) 595*256kB (UME) 552*512kB (UME) 454*1024kB (UME) 347*2048kB (UME) 246*4096kB (UME) = 2790904kB [13366.577442] Node 0 Normal: 33856*4kB (UMEH) 51815*8kB (UMEH) 42418*16kB (UMEH) 36272*32kB (UMEH) 22195*64kB (UMEH) 10296*128kB (UMEH) 7238*256kB (UMEH) 5638*512kB (UEH) 5337*1024kB (UMEH) 3506*2048kB (UMEH) 1470*4096kB (UME) = 28533784kB [13366.580460] Node 1 Normal: 15776*4kB (UMEH) 37485*8kB (UMEH) 29509*16kB (UMEH) 21420*32kB (UMEH) 14818*64kB (UMEH) 13051*128kB (UMEH) 9918*256kB (UMEH) 7374*512kB (UMEH) 5397*1024kB (UMEH) 3887*2048kB (UMEH) 2002*4096kB (UME) = 32141240kB [13366.583027] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [13366.584380] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [13366.585702] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [13366.587042] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [13366.588372] 87386 total pagecache pages [13366.589266] 0 pages in swap cache [13366.590327] Free swap = 0kB [13366.591227] Total swap = 0kB [13366.592142] 16777082 pages RAM [13366.593057] 0 pages HighMem/MovableOnly [13366.594037] 357226 pages reserved [13366.594979] 0 pages hwpoisoned This failure really confuse me as there're still lots of available pages. Finally I figured out it was caused by a fatal signal. When a process is allocating memory via vm_area_alloc_pages(), it will break directly even if it hasn't allocated the requested pages when it receives a fatal signal. In that case, we shouldn't show this warn_alloc, as it is useless. We only need to show this warning when there're really no enough pages. Link: https://lkml.kernel.org/r/20230330162625.13604-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/hugetlb: fix uffd wr-protection for CoW optimization pathPeter Xu2023-04-051-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be writable even with uffd-wp bit set. It only happens with hugetlb private mappings, when someone firstly wr-protects a missing pte (which will install a pte marker), then a write to the same page without any prior access to the page. Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before reaching hugetlb_wp() to avoid taking more locks that userfault won't need. However there's one CoW optimization path that can trigger hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap. This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit is detected. The new path will only trigger in the CoW optimization path because generic hugetlb_fault() (e.g. when a present pte was wr-protected) will resolve the uffd-wp bit already. Also make sure anonymous UNSHARE won't be affected and can still be resolved, IOW only skip CoW not CoR. This patch will be needed for v5.19+ hence copy stable. [peterx@redhat.com: v2] Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n [peterx@redhat.com: v3] Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com Fixes: 166f3ecc0daf ("mm/hugetlb: hook page faults for uffd write protection") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: enable maple tree RCU mode by defaultLiam R. Howlett2023-04-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the maple tree in RCU mode for VMA tracking. The maple tree tracks the stack and is able to update the pivot (lower/upper boundary) in-place to allow the page fault handler to write to the tree while holding just the mmap read lock. This is safe as the writes to the stack have a guard VMA which ensures there will always be a NULL in the direction of the growth and thus will only update a pivot. It is possible, but not recommended, to have VMAs that grow up/down without guard VMAs. syzbot has constructed a testcase which sets up a VMA to grow and consume the empty space. Overwriting the entire NULL entry causes the tree to be altered in a way that is not safe for concurrent readers; the readers may see a node being rewritten or one that does not match the maple state they are using. Enabling RCU mode allows the concurrent readers to see a stable node and will return the expected result. [Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[ Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/ Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+8d95422d3537159ca390@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: kfence: fix handling discontiguous pageMuchun Song2023-03-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The struct pages could be discontiguous when the kfence pool is allocated via alloc_contig_pages() with CONFIG_SPARSEMEM and !CONFIG_SPARSEMEM_VMEMMAP. This may result in setting PG_slab and memcg_data to a arbitrary address (may be not used as a struct page), which in the worst case might corrupt the kernel. So the iteration should use nth_page(). Link: https://lkml.kernel.org/r/20230323025003.94447-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: kfence: fix PG_slab and memcg_data clearingMuchun Song2023-03-281-15/+15
| | | | | | | | | | | | | | | | | | | | It does not reset PG_slab and memcg_data when KFENCE fails to initialize kfence pool at runtime. It is reporting a "Bad page state" message when kfence pool is freed to buddy. The checking of whether it is a compound head page seems unnecessary since we already guarantee this when allocating kfence pool. Remove the check to simplify the code. Link: https://lkml.kernel.org/r/20230320030059.20189-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: Marco Elver <elver@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of ↵Linus Torvalds2023-03-247-18/+45
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) mm: mmap: remove newline at the end of the trace mailmap: add entries for Richard Leitner kcsan: avoid passing -g for test kfence: avoid passing -g for test mm: kfence: fix using kfence_metadata without initialization in show_object() lib: dhry: fix unstable smp_processor_id(_) usage mailmap: add entry for Enric Balletbo i Serra mailmap: map Sai Prakash Ranjan's old address to his current one mailmap: map Rajendra Nayak's old address to his current one Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare" mailmap: add entry for Tobias Klauser kasan, powerpc: don't rename memintrinsics if compiler adds prefixes mm/ksm: fix race with VMA iteration and mm_struct teardown kselftest: vm: fix unused variable warning mm: fix error handling for map_deny_write_exec mm: deduplicate error handling for map_deny_write_exec checksyscalls: ignore fstat to silence build warning on LoongArch nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy() test_maple_tree: add more testing for mas_empty_area() maple_tree: fix mas_skip_node() end slot detection ...
| * kfence: avoid passing -g for testMarco Elver2023-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nathan reported that when building with GNU as and a version of clang that defaults to DWARF5: $ make -skj"$(nproc)" ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- \ LLVM=1 LLVM_IAS=0 O=build \ mrproper allmodconfig mm/kfence/kfence_test.o /tmp/kfence_test-08a0a0.s: Assembler messages: /tmp/kfence_test-08a0a0.s:14627: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14628: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14632: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14633: Error: non-constant .uleb128 is not supported /tmp/kfence_test-08a0a0.s:14639: Error: non-constant .uleb128 is not supported ... This is because `-g` defaults to the compiler debug info default. If the assembler does not support some of the directives used, the above errors occur. To fix, remove the explicit passing of `-g`. All the test wants is that stack traces print valid function names, and debug info is not required for that. (I currently cannot recall why I added the explicit `-g`.) Link: https://lkml.kernel.org/r/20230316224705.709984-1-elver@google.com Fixes: bc8fbc5f305a ("kfence: add test suite") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: kfence: fix using kfence_metadata without initialization in show_object()Muchun Song2023-03-231-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable kfence_metadata is initialized in kfence_init_pool(), then, it is not initialized if kfence is disabled after booting. In this case, kfence_metadata will be used (e.g. ->lock and ->state fields) without initialization when reading /sys/kernel/debug/kfence/objects. There will be a warning if you enable CONFIG_DEBUG_SPINLOCK. Fix it by creating debugfs files when necessary. Link: https://lkml.kernel.org/r/20230315034441.44321-1-songmuchun@bytedance.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jann Horn <jannh@google.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"Peter Collingbourne2023-03-231-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 487a32ec24be819e747af8c2ab0d5c515508086a. should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from page->flags. However, this line of code in free_pages_prepare(): page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; clears most of page->flags, including PG_skip_kasan_poison, before calling should_skip_kasan_poison(), which meant that it would never return true as a result of the page flag being set. Therefore, fix the code to call should_skip_kasan_poison() before clearing the flags, as we were doing before the reverted patch. This fixes a measurable performance regression introduced in the reverted commit, where munmap() takes longer than intended if HW tags KASAN is supported and enabled at runtime. Without this patch, we see a single-digit percentage performance regression in a particular mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch, there is no statistically significant performance impact when enabling HW tags KASAN. Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com Fixes: 487a32ec24be ("kasan: drop skip_kasan_poison variable in free_pages_prepare") Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Evgenii Stepanov <eugenis@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/ksm: fix race with VMA iteration and mm_struct teardownLiam R. Howlett2023-03-231-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | exit_mmap() will tear down the VMAs and maple tree with the mmap_lock held in write mode. Ensure that the maple tree is still valid by checking ksm_test_exit() after taking the mmap_lock in read mode, but before the for_each_vma() iterator dereferences a destroyed maple tree. Since the maple tree is destroyed, the flags telling lockdep to check an external lock has been cleared. Skip the for_each_vma() iterator to avoid dereferencing a maple tree without the external lock flag, which would create a lockdep warning. Link: https://lkml.kernel.org/r/20230308220310.3119196-1-Liam.Howlett@oracle.com Fixes: a5f18ba07276 ("mm/ksm: use vma iterators instead of vma linked list") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Link: https://lore.kernel.org/lkml/ZAdUUhSbaa6fHS36@xpf.sh.intel.com/ Reported-by: syzbot+2ee18845e89ae76342c5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=64a3e95957cd3deab99df7cd7b5a9475af92c93e Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <heng.su@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: fix error handling for map_deny_write_execJoey Gouly2023-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4a18419f71cd ("mm/mprotect: use mmu_gather") changed 'goto out;' to 'break' in the loop. This wasn't noticed while rebasing the MDWE patches, so fix it now. Link: https://lkml.kernel.org/r/20230308190423.46491-3-joey.gouly@arm.com Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Joey Gouly <joey.gouly@arm.com> Reported-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/ Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: nd <nd@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: deduplicate error handling for map_deny_write_execJoey Gouly2023-03-231-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fixes for MDWE prctl" These are four small fixes for the recent memory-write-deny-execute prctl patches [1]. Two reported by Alexey about error handling and two tooling fixes by Peter. This patch (of 4): Commit cc8d1b097de7 ("mmap: clean up mmap_region() unrolling") deduplicated the error handling, do the same for the return value of `map_deny_write_exec`. Link: https://lkml.kernel.org/r/20230308190423.46491-1-joey.gouly@arm.com Link: https://lkml.kernel.org/r/20230308190423.46491-2-joey.gouly@arm.com Link: https://lore.kernel.org/linux-arm-kernel/20230119160344.54358-1-joey.gouly@arm.com/ [1] Fixes: b507808ebce2 ("mm: implement memory-deny-write-execute as a prctl") Signed-off-by: Joey Gouly <joey.gouly@arm.com> Reported-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/linux-arm-kernel/8408d8901e9d7ee6b78db4c6cba04b78@ispras.ru/ Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: nd <nd@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm, vmalloc: fix high order __GFP_NOFAIL allocationsMichal Hocko2023-03-231-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gao Xiang has reported that the page allocator complains about high order __GFP_NOFAIL request coming from the vmalloc core: __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549 alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286 vm_area_alloc_pages mm/vmalloc.c:2989 [inline] __vmalloc_area_node mm/vmalloc.c:3057 [inline] __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227 kvmalloc_node+0x156/0x1a0 mm/util.c:606 kvmalloc include/linux/slab.h:737 [inline] kvmalloc_array include/linux/slab.h:755 [inline] kvcalloc include/linux/slab.h:760 [inline] it seems that I have completely missed high order allocation backing vmalloc areas case when implementing __GFP_NOFAIL support. This means that [k]vmalloc at al. can allocate higher order allocations with __GFP_NOFAIL which can trigger OOM killer for non-costly orders easily or cause a lot of reclaim/compaction activity if those requests cannot be satisfied. Fix the issue by falling back to zero order allocations for __GFP_NOFAIL requests if the high order request fails. Link: https://lkml.kernel.org/r/ZAXynvdNqcI0f6Us@dhcp22.suse.cz Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL") Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lkml.kernel.org/r/20230305053035.1911-1-hsiangkao@linux.alibaba.com Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | Merge tag 'slab-fix-for-6.3-rc4' of ↵Linus Torvalds2023-03-241-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "A single build fix for a corner case configuration that is apparently possible to achieve on some arches, from Geert" * tag 'slab-fix-for-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP
| * mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMPGeert Uytterhoeven2023-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sh/migor_defconfig: mm/slab.c: In function ‘slab_memory_callback’: mm/slab.c:1127:23: error: implicit declaration of function ‘init_cache_node_node’; did you mean ‘drain_cache_node_node’? [-Werror=implicit-function-declaration] 1127 | ret = init_cache_node_node(nid); | ^~~~~~~~~~~~~~~~~~~~ | drain_cache_node_node The #ifdef condition protecting the definition of init_cache_node_node() no longer matches the conditions protecting the (multiple) users. Fix this by syncing the conditions. Fixes: 76af6a054da40553 ("mm/migrate: add CPU hotplug to demotion #ifdef") Reported-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/b5bdea22-ed2f-3187-6efe-0c72330270a4@infradead.org Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
* | mm/damon/paddr: fix folio_nr_pages() after folio_put() in ↵SeongJae Park2023-03-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | damon_pa_mark_accessed_or_deactivate() damon_pa_mark_accessed_or_deactivate() is accessing a folio via folio_nr_pages() after folio_put() for the folio has invoked. Fix it. Link: https://lkml.kernel.org/r/20230304193949.296391-3-sj@kernel.org Fixes: f70da5ee8fe1 ("mm/damon: convert damon_pa_mark_accessed_or_deactivate() to use folios") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/damon/paddr: fix folio_size() call after folio_put() in damon_pa_young()SeongJae Park2023-03-071-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/damon/paddr: Fix folio-use-after-put bugs". There are two folio accesses after folio_put() in mm/damon/paddr.c file. Fix those. This patch (of 2): damon_pa_young() is accessing a folio via folio_size() after folio_put() for the folio has invoked. Fix it. Link: https://lkml.kernel.org/r/20230304193949.296391-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230304193949.296391-2-sj@kernel.org Fixes: 397b0c3a584b ("mm/damon/paddr: remove folio_sz field from damon_pa_access_chk_result") Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: <stable@vger.kernel.org> [6.2.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | migrate_pages: try migrate in batch asynchronously firstlyHuang Ying2023-03-071-18/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we have locked more than one folios, we cannot wait the lock or bit (e.g., page lock, buffer head lock, writeback bit) synchronously. Otherwise deadlock may be triggered. This make it hard to batch the synchronous migration directly. This patch re-enables batching synchronous migration via trying to migrate in batch asynchronously firstly. And any folios that are failed to be migrated asynchronously will be migrated synchronously one by one. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. Link: https://lkml.kernel.org/r/20230303030155.160983-4-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | migrate_pages: move split folios processing out of migrate_pages_batch()Huang Ying2023-03-071-50/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To simplify the code logic and reduce the line number. Link: https://lkml.kernel.org/r/20230303030155.160983-3-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | migrate_pages: fix deadlock in batched migrationHuang Ying2023-03-071-43/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "migrate_pages: fix deadlock in batched synchronous migration", v2. Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei. Analysis shows that if we have locked some other folios except the one we are migrating, it's not safe in general to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. So 1/3 fixes the deadlock in a simple way, where the batching support for the synchronous migration is disabled. The change is straightforward and easy to be understood. While 3/3 re-introduce the batching for synchronous migration via trying to migrate asynchronously in batch optimistically, then fall back to migrate synchronously one by one for fail-to-migrate folios. Test shows that this can restore the TLB flushing batching performance for synchronous migration effectively. This patch (of 3): Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei! For example, in the following deadlock trace snippet, INFO: task kworker/u4:0:9 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 Workqueue: loop4 loop_rootcg_workfn Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? __pfx_wake_page_function+0x10/0x10 __filemap_get_folio+0x73d/0x770 shmem_get_folio_gfp+0x1fd/0xc80 shmem_write_begin+0x91/0x220 generic_perform_write+0x10e/0x2e0 __generic_file_write_iter+0x17e/0x290 ? generic_write_checks+0x12b/0x1a0 generic_file_write_iter+0x97/0x180 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_readv_writev+0x13c/0x210 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_write+0xf6/0x330 vfs_iter_write+0x46/0x70 loop_process_work+0x723/0xfe0 loop_rootcg_workfn+0x28/0x40 process_one_work+0x3cc/0x8d0 worker_thread+0x66/0x630 ? __pfx_worker_thread+0x10/0x10 kthread+0x153/0x190 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> INFO: task repro:1023 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? compaction_alloc+0x77/0x1150 ? __pfx_wake_page_function+0x10/0x10 folio_wait_bit+0x30/0x40 folio_wait_writeback+0x2e/0x1e0 migrate_pages_batch+0x555/0x1ac0 ? __pfx_compaction_alloc+0x10/0x10 ? __pfx_compaction_free+0x10/0x10 ? __this_cpu_preempt_check+0x17/0x20 ? lock_is_held_type+0xe6/0x140 migrate_pages+0x100e/0x1180 ? __pfx_compaction_free+0x10/0x10 ? __pfx_compaction_alloc+0x10/0x10 compact_zone+0xe10/0x1b50 ? lock_is_held_type+0xe6/0x140 ? check_preemption_disabled+0x80/0xf0 compact_node+0xa3/0x100 ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 ? _find_first_bit+0x7b/0x90 sysctl_compaction_handler+0x5d/0xb0 proc_sys_call_handler+0x29d/0x420 proc_sys_write+0x2b/0x40 vfs_write+0x3a3/0x780 ksys_write+0xb7/0x180 __x64_sys_write+0x26/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f3a2471f59d RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 </TASK> The page migration task has held the lock of the shmem folio A, and is waiting the writeback of the folio B of the file system on the loop block device to complete. While the loop worker task which writes back the folio B is waiting to lock the shmem folio A, because the folio A backs the folio B in the loop device. Thus deadlock is triggered. In general, if we have locked some other folios except the one we are migrating, it's not safe to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. To fix the deadlock, in this patch, we avoid to batch the page migration except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting is avoided. The fix can be improved further. We will do that as soon as possible. Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/ Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/userfaultfd: propagate uffd-wp bit when PTE-mapping the huge zeropageDavid Hildenbrand2023-03-071-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we'd lose the userfaultfd-wp marker when PTE-mapping a huge zeropage, resulting in the next write faults in the PMD range not triggering uffd-wp events. Various actions (partial MADV_DONTNEED, partial mremap, partial munmap, partial mprotect) could trigger this. However, most importantly, un-protecting a single sub-page from the userfaultfd-wp handler when processing a uffd-wp event will PTE-map the shared huge zeropage and lose the uffd-wp bit for the remainder of the PMD. Let's properly propagate the uffd-wp bit to the PMDs. #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <stdbool.h> #include <inttypes.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <poll.h> #include <pthread.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> static size_t pagesize; static int uffd; static volatile bool uffd_triggered; #define barrier() __asm__ __volatile__("": : :"memory") static void uffd_wp_range(char *start, size_t size, bool wp) { struct uffdio_writeprotect uffd_writeprotect; uffd_writeprotect.range.start = (unsigned long) start; uffd_writeprotect.range.len = size; if (wp) { uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; } else { uffd_writeprotect.mode = 0; } if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); exit(1); } } static void *uffd_thread_fn(void *arg) { static struct uffd_msg msg; ssize_t nread; while (1) { struct pollfd pollfd; int nready; pollfd.fd = uffd; pollfd.events = POLLIN; nready = poll(&pollfd, 1, -1); if (nready == -1) { fprintf(stderr, "poll() failed: %d\n", errno); exit(1); } nread = read(uffd, &msg, sizeof(msg)); if (nread <= 0) continue; if (msg.event != UFFD_EVENT_PAGEFAULT || !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) { printf("FAIL: wrong uffd-wp event fired\n"); exit(1); } /* un-protect the single page. */ uffd_triggered = true; uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address, pagesize, false); } return arg; } static int setup_uffd(char *map, size_t size) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; pthread_t thread; uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; } uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; } uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = size; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; } pthread_create(&thread, NULL, uffd_thread_fn, NULL); return 0; } int main(void) { const size_t size = 4 * 1024 * 1024ull; char *map, *cur; pagesize = getpagesize(); map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } if (madvise(map, size, MADV_HUGEPAGE)) { fprintf(stderr, "MADV_HUGEPAGE failed\n"); return -errno; } if (setup_uffd(map, size)) return 1; /* Read the whole range, populating zeropages. */ madvise(map, size, MADV_POPULATE_READ); /* Write-protect the whole range. */ uffd_wp_range(map, size, true); /* Make sure uffd-wp triggers on each page. */ for (cur = map; cur < map + size; cur += pagesize) { uffd_triggered = false; barrier(); /* Trigger a write fault. */ *cur = 1; barrier(); if (!uffd_triggered) { printf("FAIL: uffd-wp did not trigger\n"); return 1; } } printf("PASS: uffd-wp triggered\n"); return 0; } Link: https://lkml.kernel.org/r/20230302175423.589164-1-david@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: teach mincore_hugetlb about pte markersJames Houghton2023-03-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By checking huge_pte_none(), we incorrectly classify PTE markers as "present". Instead, check huge_pte_none_mostly(), classifying PTE markers the same as if the PTE were completely blank. PTE markers, unlike other kinds of swap entries, don't reference any physical page and don't indicate that a physical page was mapped previously. As such, treat them as non-present for the sake of mincore(). Link: https://lkml.kernel.org/r/20230302222404.175303-1-jthoughton@google.com Fixes: 5c041f5d1f23 ("mm: teach core mm about pte markers") Signed-off-by: James Houghton <jthoughton@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: James Houghton <jthoughton@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: avoid gcc complaint about pointer castingLinus Torvalds2023-03-041-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The migration code ends up temporarily stashing information of the wrong type in unused fields of the newly allocated destination folio. That all works fine, but gcc does complain about the pointer type mis-use: mm/migrate.c: In function ‘__migrate_folio_extract’: mm/migrate.c:1050:20: note: randstruct: casting between randomized structure pointer types (ssa): ‘struct anon_vma’ and ‘struct address_space’ 1050 | *anon_vmap = (void *)dst->mapping; | ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ and gcc is actually right to complain since it really doesn't understand that this is a very temporary special case where this is ok. This could be fixed in different ways by just obfuscating the assignment sufficiently that gcc doesn't see what is going on, but the truly "proper C" way to do this is by explicitly using a union. Using unions for type conversions like this is normally hugely ugly and syntactically nasty, but this really is one of the few cases where we want to make it clear that we're not doing type conversion, we're really re-using the value bit-for-bit just using another type. IOW, this should not become a common pattern, but in this one case using that odd union is probably the best way to document to the compiler what is conceptually going on here. [ Side note: there are valid cases where we convert pointers to other pointer types, notably the whole "folio vs page" situation, where the types actually have fundamental commonalities. The fact that the gcc note is limited to just randomized structures means that we don't see equivalent warnings for those cases, but it migth also mean that we miss other cases where we do play these kinds of dodgy games, and this kind of explicit conversion might be a good idea. ] I verified that at least for an allmodconfig build on x86-64, this generates the exact same code, apart from line numbers and assembler comment changes. Fixes: 64c8902ed441 ("migrate_pages: split unmap_and_move() to _unmap() and _move()") Cc: Huang, Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'mm-hotfixes-stable-2023-03-04-13-12' of ↵Linus Torvalds2023-03-048-12/+65
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes. Eight are for MM and seven are for other parts of the kernel. Seven are cc:stable and eight address post-6.3 issues or were judged unsuitable for -stable backporting" * tag 'mm-hotfixes-stable-2023-03-04-13-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: map Dikshita Agarwal's old address to his current one mailmap: map Vikash Garodia's old address to his current one fs/cramfs/inode.c: initialize file_ra_state fs: hfsplus: fix UAF issue in hfsplus_put_super panic: fix the panic_print NMI backtrace setting lib: parser: update documentation for match_NUMBER functions kasan, x86: don't rename memintrinsics in uninstrumented files kasan: test: fix test for new meminstrinsic instrumentation kasan: treat meminstrinsic as builtins in uninstrumented files kasan: emit different calls for instrumentable memintrinsics ocfs2: fix non-auto defrag path not working issue ocfs2: fix defrag path triggering jbd2 ASSERT mailmap: map Georgi Djakov's old Linaro address to his current one mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON lib/zlib: DFLTCC deflate does not write all available bits for Z_NO_FLUSH mm/damon/paddr: fix missing folio_put() mm/mremap: fix dup_anon_vma() in vma_merge() case 4
| * | kasan: test: fix test for new meminstrinsic instrumentationMarco Elver2023-03-022-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tests for memset/memmove have been failing since they haven't been instrumented in 69d4c0d32186. Fix the test to recognize when memintrinsics aren't instrumented, and skip test cases accordingly. We also need to conditionally pass -fno-builtin to the test, otherwise the instrumentation pass won't recognize memintrinsics and end up not instrumenting them either. Link: https://lkml.kernel.org/r/20230224085942.1791837-3-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | kasan: treat meminstrinsic as builtins in uninstrumented filesMarco Elver2023-03-021-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where the compiler instruments meminstrinsics by generating calls to __asan/__hwasan_ prefixed functions, let the compiler consider memintrinsics as builtin again. To do so, never override memset/memmove/memcpy if the compiler does the correct instrumentation - even on !GENERIC_ENTRY architectures. [elver@google.com: powerpc: don't rename memintrinsics if compiler adds prefixes] Link: https://lore.kernel.org/all/20230224085942.1791837-1-elver@google.com/ [1] Link: https://lkml.kernel.org/r/20230227094726.3833247-1-elver@google.com Link: https://lkml.kernel.org/r/20230224085942.1791837-2-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | kasan: emit different calls for instrumentable memintrinsicsMarco Elver2023-03-022-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang 15 provides an option to prefix memcpy/memset/memmove calls with __asan_/__hwasan_ in instrumented functions: https://reviews.llvm.org/D122724 GCC will add support in future: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108777 Use it to regain KASAN instrumentation of memcpy/memset/memmove on architectures that require noinstr to be really free from instrumented mem*() functions (all GENERIC_ENTRY architectures). Link: https://lkml.kernel.org/r/20230224085942.1791837-1-elver@google.com Fixes: 69d4c0d32186 ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: kasan-dev@googlegroups.com Cc: Kees Cook <keescook@chromium.org> Cc: Linux Kernel Functional Testing <lkft@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> # build only Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISONNaoya Horiguchi2023-02-272-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a memory error happens on a clean folio, a process unexpectedly receives SIGBUS when it accesses the error page. This SIGBUS killing is pointless and simply degrades the level of RAS of the system, because the clean folio can be dropped without any data lost on memory error handling as we do for a clean pagecache. When memory_failure() is called on a clean folio, try_to_unmap() is called twice (one from split_huge_page() and one from hwpoison_user_mappings()). The root cause of the issue is that pte conversion to hwpoisoned entry is now done in the first call of try_to_unmap() because PageHWPoison is already set at this point, while it's actually expected to be done in the second call. This behavior disturbs the error handling operation like removing pagecache, which results in the malfunction described above. So convert TTU_IGNORE_HWPOISON into TTU_HWPOISON and set TTU_HWPOISON only when we really intend to convert pte to hwpoison entry. This can prevent other callers of try_to_unmap() from accidentally converting to hwpoison entries. Link: https://lkml.kernel.org/r/20230221085905.1465385-1-naoya.horiguchi@linux.dev Fixes: a42634a6c07d ("readahead: Use a folio in read_pages()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/damon/paddr: fix missing folio_put()andrew.yang2023-02-271-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | damon_get_folio() would always increase folio _refcount and folio_isolate_lru() would increase folio _refcount if the folio's lru flag is set. If an unevictable folio isolated successfully, there will be two more _refcount. The one from folio_isolate_lru() will be decreased in folio_puback_lru(), but the other one from damon_get_folio() will be left behind. This causes a pin page. Whatever the case, the _refcount from damon_get_folio() should be decreased. Link: https://lkml.kernel.org/r/20230222064223.6735-1-andrew.yang@mediatek.com Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme") Signed-off-by: andrew.yang <andrew.yang@mediatek.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.16.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/mremap: fix dup_anon_vma() in vma_merge() case 4Vlastimil Babka2023-02-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case 4, we are shrinking 'prev' (PPPP in the comment) and expanding 'mid' (NNNN). So we need to make sure 'mid' clones the anon_vma from 'prev', if it doesn't have any. After commit 0503ea8f5ba7 ("mm/mmap: remove __vma_adjust()") we can fail to do that due to wrong parameters for dup_anon_vma(). The call is a no-op because res == next, adjust == mid and mid == next. Fix it. Link: https://lkml.kernel.org/r/ad91d62b-37eb-4b73-707a-3c45c9e16256@suse.cz Fixes: 0503ea8f5ba7 ("mm/mmap: remove __vma_adjust()") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'memblock-v6.3-rc1' of ↵Linus Torvalds2023-02-271-14/+27
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: "Small optimizations: - fix off-by-one in the check whether memblock_add_range() should reallocate memory to accommodate newly inserted range - check only for relevant regions in memblock_merge_regions() rather than swipe over the entire array" * tag 'memblock-v6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: Avoid useless checks in memblock_merge_regions(). memblock: Make a boundary tighter in memblock_add_range().
| * | | memblock: Avoid useless checks in memblock_merge_regions().Peng Zhang2023-01-311-12/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memblock_merge_regions() is called after regions have been modified to merge the neighboring compatible regions. That will check all regions but most checks are useless. Most of the time we only insert one or a few new regions, or modify one or a few regions. At this time, we don't need to check all the regions. We only need to check the changed regions, because other not related regions cannot be merged. Add two parameters to memblock_merge_regions() to indicate the lower and upper boundary to scan. Debug code that counts the number of total iterations in memblock_merge_regions(), like for instance void memblock_merge_regions(struct memblock_type *type) { static int iteration_count = 0; static int max_nr_regions = 0; max_nr_regions = max(max_nr_regions, (int)type->cnt); ... while () { iteration_count++; ... } pr_info("iteration_count: %d max_nr_regions %d", iteration_count, max_nr_regions); } Produces the following numbers on a physical machine with 1T of memory: before: [2.472243] iteration_count: 45410 max_nr_regions 178 after: [2.470869] iteration_count: 923 max_nr_regions 176 The actual startup speed seems to change little, but it does reduce the scan overhead. Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Link: https://lore.kernel.org/r/20230129090034.12310-3-zhangpeng.00@bytedance.com [rppt: massaged the changelog] Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
| * | | memblock: Make a boundary tighter in memblock_add_range().Peng Zhang2023-01-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When type->cnt * 2 + 1 is less than or equal to type->max, there is enough empty regions to insert. Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Link: https://lore.kernel.org/r/20230129090034.12310-2-zhangpeng.00@bytedance.com Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
* | | | mm/mprotect: Fix successful vma_merge() of next in do_mprotect_pkey()Liam R. Howlett2023-02-251-0/+1
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If mprotect_fixup() successfully calls vma_merge() and replaces vma and the next vma, then the tmp variable in the do_mprotect_pkey() is not updated to point to the new vma end. This results in the loop detecting a gap between VMAs that does not exist. Fix the faulty value of tmp by setting it to the end location of the vma iterator at the end of the loop. Link: https://lkml.kernel.org/r/20230224212055.1786100-1-Liam.Howlett@oracle.com Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator") Link: https://lore.kernel.org/linux-mm/20230223120407.729110a6ecd1416ac59d9cb0@linux-foundation.org/ Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Bert Karwatzki <spasswolf@web.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217061 Tested-by: Bert Karwatzki <spasswolf@web.de> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-mm/CAHk-=wjFmVL7NiuxL54qLkoabni_yD-oF9=dpDgETtdsiCbhUg@mail.gmail.com/ Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds2023-02-2396-4007/+6113
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
| * | | mm/memory_hotplug: cleanup return value handing in do_migrate_range()SeongJae Park2023-02-201-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return value mechanism of do_migrate_range() is not very simple, while no caller of the function checks the return value. Make the function return nothing to be more simple, and cleanup related unnecessary code. Link: https://lkml.kernel.org/r/20230216170703.64574-1-sj@kernel.org Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm/uffd: fix comment in handling pte markersPeter Xu2023-02-201-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment is obsolete after f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode", 2022-08-20). Remove it. Link: https://lkml.kernel.org/r/20230215205800.223549-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: change to return bool for isolate_movable_page()Baolin Wang2023-02-204-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the isolate_movable_page() can only return 0 or -EBUSY, and no users will care about the negative return value, thus we can convert the isolate_movable_page() to return a boolean value to make the code more clear when checking the movable page isolation state. No functional changes intended. [akpm@linux-foundation.org: remove unneeded comment, per Matthew] Link: https://lkml.kernel.org/r/cb877f73f4fff8d309611082ec740a7065b1ade0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: hugetlb: change to return bool for isolate_hugetlb()Baolin Wang2023-02-204-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the isolate_hugetlb() only returns 0 or -EBUSY, and most users did not care about the negative value, thus we can convert the isolate_hugetlb() to return a boolean value to make code more clear when checking the hugetlb isolation state. Moreover converts 2 users which will consider the negative value returned by isolate_hugetlb(). No functional changes intended. [akpm@linux-foundation.org: shorten locked section, per SeongJae Park] Link: https://lkml.kernel.org/r/12a287c5bebc13df304387087bbecc6421510849.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: change to return bool for isolate_lru_page()Baolin Wang2023-02-208-22/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The isolate_lru_page() can only return 0 or -EBUSY, and most users did not care about the negative error of isolate_lru_page(), except one user in add_page_for_migration(). So we can convert the isolate_lru_page() to return a boolean value, which can help to make the code more clear when checking the return value of isolate_lru_page(). Also convert all users' logic of checking the isolation state. No functional changes intended. Link: https://lkml.kernel.org/r/3074c1ab628d9dbf139b33f248a8bc253a3f95f0.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: change to return bool for folio_isolate_lru()Baolin Wang2023-02-208-13/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Change the return value for page isolation functions", v3. Now the page isolation functions did not return a boolean to indicate success or not, instead it will return a negative error when failed to isolate a page. So below code used in most places seem a boolean success/failure thing, which can confuse people whether the isolation is successful. if (folio_isolate_lru(folio)) continue; Moreover the page isolation functions only return 0 or -EBUSY, and most users did not care about the negative error except for few users, thus we can convert all page isolation functions to return a boolean value, which can remove the confusion to make code more clear. No functional changes intended in this patch series. This patch (of 4): Now the folio_isolate_lru() did not return a boolean value to indicate isolation success or not, however below code checking the return value can make people think that it was a boolean success/failure thing, which makes people easy to make mistakes (see the fix patch[1]). if (folio_isolate_lru(folio)) continue; Thus it's better to check the negative error value expilictly returned by folio_isolate_lru(), which makes code more clear per Linus's suggestion[2]. Moreover Matthew suggested we can convert the isolation functions to return a boolean[3], since most users did not care about the negative error value, and can also remove the confusing of checking return value. So this patch converts the folio_isolate_lru() to return a boolean value, which means return 'true' to indicate the folio isolation is successful, and 'false' means a failure to isolation. Meanwhile changing all users' logic of checking the isolation state. No functional changes intended. [1] https://lore.kernel.org/all/20230131063206.28820-1-Kuan-Ying.Lee@mediatek.com/T/#u [2] https://lore.kernel.org/all/CAHk-=wiBrY+O-4=2mrbVyxR+hOqfdJ=Do6xoucfJ9_5az01L4Q@mail.gmail.com/ [3] https://lore.kernel.org/all/Y+sTFqwMNAjDvxw3@casper.infradead.org/ Link: https://lkml.kernel.org/r/cover.1676424378.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/8a4e3679ed4196168efadf7ea36c038f2f7d5aa9.1676424378.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | kmsan: disable ftrace in kmsan core codeArnd Bergmann2023-02-201-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | objtool warns about some suspicous code inside of kmsan: vmlinux.o: warning: objtool: __msan_metadata_ptr_for_load_n+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_store_n+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_load_1+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_store_1+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_load_2+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_store_2+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_load_4+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_store_4+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_load_8+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_metadata_ptr_for_store_8+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_instrument_asm_store+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_chain_origin+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_poison_alloca+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_warning+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: __msan_get_context_state+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: kmsan_copy_to_user+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: kmsan_unpoison_memory+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: kmsan_unpoison_entry_regs+0x4: call to __fentry__() with UACCESS enabled vmlinux.o: warning: objtool: kmsan_report+0x4: call to __fentry__() with UACCESS enabled The Makefile contained a line to turn off ftrace for the entire directory, but this does not work. Replace it with individual lines, matching the approach in kasan. Link: https://lkml.kernel.org/r/20230215130058.3836177-3-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Acked-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Marco Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | kasan: mark addr_has_metadata __always_inlineArnd Bergmann2023-02-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "objtool warning fixes", v2. These are three of the easier fixes for objtool warnings around kasan/kmsan/kcsan. I dropped one patch since Peter had come up with a better fix, and adjusted the changelog text based on feedback. This patch (of 3): When the compiler decides not to inline this function, objtool complains about incorrect UACCESS state: mm/kasan/generic.o: warning: objtool: __asan_load2+0x11: call to addr_has_metadata() with UACCESS enabled Link: https://lore.kernel.org/all/20230208164011.2287122-1-arnd@kernel.org/ Link: https://lkml.kernel.org/r/20230215130058.3836177-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: memcontrol: rename memcg_kmem_enabled()Roman Gushchin2023-02-165-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there are two kmem-related helper functions with a confusing semantics: memcg_kmem_enabled() and mem_cgroup_kmem_disabled(). The problem is that an obvious expectation memcg_kmem_enabled() == !mem_cgroup_kmem_disabled(), can be false. mem_cgroup_kmem_disabled() is similar to mem_cgroup_disabled(): it returns true only if CONFIG_MEMCG_KMEM is not set or the kmem accounting is disabled using a boot time kernel option "cgroup.memory=nokmem". It never changes the value dynamically. memcg_kmem_enabled() is different: it always returns false until the first non-root memory cgroup will get online (assuming the kernel memory accounting is enabled). It's goal is to improve the performance on systems without the cgroupfs mounted/memory controller enabled or on the systems with only the root memory cgroup. To make things more obvious and avoid potential bugs, let's rename memcg_kmem_enabled() to memcg_kmem_online(). Link: https://lkml.kernel.org/r/20230213192922.1146370-1-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: percpu: fix incorrect size in pcpu_obj_full_size()Yafang Shao2023-02-161-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extra space which is used to store the obj_cgroup membership is only valid when kmemcg is enabled. The kmemcg can be disabled via the kernel parameter "cgroup.memory=nokmem" at boot time. This helper is also used in non-memcg code, for example the tracepoint, so we should fix it. It was found by code review when I was implementing bpf memory usage[1]. No real issue happens in production environment. [1]. https://lwn.net/Articles/921991/ Link: https://lkml.kernel.org/r/20230214153549.12291-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Vasily Averin <vvs@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: page_alloc: call panic() when memoryless node allocation failsQi Zheng2023-02-161-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In free_area_init(), we will continue to run after allocation of memoryless node pgdat fails. However, in the subsequent process (such as when initializing zonelist), the case that NODE_DATA(nid) is NULL is not handled, which will cause panic. Instead of this, it's better to call panic() directly when the memory allocation fails during system boot. Link: https://lkml.kernel.org/r/20230212111027.95520-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | mm: multi-gen LRU: avoid futile retriesYu Zhao2023-02-161-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recall that the per-node memcg LRU has two generations and they alternate when the last memcg (of a given node) is moved from one to the other. Each generation is also sharded into multiple bins to improve scalability. A reclaimer starts with a random bin (in the old generation) and, if it fails, it will retry, i.e., to try the rest of the bins. If a reclaimer fails with the last memcg, it should move this memcg to the young generation first, which causes the generations to alternate, and then retry. Otherwise, the retries will be futile because all other bins are empty. Link: https://lkml.kernel.org/r/20230213075322.1416966-1-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | migrate_pages: move THP/hugetlb migration support check to simplify codeHuang Ying2023-02-161-47/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a code cleanup patch, no functionality change is expected. After the change, the line number reduces especially in the long migrate_pages_batch(). Link: https://lkml.kernel.org/r/20230213123444.155149-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Bharata B Rao <bharata@amd.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>