summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'sysctl-5.19-rc1' of ↵Linus Torvalds2022-05-262-13/+129
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. Thanks a lot to the Uniontech and Huawei folks for doing some of this nasty work" * tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits) sched: Fix build warning without CONFIG_SYSCTL reboot: Fix build warning without CONFIG_SYSCTL kernel/kexec_core: move kexec_core sysctls into its own file sysctl: minor cleanup in new_dir() ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n fs/proc: Introduce list_for_each_table_entry for proc sysctl mm: fix unused variable kernel warning when SYSCTL=n latencytop: move sysctl to its own file ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y ftrace: Fix build warning ftrace: move sysctl_ftrace_enabled to ftrace.c kernel/do_mount_initrd: move real_root_dev sysctls to its own file kernel/delayacct: move delayacct sysctls to its own file kernel/acct: move acct sysctls to its own file kernel/panic: move panic sysctls to its own file kernel/lockdep: move lockdep sysctls to its own file mm: move page-writeback sysctls to their own file mm: move oom_kill sysctls to their own file kernel/reboot: move reboot sysctls to its own file sched: Move energy_aware sysctls to topology.c ...
| * mm: fix unused variable kernel warning when SYSCTL=nLuis Chamberlain2022-04-211-3/+4
| | | | | | | | | | | | | | | | | | | | | | When CONFIG_SYSCTL=n the variable dirty_bytes_min which is just used as a minimum to a proc handler is not used. So just move this under the ifdef for CONFIG_SYSCTL. Fixes: aa779e510219 ("mm: move page-writeback sysctls to their own file") Reported-by: kernel test robot <lkp@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
| * mm: move page-writeback sysctls to their own filezhanglianjie2022-04-061-10/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move the page-writeback sysctls to its own file. [akpm@linux-foundation.org: coding-style cleanups] akpm@linux-foundation.org: fix CONFIG_SYSCTL=n warnings] Link: https://lkml.kernel.org/r/20220129012955.26594-1-zhanglianjie@uniontech.com Signed-off-by: zhanglianjie <zhanglianjie@uniontech.com> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
| * mm: move oom_kill sysctls to their own filesujiaxun2022-04-061-3/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kernel/sysctl.c is a kitchen sink where everyone leaves their dirty dishes, this makes it very difficult to maintain. To help with this maintenance let's start by moving sysctls to places where they actually belong. The proc sysctl maintainers do not want to know what sysctl knobs you wish to add for your own piece of code, we just care about the core logic. So move the oom_kill sysctls to their own file, mm/oom_kill.c [sfr@canb.auug.org.au: null-terminate the array] Link: https://lkml.kernel.org/r/20220216193202.28838626@canb.auug.org.au Link: https://lkml.kernel.org/r/20220215093203.31032-1-sujiaxun@uniontech.com Signed-off-by: sujiaxun <sujiaxun@uniontech.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
* | Merge tag 'mm-stable-2022-05-25' of ↵Linus Torvalds2022-05-2676-2735/+5256
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
| * | mm: kfence: use PAGE_ALIGNED helperKefeng Wang2022-05-251-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use PAGE_ALIGNED macro instead of IS_ALIGNED and passing PAGE_SIZE. Link: https://lkml.kernel.org/r/20220520021833.121405-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | ksm: fix typo in commentJulia Lawall2022-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spelling mistake (triple letters) in comment. Detected with the help of Coccinelle. Link: https://lkml.kernel.org/r/20220521111145.81697-94-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | Revert "mm/vmscan: never demote for memcg reclaim"Johannes Weiner2022-05-251-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a. Its premise was that cgroup reclaim cares about freeing memory inside the cgroup, and demotion just moves them around within the cgroup limit. Hence, pages from toptier nodes should be reclaimed directly. However, with NUMA balancing now doing tier promotions, demotion is part of the page aging process. Global reclaim demotes the coldest toptier pages to secondary memory, where their life continues and from which they have a chance to get promoted back. Essentially, tiered memory systems have an LRU order that spans multiple nodes. When cgroup reclaims pages coming off the toptier directly, there can be colder pages on lower tier nodes that were demoted by global reclaim. This is an aging inversion, not unlike if cgroups were to reclaim directly from the active lists while there are inactive pages. Proactive reclaim is another factor. The goal of that it is to offload colder pages from expensive RAM to cheaper storage. When lower tier memory is available as an intermediate layer, we want offloading to take advantage of it instead of bypassing to storage. Revert the patch so that cgroups respect the LRU order spanning the memory hierarchy. Of note is a specific undercommit scenario, where all cgroup limits in the system add up to <= available toptier memory. In that case, shuffling pages out to lower tiers first to reclaim them from there is inefficient. This is something could be optimized/short-circuited later on (although care must be taken not to accidentally recreate the aging inversion). Let's ensure correctness first. Link: https://lkml.kernel.org/r/20220518190911.82400-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/kfence: print disabling or re-enabling messageJackie Liu2022-05-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By printing information, we can friendly prompt the status change information of kfence by dmesg and record by syslog. Also, set kfence_enabled to false only when needed. Link: https://lkml.kernel.org/r/20220518073105.3160335-1-liu.yun@linux.dev Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Co-developed-by: Marco Elver <elver@google.com> Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: fix a potential infinite loop in start_isolate_page_range()Zi Yan2022-05-252-13/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In isolate_single_pageblock() called by start_isolate_page_range(), there are some pageblock isolation issues causing a potential infinite loop when isolating a page range. This is reported by Qian Cai. 1. the pageblock was isolated by just changing pageblock migratetype without checking unmovable pages. Calling set_migratetype_isolate() to isolate pageblock properly. 2. an off-by-one error caused migrating pages unnecessarily, since the page is not crossing pageblock boundary. 3. migrating a compound page across pageblock boundary then splitting the free page later has a small race window that the free page might be allocated again, so that the code will try again, causing an potential infinite loop. Temporarily set the to-be-migrated page's pageblock to MIGRATE_ISOLATE to prevent that and bail out early if no free page is found after page migration. An additional fix to split_free_page() aims to avoid crashing in __free_one_page(). When the free page is split at the specified split_pfn_offset, free_page_order should check both the first bit of free_page_pfn and the last bit of split_pfn_offset and use the smaller one. For example, if free_page_pfn=0x10000, split_pfn_offset=0xc000, free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then 0x8000, which the original algorithm did. [akpm@linux-foundation.org: suppress min() warning] Link: https://lkml.kernel.org/r/20220524194756.1698351-1-zi.yan@sent.com Fixes: b2c9e2fbba3253 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Ren <renzhengeek@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/shmem: fix shmem folio swapoff hangHugh Dickins2022-05-251-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shmem swapoff makes no progress: the index to indices is not incremented. But "ret" is no longer a return value, so use folio_batch_count() instead. Link: https://lkml.kernel.org/r/c32bee8a-f0aa-245-f94e-24dd271924fa@google.com Fixes: da08e9b79323 ("mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: damon: use HPAGE_PMD_SIZEKefeng Wang2022-05-193-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use HPAGE_PMD_SIZE instead of open coding. Link: https://lkml.kernel.org/r/20220517145120.118523-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: fix missing handler for __GFP_NOWARNQi Zheng2022-05-193-8/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We expect no warnings to be issued when we specify __GFP_NOWARN, but currently in paths like alloc_pages() and kmalloc(), there are still some warnings printed, fix it. But for some warnings that report usage problems, we don't deal with them. If such warnings are printed, then we should fix the usage problems. Such as the following case: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); [zhengqi.arch@bytedance.com: v2] Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()Wonhyuk Yang2022-05-191-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, trace point mm_page_alloc_zone_locked() doesn't show correct information. First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA. Second, after commit 44042b4498728 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") percpu-list can store high order pages. But trace point determine whether it is a refiil of percpu-list by comparing requested order and 0. To handle these problems, make mm_page_alloc_zone_locked() only be called by __rmqueue_smallest with correct migration type. With a new argument called percpu_refill, it can show roughly whether it is a refill of percpu-list. Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Baik Song An <bsahn@etri.re.kr> Cc: Hong Yeon Kim <kimhy@etri.re.kr> Cc: Taeung Song <taeung@reallinux.co.kr> Cc: <linuxgeek@linuxgeek.io> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/page_owner.c: add missing __initdata attributeFanjun Kong2022-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two issues: 1. Add __initdata attribute according to include/linux/init.h: For initialized data: You should insert __initdata between the variable name and equal sign followed by value 2. Fix below error reported by checkpatch.pl: ERROR: do not initialise statics to false Special thanks to Muchun Song :) Link: https://lkml.kernel.org/r/20220516030039.1487005-1-bh1scw@gmail.com Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | tmpfs: fix undefined-behaviour in shmem_reconfigure()Luo Meng2022-05-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [akpm@linux-foundation.org: tweak error message text] Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.com Signed-off-by: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/mempolicy: fix uninit-value in mpol_rebind_policy()Wang Cheng2022-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain Signed-off-by: Wang Cheng <wanngchenng@gmail.com> Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: don't be stuck to rmap lock on reclaim pathMinchan Kim2022-05-195-18/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [minchan@kernel.org: add comment, per Matthew] Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: John Dias <joaodias@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | zswap: memcg accountingJohannes Weiner2022-05-192-15/+227
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this. The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set. The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice. [hannes@cmpxchg.org: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org [hannes@cmpxchg.org: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: zswap: add basic meminfo and vmstat coverageJohannes Weiner2022-05-192-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently it requires poking at debugfs to figure out the size and population of the zswap cache on a host. There are no counters for reads and writes against the cache. As a result, it's difficult to understand zswap behavior on production systems. Print zswap memory consumption and how many pages are zswapped out in /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat. Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: Kconfig: simplify zswap configurationJohannes Weiner2022-05-191-30/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is not. Don't make the user chase down a technical dependency like that, just select it in automatically when zram is requested. The CONFIG_CRYPTO dependency is redundant due to more specific deps. - CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and have it selected in as needed. - CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern. - Make the ZSWAP suboptions and their descriptions (compression, allocation backend) a bit more straight-forward for the user. Link: https://lkml.kernel.org/r/20220510152847.230957-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: Kconfig: group swap, slab, hotplug and thp options into submenusJohannes Weiner2022-05-191-217/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several clusters of related config options spread throughout the mostly flat MM submenu. Group them together and put specialization options into further subdirectories to make the MM submenu a bit more organized and easier to navigate. [hannes@cmpxchg.org: fix kbuild warnings] Link: https://lkml.kernel.org/r/YnvkSVivfnT57Vwh@cmpxchg.org [hannes@cmpxchg.org: fix more kbuild warnings] Link: https://lkml.kernel.org/r/Ynz8NusTdEGcCnJN@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: Kconfig: move swap and slab config options to the MM sectionJohannes Weiner2022-05-191-0/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are currently under General Setup. MM seems like a better fit. Link: https://lkml.kernel.org/r/20220510152847.230957-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: fix comment about swap extentMiaohe Lin2022-05-191-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 4efaceb1c5f8 ("mm, swap: use rbtree for swap_extent"), rbtree is used for swap extent. Also curr_swap_extent is removed at that time. Update the corresponding comment. Link: https://lkml.kernel.org/r/20220509131416.17553-16-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: fix the comment of get_kernel_pagesMiaohe Lin2022-05-191-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If no pages were pinned, 0 is returned in fact. Fix the corresponding comment. [akpm@linux-foundation.org: s/nr_pages/nr_segs/ also, per David, reflow comment] Link: https://lkml.kernel.org/r/20220509131416.17553-15-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: clean up the comment of find_next_to_unuseMiaohe Lin2022-05-191-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 10a9c496789f ("mm: simplify try_to_unuse"), frontswap parameter is removed. Update the corresponding comment. Link: https://lkml.kernel.org/r/20220509131416.17553-14-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: add helper swap_offset_available()Miaohe Lin2022-05-191-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add helper swap_offset_available() to remove some duplicated codes. Minor readability improvement. [akpm@linux-foundation.org: s/swap_offset_available/swap_offset_available_and_locked/, per Neil] Link: https://lkml.kernel.org/r/20220509131416.17553-12-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITESMiaohe Lin2022-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use flags of si directly to check SWP_STABLE_WRITES to avoid possible READ_ONCE and thus save some cpu cycles. [akpm@linux-foundation.org: use data_race() on si->flags, per Neil] Link: https://lkml.kernel.org/r/20220509131416.17553-10-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: make page_swapcount and __lru_add_drain_all staticMiaohe Lin2022-05-192-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make page_swapcount and __lru_add_drain_all static. They are only used within the file now. Link: https://lkml.kernel.org/r/20220509131416.17553-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: remove unneeded p != NULL check in __swap_duplicateMiaohe Lin2022-05-191-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If p is NULL, __swap_duplicate will already return -EINVAL. So if we reach here, p must be non-NULL. Link: https://lkml.kernel.org/r/20220509131416.17553-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: remove buggy cache->nr check in refill_swap_slots_cacheMiaohe Lin2022-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | refill_swap_slots_cache is always called when cache->nr is 0. So remove such buggy and confusing check. Link: https://lkml.kernel.org/r/20220509131416.17553-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: print bad swap offset entry in get_swap_deviceMiaohe Lin2022-05-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If offset exceeds the si->max, print bad swap offset entry to help debug the unexpected case. Link: https://lkml.kernel.org/r/20220509131416.17553-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: remove unneeded return value of free_swap_slotMiaohe Lin2022-05-191-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The return value of free_swap_slot is always 0 and also ignored now. Remove it to clean up the code. Link: https://lkml.kernel.org/r/20220509131416.17553-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: fold __swap_info_get() into its sole callerMiaohe Lin2022-05-191-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold __swap_info_get() into its sole caller to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: use helper macro __ATTR_RWMiaohe Lin2022-05-191-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use helper macro __ATTR_RW to define vma_ra_enabled_attr to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/swap: use helper is_swap_pte() in swap_vma_readaheadMiaohe Lin2022-05-191-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "A few cleanup patches for swap". This series contains a few patches to fix the comment, remove unneeded return value, use some helpers and so on. More details can be found in the respective changelogs. This patch (of 14): Use helper is_swap_pte() to check whether pte is swap entry to make code more clear. Minor readability improvement. Link: https://lkml.kernel.org/r/20220509131416.17553-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220509131416.17553-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: mmap: register suitable readonly file vmas for khugepagedYang Shi2022-05-192-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The readonly FS THP relies on khugepaged to collapse THP for suitable vmas. But the behavior is inconsistent for "always" mode (https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/). The "always" mode means THP allocation should be tried all the time and khugepaged should try to collapse THP all the time. Of course the allocation and collapse may fail due to other factors and conditions. Currently file THP may not be collapsed by khugepaged even though all the conditions are met. That does break the semantics of "always" mode. So make sure readonly FS vmas are registered to khugepaged to fix the break. Register suitable vmas in common mmap path, that could cover both readonly FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c. Still need to keep the khugepaged call in vma_merge() since vma_merge() is called in a lot of places, for example, madvise, mprotect, etc. Link: https://lkml.kernel.org/r/20220510203222.24246-9-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Song Liu <song@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: khugepaged: introduce khugepaged_enter_vma() helperYang Shi2022-05-193-17/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The khugepaged_enter_vma_merge() actually does as the same thing as the khugepaged_enter() section called by shmem_mmap(), so consolidate them into one helper and rename it to khugepaged_enter_vma(). Link: https://lkml.kernel.org/r/20220510203222.24246-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: khugepaged: make hugepage_vma_check() non-staticYang Shi2022-05-191-16/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hugepage_vma_check() could be reused by khugepaged_enter() and khugepaged_enter_vma_merge(), but it is static in khugepaged.c. Make it non-static and declare it in khugepaged.h. Link: https://lkml.kernel.org/r/20220510203222.24246-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: khugepaged: make khugepaged_enter() void functionYang Shi2022-05-192-13/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The most callers of khugepaged_enter() don't care about the return value. Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the error by returning -ENOMEM. Actually it is not harmful for them to ignore the error case either. It also sounds overkilling to fail fork() and page fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set VM_HUGEPAGE flag regardless of the error. Link: https://lkml.kernel.org/r/20220510203222.24246-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: thp: only regular file could be THP eligibleYang Shi2022-05-192-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit a4aeaa06d45e ("mm: khugepaged: skip huge page collapse for special files"), khugepaged just collapses THP for regular file which is the intended usecase for readonly fs THP. Only show regular file as THP eligible accordingly. And make file_thp_enabled() available for khugepaged too in order to remove duplicate code. Link: https://lkml.kernel.org/r/20220510203222.24246-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: khugepaged: skip DAX vmaYang Shi2022-05-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DAX vma may be seen by khugepaged when the mm has other khugepaged suitable vmas. So khugepaged may try to collapse THP for DAX vma, but it will fail due to page sanity check, for example, page is not on LRU. So it is not harmful, but it is definitely pointless to run khugepaged against DAX vma, so skip it in early check. Link: https://lkml.kernel.org/r/20220510203222.24246-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGEDYang Shi2022-05-191-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hugepage_vma_check() called by khugepaged_enter_vma_merge() does check VM_NO_KHUGEPAGED. Remove the check from caller and move the check in hugepage_vma_check() up. More checks may be run for VM_NO_KHUGEPAGED vmas, but MADV_HUGEPAGE is definitely not a hot path, so cleaner code does outweigh. Link: https://lkml.kernel.org/r/20220510203222.24246-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Vlastmil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <songliubraving@fb.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm, compaction: fast_find_migrateblock() should return pfn in the target zoneRei Yamamoto2022-05-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit a984226f457f849e ("mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec"), because it can corrupt the lru list by handling pages in list without holding proper lru_lock. Avoid returning a pfn outside the target zone in the case that it is not aligned with a pageblock boundary. Otherwise isolate_migratepages_block() will handle pages not in the target zone. Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Don Dutile <ddutile@redhat.com> Cc: Wonhyuk Yang <vvghjk1234@gmail.com> Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/memcontrol: export memcg->watermark via sysfs for v2 memcgGanesan Rajagopal2022-05-131-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data. This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups. Hence, expose memcg->watermark as "memory.peak" for v2 memcg. Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.com Signed-off-by: Ganesan Rajagopal <rganesan@arista.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctlMuchun Song2022-05-132-15/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit 099730d67417. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsingMuchun Song2022-05-131-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use kstrtobool rather than open coding "on" and "off" parsing in mm/hugetlb_vmemmap.c, which is more powerful to handle all kinds of parameters like 'Yy1Nn0' or [oO][NnFf] for "on" and "off". Link: https://lkml.kernel.org/r/20220512041142.39501-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=onMuchun Song2022-05-131-6/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap on hot added memory. If "hugetlb_free_vmemmap=on" and memory_hotplug.memmap_on_memory" are both passed on the kernel command line, optimizing hugetlb pages takes precedence. However, the global variable memmap_on_memory will still be set to 1, even though we will not try to allocate memmap on hot added memory. Also introduce mhp_memmap_on_memory() helper to move the definition of "memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY. In the next patch, mhp_memmap_on_memory() will also be exported to be used in hugetlb_vmemmap.c. Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page ↵Muchun Song2022-05-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | crosses page boundaries Patch series "add hugetlb_optimize_vmemmap sysctl", v11. This series aims to add hugetlb_optimize_vmemmap sysctl to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. This patch (of 4): If the size of "struct page" is not the power of two but with the feature of minimizing overhead of struct page associated with each HugeTLB is enabled, then the vmemmap pages of HugeTLB will be corrupted after remapping (panic is about to happen in theory). But this only exists when !CONFIG_MEMCG && !CONFIG_SLUB on x86_64. However, it is not a conventional configuration nowadays. So it is not a real word issue, just the result of a code review. But we cannot prevent anyone from configuring that combined configure. This hugetlb_optimize_vmemmap should be disable in this case to fix this issue. Link: https://lkml.kernel.org/r/20220512041142.39501-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220512041142.39501-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmappingBaolin Wang2022-05-131-17/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified. When unmapping a hugetlb page, we will get the relevant page table entry by huge_pte_offset() only once to nuke it. This is correct for PMD or PUD size hugetlb, since they always contain only one pmd entry or pud entry in the page table. However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since they can contain several continuous pte or pmd entry with same page table attributes, so we will nuke only one pte or pmd entry for this CONT-PTE/PMD size hugetlb page. And now try_to_unmap() is only passed a hugetlb page in the case where the hugetlb page is poisoned. Which means now we will unmap only one pte entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can still access other subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page, which will cause serious issues possibly. So we should change to use huge_ptep_clear_flush() to nuke the hugetlb page table to fix this issue, which already considered CONT-PTE and CONT-PMD size hugetlb. We've already used set_huge_swap_pte_at() to set a poisoned swap entry for a poisoned hugetlb page. Meanwhile adding a VM_BUG_ON() to make sure the passed hugetlb page is poisoned in try_to_unmap(). Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Rich Felker <dalias@libc.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>