summaryrefslogtreecommitdiffstats
path: root/mm/page_owner.c
Commit message (Collapse)AuthorAgeFilesLines
* mm,page_owner: drop unnecessary checkOscar Salvador2024-03-061-2/+1
| | | | | | | | | | | | | | | | stackdepot only saves stack_records which size is greather than 0, so we cannot possibly have empty stack_records. Drop the check. Link: https://lkml.kernel.org/r/20240306123217.29774-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm,page_owner: check for null stack_record before bumping its refcountOscar Salvador2024-03-061-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "page_owner: Fixup and cleanup". This patchset consists of a fixup by an error that was reported by intel robot, where it seems to be that by the time page_owner gets initialized, stackdepot has already depleted its allocation space and returns 0-handles, turning that into null stack_records when trying to retrieve the stack_record. I was not able to reproduce that from the config because it booted fine for me, but when setting e.g: dummy_handle to 0 artificially, I could see the same error that was reported. The second patch is a cleanup that can also lead to a compilation warning. This patch (of 2): Although the retrieval of the stack_records for {dummy,failure}_handle happen when page_owner gets initialized, there seems to be some situations where stackdepot space has been already depleted by then, so we get 0-handles which make stack_records being NULL for those cases. Be careful to 1) only bump stack_records refcount and 2) only access stack_record fields if we actually have a non-null stack_record between hands. Link: https://lkml.kernel.org/r/20240306123217.29774-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240306123217.29774-2-osalvador@suse.de Fixes: 4bedfb314bdd ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202403051032.e2f865a-lkp@intel.com Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: page_owner: add support for splitting to any order in split page_ownerZi Yan2024-03-041-4/+3
| | | | | | | | | | | | | | | | | | | | | It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner: use order instead of nr in split_page_owner()Zi Yan2024-03-041-1/+2
| | | | | | | | | | | | | | | | | | | | We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm,page_owner: filter out stacks by a thresholdOscar Salvador2024-02-231-1/+22
| | | | | | | | | | | | | | | We want to be able to filter out the stacks based on a threshold we can can tune. By writing to 'count_threshold' file, we can adjust the threshold value. Link: https://lkml.kernel.org/r/20240215215907.20121-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm,page_owner: display all stacks and their countOscar Salvador2024-02-231-1/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new directory called 'page_owner_stacks' under /sys/kernel/debug/, with a file called 'show_stacks' in it. Reading from that file will show all stacks that were added by page_owner followed by their counting, giving us a clear overview of stack <-> count relationship. E.g: prep_new_page+0xa9/0x120 get_page_from_freelist+0x801/0x2210 __alloc_pages+0x18b/0x350 alloc_pages_mpol+0x91/0x1f0 folio_alloc+0x14/0x50 filemap_alloc_folio+0xb2/0x100 __filemap_get_folio+0x14a/0x490 ext4_write_begin+0xbd/0x4b0 [ext4] generic_perform_write+0xc1/0x1e0 ext4_buffered_write_iter+0x68/0xe0 [ext4] ext4_file_write_iter+0x70/0x740 [ext4] vfs_write+0x33d/0x420 ksys_write+0xa5/0xe0 do_syscall_64+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x6e/0x76 stack_count: 4578 The seq stack_{start,next} functions will iterate through the list stack_list in order to print all stacks. Link: https://lkml.kernel.org/r/20240215215907.20121-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm,page_owner: implement the tracking of the stacks countOscar Salvador2024-02-231-1/+72
| | | | | | | | | | | | | | | | | | | | Implement {inc,dec}_stack_record_count() which increments or decrements on respective allocation and free operations, via __reset_page_owner() (free operation) and __set_page_owner() (alloc operation). Newly allocated stack_record structs will be added to the list stack_list via add_stack_record_to_list(). Modifications on the list are protected via a spinlock with irqs disabled, since this code can also be reached from IRQ context. Link: https://lkml.kernel.org/r/20240215215907.20121-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm,page_owner: maintain own list of stack_records structsOscar Salvador2024-02-231-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | page_owner needs to increment a stack_record refcount when a new allocation occurs, and decrement it on a free operation. In order to do that, we need to have a way to get a stack_record from a handle. Implement __stack_depot_get_stack_record() which just does that, and make it public so page_owner can use it. Also, traversing all stackdepot buckets comes with its own complexity, plus we would have to implement a way to mark only those stack_records that were originated from page_owner, as those are the ones we are interested in. For that reason, page_owner maintains its own list of stack_records, because traversing that list is faster than traversing all buckets while keeping at the same time a low complexity. For now, add to stack_list only the stack_records of dummy_handle and failure_handle, and set their refcount of 1. Further patches will add code to increment or decrement stack_records count on allocation and free operation. Link: https://lkml.kernel.org/r/20240215215907.20121-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov2024-01-081-3/+3
| | | | | | | | | | | | | | | commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: remove unnecessary ia64 code and commentKefeng Wang2023-12-291-1/+0
| | | | | | | | | | IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"), remove unnecessary ia64 special mm code and comment too. Link: https://lkml.kernel.org/r/20231222070203.2966980-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner: record and dump free_pid and free_tgidBarry Song2023-12-101-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | While investigating some complex memory allocation and free bugs especially in multi-processes and multi-threads cases, from time to time, I feel the free stack isn't sufficient as a page can be freed by processes or threads other than the one allocating it. And other processes and threads which free the page often have the exactly same free stack with the one allocating the page. We can't know who free the page only through the free stack though the current page_owner does tell us the pid and tgid of the one allocating the page. This makes the bug investigation often hard. So this patch adds free pid and tgid in page_owner, so that we can easily figure out if the freeing is crossing processes or threads. Link: https://lkml.kernel.org/r/20231114034202.73098-1-v-songbaohua@oppo.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Cc: Audra Mitchell <audra@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kassey Li <quic_yingangl@quicinc.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner: remove free_ts from page_owner outputAudra Mitchell2023-10-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fix page_owner's use of free timestamps". While page ower output is used to investigate memory utilization, typically the allocation pathway, the introduction of timestamps to the page owner records caused each record to become unique due to the granularity of the nanosecond timestamp (for example): Page allocated via order 0 ... ts 5206196026 ns, free_ts 5187156703 ns Page allocated via order 0 ... ts 5206198540 ns, free_ts 5187162702 ns Furthermore, the page_owner output only dumps the currently allocated records, so having the free timestamps is nonsensical for the typical use case. In addition, the introduction of timestamps was not properly handled in the page_owner_sort tool causing most use cases to be broken. This series is meant to remove the free timestamps from the page_owner output and fix the page_owner_sort tool so proper collation can occur. This patch (of 5): When printing page_owner data via the sysfs interface, no free pages will ever be dumped due to the series of checks in read_page_owner(): /* * Although we do have the info about past allocation of free * pages, it's not relevant for current memory usage. */ if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags)) The free_ts values are still used when dump_page_owner() is called, so keeping the field for other use cases but removing them for the typical page_owner case. Link: https://lkml.kernel.org/r/20231013190350.579407-1-audra@redhat.com Link: https://lkml.kernel.org/r/20231013190350.579407-2-audra@redhat.com Fixes: 866b48526217 ("mm/page_owner: record the timestamp of all pages during free") Signed-off-by: Audra Mitchell <audra@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Georgi Djakov <djakov@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_ext: use page_ext_data helper in page_ownerKemeng Shi2023-08-211-1/+1
| | | | | | | | | | Use page_ext_data helper in page_owner to avoid access offset directly. Link: https://lkml.kernel.org/r/20230718145812.1991717-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner/cma: show pfn in cma/page_owner with hex formatKassey Li2023-06-191-1/+1
| | | | | | | | | | | | | cma: display pfn as well as pfn_to_page(pfn) page_owner: display pfn in hex rather than decimal Link: https://lkml.kernel.org/r/20230613092533.15449-1-quic_yingangl@quicinc.com Signed-off-by: Kassey Li <quic_yingangl@quicinc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov2023-04-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* lib/stackdepot, mm: rename stack_depot_want_early_initAndrey Konovalov2023-02-161-1/+1
| | | | | | | | | | | | | | | | | Rename stack_depot_want_early_init to stack_depot_request_early_init. The old name is confusing, as it hints at returning some kind of intention of stack depot. The new name reflects that this function requests an action from stack depot instead. No functional changes. [akpm@linux-foundation.org: update mm/kmemleak.c] Link: https://lkml.kernel.org/r/359f31bf67429a06e630b4395816a967214ef753.1676063693.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner: record single timestamp value for high order allocationsHyeonggon Yoo2023-02-021-1/+2
| | | | | | | | | | | | | | | | | | | | | When allocating a high-order page, separate allocation timestamp is recorded for each sub-page resulting in different timestamp values between them. This behavior is not consistent with the behavior when recording free timestamp and caused confusion when analyzing memory dumps. Record single timestamp for the entire allocation, aligning with the behavior for free timestamps. Link: https://lkml.kernel.org/r/20230121165054.520507-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_ext: do not allocate space for page_ext->flags if not neededPasha Tatashin2023-02-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is 8 byte page_ext->flags field allocated per page whenever CONFIG_PAGE_EXTENSION is enabled. However, not every user of page_ext uses flags. Therefore, check whether flags is needed at least by one user and if so allocate space for it. For example when page_table_check is enabled, on a machine with 128G of memory before the fix: [ 2.244288] allocated 536870912 bytes of page_ext after the fix: [ 2.160154] allocated 268435456 bytes of page_ext Also, add a kernel-doc comment before page_ext_operations that describes the fields, and remove check if need() is set, as that is now a required field. [pasha.tatashin@soleen.com: address comments from Mike Rapoport] Link: https://lkml.kernel.org/r/20230117202103.1412449-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20230113154253.92480-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: reuse pageblock_start/end_pfn() macroKefeng Wang2022-10-031-2/+2
| | | | | | | | | | | | | | | Move pageblock_start_pfn/pageblock_end_pfn() into pageblock-flags.h, then they could be used somewhere else, not only in compaction, also use ALIGN_DOWN() instead of round_down() to be pair with ALIGN(), which should be same for pageblock usage. Link: https://lkml.kernel.org/r/20220907060844.126891-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner.c: remove redundant drain_all_pagesZhenhua Huang2022-10-031-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Remove an expensive and unnecessary operation as PCP pages are safely skipped when reading page owner.PCP pages can be skipped because PAGE_EXT_OWNER_ALLOCATED is cleared. With draining PCP pages, these pages are moved to buddy list so they can be identified as buddy pages and skipped quickly. Although it improved efficiency of PFN walker, the drain is guaranteed expensive that is unlikely to be offset by a slight increase in efficiency when skipping free pages. PAGE_EXT_OWNER_ALLOCATED is cleared in the page owner reset path below: free_unref_page -> free_unref_page_prepare -> free_pcp_prepare -> free_pages_prepare which do page owner reset -> free_unref_page_commit which add pages into pcp list Link: https://lkml.kernel.org/r/1662704326-15899-1-git-send-email-quic_zhenhuah@quicinc.com Link: https://lkml.kernel.org/r/1662633204-10044-1-git-send-email-quic_zhenhuah@quicinc.com Link: https://lkml.kernel.org/r/1662537673-9392-1-git-send-email-quic_zhenhuah@quicinc.com Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/page_owner.c: add llseek for page_ownerKassey Li2022-09-111-3/+21
| | | | | | | | | | | | | | | | It is too slow to dump all the pages, in some usage we just want to dump a given start pfn, for example: a CMA range or a single page. To speed up and save time, this change allows specifying of a start pfn by adding llseek for page_owner. Link: https://lkml.kernel.org/r/20220818022425.31056-1-quic_yingangl@quicinc.com Signed-off-by: Kassey Li <quic_yingangl@quicinc.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: fix use-after free of page_ext after race with memory-offlineCharan Teja Kalla2022-09-111-20/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The below is one path where race between page_ext and offline of the respective memory blocks will cause use-after-free on the access of page_ext structure. process1 process2 --------- --------- a)doing /proc/page_owner doing memory offline through offline_pages. b) PageBuddy check is failed thus proceed to get the page_owner information through page_ext access. page_ext = lookup_page_ext(page); migrate_pages(); ................. Since all pages are successfully migrated as part of the offline operation,send MEM_OFFLINE notification where for page_ext it calls: offline_page_ext()--> __free_page_ext()--> free_page_ext()--> vfree(ms->page_ext) mem_section->page_ext = NULL c) Check for the PAGE_EXT flags in the page_ext->flags access results into the use-after-free (leading to the translation faults). As mentioned above, there is really no synchronization between page_ext access and its freeing in the memory_offline. The memory offline steps(roughly) on a memory block is as below: 1) Isolate all the pages 2) while(1) try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) 3) delete the pages from this buddy list. 4) Then free page_ext.(Note: The struct page is still alive as it is freed only during hot remove of the memory which frees the memmap, which steps the user might not perform). This design leads to the state where struct page is alive but the struct page_ext is freed, where the later is ideally part of the former which just representing the page_flags (check [3] for why this design is chosen). The abovementioned race is just one example __but the problem persists in the other paths too involving page_ext->flags access(eg: page_is_idle())__. Fix all the paths where offline races with page_ext access by maintaining synchronization with rcu lock and is achieved in 3 steps: 1) Invalidate all the page_ext's of the sections of a memory block by storing a flag in the LSB of mem_section->page_ext. 2) Wait until all the existing readers to finish working with the ->page_ext's with synchronize_rcu(). Any parallel process that starts after this call will not get page_ext, through lookup_page_ext(), for the block parallel offline operation is being performed. 3) Now safely free all sections ->page_ext's of the block on which offline operation is being performed. Note: If synchronize_rcu() takes time then optimizations can be done in this path through call_rcu()[2]. Thanks to David Hildenbrand for his views/suggestions on the initial discussion[1] and Pavan kondeti for various inputs on this patch. [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/ [quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David] Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2022-05-25' of ↵Linus Torvalds2022-05-261-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Almost all of MM here. A few things are still getting finished off, reviewed, etc. - Yang Shi has improved the behaviour of khugepaged collapsing of readonly file-backed transparent hugepages. - Johannes Weiner has arranged for zswap memory use to be tracked and managed on a per-cgroup basis. - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime enablement of the recent huge page vmemmap optimization feature. - Baolin Wang contributes a series to fix some issues around hugetlb pagetable invalidation. - Zhenwei Pi has fixed some interactions between hwpoisoned pages and virtualization. - Tong Tiangen has enabled the use of the presently x86-only page_table_check debugging feature on arm64 and riscv. - David Vernet has done some fixup work on the memcg selftests. - Peter Xu has taught userfaultfd to handle write protection faults against shmem- and hugetlbfs-backed files. - More DAMON development from SeongJae Park - adding online tuning of the feature and support for monitoring of fixed virtual address ranges. Also easier discovery of which monitoring operations are available. - Nadav Amit has done some optimization of TLB flushing during mprotect(). - Neil Brown continues to labor away at improving our swap-over-NFS support. - David Hildenbrand has some fixes to anon page COWing versus get_user_pages(). - Peng Liu fixed some errors in the core hugetlb code. - Joao Martins has reduced the amount of memory consumed by device-dax's compound devmaps. - Some cleanups of the arch-specific pagemap code from Anshuman Khandual. - Muchun Song has found and fixed some errors in the TLB flushing of transparent hugepages. - Roman Gushchin has done more work on the memcg selftests. ... and, of course, many smaller fixes and cleanups. Notably, the customary million cleanup serieses from Miaohe Lin" * tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits) mm: kfence: use PAGE_ALIGNED helper selftests: vm: add the "settings" file with timeout variable selftests: vm: add "test_hmm.sh" to TEST_FILES selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests selftests: vm: add migration to the .gitignore selftests/vm/pkeys: fix typo in comment ksm: fix typo in comment selftests: vm: add process_mrelease tests Revert "mm/vmscan: never demote for memcg reclaim" mm/kfence: print disabling or re-enabling message include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion" mm: fix a potential infinite loop in start_isolate_page_range() MAINTAINERS: add Muchun as co-maintainer for HugeTLB zram: fix Kconfig dependency warning mm/shmem: fix shmem folio swapoff hang cgroup: fix an error handling path in alloc_pagecache_max_30M() mm: damon: use HPAGE_PMD_SIZE tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate nodemask.h: fix compilation error with GCC12 ...
| * mm/page_owner.c: add missing __initdata attributeFanjun Kong2022-05-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes two issues: 1. Add __initdata attribute according to include/linux/init.h: For initialized data: You should insert __initdata between the variable name and equal sign followed by value 2. Fix below error reported by checkpatch.pl: ERROR: do not initialise statics to false Special thanks to Muchun Song :) Link: https://lkml.kernel.org/r/20220516030039.1487005-1-bh1scw@gmail.com Signed-off-by: Fanjun Kong <bh1scw@gmail.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/page_owner: use strscpy() instead of strlcpy()Eric Dumazet2022-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | current->comm[] is not a string (no guarantee for a zero byte in it). strlcpy(s1, s2, l) is calling strlen(s2), potentially causing out-of-bound access, as reported by syzbot: detected buffer overflow in __fortify_strlen ------------[ cut here ]------------ kernel BUG at lib/string_helpers.c:980! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 4087 Comm: dhcpcd-run-hooks Not tainted 5.18.0-rc3-syzkaller-01537-g20b87e7c29df #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:fortify_panic+0x18/0x1a lib/string_helpers.c:980 Code: 8c e8 c5 ba e1 fa e9 23 0f bf fa e8 0b 5d 8c f8 eb db 55 48 89 fd e8 e0 49 40 f8 48 89 ee 48 c7 c7 80 f5 26 8a e8 99 09 f1 ff <0f> 0b e8 ca 49 40 f8 48 8b 54 24 18 4c 89 f1 48 c7 c7 00 00 27 8a RSP: 0018:ffffc900000074a8 EFLAGS: 00010286 RAX: 000000000000002c RBX: ffff88801226b728 RCX: 0000000000000000 RDX: ffff8880198e0000 RSI: ffffffff81600458 RDI: fffff52000000e87 RBP: ffffffff89da2aa0 R08: 000000000000002c R09: 0000000000000000 R10: ffffffff815fae2e R11: 0000000000000000 R12: ffff88801226b700 R13: ffff8880198e0830 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5876ad6ff8 CR3: 000000001a48c000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 Call Trace: <IRQ> __fortify_strlen include/linux/fortify-string.h:128 [inline] strlcpy include/linux/fortify-string.h:143 [inline] __set_page_owner_handle+0x2b1/0x3e0 mm/page_owner.c:171 __set_page_owner+0x3e/0x50 mm/page_owner.c:190 prep_new_page mm/page_alloc.c:2441 [inline] get_page_from_freelist+0xba2/0x3e00 mm/page_alloc.c:4182 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5408 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2272 alloc_slab_page mm/slub.c:1799 [inline] allocate_slab+0x26c/0x3c0 mm/slub.c:1944 new_slab mm/slub.c:2004 [inline] ___slab_alloc+0x8df/0xf20 mm/slub.c:3005 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3092 slab_alloc_node mm/slub.c:3183 [inline] slab_alloc mm/slub.c:3225 [inline] __kmem_cache_alloc_lru mm/slub.c:3232 [inline] kmem_cache_alloc+0x360/0x3b0 mm/slub.c:3242 dst_alloc+0x146/0x1f0 net/core/dst.c:92 Link: https://lkml.kernel.org/r/20220509145949.265184-1-eric.dumazet@gmail.com Fixes: 865ed6a32786 ("mm/page_owner: record task command name") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | lib/stackdepot: allow requesting early initialization dynamicallyVlastimil Babka2022-04-061-3/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a later patch we want to add stackdepot support for object owner tracking in slub caches, which is enabled by slub_debug boot parameter. This creates a bootstrap problem as some caches are created early in boot when slab_is_available() is false and thus stack_depot_init() tries to use memblock. But, as reported by Hyeonggon Yoo [1] we are already beyond memblock_free_all(). Ideally memblock allocation should fail, yet it succeeds, but later the system crashes, which is a separately handled issue. To resolve this boostrap issue in a robust way, this patch adds another way to request stack_depot_early_init(), which happens at a well-defined point of time. In addition to build-time CONFIG_STACKDEPOT_ALWAYS_INIT, code that's e.g. processing boot parameters (which happens early enough) can call a new function stack_depot_want_early_init(), which sets a flag that stack_depot_early_init() will check. In this patch we also convert page_owner to this approach. While it doesn't have the bootstrap issue as slub, it's also a functionality enabled by a boot param and can thus request stack_depot_early_init() with memblock allocation instead of later initialization with kvmalloc(). As suggested by Mike, make stack_depot_early_init() only attempt memblock allocation and stack_depot_init() only attempt kvmalloc(). Also change the latter to kvcalloc(). In both cases we can lose the explicit array zeroing, which the allocations do already. As suggested by Marco, provide empty implementations of the init functions for !CONFIG_STACKDEPOT builds to simplify the callers. [1] https://lore.kernel.org/all/YhnUcqyeMgCrWZbd@ip-172-31-19-208.ap-northeast-1.compute.internal/ Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Suggested-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Rientjes <rientjes@google.com>
* mm/page_owner.c: record tgidYixuan Cao2022-03-241-6/+9
| | | | | | | | | | | | | | | | | | | | | | | In a single-threaded process, the pid in kernel task_struct is the same as the tgid, which can mark the process of page allocation. But in a multithreaded process, only the task_struct of the thread leader has the same pid as tgid, and the pids of other threads are different from tgid. Therefore, tgid is recorded to provide effective information for debugging and data statistics of multithreaded programs. This can also be achieved by observing the task name (executable file name) for a specific process. However, when the same program is started multiple times, the task name is the same and the tgid is different. Therefore, in the debugging of multi-threaded programs, combined with the task name and tgid, more accurate runtime information of a certain run of the program can be obtained. Link: https://lkml.kernel.org/r/20220219180450.2399-1-caoyixuan2019@email.szu.edu.cn Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn> Cc: Waiman Long <longman@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: record task command nameWaiman Long2022-03-241-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page_owner information currently includes the pid of the calling task. That is useful as long as the task is still running. Otherwise, the number is meaningless. To have more information about the allocating tasks that had exited by the time the page_owner information is retrieved, we need to store the command name of the task. Add a new comm field into page_owner structure to store the command name and display it when the page_owner information is retrieved. Link: https://lkml.kernel.org/r/20220202203036.744010-5-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: print memcg informationWaiman Long2022-03-241-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was found that a number of offline memcgs were not freed because they were pinned by some charged pages that were present. Even "echo 1 > /proc/sys/vm/drop_caches" wasn't able to free those pages. These offline but not freed memcgs tend to increase in number over time with the side effect that percpu memory consumption as shown in /proc/meminfo also increases over time. In order to find out more information about those pages that pin offline memcgs, the page_owner feature is extended to print memory cgroup information especially whether the cgroup is offline or not. RCU read lock is taken when memcg is being accessed to make sure that it won't be freed. Link: https://lkml.kernel.org/r/20220202203036.744010-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: use scnprintf() to avoid excessive buffer overrun checkWaiman Long2022-03-241-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The snprintf() function can return a length greater than the given input size. That will require a check for buffer overrun after each invocation of snprintf(). scnprintf(), on the other hand, will never return a greater length. By using scnprintf() in selected places, we can avoid some buffer overrun checks except after stack_depot_snprint() and after the last snprintf(). Link: https://lkml.kernel.org/r/20220202203036.744010-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()Vlastimil Babka2022-01-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, enabling CONFIG_STACKDEPOT means its stack_table will be allocated from memblock, even if stack depot ends up not actually used. The default size of stack_table is 4MB on 32-bit, 8MB on 64-bit. This is fine for use-cases such as KASAN which is also a config option and has overhead on its own. But it's an issue for functionality that has to be actually enabled on boot (page_owner) or depends on hardware (GPU drivers) and thus the memory might be wasted. This was raised as an issue [1] when attempting to add stackdepot support for SLUB's debug object tracking functionality. It's common to build kernels with CONFIG_SLUB_DEBUG and enable slub_debug on boot only when needed, or create only specific kmem caches with debugging for testing purposes. It would thus be more efficient if stackdepot's table was allocated only when actually going to be used. This patch thus makes the allocation (and whole stack_depot_init() call) optional: - Add a CONFIG_STACKDEPOT_ALWAYS_INIT flag to keep using the current well-defined point of allocation as part of mem_init(). Make CONFIG_KASAN select this flag. - Other users have to call stack_depot_init() as part of their own init when it's determined that stack depot will actually be used. This may depend on both config and runtime conditions. Convert current users which are page_owner and several in the DRM subsystem. Same will be done for SLUB later. - Because the init might now be called after the boot-time memblock allocation has given all memory to the buddy allocator, change stack_depot_init() to allocate stack_table with kvmalloc() when memblock is no longer available. Also handle allocation failure by disabling stackdepot (could have theoretically happened even with memblock allocation previously), and don't unnecessarily align the memblock allocation to its own size anymore. [1] https://lore.kernel.org/all/CAMuHMdW=eoVzM1Re5FVoEN87nKfiLmM2+Ah7eNu2KXEhCvbZyA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20211013073005.11351-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Marco Elver <elver@google.com> # stackdepot Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> From: Colin Ian King <colin.king@canonical.com> Subject: lib/stackdepot: fix spelling mistake and grammar in pr_err message There is a spelling mistake of the work allocation so fix this and re-phrase the message to make it easier to read. Link: https://lkml.kernel.org/r/20211015104159.11282-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Vlastimil Babka <vbabka@suse.cz> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup On FLATMEM, we call page_ext_init_flatmem_late() just before kmem_cache_init() which means stack_depot_init() (called by page owner init) will not recognize properly it should use kvmalloc() and not memblock_alloc(). memblock_alloc() will also not issue a warning and return a block memory that can be invalid and cause kernel page fault when saving stacks, as reported by the kernel test robot [1]. Fix this by moving page_ext_init_flatmem_late() below kmem_cache_init() so that slab_is_available() is true during stack_depot_init(). SPARSEMEM doesn't have this issue, as it doesn't do page_ext_init_flatmem_late(), but a different page_ext_init() even later in the boot process. Thanks to Mike Rapoport for pointing out the FLATMEM init ordering issue. While at it, also actually resolve a checkpatch warning in stack_depot_init() from DRM CI, which was supposed to be in the original patch already. [1] https://lore.kernel.org/all/20211014085450.GC18719@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/6abd9213-19a9-6d58-cedc-2414386d2d81@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: kernel test robot <oliver.sang@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup3 Due to cd06ab2fd48f ("drm/locking: add backtrace for locking contended locks without backoff") landing recently to -next adding a new stack depot user in drivers/gpu/drm/drm_modeset_lock.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/2a692365-cfa1-64f2-34e0-8aa5674dce5e@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Marco Elver <elver@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Oliver Glitta <glittao@gmail.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> From: Vlastimil Babka <vbabka@suse.cz> Subject: lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() - fixup4 Due to 4e66934eaadc ("lib: add reference counting tracking infrastructure") landing recently to net-next adding a new stack depot user in lib/ref_tracker.c we need to add an appropriate call to stack_depot_init() there as well. Link: https://lkml.kernel.org/r/45c1b738-1a2f-5b5f-2f6d-86fab206d01c@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Jiri Slab <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: make some vars and functions static or __initTing Liu2022-01-151-2/+2
| | | | | | | | | | | | | | "page_idle_ops" as a global var, but its scope of use within this document. So it should be static. "page_ext_ops" is a var used in the kernel initial phase. And other functions are aslo used in the kernel initial phase. So they should be __init or __initdata to reclaim memory. Link: https://lkml.kernel.org/r/20211217095023.67293-1-liuting.0x7c00@bytedance.com Signed-off-by: Ting Liu <liuting.0x7c00@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner.c: modify the type of argument "order" in some functionsYixuan Cao2021-11-111-3/+3
| | | | | | | | | | | | | | | | | | | The type of "order" in struct page_owner is unsigned short. However, it is unsigned int in the following 3 functions: __reset_page_owner __set_page_owner_handle __set_page_owner_handle The type of "order" in argument list is unsigned int, which is inconsistent. [akpm@linux-foundation.org: update include/linux/page_owner.h] Link: https://lkml.kernel.org/r/20211020125945.47792-1-caoyixuan2019@email.szu.edu.cn Signed-off-by: Yixuan Cao <caoyixuan2019@email.szu.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-11-091-13/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more updates from Andrew Morton: "87 patches. Subsystems affected by this patch series: mm (pagecache and hugetlb), procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs, init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork, sysvfs, kcov, gdb, resource, selftests, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits) ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL ipc: check checkpoint_restore_ns_capable() to modify C/R proc files selftests/kselftest/runner/run_one(): allow running non-executable files virtio-mem: disallow mapping virtio-mem memory via /dev/mem kernel/resource: disallow access to exclusive system RAM regions kernel/resource: clean up and optimize iomem_is_exclusive() scripts/gdb: handle split debug for vmlinux kcov: replace local_irq_save() with a local_lock_t kcov: avoid enable+disable interrupts if !in_task() kcov: allocate per-CPU memory on the relevant node Documentation/kcov: define `ip' in the example Documentation/kcov: include types.h in the example sysv: use BUILD_BUG_ON instead of runtime check kernel/fork.c: unshare(): use swap() to make code cleaner seq_file: fix passing wrong private data seq_file: move seq_escape() to a header signal: remove duplicate include in signal.h crash_dump: remove duplicate include in crash_dump.h crash_dump: fix boolreturn.cocci warning hfs/hfsplus: use WARN_ON for sanity check ...
| * lib, stackdepot: add helper to print stack entries into bufferImran Khan2021-11-091-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To print stack entries into a buffer, users of stackdepot, first get a list of stack entries using stack_depot_fetch and then print this list into a buffer using stack_trace_snprint. Provide a helper in stackdepot for this purpose. Also change above mentioned users to use this helper. [imran.f.khan@oracle.com: fix build error] Link: https://lkml.kernel.org/r/20210915175321.3472770-4-imran.f.khan@oracle.com [imran.f.khan@oracle.com: export stack_depot_snprint() to modules] Link: https://lkml.kernel.org/r/20210916133535.3592491-4-imran.f.khan@oracle.com Link: https://lkml.kernel.org/r/20210915014806.3206938-4-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jani Nikula <jani.nikula@intel.com> [i915] Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * lib, stackdepot: add helper to print stack entriesImran Khan2021-11-091-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To print a stack entries, users of stackdepot, first use stack_depot_fetch to get a list of stack entries and then use stack_trace_print to print this list. Provide a helper in stackdepot to print stack entries based on stackdepot handle. Also change above mentioned users to use this helper. Link: https://lkml.kernel.org/r/20210915014806.3206938-3-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'printk-for-5.16' of ↵Linus Torvalds2021-11-021-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Extend %pGp print format to print hex value of the page flags - Use kvmalloc instead of kmalloc to allocate devkmsg buffers - Misc cleanup and warning fixes * tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: Update %pGp documentation about that it prints hex value lib/vsprintf.c: Amend static asserts for format specifier flags vsprintf: Make %pGp print the hex value test_printf: Append strings more efficiently test_printf: Remove custom appending of '|' test_printf: Remove separate page_flags variable test_printf: Make pft array const ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK printk: use gnu_printf format attribute for printk_sprint() printk: avoid -Wsometimes-uninitialized warning printk: use kvmalloc instead of kmalloc for devkmsg_user
| * | vsprintf: Make %pGp print the hex valueMatthew Wilcox (Oracle)2021-10-271-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | All existing users of %pGp want the hex value as well as the decoded flag names. This looks awkward (passing the same parameter to printf twice), so move that functionality into the core. If we want, we can make that optional with flag arguments to %pGp in the future. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211019142621.2810043-6-willy@infradead.org
* / mm/migrate: Add folio_migrate_flags()Matthew Wilcox (Oracle)2021-10-181-5/+5
|/ | | | | | | | | | | | | Turn migrate_page_states() into a wrapper around folio_migrate_flags(). Also convert two functions only called from folio_migrate_flags() to be folio-based. ksm_migrate_page() becomes folio_migrate_ksm() and copy_page_owner() becomes folio_copy_owner(). folio_migrate_flags() alone shrinks by two thirds -- 1967 bytes down to 642 bytes. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
* mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONEMike Rapoport2021-09-081-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE". After recent updates to freeing unused parts of the memory map, no architecture can have holes in the memory map within a pageblock. This makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration option redundant. The first patch removes them both in a mechanical way and the second patch simplifies memory_hotplug::test_pages_in_a_zone() that had pfn_valid_within() surrounded by more logic than simple if. This patch (of 2): After recent changes in freeing of the unused parts of the memory map and rework of pfn_valid() in arm and arm64 there are no architectures that can have holes in the memory map within a pageblock and so nothing can enable CONFIG_HOLES_IN_ZONE which guards non trivial implementation of pfn_valid_within(). With that, pfn_valid_within() is always hardwired to 1 and can be completely removed. Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE. Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: constify dump_page_ownerMatthew Wilcox (Oracle)2021-06-291-1/+1
| | | | | | | | | | | | | dump_page_owner() only uses struct page to find the page_ext, and lookup_page_ext() already takes a const argument. Link: https://lkml.kernel.org/r/20210416231531.2521383-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: fix typos in commentsIngo Molnar2021-05-071-1/+1
| | | | | | | | | | | | | | Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_owner: detect page_owner recursion via task_structSergei Trofimovich2021-04-301-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before the change page_owner recursion was detected via fetching backtrace and inspecting it for current instruction pointer. It has a few problems: - it is slightly slow as it requires extra backtrace and a linear stack scan of the result - it is too late to check if backtrace fetching required memory allocation itself (ia64's unwinder requires it). To simplify recursion tracking let's use page_owner recursion flag in 'struct task_struct'. The change make page_owner=on work on ia64 by avoiding infinite recursion in: kmalloc() -> __set_page_owner() -> save_stack() -> unwind() [ia64-specific] -> build_script() -> kmalloc() -> __set_page_owner() [we short-circuit here] -> save_stack() -> unwind() [recursion] Link: https://lkml.kernel.org/r/20210402115342.1463781-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_owner: use kstrtobool() to parse bool optionSergei Trofimovich2021-04-301-7/+1
| | | | | | | | | | | | | | | | | | I tried to use page_owner=1 for a while noticed too late it had no effect as opposed to similar init_on_alloc=1 (these work). Let's make them consistent. The change decreses binary size slightly: text data bss dec hex filename 12408 321 17 12746 31ca mm/page_owner.o.before 12320 321 17 12658 3172 mm/page_owner.o.after Link: https://lkml.kernel.org/r/20210401210909.3532086-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: page_owner: fetch backtrace only for tracked pagesSergei Trofimovich2021-04-301-3/+3
| | | | | | | | | | | | Very minor optimization. Link: https://lkml.kernel.org/r/20210401212445.3534721-1-slyfox@gentoo.org Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, page_owner: remove unused parameter in __set_page_owner_handlezhongjiang-ali2021-04-301-5/+5
| | | | | | | | | | | Since commit 5556cfe8d994 ("mm, page_owner: fix off-by-one error in __set_page_owner_handle()") introduced, the parameter 'page' will not used, hence it need to be removed. Link: https://lkml.kernel.org/r/1616602022-43545-1-git-send-email-zhongjiang-ali@linux.alibaba.com Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: record the timestamp of all pages during freeGeorgi Djakov2021-04-301-4/+8
| | | | | | | | | | | | | | | | | Collect the time when each allocation is freed, to help with memory analysis with kdump/ramdump. Add the timestamp also in the page_owner debugfs file and print it in dump_page(). Having another timestamp when we free the page helps for debugging page migration issues. For example both alloc and free timestamps being the same can gave hints that there is an issue with migrating memory, as opposed to a page just being dropped during migration. Link: https://lkml.kernel.org/r/20210203175905.12267-1-georgi.djakov@linaro.org Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: use helper function zone_end_pfn() to get end_pfnMiaohe Lin2021-02-241-2/+2
| | | | | | | | | | | | | | Commit 108bcc96ef70 ("mm: add & use zone_end_pfn() and zone_spans_pfn()") introduced the helper zone_end_pfn() to calculate the zone end pfn. But pagetypeinfo_showmixedcount_print forgot to use it. And the initialization of local variable pfn is duplicated, remove one. Link: https://lkml.kernel.org/r/20210123070538.5861-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_owner: record timestamp and pidLiam Mark2020-12-151-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Collect the time for each allocation recorded in page owner so that allocation "surges" can be measured. Record the pid for each allocation recorded in page owner so that the source of allocation "surges" can be better identified. The above is very useful when doing memory analysis. On a crash for example, we can get this information from kdump (or ramdump) and parse it to figure out memory allocation problems. Please note that on x86_64 this increases the size of struct page_owner from 16 bytes to 32. Vlastimil: it's not a functionality intended for production, so unless somebody says they need to enable page_owner for debugging and this increase prevents them from fitting into available memory, let's not complicate things with making this optional. [lmark@codeaurora.org: v3] Link: https://lkml.kernel.org/r/20201210160357.27779-1-georgi.djakov@linaro.org Link: https://lkml.kernel.org/r/20201209125153.10533-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: rename page_order() to buddy_order()Matthew Wilcox (Oracle)2020-10-161-3/+3
| | | | | | | | | | | The current page_order() can only be called on pages in the buddy allocator. For compound pages, you have to use compound_order(). This is confusing and led to a bug, so rename page_order() to buddy_order(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>