From da34a8484d162585e22ed8c1e4114aa2f60e3567 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 7 Dec 2022 14:00:39 +0100 Subject: mm: memcontrol: deprecate charge moving Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner Acked-by: Shakeel Butt Acked-by: Hugh Dickins Acked-by: Michal Hocko Cc: Muchun Song Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v1/memory.rst | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 60370f2c67b9..258e45cc3b2d 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -86,6 +86,8 @@ Brief summary of control files. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate set/show controls of moving charges + This knob is deprecated and shouldn't be + used. memory.oom_control set/show oom controls. memory.numa_stat show the number of memory usage per numa node @@ -717,8 +719,15 @@ NOTE2: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. -8. Move charges at task migration -================================= +8. Move charges at task migration (DEPRECATED!) +=============================================== + +THIS IS DEPRECATED! + +It's expensive and unreliable! It's better practice to launch workload +tasks directly from inside their target cgroup. Use dedicated workload +cgroups to allow fine-grained policy adjustments without having to +move physical pages between control domains. Users can move charges associated with a task along with task migration, that is, uncharge task's pages from the old cgroup and charge them to the new cgroup. -- cgit v1.2.3 From d56fe24237c3dec29b7de20ce93a4a53306d180b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 5 Dec 2022 23:08:23 +0000 Subject: Docs/admin-guide/damon/reclaim: document 'skip_anon' parameter Document the newly added 'skip_anon' parameter of DAMON_RECLAIM, which can be used to avoid anonymous pages reclamation. Link: https://lkml.kernel.org/r/20221205230830.144349-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/reclaim.rst | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 4f1479a11e63..ff335e96e0d8 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -205,6 +205,15 @@ The end physical address of memory region that DAMON_RECLAIM will do work against. That is, DAMON_RECLAIM will find cold memory regions in this region and reclaims. By default, biggest System RAM is used as the region. +skip_anon +--------- + +Skip anonymous pages reclamation. + +If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous +pages. By default, ``N``. + + kdamond_pid ----------- -- cgit v1.2.3 From 9b7f9322a5300505fffba492b45621c897c0e0df Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 5 Dec 2022 23:08:29 +0000 Subject: Docs/admin-guide/mm/damon/usage: document DAMOS filters of sysfs Document about the newly added files for DAMOS filters on the DAMON usage document. Link: https://lkml.kernel.org/r/20221205230830.144349-11-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 48 ++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 1a5b6b71efa1..3d82ca6a17ff 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -87,6 +87,8 @@ comma (","). :: │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low + │ │ │ │ │ │ │ filters/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,memcg_id │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds │ │ │ │ │ │ │ tried_regions/ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age @@ -151,6 +153,8 @@ number (``N``) to the file creates the number of child directories named as moment, only one context per kdamond is supported, so only ``0`` or ``1`` can be written to the file. +.. _sysfs_contexts: + contexts// ------------- @@ -268,8 +272,8 @@ schemes// ------------ In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``stats``, and ``tried_regions``) and one file (``action``) -exist. +``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file +(``action``) exist. The ``action`` file is for setting and getting what action you want to apply to memory regions having specific access pattern of the interest. The keywords @@ -347,6 +351,46 @@ as below. The ``interval`` should written in microseconds unit. +schemes//filters/ +-------------------- + +Users could know something more than the kernel for specific types of memory. +In the case, users could do their own management for the memory and hence +doesn't want DAMOS bothers that. Users could limit DAMOS by setting the access +pattern of the scheme and/or the monitoring regions for the purpose, but that +can be inefficient in some cases. In such cases, users could set non-access +pattern driven filters using files in this directory. + +In the beginning, this directory has only one file, ``nr_filters``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each filter. The filters are evaluated +in the numeric order. + +Each filter directory contains three files, namely ``type``, ``matcing``, and +``memcg_path``. You can write one of two special keywords, ``anon`` for +anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of +the memory cgroup filtering, you can specify the memory cgroup of the interest +by writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to +filter out pages that does or does not match to the type, respectively. Then, +the scheme's action will not be applied to the pages that specified to be +filtered out. + +For example, below restricts a DAMOS action to be applied to only non-anonymous +pages of all memory cgroups except ``/having_care_already``.:: + + # echo 2 > nr_filters + # # filter out anonymous pages + echo anon > 0/type + echo Y > 0/matching + # # further filter out all cgroups except one at '/having_care_already' + echo memcg > 1/type + echo /having_care_already > 1/memcg_path + echo N > 1/matching + +Note that filters could be ignored depend on the running DAMON operations set +`implementation `. + .. _sysfs_schemes_stats: schemes//stats/ -- cgit v1.2.3 From 497b099d9a166f819e667f4bf258e7a00ea2e78a Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 5 Dec 2022 23:08:30 +0000 Subject: Docs/ABI/damon: document scheme filters files Document newly added DAMON sysfs interface files for DAMOS filtering on the DAMON ABI document. Link: https://lkml.kernel.org/r/20221205230830.144349-12-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/ABI/testing/sysfs-kernel-mm-damon | 29 +++++++++++++++++++++++++ 1 file changed, 29 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-damon b/Documentation/ABI/testing/sysfs-kernel-mm-damon index 13397b853692..2744f21b5a6b 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-damon +++ b/Documentation/ABI/testing/sysfs-kernel-mm-damon @@ -258,6 +258,35 @@ Contact: SeongJae Park Description: Writing to and reading from this file sets and gets the low watermark of the scheme in permil. +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters/nr_filters +Date: Dec 2022 +Contact: SeongJae Park +Description: Writing a number 'N' to this file creates the number of + directories for setting filters of the scheme named '0' to + 'N-1' under the filters/ directory. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters//type +Date: Dec 2022 +Contact: SeongJae Park +Description: Writing to and reading from this file sets and gets the type of + the memory of the interest. 'anon' for anonymous pages, or + 'memcg' for specific memory cgroup can be written and read. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters//memcg_path +Date: Dec 2022 +Contact: SeongJae Park +Description: If 'memcg' is written to the 'type' file, writing to and + reading from this file sets and gets the path to the memory + cgroup of the interest. + +What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//filters//matching +Date: Dec 2022 +Contact: SeongJae Park +Description: Writing 'Y' or 'N' to this file sets whether to filter out + pages that do or do not match to the 'type' and 'memcg_path', + respectively. Filter out means the action of the scheme will + not be applied to. + What: /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//stats/nr_tried Date: Mar 2022 Contact: SeongJae Park -- cgit v1.2.3 From 44383cef54c0ce1201f884d83cc2b367bc5aa4f7 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Mon, 19 Dec 2022 19:09:18 +0100 Subject: kasan: allow sampling page_alloc allocations for HW_TAGS As Hardware Tag-Based KASAN is intended to be used in production, its performance impact is crucial. As page_alloc allocations tend to be big, tagging and checking all such allocations can introduce a significant slowdown. Add two new boot parameters that allow to alleviate that slowdown: - kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only every Nth page_alloc allocation with the order configured by the second added parameter (default: tag every such allocation). - kasan.page_alloc.sample.order, which makes sampling enabled by the first parameter only affect page_alloc allocations with the order equal or greater than the specified value (default: 3, see below). The exact performance improvement caused by using the new parameters depends on their values and the applied workload. The chosen default value for kasan.page_alloc.sample.order is 3, which matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is done for two reasons: 1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed costly to service", which corresponds to the idea that only large and thus costly allocations are supposed to sampled. 2. One of the workloads targeted by this patch is a benchmark that sends a large amount of data over a local loopback connection. Most multi-page data allocations in the networking subsystem have the order of SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER). When running a local loopback test on a testing MTE-enabled device in sync mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown. Applying this patch and setting kasan.page_alloc.sampling to a value higher than 1 allows to lower the slowdown. The performance improvement saturates around the sampling interval value of 10 with the default sampling page order of 3. This lowers the slowdown to ~20%. The slowdown in real scenarios involving the network will likely be better. Enabling page_alloc sampling has a downside: KASAN misses bad accesses to a page_alloc allocation that has not been tagged. This lowers the value of KASAN as a security mitigation. However, based on measuring the number of page_alloc allocations of different orders during boot in a test build, sampling with the default kasan.page_alloc.sample.order value affects only ~7% of allocations. The rest ~93% of allocations are still checked deterministically. Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Marco Elver Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Evgenii Stepanov Cc: Jann Horn Cc: Mark Brand Cc: Peter Collingbourne Signed-off-by: Andrew Morton --- Documentation/dev-tools/kasan.rst | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'Documentation') diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst index 5c93ab915049..e66916a483cd 100644 --- a/Documentation/dev-tools/kasan.rst +++ b/Documentation/dev-tools/kasan.rst @@ -140,6 +140,23 @@ disabling KASAN altogether or controlling its features: - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc allocations (default: ``on``). +- ``kasan.page_alloc.sample=`` makes KASAN tag only every + Nth page_alloc allocation with the order equal or greater than + ``kasan.page_alloc.sample.order``, where N is the value of the ``sample`` + parameter (default: ``1``, or tag every such allocation). + This parameter is intended to mitigate the performance overhead introduced + by KASAN. + Note that enabling this parameter makes Hardware Tag-Based KASAN skip checks + of allocations chosen by sampling and thus miss bad accesses to these + allocations. Use the default value for accurate bug detection. + +- ``kasan.page_alloc.sample.order=`` specifies the minimum + order of allocations that are affected by sampling (default: ``3``). + Only applies when ``kasan.page_alloc.sample`` is set to a value greater + than ``1``. + This parameter is intended to allow sampling only large page_alloc + allocations, which is the biggest source of the performance overhead. + Error reports ~~~~~~~~~~~~~ -- cgit v1.2.3 From 6df1b2212950aae2b2188c6645ea18e2a9e3fdd5 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 21 Dec 2022 21:19:00 -0700 Subject: mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] lru_gen_folio will be chained into per-node lists by the coming lrugen->list. Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com Signed-off-by: Yu Zhao Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Michael Larabel Cc: Michal Hocko Cc: Mike Rapoport Cc: Roman Gushchin Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index d7062c6a8946..d8f721f98868 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -89,15 +89,15 @@ variables are monotonically increasing. Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` bits in order to fit into the gen counter in ``folio->flags``. Each -truncated generation number is an index to ``lrugen->lists[]``. The +truncated generation number is an index to ``lrugen->folios[]``. The sliding window technique is used to track at least ``MIN_NR_GENS`` and at most ``MAX_NR_GENS`` generations. The gen counter stores a value within ``[1, MAX_NR_GENS]`` while a page is on one of -``lrugen->lists[]``; otherwise it stores zero. +``lrugen->folios[]``; otherwise it stores zero. Each generation is divided into multiple tiers. A page accessed ``N`` times through file descriptors is in tier ``order_base_2(N)``. Unlike -generations, tiers do not have dedicated ``lrugen->lists[]``. In +generations, tiers do not have dedicated ``lrugen->folios[]``. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves atomic operations on ``folio->flags`` and therefore has a negligible cost. A feedback loop @@ -127,7 +127,7 @@ page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. Eviction -------- The eviction consumes old generations. Given an ``lruvec``, it -increments ``min_seq`` when ``lrugen->lists[]`` indexed by +increments ``min_seq`` when ``lrugen->folios[]`` indexed by ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to evict from, it first compares ``min_seq[]`` to select the older type. If both types are equally old, it selects the one whose first tier has -- cgit v1.2.3 From 799fb82aa132fa3a3886b7872997a5a84e820062 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 3 Jan 2023 18:07:52 +0000 Subject: tools/vm: rename tools/vm to tools/mm Rename tools/vm to tools/mm for being more consistent with the code and documentation directories, and won't be confused with virtual machines. Link: https://lkml.kernel.org/r/20230103180754.129637-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/idle_page_tracking.rst | 2 +- Documentation/admin-guide/mm/pagemap.rst | 4 ++-- Documentation/mm/page_owner.rst | 2 +- Documentation/mm/slub.rst | 2 +- Documentation/translations/zh_CN/mm/page_owner.rst | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst index df9394fb39c2..19492064278c 100644 --- a/Documentation/admin-guide/mm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -65,7 +65,7 @@ workload one should: are not reclaimable, he or she can filter them out using ``/proc/kpageflags``. -The page-types tool in the tools/vm directory can be used to assist in this. +The page-types tool in the tools/mm directory can be used to assist in this. If the tool is run initially with the appropriate option, it will mark all the queried pages as idle. Subsequent runs of the tool can then show which pages have their idle flag cleared in the interim. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 6e2e416af783..ceb5da3172ba 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -46,7 +46,7 @@ There are four components to pagemap: * ``/proc/kpagecount``. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. -The page-types tool in the tools/vm directory can be used to query the +The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. * ``/proc/kpageflags``. This file contains a 64-bit set of flags for each @@ -173,7 +173,7 @@ LRU related page flags 14 - SWAPBACKED The page is backed by swap/RAM. -The page-types tool in the tools/vm directory can be used to query the +The page-types tool in the tools/mm directory can be used to query the above flags. Using pagemap to do something useful diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst index 127514955a5e..5df26c0a0c1f 100644 --- a/Documentation/mm/page_owner.rst +++ b/Documentation/mm/page_owner.rst @@ -61,7 +61,7 @@ Usage 1) Build user-space helper:: - cd tools/vm + cd tools/mm make page_owner_sort 2) Enable page owner: add "page_owner=on" to boot cmdline. diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst index 7f652216dabe..3ffa7eded251 100644 --- a/Documentation/mm/slub.rst +++ b/Documentation/mm/slub.rst @@ -21,7 +21,7 @@ slabs that have data in them. See "slabinfo -h" for more options when running the command. ``slabinfo`` can be compiled with :: - gcc -o slabinfo tools/vm/slabinfo.c + gcc -o slabinfo tools/mm/slabinfo.c Some of the modes of operation of ``slabinfo`` require that slub debugging be enabled on the command line. F.e. no tracking information will be diff --git a/Documentation/translations/zh_CN/mm/page_owner.rst b/Documentation/translations/zh_CN/mm/page_owner.rst index 21a6a0837d42..4d3b2c33e4ef 100644 --- a/Documentation/translations/zh_CN/mm/page_owner.rst +++ b/Documentation/translations/zh_CN/mm/page_owner.rst @@ -62,7 +62,7 @@ page owner在默认情况下是禁用的。所以,如果你想使用它,你 1) 构建用户空间的帮助:: - cd tools/vm + cd tools/mm make page_owner_sort 2) 启用page owner: 添加 "page_owner=on" 到 boot cmdline. -- cgit v1.2.3 From baa489fabd01596d5426d6e112b34ba5fb59ab82 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 3 Jan 2023 18:07:53 +0000 Subject: selftests/vm: rename selftests/vm to selftests/mm Rename selftets/vm to selftests/mm for being more consistent with the code, documentation, and tools directories, and won't be confused with virtual machines. [sj@kernel.org: convert missing vm->mm changes] Link: https://lkml.kernel.org/r/20230107230643.252273-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/hugetlbpage.rst | 6 +++--- Documentation/core-api/pin_user_pages.rst | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 19f27c0d92e0..a969a2c742b2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -461,13 +461,13 @@ Examples .. _map_hugetlb: ``map_hugetlb`` - see tools/testing/selftests/vm/map_hugetlb.c + see tools/testing/selftests/mm/map_hugetlb.c ``hugepage-shm`` - see tools/testing/selftests/vm/hugepage-shm.c + see tools/testing/selftests/mm/hugepage-shm.c ``hugepage-mmap`` - see tools/testing/selftests/vm/hugepage-mmap.c + see tools/testing/selftests/mm/hugepage-mmap.c The `libhugetlbfs`_ library provides a wide range of userspace tools to help with huge page usability, environment setup, and control. diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index b18416f4500f..facafbdecb95 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -221,7 +221,7 @@ Unit testing ============ This file:: - tools/testing/selftests/vm/gup_test.c + tools/testing/selftests/mm/gup_test.c has the following new calls to exercise the new pin*() wrapper functions: -- cgit v1.2.3 From 6c364edc194e104566f4de72f7a2af5b8fc17110 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 3 Jan 2023 18:07:54 +0000 Subject: Docs/admin-guide/mm/numaperf: increase depth of subsections Each section of numaperf.rst has zero depth, and therefore be exposed to the index of admin-guide/mm. Especially 'See Also' section on the index makes the document weird. Hide the sections from the index by giving the document a title and increasing the depth of each section. [sj@kernel.org: change title to fix duplicate label warning] Link: https://lkml.kernel.org/r/20230106194927.152663-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/numaperf.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst index 166697325947..544a6d16c801 100644 --- a/Documentation/admin-guide/mm/numaperf.rst +++ b/Documentation/admin-guide/mm/numaperf.rst @@ -1,6 +1,9 @@ .. _numaperf: -============= +======================= +NUMA Memory Performance +======================= + NUMA Locality ============= @@ -61,7 +64,6 @@ that are CPUs and hence suitable for generic task scheduling, and IO initiators such as GPUs and NICs. Unlike access class 0, only nodes containing CPUs are considered. -================ NUMA Performance ================ @@ -96,7 +98,6 @@ for the platform. Access class 1 takes the same form but only includes values for CPU to memory activity. -========== NUMA Cache ========== @@ -170,7 +171,6 @@ The "size" is the number of bytes provided by this cache level. The "write_policy" will be 0 for write-back, and non-zero for write-through caching. -======== See Also ======== -- cgit v1.2.3 From 92b64bd01fe99325ba0f51125bcb991f1566eadc Mon Sep 17 00:00:00 2001 From: "Fabio M. De Francesco" Date: Wed, 7 Dec 2022 23:53:08 +0100 Subject: mm/highmem: add notes about conversions from kmap{,_atomic}() kmap() and kmap_atomic() have been deprecated. kmap_local_page() should always be used in new code and the call sites of the two deprecated functions should be converted. This latter task can lead to errors if it is not carried out with the necessary attention to the context around and between the maps and unmaps. Therefore, add further information to the Highmem's documentation for the purpose to make it clearer that (1) kmap() and kmap_atomic() must not any longer be called in new code and (2) developers doing conversions from kmap() amd kmap_atomic() are expected to take care of the context around and between the maps and unmaps, in order to not break the code. Relevant parts of this patch have been taken from messages exchanged privately with Ira Weiny (thanks!). [fmdefrancesco@gmail.com: merge two sentences into one, per Bagas] Link: https://lkml.kernel.org/r/20230119123945.10471-1-fmdefrancesco@gmail.com Link: https://lkml.kernel.org/r/20221207225308.8290-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco Cc: Ira Weiny Cc: Jonathan Corbet Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Sebastian Andrzej Siewior Cc: Thomas Gleixner Signed-off-by: Andrew Morton --- Documentation/mm/highmem.rst | 41 +++++++++++++++++++++++++++++++---------- 1 file changed, 31 insertions(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/highmem.rst b/Documentation/mm/highmem.rst index 0f731d9196b0..e691a06fb337 100644 --- a/Documentation/mm/highmem.rst +++ b/Documentation/mm/highmem.rst @@ -57,7 +57,8 @@ list shows them in order of preference of use. It can be invoked from any context (including interrupts) but the mappings can only be used in the context which acquired them. - This function should be preferred, where feasible, over all the others. + This function should always be used, whereas kmap_atomic() and kmap() have + been deprecated. These mappings are thread-local and CPU-local, meaning that the mapping can only be accessed from within this thread and the thread is bound to the @@ -100,10 +101,21 @@ list shows them in order of preference of use. (included in the "Functions" section) for details on how to manage nested mappings. -* kmap_atomic(). This permits a very short duration mapping of a single - page. Since the mapping is restricted to the CPU that issued it, it - performs well, but the issuing task is therefore required to stay on that - CPU until it has finished, lest some other task displace its mappings. +* kmap_atomic(). This function has been deprecated; use kmap_local_page(). + + NOTE: Conversions to kmap_local_page() must take care to follow the mapping + restrictions imposed on kmap_local_page(). Furthermore, the code between + calls to kmap_atomic() and kunmap_atomic() may implicitly depend on the side + effects of atomic mappings, i.e. disabling page faults or preemption, or both. + In that case, explicit calls to pagefault_disable() or preempt_disable() or + both must be made in conjunction with the use of kmap_local_page(). + + [Legacy documentation] + + This permits a very short duration mapping of a single page. Since the + mapping is restricted to the CPU that issued it, it performs well, but + the issuing task is therefore required to stay on that CPU until it has + finished, lest some other task displace its mappings. kmap_atomic() may also be used by interrupt contexts, since it does not sleep and the callers too may not sleep until after kunmap_atomic() is @@ -115,11 +127,20 @@ list shows them in order of preference of use. It is assumed that k[un]map_atomic() won't fail. -* kmap(). This should be used to make short duration mapping of a single - page with no restrictions on preemption or migration. It comes with an - overhead as mapping space is restricted and protected by a global lock - for synchronization. When mapping is no longer needed, the address that - the page was mapped to must be released with kunmap(). +* kmap(). This function has been deprecated; use kmap_local_page(). + + NOTE: Conversions to kmap_local_page() must take care to follow the mapping + restrictions imposed on kmap_local_page(). In particular, it is necessary to + make sure that the kernel virtual memory pointer is only valid in the thread + that obtained it. + + [Legacy documentation] + + This should be used to make short duration mapping of a single page with no + restrictions on preemption or migration. It comes with an overhead as mapping + space is restricted and protected by a global lock for synchronization. When + mapping is no longer needed, the address that the page was mapped to must be + released with kunmap(). Mapping changes must be propagated across all the CPUs. kmap() also requires global TLB invalidation when the kmap's pool wraps and it might -- cgit v1.2.3 From 86834644e3c9301ccd28df1293c37306a6332f3b Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 10 Jan 2023 19:03:55 +0000 Subject: Docs/mm/damon/index: mention DAMOS on the intro What DAMON aims to do is not only access monitoring but efficient and effective access-aware system operations. And DAMon-based Operation Schemes (DAMOS) is the important feature of DAMON for the goal. Make the intro of DAMON documentation to emphasize the goal and mention DAMOS. Link: https://lkml.kernel.org/r/20230110190400.119388-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/mm/damon/index.rst | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 48c0bbff98b2..2983699c12ea 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -4,8 +4,9 @@ DAMON: Data Access MONitor ========================== -DAMON is a data access monitoring framework subsystem for the Linux kernel. -The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it +DAMON is a Linux kernel subsystem that provides a framework for data access +monitoring and the monitoring results based system operations. The core +monitoring mechanisms of DAMON (refer to :doc:`design` for the detail) make it - *accurate* (the monitoring output is useful enough for DRAM level memory management; It might not appropriate for CPU Cache levels, though), @@ -14,12 +15,16 @@ The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it - *scalable* (the upper-bound of the overhead is in constant range regardless of the size of target workloads). -Using this framework, therefore, the kernel's memory management mechanisms can -make advanced decisions. Experimental memory management optimization works -that incurring high data accesses monitoring overhead could implemented again. -In user space, meanwhile, users who have some special workloads can write -personalized applications for better understanding and optimizations of their -workloads and systems. +Using this framework, therefore, the kernel can operate system in an +access-aware fashion. Because the features are also exposed to the user space, +users who have special information about their workloads can write personalized +applications for better understanding and optimizations of their workloads and +systems. + +For easier development of such systems, DAMON provides a feature called DAMOS +(DAMon-based Operation Schemes) in addition to the monitoring. Using the +feature, DAMON users in both kernel and user spaces can do access-aware system +operations with no code but simple configurations. .. toctree:: :maxdepth: 2 -- cgit v1.2.3 From 9a47c411ccddf843812dbbff707bd45e3bc83f40 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 10 Jan 2023 19:03:56 +0000 Subject: Docs/admin-guide/mm/damon/usage: update DAMOS actions/filters supports of each DAMON operations set Supports of each DAMOS action and filters are up to DAMON operations set implementation, but it's not mentioned in detail on the documentation. Update the information on the usage document. Link: https://lkml.kernel.org/r/20230110190400.119388-5-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 41 +++++++++++++++++++--------- 1 file changed, 28 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 3d82ca6a17ff..9237d6a25897 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -279,14 +279,25 @@ The ``action`` file is for setting and getting what action you want to apply to memory regions having specific access pattern of the interest. The keywords that can be written to and read from the file and their meaning are as below. - - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED`` - - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD`` - - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` +Note that support of each action depends on the running DAMON operations set +`implementation `. + + - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set. + - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. + - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. + Supported by ``vaddr`` and ``fvaddr`` operations set. - ``lru_prio``: Prioritize the region on its LRU lists. + Supported by ``paddr`` operations set. - ``lru_deprio``: Deprioritize the region on its LRU lists. - - ``stat``: Do nothing but count the statistics + Supported by ``paddr`` operations set. + - ``stat``: Do nothing but count the statistics. + Supported by all operations sets. schemes//access_pattern/ --------------------------- @@ -388,8 +399,8 @@ pages of all memory cgroups except ``/having_care_already``.:: echo /having_care_already > 1/memcg_path echo N > 1/matching -Note that filters could be ignored depend on the running DAMON operations set -`implementation `. +Note that filters are currently supported only when ``paddr`` +`implementation ` is being used. .. _sysfs_schemes_stats: @@ -618,11 +629,15 @@ The ```` is a predefined integer for memory management actions, which DAMON will apply to the regions having the target access pattern. The supported numbers and their meanings are as below. - - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED`` - - 1: Call ``madvise()`` for the region with ``MADV_COLD`` - - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` + - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if + ``target`` is ``paddr``. + - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if + ``target`` is ``paddr``. + - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``. + - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if + ``target`` is ``paddr``. + - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if + ``target`` is ``paddr``. - 5: Do nothing but count the statistics Quota -- cgit v1.2.3 From e7366f3a2ed0f554354a1d7fe8bf5d98bab5247e Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Tue, 10 Jan 2023 19:03:57 +0000 Subject: Docs/mm/damon: add a maintainer-profile for DAMON Document the basic policies and expectations for DAMON development. Link: https://lkml.kernel.org/r/20230110190400.119388-6-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Cc: Shuah Khan Signed-off-by: Andrew Morton --- Documentation/mm/damon/index.rst | 1 + Documentation/mm/damon/maintainer-profile.rst | 62 +++++++++++++++++++++++++++ 2 files changed, 63 insertions(+) create mode 100644 Documentation/mm/damon/maintainer-profile.rst (limited to 'Documentation') diff --git a/Documentation/mm/damon/index.rst b/Documentation/mm/damon/index.rst index 2983699c12ea..5e0a50583500 100644 --- a/Documentation/mm/damon/index.rst +++ b/Documentation/mm/damon/index.rst @@ -32,3 +32,4 @@ operations with no code but simple configurations. faq design api + maintainer-profile diff --git a/Documentation/mm/damon/maintainer-profile.rst b/Documentation/mm/damon/maintainer-profile.rst new file mode 100644 index 000000000000..24a202f03de8 --- /dev/null +++ b/Documentation/mm/damon/maintainer-profile.rst @@ -0,0 +1,62 @@ +.. SPDX-License-Identifier: GPL-2.0 + +DAMON Maintainer Entry Profile +============================== + +The DAMON subsystem covers the files that listed in 'DATA ACCESS MONITOR' +section of 'MAINTAINERS' file. + +The mailing lists for the subsystem are damon@lists.linux.dev and +linux-mm@kvack.org. Patches should be made against the mm-unstable tree [1]_ +whenever possible and posted to the mailing lists. + +SCM Trees +--------- + +There are multiple Linux trees for DAMON development. Patches under +development or testing are queued in damon/next [2]_ by the DAMON maintainer. +Suffieicntly reviewed patches will be queued in mm-unstable [1]_ by the memory +management subsystem maintainer. After more sufficient tests, the patches will +be queued in mm-stable [3]_ , and finally pull-requested to the mainline by the +memory management subsystem maintainer. + +Note again the patches for review should be made against the mm-unstable +tree[1] whenever possible. damon/next is only for preview of others' works in +progress. + +Submit checklist addendum +------------------------- + +When making DAMON changes, you should do below. + +- Build changes related outputs including kernel and documents. +- Ensure the builds introduce no new errors or warnings. +- Run and ensure no new failures for DAMON selftests [4]_ and kunittests [5]_ . + +Further doing below and putting the results will be helpful. + +- Run damon-tests/corr [6]_ for normal changes. +- Run damon-tests/perf [7]_ for performance changes. + +Key cycle dates +--------------- + +Patches can be sent anytime. Key cycle dates of the mm-unstable[1] and +mm-stable[3] trees depend on the memory management subsystem maintainer. + +Review cadence +-------------- + +The DAMON maintainer does the work on the usual work hour (09:00 to 17:00, +Mon-Fri) in PST. The response to patches will occasionally be slow. Do not +hesitate to send a ping if you have not heard back within a week of sending a +patch. + + +.. [1] https://git.kernel.org/akpm/mm/h/mm-unstable +.. [2] https://git.kernel.org/sj/h/damon/next +.. [3] https://git.kernel.org/akpm/mm/h/mm-stable +.. [4] https://github.com/awslabs/damon-tests/blob/master/corr/run.sh#L49 +.. [5] https://github.com/awslabs/damon-tests/blob/master/corr/tests/kunit.sh +.. [6] https://github.com/awslabs/damon-tests/tree/master/corr +.. [7] https://github.com/awslabs/damon-tests/tree/master/perf -- cgit v1.2.3 From 94688e8eb453e616098cb930e5f6fed4a6ea2dfa Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 11 Jan 2023 14:28:47 +0000 Subject: mm: remove folio_pincount_ptr() and head_compound_pincount() We can use folio->_pincount directly, since all users are guarded by tests of compound/large. Link: https://lkml.kernel.org/r/20230111142915.1001531-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: John Hubbard Signed-off-by: Andrew Morton --- Documentation/core-api/pin_user_pages.rst | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index facafbdecb95..9fb0b1080d3b 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -55,18 +55,17 @@ flags the caller provides. The caller is required to pass in a non-null struct pages* array, and the function then pins pages by incrementing each by a special value: GUP_PIN_COUNTING_BIAS. -For compound pages, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, -an exact form of pin counting is achieved, by using the 2nd struct page -in the compound page. A new struct page field, compound_pincount, has -been added in order to support this. - -This approach for compound pages avoids the counting upper limit problems that -are discussed below. Those limitations would have been aggravated severely by -huge pages, because each tail page adds a refcount to the head page. And in -fact, testing revealed that, without a separate compound_pincount field, -page overflows were seen in some huge page stress tests. - -This also means that huge pages and compound pages do not suffer +For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, +the extra space available in the struct folio is used to store the +pincount directly. + +This approach for large folios avoids the counting upper limit problems +that are discussed below. Those limitations would have been aggravated +severely by huge pages, because each tail page adds a refcount to the +head page. And in fact, testing revealed that, without a separate pincount +field, refcount overflows were seen in some huge page stress tests. + +This also means that huge pages and large folios do not suffer from the false positives problem that is mentioned below.:: Function @@ -264,9 +263,9 @@ place.) Other diagnostics ================= -dump_page() has been enhanced slightly, to handle these new counting -fields, and to better report on compound pages in general. Specifically, -for compound pages, the exact (compound_pincount) pincount is reported. +dump_page() has been enhanced slightly to handle these new counting +fields, and to better report on large folios in general. Specifically, +for large folios, the exact pincount is reported. References ========== -- cgit v1.2.3 From 6eee1a0062298601dfc36bd34517affc4458c43d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 11 Jan 2023 14:28:49 +0000 Subject: doc: clarify refcount section by referring to folios & pages Include the rename of subpages_mapcount to _nr_pages_mapped. Link: https://lkml.kernel.org/r/20230111142915.1001531-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/transhuge.rst | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index ec3dc5b04226..03bbd0a19041 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -112,20 +112,20 @@ Refcounts and transparent huge pages Refcounting on THP is mostly consistent with refcounting on other compound pages: - - get_page()/put_page() and GUP operate on head page's ->_refcount. + - get_page()/put_page() and GUP operate on the folio->_refcount. - ->_refcount in tail pages is always zero: get_page_unless_zero() never succeeds on tail pages. - - map/unmap of PMD entry for the whole compound page increment/decrement - ->compound_mapcount, stored in the first tail page of the compound page; - and also increment/decrement ->subpages_mapcount (also in the first tail) - by COMPOUND_MAPPED when compound_mapcount goes from -1 to 0 or 0 to -1. + - map/unmap of a PMD entry for the whole THP increment/decrement + folio->_entire_mapcount and also increment/decrement + folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount + goes from -1 to 0 or 0 to -1. - - map/unmap of sub-pages with PTE entry increment/decrement ->_mapcount - on relevant sub-page of the compound page, and also increment/decrement - ->subpages_mapcount, stored in first tail page of the compound page, when - _mapcount goes from -1 to 0 or 0 to -1: counting sub-pages mapped by PTE. + - map/unmap of individual pages with PTE entry increment/decrement + page->_mapcount and also increment/decrement folio->_nr_pages_mapped + when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts + the number of pages mapped by PTE. split_huge_page internally has to distribute the refcounts in the head page to the tail pages before clearing all PG_head/tail bits from the page -- cgit v1.2.3 From f158ed6195ef949060811fd85086928470651944 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 11 Jan 2023 14:29:13 +0000 Subject: mm: convert deferred_split_huge_page() to deferred_split_folio() Now that both callers use a folio, pass the folio in and save a call to compound_head(). Link: https://lkml.kernel.org/r/20230111142915.1001531-28-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/transhuge.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 03bbd0a19041..a9608fe51649 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -153,8 +153,8 @@ clear where references should go after split: it will stay on the head page. Note that split_huge_pmd() doesn't have any limitations on refcounting: pmd can be split at any point and never fails. -Partial unmap and deferred_split_huge_page() -============================================ +Partial unmap and deferred_split_folio() +======================================== Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use @@ -166,6 +166,6 @@ the place where we can detect partial unmap. It also might be counterproductive since in many cases partial unmap happens during exit(2) if a THP crosses a VMA boundary. -The function deferred_split_huge_page() is used to queue a page for splitting. +The function deferred_split_folio() is used to queue a folio for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. -- cgit v1.2.3 From a8265cd917a63c0a1e13caf07e9a0a024480372b Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Thu, 12 Jan 2023 12:39:32 +0000 Subject: Documentation/mm: update references to __m[un]lock_page() to *_folio() We now pass folios to these functions, so update the documentation accordingly. Additionally, correct the outdated reference to __pagevec_lru_add_fn(), the referenced action occurs in __munlock_folio() directly now, replace reference to lru_cache_add_inactive_or_unevictable() with the modern folio equivalent folio_add_lru_vma() and reference folio flags by the flag name rather than accessor. Link: https://lkml.kernel.org/r/898c487169d98a7f09c1c1e57a7dfdc2b3f6bf0f.1673526881.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Cc: Christian Brauner Cc: Geert Uytterhoeven Cc: Hugh Dickins Cc: Joel Fernandes (Google) Cc: Jonathan Corbet Cc: Liam R. Howlett Cc: Matthew Wilcox Cc: Mike Rapoport (IBM) Cc: William Kucharski Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 4a0e158aa9ce..2a90d0721dd9 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -308,22 +308,22 @@ do end up getting faulted into this VM_LOCKED VMA, they will be handled in the fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled. For each PTE (or PMD) being faulted into a VMA, the page add rmap function -calls mlock_vma_page(), which calls mlock_page() when the VMA is VM_LOCKED +calls mlock_vma_page(), which calls mlock_folio() when the VMA is VM_LOCKED (unless it is a PTE mapping of a part of a transparent huge page). Or when -it is a newly allocated anonymous page, lru_cache_add_inactive_or_unevictable() -calls mlock_new_page() instead: similar to mlock_page(), but can make better +it is a newly allocated anonymous page, folio_add_lru_vma() calls +mlock_new_folio() instead: similar to mlock_folio(), but can make better judgments, since this page is held exclusively and known not to be on LRU yet. -mlock_page() sets PageMlocked immediately, then places the page on the CPU's -mlock pagevec, to batch up the rest of the work to be done under lru_lock by -__mlock_page(). __mlock_page() sets PageUnevictable, initializes mlock_count +mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's +mlock folio batch, to batch up the rest of the work to be done under lru_lock by +__mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count and moves the page to unevictable state ("the unevictable LRU", but with -mlock_count in place of LRU threading). Or if the page was already PageLRU -and PageUnevictable and PageMlocked, it simply increments the mlock_count. +mlock_count in place of LRU threading). Or if the page was already PG_lru +and PG_unevictable and PG_mlocked, it simply increments the mlock_count. But in practice that may not work ideally: the page may not yet be on an LRU, or it may have been temporarily isolated from LRU. In such cases the mlock_count -field cannot be touched, but will be set to 0 later when __pagevec_lru_add_fn() +field cannot be touched, but will be set to 0 later when __munlock_folio() returns the page to "LRU". Races prohibit mlock_count from being set to 1 then: rather than risk stranding a page indefinitely as unevictable, always err with mlock_count on the low side, so that when munlocked the page will be rescued to @@ -377,8 +377,8 @@ that it is munlock() being performed. munlock_page() uses the mlock pagevec to batch up work to be done under lru_lock by __munlock_page(). __munlock_page() decrements the page's -mlock_count, and when that reaches 0 it clears PageMlocked and clears -PageUnevictable, moving the page from unevictable state to inactive LRU. +mlock_count, and when that reaches 0 it clears PG_mlocked and clears +PG_unevictable, moving the page from unevictable state to inactive LRU. But in practice that may not work ideally: the page may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In @@ -488,8 +488,8 @@ munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED munlock_page() uses the mlock pagevec to batch up work to be done under lru_lock by __munlock_page(). __munlock_page() decrements the page's -mlock_count, and when that reaches 0 it clears PageMlocked and clears -PageUnevictable, moving the page from unevictable state to inactive LRU. +mlock_count, and when that reaches 0 it clears PG_mlocked and clears +PG_unevictable, moving the page from unevictable state to inactive LRU. But in practice that may not work ideally: the page may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In @@ -515,7 +515,7 @@ munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages present, if one of those pages were unmapped by truncation or hole punch before mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA, and would not be counted out of mlock_count. In this rare case, a page may -still appear as PageMlocked after it has been fully unmapped: and it is left to +still appear as PG_mlocked after it has been fully unmapped: and it is left to release_pages() (or __page_cache_release()) to clear it and update statistics before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared, which is usually 0). @@ -527,7 +527,7 @@ Page Reclaim in shrink_*_list() vmscan's shrink_active_list() culls any obviously unevictable pages - i.e. !page_evictable(page) pages - diverting those to the unevictable list. However, shrink_active_list() only sees unevictable pages that made it onto the -active/inactive LRU lists. Note that these pages do not have PageUnevictable +active/inactive LRU lists. Note that these pages do not have PG_unevictable set - otherwise they would be on the unevictable list and shrink_active_list() would never see them. -- cgit v1.2.3 From 2973d8229b78d3f148e0c45916a1e8b237dc6167 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Fri, 13 Jan 2023 11:12:17 +0000 Subject: mm: discard __GFP_ATOMIC __GFP_ATOMIC serves little purpose. Its main effect is to set ALLOC_HARDER which adds a few little boosts to increase the chance of an allocation succeeding, one of which is to lower the water-mark at which it will succeed. It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also adjusts this watermark. It is probable that other users of __GFP_HIGH should benefit from the other little bonuses that __GFP_ATOMIC gets. __GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM. There is little point to this. We already get a might_sleep() warning if __GFP_DIRECT_RECLAIM is set. __GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is probable that testing ALLOC_HARDER is a better fit here. __GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might sleep. This should test __GFP_DIRECT_RECLAIM instead. This patch: - removes __GFP_ATOMIC - allows __GFP_HIGH allocations to ignore watermark boosting as well as GFP_ATOMIC requests. - makes other adjustments as suggested by the above. The net result is not change to GFP_ATOMIC allocations. Other allocations that use __GFP_HIGH will benefit from a few different extra privileges. This affects: xen, dm, md, ntfs3 the vermillion frame buffer hibernation ksm swap all of which likely produce more benefit than cost if these selected allocation are more likely to succeed quickly. [mgorman: Minor adjustments to rework on top of a series] Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name Link: https://lkml.kernel.org/r/20230113111217.14134-7-mgorman@techsingularity.net Signed-off-by: NeilBrown Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Acked-by: Michal Hocko Cc: Matthew Wilcox Cc: Thierry Reding Signed-off-by: Andrew Morton --- Documentation/mm/balance.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/mm/balance.rst b/Documentation/mm/balance.rst index 6a1fadf3e173..e38e9d83c1c7 100644 --- a/Documentation/mm/balance.rst +++ b/Documentation/mm/balance.rst @@ -6,7 +6,7 @@ Memory Balancing Started Jan 2000 by Kanoj Sarcar -Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as +Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as well as for non __GFP_IO allocations. The first reason why a caller may avoid reclaim is that the caller can not -- cgit v1.2.3 From 90c9d13a47d45f2f16530c4d62af2fa4d74dfd16 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 16 Jan 2023 19:28:24 +0000 Subject: mm: remove page_evictable() Patch series "Remove leftover mlock/munlock page wrappers". We no longer need the various mlock page functions as all callers have folios. This patch (of 4): This function now has no users. Also update the unevictable-lru documentation to discuss folios instead of pages (mostly). [akpm@linux-foundation.org: fix Documentation/mm/unevictable-lru.rst underlining] Link: https://lkml.kernel.org/r/20230117145106.585b277b@canb.auug.org.au Link: https://lkml.kernel.org/r/20230116192827.2146732-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230116192827.2146732-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 91 +++++++++++++++++++----------------- 1 file changed, 47 insertions(+), 44 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 2a90d0721dd9..552c63eef954 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -12,7 +12,7 @@ Introduction This document describes the Linux memory manager's "Unevictable LRU" infrastructure and the use of this to manage several types of "unevictable" -pages. +folios. The document attempts to provide the overall rationale behind this mechanism and the rationale for some of the design decisions that drove the @@ -27,8 +27,8 @@ The Unevictable LRU =================== The Unevictable LRU facility adds an additional LRU list to track unevictable -pages and to hide these pages from vmscan. This mechanism is based on a patch -by Larry Woodman of Red Hat to address several scalability problems with page +folios and to hide these folios from vmscan. This mechanism is based on a patch +by Larry Woodman of Red Hat to address several scalability problems with folio reclaim in Linux. The problems have been observed at customer sites on large memory x86_64 systems. @@ -52,40 +52,41 @@ The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future. -The Unevictable LRU Page List ------------------------------ +The Unevictable LRU Folio List +------------------------------ -The Unevictable LRU page list is a lie. It was never an LRU-ordered list, but a -companion to the LRU-ordered anonymous and file, active and inactive page lists; -and now it is not even a page list. But following familiar convention, here in -this document and in the source, we often imagine it as a fifth LRU page list. +The Unevictable LRU folio list is a lie. It was never an LRU-ordered +list, but a companion to the LRU-ordered anonymous and file, active and +inactive folio lists; and now it is not even a folio list. But following +familiar convention, here in this document and in the source, we often +imagine it as a fifth LRU folio list. The Unevictable LRU infrastructure consists of an additional, per-node, LRU list -called the "unevictable" list and an associated page flag, PG_unevictable, to -indicate that the page is being managed on the unevictable list. +called the "unevictable" list and an associated folio flag, PG_unevictable, to +indicate that the folio is being managed on the unevictable list. The PG_unevictable flag is analogous to, and mutually exclusive with, the -PG_active flag in that it indicates on which LRU list a page resides when +PG_active flag in that it indicates on which LRU list a folio resides when PG_lru is set. -The Unevictable LRU infrastructure maintains unevictable pages as if they were +The Unevictable LRU infrastructure maintains unevictable folios as if they were on an additional LRU list for a few reasons: - (1) We get to "treat unevictable pages just like we treat other pages in the + (1) We get to "treat unevictable folios just like we treat other folios in the system - which means we get to use the same code to manipulate them, the same code to isolate them (for migrate, etc.), the same code to keep track of the statistics, etc..." [Rik van Riel] - (2) We want to be able to migrate unevictable pages between nodes for memory + (2) We want to be able to migrate unevictable folios between nodes for memory defragmentation, workload management and memory hotplug. The Linux kernel - can only migrate pages that it can successfully isolate from the LRU + can only migrate folios that it can successfully isolate from the LRU lists (or "Movable" pages: outside of consideration here). If we were to - maintain pages elsewhere than on an LRU-like list, where they can be - detected by isolate_lru_page(), we would prevent their migration. + maintain folios elsewhere than on an LRU-like list, where they can be + detected by folio_isolate_lru(), we would prevent their migration. -The unevictable list does not differentiate between file-backed and anonymous, -swap-backed pages. This differentiation is only important while the pages are, -in fact, evictable. +The unevictable list does not differentiate between file-backed and +anonymous, swap-backed folios. This differentiation is only important +while the folios are, in fact, evictable. The unevictable list benefits from the "arrayification" of the per-node LRU lists and statistics originally proposed and posted by Christoph Lameter. @@ -158,7 +159,7 @@ These are currently used in three places in the kernel: Detecting Unevictable Pages --------------------------- -The function page_evictable() in mm/internal.h determines whether a page is +The function folio_evictable() in mm/internal.h determines whether a folio is evictable or not using the query function outlined above [see section :ref:`Marking address spaces unevictable `] to check the AS_UNEVICTABLE flag. @@ -167,7 +168,7 @@ For address spaces that are so marked after being populated (as SHM regions might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate the page tables for the region as does, for example, mlock(), nor need it make any special effort to push any pages in the SHM_LOCK'd area to the unevictable -list. Instead, vmscan will do this if and when it encounters the pages during +list. Instead, vmscan will do this if and when it encounters the folios during a reclamation scan. On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan @@ -176,41 +177,43 @@ condition is keeping them unevictable. If an unevictable region is destroyed, the pages are also "rescued" from the unevictable list in the process of freeing them. -page_evictable() also checks for mlocked pages by testing an additional page -flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is -faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED. +folio_evictable() also checks for mlocked folios by calling +folio_test_mlocked(), which is set when a folio is faulted into a +VM_LOCKED VMA, or found in a VMA being VM_LOCKED. -Vmscan's Handling of Unevictable Pages --------------------------------------- +Vmscan's Handling of Unevictable Folios +--------------------------------------- -If unevictable pages are culled in the fault path, or moved to the unevictable -list at mlock() or mmap() time, vmscan will not encounter the pages until they +If unevictable folios are culled in the fault path, or moved to the unevictable +list at mlock() or mmap() time, vmscan will not encounter the folios until they have become evictable again (via munlock() for example) and have been "rescued" from the unevictable list. However, there may be situations where we decide, -for the sake of expediency, to leave an unevictable page on one of the regular +for the sake of expediency, to leave an unevictable folio on one of the regular active/inactive LRU lists for vmscan to deal with. vmscan checks for such -pages in all of the shrink_{active|inactive|page}_list() functions and will -"cull" such pages that it encounters: that is, it diverts those pages to the +folios in all of the shrink_{active|inactive|page}_list() functions and will +"cull" such folios that it encounters: that is, it diverts those folios to the unevictable list for the memory cgroup and node being scanned. -There may be situations where a page is mapped into a VM_LOCKED VMA, but the -page is not marked as PG_mlocked. Such pages will make it all the way to -shrink_active_list() or shrink_page_list() where they will be detected when -vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The page -is culled to the unevictable list when it is released by the shrinker. +There may be situations where a folio is mapped into a VM_LOCKED VMA, +but the folio does not have the mlocked flag set. Such folios will make +it all the way to shrink_active_list() or shrink_page_list() where they +will be detected when vmscan walks the reverse map in folio_referenced() +or try_to_unmap(). The folio is culled to the unevictable list when it +is released by the shrinker. -To "cull" an unevictable page, vmscan simply puts the page back on the LRU list -using putback_lru_page() - the inverse operation to isolate_lru_page() - after -dropping the page lock. Because the condition which makes the page unevictable -may change once the page is unlocked, __pagevec_lru_add_fn() will recheck the -unevictable state of a page before placing it on the unevictable list. +To "cull" an unevictable folio, vmscan simply puts the folio back on +the LRU list using folio_putback_lru() - the inverse operation to +folio_isolate_lru() - after dropping the folio lock. Because the +condition which makes the folio unevictable may change once the folio +is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state +of a folio before placing it on the unevictable list. MLOCKED Pages ============= -The unevictable page list is also useful for mlock(), in addition to ramfs and +The unevictable folio list is also useful for mlock(), in addition to ramfs and SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in NOMMU situations, all mappings are effectively mlocked. -- cgit v1.2.3 From 7efecffb8e7968c4a6c53177b0053ca4765fe233 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 16 Jan 2023 19:28:25 +0000 Subject: mm: remove mlock_vma_page() All callers now have a folio and can call mlock_vma_folio(). Update the documentation to refer to mlock_vma_folio(). Link: https://lkml.kernel.org/r/20230116192827.2146732-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 552c63eef954..9257235fe904 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -311,7 +311,7 @@ do end up getting faulted into this VM_LOCKED VMA, they will be handled in the fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled. For each PTE (or PMD) being faulted into a VMA, the page add rmap function -calls mlock_vma_page(), which calls mlock_folio() when the VMA is VM_LOCKED +calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED (unless it is a PTE mapping of a part of a transparent huge page). Or when it is a newly allocated anonymous page, folio_add_lru_vma() calls mlock_new_folio() instead: similar to mlock_folio(), but can make better @@ -413,7 +413,7 @@ However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA, before mlocking any pages already present, if one of those pages were migrated before mlock_pte_range() reached it, it would get counted twice in mlock_count. To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO, -so that mlock_vma_page() will skip it. +so that mlock_vma_folio() will skip it. To complete page migration, we place the old and new pages back onto the LRU afterwards. The "unneeded" page - old page on success, new page on failure - @@ -552,6 +552,6 @@ and node unevictable list. rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), -check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page() +check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() to correct them. Such pages are culled to the unevictable list when released by the shrinker. -- cgit v1.2.3 From 672aa27d0bd241759376e62b78abb8aae1792479 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 16 Jan 2023 19:28:26 +0000 Subject: mm: remove munlock_vma_page() All callers now have a folio and can call munlock_vma_folio(). Update the documentation to refer to munlock_vma_folio(). Link: https://lkml.kernel.org/r/20230116192827.2146732-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 9257235fe904..34b8b098c5bc 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -486,7 +486,7 @@ Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing. For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls -munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED +munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). munlock_page() uses the mlock pagevec to batch up work to be done under @@ -510,7 +510,7 @@ which had been Copied-On-Write from the file pages now being truncated. Mlocked pages can be munlocked and deleted in this way: like with munmap(), for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls -munlock_vma_page(), which calls munlock_page() when the VMA is VM_LOCKED +munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). However, if there is a racing munlock(), since mlock_vma_pages_range() starts -- cgit v1.2.3 From e0650a41f7d024b72669a2a2db846ef70281abd8 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 16 Jan 2023 19:28:27 +0000 Subject: mm: clean up mlock_page / munlock_page references in comments Change documentation and comments that refer to now-renamed functions. Link: https://lkml.kernel.org/r/20230116192827.2146732-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- Documentation/mm/unevictable-lru.rst | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 34b8b098c5bc..53e59433497a 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -298,7 +298,7 @@ treated as a no-op and mlock_fixup() simply returns. If the VMA passes some filtering as described in "Filtering Special VMAs" below, mlock_fixup() will attempt to merge the VMA with its neighbors or split off a subset of the VMA if the range does not cover the entire VMA. Any pages -already present in the VMA are then marked as mlocked by mlock_page() via +already present in the VMA are then marked as mlocked by mlock_folio() via mlock_pte_range() via walk_page_range() via mlock_vma_pages_range(). Before returning from the system call, do_mlock() or mlockall() will call @@ -373,20 +373,21 @@ Because of the VMA filtering discussed above, VM_LOCKED will not be set in any "special" VMAs. So, those VMAs will be ignored for munlock. If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the -specified range. All pages in the VMA are then munlocked by munlock_page() via +specified range. All pages in the VMA are then munlocked by munlock_folio() via mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same function used when mlocking a VMA range, with new flags for the VMA indicating that it is munlock() being performed. -munlock_page() uses the mlock pagevec to batch up work to be done under -lru_lock by __munlock_page(). __munlock_page() decrements the page's -mlock_count, and when that reaches 0 it clears PG_mlocked and clears -PG_unevictable, moving the page from unevictable state to inactive LRU. +munlock_folio() uses the mlock pagevec to batch up work to be done +under lru_lock by __munlock_folio(). __munlock_folio() decrements the +folio's mlock_count, and when that reaches 0 it clears the mlocked flag +and clears the unevictable flag, moving the folio from unevictable state +to the inactive LRU. -But in practice that may not work ideally: the page may not yet have reached +But in practice that may not work ideally: the folio may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In those cases its mlock_count field is unusable and must be assumed to be 0: so -that the page will be rescued to an evictable LRU, then perhaps be mlocked +that the folio will be rescued to an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a VM_LOCKED VMA. @@ -489,15 +490,16 @@ For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page). -munlock_page() uses the mlock pagevec to batch up work to be done under -lru_lock by __munlock_page(). __munlock_page() decrements the page's -mlock_count, and when that reaches 0 it clears PG_mlocked and clears -PG_unevictable, moving the page from unevictable state to inactive LRU. +munlock_folio() uses the mlock pagevec to batch up work to be done +under lru_lock by __munlock_folio(). __munlock_folio() decrements the +folio's mlock_count, and when that reaches 0 it clears the mlocked flag +and clears the unevictable flag, moving the folio from unevictable state +to the inactive LRU. -But in practice that may not work ideally: the page may not yet have reached +But in practice that may not work ideally: the folio may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In those cases its mlock_count field is unusable and must be assumed to be 0: so -that the page will be rescued to an evictable LRU, then perhaps be mlocked +that the folio will be rescued to an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a VM_LOCKED VMA. -- cgit v1.2.3 From 4ff93b292c0b91b9a7c9fb54643f122bbe59d8d0 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 18 Jan 2023 09:52:09 +0900 Subject: zsmalloc: make zspage chain size configurable Remove hard coded limit on the maximum number of physical pages per-zspage. This will allow tuning of zsmalloc pool as zspage chain size changes `pages per-zspage` and `objects per-zspage` characteristics of size classes which also affects size classes clustering (the way size classes are merged). Link: https://lkml.kernel.org/r/20230118005210.2814763-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Mike Kravetz Signed-off-by: Andrew Morton --- Documentation/mm/zsmalloc.rst | 168 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 168 insertions(+) (limited to 'Documentation') diff --git a/Documentation/mm/zsmalloc.rst b/Documentation/mm/zsmalloc.rst index 6e79893d6132..40323c9b39d8 100644 --- a/Documentation/mm/zsmalloc.rst +++ b/Documentation/mm/zsmalloc.rst @@ -80,3 +80,171 @@ Similarly, we assign zspage to: * ZS_ALMOST_FULL when n > N / f * ZS_EMPTY when n == 0 * ZS_FULL when n == N + + +Internals +========= + +zsmalloc has 255 size classes, each of which can hold a number of zspages. +Each zspage can contain up to ZSMALLOC_CHAIN_SIZE physical (0-order) pages. +The optimal zspage chain size for each size class is calculated during the +creation of the zsmalloc pool (see calculate_zspage_chain_size()). + +As an optimization, zsmalloc merges size classes that have similar +characteristics in terms of the number of pages per zspage and the number +of objects that each zspage can store. + +For instance, consider the following size classes::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + 94 1536 0 0 0 0 0 3 0 + 100 1632 0 0 0 0 0 2 0 + ... + + +Size classes #95-99 are merged with size class #100. This means that when we +need to store an object of size, say, 1568 bytes, we end up using size class +#100 instead of size class #96. Size class #100 is meant for objects of size +1632 bytes, so each object of size 1568 bytes wastes 1632-1568=64 bytes. + +Size class #100 consists of zspages with 2 physical pages each, which can +hold a total of 5 objects. If we need to store 13 objects of size 1568, we +end up allocating three zspages, or 6 physical pages. + +However, if we take a closer look at size class #96 (which is meant for +objects of size 1568 bytes) and trace `calculate_zspage_chain_size()`, we +find that the most optimal zspage configuration for this class is a chain +of 5 physical pages::: + + pages per zspage wasted bytes used% + 1 960 76 + 2 352 95 + 3 1312 89 + 4 704 95 + 5 96 99 + +This means that a class #96 configuration with 5 physical pages can store 13 +objects of size 1568 in a single zspage, using a total of 5 physical pages. +This is more efficient than the class #100 configuration, which would use 6 +physical pages to store the same number of objects. + +As the zspage chain size for class #96 increases, its key characteristics +such as pages per-zspage and objects per-zspage also change. This leads to +dewer class mergers, resulting in a more compact grouping of classes, which +reduces memory wastage. + +Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + 202 3264 0 0 0 0 0 4 0 + 254 4096 0 0 0 0 0 1 0 + ... + +Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages +per zspage. Any object larger than 3264 bytes is considered huge and belongs +to size class #254, which stores each object in its own physical page (objects +in huge classes do not share pages). + +Increasing the size of the chain of zspages also results in a higher watermark +for the huge size class and fewer huge classes overall. This allows for more +efficient storage of large objects. + +For zspage chain size of 8, huge class watermark becomes 3632 bytes::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + 202 3264 0 0 0 0 0 4 0 + 211 3408 0 0 0 0 0 5 0 + 217 3504 0 0 0 0 0 6 0 + 222 3584 0 0 0 0 0 7 0 + 225 3632 0 0 0 0 0 8 0 + 254 4096 0 0 0 0 0 1 0 + ... + +For zspage chain size of 16, huge class watermark becomes 3840 bytes::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + 202 3264 0 0 0 0 0 4 0 + 206 3328 0 0 0 0 0 13 0 + 207 3344 0 0 0 0 0 9 0 + 208 3360 0 0 0 0 0 14 0 + 211 3408 0 0 0 0 0 5 0 + 212 3424 0 0 0 0 0 16 0 + 214 3456 0 0 0 0 0 11 0 + 217 3504 0 0 0 0 0 6 0 + 219 3536 0 0 0 0 0 13 0 + 222 3584 0 0 0 0 0 7 0 + 223 3600 0 0 0 0 0 15 0 + 225 3632 0 0 0 0 0 8 0 + 228 3680 0 0 0 0 0 9 0 + 230 3712 0 0 0 0 0 10 0 + 232 3744 0 0 0 0 0 11 0 + 234 3776 0 0 0 0 0 12 0 + 235 3792 0 0 0 0 0 13 0 + 236 3808 0 0 0 0 0 14 0 + 238 3840 0 0 0 0 0 15 0 + 254 4096 0 0 0 0 0 1 0 + ... + +Overall the combined zspage chain size effect on zsmalloc pool configuration::: + + pages per zspage number of size classes (clusters) huge size class watermark + 4 69 3264 + 5 86 3408 + 6 93 3504 + 7 112 3584 + 8 123 3632 + 9 140 3680 + 10 143 3712 + 11 159 3744 + 12 164 3776 + 13 180 3792 + 14 183 3808 + 15 188 3840 + 16 191 3840 + + +A synthetic test +---------------- + +zram as a build artifacts storage (Linux kernel compilation). + +* `CONFIG_ZSMALLOC_CHAIN_SIZE=4` + + zsmalloc classes stats::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + Total 13 51 413836 412973 159955 3 + + zram mm_stat::: + + 1691783168 628083717 655175680 0 655175680 60 0 34048 34049 + + +* `CONFIG_ZSMALLOC_CHAIN_SIZE=8` + + zsmalloc classes stats::: + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + ... + Total 18 87 414852 412978 156666 0 + + zram mm_stat::: + + 1691803648 627793930 641703936 0 641703936 60 0 33591 33591 + +Using larger zspage chains may result in using fewer physical pages, as seen +in the example where the number of physical pages used decreased from 159955 +to 156666, at the same time maximum zsmalloc pool memory usage went down from +655175680 to 641703936 bytes. + +However, this advantage may be offset by the potential for increased system +memory pressure (as some zspages have larger chain sizes) in cases where there +is heavy internal fragmentation and zspool compaction is unable to relocate +objects and release zspages. In these cases, it is recommended to decrease +the limit on the size of the zspage chains (as specified by the +CONFIG_ZSMALLOC_CHAIN_SIZE option). -- cgit v1.2.3 From d0634a622be35df2dfa80dc14ee1482ae1889cb2 Mon Sep 17 00:00:00 2001 From: Deming Wang Date: Tue, 17 Jan 2023 21:54:03 -0500 Subject: Documentation: mm: use `s/higmem/highmem/` fix typo for highmem We should use highmem replace higmem. Link: https://lkml.kernel.org/r/20230118025403.1531-1-wangdeming@inspur.com Signed-off-by: Deming Wang Reviewed-by: Ira Weiny Cc: "Fabio M. De Francesco" Cc: Jonathan Corbet Cc: Mike Rapoport (IBM) Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton --- Documentation/mm/highmem.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/mm/highmem.rst b/Documentation/mm/highmem.rst index e691a06fb337..4503868b0865 100644 --- a/Documentation/mm/highmem.rst +++ b/Documentation/mm/highmem.rst @@ -83,7 +83,7 @@ list shows them in order of preference of use. for pages which are known to not come from ZONE_HIGHMEM. However, it is always safe to use kmap_local_page() / kunmap_local(). - While it is significantly faster than kmap(), for the higmem case it + While it is significantly faster than kmap(), for the highmem case it comes with restrictions about the pointers validity. Contrary to kmap() mappings, the local mappings are only valid in the context of the caller and cannot be handed to other contexts. This implies that users must -- cgit v1.2.3 From 7b8144e63d84716f16a1b929e0c7e03ae5c4d5c1 Mon Sep 17 00:00:00 2001 From: "T.J. Alumbaugh" Date: Wed, 18 Jan 2023 00:18:21 +0000 Subject: mm: multi-gen LRU: section for working set protection Patch series "mm: multi-gen LRU: improve". This patch series improves a few MGLRU functions, collects related functions, and adds additional documentation. This patch (of 7): Add a section for working set protection in the code and the design doc. The admin doc already contains its usage. Link: https://lkml.kernel.org/r/20230118001827.1040870-1-talumbau@google.com Link: https://lkml.kernel.org/r/20230118001827.1040870-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index d8f721f98868..6e1483e70fdc 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -141,6 +141,21 @@ loop has detected outlying refaults from the tier this page is in. To this end, the feedback loop uses the first tier as the baseline, for the reason stated earlier. +Working set protection +---------------------- +Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is +set, an ``lruvec`` is protected from the eviction when its oldest +generation was born within ``lru_gen_min_ttl`` milliseconds. In other +words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds +from getting evicted. The OOM killer is triggered if this working set +cannot be kept in memory. + +This time-based approach has the following advantages: + +1. It is easier to configure because it is agnostic to applications + and memory sizes. +2. It is more reliable because it is directly wired to the OOM killer. + Summary ------- The multi-gen LRU can be disassembled into the following parts: -- cgit v1.2.3 From db19a43d9b3a8876552f00f656008206ef9a5efa Mon Sep 17 00:00:00 2001 From: "T.J. Alumbaugh" Date: Wed, 18 Jan 2023 00:18:22 +0000 Subject: mm: multi-gen LRU: section for rmap/PT walk feedback Add a section for lru_gen_look_around() in the code and the design doc. Link: https://lkml.kernel.org/r/20230118001827.1040870-3-talumbau@google.com Signed-off-by: T.J. Alumbaugh Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index 6e1483e70fdc..bd988a142bc2 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -156,6 +156,20 @@ This time-based approach has the following advantages: and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. +Rmap/PT walk feedback +--------------------- +Searching the rmap for PTEs mapping each page on an LRU list (to test +and clear the accessed bit) can be expensive because pages from +different VMAs (PA space) are not cache friendly to the rmap (VA +space). For workloads mostly using mapped pages, searching the rmap +can incur the highest CPU cost in the reclaim path. + +``lru_gen_look_around()`` exploits spatial locality to reduce the +trips into the rmap. It scans the adjacent PTEs of a young PTE and +promotes hot pages. If the scan was done cacheline efficiently, it +adds the PMD entry pointing to the PTE table to the Bloom filter. This +forms a feedback loop between the eviction and the aging. + Summary ------- The multi-gen LRU can be disassembled into the following parts: -- cgit v1.2.3 From ccbbbb85945d8f0255aa9dbc1b617017e2294f2c Mon Sep 17 00:00:00 2001 From: "T.J. Alumbaugh" Date: Wed, 18 Jan 2023 00:18:23 +0000 Subject: mm: multi-gen LRU: section for Bloom filters Move Bloom filters code into a dedicated section. Improve the design doc to explain Bloom filter usage and connection between aging and eviction in their use. Link: https://lkml.kernel.org/r/20230118001827.1040870-4-talumbau@google.com Signed-off-by: T.J. Alumbaugh Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index bd988a142bc2..770b5d539856 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -170,6 +170,22 @@ promotes hot pages. If the scan was done cacheline efficiently, it adds the PMD entry pointing to the PTE table to the Bloom filter. This forms a feedback loop between the eviction and the aging. +Bloom Filters +------------- +Bloom filters are a space and memory efficient data structure for set +membership test, i.e., test if an element is not in the set or may be +in the set. + +In the eviction path, specifically, in ``lru_gen_look_around()``, if a +PMD has a sufficient number of hot pages, its address is placed in the +filter. In the aging path, set membership means that the PTE range +will be scanned for young pages. + +Note that Bloom filters are probabilistic on set membership. If a test +is false positive, the cost is an additional scan of a range of PTEs, +which may yield hot pages anyway. Parameters of the filter itself can +control the false positive rate in the limit. + Summary ------- The multi-gen LRU can be disassembled into the following parts: -- cgit v1.2.3 From 36c7b4db7c942ae9e1b111f0c6b468c8b2e33842 Mon Sep 17 00:00:00 2001 From: "T.J. Alumbaugh" Date: Wed, 18 Jan 2023 00:18:24 +0000 Subject: mm: multi-gen LRU: section for memcg LRU Move memcg LRU code into a dedicated section. Improve the design doc to outline its architecture. Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com Signed-off-by: T.J. Alumbaugh Cc: Yu Zhao Signed-off-by: Andrew Morton --- Documentation/mm/multigen_lru.rst | 33 ++++++++++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index 770b5d539856..5f1f6ecbb79b 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -186,9 +186,40 @@ is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit. +Memcg LRU +--------- +An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, +since each node and memcg combination has an LRU of folios (see +``mem_cgroup_lruvec()``). Its goal is to improve the scalability of +global reclaim, which is critical to system-wide memory overcommit in +data centers. Note that memcg LRU only applies to global reclaim. + +The basic structure of an memcg LRU can be understood by an analogy to +the active/inactive LRU (of folios): + +1. It has the young and the old (generations), i.e., the counterparts + to the active and the inactive; +2. The increment of ``max_seq`` triggers promotion, i.e., the + counterpart to activation; +3. Other events trigger similar operations, e.g., offlining an memcg + triggers demotion, i.e., the counterpart to deactivation. + +In terms of global reclaim, it has two distinct features: + +1. Sharding, which allows each thread to start at a random memcg (in + the old generation) and improves parallelism; +2. Eventual fairness, which allows direct reclaim to bail out at will + and reduces latency without affecting fairness over some time. + +In terms of traversing memcgs during global reclaim, it improves the +best-case complexity from O(n) to O(1) and does not affect the +worst-case complexity O(n). Therefore, on average, it has a sublinear +complexity. + Summary ------- -The multi-gen LRU can be disassembled into the following parts: +The multi-gen LRU (of folios) can be disassembled into the following +parts: * Generations * Rmap walks -- cgit v1.2.3 From 4180887f0625af739896aaafc44ee98103ab8f71 Mon Sep 17 00:00:00 2001 From: Jiaqi Yan Date: Fri, 20 Jan 2023 03:46:22 +0000 Subject: mm: memory-failure: document memory failure stats Add documentation for memory_failure's per NUMA node sysfs entries Link: https://lkml.kernel.org/r/20230120034622.2698268-4-jiaqiyan@google.com Signed-off-by: Jiaqi Yan Acked-by: Naoya Horiguchi Cc: David Rientjes Cc: Kefeng Wang Cc: Tony Luck Cc: Yang Shi Signed-off-by: Andrew Morton --- Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node index 8db67aa472f1..402af4b2b905 100644 --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -182,3 +182,42 @@ Date: November 2021 Contact: Jarkko Sakkinen Description: The total amount of SGX physical memory in bytes. + +What: /sys/devices/system/node/nodeX/memory_failure/total +Date: January 2023 +Contact: Jiaqi Yan +Description: + The total number of raw poisoned pages (pages containing + corrupted data due to memory errors) on a NUMA node. + +What: /sys/devices/system/node/nodeX/memory_failure/ignored +Date: January 2023 +Contact: Jiaqi Yan +Description: + Of the raw poisoned pages on a NUMA node, how many pages are + ignored by memory error recovery attempt, usually because + support for this type of pages is unavailable, and kernel + gives up the recovery. + +What: /sys/devices/system/node/nodeX/memory_failure/failed +Date: January 2023 +Contact: Jiaqi Yan +Description: + Of the raw poisoned pages on a NUMA node, how many pages are + failed by memory error recovery attempt. This usually means + a key recovery operation failed. + +What: /sys/devices/system/node/nodeX/memory_failure/delayed +Date: January 2023 +Contact: Jiaqi Yan +Description: + Of the raw poisoned pages on a NUMA node, how many pages are + delayed by memory error recovery attempt. Delayed poisoned + pages usually will be retried by kernel. + +What: /sys/devices/system/node/nodeX/memory_failure/recovered +Date: January 2023 +Contact: Jiaqi Yan +Description: + Of the raw poisoned pages on a NUMA node, how many pages are + recovered by memory error recovery attempt. -- cgit v1.2.3 From 192a50220342f82826812d0032a82fe441e924e2 Mon Sep 17 00:00:00 2001 From: Sidhartha Kumar Date: Wed, 25 Jan 2023 09:05:37 -0800 Subject: Documentation/mm: update hugetlbfs documentation to mention alloc_hugetlb_folio Link: https://lkml.kernel.org/r/20230125170537.96973-9-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar Cc: Gerald Schaefer Cc: John Hubbard Cc: Matthew Wilcox Cc: Mike Kravetz Cc: Muchun Song Signed-off-by: Andrew Morton --- Documentation/mm/hugetlbfs_reserv.rst | 21 +++++++++++---------- .../translations/zh_CN/mm/hugetlbfs_reserv.rst | 14 +++++++------- 2 files changed, 18 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst index f143954e0d05..611728c49bff 100644 --- a/Documentation/mm/hugetlbfs_reserv.rst +++ b/Documentation/mm/hugetlbfs_reserv.rst @@ -181,14 +181,14 @@ Consuming Reservations/Allocating a Huge Page Reservations are consumed when huge pages associated with the reservations are allocated and instantiated in the corresponding mapping. The allocation -is performed within the routine alloc_huge_page():: +is performed within the routine alloc_hugetlb_folio():: - struct page *alloc_huge_page(struct vm_area_struct *vma, + struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) -alloc_huge_page is passed a VMA pointer and a virtual address, so it can +alloc_hugetlb_folio is passed a VMA pointer and a virtual address, so it can consult the reservation map to determine if a reservation exists. In addition, -alloc_huge_page takes the argument avoid_reserve which indicates reserves +alloc_hugetlb_folio takes the argument avoid_reserve which indicates reserves should not be used even if it appears they have been set aside for the specified address. The avoid_reserve argument is most often used in the case of Copy on Write and Page Migration where additional copies of an existing @@ -208,7 +208,8 @@ a reservation for the allocation. After determining whether a reservation exists and can be used for the allocation, the routine dequeue_huge_page_vma() is called. This routine takes two arguments related to reservations: -- avoid_reserve, this is the same value/argument passed to alloc_huge_page() +- avoid_reserve, this is the same value/argument passed to + alloc_hugetlb_folio(). - chg, even though this argument is of type long only the values 0 or 1 are passed to dequeue_huge_page_vma. If the value is 0, it indicates a reservation exists (see the section "Memory Policy and Reservations" for @@ -233,9 +234,9 @@ the scope reservations. Even if a surplus page is allocated, the same reservation based adjustments as above will be made: SetPagePrivate(page) and resv_huge_pages--. -After obtaining a new huge page, (page)->private is set to the value of -the subpool associated with the page if it exists. This will be used for -subpool accounting when the page is freed. +After obtaining a new hugetlb folio, (folio)->_hugetlb_subpool is set to the +value of the subpool associated with the page if it exists. This will be used +for subpool accounting when the folio is freed. The routine vma_commit_reservation() is then called to adjust the reserve map based on the consumption of the reservation. In general, this involves @@ -246,8 +247,8 @@ was no reservation in a shared mapping or this was a private mapping a new entry must be created. It is possible that the reserve map could have been changed between the call -to vma_needs_reservation() at the beginning of alloc_huge_page() and the -call to vma_commit_reservation() after the page was allocated. This would +to vma_needs_reservation() at the beginning of alloc_hugetlb_folio() and the +call to vma_commit_reservation() after the folio was allocated. This would be possible if hugetlb_reserve_pages was called for the same page in a shared mapping. In such cases, the reservation count and subpool free page count will be off by one. This rare condition can be identified by comparing the diff --git a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst index 752e5696cd47..826a50c47389 100644 --- a/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst +++ b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst @@ -142,14 +142,14 @@ HPAGE_RESV_OWNER标志被设置,以表明该VMA拥有预留。 消耗预留/分配一个巨页 =========================== -当与预留相关的巨页在相应的映射中被分配和实例化时,预留就被消耗了。该分配是在函数alloc_huge_page() +当与预留相关的巨页在相应的映射中被分配和实例化时,预留就被消耗了。该分配是在函数alloc_hugetlb_folio() 中进行的:: - struct page *alloc_huge_page(struct vm_area_struct *vma, + struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) -alloc_huge_page被传递给一个VMA指针和一个虚拟地址,因此它可以查阅预留映射以确定是否存在预留。 -此外,alloc_huge_page需要一个参数avoid_reserve,该参数表示即使看起来已经为指定的地址预留了 +alloc_hugetlb_folio被传递给一个VMA指针和一个虚拟地址,因此它可以查阅预留映射以确定是否存在预留。 +此外,alloc_hugetlb_folio需要一个参数avoid_reserve,该参数表示即使看起来已经为指定的地址预留了 预留,也不应该使用预留。avoid_reserve参数最常被用于写时拷贝和页面迁移的情况下,即现有页面的额 外拷贝被分配。 @@ -162,7 +162,7 @@ vma_needs_reservation()返回的值通常为0或1。如果该地址存在预留 确定预留是否存在并可用于分配后,调用dequeue_huge_page_vma()函数。这个函数需要两个与预留有关 的参数: -- avoid_reserve,这是传递给alloc_huge_page()的同一个值/参数。 +- avoid_reserve,这是传递给alloc_hugetlb_folio()的同一个值/参数。 - chg,尽管这个参数的类型是long,但只有0或1的值被传递给dequeue_huge_page_vma。如果该值为0, 则表明存在预留(关于可能的问题,请参见 “预留和内存策略” 一节)。如果值 为1,则表示不存在预留,如果可能的话,必须从全局空闲池中取出该页。 @@ -179,7 +179,7 @@ free_huge_pages的值被递减。如果有一个与该页相关的预留,将 的剩余巨页和超额分配的问题。即使分配了一个多余的页面,也会进行与上面一样的基于预留的调整: SetPagePrivate(page) 和 resv_huge_pages--. -在获得一个新的巨页后,(page)->private被设置为与该页面相关的子池的值,如果它存在的话。当页 +在获得一个新的巨页后,(folio)->_hugetlb_subpool被设置为与该页面相关的子池的值,如果它存在的话。当页 面被释放时,这将被用于子池的计数。 然后调用函数vma_commit_reservation(),根据预留的消耗情况调整预留映射。一般来说,这涉及 @@ -199,7 +199,7 @@ SetPagePrivate(page)和resv_huge_pages-。 已经存在,所以不做任何改变。然而,如果共享映射中没有预留,或者这是一个私有映射,则必须创建 一个新的条目。 -在alloc_huge_page()开始调用vma_needs_reservation()和页面分配后调用 +在alloc_hugetlb_folio()开始调用vma_needs_reservation()和页面分配后调用 vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_reserve_pages在共 享映射中为同一页面被调用,这将是可能的。在这种情况下,预留计数和子池空闲页计数会有一个偏差。 这种罕见的情况可以通过比较vma_needs_reservation和vma_commit_reservation的返回值来 -- cgit v1.2.3 From 5445fcbc4cda770cd07f49624704fabcc284d563 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Thu, 9 Feb 2023 19:20:07 +0000 Subject: Docs/admin-guide/mm/damon/usage: add DAMON debugfs interface deprecation notice Patch series "mm/damon: deprecate DAMON debugfs interface". DAMON debugfs interface has announced to be deprecated after >v5.15 LTS kernel is released. And v6.1.y has been announced to be an LTS[1]. Though the announcement was there for a while, some people might not have noticed that so far. Also, some users could depend on it and have problems at movng to the alternative (DAMON sysfs interface). For such cases, keep the code and documents with warning messages and contacts to ask helps for the deprecation. [1] https://git.kernel.org/pub/scm/docs/kernel/website.git/commit/?id=332e9121320bc7461b2d3a79665caf153e51732c This patch (of 3): DAMON debugfs interface has announced to be deprecated after >v5.15 LTS kernel is released. And, v6.1.y has announced to be an LTS[1]. Though the announcement was there for a while, some people might not noticed that so far. Also, some users could depend on it and have problems at movng to the alternative (DAMON sysfs interface). For such cases, note DAMON debugfs interface as deprecated, and contacts to ask helps on the document. [1] https://git.kernel.org/pub/scm/docs/kernel/website.git/commit/?id=332e9121320bc7461b2d3a79665caf153e51732c Link: https://lkml.kernel.org/r/20230209192009.7885-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230209192009.7885-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Jonathan Corbet Signed-off-by: Andrew Morton --- Documentation/admin-guide/mm/damon/usage.rst | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 9237d6a25897..9b823fec974d 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -25,10 +25,12 @@ DAMON provides below interfaces for different users. interface provides only simple :ref:`statistics ` for the monitoring results. For detailed monitoring results, DAMON provides a :ref:`tracepoint `. -- *debugfs interface.* +- *debugfs interface. (DEPRECATED!)* :ref:`This ` is almost identical to :ref:`sysfs interface - `. This will be removed after next LTS kernel is released, - so users should move to the :ref:`sysfs interface `. + `. This is deprecated, so users should move to the + :ref:`sysfs interface `. If you depend on this and cannot + move, please report your usecase to damon@lists.linux.dev and + linux-mm@kvack.org. - *Kernel Space Programming Interface.* :doc:`This ` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by @@ -487,13 +489,17 @@ the files as above. Above is only for an example. .. _debugfs_interface: -debugfs Interface -================= +debugfs Interface (DEPRECATED!) +=============================== .. note:: - DAMON debugfs interface will be removed after next LTS kernel is released, so - users should move to the :ref:`sysfs interface `. + THIS IS DEPRECATED! + + DAMON debugfs interface is deprecated, so users should move to the + :ref:`sysfs interface `. If you depend on this and cannot + move, please report your usecase to damon@lists.linux.dev and + linux-mm@kvack.org. DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and -- cgit v1.2.3