summaryrefslogtreecommitdiffstats
path: root/Documentation/mm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/mm')
-rw-r--r--Documentation/mm/active_mm.rst6
-rw-r--r--Documentation/mm/arch_pgtable_helpers.rst2
-rw-r--r--Documentation/mm/multigen_lru.rst44
-rw-r--r--Documentation/mm/physical_memory.rst21
-rw-r--r--Documentation/mm/unevictable-lru.rst2
-rw-r--r--Documentation/mm/zsmalloc.rst135
6 files changed, 144 insertions, 66 deletions
diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index 45d89f8fb3a8..d096fc091e23 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -2,6 +2,12 @@
Active MM
=========
+Note, the mm_count refcount may no longer include the "lazy" users
+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers, which abstract this config option.
+
::
List: linux-kernel
diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index 30d9a09f01f4..af3891f895b0 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -214,7 +214,7 @@ HugeTLB Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_huge | Tests a HugeTLB |
+---------------------------+--------------------------------------------------+
-| pte_mkhuge | Creates a HugeTLB |
+| arch_make_huge_pte | Creates a HugeTLB |
+---------------------------+--------------------------------------------------+
| huge_pte_dirty | Tests a dirty HugeTLB |
+---------------------------+--------------------------------------------------+
diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
index 5f1f6ecbb79b..52ed5092022f 100644
--- a/Documentation/mm/multigen_lru.rst
+++ b/Documentation/mm/multigen_lru.rst
@@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on
``folio->flags`` and therefore has a negligible cost. A feedback loop
modeled after the PID controller monitors refaults over all the tiers
from anon and file types and decides which tiers from which types to
-evict or protect.
+evict or protect. The desired effect is to balance refault percentages
+between anon and file types proportional to the swappiness level.
There are two conceptually independent procedures: the aging and the
eviction. They form a closed-loop system, i.e., the page reclaim.
@@ -156,6 +157,27 @@ This time-based approach has the following advantages:
and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.
+``mm_struct`` list
+------------------
+An ``mm_struct`` list is maintained for each memcg, and an
+``mm_struct`` follows its owner task to the new memcg when this task
+is migrated.
+
+A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
+``walk_page_range()`` with each ``mm_struct`` on this list to scan
+PTEs. When multiple page table walkers iterate the same list, each of
+them gets a unique ``mm_struct``, and therefore they can run in
+parallel.
+
+Page table walkers ignore any misplaced pages, e.g., if an
+``mm_struct`` was migrated, pages left in the previous memcg will be
+ignored when the current memcg is under reclaim. Similarly, page table
+walkers will ignore pages from nodes other than the one under reclaim.
+
+This infrastructure also tracks the usage of ``mm_struct`` between
+context switches so that page table walkers can skip processes that
+have been sleeping since the last iteration.
+
Rmap/PT walk feedback
---------------------
Searching the rmap for PTEs mapping each page on an LRU list (to test
@@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it
adds the PMD entry pointing to the PTE table to the Bloom filter. This
forms a feedback loop between the eviction and the aging.
-Bloom Filters
+Bloom filters
-------------
Bloom filters are a space and memory efficient data structure for set
membership test, i.e., test if an element is not in the set or may be
@@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs,
which may yield hot pages anyway. Parameters of the filter itself can
control the false positive rate in the limit.
+PID controller
+--------------
+A feedback loop modeled after the Proportional-Integral-Derivative
+(PID) controller monitors refaults over anon and file types and
+decides which type to evict when both types are available from the
+same generation.
+
+The PID controller uses generations rather than the wall clock as the
+time domain because a CPU can scan pages at different rates under
+varying memory pressure. It calculates a moving average for each new
+generation to avoid being permanently locked in a suboptimal state.
+
Memcg LRU
---------
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
@@ -223,9 +257,9 @@ parts:
* Generations
* Rmap walks
-* Page table walks
-* Bloom filters
-* PID controller
+* Page table walks via ``mm_struct`` list
+* Bloom filters for rmap/PT walk feedback
+* PID controller for refault feedback
The aging and the eviction form a producer-consumer model;
specifically, the latter drives the former by the sliding window over
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 1bc888d36ea1..531e73b003dd 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -19,7 +19,7 @@ a bank of memory very suitable for DMA near peripheral devices.
Each bank is called a node and the concept is represented under Linux by a
``struct pglist_data`` even if the architecture is UMA. This structure is
-always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure
+always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure
for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
``nid`` is the ID of that node.
@@ -114,6 +114,25 @@ RAM equally split between two nodes, there will be ``ZONE_DMA32``,
| DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
+---------+----------+-----------+ +------------+-------------+
+
+Memory banks may belong to interleaving nodes. In the example below an x86
+machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0
+and odd banks belong to node 1::
+
+
+ 0 4G 8G 12G 16G
+ +-------------+ +-------------+ +-------------+ +-------------+
+ | node 0 | | node 1 | | node 0 | | node 1 |
+ +-------------+ +-------------+ +-------------+ +-------------+
+
+ 0 16M 4G
+ +-----+-------+ +-------------+ +-------------+ +-------------+
+ | DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL |
+ +-----+-------+ +-------------+ +-------------+ +-------------+
+
+In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from
+4 to 16 Gbytes.
+
.. _nodes:
Nodes
diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst
index 92ac5dca420c..d5ac8511eb67 100644
--- a/Documentation/mm/unevictable-lru.rst
+++ b/Documentation/mm/unevictable-lru.rst
@@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages:
* Those owned by ramfs.
+ * Those owned by tmpfs with the noswap mount option.
+
* Those mapped into SHM_LOCK'd shared memory regions.
* Those mapped into VM_LOCKED [mlock()ed] VMAs.
diff --git a/Documentation/mm/zsmalloc.rst b/Documentation/mm/zsmalloc.rst
index 64d127bfc221..a3c26d587752 100644
--- a/Documentation/mm/zsmalloc.rst
+++ b/Documentation/mm/zsmalloc.rst
@@ -39,13 +39,12 @@ With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via
# cat /sys/kernel/debug/zsmalloc/zram0/classes
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage
+ class size 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100% obj_allocated obj_used pages_used pages_per_zspage freeable
...
...
- 9 176 0 1 186 129 8 4
- 10 192 1 0 2880 2872 135 3
- 11 208 0 1 819 795 42 2
- 12 224 0 1 219 159 12 4
+ 30 512 0 12 4 1 0 1 0 0 1 0 414 3464 3346 433 1 14
+ 31 528 2 7 2 2 1 0 1 0 0 2 117 4154 3793 536 4 44
+ 32 544 6 3 4 1 2 1 0 0 0 1 260 4170 3965 556 2 26
...
...
@@ -54,10 +53,28 @@ class
index
size
object size zspage stores
-almost_empty
- the number of ZS_ALMOST_EMPTY zspages(see below)
-almost_full
- the number of ZS_ALMOST_FULL zspages(see below)
+10%
+ the number of zspages with usage ratio less than 10% (see below)
+20%
+ the number of zspages with usage ratio between 10% and 20%
+30%
+ the number of zspages with usage ratio between 20% and 30%
+40%
+ the number of zspages with usage ratio between 30% and 40%
+50%
+ the number of zspages with usage ratio between 40% and 50%
+60%
+ the number of zspages with usage ratio between 50% and 60%
+70%
+ the number of zspages with usage ratio between 60% and 70%
+80%
+ the number of zspages with usage ratio between 70% and 80%
+90%
+ the number of zspages with usage ratio between 80% and 90%
+99%
+ the number of zspages with usage ratio between 90% and 99%
+100%
+ the number of zspages with usage ratio 100%
obj_allocated
the number of objects allocated
obj_used
@@ -66,19 +83,14 @@ pages_used
the number of pages allocated for the class
pages_per_zspage
the number of 0-order pages to make a zspage
+freeable
+ the approximate number of pages class compaction can free
-We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where
-
-* n = number of allocated objects
-* N = total number of objects zspage can store
-* f = fullness_threshold_frac(ie, 4 at the moment)
-
-Similarly, we assign zspage to:
-
-* ZS_ALMOST_FULL when n > N / f
-* ZS_EMPTY when n == 0
-* ZS_FULL when n == N
-
+Each zspage maintains inuse counter which keeps track of the number of
+objects stored in the zspage. The inuse counter determines the zspage's
+"fullness group" which is calculated as the ratio of the "inuse" objects to
+the total number of objects the zspage can hold (objs_per_zspage). The
+closer the inuse counter is to objs_per_zspage, the better.
Internals
=========
@@ -94,10 +106,10 @@ of objects that each zspage can store.
For instance, consider the following size classes:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
...
- 94 1536 0 0 0 0 0 3 0
- 100 1632 0 0 0 0 0 2 0
+ 94 1536 0 .... 0 0 0 0 3 0
+ 100 1632 0 .... 0 0 0 0 2 0
...
@@ -134,10 +146,11 @@ reduces memory wastage.
Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
+
...
- 202 3264 0 0 0 0 0 4 0
- 254 4096 0 0 0 0 0 1 0
+ 202 3264 0 .. 0 0 0 0 4 0
+ 254 4096 0 .. 0 0 0 0 1 0
...
Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages
@@ -151,40 +164,42 @@ efficient storage of large objects.
For zspage chain size of 8, huge class watermark becomes 3632 bytes:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
+
...
- 202 3264 0 0 0 0 0 4 0
- 211 3408 0 0 0 0 0 5 0
- 217 3504 0 0 0 0 0 6 0
- 222 3584 0 0 0 0 0 7 0
- 225 3632 0 0 0 0 0 8 0
- 254 4096 0 0 0 0 0 1 0
+ 202 3264 0 .. 0 0 0 0 4 0
+ 211 3408 0 .. 0 0 0 0 5 0
+ 217 3504 0 .. 0 0 0 0 6 0
+ 222 3584 0 .. 0 0 0 0 7 0
+ 225 3632 0 .. 0 0 0 0 8 0
+ 254 4096 0 .. 0 0 0 0 1 0
...
For zspage chain size of 16, huge class watermark becomes 3840 bytes:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
+
...
- 202 3264 0 0 0 0 0 4 0
- 206 3328 0 0 0 0 0 13 0
- 207 3344 0 0 0 0 0 9 0
- 208 3360 0 0 0 0 0 14 0
- 211 3408 0 0 0 0 0 5 0
- 212 3424 0 0 0 0 0 16 0
- 214 3456 0 0 0 0 0 11 0
- 217 3504 0 0 0 0 0 6 0
- 219 3536 0 0 0 0 0 13 0
- 222 3584 0 0 0 0 0 7 0
- 223 3600 0 0 0 0 0 15 0
- 225 3632 0 0 0 0 0 8 0
- 228 3680 0 0 0 0 0 9 0
- 230 3712 0 0 0 0 0 10 0
- 232 3744 0 0 0 0 0 11 0
- 234 3776 0 0 0 0 0 12 0
- 235 3792 0 0 0 0 0 13 0
- 236 3808 0 0 0 0 0 14 0
- 238 3840 0 0 0 0 0 15 0
- 254 4096 0 0 0 0 0 1 0
+ 202 3264 0 .. 0 0 0 0 4 0
+ 206 3328 0 .. 0 0 0 0 13 0
+ 207 3344 0 .. 0 0 0 0 9 0
+ 208 3360 0 .. 0 0 0 0 14 0
+ 211 3408 0 .. 0 0 0 0 5 0
+ 212 3424 0 .. 0 0 0 0 16 0
+ 214 3456 0 .. 0 0 0 0 11 0
+ 217 3504 0 .. 0 0 0 0 6 0
+ 219 3536 0 .. 0 0 0 0 13 0
+ 222 3584 0 .. 0 0 0 0 7 0
+ 223 3600 0 .. 0 0 0 0 15 0
+ 225 3632 0 .. 0 0 0 0 8 0
+ 228 3680 0 .. 0 0 0 0 9 0
+ 230 3712 0 .. 0 0 0 0 10 0
+ 232 3744 0 .. 0 0 0 0 11 0
+ 234 3776 0 .. 0 0 0 0 12 0
+ 235 3792 0 .. 0 0 0 0 13 0
+ 236 3808 0 .. 0 0 0 0 14 0
+ 238 3840 0 .. 0 0 0 0 15 0
+ 254 4096 0 .. 0 0 0 0 1 0
...
Overall the combined zspage chain size effect on zsmalloc pool configuration:::
@@ -214,9 +229,10 @@ zram as a build artifacts storage (Linux kernel compilation).
zsmalloc classes stats:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
+
...
- Total 13 51 413836 412973 159955 3
+ Total 13 .. 51 413836 412973 159955 3
zram mm_stat:::
@@ -227,9 +243,10 @@ zram as a build artifacts storage (Linux kernel compilation).
zsmalloc classes stats:::
- class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
+ class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable
+
...
- Total 18 87 414852 412978 156666 0
+ Total 18 .. 87 414852 412978 156666 0
zram mm_stat:::