diff options
Diffstat (limited to 'Documentation/mm')
-rw-r--r-- | Documentation/mm/active_mm.rst | 6 | ||||
-rw-r--r-- | Documentation/mm/arch_pgtable_helpers.rst | 2 | ||||
-rw-r--r-- | Documentation/mm/multigen_lru.rst | 44 | ||||
-rw-r--r-- | Documentation/mm/physical_memory.rst | 21 | ||||
-rw-r--r-- | Documentation/mm/unevictable-lru.rst | 2 | ||||
-rw-r--r-- | Documentation/mm/zsmalloc.rst | 135 |
6 files changed, 144 insertions, 66 deletions
diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst index 45d89f8fb3a8..d096fc091e23 100644 --- a/Documentation/mm/active_mm.rst +++ b/Documentation/mm/active_mm.rst @@ -2,6 +2,12 @@ Active MM ========= +Note, the mm_count refcount may no longer include the "lazy" users +(running tasks with ->active_mm == mm && ->mm == NULL) on kernels +with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy +references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb() +helpers, which abstract this config option. + :: List: linux-kernel diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst index 30d9a09f01f4..af3891f895b0 100644 --- a/Documentation/mm/arch_pgtable_helpers.rst +++ b/Documentation/mm/arch_pgtable_helpers.rst @@ -214,7 +214,7 @@ HugeTLB Page Table Helpers +---------------------------+--------------------------------------------------+ | pte_huge | Tests a HugeTLB | +---------------------------+--------------------------------------------------+ -| pte_mkhuge | Creates a HugeTLB | +| arch_make_huge_pte | Creates a HugeTLB | +---------------------------+--------------------------------------------------+ | huge_pte_dirty | Tests a dirty HugeTLB | +---------------------------+--------------------------------------------------+ diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst index 5f1f6ecbb79b..52ed5092022f 100644 --- a/Documentation/mm/multigen_lru.rst +++ b/Documentation/mm/multigen_lru.rst @@ -103,7 +103,8 @@ moving across tiers only involves atomic operations on ``folio->flags`` and therefore has a negligible cost. A feedback loop modeled after the PID controller monitors refaults over all the tiers from anon and file types and decides which tiers from which types to -evict or protect. +evict or protect. The desired effect is to balance refault percentages +between anon and file types proportional to the swappiness level. There are two conceptually independent procedures: the aging and the eviction. They form a closed-loop system, i.e., the page reclaim. @@ -156,6 +157,27 @@ This time-based approach has the following advantages: and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. +``mm_struct`` list +------------------ +An ``mm_struct`` list is maintained for each memcg, and an +``mm_struct`` follows its owner task to the new memcg when this task +is migrated. + +A page table walker iterates ``lruvec_memcg()->mm_list`` and calls +``walk_page_range()`` with each ``mm_struct`` on this list to scan +PTEs. When multiple page table walkers iterate the same list, each of +them gets a unique ``mm_struct``, and therefore they can run in +parallel. + +Page table walkers ignore any misplaced pages, e.g., if an +``mm_struct`` was migrated, pages left in the previous memcg will be +ignored when the current memcg is under reclaim. Similarly, page table +walkers will ignore pages from nodes other than the one under reclaim. + +This infrastructure also tracks the usage of ``mm_struct`` between +context switches so that page table walkers can skip processes that +have been sleeping since the last iteration. + Rmap/PT walk feedback --------------------- Searching the rmap for PTEs mapping each page on an LRU list (to test @@ -170,7 +192,7 @@ promotes hot pages. If the scan was done cacheline efficiently, it adds the PMD entry pointing to the PTE table to the Bloom filter. This forms a feedback loop between the eviction and the aging. -Bloom Filters +Bloom filters ------------- Bloom filters are a space and memory efficient data structure for set membership test, i.e., test if an element is not in the set or may be @@ -186,6 +208,18 @@ is false positive, the cost is an additional scan of a range of PTEs, which may yield hot pages anyway. Parameters of the filter itself can control the false positive rate in the limit. +PID controller +-------------- +A feedback loop modeled after the Proportional-Integral-Derivative +(PID) controller monitors refaults over anon and file types and +decides which type to evict when both types are available from the +same generation. + +The PID controller uses generations rather than the wall clock as the +time domain because a CPU can scan pages at different rates under +varying memory pressure. It calculates a moving average for each new +generation to avoid being permanently locked in a suboptimal state. + Memcg LRU --------- An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, @@ -223,9 +257,9 @@ parts: * Generations * Rmap walks -* Page table walks -* Bloom filters -* PID controller +* Page table walks via ``mm_struct`` list +* Bloom filters for rmap/PT walk feedback +* PID controller for refault feedback The aging and the eviction form a producer-consumer model; specifically, the latter drives the former by the sliding window over diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst index 1bc888d36ea1..531e73b003dd 100644 --- a/Documentation/mm/physical_memory.rst +++ b/Documentation/mm/physical_memory.rst @@ -19,7 +19,7 @@ a bank of memory very suitable for DMA near peripheral devices. Each bank is called a node and the concept is represented under Linux by a ``struct pglist_data`` even if the architecture is UMA. This structure is -always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure +always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure for a particular node can be referenced by ``NODE_DATA(nid)`` macro where ``nid`` is the ID of that node. @@ -114,6 +114,25 @@ RAM equally split between two nodes, there will be ``ZONE_DMA32``, | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE | +---------+----------+-----------+ +------------+-------------+ + +Memory banks may belong to interleaving nodes. In the example below an x86 +machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0 +and odd banks belong to node 1:: + + + 0 4G 8G 12G 16G + +-------------+ +-------------+ +-------------+ +-------------+ + | node 0 | | node 1 | | node 0 | | node 1 | + +-------------+ +-------------+ +-------------+ +-------------+ + + 0 16M 4G + +-----+-------+ +-------------+ +-------------+ +-------------+ + | DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL | + +-----+-------+ +-------------+ +-------------+ +-------------+ + +In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from +4 to 16 Gbytes. + .. _nodes: Nodes diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index 92ac5dca420c..d5ac8511eb67 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -42,6 +42,8 @@ The unevictable list addresses the following classes of unevictable pages: * Those owned by ramfs. + * Those owned by tmpfs with the noswap mount option. + * Those mapped into SHM_LOCK'd shared memory regions. * Those mapped into VM_LOCKED [mlock()ed] VMAs. diff --git a/Documentation/mm/zsmalloc.rst b/Documentation/mm/zsmalloc.rst index 64d127bfc221..a3c26d587752 100644 --- a/Documentation/mm/zsmalloc.rst +++ b/Documentation/mm/zsmalloc.rst @@ -39,13 +39,12 @@ With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via # cat /sys/kernel/debug/zsmalloc/zram0/classes - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage + class size 10% 20% 30% 40% 50% 60% 70% 80% 90% 99% 100% obj_allocated obj_used pages_used pages_per_zspage freeable ... ... - 9 176 0 1 186 129 8 4 - 10 192 1 0 2880 2872 135 3 - 11 208 0 1 819 795 42 2 - 12 224 0 1 219 159 12 4 + 30 512 0 12 4 1 0 1 0 0 1 0 414 3464 3346 433 1 14 + 31 528 2 7 2 2 1 0 1 0 0 2 117 4154 3793 536 4 44 + 32 544 6 3 4 1 2 1 0 0 0 1 260 4170 3965 556 2 26 ... ... @@ -54,10 +53,28 @@ class index size object size zspage stores -almost_empty - the number of ZS_ALMOST_EMPTY zspages(see below) -almost_full - the number of ZS_ALMOST_FULL zspages(see below) +10% + the number of zspages with usage ratio less than 10% (see below) +20% + the number of zspages with usage ratio between 10% and 20% +30% + the number of zspages with usage ratio between 20% and 30% +40% + the number of zspages with usage ratio between 30% and 40% +50% + the number of zspages with usage ratio between 40% and 50% +60% + the number of zspages with usage ratio between 50% and 60% +70% + the number of zspages with usage ratio between 60% and 70% +80% + the number of zspages with usage ratio between 70% and 80% +90% + the number of zspages with usage ratio between 80% and 90% +99% + the number of zspages with usage ratio between 90% and 99% +100% + the number of zspages with usage ratio 100% obj_allocated the number of objects allocated obj_used @@ -66,19 +83,14 @@ pages_used the number of pages allocated for the class pages_per_zspage the number of 0-order pages to make a zspage +freeable + the approximate number of pages class compaction can free -We assign a zspage to ZS_ALMOST_EMPTY fullness group when n <= N / f, where - -* n = number of allocated objects -* N = total number of objects zspage can store -* f = fullness_threshold_frac(ie, 4 at the moment) - -Similarly, we assign zspage to: - -* ZS_ALMOST_FULL when n > N / f -* ZS_EMPTY when n == 0 -* ZS_FULL when n == N - +Each zspage maintains inuse counter which keeps track of the number of +objects stored in the zspage. The inuse counter determines the zspage's +"fullness group" which is calculated as the ratio of the "inuse" objects to +the total number of objects the zspage can hold (objs_per_zspage). The +closer the inuse counter is to objs_per_zspage, the better. Internals ========= @@ -94,10 +106,10 @@ of objects that each zspage can store. For instance, consider the following size classes::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable ... - 94 1536 0 0 0 0 0 3 0 - 100 1632 0 0 0 0 0 2 0 + 94 1536 0 .... 0 0 0 0 3 0 + 100 1632 0 .... 0 0 0 0 2 0 ... @@ -134,10 +146,11 @@ reduces memory wastage. Let's take a closer look at the bottom of `/sys/kernel/debug/zsmalloc/zramX/classes`::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable + ... - 202 3264 0 0 0 0 0 4 0 - 254 4096 0 0 0 0 0 1 0 + 202 3264 0 .. 0 0 0 0 4 0 + 254 4096 0 .. 0 0 0 0 1 0 ... Size class #202 stores objects of size 3264 bytes and has a maximum of 4 pages @@ -151,40 +164,42 @@ efficient storage of large objects. For zspage chain size of 8, huge class watermark becomes 3632 bytes::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable + ... - 202 3264 0 0 0 0 0 4 0 - 211 3408 0 0 0 0 0 5 0 - 217 3504 0 0 0 0 0 6 0 - 222 3584 0 0 0 0 0 7 0 - 225 3632 0 0 0 0 0 8 0 - 254 4096 0 0 0 0 0 1 0 + 202 3264 0 .. 0 0 0 0 4 0 + 211 3408 0 .. 0 0 0 0 5 0 + 217 3504 0 .. 0 0 0 0 6 0 + 222 3584 0 .. 0 0 0 0 7 0 + 225 3632 0 .. 0 0 0 0 8 0 + 254 4096 0 .. 0 0 0 0 1 0 ... For zspage chain size of 16, huge class watermark becomes 3840 bytes::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable + ... - 202 3264 0 0 0 0 0 4 0 - 206 3328 0 0 0 0 0 13 0 - 207 3344 0 0 0 0 0 9 0 - 208 3360 0 0 0 0 0 14 0 - 211 3408 0 0 0 0 0 5 0 - 212 3424 0 0 0 0 0 16 0 - 214 3456 0 0 0 0 0 11 0 - 217 3504 0 0 0 0 0 6 0 - 219 3536 0 0 0 0 0 13 0 - 222 3584 0 0 0 0 0 7 0 - 223 3600 0 0 0 0 0 15 0 - 225 3632 0 0 0 0 0 8 0 - 228 3680 0 0 0 0 0 9 0 - 230 3712 0 0 0 0 0 10 0 - 232 3744 0 0 0 0 0 11 0 - 234 3776 0 0 0 0 0 12 0 - 235 3792 0 0 0 0 0 13 0 - 236 3808 0 0 0 0 0 14 0 - 238 3840 0 0 0 0 0 15 0 - 254 4096 0 0 0 0 0 1 0 + 202 3264 0 .. 0 0 0 0 4 0 + 206 3328 0 .. 0 0 0 0 13 0 + 207 3344 0 .. 0 0 0 0 9 0 + 208 3360 0 .. 0 0 0 0 14 0 + 211 3408 0 .. 0 0 0 0 5 0 + 212 3424 0 .. 0 0 0 0 16 0 + 214 3456 0 .. 0 0 0 0 11 0 + 217 3504 0 .. 0 0 0 0 6 0 + 219 3536 0 .. 0 0 0 0 13 0 + 222 3584 0 .. 0 0 0 0 7 0 + 223 3600 0 .. 0 0 0 0 15 0 + 225 3632 0 .. 0 0 0 0 8 0 + 228 3680 0 .. 0 0 0 0 9 0 + 230 3712 0 .. 0 0 0 0 10 0 + 232 3744 0 .. 0 0 0 0 11 0 + 234 3776 0 .. 0 0 0 0 12 0 + 235 3792 0 .. 0 0 0 0 13 0 + 236 3808 0 .. 0 0 0 0 14 0 + 238 3840 0 .. 0 0 0 0 15 0 + 254 4096 0 .. 0 0 0 0 1 0 ... Overall the combined zspage chain size effect on zsmalloc pool configuration::: @@ -214,9 +229,10 @@ zram as a build artifacts storage (Linux kernel compilation). zsmalloc classes stats::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable + ... - Total 13 51 413836 412973 159955 3 + Total 13 .. 51 413836 412973 159955 3 zram mm_stat::: @@ -227,9 +243,10 @@ zram as a build artifacts storage (Linux kernel compilation). zsmalloc classes stats::: - class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable + class size 10% .... 100% obj_allocated obj_used pages_used pages_per_zspage freeable + ... - Total 18 87 414852 412978 156666 0 + Total 18 .. 87 414852 412978 156666 0 zram mm_stat::: |