1 files changed, 127 insertions, 0 deletions
diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
index 7840c1891751..be47b192a596 100644
--- a/Documentation/mm/page_tables.rst
+++ b/Documentation/mm/page_tables.rst
@@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the
 virtual memory manager, will need to be written so that it traverses all of the
 currently five levels. This style should also be preferred for
 architecture-specific code, so as to be robust to future changes.
+
+
+MMU, TLB, and Page Faults
+=========================
+
+The `Memory Management Unit (MMU)` is a hardware component that handles virtual
+to physical address translations. It may use relatively small caches in hardware
+called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
+these translations.
+
+When CPU accesses a memory location, it provides a virtual address to the MMU,
+which checks if there is the existing translation in the TLB or in the Page
+Walk Caches (on architectures that support them). If no translation is found,
+MMU uses the page walks to determine the physical address and create the map.
+
+The dirty bit for a page is set (i.e., turned on) when the page is written to.
+Each page of memory has associated permission and dirty bits. The latter
+indicate that the page has been modified since it was loaded into memory.
+
+If nothing prevents it, eventually the physical memory can be accessed and the
+requested operation on the physical frame is performed.
+
+There are several reasons why the MMU can't find certain translations. It could
+happen because the CPU is trying to access memory that the current task is not
+permitted to, or because the data is not present into physical memory.
+
+When these conditions happen, the MMU triggers page faults, which are types of
+exceptions that signal the CPU to pause the current execution and run a special
+function to handle the mentioned exceptions.
+
+There are common and expected causes of page faults. These are triggered by
+process management optimization techniques called "Lazy Allocation" and
+"Copy-on-Write". Page faults may also happen when frames have been swapped out
+to persistent storage (swap partition or file) and evicted from their physical
+locations.
+
+These techniques improve memory efficiency, reduce latency, and minimize space
+occupation. This document won't go deeper into the details of "Lazy Allocation"
+and "Copy-on-Write" because these subjects are out of scope as they belong to
+Process Address Management.
+
+Swapping differentiates itself from the other mentioned techniques because it's
+undesirable since it's performed as a means to reduce memory under heavy
+pressure.
+
+Swapping can't work for memory mapped by kernel logical addresses. These are a
+subset of the kernel virtual space that directly maps a contiguous range of
+physical memory. Given any logical address, its physical address is determined
+with simple arithmetic on an offset. Accesses to logical addresses are fast
+because they avoid the need for complex page table lookups at the expenses of
+frames not being evictable and pageable out.
+
+If the kernel fails to make room for the data that must be present in the
+physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
+by terminating lower priority processes until pressure reduces under a safe
+threshold.
+
+Additionally, page faults may be also caused by code bugs or by maliciously
+crafted addresses that the CPU is instructed to access. A thread of a process
+could use instructions to address (non-shared) memory which does not belong to
+its own address space, or could try to execute an instruction that want to write
+to a read-only location.
+
+If the above-mentioned conditions happen in user-space, the kernel sends a
+`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
+causes the termination of the thread and of the process it belongs to.
+
+This document is going to simplify and show an high altitude view of how the
+Linux kernel handles these page faults, creates tables and tables' entries,
+check if memory is present and, if not, requests to load data from persistent
+storage or from other devices, and updates the MMU and its caches.
+
+The first steps are architecture dependent. Most architectures jump to
+`do_page_fault()`, whereas the x86 interrupt handler is defined by the
+`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
+
+Whatever the routes, all architectures end up to the invocation of
+`handle_mm_fault()` which, in turn, (likely) ends up calling
+`__handle_mm_fault()` to carry out the actual work of allocating the page
+tables.
+
+The unfortunate case of not being able to call `__handle_mm_fault()` means
+that the virtual address is pointing to areas of physical memory which are not
+permitted to be accessed (at least from the current context). This
+condition resolves to the kernel sending the above-mentioned SIGSEGV signal
+to the process and leads to the consequences already explained.
+
+`__handle_mm_fault()` carries out its work by calling several functions to
+find the entry's offsets of the upper layers of the page tables and allocate
+the tables that it may need.
+
+The functions that look for the offset have names like `*_offset()`, where the
+"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
+corresponding tables, layer by layer, are called `*_alloc`, using the
+above-mentioned convention to name them after the corresponding types of tables
+in the hierarchy.
+
+The page table walk may end at one of the middle or upper layers (PMD, PUD).
+
+Linux supports larger page sizes than the usual 4KB (i.e., the so called
+`huge pages`). When using these kinds of larger pages, higher level pages can
+directly map them, with no need to use lower level page entries (PTE). Huge
+pages contain large contiguous physical regions that usually span from 2MB to
+1GB. They are respectively mapped by the PMD and PUD page entries.
+
+The huge pages bring with them several benefits like reduced TLB pressure,
+reduced page table overhead, memory allocation efficiency, and performance
+improvement for certain workloads. However, these benefits come with
+trade-offs, like wasted memory and allocation challenges.
+
+At the very end of the walk with allocations, if it didn't return errors,
+`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
+performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
+"read", "cow", "shared" give hints about the reasons and the kind of fault it's
+handling.
+
+The actual implementation of the workflow is very complex. Its design allows
+Linux to handle page faults in a way that is tailored to the specific
+characteristics of each architecture, while still sharing a common overall
+structure.
+
+To conclude this high altitude view of how Linux handles page faults, let's
+add that the page faults handler can be disabled and enabled respectively with
+`pagefault_disable()` and `pagefault_enable()`.
+
+Several code path make use of the latter two functions because they need to
+disable traps into the page faults handler, mostly to prevent deadlocks.