From 0116523cfffa62aeb5aa3b85ce7419f3dae0c1b8 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:29:37 -0800 Subject: kasan, mm: change hooks signatures Patch series "kasan: add software tag-based mode for arm64", v13. This patchset adds a new software tag-based mode to KASAN [1]. (Initially this mode was called KHWASAN, but it got renamed, see the naming rationale at the end of this section). The plan is to implement HWASan [2] for the kernel with the incentive, that it's going to have comparable to KASAN performance, but in the same time consume much less memory, trading that off for somewhat imprecise bug detection and being supported only for arm64. The underlying ideas of the approach used by software tag-based KASAN are: 1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store pointer tags in the top byte of each kernel pointer. 2. Using shadow memory, we can store memory tags for each chunk of kernel memory. 3. On each memory allocation, we can generate a random tag, embed it into the returned pointer and set the memory tags that correspond to this chunk of memory to the same value. 4. By using compiler instrumentation, before each memory access we can add a check that the pointer tag matches the tag of the memory that is being accessed. 5. On a tag mismatch we report an error. With this patchset the existing KASAN mode gets renamed to generic KASAN, with the word "generic" meaning that the implementation can be supported by any architecture as it is purely software. The new mode this patchset adds is called software tag-based KASAN. The word "tag-based" refers to the fact that this mode uses tags embedded into the top byte of kernel pointers and the TBI arm64 CPU feature that allows to dereference such pointers. The word "software" here means that shadow memory manipulation and tag checking on pointer dereference is done in software. As it is the only tag-based implementation right now, "software tag-based" KASAN is sometimes referred to as simply "tag-based" in this patchset. A potential expansion of this mode is a hardware tag-based mode, which would use hardware memory tagging support (announced by Arm [3]) instead of compiler instrumentation and manual shadow memory manipulation. Same as generic KASAN, software tag-based KASAN is strictly a debugging feature. [1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html [2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html [3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a ====== Rationale On mobile devices generic KASAN's memory usage is significant problem. One of the main reasons to have tag-based KASAN is to be able to perform a similar set of checks as the generic one does, but with lower memory requirements. Comment from Vishwath Mohan : I don't have data on-hand, but anecdotally both ASAN and KASAN have proven problematic to enable for environments that don't tolerate the increased memory pressure well. This includes (a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go, (c) Connected components like Pixel's visual core [1]. These are both places I'd love to have a low(er) memory footprint option at my disposal. Comment from Evgenii Stepanov : Looking at a live Android device under load, slab (according to /proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's overhead of 2x - 3x on top of it is not insignificant. Not having this overhead enables near-production use - ex. running KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do not reproduce in test configuration. These are the ones that often cost the most engineering time to track down. CPU overhead is bad, but generally tolerable. RAM is critical, in our experience. Once it gets low enough, OOM-killer makes your life miserable. [1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/ ====== Technical details Software tag-based KASAN mode is implemented in a very similar way to the generic one. This patchset essentially does the following: 1. TCR_TBI1 is set to enable Top Byte Ignore. 2. Shadow memory is used (with a different scale, 1:16, so each shadow byte corresponds to 16 bytes of kernel memory) to store memory tags. 3. All slab objects are aligned to shadow scale, which is 16 bytes. 4. All pointers returned from the slab allocator are tagged with a random tag and the corresponding shadow memory is poisoned with the same value. 5. Compiler instrumentation is used to insert tag checks. Either by calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE flags are reused). 6. When a tag mismatch is detected in callback instrumentation mode KASAN simply prints a bug report. In case of inline instrumentation, clang inserts a brk instruction, and KASAN has it's own brk handler, which reports the bug. 7. The memory in between slab objects is marked with a reserved tag, and acts as a redzone. 8. When a slab object is freed it's marked with a reserved tag. Bug detection is imprecise for two reasons: 1. We won't catch some small out-of-bounds accesses, that fall into the same shadow cell, as the last byte of a slab object. 2. We only have 1 byte to store tags, which means we have a 1/256 probability of a tag match for an incorrect access (actually even slightly less due to reserved tag values). Despite that there's a particular type of bugs that tag-based KASAN can detect compared to generic KASAN: use-after-free after the object has been allocated by someone else. ====== Testing Some kernel developers voiced a concern that changing the top byte of kernel pointers may lead to subtle bugs that are difficult to discover. To address this concern deliberate testing has been performed. It doesn't seem feasible to do some kind of static checking to find potential issues with pointer tagging, so a dynamic approach was taken. All pointer comparisons/subtractions have been instrumented in an LLVM compiler pass and a kernel module that would print a bug report whenever two pointers with different tags are being compared/subtracted (ignoring comparisons with NULL pointers and with pointers obtained by casting an error code to a pointer type) has been used. Then the kernel has been booted in QEMU and on an Odroid C2 board and syzkaller has been run. This yielded the following results. The two places that look interesting are: is_vmalloc_addr in include/linux/mm.h is_kernel_rodata in mm/util.c Here we compare a pointer with some fixed untagged values to make sure that the pointer lies in a particular part of the kernel address space. Since tag-based KASAN doesn't add tags to pointers that belong to rodata or vmalloc regions, this should work as is. To make sure debug checks to those two functions that check that the result doesn't change whether we operate on pointers with or without untagging has been added. A few other cases that don't look that interesting: Comparing pointers to achieve unique sorting order of pointee objects (e.g. sorting locks addresses before performing a double lock): tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c pipe_double_lock in fs/pipe.c unix_state_double_lock in net/unix/af_unix.c lock_two_nondirectories in fs/inode.c mutex_lock_double in kernel/events/core.c ep_cmp_ffd in fs/eventpoll.c fsnotify_compare_groups fs/notify/mark.c Nothing needs to be done here, since the tags embedded into pointers don't change, so the sorting order would still be unique. Checks that a pointer belongs to some particular allocation: is_sibling_entry in lib/radix-tree.c object_is_on_stack in include/linux/sched/task_stack.h Nothing needs to be done here either, since two pointers can only belong to the same allocation if they have the same tag. Overall, since the kernel boots and works, there are no critical bugs. As for the rest, the traditional kernel testing way (use until fails) is the only one that looks feasible. Another point here is that tag-based KASAN is available under a separate config option that needs to be deliberately enabled. Even though it might be used in a "near-production" environment to find bugs that are not found during fuzzing or running tests, it is still a debug tool. ====== Benchmarks The following numbers were collected on Odroid C2 board. Both generic and tag-based KASAN were used in inline instrumentation mode. Boot time [1]: * ~1.7 sec for clean kernel * ~5.0 sec for generic KASAN * ~5.0 sec for tag-based KASAN Network performance [2]: * 8.33 Gbits/sec for clean kernel * 3.17 Gbits/sec for generic KASAN * 2.85 Gbits/sec for tag-based KASAN Slab memory usage after boot [3]: * ~40 kb for clean kernel * ~105 kb (~260% overhead) for generic KASAN * ~47 kb (~20% overhead) for tag-based KASAN KASAN memory overhead consists of three main parts: 1. Increased slab memory usage due to redzones. 2. Shadow memory (the whole reserved once during boot). 3. Quaratine (grows gradually until some preset limit; the more the limit, the more the chance to detect a use-after-free). Comparing tag-based vs generic KASAN for each of these points: 1. 20% vs 260% overhead. 2. 1/16th vs 1/8th of physical memory. 3. Tag-based KASAN doesn't require quarantine. [1] Time before the ext4 driver is initialized. [2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`. [3] Measured as `cat /proc/meminfo | grep Slab`. ====== Some notes A few notes: 1. The patchset can be found here: https://github.com/xairy/kasan-prototype/tree/khwasan 2. Building requires a recent Clang version (7.0.0 or later). 3. Stack instrumentation is not supported yet and will be added later. This patch (of 25): Tag-based KASAN changes the value of the top byte of pointers returned from the kernel allocation functions (such as kmalloc). This patch updates KASAN hooks signatures and their usage in SLAB and SLUB code to reflect that. Link: http://lkml.kernel.org/r/aec2b5e3973781ff8a6bb6760f8543643202c451.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 43 +++++++++++++++++++++++++++++-------------- include/linux/slab.h | 4 ++-- 2 files changed, 31 insertions(+), 16 deletions(-) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 46aae129917c..52c86a568a4e 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -51,16 +51,16 @@ void kasan_cache_shutdown(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); void kasan_poison_object_data(struct kmem_cache *cache, void *object); -void kasan_init_slab_obj(struct kmem_cache *cache, const void *object); +void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object); -void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); +void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); void kasan_kfree_large(void *ptr, unsigned long ip); void kasan_poison_kfree(void *ptr, unsigned long ip); -void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, +void *kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, gfp_t flags); -void kasan_krealloc(const void *object, size_t new_size, gfp_t flags); +void *kasan_krealloc(const void *object, size_t new_size, gfp_t flags); -void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); +void *kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip); struct kasan_cache { @@ -105,19 +105,34 @@ static inline void kasan_unpoison_object_data(struct kmem_cache *cache, void *object) {} static inline void kasan_poison_object_data(struct kmem_cache *cache, void *object) {} -static inline void kasan_init_slab_obj(struct kmem_cache *cache, - const void *object) {} +static inline void *kasan_init_slab_obj(struct kmem_cache *cache, + const void *object) +{ + return (void *)object; +} -static inline void kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) {} +static inline void *kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) +{ + return ptr; +} static inline void kasan_kfree_large(void *ptr, unsigned long ip) {} static inline void kasan_poison_kfree(void *ptr, unsigned long ip) {} -static inline void kasan_kmalloc(struct kmem_cache *s, const void *object, - size_t size, gfp_t flags) {} -static inline void kasan_krealloc(const void *object, size_t new_size, - gfp_t flags) {} +static inline void *kasan_kmalloc(struct kmem_cache *s, const void *object, + size_t size, gfp_t flags) +{ + return (void *)object; +} +static inline void *kasan_krealloc(const void *object, size_t new_size, + gfp_t flags) +{ + return (void *)object; +} -static inline void kasan_slab_alloc(struct kmem_cache *s, void *object, - gfp_t flags) {} +static inline void *kasan_slab_alloc(struct kmem_cache *s, void *object, + gfp_t flags) +{ + return object; +} static inline bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip) { diff --git a/include/linux/slab.h b/include/linux/slab.h index 918f374e7156..351ac48dabc4 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -444,7 +444,7 @@ static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s, { void *ret = kmem_cache_alloc(s, flags); - kasan_kmalloc(s, ret, size, flags); + ret = kasan_kmalloc(s, ret, size, flags); return ret; } @@ -455,7 +455,7 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, { void *ret = kmem_cache_alloc_node(s, gfpflags, node); - kasan_kmalloc(s, ret, size, gfpflags); + ret = kasan_kmalloc(s, ret, size, gfpflags); return ret; } #endif /* CONFIG_TRACING */ -- cgit v1.2.3 From 2bd926b439b4cb6b9ed240a9781cd01958b53d85 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:29:53 -0800 Subject: kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS This commit splits the current CONFIG_KASAN config option into two: 1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one that exists now); 2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode. The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have another hardware tag-based KASAN mode, that will rely on hardware memory tagging support in arm64. With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to instrument kernel files with -fsantize=kernel-hwaddress (except the ones for which KASAN_SANITIZE := n is set). Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes. This commit also adds empty placeholder (for now) implementation of tag-based KASAN specific hooks inserted by the compiler and adjusts common hooks implementation. While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will enable once all the infrastracture code has been added. Link: http://lkml.kernel.org/r/b2550106eb8a68b10fefbabce820910b115aa853.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compiler-clang.h | 6 +++++- include/linux/compiler-gcc.h | 6 ++++++ include/linux/compiler_attributes.h | 13 ------------- include/linux/kasan.h | 16 ++++++++++++---- 4 files changed, 23 insertions(+), 18 deletions(-) (limited to 'include') diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index 3e7dafb3ea80..39f668d5066b 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -16,9 +16,13 @@ /* all clang versions usable with the kernel support KASAN ABI version 5 */ #define KASAN_ABI_VERSION 5 +#if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer) /* emulate gcc's __SANITIZE_ADDRESS__ flag */ -#if __has_feature(address_sanitizer) #define __SANITIZE_ADDRESS__ +#define __no_sanitize_address \ + __attribute__((no_sanitize("address", "hwaddress"))) +#else +#define __no_sanitize_address #endif /* diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 2010493e1040..5776da43da97 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -143,6 +143,12 @@ #define KASAN_ABI_VERSION 3 #endif +#if __has_attribute(__no_sanitize_address__) +#define __no_sanitize_address __attribute__((no_sanitize_address)) +#else +#define __no_sanitize_address +#endif + #if GCC_VERSION >= 50100 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 #endif diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h index fe07b680dd4a..19f32b0c29af 100644 --- a/include/linux/compiler_attributes.h +++ b/include/linux/compiler_attributes.h @@ -199,19 +199,6 @@ */ #define __noreturn __attribute__((__noreturn__)) -/* - * Optional: only supported since gcc >= 4.8 - * Optional: not supported by icc - * - * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-no_005fsanitize_005faddress-function-attribute - * clang: https://clang.llvm.org/docs/AttributeReference.html#no-sanitize-address-no-address-safety-analysis - */ -#if __has_attribute(__no_sanitize_address__) -# define __no_sanitize_address __attribute__((__no_sanitize_address__)) -#else -# define __no_sanitize_address -#endif - /* * gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#index-packed-type-attribute * clang: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-packed-variable-attribute diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 52c86a568a4e..b66fdf5ea7ab 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -45,8 +45,6 @@ void kasan_free_pages(struct page *page, unsigned int order); void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, slab_flags_t *flags); -void kasan_cache_shrink(struct kmem_cache *cache); -void kasan_cache_shutdown(struct kmem_cache *cache); void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); @@ -97,8 +95,6 @@ static inline void kasan_free_pages(struct page *page, unsigned int order) {} static inline void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, slab_flags_t *flags) {} -static inline void kasan_cache_shrink(struct kmem_cache *cache) {} -static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} static inline void kasan_poison_slab(struct page *page) {} static inline void kasan_unpoison_object_data(struct kmem_cache *cache, @@ -155,4 +151,16 @@ static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } #endif /* CONFIG_KASAN */ +#ifdef CONFIG_KASAN_GENERIC + +void kasan_cache_shrink(struct kmem_cache *cache); +void kasan_cache_shutdown(struct kmem_cache *cache); + +#else /* CONFIG_KASAN_GENERIC */ + +static inline void kasan_cache_shrink(struct kmem_cache *cache) {} +static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} + +#endif /* CONFIG_KASAN_GENERIC */ + #endif /* LINUX_KASAN_H */ -- cgit v1.2.3 From 9577dd7486487722ed8f0773243223f108e8089f Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:01 -0800 Subject: kasan: rename kasan_zero_page to kasan_early_shadow_page With tag based KASAN mode the early shadow value is 0xff and not 0x00, so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion. Link: http://lkml.kernel.org/r/3fed313280ebf4f88645f5b89ccbc066d320e177.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Suggested-by: Mark Rutland Cc: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index b66fdf5ea7ab..ec22d548d0d7 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -14,13 +14,13 @@ struct task_struct; #include #include -extern unsigned char kasan_zero_page[PAGE_SIZE]; -extern pte_t kasan_zero_pte[PTRS_PER_PTE]; -extern pmd_t kasan_zero_pmd[PTRS_PER_PMD]; -extern pud_t kasan_zero_pud[PTRS_PER_PUD]; -extern p4d_t kasan_zero_p4d[MAX_PTRS_PER_P4D]; +extern unsigned char kasan_early_shadow_page[PAGE_SIZE]; +extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE]; +extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD]; +extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD]; +extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D]; -int kasan_populate_zero_shadow(const void *shadow_start, +int kasan_populate_early_shadow(const void *shadow_start, const void *shadow_end); static inline void *kasan_mem_to_shadow(const void *addr) -- cgit v1.2.3 From 080eb83f54cf5b96ae5b6ce3c1896e35c341aff9 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:09 -0800 Subject: kasan: initialize shadow to 0xff for tag-based mode A tag-based KASAN shadow memory cell contains a memory tag, that corresponds to the tag in the top byte of the pointer, that points to that memory. The native top byte value of kernel pointers is 0xff, so with tag-based KASAN we need to initialize shadow memory to 0xff. [cai@lca.pw: arm64: skip kmemleak for KASAN again\ Link: http://lkml.kernel.org/r/20181226020550.63712-1-cai@lca.pw Link: http://lkml.kernel.org/r/5cc1b789aad7c99cf4f3ec5b328b147ad53edb40.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Cc: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index ec22d548d0d7..c56af24bd3e7 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -153,6 +153,8 @@ static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; } #ifdef CONFIG_KASAN_GENERIC +#define KASAN_SHADOW_INIT 0 + void kasan_cache_shrink(struct kmem_cache *cache); void kasan_cache_shutdown(struct kmem_cache *cache); @@ -163,4 +165,10 @@ static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} #endif /* CONFIG_KASAN_GENERIC */ +#ifdef CONFIG_KASAN_SW_TAGS + +#define KASAN_SHADOW_INIT 0xFF + +#endif /* CONFIG_KASAN_SW_TAGS */ + #endif /* LINUX_KASAN_H */ -- cgit v1.2.3 From 3c9e3aa11094e821aff4a8f6812a6e032293dbc0 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:16 -0800 Subject: kasan: add tag related helper functions This commit adds a few helper functions, that are meant to be used to work with tags embedded in the top byte of kernel pointers: to set, to get or to reset the top byte. Link: http://lkml.kernel.org/r/f6c6437bb8e143bc44f42c3c259c62e734be7935.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Cc: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index c56af24bd3e7..a477ce2abdc9 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -169,6 +169,19 @@ static inline void kasan_cache_shutdown(struct kmem_cache *cache) {} #define KASAN_SHADOW_INIT 0xFF +void kasan_init_tags(void); + +void *kasan_reset_tag(const void *addr); + +#else /* CONFIG_KASAN_SW_TAGS */ + +static inline void kasan_init_tags(void) { } + +static inline void *kasan_reset_tag(const void *addr) +{ + return (void *)addr; +} + #endif /* CONFIG_KASAN_SW_TAGS */ #endif /* LINUX_KASAN_H */ -- cgit v1.2.3 From 5b7c4148222d7acaa1612e5eec84fc66c88d54f3 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:46 -0800 Subject: mm: move obj_to_index to include/linux/slab_def.h While with SLUB we can actually preassign tags for caches with contructors and store them in pointers in the freelist, SLAB doesn't allow that since the freelist is stored as an array of indexes, so there are no pointers to store the tags. Instead we compute the tag twice, once when a slab is created before calling the constructor and then again each time when an object is allocated with kmalloc. Tag is computed simply by taking the lowest byte of the index that corresponds to the object. However in kasan_kmalloc we only have access to the objects pointer, so we need a way to find out which index this object corresponds to. This patch moves obj_to_index from slab.c to include/linux/slab_def.h to be reused by KASAN. Link: http://lkml.kernel.org/r/c02cd9e574cfd93858e43ac94b05e38f891fef64.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Christoph Lameter Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab_def.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'include') diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 3485c58cfd1c..9a5eafb7145b 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -104,4 +104,17 @@ static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, return object; } +/* + * We want to avoid an expensive divide : (offset / cache->size) + * Using the fact that size is a constant for a particular cache, + * we can replace (offset / cache->size) by + * reciprocal_divide(offset, cache->reciprocal_buffer_size) + */ +static inline unsigned int obj_to_index(const struct kmem_cache *cache, + const struct page *page, void *obj) +{ + u32 offset = (obj - page->s_mem); + return reciprocal_divide(offset, cache->reciprocal_buffer_size); +} + #endif /* _LINUX_SLAB_DEF_H */ -- cgit v1.2.3 From 41eea9cd239c5b3fff726894f85c97f60e5799a3 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:54 -0800 Subject: kasan, arm64: add brk handler for inline instrumentation Tag-based KASAN inline instrumentation mode (which embeds checks of shadow memory into the generated code, instead of inserting a callback) generates a brk instruction when a tag mismatch is detected. This commit adds a tag-based KASAN specific brk handler, that decodes the immediate value passed to the brk instructions (to extract information about the memory access that triggered the mismatch), reads the register values (x0 contains the guilty address) and reports the bug. Link: http://lkml.kernel.org/r/c91fe7684070e34dc34b419e6b69498f4dcacc2d.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Will Deacon Cc: Christoph Lameter Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index a477ce2abdc9..8da7b7a4397a 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -173,6 +173,9 @@ void kasan_init_tags(void); void *kasan_reset_tag(const void *addr); +void kasan_report(unsigned long addr, size_t size, + bool is_write, unsigned long ip); + #else /* CONFIG_KASAN_SW_TAGS */ static inline void kasan_init_tags(void) { } -- cgit v1.2.3 From 2813b9c0296259fb11e75c839bab2d958ba4f96c Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:30:57 -0800 Subject: kasan, mm, arm64: tag non slab memory allocated via pagealloc Tag-based KASAN doesn't check memory accesses through pointers tagged with 0xff. When page_address is used to get pointer to memory that corresponds to some page, the tag of the resulting pointer gets set to 0xff, even though the allocated memory might have been tagged differently. For slab pages it's impossible to recover the correct tag to return from page_address, since the page might contain multiple slab objects tagged with different values, and we can't know in advance which one of them is going to get accessed. For non slab pages however, we can recover the tag in page_address, since the whole page was marked with the same tag. This patch adds tagging to non slab memory allocated with pagealloc. To set the tag of the pointer returned from page_address, the tag gets stored to page->flags when the memory gets allocated. Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Reviewed-by: Dmitry Vyukov Acked-by: Will Deacon Cc: Christoph Lameter Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 29 +++++++++++++++++++++++++++++ include/linux/page-flags-layout.h | 10 ++++++++++ 2 files changed, 39 insertions(+) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 5411de93a363..b4d01969e700 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -804,6 +804,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH) #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH) #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH) +#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH) /* * Define the bit shifts to access each section. For non-existent @@ -814,6 +815,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0)) #define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0)) #define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0)) +#define KASAN_TAG_PGSHIFT (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0)) /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */ #ifdef NODE_NOT_IN_PAGE_FLAGS @@ -836,6 +838,7 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); #define NODES_MASK ((1UL << NODES_WIDTH) - 1) #define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1) #define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_SHIFT) - 1) +#define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1) #define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1) static inline enum zone_type page_zonenum(const struct page *page) @@ -1101,6 +1104,32 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid) } #endif /* CONFIG_NUMA_BALANCING */ +#ifdef CONFIG_KASAN_SW_TAGS +static inline u8 page_kasan_tag(const struct page *page) +{ + return (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK; +} + +static inline void page_kasan_tag_set(struct page *page, u8 tag) +{ + page->flags &= ~(KASAN_TAG_MASK << KASAN_TAG_PGSHIFT); + page->flags |= (tag & KASAN_TAG_MASK) << KASAN_TAG_PGSHIFT; +} + +static inline void page_kasan_tag_reset(struct page *page) +{ + page_kasan_tag_set(page, 0xff); +} +#else +static inline u8 page_kasan_tag(const struct page *page) +{ + return 0xff; +} + +static inline void page_kasan_tag_set(struct page *page, u8 tag) { } +static inline void page_kasan_tag_reset(struct page *page) { } +#endif + static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h index 7ec86bf31ce4..1dda31825ec4 100644 --- a/include/linux/page-flags-layout.h +++ b/include/linux/page-flags-layout.h @@ -82,6 +82,16 @@ #define LAST_CPUPID_WIDTH 0 #endif +#ifdef CONFIG_KASAN_SW_TAGS +#define KASAN_TAG_WIDTH 8 +#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH+LAST_CPUPID_WIDTH+KASAN_TAG_WIDTH \ + > BITS_PER_LONG - NR_PAGEFLAGS +#error "KASAN: not enough bits in page flags for tag" +#endif +#else +#define KASAN_TAG_WIDTH 0 +#endif + /* * We are going to use the flags for the page to node mapping if its in * there. This includes the case where there is no node, so it is implicit. -- cgit v1.2.3 From 66afc7f1e07a1db74453be9167ac0d1205653854 Mon Sep 17 00:00:00 2001 From: Andrey Konovalov Date: Fri, 28 Dec 2018 00:31:01 -0800 Subject: kasan: add __must_check annotations to kasan hooks This patch adds __must_check annotations to kasan hooks that return a pointer to make sure that a tagged pointer always gets propagated. Link: http://lkml.kernel.org/r/03b269c5e453945f724bfca3159d4e1333a8fb1c.1544099024.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov Suggested-by: Andrey Ryabinin Cc: Christoph Lameter Cc: Dmitry Vyukov Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kasan.h | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'include') diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 8da7b7a4397a..b40ea104dd36 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -49,16 +49,20 @@ void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, void kasan_poison_slab(struct page *page); void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); void kasan_poison_object_data(struct kmem_cache *cache, void *object); -void *kasan_init_slab_obj(struct kmem_cache *cache, const void *object); +void * __must_check kasan_init_slab_obj(struct kmem_cache *cache, + const void *object); -void *kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags); +void * __must_check kasan_kmalloc_large(const void *ptr, size_t size, + gfp_t flags); void kasan_kfree_large(void *ptr, unsigned long ip); void kasan_poison_kfree(void *ptr, unsigned long ip); -void *kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size, - gfp_t flags); -void *kasan_krealloc(const void *object, size_t new_size, gfp_t flags); +void * __must_check kasan_kmalloc(struct kmem_cache *s, const void *object, + size_t size, gfp_t flags); +void * __must_check kasan_krealloc(const void *object, size_t new_size, + gfp_t flags); -void *kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags); +void * __must_check kasan_slab_alloc(struct kmem_cache *s, void *object, + gfp_t flags); bool kasan_slab_free(struct kmem_cache *s, void *object, unsigned long ip); struct kasan_cache { -- cgit v1.2.3 From 4e45f712d82c6b7a37e02faf388173ad12ab464d Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Fri, 28 Dec 2018 00:33:17 -0800 Subject: include/linux/slab.h: fix sparse warning in kmalloc_type() Multiple people have reported the following sparse warning: ./include/linux/slab.h:332:43: warning: dubious: x & !y The minimal fix would be to change the logical & to boolean &&, which emits the same code, but Andrew has suggested that the branch-avoiding tricks are maybe not worthwile. David Laight provided a nice comparison of disassembly of multiple variants, which shows that the current version produces a 4 deep dependency chain, and fixing the sparse warning by changing logical and to multiplication emits an IMUL, making it even more expensive. The code as rewritten by this patch yielded the best disassembly, with a single predictable branch for the most common case, and a ternary operator for the rest, which gcc seems to compile without a branch or cmov by itself. The result should be more readable, without a sparse warning and probably also faster for the common case. Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Reviewed-by: Andrew Morton Signed-off-by: Vlastimil Babka Reported-by: Bart Van Assche Reported-by: Darryl T. Agostinelli Reported-by: Masahiro Yamada Suggested-by: Andrew Morton Suggested-by: David Laight Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'include') diff --git a/include/linux/slab.h b/include/linux/slab.h index 351ac48dabc4..6d9bd6fc0c57 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -314,22 +314,22 @@ kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { - int is_dma = 0; - int type_dma = 0; - int is_reclaimable; - #ifdef CONFIG_ZONE_DMA - is_dma = !!(flags & __GFP_DMA); - type_dma = is_dma * KMALLOC_DMA; -#endif - - is_reclaimable = !!(flags & __GFP_RECLAIMABLE); + /* + * The most common case is KMALLOC_NORMAL, so test for it + * with a single branch for both flags. + */ + if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) + return KMALLOC_NORMAL; /* - * If an allocation is both __GFP_DMA and __GFP_RECLAIMABLE, return - * KMALLOC_DMA and effectively ignore __GFP_RECLAIMABLE + * At least one of the flags has to be set. If both are, __GFP_DMA + * is more important. */ - return type_dma + (is_reclaimable & !is_dma) * KMALLOC_RECLAIM; + return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; +#else + return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; +#endif } /* -- cgit v1.2.3 From 6a90a83f1d1957647581ca48caa1f7cc4fa44f8d Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Fri, 28 Dec 2018 00:33:28 -0800 Subject: mm/mmu_notifier.c: remove mmu_notifier_synchronize() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Contrary to its name, mmu_notifier_synchronize() does not synchronize the notifier's SRCU instance, but rather waits for RCU callbacks to finish. i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on this matter, explicitly calling out that rcu_barrier() does not imply synchronize_rcu(). As there are no callers of mmu_notifier_synchronize() and it's unclear whether any user of mmu_notifier_call_srcu() will ever want to barrier on their callbacks, simply remove the function. Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com Signed-off-by: Sean Christopherson Reviewed-by: Andrew Morton Cc: Peter Zijlstra Cc: Jérôme Glisse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 9893a6432adf..913c3c13e36e 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -420,7 +420,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) extern void mmu_notifier_call_srcu(struct rcu_head *rcu, void (*func)(struct rcu_head *rcu)); -extern void mmu_notifier_synchronize(void); #else /* CONFIG_MMU_NOTIFIER */ -- cgit v1.2.3 From 368686a95e55fd66b88542b5b23d802a4886b1aa Mon Sep 17 00:00:00 2001 From: Anders Roxell Date: Fri, 28 Dec 2018 00:33:31 -0800 Subject: writeback: don't decrement wb->refcnt if !wb->bdi This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet. Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers. Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann Signed-off-by: Anders Roxell Co-developed-by: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev-defs.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'include') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 9a6bc0951cfa..c31157135598 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -258,6 +258,14 @@ static inline void wb_get(struct bdi_writeback *wb) */ static inline void wb_put(struct bdi_writeback *wb) { + if (WARN_ON_ONCE(!wb->bdi)) { + /* + * A driver bug might cause a file to be removed before bdi was + * initialized. + */ + return; + } + if (wb != &wb->bdi->wb) percpu_ref_put(&wb->refcnt); } -- cgit v1.2.3 From d381c54760dcfad23743da40516e7e003d73952a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:33:56 -0800 Subject: mm: only report isolation failures when offlining memory Heiko has complained that his log is swamped by warnings from has_unmovable_pages [ 20.536664] page dumped because: has_unmovable_pages [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0 [ 20.536794] flags: 0x3fffe0000010200(slab|head) [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600 [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000 [ 20.536797] page dumped because: has_unmovable_pages [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 20.536815] flags: 0x7fffe0000000000() [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000 [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 which are not triggered by the memory hotplug but rather CMA allocator. The original idea behind dumping the page state for all call paths was that these messages will be helpful debugging failures. From the above it seems that this is not the case for the CMA path because we are lacking much more context. E.g the second reported page might be a CMA allocated page. It is still interesting to see a slab page in the CMA area but it is hard to tell whether this is bug from the above output alone. Address this issue by dumping the page state only on request. Both start_isolate_page_range and has_unmovable_pages already have an argument to ignore hwpoison pages so make this argument more generic and turn it into flags and allow callers to combine non-default modes into a mask. While we are at it, has_unmovable_pages call from is_pageblock_removable_nolock (sysfs removable file) is questionable to report the failure so drop it from there as well. Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org Signed-off-by: Michal Hocko Reported-by: Heiko Carstens Reviewed-by: Oscar Salvador Cc: Anshuman Khandual Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/page-isolation.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h index 4ae347cbc36d..4eb26d278046 100644 --- a/include/linux/page-isolation.h +++ b/include/linux/page-isolation.h @@ -30,8 +30,11 @@ static inline bool is_migrate_isolate(int migratetype) } #endif +#define SKIP_HWPOISON 0x1 +#define REPORT_FAILURE 0x2 + bool has_unmovable_pages(struct zone *zone, struct page *page, int count, - int migratetype, bool skip_hwpoisoned_pages); + int migratetype, int flags); void set_pageblock_migratetype(struct page *page, int migratetype); int move_freepages_block(struct zone *zone, struct page *page, int migratetype, int *num_movable); @@ -44,10 +47,14 @@ int move_freepages_block(struct zone *zone, struct page *page, * For isolating all pages in the range finally, the caller have to * free all pages in the range. test_page_isolated() can be used for * test it. + * + * The following flags are allowed (they can be combined in a bit mask) + * SKIP_HWPOISON - ignore hwpoison pages + * REPORT_FAILURE - report details about the failure to isolate the range */ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn, - unsigned migratetype, bool skip_hwpoisoned_pages); + unsigned migratetype, int flags); /* * Changes MIGRATE_ISOLATE to MIGRATE_MOVABLE. -- cgit v1.2.3 From 0b9df58b79fa283fbedc0fb6a8e248599444bacc Mon Sep 17 00:00:00 2001 From: Timofey Titovets Date: Fri, 28 Dec 2018 00:34:00 -0800 Subject: xxHash: create arch dependent 32/64-bit xxhash() Patch series "Currently used jhash are slow enough and replace it allow as to make KSM", v8. Apeed (in kernel): ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee's testing (copy from other mail): Test platform: openstack cloud platform (NEWTON version) Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb) VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4 Linux kernel: 4.14 (latest version) KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200 Experiment process: Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and measure the checksum computation time until full_scans become two. The experimental results (the experimental value is the average of the measured values) crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns In summary, the result shows that crc32c_intel has advantages over all of the hash function used in the experiment. (decreased by 84.54% compared to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to xxhash64) the results are similar to those of Timofey. But, use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that require some tricky solution to work good in all situations. So: - First patch implement compile time pickup of fastest implementation of xxhash for target platform. - The second patch replaces jhash2 with xxhash This patch (of 2): xxh32() - fast on both 32/64-bit platforms xxh64() - fast only on 64-bit platform Create xxhash() which will pick up the fastest version at compile time. Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets Reviewed-by: Pavel Tatashin Reviewed-by: Mike Rapoport Reviewed-by: Andrew Morton Cc: Andrea Arcangeli Cc: leesioh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/xxhash.h | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) (limited to 'include') diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h index 9e1f42cb57e9..52b073fea17f 100644 --- a/include/linux/xxhash.h +++ b/include/linux/xxhash.h @@ -107,6 +107,29 @@ uint32_t xxh32(const void *input, size_t length, uint32_t seed); */ uint64_t xxh64(const void *input, size_t length, uint64_t seed); +/** + * xxhash() - calculate wordsize hash of the input with a given seed + * @input: The data to hash. + * @length: The length of the data to hash. + * @seed: The seed can be used to alter the result predictably. + * + * If the hash does not need to be comparable between machines with + * different word sizes, this function will call whichever of xxh32() + * or xxh64() is faster. + * + * Return: wordsize hash of the data. + */ + +static inline unsigned long xxhash(const void *input, size_t length, + uint64_t seed) +{ +#if BITS_PER_LONG == 64 + return xxh64(input, length, seed); +#else + return xxh32(input, length, seed); +#endif +} + /*-**************************** * Streaming Hash Functions *****************************/ -- cgit v1.2.3 From 9705bea5f833f4fc21d5bef5fce7348427f76ea4 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:24 -0800 Subject: mm: convert zone->managed_pages to atomic variable totalram_pages, zone->managed_pages and totalhigh_pages updates are protected by managed_page_count_lock, but readers never care about it. Convert these variables to atomic to avoid readers potentially seeing a store tear. This patch converts zone->managed_pages. Subsequent patches will convert totalram_panges, totalhigh_pages and eventually managed_page_count_lock will be removed. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Suggested-by: Michal Hocko Suggested-by: Vlastimil Babka Reviewed-by: Konstantin Khlebnikov Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 077d797d1f60..a23e34e21178 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -435,7 +435,7 @@ struct zone { * adjust_managed_page_count() should be used instead of directly * touching zone->managed_pages and totalram_pages. */ - unsigned long managed_pages; + atomic_long_t managed_pages; unsigned long spanned_pages; unsigned long present_pages; @@ -524,6 +524,11 @@ enum pgdat_flags { PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ }; +static inline unsigned long zone_managed_pages(struct zone *zone) +{ + return (unsigned long)atomic_long_read(&zone->managed_pages); +} + static inline unsigned long zone_end_pfn(const struct zone *zone) { return zone->zone_start_pfn + zone->spanned_pages; @@ -820,7 +825,7 @@ static inline bool is_dev_zone(const struct zone *zone) */ static inline bool managed_zone(struct zone *zone) { - return zone->managed_pages; + return zone_managed_pages(zone); } /* Returns true if a zone has memory */ -- cgit v1.2.3 From ca79b0c211af63fa3276f0e3fd7dd9ada2439839 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:29 -0800 Subject: mm: convert totalram_pages and totalhigh_pages variables to atomic totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Suggested-by: Michal Hocko Suggested-by: Vlastimil Babka Reviewed-by: Konstantin Khlebnikov Reviewed-by: Pavel Tatashin Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/highmem.h | 28 ++++++++++++++++++++++++++-- include/linux/mm.h | 27 ++++++++++++++++++++++++++- include/linux/swap.h | 1 - 3 files changed, 52 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 0690679832d4..ea5cdbd8c2c3 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -36,7 +36,31 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); -extern unsigned long totalhigh_pages; +extern atomic_long_t _totalhigh_pages; +static inline unsigned long totalhigh_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalhigh_pages); +} + +static inline void totalhigh_pages_inc(void) +{ + atomic_long_inc(&_totalhigh_pages); +} + +static inline void totalhigh_pages_dec(void) +{ + atomic_long_dec(&_totalhigh_pages); +} + +static inline void totalhigh_pages_add(long count) +{ + atomic_long_add(count, &_totalhigh_pages); +} + +static inline void totalhigh_pages_set(long val) +{ + atomic_long_set(&_totalhigh_pages, val); +} void kmap_flush_unused(void); @@ -51,7 +75,7 @@ static inline struct page *kmap_to_page(void *addr) return virt_to_page(addr); } -#define totalhigh_pages 0UL +static inline unsigned long totalhigh_pages(void) { return 0UL; } #ifndef ARCH_HAS_KMAP static inline void *kmap(struct page *page) diff --git a/include/linux/mm.h b/include/linux/mm.h index b4d01969e700..1d2be4c2d34a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -48,7 +48,32 @@ static inline void set_max_mapnr(unsigned long limit) static inline void set_max_mapnr(unsigned long limit) { } #endif -extern unsigned long totalram_pages; +extern atomic_long_t _totalram_pages; +static inline unsigned long totalram_pages(void) +{ + return (unsigned long)atomic_long_read(&_totalram_pages); +} + +static inline void totalram_pages_inc(void) +{ + atomic_long_inc(&_totalram_pages); +} + +static inline void totalram_pages_dec(void) +{ + atomic_long_dec(&_totalram_pages); +} + +static inline void totalram_pages_add(long count) +{ + atomic_long_add(count, &_totalram_pages); +} + +static inline void totalram_pages_set(long val) +{ + atomic_long_set(&_totalram_pages, val); +} + extern void * high_memory; extern int page_cluster; diff --git a/include/linux/swap.h b/include/linux/swap.h index a8f6d5d89524..77459d695010 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,7 +310,6 @@ void workingset_update_node(struct xa_node *node); } while (0) /* linux/mm/page_alloc.c */ -extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; extern unsigned long nr_free_buffer_pages(void); extern unsigned long nr_free_pagecache_pages(void); -- cgit v1.2.3 From 476567e8735a0d06225f3873a86dfa0efd95f3a5 Mon Sep 17 00:00:00 2001 From: Arun KS Date: Fri, 28 Dec 2018 00:34:32 -0800 Subject: mm: remove managed_page_count_lock spinlock Now that totalram_pages and managed_pages are atomic varibles, no need of managed_page_count spinlock. The lock had really a weak consistency guarantee. It hasn't been used for anything but the update but no reader actually cares about all the values being updated to be in sync. Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS Reviewed-by: Konstantin Khlebnikov Acked-by: Michal Hocko Acked-by: Vlastimil Babka Cc: David Hildenbrand Reviewed-by: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a23e34e21178..bc0990c1f1c3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -428,12 +428,6 @@ struct zone { * Write access to present_pages at runtime should be protected by * mem_hotplug_begin/end(). Any reader who can't tolerant drift of * present_pages should get_online_mems() to get a stable value. - * - * Read access to managed_pages should be safe because it's unsigned - * long. Write access to zone->managed_pages and totalram_pages are - * protected by managed_page_count_lock at runtime. Idealy only - * adjust_managed_page_count() should be used instead of directly - * touching zone->managed_pages and totalram_pages. */ atomic_long_t managed_pages; unsigned long spanned_pages; -- cgit v1.2.3 From 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:34:36 -0800 Subject: vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail zone_reclaim() as full") changed the return value of node_reclaim(). The original return value 0 means NODE_RECLAIM_SOME after this commit. While the return value of node_reclaim() when CONFIG_NUMA is n is not changed. This will leads to call zone_watermark_ok() again. This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN. Since node_reclaim() is only called in page_alloc.c, move it to mm/internal.h. Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: Matthew Wilcox Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index 77459d695010..f9e576a2c188 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -359,14 +359,8 @@ extern unsigned long vm_total_pages; extern int node_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int); #else #define node_reclaim_mode 0 -static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask, - unsigned int order) -{ - return 0; -} #endif extern int page_evictable(struct page *page); -- cgit v1.2.3 From 66f71da9dd38af17dc17209cdde7987d4679a699 Mon Sep 17 00:00:00 2001 From: Aaron Lu Date: Fri, 28 Dec 2018 00:34:39 -0800 Subject: mm/swap: use nr_node_ids for avail_lists in swap_info_struct Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu Reviewed-by: Andrew Morton Acked-by: Michal Hocko Cc: Vasily Averin Cc: Huang Ying Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/swap.h b/include/linux/swap.h index f9e576a2c188..622025ac1461 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -235,7 +235,6 @@ struct swap_info_struct { unsigned long flags; /* SWP_USED etc: see above */ signed short prio; /* swap priority of this type */ struct plist_node list; /* entry in swap_active_head */ - struct plist_node avail_lists[MAX_NUMNODES];/* entry in swap_avail_heads */ signed char type; /* strange name for an index */ unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ @@ -276,6 +275,16 @@ struct swap_info_struct { */ struct work_struct discard_work; /* discard worker */ struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct plist_node avail_lists[0]; /* + * entries in swap_avail_heads, one + * entry per node. + * Must be last as the number of the + * array is nr_node_ids, which is not + * a fixed value so have to allocate + * dynamically. + * And it has to be an array so that + * plist_for_each_* can work. + */ }; #ifdef CONFIG_64BIT -- cgit v1.2.3 From a95c90f1e2c253b280385ecf3d4ebfe476926b28 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:34:57 -0800 Subject: mm, devm_memremap_pages: fix shutdown handling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Reviewed-by: "Jérôme Glisse" Reported-by: Logan Gunthorpe Reviewed-by: Logan Gunthorpe Reviewed-by: Christoph Hellwig Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memremap.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 0ac69ddf5fc4..55db66b3716f 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -111,6 +111,7 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref * @ref: reference count that pins the devm_memremap_pages() mapping + * @kill: callback to transition @ref to the dead state * @dev: host device of the mapping for debug * @data: private data pointer for page_free() * @type: memory type: see MEMORY_* in memory_hotplug.h @@ -122,6 +123,7 @@ struct dev_pagemap { bool altmap_valid; struct resource res; struct percpu_ref *ref; + void (*kill)(struct percpu_ref *ref); struct device *dev; void *data; enum memory_type type; -- cgit v1.2.3 From 58ef15b765af0d2cbe6799ec564f1dc485010ab8 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:35:07 -0800 Subject: mm, hmm: use devm semantics for hmm_devmem_{add, remove} MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Reviewed-by: Christoph Hellwig Reviewed-by: Jérôme Glisse Cc: "Jérôme Glisse" Cc: Logan Gunthorpe Cc: Balbir Singh Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hmm.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/hmm.h b/include/linux/hmm.h index c6fb869a81c0..ed89fbc525d2 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -512,8 +512,7 @@ struct hmm_devmem { * enough and allocate struct page for it. * * The device driver can wrap the hmm_devmem struct inside a private device - * driver struct. The device driver must call hmm_devmem_remove() before the - * device goes away and before freeing the hmm_devmem struct memory. + * driver struct. */ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, @@ -521,7 +520,6 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, struct device *device, struct resource *res); -void hmm_devmem_remove(struct hmm_devmem *devmem); /* * hmm_devmem_page_set_drvdata - set per-page driver data field -- cgit v1.2.3 From 4d72868c8f7c293fc8408a54db4e0a12dc031152 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Fri, 28 Dec 2018 00:35:29 -0800 Subject: memblock: replace usage of __memblock_free_early() with memblock_free() __memblock_free_early() is only used by the convenience wrappers, so essentially we wrap a call to memblock_free() twice. Replace calls of __memblock_free_early() with calls to memblock_free() and drop the former. Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnx Signed-off-by: Mike Rapoport Reviewed-by: Andrew Morton Cc: Wentao Wang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index aee299a6aa76..5f74ba623dbd 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -154,7 +154,6 @@ void __next_mem_range_rev(u64 *idx, int nid, enum memblock_flags flags, void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start, phys_addr_t *out_end); -void __memblock_free_early(phys_addr_t base, phys_addr_t size); void __memblock_free_late(phys_addr_t base, phys_addr_t size); /** @@ -414,13 +413,13 @@ static inline void * __init memblock_alloc_node_nopanic(phys_addr_t size, static inline void __init memblock_free_early(phys_addr_t base, phys_addr_t size) { - __memblock_free_early(base, size); + memblock_free(base, size); } static inline void __init memblock_free_early_nid(phys_addr_t base, phys_addr_t size, int nid) { - __memblock_free_early(base, size); + memblock_free(base, size); } static inline void __init memblock_free_late(phys_addr_t base, phys_addr_t size) -- cgit v1.2.3 From f29d8e9c0191a2a02500945db505e5c89159c3f4 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 28 Dec 2018 00:35:36 -0800 Subject: mm/memory_hotplug: drop "online" parameter from add_memory_resource() Userspace should always be in charge of how to online memory and if memory should be onlined automatically in the kernel. Let's drop the parameter to overwrite this - XEN passes memhp_auto_online, just like add_memory(), so we can directly use that instead internally. Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Michal Hocko Reviewed-by: Oscar Salvador Acked-by: Juergen Gross Cc: Boris Ostrovsky Cc: Stefano Stabellini Cc: Dan Williams Cc: Pavel Tatashin Cc: David Hildenbrand Cc: Joonsoo Kim Cc: Arun KS Cc: Mathieu Malaterre Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index ffd9cd10fcf3..7383a7a76d69 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -326,7 +326,7 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); extern int __add_memory(int nid, u64 start, u64 size); extern int add_memory(int nid, u64 start, u64 size); -extern int add_memory_resource(int nid, struct resource *resource, bool online); +extern int add_memory_resource(int nid, struct resource *resource); extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, -- cgit v1.2.3 From a921444382b49cc7fdeca3fba3e278bc09484a27 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:44 -0800 Subject: mm: move zone watermark accesses behind an accessor This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bc0990c1f1c3..dcf1b66a96ab 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,9 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->watermark[WMARK_HIGH]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) +#define wmark_pages(z, i) (z->_watermark[i]) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -362,7 +363,7 @@ struct zone { /* Read-mostly fields */ /* zone watermarks, access with *_wmark_pages(zone) macros */ - unsigned long watermark[NR_WMARK]; + unsigned long _watermark[NR_WMARK]; unsigned long nr_reserved_highatomic; -- cgit v1.2.3 From 1c30844d2dfe272d58c8fc000960b835d13aa2ac Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 28 Dec 2018 00:35:52 -0800 Subject: mm: reclaim small amounts of memory when an external fragmentation event occurs An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: David Rientjes Cc: Michal Hocko Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 + include/linux/mmzone.h | 11 +++++++---- 2 files changed, 8 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d2be4c2d34a..031b2ce983f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone); /* page_alloc.c */ extern int min_free_kbytes; +extern int watermark_boost_factor; extern int watermark_scale_factor; /* nommu.c */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dcf1b66a96ab..5b4bfb90fb94 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -269,10 +269,10 @@ enum zone_watermarks { NR_WMARK }; -#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) -#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) -#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) -#define wmark_pages(z, i) (z->_watermark[i]) +#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) +#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) +#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) struct per_cpu_pages { int count; /* number of pages in the list */ @@ -364,6 +364,7 @@ struct zone { /* zone watermarks, access with *_wmark_pages(zone) macros */ unsigned long _watermark[NR_WMARK]; + unsigned long watermark_boost; unsigned long nr_reserved_highatomic; @@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone) struct ctl_table; int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int watermark_boost_factor_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; -- cgit v1.2.3 From c999fbd3dcc6535b1e298b016665ec23ac2b0a9a Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:35:55 -0800 Subject: mm/mmzone.c: make "migratetype_names" const char * Those strings are immutable in fact. Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5b4bfb90fb94..e0c3bc2edbbd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -65,7 +65,7 @@ enum migratetype { }; /* In mm/page_alloc.c; keep in sync also with show_migration_types() there */ -extern char * const migratetype_names[MIGRATE_TYPES]; +extern const char * const migratetype_names[MIGRATE_TYPES]; #ifdef CONFIG_CMA # define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA) -- cgit v1.2.3 From 9a2f45ff320287d49a3cd90ce68cb58a6da6f5e1 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:35:59 -0800 Subject: mm/debug.c: make "migrate_reason_names[]" const char * Those strings are immutable as well. Link: http://lkml.kernel.org/r/20181124090508.GB10877@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/migrate.h b/include/linux/migrate.h index f2b4abbca55e..617615fa11ce 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -29,7 +29,7 @@ enum migrate_reason { }; /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */ -extern char *migrate_reason_names[MR_TYPES]; +extern const char *migrate_reason_names[MR_TYPES]; static inline struct page *new_page_nodemask(struct page *page, int preferred_nid, nodemask_t *nodemask) -- cgit v1.2.3 From e5cb113f2dbc8125f31005faebab161a2a84ebe6 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Fri, 28 Dec 2018 00:36:03 -0800 Subject: mm: make free_reserved_area() return "const char *" and propagate through down the call stack. Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 031b2ce983f9..9963f77f1101 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2108,7 +2108,7 @@ extern void free_initmem(void); * Return pages freed into the buddy system. */ extern unsigned long free_reserved_area(void *start, void *end, - int poison, char *s); + int poison, const char *s); #ifdef CONFIG_HIGHMEM /* -- cgit v1.2.3 From ef8444ea01d7442652f8e1b8a8b94278cb57eafd Mon Sep 17 00:00:00 2001 From: yuzhoujian Date: Fri, 28 Dec 2018 00:36:07 -0800 Subject: mm, oom: reorganize the oom report in dump_header OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian Acked-by: Michal Hocko Cc: Andrea Arcangeli Cc: David Rientjes Cc: "Kirill A . Shutemov" Cc: Roman Gushchin Cc: Tetsuo Handa Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'include') diff --git a/include/linux/oom.h b/include/linux/oom.h index 69864a547663..d07992009265 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,13 @@ struct notifier_block; struct mem_cgroup; struct task_struct; +enum oom_constraint { + CONSTRAINT_NONE, + CONSTRAINT_CPUSET, + CONSTRAINT_MEMORY_POLICY, + CONSTRAINT_MEMCG, +}; + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -42,6 +49,9 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + + /* Used to print the constraint info. */ + enum oom_constraint constraint; }; extern struct mutex oom_lock; -- cgit v1.2.3 From f0c867d9588d9efc10d6a55009c9560336673369 Mon Sep 17 00:00:00 2001 From: yuzhoujian Date: Fri, 28 Dec 2018 00:36:10 -0800 Subject: mm, oom: add oom victim's memcg to the oom context information The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid= - memcg oom context information: oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid= [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: David Rientjes Cc: "Kirill A . Shutemov" Cc: Andrea Arcangeli Cc: Tetsuo Handa Cc: Roman Gushchin Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7ab2120155a4..83ae11cbd12c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -526,9 +526,11 @@ void mem_cgroup_handle_over_high(void); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); -void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, +void mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p); +void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg); + static inline void mem_cgroup_enter_user_fault(void) { WARN_ON(current->in_user_fault); @@ -970,7 +972,12 @@ static inline unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg) } static inline void -mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) +mem_cgroup_print_oom_context(struct mem_cgroup *memcg, struct task_struct *p) +{ +} + +static inline void +mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) { } -- cgit v1.2.3 From 9a1ea439b16b92002e0a6fceebc5d1794906e297 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Fri, 28 Dec 2018 00:36:14 -0800 Subject: mm: put_and_wait_on_page_locked() while page is migrated Waiting on a page migration entry has used wait_on_page_locked() all along since 2006: but you cannot safely wait_on_page_locked() without holding a reference to the page, and that extra reference is enough to make migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults on the entry before migrate_page_move_mapping() gets there. And that failure is retried nine times, amplifying the pain when trying to migrate a popular page. With a single persistent faulter, migration sometimes succeeds; with two or three concurrent faulters, success becomes much less likely (and the more the page was mapped, the worse the overhead of unmapping and remapping it on each try). This is especially a problem for memory offlining, where the outer level retries forever (or until terminated from userspace), because a heavy refault workload can trigger an endless loop of migration failures. wait_on_page_locked() is the wrong tool for the job. David Herrmann (but was he the first?) noticed this issue in 2014: https://marc.info/?l=linux-mm&m=140110465608116&w=2 Tim Chen started a thread in August 2017 which appears relevant: https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went on to implicate __migration_entry_wait(): https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in wake_up_page_bit") Baoquan He reported "Memory hotplug softlock issue" 14 November 2018: https://marc.info/?l=linux-mm&m=154217936431300&w=2 We have all assumed that it is essential to hold a page reference while waiting on a page lock: partly to guarantee that there is still a struct page when MEMORY_HOTREMOVE is configured, but also to protect against reuse of the struct page going to someone who then holds the page locked indefinitely, when the waiter can reasonably expect timely unlocking. But in fact, so long as wait_on_page_bit_common() does the put_page(), and is careful not to rely on struct page contents thereafter, there is no need to hold a reference to the page while waiting on it. That does mean that this case cannot go back through the loop: but that's fine for the page migration case, and even if used more widely, is limited by the "Stop walking if it's locked" optimization in wake_page_function(). Add interface put_and_wait_on_page_locked() to do this, using "behavior" enum in place of "lock" arg to wait_on_page_bit_common() to implement it. No interruptible or killable variant needed yet, but they might follow: I have a vague notion that reporting -EINTR should take precedence over return from wait_on_page_bit_common() without knowing the page state, so arrange it accordingly - but that may be nothing but pedantic. __migration_entry_wait() still has to take a brief reference to the page, prior to calling put_and_wait_on_page_locked(): but now that it is dropped before waiting, the chance of impeding page migration is very much reduced. Should we perhaps disable preemption across this? shrink_page_list()'s __ClearPageLocked(): that was a surprise! This survived a lot of testing before that showed up. PageWaiters may have been set by wait_on_page_bit_common(), and the reference dropped, just before shrink_page_list() succeeds in freezing its last page reference: in such a case, unlock_page() must be used. Follow the suggestion from Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now: that optimization predates PageWaiters, and won't buy much these days; but we can reinstate it for the !PageWaiters case if anyone notices. It does raise the question: should vmscan.c's is_page_cache_freeable() and __remove_mapping() now treat a PageWaiters page as if an extra reference were held? Perhaps, but I don't think it matters much, since shrink_page_list() already had to win its trylock_page(), so waiters are not very common there: I noticed no difference when trying the bigger change, and it's surely not needed while put_and_wait_on_page_locked() is only used for page migration. [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils Signed-off-by: Hugh Dickins Reported-by: Baoquan He Tested-by: Baoquan He Reviewed-by: Andrea Arcangeli Acked-by: Michal Hocko Acked-by: Linus Torvalds Acked-by: Vlastimil Babka Cc: Matthew Wilcox Cc: Baoquan He Cc: David Hildenbrand Cc: Mel Gorman Cc: David Herrmann Cc: Tim Chen Cc: Kan Liang Cc: Andi Kleen Cc: Davidlohr Bueso Cc: Peter Zijlstra Cc: Christoph Lameter Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 226f96f0dee0..e2d7039af6a3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -537,6 +537,8 @@ static inline int wait_on_page_locked_killable(struct page *page) return wait_on_page_bit_killable(compound_head(page), PG_locked); } +extern void put_and_wait_on_page_locked(struct page *page); + /* * Wait for a page to complete writeback */ -- cgit v1.2.3 From 23b68cfaae0ea40a9509fad37b756a6916dec54e Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:36:18 -0800 Subject: mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init() When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of each node's highest zone is initialized before defer stage. static_init_pgcnt is used to store the number of pages like this: pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, pgdat->node_spanned_pages); because we don't want to overflow zone's range. But this is not necessary, since defer_init() is called like this: memmap_init_zone() for pfn in [start_pfn, end_pfn) defer_init(pfn, end_pfn) In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would stop before calling defer_init(). BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct, since nr_initialised is zone based instead of node based. Even node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone would have pages less than PAGES_PER_SECTION. Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: Alexander Duyck Cc: Pavel Tatashin Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e0c3bc2edbbd..a6e300732ec7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -692,8 +692,6 @@ typedef struct pglist_data { * is the first PFN that needs to be initialised. */ unsigned long first_deferred_pfn; - /* Number of non-deferred pages */ - unsigned long static_init_pgcnt; #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- cgit v1.2.3 From 2c2a5af6fed20cf74401c9d64319c76c5ff81309 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Fri, 28 Dec 2018 00:36:22 -0800 Subject: mm, memory_hotplug: add nid parameter to arch_remove_memory Patch series "Do not touch pages in hot-remove path", v2. This patchset aims for two things: 1) A better definition about offline and hot-remove stage 2) Solving bugs where we can access non-initialized pages during hot-remove operations [2] [3]. This is achieved by moving all page/zone handling to the offline stage, so we do not need to access pages when hot-removing memory. [1] https://patchwork.kernel.org/cover/10691415/ [2] https://patchwork.kernel.org/patch/10547445/ [3] https://www.spinics.net/lists/linux-mm/msg161316.html This patch (of 5): This is a preparation for the following-up patches. The idea of passing the nid is that it will allow us to get rid of the zone parameter afterwards. Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.de Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Reviewed-by: Pavel Tatashin Cc: Michal Hocko Cc: Dan Williams Cc: Jerome Glisse Cc: Jonathan Cameron Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7383a7a76d69..9e4d9b9b93ea 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -107,8 +107,8 @@ static inline bool movable_node_is_enabled(void) } #ifdef CONFIG_MEMORY_HOTREMOVE -extern int arch_remove_memory(u64 start, u64 size, - struct vmem_altmap *altmap); +extern int arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap); extern int __remove_pages(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); #endif /* CONFIG_MEMORY_HOTREMOVE */ -- cgit v1.2.3 From fed84c78527009d4f799a3ed9a566502fa026d82 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Fri, 28 Dec 2018 00:36:29 -0800 Subject: mm/memblock.c: skip kmemleak for kasan_init() Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us Signed-off-by: Qian Cai Acked-by: Catalin Marinas Cc: Michal Hocko Cc: Mike Rapoport Cc: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 5f74ba623dbd..64c41cf45590 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -319,6 +319,7 @@ static inline int memblock_get_region_node(const struct memblock_region *r) /* Flags for memblock allocation APIs */ #define MEMBLOCK_ALLOC_ANYWHERE (~(phys_addr_t)0) #define MEMBLOCK_ALLOC_ACCESSIBLE 0 +#define MEMBLOCK_ALLOC_KASAN 1 /* We are using top down, so it is safe to use 0 here */ #define MEMBLOCK_LOW_LIMIT 0 -- cgit v1.2.3 From 9e247bab0668a5893b3efa131cec5b5859467834 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 28 Dec 2018 00:36:58 -0800 Subject: mm: remove pte_lock_deinit() Pagetable page doesn't touch page->mapping or have any used field that overlaps with it. No need to clear mapping in dtor. In fact, doing so might mask problems that otherwise would be detected by bad_page(). Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.com Signed-off-by: Yu Zhao Reviewed-by: Matthew Wilcox Acked-by: Michal Hocko Cc: Hugh Dickins Cc: "Kirill A . Shutemov" Cc: Dan Williams Cc: Pavel Tatashin Cc: Souptick Joarder Cc: Logan Gunthorpe Cc: Keith Busch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9963f77f1101..3c39b9dc7a90 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1954,13 +1954,6 @@ static inline bool ptlock_init(struct page *page) return true; } -/* Reset page->mapping so free_pages_check won't complain. */ -static inline void pte_lock_deinit(struct page *page) -{ - page->mapping = NULL; - ptlock_free(page); -} - #else /* !USE_SPLIT_PTE_PTLOCKS */ /* * We use mm->page_table_lock to guard all pagetable pages of the mm. @@ -1971,7 +1964,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd) } static inline void ptlock_cache_init(void) {} static inline bool ptlock_init(struct page *page) { return true; } -static inline void pte_lock_deinit(struct page *page) {} +static inline void ptlock_free(struct page *page) {} #endif /* USE_SPLIT_PTE_PTLOCKS */ static inline void pgtable_init(void) @@ -1991,7 +1984,7 @@ static inline bool pgtable_page_ctor(struct page *page) static inline void pgtable_page_dtor(struct page *page) { - pte_lock_deinit(page); + ptlock_free(page); __ClearPageTable(page); dec_zone_page_state(page, NR_PAGETABLE); } -- cgit v1.2.3 From 83af658898cb292a32d8b6cd9b51266d7cfc4b6a Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:02 -0800 Subject: mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() pgdat_resize_lock is used to protect pgdat's memory region information like: node_start_pfn, node_present_pages, etc. While in function sparse_add/remove_one_section(), pgdat_resize_lock is used to protect initialization/release of one mem_section. This looks not proper. These code paths are currently protected by mem_hotplug_lock currently but should there ever be any reason for locking at the sparse layer a dedicated lock should be introduced. Following is the current call trace of sparse_add/remove_one_section() mem_hotplug_begin() arch_add_memory() add_pages() __add_pages() __add_section() sparse_add_one_section() mem_hotplug_done() mem_hotplug_begin() arch_remove_memory() __remove_pages() __remove_section() sparse_remove_one_section() mem_hotplug_done() The comment above the pgdat_resize_lock also mentions "Holding this will also guarantee that any pfn_valid() stays that way.", which is true with the current implementation and false after this patch. But current implementation doesn't meet this comment. There isn't any pfn walkers to take the lock so this looks like a relict from the past. This patch also removes this comment. [richard.weiyang@gmail.com: v4] Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com [mhocko@suse.com: changelog suggestion] Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Cc: Dave Hansen Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a6e300732ec7..fc4b5cdb6c2d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -637,8 +637,7 @@ typedef struct pglist_data { #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) /* * Must be held any time you expect node_start_pfn, node_present_pages - * or node_spanned_pages stay constant. Holding this will also - * guarantee that any pfn_valid() stays that way. + * or node_spanned_pages stay constant. * * pgdat_resize_lock() and pgdat_resize_unlock() are provided to * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG -- cgit v1.2.3 From 4e0d2e7ef14d9e1c900dac909db45263822b824f Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:06 -0800 Subject: mm, sparse: pass nid instead of pgdat to sparse_add_one_section() Since the information needed in sparse_add_one_section() is node id to allocate proper memory, it is not necessary to pass its pgdat. This patch changes the prototype of sparse_add_one_section() to pass node id directly. This is intended to reduce misleading that sparse_add_one_section() would touch pgdat. Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Cc: Dave Hansen Cc: Oscar Salvador Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 9e4d9b9b93ea..8ed6e09a5c0c 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -333,8 +333,8 @@ extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); -extern int sparse_add_one_section(struct pglist_data *pgdat, - unsigned long start_pfn, struct vmem_altmap *altmap); +extern int sparse_add_one_section(int nid, unsigned long start_pfn, + struct vmem_altmap *altmap); extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms, unsigned long map_offset, struct vmem_altmap *altmap); extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, -- cgit v1.2.3 From fa004ab7365ffa1e17e6b267d64798afccb94946 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:37:10 -0800 Subject: mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection During online_pages phase, pgdat->nr_zones will be updated in case this zone is empty. Currently the online_pages phase is protected by the global locks (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is no contention during the update of nr_zones. These global locks introduces scalability issues (especially the second one), which slow down code relying on get_online_mems(). This is also a preparation for not having to rely on get_online_mems() but instead some more fine grained locks. The patch moves init_currently_empty_zone under both zone_span_writelock and pgdat_resize_lock because both the pgdat state is changed (nr_zones) and the zone's start_pfn. Also this patch changes the documentation of node_size_lock to include the protection of nr_zones. Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: Oscar Salvador Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc4b5cdb6c2d..cc4a507d7ca4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -636,8 +636,8 @@ typedef struct pglist_data { #endif #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT) /* - * Must be held any time you expect node_start_pfn, node_present_pages - * or node_spanned_pages stay constant. + * Must be held any time you expect node_start_pfn, + * node_present_pages, node_spanned_pages or nr_zones to stay constant. * * pgdat_resize_lock() and pgdat_resize_unlock() are provided to * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG -- cgit v1.2.3 From 144552ff8995dd34d049a203d636b259ab751137 Mon Sep 17 00:00:00 2001 From: Anthony Yznaga Date: Fri, 28 Dec 2018 00:37:31 -0800 Subject: /proc/kpagecount: return 0 for special pages that are never mapped Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga Reviewed-by: Andrew Morton Acked-by: Matthew Wilcox Reviewed-by: Naoya Horiguchi Cc: Vlastimil Babka Cc: David Rientjes Cc: Alexey Dobriyan Cc: Kirill A. Shutemov Cc: Mike Rapoport Cc: Michal Hocko Cc: Alexander Duyck Cc: Johannes Weiner Cc: Miles Chen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/page-flags.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include') diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 50ce1bddaf56..39b4494e29f1 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -669,6 +669,7 @@ PAGEFLAG_FALSE(DoubleMap) #define PAGE_TYPE_BASE 0xf0000000 /* Reserve 0x0000007f to catch underflows of page_mapcount */ +#define PAGE_MAPCOUNT_RESERVE -128 #define PG_buddy 0x00000080 #define PG_balloon 0x00000100 #define PG_kmemcg 0x00000200 @@ -677,6 +678,11 @@ PAGEFLAG_FALSE(DoubleMap) #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) +static inline int page_has_type(struct page *page) +{ + return (int)page->page_type < PAGE_MAPCOUNT_RESERVE; +} + #define PAGE_TYPE_OPS(uname, lname) \ static __always_inline int Page##uname(struct page *page) \ { \ -- cgit v1.2.3 From 8e2d43405b22e98cf5f3730c1829ec1fdbe17ae7 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Fri, 28 Dec 2018 00:37:53 -0800 Subject: lib/ioremap: ensure break-before-make is used for huge p4d mappings Whilst no architectures actually enable support for huge p4d mappings in the vmap area, the code that is implemented should be using break-before-make, as we do for pud and pmd huge entries. Link: http://lkml.kernel.org/r/1544120495-17438-6-git-send-email-will.deacon@arm.com Signed-off-by: Will Deacon Reviewed-by: Toshi Kani Cc: Chintan Pandya Cc: Toshi Kani Cc: Thomas Gleixner Cc: Michal Hocko Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Sean Christopherson Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/asm-generic/pgtable.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include') diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index a9cac82e9a7a..05e61e6c843f 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -1057,6 +1057,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot); int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); int pud_clear_huge(pud_t *pud); int pmd_clear_huge(pmd_t *pmd); +int p4d_free_pud_page(p4d_t *p4d, unsigned long addr); int pud_free_pmd_page(pud_t *pud, unsigned long addr); int pmd_free_pte_page(pmd_t *pmd, unsigned long addr); #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ @@ -1084,6 +1085,10 @@ static inline int pmd_clear_huge(pmd_t *pmd) { return 0; } +static inline int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) +{ + return 0; +} static inline int pud_free_pmd_page(pud_t *pud, unsigned long addr) { return 0; -- cgit v1.2.3 From 5d6527a784f7a6d247961e046e830de8d71b47d1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= Date: Fri, 28 Dec 2018 00:38:05 -0800 Subject: mm/mmu_notifier: use structure for invalidate_range_start/end callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mmu notifier contextual informations", v2. This patchset adds contextual information, why an invalidation is happening, to mmu notifier callback. This is necessary for user of mmu notifier that wish to maintains their own data structure without having to add new fields to struct vm_area_struct (vma). For instance device can have they own page table that mirror the process address space. When a vma is unmap (munmap() syscall) the device driver can free the device page table for the range. Today we do not have any information on why a mmu notifier call back is happening and thus device driver have to assume that it is always an munmap(). This is inefficient at it means that it needs to re-allocate device page table on next page fault and rebuild the whole device driver data structure for the range. Other use case beside munmap() also exist, for instance it is pointless for device driver to invalidate the device page table when the invalidation is for the soft dirtyness tracking. Or device driver can optimize away mprotect() that change the page table permission access for the range. This patchset enables all this optimizations for device drivers. I do not include any of those in this series but another patchset I am posting will leverage this. The patchset is pretty simple from a code point of view. The first two patches consolidate all mmu notifier arguments into a struct so that it is easier to add/change arguments. The last patch adds the contextual information (munmap, protection, soft dirty, clear, ...). This patch (of 3): To avoid having to change many callback definition everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end callback. No functional changes with this patch. [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc] Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com Signed-off-by: Jérôme Glisse Acked-by: Jan Kara Acked-by: Jason Gunthorpe [infiniband] Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krcmar Cc: Michal Hocko Cc: Christian Koenig Cc: Felix Kuehling Cc: Ralph Campbell Cc: John Hubbard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmu_notifier.h | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'include') diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 913c3c13e36e..3d377805b29c 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -25,6 +25,13 @@ struct mmu_notifier_mm { spinlock_t lock; }; +struct mmu_notifier_range { + struct mm_struct *mm; + unsigned long start; + unsigned long end; + bool blockable; +}; + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -146,12 +153,9 @@ struct mmu_notifier_ops { * */ int (*invalidate_range_start)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable); + const struct mmu_notifier_range *range); void (*invalidate_range_end)(struct mmu_notifier *mn, - struct mm_struct *mm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); /* * invalidate_range() is either called between -- cgit v1.2.3 From ac46d4f3c43241ffa23d5bf36153a0830c0e02cc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= Date: Fri, 28 Dec 2018 00:38:09 -0800 Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Acked-by: Christian König Acked-by: Jan Kara Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Dan Williams Cc: Paolo Bonzini Cc: Radim Krcmar Cc: Michal Hocko Cc: Felix Kuehling Cc: Ralph Campbell Cc: John Hubbard From: Jérôme Glisse Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 6 ++- include/linux/mmu_notifier.h | 87 +++++++++++++++++++++++++++++--------------- 2 files changed, 62 insertions(+), 31 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 3c39b9dc7a90..ea1f12d15365 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1451,6 +1451,8 @@ struct mm_walk { void *private; }; +struct mmu_notifier_range; + int walk_page_range(unsigned long addr, unsigned long end, struct mm_walk *walk); int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk); @@ -1459,8 +1461,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long addr, int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); int follow_pte_pmd(struct mm_struct *mm, unsigned long address, - unsigned long *start, unsigned long *end, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp); int follow_pfn(struct vm_area_struct *vma, unsigned long address, unsigned long *pfn); int follow_phys(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 3d377805b29c..4050ec1c3b45 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -220,11 +220,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); -extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end, - bool blockable); -extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end, +extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); +extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, bool only_end); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); @@ -268,33 +265,37 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, __mmu_notifier_change_pte(mm, address, pte); } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_start(mm, start, end, true); + if (mm_has_notifiers(range->mm)) { + range->blockable = true; + __mmu_notifier_invalidate_range_start(range); + } } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - return __mmu_notifier_invalidate_range_start(mm, start, end, false); + if (mm_has_notifiers(range->mm)) { + range->blockable = false; + return __mmu_notifier_invalidate_range_start(range); + } return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, false); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, false); } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { - if (mm_has_notifiers(mm)) - __mmu_notifier_invalidate_range_end(mm, start, end, true); + if (mm_has_notifiers(range->mm)) + __mmu_notifier_invalidate_range_end(range, true); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -315,6 +316,17 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __mmu_notifier_mm_destroy(mm); } + +static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + range->mm = mm; + range->start = start; + range->end = end; +} + #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young; \ @@ -427,6 +439,23 @@ extern void mmu_notifier_call_srcu(struct rcu_head *rcu, #else /* CONFIG_MMU_NOTIFIER */ +struct mmu_notifier_range { + unsigned long start; + unsigned long end; +}; + +static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, + unsigned long start, + unsigned long end) +{ + range->start = start; + range->end = end; +} + +#define mmu_notifier_range_init(range, mm, start, end) \ + _mmu_notifier_range_init(range, start, end) + + static inline int mm_has_notifiers(struct mm_struct *mm) { return 0; @@ -454,24 +483,24 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm, { } -static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range) { } -static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline int +mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range *range) { return 0; } -static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline +void mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) { } -static inline void mmu_notifier_invalidate_range_only_end(struct mm_struct *mm, - unsigned long start, unsigned long end) +static inline void +mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) { } -- cgit v1.2.3 From 0614ce9776b037b6a08a9adcbfcc382c0053b178 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:38:13 -0800 Subject: include/linux/memory_hotplug.h: remove duplicate declaration of offline_pages() offline_pages() is already declared in this file. Just remove the duplicated one. Link: http://lkml.kernel.org/r/20181205031357.24769-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory_hotplug.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8ed6e09a5c0c..07da5c6c5ba0 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -331,7 +331,6 @@ extern int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap, bool want_memblock); extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, struct vmem_altmap *altmap); -extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern bool is_memblock_offlined(struct memory_block *mem); extern int sparse_add_one_section(int nid, unsigned long start_pfn, struct vmem_altmap *altmap); -- cgit v1.2.3 From 7635d9cbe8327e131a1d3d8517dc186c2796ce2e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Fri, 28 Dec 2018 00:38:21 -0800 Subject: mm, thp, proc: report THP eligibility for each vma Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc//smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Dan Williams Cc: David Rientjes Cc: Jan Kara Cc: Mike Rapoport Cc: Paul Oppenheimer Cc: William Kucharski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4663ee96cf59..381e872bfde0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -93,7 +93,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); extern unsigned long transparent_hugepage_flags; -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) +/* + * to be used on vmas which are known to support THP. + * Use transparent_hugepage_enabled otherwise + */ +static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { if (vma->vm_flags & VM_NOHUGEPAGE) return false; @@ -117,6 +121,8 @@ static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } +bool transparent_hugepage_enabled(struct vm_area_struct *vma); + #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1< Date: Fri, 28 Dec 2018 00:38:43 -0800 Subject: mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code. If someone adds extra migrate type, then he may forget to enlarge the NR_PAGEBLOCK_BITS. Hence it requires some way to fix. NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on two different .h file with reverse dependency, it is a little hard to refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind such relation in compiling-time. Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu Reviewed-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Michal Hocko Cc: Pavel Tatashin Cc: Oscar Salvador Cc: Mike Rapoport Cc: Joonsoo Kim Cc: Alexander Duyck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/pageblock-flags.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 9132c5cb41f1..06a66327333d 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -25,10 +25,11 @@ #include +#define PB_migratetype_bits 3 /* Bit indices that affect a whole block of pages */ enum pageblock_bits { PB_migrate, - PB_migrate_end = PB_migrate + 3 - 1, + PB_migrate_end = PB_migrate + PB_migratetype_bits - 1, /* 3 bits required for migrate types */ PB_migrate_skip,/* If set the block is skipped by compaction */ -- cgit v1.2.3 From 89cb0888ca1483ad72648844ddd1b801863a8949 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 28 Dec 2018 00:39:12 -0800 Subject: mm: migrate: provide buffer_migrate_page_norefs() Provide a variant of buffer_migrate_page() that also checks whether there are no unexpected references to buffer heads. This function will then be safe to use for block device pages. [akpm@linux-foundation.org: remove EXPORT_SYMBOL(buffer_migrate_page_norefs)] Link: http://lkml.kernel.org/r/20181211172143.7358-5-jack@suse.cz Signed-off-by: Jan Kara Acked-by: Mel Gorman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'include') diff --git a/include/linux/fs.h b/include/linux/fs.h index 26a8607b3c3c..1cda6648a41f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3269,8 +3269,12 @@ extern int generic_check_addressable(unsigned, u64); extern int buffer_migrate_page(struct address_space *, struct page *, struct page *, enum migrate_mode); +extern int buffer_migrate_page_norefs(struct address_space *, + struct page *, struct page *, + enum migrate_mode); #else #define buffer_migrate_page NULL +#define buffer_migrate_page_norefs NULL #endif extern int setattr_prepare(struct dentry *, struct iattr *); -- cgit v1.2.3 From ab41ee6879981b3d3a16a1079a33fa6fd043eb3c Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 28 Dec 2018 00:39:20 -0800 Subject: mm: migrate: drop unused argument of migrate_page_move_mapping() All callers of migrate_page_move_mapping() now pass NULL for 'head' argument. Drop it. Link: http://lkml.kernel.org/r/20181211172143.7358-7-jack@suse.cz Signed-off-by: Jan Kara Acked-by: Mel Gorman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'include') diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 617615fa11ce..e13d9bf2f9a5 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -77,8 +77,7 @@ extern void migrate_page_copy(struct page *newpage, struct page *page); extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, - struct page *newpage, struct page *page, - struct buffer_head *head, enum migrate_mode mode, + struct page *newpage, struct page *page, enum migrate_mode mode, int extra_count); #else -- cgit v1.2.3 From af3b854492f351d1ff3b4744a83bf5ff7eed4920 Mon Sep 17 00:00:00 2001 From: Benjamin Poirier Date: Fri, 28 Dec 2018 00:39:23 -0800 Subject: mm/page_alloc.c: allow error injection Model call chain after should_failslab(). Likewise, we can now use a kprobe to override the return value of should_fail_alloc_page() and inject allocation failures into alloc_page*(). This will allow injecting allocation failures using the BCC tools even without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a fail_page_alloc= parameter, which incurs some overhead even when failures are not being injected. On the other hand, this patch adds an unconditional call to should_fail_alloc_page() from page allocation hotpath. That overhead should be rather negligible with CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though. [vbabka@suse.cz: changelog addition] Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com Signed-off-by: Benjamin Poirier Acked-by: Vlastimil Babka Cc: Arnd Bergmann Cc: Michal Hocko Cc: Pavel Tatashin Cc: Oscar Salvador Cc: Mike Rapoport Cc: Joonsoo Kim Cc: Alexander Duyck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/asm-generic/error-injection.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include') diff --git a/include/asm-generic/error-injection.h b/include/asm-generic/error-injection.h index 296c65442f00..95a159a4137f 100644 --- a/include/asm-generic/error-injection.h +++ b/include/asm-generic/error-injection.h @@ -8,6 +8,7 @@ enum { EI_ETYPE_NULL, /* Return NULL if failure */ EI_ETYPE_ERRNO, /* Return -ERRNO if failure */ EI_ETYPE_ERRNO_NULL, /* Return -ERRNO or NULL if failure */ + EI_ETYPE_TRUE, /* Return true if failure */ }; struct error_injection_entry { -- cgit v1.2.3 From 4918e7625ffa82f388ea70538f0e1df20ea35a54 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Fri, 28 Dec 2018 00:39:27 -0800 Subject: include/linux/vmstat.h: remove unused page state adjustment macro These four macro are not used anymore. Just remove them. Link: http://lkml.kernel.org/r/20181214063211.2290-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmstat.h | 5 ----- 1 file changed, 5 deletions(-) (limited to 'include') diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index f25cef84b41d..2db8d60981fe 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -239,11 +239,6 @@ extern unsigned long node_page_state(struct pglist_data *pgdat, #define node_page_state(node, item) global_node_page_state(item) #endif /* CONFIG_NUMA */ -#define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d) -#define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d)) -#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d) -#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, -(__d)) - #ifdef CONFIG_SMP void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long); void __inc_zone_page_state(struct page *, enum zone_stat_item); -- cgit v1.2.3 From 063a7d1d3623db31ca5d2309cab6030ebf93b72f Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 28 Dec 2018 00:39:46 -0800 Subject: mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The kbuild robot reported the following on a development branch that used memremap.h in a new path: In file included from arch/m68k/include/asm/pgtable_mm.h:148:0, from arch/m68k/include/asm/pgtable.h:5, from include/linux/memremap.h:7, from drivers//dax/bus.c:3: arch/m68k/include/asm/motorola_pgtable.h: In function 'pgd_offset': >> arch/m68k/include/asm/motorola_pgtable.h:199:11: error: dereferencing pointer to incomplete type 'const struct mm_struct' return mm->pgd + pgd_index(address); ^~ The ->page_fault() callback is specific to HMM. Move it to 'struct hmm_devmem' where the unusual asm/pgtable.h dependency can be contained in include/linux/hmm.h. Longer term refactoring this dependency out of HMM is recommended, but in the meantime memremap.h remains generic. Link: http://lkml.kernel.org/r/154534090899.3120190.6652620807617715272.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE memory...") Signed-off-by: Dan Williams Reviewed-by: "Jérôme Glisse" Cc: Logan Gunthorpe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hmm.h | 24 ++++++++++++++++++++++++ include/linux/memremap.h | 32 -------------------------------- 2 files changed, 24 insertions(+), 32 deletions(-) (limited to 'include') diff --git a/include/linux/hmm.h b/include/linux/hmm.h index ed89fbc525d2..66f9ebbb1df3 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -69,6 +69,7 @@ #define LINUX_HMM_H #include +#include #if IS_ENABLED(CONFIG_HMM) @@ -486,6 +487,7 @@ struct hmm_devmem_ops { * @device: device to bind resource to * @ops: memory operations callback * @ref: per CPU refcount + * @page_fault: callback when CPU fault on an unaddressable device page * * This an helper structure for device drivers that do not wish to implement * the gory details related to hotplugging new memoy and allocating struct @@ -493,7 +495,28 @@ struct hmm_devmem_ops { * * Device drivers can directly use ZONE_DEVICE memory on their own if they * wish to do so. + * + * The page_fault() callback must migrate page back, from device memory to + * system memory, so that the CPU can access it. This might fail for various + * reasons (device issues, device have been unplugged, ...). When such error + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and + * set the CPU page table entry to "poisoned". + * + * Note that because memory cgroup charges are transferred to the device memory, + * this should never fail due to memory restrictions. However, allocation + * of a regular system page might still fail because we are out of memory. If + * that happens, the page_fault() callback must return VM_FAULT_OOM. + * + * The page_fault() callback can also try to migrate back multiple pages in one + * chunk, as an optimization. It must, however, prioritize the faulting address + * over all the others. */ +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, + unsigned long addr, + const struct page *page, + unsigned int flags, + pmd_t *pmdp); + struct hmm_devmem { struct completion completion; unsigned long pfn_first; @@ -503,6 +526,7 @@ struct hmm_devmem { struct dev_pagemap pagemap; const struct hmm_devmem_ops *ops; struct percpu_ref ref; + dev_page_fault_t page_fault; }; /* diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 55db66b3716f..f0628660d541 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -4,8 +4,6 @@ #include #include -#include - struct resource; struct device; @@ -66,47 +64,18 @@ enum memory_type { }; /* - * For MEMORY_DEVICE_PRIVATE we use ZONE_DEVICE and extend it with two - * callbacks: - * page_fault() - * page_free() - * * Additional notes about MEMORY_DEVICE_PRIVATE may be found in * include/linux/hmm.h and Documentation/vm/hmm.rst. There is also a brief * explanation in include/linux/memory_hotplug.h. * - * The page_fault() callback must migrate page back, from device memory to - * system memory, so that the CPU can access it. This might fail for various - * reasons (device issues, device have been unplugged, ...). When such error - * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and - * set the CPU page table entry to "poisoned". - * - * Note that because memory cgroup charges are transferred to the device memory, - * this should never fail due to memory restrictions. However, allocation - * of a regular system page might still fail because we are out of memory. If - * that happens, the page_fault() callback must return VM_FAULT_OOM. - * - * The page_fault() callback can also try to migrate back multiple pages in one - * chunk, as an optimization. It must, however, prioritize the faulting address - * over all the others. - * - * * The page_free() callback is called once the page refcount reaches 1 * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug. * This allows the device driver to implement its own memory management.) - * - * For MEMORY_DEVICE_PUBLIC only the page_free() callback matter. */ -typedef int (*dev_page_fault_t)(struct vm_area_struct *vma, - unsigned long addr, - const struct page *page, - unsigned int flags, - pmd_t *pmdp); typedef void (*dev_page_free_t)(struct page *page, void *data); /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings - * @page_fault: callback when CPU fault on an unaddressable device page * @page_free: free page callback when page refcount reaches 1 * @altmap: pre-allocated/reserved memory for vmemmap allocations * @res: physical address range covered by @ref @@ -117,7 +86,6 @@ typedef void (*dev_page_free_t)(struct page *page, void *data); * @type: memory type: see MEMORY_* in memory_hotplug.h */ struct dev_pagemap { - dev_page_fault_t page_fault; dev_page_free_t page_free; struct vmem_altmap altmap; bool altmap_valid; -- cgit v1.2.3 From 70c6066e19c15749b579dde7d5722c7d7fb05d57 Mon Sep 17 00:00:00 2001 From: Kyle Spiers Date: Fri, 28 Dec 2018 00:39:49 -0800 Subject: include/linux/gfp.h: fix typo Fix misspelled "satisfied" Link: http://lkml.kernel.org/r/20181227232354.64562-1-ksspiers@google.com Signed-off-by: Kyle Spiers Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0705164f928c..5f5e25fd6149 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -81,7 +81,7 @@ struct vm_area_struct; * * %__GFP_HARDWALL enforces the cpuset memory allocation policy. * - * %__GFP_THISNODE forces the allocation to be satisified from the requested + * %__GFP_THISNODE forces the allocation to be satisfied from the requested * node with no fallbacks or placement policy enforcements. * * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg. -- cgit v1.2.3