From 91173c6e18ab410fac12667656ab7cc3363687cc Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 31 May 2019 22:29:57 -0700 Subject: mm: fix Documentation/vm/hmm.rst Sphinx warnings MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix Sphinx warnings in Documentation/vm/hmm.rst by using "::" notation and inserting a blank line. Also add a missing ';'. Documentation/vm/hmm.rst:292: WARNING: Unexpected indentation. Documentation/vm/hmm.rst:300: WARNING: Unexpected indentation. Link: http://lkml.kernel.org/r/c5995359-7c82-4e47-c7be-b58a4dda0953@infradead.org Fixes: 023a019a9b4e ("mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays") Signed-off-by: Randy Dunlap Reviewed-by: Jérôme Glisse Reviewed-by: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hmm.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index ec1efa32af3c..7cdf7282e022 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -288,15 +288,17 @@ For instance if the device flags for device entries are: WRITE (1 << 62) Now let say that device driver wants to fault with at least read a range then -it does set: - range->default_flags = (1 << 63) +it does set:: + + range->default_flags = (1 << 63); range->pfn_flags_mask = 0; and calls hmm_range_fault() as described above. This will fill fault all page in the range with at least read permission. Now let say driver wants to do the same except for one page in the range for -which its want to have write. Now driver set: +which its want to have write. Now driver set:: + range->default_flags = (1 << 63); range->pfn_flags_mask = (1 << 62); range->pfns[index_of_write] = (1 << 62); -- cgit v1.2.3 From aa52619ccbe056999d7c7231c8a1a11cedfccc6a Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 31 May 2019 22:30:00 -0700 Subject: lib/sort.c: fix kernel-doc notation warnings Fix kernel-doc notation in lib/sort.c by using correct function parameter names. lib/sort.c:59: warning: Excess function parameter 'size' description in 'swap_words_32' lib/sort.c:83: warning: Excess function parameter 'size' description in 'swap_words_64' lib/sort.c:110: warning: Excess function parameter 'size' description in 'swap_bytes' Link: http://lkml.kernel.org/r/60e25d3d-68d1-bde2-3b39-e4baa0b14907@infradead.org Fixes: 37d0ec34d111a ("lib/sort: make swap functions more generic") Signed-off-by: Randy Dunlap Cc: George Spelvin Cc: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/sort.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/lib/sort.c b/lib/sort.c index 50855ea8c262..cf408aec3733 100644 --- a/lib/sort.c +++ b/lib/sort.c @@ -43,8 +43,9 @@ static bool is_aligned(const void *base, size_t size, unsigned char align) /** * swap_words_32 - swap two elements in 32-bit chunks - * @a, @b: pointers to the elements - * @size: element size (must be a multiple of 4) + * @a: pointer to the first element to swap + * @b: pointer to the second element to swap + * @n: element size (must be a multiple of 4) * * Exchange the two objects in memory. This exploits base+index addressing, * which basically all CPUs have, to minimize loop overhead computations. @@ -65,8 +66,9 @@ static void swap_words_32(void *a, void *b, size_t n) /** * swap_words_64 - swap two elements in 64-bit chunks - * @a, @b: pointers to the elements - * @size: element size (must be a multiple of 8) + * @a: pointer to the first element to swap + * @b: pointer to the second element to swap + * @n: element size (must be a multiple of 8) * * Exchange the two objects in memory. This exploits base+index * addressing, which basically all CPUs have, to minimize loop overhead @@ -100,8 +102,9 @@ static void swap_words_64(void *a, void *b, size_t n) /** * swap_bytes - swap two elements a byte at a time - * @a, @b: pointers to the elements - * @size: element size + * @a: pointer to the first element to swap + * @b: pointer to the second element to swap + * @n: element size * * This is the fallback if alignment doesn't allow using larger chunks. */ -- cgit v1.2.3 From 3806b04144e5e030aa17835ac1bb42473af4b957 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 31 May 2019 22:30:03 -0700 Subject: mm/vmalloc.c: fix typo in comment Reported-by: Nicholas Joll Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 233af6936c93..7350a124524b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -815,7 +815,7 @@ find_vmap_lowest_match(unsigned long size, } /* - * OK. We roll back and find the fist right sub-tree, + * OK. We roll back and find the first right sub-tree, * that will satisfy the search criteria. It can happen * only once due to "vstart" restriction. */ -- cgit v1.2.3 From 461071b09e29160c3d179def1b01f49e14df52de Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Fri, 31 May 2019 22:30:06 -0700 Subject: arch/parisc/configs/c8000_defconfig: remove obsoleted CONFIG_DEBUG_SLAB_LEAK CONFIG_DEBUG_SLAB_LEAK has been removed, so remove it from defconfig. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1905201015460.96074@chino.kir.corp.google.com Fixes: 7878c231dae0 ("slab: remove /proc/slab_allocators") Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/parisc/configs/c8000_defconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/parisc/configs/c8000_defconfig b/arch/parisc/configs/c8000_defconfig index 088ab948a5ca..900b00084953 100644 --- a/arch/parisc/configs/c8000_defconfig +++ b/arch/parisc/configs/c8000_defconfig @@ -225,7 +225,6 @@ CONFIG_UNUSED_SYMBOLS=y CONFIG_DEBUG_FS=y CONFIG_MAGIC_SYSRQ=y CONFIG_DEBUG_SLAB=y -CONFIG_DEBUG_SLAB_LEAK=y CONFIG_DEBUG_MEMORY_INIT=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_PANIC_ON_OOPS=y -- cgit v1.2.3 From fb092eb63d3aba5b876a51cbe743c1c8a8b37d5b Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Fri, 31 May 2019 22:30:09 -0700 Subject: arch/arm/boot/compressed/decompress.c: fix build error due to lz4 changes include/linux/cpumask.h: In function 'cpumask_parse': include/linux/cpumask.h:636:21: error: implicit declaration of function 'strchrnul'; did you mean 'strchr'? [-Werror=implicit-function-declaration] Because arch/arm/boot/compressed/decompress.c does #define _LINUX_STRING_H_ preventing linux/string.h from providing strchrnul. It also #includes asm/string.h, which for arm has a declaration of strchr(), explaining why this didn't use to fail. Link: http://lkml.kernel.org/r/20190528115346.f5a7kn3hdnuf5rts@linutronix.de Fixes: 3713a4e1fdb8d ("include/linux/cpumask.h: fix double string traverse in cpumask_parse") Suggested-by: Rasmus Villemoes Cc: Yury Norov Cc: Thomas Gleixner Cc: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/boot/compressed/decompress.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/boot/compressed/decompress.c b/arch/arm/boot/compressed/decompress.c index c16c1829a5e4..aa075d8372ea 100644 --- a/arch/arm/boot/compressed/decompress.c +++ b/arch/arm/boot/compressed/decompress.c @@ -32,6 +32,7 @@ extern char * strstr(const char * s1, const char *s2); extern size_t strlen(const char *s); extern int memcmp(const void *cs, const void *ct, size_t count); +extern char * strchrnul(const char *, int); #ifdef CONFIG_KERNEL_GZIP #include "../../../../lib/decompress_inflate.c" -- cgit v1.2.3 From 8856ae4df3e9b5295ea2da7ad3b00796386454ec Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Fri, 31 May 2019 22:30:12 -0700 Subject: kernel/fork.c: make max_threads symbol static Fix build warning, kernel/fork.c:125:5: warning: symbol 'max_threads' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20190516015118.140561-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Reported-by: Hulk Robot Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index b2b87d450b80..75675b9bf6df 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -123,7 +123,7 @@ unsigned long total_forks; /* Handle normal Linux uptimes. */ int nr_threads; /* The idle threads do not count.. */ -int max_threads; /* tunable limit on nr_threads */ +static int max_threads; /* tunable limit on nr_threads */ DEFINE_PER_CPU(unsigned long, process_counts) = 0; -- cgit v1.2.3 From 11bbd8b416f8abf40900dc5041152892f873d915 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Fri, 31 May 2019 22:30:16 -0700 Subject: prctl_set_mm: refactor checks from validate_prctl_map MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Despite comment of validate_prctl_map claims there are no capability checks, it is not completely true since commit 4d28df6152aa ("prctl: Allow local CAP_SYS_ADMIN changing exe_file"). Extract the check out of the function and make the function perform purely arithmetic checks. This patch should not change any behavior, it is mere refactoring for following patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com Signed-off-by: Michal Koutný Reviewed-by: Kirill Tkhai Reviewed-by: Cyrill Gorcunov Cc: Kirill Tkhai Cc: Laurent Dufour Cc: Mateusz Guzik Cc: Michal Hocko Cc: Yang Shi Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 51 +++++++++++++++++++++++++-------------------------- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index bdbfe8d37418..775bf8d18d03 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1882,13 +1882,14 @@ exit_err: } /* + * Check arithmetic relations of passed addresses. + * * WARNING: we don't require any capability here so be very careful * in what is allowed for modification from userspace. */ -static int validate_prctl_map(struct prctl_mm_map *prctl_map) +static int validate_prctl_map_addr(struct prctl_mm_map *prctl_map) { unsigned long mmap_max_addr = TASK_SIZE; - struct mm_struct *mm = current->mm; int error = -EINVAL, i; static const unsigned char offsets[] = { @@ -1949,24 +1950,6 @@ static int validate_prctl_map(struct prctl_mm_map *prctl_map) prctl_map->start_data)) goto out; - /* - * Someone is trying to cheat the auxv vector. - */ - if (prctl_map->auxv_size) { - if (!prctl_map->auxv || prctl_map->auxv_size > sizeof(mm->saved_auxv)) - goto out; - } - - /* - * Finally, make sure the caller has the rights to - * change /proc/pid/exe link: only local sys admin should - * be allowed to. - */ - if (prctl_map->exe_fd != (u32)-1) { - if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN)) - goto out; - } - error = 0; out: return error; @@ -1993,11 +1976,18 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data if (copy_from_user(&prctl_map, addr, sizeof(prctl_map))) return -EFAULT; - error = validate_prctl_map(&prctl_map); + error = validate_prctl_map_addr(&prctl_map); if (error) return error; if (prctl_map.auxv_size) { + /* + * Someone is trying to cheat the auxv vector. + */ + if (!prctl_map.auxv || + prctl_map.auxv_size > sizeof(mm->saved_auxv)) + return -EINVAL; + memset(user_auxv, 0, sizeof(user_auxv)); if (copy_from_user(user_auxv, (const void __user *)prctl_map.auxv, @@ -2010,6 +2000,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data } if (prctl_map.exe_fd != (u32)-1) { + /* + * Make sure the caller has the rights to + * change /proc/pid/exe link: only local sys admin should + * be allowed to. + */ + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN)) + return -EINVAL; + error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd); if (error) return error; @@ -2097,7 +2095,11 @@ static int prctl_set_mm(int opt, unsigned long addr, unsigned long arg4, unsigned long arg5) { struct mm_struct *mm = current->mm; - struct prctl_mm_map prctl_map; + struct prctl_mm_map prctl_map = { + .auxv = NULL, + .auxv_size = 0, + .exe_fd = -1, + }; struct vm_area_struct *vma; int error; @@ -2139,9 +2141,6 @@ static int prctl_set_mm(int opt, unsigned long addr, prctl_map.arg_end = mm->arg_end; prctl_map.env_start = mm->env_start; prctl_map.env_end = mm->env_end; - prctl_map.auxv = NULL; - prctl_map.auxv_size = 0; - prctl_map.exe_fd = -1; switch (opt) { case PR_SET_MM_START_CODE: @@ -2181,7 +2180,7 @@ static int prctl_set_mm(int opt, unsigned long addr, goto out; } - error = validate_prctl_map(&prctl_map); + error = validate_prctl_map_addr(&prctl_map); if (error) goto out; -- cgit v1.2.3 From bc81426f5beef7da863d3365bc9d45e820448745 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Michal=20Koutn=C3=BD?= Date: Fri, 31 May 2019 22:30:19 -0700 Subject: prctl_set_mm: downgrade mmap_sem to read lock MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap semaphore taken.") added synchronization of reading argument/environment boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided the coarse use of mmap_sem in similar situations. But there still remained two places that (mis)use mmap_sem. get_cmdline should also use arg_lock instead of mmap_sem when it reads the boundaries. The second place that should use arg_lock is in prctl_set_mm. By protecting the boundaries fields with the arg_lock, we can downgrade mmap_sem to reader lock (analogous to what we already do in prctl_set_mm_map). [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") Signed-off-by: Michal Koutný Signed-off-by: Laurent Dufour Co-developed-by: Laurent Dufour Reviewed-by: Cyrill Gorcunov Acked-by: Michal Hocko Cc: Yang Shi Cc: Mateusz Guzik Cc: Kirill Tkhai Cc: Konstantin Khlebnikov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sys.c | 11 +++++++++-- mm/util.c | 4 ++-- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index 775bf8d18d03..2969304c29fe 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2127,9 +2127,15 @@ static int prctl_set_mm(int opt, unsigned long addr, error = -EINVAL; - down_write(&mm->mmap_sem); + /* + * arg_lock protects concurent updates of arg boundaries, we need + * mmap_sem for a) concurrent sys_brk, b) finding VMA for addr + * validation. + */ + down_read(&mm->mmap_sem); vma = find_vma(mm, addr); + spin_lock(&mm->arg_lock); prctl_map.start_code = mm->start_code; prctl_map.end_code = mm->end_code; prctl_map.start_data = mm->start_data; @@ -2217,7 +2223,8 @@ static int prctl_set_mm(int opt, unsigned long addr, error = 0; out: - up_write(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); + up_read(&mm->mmap_sem); return error; } diff --git a/mm/util.c b/mm/util.c index 91682a2090ee..9834c4ab7d8e 100644 --- a/mm/util.c +++ b/mm/util.c @@ -718,12 +718,12 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen) if (!mm->arg_end) goto out_mm; /* Shh! No looking before we're done */ - down_read(&mm->mmap_sem); + spin_lock(&mm->arg_lock); arg_start = mm->arg_start; arg_end = mm->arg_end; env_start = mm->env_start; env_end = mm->env_end; - up_read(&mm->mmap_sem); + spin_unlock(&mm->arg_lock); len = arg_end - arg_start; -- cgit v1.2.3 From 9852ae3fe5293264f01c49f2571ef7688f7823ce Mon Sep 17 00:00:00 2001 From: Chris Down Date: Fri, 31 May 2019 22:30:22 -0700 Subject: mm, memcg: consider subtrees in memory.events memory.stat and other files already consider subtrees in their output, and we should too in order to not present an inconsistent interface. The current situation is fairly confusing, because people interacting with cgroups expect hierarchical behaviour in the vein of memory.stat, cgroup.events, and other files. For example, this causes confusion when debugging reclaim events under low, as currently these always read "0" at non-leaf memcg nodes, which frequently causes people to misdiagnose breach behaviour. The same confusion applies to other counters in this file when debugging issues. Aggregation is done at write time instead of at read-time since these counters aren't hot (unlike memory.stat which is per-page, so it does it at read time), and it makes sense to bundle this with the file notifications. After this patch, events are propagated up the hierarchy: [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events low 0 high 0 max 0 oom 0 oom_kill 0 [root@ktst ~]# systemd-run -p MemoryMax=1 true Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events low 0 high 0 max 7 oom 1 oom_kill 1 As this is a change in behaviour, this can be reverted to the old behaviour by mounting with the `memory_localevents' flag set. However, we use the new behaviour by default as there's a lack of evidence that there are any current users of memory.events that would find this change undesirable. akpm: this is a behaviour change, so Cc:stable. THis is so that forthcoming distros which use cgroup v2 are more likely to pick up the revised behaviour. Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name Signed-off-by: Chris Down Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Cc: Michal Hocko Cc: Tejun Heo Cc: Roman Gushchin Cc: Dennis Zhou Cc: Suren Baghdasaryan Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 9 +++++++++ include/linux/cgroup-defs.h | 5 +++++ include/linux/memcontrol.h | 10 ++++++++-- kernel/cgroup/cgroup.c | 16 ++++++++++++++-- 4 files changed, 36 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 88e746074252..cf88c1f98270 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -177,6 +177,15 @@ cgroup v2 currently supports the following mount options. ignored on non-init namespace mounts. Please refer to the Delegation section for details. + memory_localevents + + Only populate memory.events with data for the current cgroup, + and not any subtrees. This is legacy behaviour, the default + behaviour without this option is to include subtree counts. + This option is system wide and can only be set on mount or + modified through remount from the init namespace. The mount + option is ignored on non-init namespace mounts. + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 77258d276f93..11e215d7937e 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -89,6 +89,11 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), + + /* + * Enable legacy local memory.events. + */ + CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5), }; /* cftype->flags */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 73fe0a700911..edf9e8f32d70 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -737,8 +737,14 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, static inline void memcg_memory_event(struct mem_cgroup *memcg, enum memcg_memory_event event) { - atomic_long_inc(&memcg->memory_events[event]); - cgroup_file_notify(&memcg->events_file); + do { + atomic_long_inc(&memcg->memory_events[event]); + cgroup_file_notify(&memcg->events_file); + + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) + break; + } while ((memcg = parent_mem_cgroup(memcg)) && + !mem_cgroup_is_root(memcg)); } static inline void memcg_memory_event_mm(struct mm_struct *mm, diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 217cec4e22c6..426a0026225c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1810,11 +1810,13 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, enum cgroup2_param { Opt_nsdelegate, + Opt_memory_localevents, nr__cgroup2_params }; static const struct fs_parameter_spec cgroup2_param_specs[] = { - fsparam_flag ("nsdelegate", Opt_nsdelegate), + fsparam_flag("nsdelegate", Opt_nsdelegate), + fsparam_flag("memory_localevents", Opt_memory_localevents), {} }; @@ -1837,6 +1839,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_nsdelegate: ctx->flags |= CGRP_ROOT_NS_DELEGATE; return 0; + case Opt_memory_localevents: + ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; + return 0; } return -EINVAL; } @@ -1848,6 +1853,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; + + if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS; } } @@ -1855,6 +1865,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) + seq_puts(seq, ",memory_localevents"); return 0; } @@ -6325,7 +6337,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\n"); + return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); -- cgit v1.2.3 From 3e8589963773a5c23e2f1fe4bcad0e9a90b7f471 Mon Sep 17 00:00:00 2001 From: Jiri Slaby Date: Fri, 31 May 2019 22:30:26 -0700 Subject: memcg: make it work on sparse non-0-node systems We have a single node system with node 0 disabled: Scanning NUMA topology in Northbridge 24 Number of physical nodes 2 Skipping disabled node 0 Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] This causes crashes in memcg when system boots: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] ... RIP: 0010:list_lru_add+0x94/0x170 ... Call Trace: d_lru_add+0x44/0x50 dput.part.34+0xfc/0x110 __fput+0x108/0x230 task_work_run+0x9f/0xc0 exit_to_usermode_loop+0xf5/0x100 It is reproducible as far as 4.12. I did not try older kernels. You have to have a new enough systemd, e.g. 241 (the reason is unknown -- was not investigated). Cannot be reproduced with systemd 234. The system crashes because the size of lru array is never updated in memcg_update_all_list_lrus and the reads are past the zero-sized array, causing dereferences of random memory. The root cause are list_lru_memcg_aware checks in the list_lru code. The test in list_lru_memcg_aware is broken: it assumes node 0 is always present, but it is not true on some systems as can be seen above. So fix this by avoiding checks on node 0. Remember the memcg-awareness by a bool flag in struct list_lru. Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists") Signed-off-by: Jiri Slaby Acked-by: Michal Hocko Suggested-by: Vladimir Davydov Acked-by: Vladimir Davydov Reviewed-by: Shakeel Butt Cc: Johannes Weiner Cc: Raghavendra K T Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/list_lru.h | 1 + mm/list_lru.c | 8 +++----- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index aa5efd9351eb..d5ceb2839a2d 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -54,6 +54,7 @@ struct list_lru { #ifdef CONFIG_MEMCG_KMEM struct list_head list; int shrinker_id; + bool memcg_aware; #endif }; diff --git a/mm/list_lru.c b/mm/list_lru.c index 0bdf3152735e..e4709fdaa8e6 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -38,11 +38,7 @@ static int lru_shrinker_id(struct list_lru *lru) static inline bool list_lru_memcg_aware(struct list_lru *lru) { - /* - * This needs node 0 to be always present, even - * in the systems supporting sparse numa ids. - */ - return !!lru->node[0].memcg_lrus; + return lru->memcg_aware; } static inline struct list_lru_one * @@ -452,6 +448,8 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; + lru->memcg_aware = memcg_aware; + if (!memcg_aware) return 0; -- cgit v1.2.3 From b9fba67b3806e21b98bd5a98dc3921a8e9b42d61 Mon Sep 17 00:00:00 2001 From: "Tobin C. Harding" Date: Fri, 31 May 2019 22:30:29 -0700 Subject: ocfs2: fix error path kobject memory leak If a call to kobject_init_and_add() fails we should call kobject_put() otherwise we leak memory. Add call to kobject_put() in the error path of call to kobject_init_and_add(). Please note, this has the side effect that the release method is called if kobject_init_and_add() fails. Link: http://lkml.kernel.org/r/20190513033458.2824-1-tobin@kernel.org Signed-off-by: Tobin C. Harding Reviewed-by: Greg Kroah-Hartman Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/filecheck.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ocfs2/filecheck.c b/fs/ocfs2/filecheck.c index f65f2b2f594d..1906cc962c4d 100644 --- a/fs/ocfs2/filecheck.c +++ b/fs/ocfs2/filecheck.c @@ -193,6 +193,7 @@ int ocfs2_filecheck_create_sysfs(struct ocfs2_super *osb) ret = kobject_init_and_add(&entry->fs_kobj, &ocfs2_ktype_filecheck, NULL, "filecheck"); if (ret) { + kobject_put(&entry->fs_kobj); kfree(fcheck); return ret; } -- cgit v1.2.3 From df17277b2a85c00f5710e33ce238ba4114687a28 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Fri, 31 May 2019 22:30:33 -0700 Subject: mm/gup: continue VM_FAULT_RETRY processing even for pre-faults When get_user_pages*() is called with pages = NULL, the processing of VM_FAULT_RETRY terminates early without actually retrying to fault-in all the pages. If the pages in the requested range belong to a VMA that has userfaultfd registered, handle_userfault() returns VM_FAULT_RETRY *after* user space has populated the page, but for the gup pre-fault case there's no actual retry and the caller will get no pages although they are present. This issue was uncovered when running post-copy memory restore in CRIU after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails"). After this change, the copying of FPU state to the sigframe switched from copy_to_user() variants which caused a real page fault to get_user_pages() with pages parameter set to NULL. In post-copy mode of CRIU, the destination memory is managed with userfaultfd and lack of the retry for pre-fault case in get_user_pages() causes a crash of the restored process. Making the pre-fault behavior of get_user_pages() the same as the "normal" one fixes the issue. Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") Signed-off-by: Mike Rapoport Tested-by: Andrei Vagin [https://travis-ci.org/avagin/linux/builds/533184940] Tested-by: Hugh Dickins Cc: Andrea Arcangeli Cc: Sebastian Andrzej Siewior Cc: Borislav Petkov Cc: Pavel Machek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index f173fcbaf1b2..ddde097cf9e4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1042,10 +1042,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, BUG_ON(ret >= nr_pages); } - if (!pages) - /* If it's a prefault don't insist harder */ - return ret; - if (ret > 0) { nr_pages -= ret; pages_done += ret; @@ -1061,8 +1057,12 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done = ret; break; } - /* VM_FAULT_RETRY triggered, so seek to the faulting offset */ - pages += ret; + /* + * VM_FAULT_RETRY triggered, so seek to the faulting offset. + * For the prefault case (!pages) we only update counts. + */ + if (likely(pages)) + pages += ret; start += ret << PAGE_SHIFT; /* @@ -1085,7 +1085,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done++; if (!nr_pages) break; - pages++; + if (likely(pages)) + pages++; start += PAGE_SIZE; } if (lock_dropped && *locked) { -- cgit v1.2.3 From ef7a77c6de2f98c25ca97541f111f14bb74fc13d Mon Sep 17 00:00:00 2001 From: Fabiano Rosas Date: Fri, 31 May 2019 22:30:36 -0700 Subject: scripts/gdb: fix invocation when CONFIG_COMMON_CLK is not set CLK_GET_RATE_NOCACHE depends on CONFIG_COMMON_CLK. Importing constants.py when CONFIG_COMMON_CLK is not defined causes: (gdb) lx-symbols (...) File "scripts/gdb/linux/proc.py", line 15, in from linux import constants File "scripts/gdb/linux/constants.py", line 2, in LX_CLK_GET_RATE_NOCACHE = gdb.parse_and_eval("CLK_GET_RATE_NOCACHE") gdb.error: No symbol "CLK_GET_RATE_NOCACHE" in current context. Link: http://lkml.kernel.org/r/20190523195313.24701-1-farosas@linux.ibm.com Fixes: e7e6f462c1be ("scripts/gdb: print cached rate in lx-clk-summary") Signed-off-by: Fabiano Rosas Reviewed-by: Stephen Boyd Cc: Jan Kiszka Cc: Kieran Bingham Cc: Leonard Crestez Cc: Jackie Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/gdb/linux/constants.py.in | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/scripts/gdb/linux/constants.py.in b/scripts/gdb/linux/constants.py.in index 1d73083da6cb..2efbec6b6b8d 100644 --- a/scripts/gdb/linux/constants.py.in +++ b/scripts/gdb/linux/constants.py.in @@ -40,7 +40,8 @@ import gdb /* linux/clk-provider.h */ -LX_GDBPARSED(CLK_GET_RATE_NOCACHE) +if IS_BUILTIN(CONFIG_COMMON_CLK): + LX_GDBPARSED(CLK_GET_RATE_NOCACHE) /* linux/fs.h */ LX_VALUE(SB_RDONLY) -- cgit v1.2.3 From bb9f6f63f32da40ca34a921e377ad3181a4f9023 Mon Sep 17 00:00:00 2001 From: Vitaly Wool Date: Fri, 31 May 2019 22:30:39 -0700 Subject: z3fold: fix sheduling while atomic kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so we need to pass correct gfp flags to avoid "scheduling while atomic" bug. Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com Fixes: 7c2b8baa61fe5 ("mm/z3fold.c: add structure for buddy handles") Signed-off-by: Vitaly Wool Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 99be52c5ca45..985732c8b025 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -190,10 +190,11 @@ static int size_to_chunks(size_t size) static void compact_page_work(struct work_struct *w); -static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool) +static inline struct z3fold_buddy_slots *alloc_slots(struct z3fold_pool *pool, + gfp_t gfp) { struct z3fold_buddy_slots *slots = kmem_cache_alloc(pool->c_handle, - GFP_KERNEL); + gfp); if (slots) { memset(slots->slot, 0, sizeof(slots->slot)); @@ -295,10 +296,10 @@ static void z3fold_unregister_migration(struct z3fold_pool *pool) /* Initializes the z3fold header of a newly allocated z3fold page */ static struct z3fold_header *init_z3fold_page(struct page *page, - struct z3fold_pool *pool) + struct z3fold_pool *pool, gfp_t gfp) { struct z3fold_header *zhdr = page_address(page); - struct z3fold_buddy_slots *slots = alloc_slots(pool); + struct z3fold_buddy_slots *slots = alloc_slots(pool, gfp); if (!slots) return NULL; @@ -912,7 +913,7 @@ retry: if (!page) return -ENOMEM; - zhdr = init_z3fold_page(page, pool); + zhdr = init_z3fold_page(page, pool, gfp); if (!zhdr) { __free_page(page); return -ENOMEM; -- cgit v1.2.3 From 0600597c854e53d2f9b7a6a718c1da2b8b4cb4db Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Fri, 31 May 2019 22:30:42 -0700 Subject: kasan: initialize tag to 0xff in __kasan_kmalloc When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang warns: mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when used here [-Wuninitialized] kasan_unpoison_shadow(set_tag(object, tag), size); ^~~ set_tag ignores tag in this configuration but clang doesn't realize it at this point in its pipeline, as it points to arch_kasan_set_tag as being the point where it is used, which will later be expanded to (void *)(object) without a use of tag. Initialize tag to 0xff, as it removes this warning and doesn't change the meaning of the code. Link: https://github.com/ClangBuiltLinux/linux/issues/465 Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Nathan Chancellor Reviewed-by: Andrey Konovalov Reviewed-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Dmitry Vyukov Cc: Nick Desaulniers Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kasan/common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 36afcf64e016..242fdc01aaa9 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -464,7 +464,7 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, const void *object, { unsigned long redzone_start; unsigned long redzone_end; - u8 tag; + u8 tag = 0xff; if (gfpflags_allow_blocking(flags)) quarantine_reduce(); -- cgit v1.2.3 From 8d7a7abfc6b42621adc070c7c29b013d7727ed6f Mon Sep 17 00:00:00 2001 From: Vincenzo Frascino Date: Fri, 31 May 2019 22:30:45 -0700 Subject: spdxcheck.py: fix directory structures The LICENSE directory has recently changed structure and this makes spdxcheck fails as per below: FAIL: "Blob or Tree named 'other' not found" Traceback (most recent call last): File "scripts/spdxcheck.py", line 240, in spdx = read_spdxdata(repo) File "scripts/spdxcheck.py", line 41, in read_spdxdata for el in lictree[d].traverse(): [...] KeyError: "Blob or Tree named 'other' not found" Fix the script to restore the correctness on checkpatch License checking. References: 62be257e986d ("LICENSES: Rename other to deprecated") References: 8ea8814fcdcb ("LICENSES: Clearly mark dual license only licenses") Link: http://lkml.kernel.org/r/20190523084755.56739-1-vincenzo.frascino@arm.com Signed-off-by: Vincenzo Frascino Cc: Joe Perches Cc: Christoph Hellwig Cc: Thomas Gleixner Cc: Jeremy Cline Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/spdxcheck.py | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/scripts/spdxcheck.py b/scripts/spdxcheck.py index 33df646618e2..6374e078a5f2 100755 --- a/scripts/spdxcheck.py +++ b/scripts/spdxcheck.py @@ -32,7 +32,8 @@ class SPDXdata(object): def read_spdxdata(repo): # The subdirectories of LICENSES in the kernel source - license_dirs = [ "preferred", "deprecated", "exceptions", "dual" ] + # Note: exceptions needs to be parsed as last directory. + license_dirs = [ "preferred", "dual", "deprecated", "exceptions" ] lictree = repo.head.commit.tree['LICENSES'] spdx = SPDXdata() @@ -58,13 +59,13 @@ def read_spdxdata(repo): elif l.startswith('SPDX-Licenses:'): for lic in l.split(':')[1].upper().strip().replace(' ', '').replace('\t', '').split(','): if not lic in spdx.licenses: - raise SPDXException(None, 'Exception %s missing license %s' %(ex, lic)) + raise SPDXException(None, 'Exception %s missing license %s' %(exception, lic)) spdx.exceptions[exception].append(lic) elif l.startswith("License-Text:"): if exception: if not len(spdx.exceptions[exception]): - raise SPDXException(el, 'Exception %s is missing SPDX-Licenses' %excid) + raise SPDXException(el, 'Exception %s is missing SPDX-Licenses' %exception) spdx.exception_files += 1 else: spdx.license_files += 1 -- cgit v1.2.3 From d3ed71e5cc50e0df8362af295cbc906acef75558 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Fri, 31 May 2019 22:30:49 -0700 Subject: drivers/iommu/intel-iommu.c: fix variable 'iommu' set but not used Commit cf04eee8bf0e ("iommu/vt-d: Include ACPI devices in iommu=pt") added for_each_active_iommu() in iommu_prepare_static_identity_mapping() but never used the each element, i.e, "drhd->iommu". drivers/iommu/intel-iommu.c: In function 'iommu_prepare_static_identity_mapping': drivers/iommu/intel-iommu.c:3037:22: warning: variable 'iommu' set but not used [-Wunused-but-set-variable] struct intel_iommu *iommu; Fixed the warning by appending a compiler attribute __maybe_unused for it. Link: http://lkml.kernel.org/r/20190523013314.2732-1-cai@lca.pw Signed-off-by: Qian Cai Suggested-by: Andrew Morton Cc: Joerg Roedel Cc: David Woodhouse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/iommu/intel-iommu.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index a209199f3af6..09b8ff0d856a 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3034,7 +3034,8 @@ static int __init iommu_prepare_static_identity_mapping(int hw) { struct pci_dev *pdev = NULL; struct dmar_drhd_unit *drhd; - struct intel_iommu *iommu; + /* To avoid a -Wunused-but-set-variable warning. */ + struct intel_iommu *iommu __maybe_unused; struct device *dev; int i; int ret = 0; -- cgit v1.2.3 From 98af37d624ed8c83f1953b1b6b2f6866011fc064 Mon Sep 17 00:00:00 2001 From: Zhenliang Wei Date: Fri, 31 May 2019 22:30:52 -0700 Subject: kernel/signal.c: trace_signal_deliver when signal_group_exit In the fixes commit, removing SIGKILL from each thread signal mask and executing "goto fatal" directly will skip the call to "trace_signal_deliver". At this point, the delivery tracking of the SIGKILL signal will be inaccurate. Therefore, we need to add trace_signal_deliver before "goto fatal" after executing sigdelset. Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info. Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT") Signed-off-by: Zhenliang Wei Reviewed-by: Christian Brauner Reviewed-by: Oleg Nesterov Cc: Eric W. Biederman Cc: Ivan Delalande Cc: Arnd Bergmann Cc: Thomas Gleixner Cc: Deepa Dinamani Cc: Greg Kroah-Hartman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/signal.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index d7b9d14ac80d..328a01e1a2f0 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2485,6 +2485,8 @@ relock: if (signal_group_exit(signal)) { ksig->info.si_signo = signr = SIGKILL; sigdelset(¤t->pending.signal, SIGKILL); + trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, + &sighand->action[SIGKILL - 1]); recalc_sigpending(); goto fatal; } -- cgit v1.2.3 From 590ba22ba0aa0680a41fb7e51ec5395a4e2c4a85 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Fri, 31 May 2019 22:30:55 -0700 Subject: include/linux/generic-radix-tree.h: fix kerneldoc comment The DOC comment block section in include/linux/generic-radix-tree.h contained a spurious colon, causing this warning in the documentation build: include/linux/generic-radix-tree.h:1: warning: no structured comments found Remove the colon and make the docs build happy. Link: http://lkml.kernel.org/r/20190524141933.74ae9050@lwn.net Signed-off-by: Jonathan Corbet Reviewed-by: Andrew Morton Cc: Kent Overstreet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/generic-radix-tree.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h index 3a91130a4fbd..02393c0c98f9 100644 --- a/include/linux/generic-radix-tree.h +++ b/include/linux/generic-radix-tree.h @@ -2,7 +2,7 @@ #define _LINUX_GENERIC_RADIX_TREE_H /** - * DOC: Generic radix trees/sparse arrays: + * DOC: Generic radix trees/sparse arrays * * Very simple and minimalistic, supporting arbitrary size entries up to * PAGE_SIZE. -- cgit v1.2.3 From e577c8b64d58fe307ea4d5149d31615df2d90861 Mon Sep 17 00:00:00 2001 From: Suzuki K Poulose Date: Fri, 31 May 2019 22:30:59 -0700 Subject: mm, compaction: make sure we isolate a valid PFN When we have holes in a normal memory zone, we could endup having cached_migrate_pfns which may not necessarily be valid, under heavy memory pressure with swapping enabled ( via __reset_isolation_suitable(), triggered by kswapd). Later if we fail to find a page via fast_isolate_freepages(), we may end up using the migrate_pfn we started the search with, as valid page. This could lead to accessing NULL pointer derefernces like below, due to an invalid mem_section pointer. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825] Mem abort info: ESR = 0x96000004 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9 [0000000000000008] pgd=0000000000000000 Internal error: Oops: 96000004 [#1] SMP ... CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018 pstate: 60000005 (nZCv daif -PAN -UAO) pc : set_pfnblock_flags_mask+0x58/0xe8 lr : compaction_alloc+0x300/0x950 [...] Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5) Call trace: set_pfnblock_flags_mask+0x58/0xe8 compaction_alloc+0x300/0x950 migrate_pages+0x1a4/0xbb0 compact_zone+0x750/0xde8 compact_zone_order+0xd8/0x118 try_to_compact_pages+0xb4/0x290 __alloc_pages_direct_compact+0x84/0x1e0 __alloc_pages_nodemask+0x5e0/0xe18 alloc_pages_vma+0x1cc/0x210 do_huge_pmd_anonymous_page+0x108/0x7c8 __handle_mm_fault+0xdd4/0x1190 handle_mm_fault+0x114/0x1c0 __get_user_pages+0x198/0x3c0 get_user_pages_unlocked+0xb4/0x1d8 __gfn_to_pfn_memslot+0x12c/0x3b8 gfn_to_pfn_prot+0x4c/0x60 kvm_handle_guest_abort+0x4b0/0xcd8 handle_exit+0x140/0x1b8 kvm_arch_vcpu_ioctl_run+0x260/0x768 kvm_vcpu_ioctl+0x490/0x898 do_vfs_ioctl+0xc4/0x898 ksys_ioctl+0x8c/0xa0 __arm64_sys_ioctl+0x28/0x38 el0_svc_common+0x74/0x118 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc Code: f8607840 f100001f 8b011401 9a801020 (f9400400) ---[ end trace af6a35219325a9b6 ]--- The issue was reported on an arm64 server with 128GB with holes in the zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while running 100 KVM guest instances. This patch fixes the issue by ensuring that the page belongs to a valid PFN when we fallback to using the lower limit of the scan range upon failure in fast_isolate_freepages(). Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Suzuki K Poulose Reported-by: Marc Zyngier Reviewed-by: Mel Gorman Reviewed-by: Anshuman Khandual Cc: Michal Hocko Cc: Qian Cai Cc: Marc Zyngier Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 9febc8cc84e7..9e1b9acb116b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1399,7 +1399,7 @@ fast_isolate_freepages(struct compact_control *cc) page = pfn_to_page(highest); cc->free_pfn = highest; } else { - if (cc->direct_compaction) { + if (cc->direct_compaction && pfn_valid(min_pfn)) { page = pfn_to_page(min_pfn); cc->free_pfn = min_pfn; } -- cgit v1.2.3