From a8311f647e419675f6ecba9f4284080fd38a0a37 Mon Sep 17 00:00:00 2001 From: Jarrett Farnitano Date: Thu, 14 Jun 2018 15:26:31 -0700 Subject: kexec: yield to scheduler when loading kimage segments Without yielding while loading kimage segments, a large initrd will block all other work on the CPU performing the load until it is completed. For example loading an initrd of 200MB on a low power single core system will lock up the system for a few seconds. To increase system responsiveness to other tasks at that time, call cond_resched() in both the crash kernel and normal kernel segment loading loops. I did run into a practical problem. Hardware watchdogs on embedded systems can have short timers on the order of seconds. If the system is locked up for a few seconds with only a single core available, the watchdog may not be pet in a timely fashion. If this happens, the hardware watchdog will fire and reset the system. This really only becomes a problem when you are working with a single core, a decently sized initrd, and have a constrained hardware watchdog. Link: http://lkml.kernel.org/r/1528738546-3328-1-git-send-email-jmf@amazon.com Signed-off-by: Jarrett Farnitano Reviewed-by: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec_core.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 20fef1a38602..23a83a4da38a 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -829,6 +829,8 @@ static int kimage_load_normal_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; + + cond_resched(); } out: return result; @@ -893,6 +895,8 @@ static int kimage_load_crash_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; + + cond_resched(); } out: return result; -- cgit v1.2.3 From 655c79bb40a0870adcd0871057d01de11625882b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 14 Jun 2018 15:26:34 -0700 Subject: mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting exit_mmap() due to NULL mmap ***** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Cc: Alexander Viro Cc: Rik van Riel Cc: Michal Hocko Cc: Kirill A. Shutemov Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 92870be50bba..9440d61b925c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -440,6 +440,14 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, continue; } charge = 0; + /* + * Don't duplicate many vmas if we've been oom-killed (for + * example) + */ + if (fatal_signal_pending(current)) { + retval = -EINTR; + goto out; + } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); -- cgit v1.2.3 From 3fb3894b84c2e0f83cb1e4f4e960243742e6b3a6 Mon Sep 17 00:00:00 2001 From: Souptick Joarder Date: Thu, 14 Jun 2018 15:27:31 -0700 Subject: kernel/relay.c: change return type to vm_fault_t Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180510140335.GA25363@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder Reviewed-by: Matthew Wilcox Reviewed-by: Andrew Morton Cc: Eric Biggers Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/relay.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/relay.c b/kernel/relay.c index 9f5326e8a036..04f248644e06 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -39,7 +39,7 @@ static void relay_file_mmap_close(struct vm_area_struct *vma) /* * fault() vm_op implementation for relay file mapping. */ -static int relay_buf_fault(struct vm_fault *vmf) +static vm_fault_t relay_buf_fault(struct vm_fault *vmf) { struct page *page; struct rchan_buf *buf = vmf->vma->vm_private_data; -- cgit v1.2.3 From c9484b986ef03492357fddd50afbdd02929cfa72 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:34 -0700 Subject: kcov: ensure irq code sees a valid area Patch series "kcov: fix unexpected faults". These patches fix a few issues where KCOV code could trigger recursive faults, discovered while debugging a patch enabling KCOV for arch/arm: * On CONFIG_PREEMPT kernels, there's a small race window where __sanitizer_cov_trace_pc() can see a bogus kcov_area. * Lazy faulting of the vmalloc area can cause mutual recursion between fault handling code and __sanitizer_cov_trace_pc(). * During the context switch, switching the mm can cause the kcov_area to be transiently unmapped. These are prerequisites for enabling KCOV on arm, but the issues themsevles are generic -- we just happen to avoid them by chance rather than design on x86-64 and arm64. This patch (of 3): For kernels built with CONFIG_PREEMPT, some C code may execute before or after the interrupt handler, while the hardirq count is zero. In these cases, in_task() can return true. A task can be interrupted in the middle of a KCOV_DISABLE ioctl while it resets the task's kcov data via kcov_task_init(). Instrumented code executed during this period will call __sanitizer_cov_trace_pc(), and as in_task() returns true, will inspect t->kcov_mode before trying to write to t->kcov_area. In kcov_init_task() we update t->kcov_{mode,area,size} with plain stores, which may be re-ordered, torn, etc. Thus __sanitizer_cov_trace_pc() may see bogus values for any of these fields, and may attempt to write to memory which is not mapped. Let's avoid this by using WRITE_ONCE() to set t->kcov_mode, with a barrier() to ensure this is ordered before we clear t->kov_{area,size}. This ensures that any code execute while kcov_init_task() is preempted will either see valid values for t->kcov_{area,size}, or will see that t->kcov_mode is KCOV_MODE_DISABLED, and bail out without touching t->kcov_area. Link: http://lkml.kernel.org/r/20180504135535.53744-2-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index 2c16f1ab5e10..5be9a60a959f 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -241,7 +241,8 @@ static void kcov_put(struct kcov *kcov) void kcov_task_init(struct task_struct *t) { - t->kcov_mode = KCOV_MODE_DISABLED; + WRITE_ONCE(t->kcov_mode, KCOV_MODE_DISABLED); + barrier(); t->kcov_size = 0; t->kcov_area = NULL; t->kcov = NULL; -- cgit v1.2.3 From dc55daff9040a90adce97208e776ee0bf515ab12 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:37 -0700 Subject: kcov: prefault the kcov_area On many architectures the vmalloc area is lazily faulted in upon first access. This is problematic for KCOV, as __sanitizer_cov_trace_pc accesses the (vmalloc'd) kcov_area, and fault handling code may be instrumented. If an access to kcov_area faults, this will result in mutual recursion through the fault handling code and __sanitizer_cov_trace_pc(), eventually leading to stack corruption and/or overflow. We can avoid this by faulting in the kcov_area before __sanitizer_cov_trace_pc() is permitted to access it. Once it has been faulted in, it will remain present in the process page tables, and will not fault again. [akpm@linux-foundation.org: code cleanup] [akpm@linux-foundation.org: add comment explaining kcov_fault_in_area()] [akpm@linux-foundation.org: fancier code comment from Mark] Link: http://lkml.kernel.org/r/20180504135535.53744-3-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index 5be9a60a959f..cf250392c55c 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -324,6 +324,21 @@ static int kcov_close(struct inode *inode, struct file *filep) return 0; } +/* + * Fault in a lazily-faulted vmalloc area before it can be used by + * __santizer_cov_trace_pc(), to avoid recursion issues if any code on the + * vmalloc fault handling path is instrumented. + */ +static void kcov_fault_in_area(struct kcov *kcov) +{ + unsigned long stride = PAGE_SIZE / sizeof(unsigned long); + unsigned long *area = kcov->area; + unsigned long offset; + + for (offset = 0; offset < kcov->size; offset += stride) + READ_ONCE(area[offset]); +} + static int kcov_ioctl_locked(struct kcov *kcov, unsigned int cmd, unsigned long arg) { @@ -372,6 +387,7 @@ static int kcov_ioctl_locked(struct kcov *kcov, unsigned int cmd, #endif else return -EINVAL; + kcov_fault_in_area(kcov); /* Cache in task struct for performance. */ t->kcov_size = kcov->size; t->kcov_area = kcov->area; -- cgit v1.2.3 From 0ed557aa813922f6f32adec69e266532091c895b Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:41 -0700 Subject: sched/core / kcov: avoid kcov_area during task switch During a context switch, we first switch_mm() to the next task's mm, then switch_to() that new task. This means that vmalloc'd regions which had previously been faulted in can transiently disappear in the context of the prev task. Functions instrumented by KCOV may try to access a vmalloc'd kcov_area during this window, and as the fault handling code is instrumented, this results in a recursive fault. We must avoid accessing any kcov_area during this window. We can do so with a new flag in kcov_mode, set prior to switching the mm, and cleared once the new task is live. Since task_struct::kcov_mode isn't always a specific enum kcov_mode value, this is made an unsigned int. The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers, which are empty for !CONFIG_KCOV kernels. The code uses macros because I can't use static inline functions without a circular include dependency between and , since the definition of task_struct uses things defined in Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 2 +- kernel/sched/core.c | 4 ++++ 2 files changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index cf250392c55c..3ebd09efe72a 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -58,7 +58,7 @@ struct kcov { static bool check_kcov_mode(enum kcov_mode needed_mode, struct task_struct *t) { - enum kcov_mode mode; + unsigned int mode; /* * We are interested in code coverage as a function of a syscall inputs, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a98d54cd5535..78d8facba456 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10,6 +10,8 @@ #include #include +#include + #include #include @@ -2633,6 +2635,7 @@ static inline void prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { + kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); rseq_preempt(prev); @@ -2702,6 +2705,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) finish_task(prev); finish_lock_switch(rq); finish_arch_post_lock_switch(); + kcov_finish_switch(current); fire_sched_in_preempt_notifiers(current); /* -- cgit v1.2.3