From 94b1b03b519b81c494900cb112aa00ed205cc2d9 Mon Sep 17 00:00:00 2001 From: Andy Lutomirski Date: Thu, 29 Jun 2017 08:53:17 -0700 Subject: x86/mm: Rework lazy TLB mode and TLB freshness tracking MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit x86's lazy TLB mode used to be fairly weak -- it would switch to init_mm the first time it tried to flush a lazy TLB. This meant an unnecessary CR3 write and, if the flush was remote, an unnecessary IPI. Rewrite it entirely. When we enter lazy mode, we simply remove the CPU from mm_cpumask. This means that we need a way to figure out whether we've missed a flush when we switch back out of lazy mode. I use the tlb_gen machinery to track whether a context is up to date. Note to reviewers: this patch, my itself, looks a bit odd. I'm using an array of length 1 containing (ctx_id, tlb_gen) rather than just storing tlb_gen, and making it at array isn't necessary yet. I'm doing this because the next few patches add PCID support, and, with PCID, we need ctx_id, and the array will end up with a length greater than 1. Making it an array now means that there will be less churn and therefore less stress on your eyeballs. NB: This is dubious but, AFAICT, still correct on Xen and UV. xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this patch changes the way that mm_cpumask() works. This should be okay, since Xen *also* iterates all online CPUs to find all the CPUs it needs to twiddle. The UV tlbflush code is rather dated and should be changed. Here are some benchmark results, done on a Skylake laptop at 2.3 GHz (turbo off, intel_pstate requesting max performance) under KVM with the guest using idle=poll (to avoid artifacts when bouncing between CPUs). I haven't done any real statistics here -- I just ran them in a loop and picked the fastest results that didn't look like outliers. Unpatched means commit a4eb8b993554, so all the bookkeeping overhead is gone. MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU. This is intended to be a nearly worst-case test. patched: 13.4µs unpatched: 21.6µs Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores), nrounds = 100, 256M data patched: 1.1 seconds or so unpatched: 1.9 seconds or so The sleepup on Vitaly's test appearss to be because it spends a lot of time blocked on mmap_sem, and this patch avoids sending IPIs to blocked CPUs. Signed-off-by: Andy Lutomirski Reviewed-by: Nadav Amit Reviewed-by: Thomas Gleixner Cc: Andrew Banman Cc: Andrew Morton Cc: Arjan van de Ven Cc: Boris Ostrovsky Cc: Borislav Petkov Cc: Dave Hansen Cc: Dimitri Sivanich Cc: Juergen Gross Cc: Linus Torvalds Cc: Mel Gorman Cc: Mike Travis Cc: Peter Zijlstra Cc: Rik van Riel Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/ddf2c92962339f4ba39d8fc41b853936ec0b44f1.1498751203.git.luto@kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/init.c | 1 - 1 file changed, 1 deletion(-) (limited to 'arch/x86/mm/init.c') diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 673541eb3b3f..4d353efb2838 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -812,7 +812,6 @@ void __init zone_sizes_init(void) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = &init_mm, - .state = 0, .cr4 = ~0UL, /* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); -- cgit v1.2.3