From 754c4050a00e802e122690112fc2c3a6abafa7e2 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Wed, 27 Oct 2021 12:37:29 -0700 Subject: ARM: dts: BCM5301X: Fix I2C controller interrupt The I2C interrupt controller line is off by 32 because the datasheet describes interrupt inputs into the GIC which are for Shared Peripheral Interrupts and are starting at offset 32. The ARM GIC binding expects the SPI interrupts to be numbered from 0 relative to the SPI base. Fixes: bb097e3e0045 ("ARM: dts: BCM5301X: Add I2C support to the DT") Tested-by: Christian Lamparter Signed-off-by: Florian Fainelli --- arch/arm/boot/dts/bcm5301x.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm/boot/dts/bcm5301x.dtsi b/arch/arm/boot/dts/bcm5301x.dtsi index d4f355015e3c..437a2b0f68de 100644 --- a/arch/arm/boot/dts/bcm5301x.dtsi +++ b/arch/arm/boot/dts/bcm5301x.dtsi @@ -408,7 +408,7 @@ i2c0: i2c@18009000 { compatible = "brcm,iproc-i2c"; reg = <0x18009000 0x50>; - interrupts = ; + interrupts = ; #address-cells = <1>; #size-cells = <0>; clock-frequency = <100000>; -- cgit v1.2.3 From 40f7342f0587639e5ad625adaa15efdd3cffb18f Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Thu, 28 Oct 2021 09:46:53 -0700 Subject: ARM: dts: BCM5301X: Add interrupt properties to GPIO node The GPIO controller is also an interrupt controller provider and is currently missing the appropriate 'interrupt-controller' and '#interrupt-cells' properties to denote that. Fixes: fb026d3de33b ("ARM: BCM5301X: Add Broadcom's bus-axi to the DTS file") Signed-off-by: Florian Fainelli --- arch/arm/boot/dts/bcm5301x.dtsi | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch') diff --git a/arch/arm/boot/dts/bcm5301x.dtsi b/arch/arm/boot/dts/bcm5301x.dtsi index 437a2b0f68de..f69d2af3c1fa 100644 --- a/arch/arm/boot/dts/bcm5301x.dtsi +++ b/arch/arm/boot/dts/bcm5301x.dtsi @@ -242,6 +242,8 @@ gpio-controller; #gpio-cells = <2>; + interrupt-controller; + #interrupt-cells = <2>; }; pcie0: pcie@12000 { -- cgit v1.2.3 From 98481f3d72fb88cb5b973153434061015f094925 Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Fri, 29 Oct 2021 14:09:26 -0700 Subject: ARM: dts: bcm2711: Fix PCIe interrupts The PCIe host bridge has two interrupt lines, one that goes towards it PCIE_INTR2 second level interrupt controller and one for its MSI second level interrupt controller. The first interrupt line is not currently managed by the driver, which is why it was not a functional problem. The interrupt-map property was also only listing the PCI_INTA interrupts when there are also the INTB, C and D. Reported-by: Jim Quinlan Fixes: d5c8dc0d4c88 ("ARM: dts: bcm2711: Enable PCIe controller") Signed-off-by: Florian Fainelli --- arch/arm/boot/dts/bcm2711.dtsi | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm/boot/dts/bcm2711.dtsi b/arch/arm/boot/dts/bcm2711.dtsi index 3b60297af7f6..9e01dbca4a01 100644 --- a/arch/arm/boot/dts/bcm2711.dtsi +++ b/arch/arm/boot/dts/bcm2711.dtsi @@ -506,11 +506,17 @@ #address-cells = <3>; #interrupt-cells = <1>; #size-cells = <2>; - interrupts = , + interrupts = , ; interrupt-names = "pcie", "msi"; interrupt-map-mask = <0x0 0x0 0x0 0x7>; interrupt-map = <0 0 0 1 &gicv2 GIC_SPI 143 + IRQ_TYPE_LEVEL_HIGH>, + <0 0 0 2 &gicv2 GIC_SPI 144 + IRQ_TYPE_LEVEL_HIGH>, + <0 0 0 3 &gicv2 GIC_SPI 145 + IRQ_TYPE_LEVEL_HIGH>, + <0 0 0 4 &gicv2 GIC_SPI 146 IRQ_TYPE_LEVEL_HIGH>; msi-controller; msi-parent = <&pcie0>; -- cgit v1.2.3 From c6d3cd32fd0064af7611d00877a67e6993bf220b Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Fri, 29 Oct 2021 17:22:45 +0100 Subject: arm64: ftrace: use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR When CONFIG_FUNCTION_GRAPH_TRACER is selected and the function graph tracer is in use, unwind_frame() may erroneously associate a traced function with an incorrect return address. This can happen when starting an unwind from a pt_regs, or when unwinding across an exception boundary. This can be seen when recording with perf while the function graph tracer is in use. For example: | # echo function_graph > /sys/kernel/debug/tracing/current_tracer | # perf record -g -e raw_syscalls:sys_enter:k /bin/true | # perf report ... reports the callchain erroneously as: | el0t_64_sync | el0t_64_sync_handler | el0_svc_common.constprop.0 | perf_callchain | get_perf_callchain | syscall_trace_enter | syscall_trace_enter ... whereas when the function graph tracer is not in use, it reports: | el0t_64_sync | el0t_64_sync_handler | el0_svc | do_el0_svc | el0_svc_common.constprop.0 | syscall_trace_enter | syscall_trace_enter The underlying problem is that ftrace_graph_get_ret_stack() takes an index offset from the most recent entry added to the fgraph return stack. We start an unwind at offset 0, and increment the offset each time we encounter a rewritten return address (i.e. when we see `return_to_handler`). This is broken in two cases: 1) Between creating a pt_regs and starting the unwind, function calls may place entries on the stack, leaving an arbitrary offset which we can only determine by performing a full unwind from the caller of the unwind code (and relying on none of the unwind code being instrumented). This can result in erroneous entries being reported in a backtrace recorded by perf or kfence when the function graph tracer is in use. Currently show_regs() is unaffected as dump_backtrace() performs an initial unwind. 2) When unwinding across an exception boundary (whether continuing an unwind or starting a new unwind from regs), we currently always skip the LR of the interrupted context. Where this was live and contained a rewritten address, we won't consume the corresponding fgraph ret stack entry, leaving subsequent entries off-by-one. This can result in erroneous entries being reported in a backtrace performed by any in-kernel unwinder when that backtrace crosses an exception boundary, with entries after the boundary being reported incorrectly. This includes perf, kfence, show_regs(), panic(), etc. To fix this, we need to be able to uniquely identify each rewritten return address such that we can map this back to the original return address. We can use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR to associate each rewritten return address with a unique location on the stack. As the return address is passed in the LR (and so is not guaranteed a unique location in memory), we use the FP upon entry to the function (i.e. the address of the caller's frame record) as the return address pointer. Any nested call will have a different FP value as the caller must create its own frame record and update FP to point to this. Since ftrace_graph_ret_addr() requires the return address with the PAC stripped, the stripping of the PAC is moved before the fixup of the rewritten address. As we would unconditionally strip the PAC, moving this earlier is not harmful, and we can avoid a redundant strip in the return address fixup code. I've tested this with the perf case above, the ftrace selftests, and a number of ad-hoc unwinder tests. The tests all pass, and I have seen no unexpected behaviour as a result of this change. I've tested with pointer authentication under QEMU TCG where magic-sysrq+l correctly recovers the original return addresses. Note that this doesn't fix the issue of skipping a live LR at an exception boundary, which is a more general problem and requires more substantial rework. Were we to consume the LR in all cases this would result in warnings where the interrupted context's LR contains `return_to_handler`, but the FP has been altered, e.g. | func: | <--- ftrace entry ---> // logs FP & LR, rewrites LR | STP FP, LR, [SP, #-16]! | MOV FP, SP | <--- INTERRUPT ---> ... as ftrace_graph_get_ret_stack() fill not find a matching entry, triggering the WARN_ON_ONCE() in unwind_frame(). Link: https://lore.kernel.org/r/20211025164925.GB2001@C02TD0UTHF1T.local Link: https://lore.kernel.org/r/20211027132529.30027-1-mark.rutland@arm.com Signed-off-by: Mark Rutland Cc: Catalin Marinas Cc: Madhavan T. Venkataraman Cc: Mark Brown Cc: Steven Rostedt Cc: Will Deacon Reviewed-by: Mark Brown Link: https://lore.kernel.org/r/20211029162245.39761-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/include/asm/ftrace.h | 11 +++++++++++ arch/arm64/include/asm/stacktrace.h | 6 ------ arch/arm64/kernel/ftrace.c | 6 +++--- arch/arm64/kernel/stacktrace.c | 18 ++++++++---------- 4 files changed, 22 insertions(+), 19 deletions(-) (limited to 'arch') diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h index 347b0cc68f07..1494cfa8639b 100644 --- a/arch/arm64/include/asm/ftrace.h +++ b/arch/arm64/include/asm/ftrace.h @@ -12,6 +12,17 @@ #define HAVE_FUNCTION_GRAPH_FP_TEST +/* + * HAVE_FUNCTION_GRAPH_RET_ADDR_PTR means that the architecture can provide a + * "return address pointer" which can be used to uniquely identify a return + * address which has been overwritten. + * + * On arm64 we use the address of the caller's frame record, which remains the + * same for the lifetime of the instrumented function, unlike the return + * address in the LR. + */ +#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR + #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS #define ARCH_SUPPORTS_FTRACE_OPS 1 #else diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h index a4e046ef4568..6564a01cc085 100644 --- a/arch/arm64/include/asm/stacktrace.h +++ b/arch/arm64/include/asm/stacktrace.h @@ -47,9 +47,6 @@ struct stack_info { * @prev_type: The type of stack this frame record was on, or a synthetic * value of STACK_TYPE_UNKNOWN. This is used to detect a * transition from one stack to another. - * - * @graph: When FUNCTION_GRAPH_TRACER is selected, holds the index of a - * replacement lr value in the ftrace graph stack. */ struct stackframe { unsigned long fp; @@ -57,9 +54,6 @@ struct stackframe { DECLARE_BITMAP(stacks_done, __NR_STACK_TYPES); unsigned long prev_fp; enum stack_type prev_type; -#ifdef CONFIG_FUNCTION_GRAPH_TRACER - int graph; -#endif #ifdef CONFIG_KRETPROBES struct llist_node *kr_cur; #endif diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c index fc62dfe73f93..4506c4a90ac1 100644 --- a/arch/arm64/kernel/ftrace.c +++ b/arch/arm64/kernel/ftrace.c @@ -244,8 +244,6 @@ void arch_ftrace_update_code(int command) * on the way back to parent. For this purpose, this function is called * in _mcount() or ftrace_caller() to replace return address (*parent) on * the call stack to return_to_handler. - * - * Note that @frame_pointer is used only for sanity check later. */ void prepare_ftrace_return(unsigned long self_addr, unsigned long *parent, unsigned long frame_pointer) @@ -263,8 +261,10 @@ void prepare_ftrace_return(unsigned long self_addr, unsigned long *parent, */ old = *parent; - if (!function_graph_enter(old, self_addr, frame_pointer, NULL)) + if (!function_graph_enter(old, self_addr, frame_pointer, + (void *)frame_pointer)) { *parent = return_hooker; + } } #ifdef CONFIG_DYNAMIC_FTRACE diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c index c30624fff6ac..94f83cd44e50 100644 --- a/arch/arm64/kernel/stacktrace.c +++ b/arch/arm64/kernel/stacktrace.c @@ -38,9 +38,6 @@ void start_backtrace(struct stackframe *frame, unsigned long fp, { frame->fp = fp; frame->pc = pc; -#ifdef CONFIG_FUNCTION_GRAPH_TRACER - frame->graph = 0; -#endif #ifdef CONFIG_KRETPROBES frame->kr_cur = NULL; #endif @@ -116,20 +113,23 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame) frame->prev_fp = fp; frame->prev_type = info.type; + frame->pc = ptrauth_strip_insn_pac(frame->pc); + #ifdef CONFIG_FUNCTION_GRAPH_TRACER if (tsk->ret_stack && - (ptrauth_strip_insn_pac(frame->pc) == (unsigned long)return_to_handler)) { - struct ftrace_ret_stack *ret_stack; + (frame->pc == (unsigned long)return_to_handler)) { + unsigned long orig_pc; /* * This is a case where function graph tracer has * modified a return address (LR) in a stack frame * to hook a function return. * So replace it to an original value. */ - ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++); - if (WARN_ON_ONCE(!ret_stack)) + orig_pc = ftrace_graph_ret_addr(tsk, NULL, frame->pc, + (void *)frame->fp); + if (WARN_ON_ONCE(frame->pc == orig_pc)) return -EINVAL; - frame->pc = ret_stack->ret; + frame->pc = orig_pc; } #endif /* CONFIG_FUNCTION_GRAPH_TRACER */ #ifdef CONFIG_KRETPROBES @@ -137,8 +137,6 @@ int notrace unwind_frame(struct task_struct *tsk, struct stackframe *frame) frame->pc = kretprobe_find_ret_addr(tsk, (void *)frame->fp, &frame->kr_cur); #endif - frame->pc = ptrauth_strip_insn_pac(frame->pc); - return 0; } NOKPROBE_SYMBOL(unwind_frame); -- cgit v1.2.3 From d3eb70ead6474ec16f976fcacf10a7a890a95bd3 Mon Sep 17 00:00:00 2001 From: Pingfan Liu Date: Fri, 12 Nov 2021 13:22:14 +0800 Subject: arm64: mm: Fix VM_BUG_ON(mm != &init_mm) for trans_pgd trans_pgd_create_copy() can hit "VM_BUG_ON(mm != &init_mm)" in the function pmd_populate_kernel(). This is the combined consequence of commit 5de59884ac0e ("arm64: trans_pgd: pass NULL instead of init_mm to *_populate functions"), which replaced &init_mm with NULL and commit 59511cfd08f3 ("arm64: mm: use XN table mapping attributes for user/kernel mappings"), which introduced the VM_BUG_ON. Since the former sounds reasonable, it is better to work on the later. From the perspective of trans_pgd, two groups of functions are considered in the later one: pmd_populate_kernel() mm == NULL should be fixed, else it hits VM_BUG_ON() p?d_populate() mm == NULL means PXN, that is OK, since trans_pgd only copies a linear map, no execution will happen on the map. So it is good enough to just relax VM_BUG_ON() to disregard mm == NULL Fixes: 59511cfd08f3 ("arm64: mm: use XN table mapping attributes for user/kernel mappings") Signed-off-by: Pingfan Liu Cc: # 5.13.x Cc: Ard Biesheuvel Cc: James Morse Cc: Matthias Brugger Reviewed-by: Catalin Marinas Reviewed-by: Pasha Tatashin Link: https://lore.kernel.org/r/20211112052214.9086-1-kernelfans@gmail.com Signed-off-by: Will Deacon --- arch/arm64/include/asm/pgalloc.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h index 8433a2058eb1..237224484d0f 100644 --- a/arch/arm64/include/asm/pgalloc.h +++ b/arch/arm64/include/asm/pgalloc.h @@ -76,7 +76,7 @@ static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t ptep, static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep) { - VM_BUG_ON(mm != &init_mm); + VM_BUG_ON(mm && mm != &init_mm); __pmd_populate(pmdp, __pa(ptep), PMD_TYPE_TABLE | PMD_TABLE_UXN); } -- cgit v1.2.3 From 522a0032af005502507f5f81ae64fdcc82b5d068 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 6 Nov 2021 17:13:35 -0400 Subject: Add linux/cacheflush.h Many architectures do not include asm-generic/cacheflush.h, so turn the includes on their head and add linux/cacheflush.h which includes asm/cacheflush.h. Move the flush_dcache_folio() declaration from asm-generic/cacheflush.h to linux/cacheflush.h and change linux/highmem.h to include linux/cacheflush.h instead of asm/cacheflush.h so that all necessary places will see flush_dcache_folio(). More functions should have their default implementations moved in the future, but those are for follow-on patches. This fixes csky, sparc and sparc64 which were missed in the commit which added flush_dcache_folio(). Fixes: 08b0b0059bf1 ("mm: Add flush_dcache_folio()") Suggested-by: Christoph Hellwig Signed-off-by: Matthew Wilcox (Oracle) Acked-by: Geert Uytterhoeven --- arch/arc/include/asm/cacheflush.h | 1 - arch/arm/include/asm/cacheflush.h | 1 - arch/m68k/include/asm/cacheflush_mm.h | 1 - arch/mips/include/asm/cacheflush.h | 2 -- arch/nds32/include/asm/cacheflush.h | 1 - arch/nios2/include/asm/cacheflush.h | 1 - arch/parisc/include/asm/cacheflush.h | 1 - arch/sh/include/asm/cacheflush.h | 1 - arch/xtensa/include/asm/cacheflush.h | 3 --- 9 files changed, 12 deletions(-) (limited to 'arch') diff --git a/arch/arc/include/asm/cacheflush.h b/arch/arc/include/asm/cacheflush.h index e8c2c7469e10..e201b4b1655a 100644 --- a/arch/arc/include/asm/cacheflush.h +++ b/arch/arc/include/asm/cacheflush.h @@ -36,7 +36,6 @@ void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); -void flush_dcache_folio(struct folio *folio); void dma_cache_wback_inv(phys_addr_t start, unsigned long sz); void dma_cache_inv(phys_addr_t start, unsigned long sz); diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index e68fb879e4f9..5e56288e343b 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -290,7 +290,6 @@ extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr */ #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 extern void flush_dcache_page(struct page *); -void flush_dcache_folio(struct folio *folio); #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 static inline void flush_kernel_vmap_range(void *addr, int size) diff --git a/arch/m68k/include/asm/cacheflush_mm.h b/arch/m68k/include/asm/cacheflush_mm.h index 8ab46625ddd3..1ac55e7b47f0 100644 --- a/arch/m68k/include/asm/cacheflush_mm.h +++ b/arch/m68k/include/asm/cacheflush_mm.h @@ -250,7 +250,6 @@ static inline void __flush_page_to_ram(void *vaddr) #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 #define flush_dcache_page(page) __flush_page_to_ram(page_address(page)) -void flush_dcache_folio(struct folio *folio); #define flush_dcache_mmap_lock(mapping) do { } while (0) #define flush_dcache_mmap_unlock(mapping) do { } while (0) #define flush_icache_page(vma, page) __flush_page_to_ram(page_address(page)) diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h index f207388541d5..b3dc9c589442 100644 --- a/arch/mips/include/asm/cacheflush.h +++ b/arch/mips/include/asm/cacheflush.h @@ -61,8 +61,6 @@ static inline void flush_dcache_page(struct page *page) SetPageDcacheDirty(page); } -void flush_dcache_folio(struct folio *folio); - #define flush_dcache_mmap_lock(mapping) do { } while (0) #define flush_dcache_mmap_unlock(mapping) do { } while (0) diff --git a/arch/nds32/include/asm/cacheflush.h b/arch/nds32/include/asm/cacheflush.h index 3fc0bb7d6487..c2a222ebfa2a 100644 --- a/arch/nds32/include/asm/cacheflush.h +++ b/arch/nds32/include/asm/cacheflush.h @@ -27,7 +27,6 @@ void flush_cache_vunmap(unsigned long start, unsigned long end); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); -void flush_dcache_folio(struct folio *folio); void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, void *src, int len); void copy_from_user_page(struct vm_area_struct *vma, struct page *page, diff --git a/arch/nios2/include/asm/cacheflush.h b/arch/nios2/include/asm/cacheflush.h index 1999561b22aa..d0b71dd71287 100644 --- a/arch/nios2/include/asm/cacheflush.h +++ b/arch/nios2/include/asm/cacheflush.h @@ -29,7 +29,6 @@ extern void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); -void flush_dcache_folio(struct folio *folio); extern void flush_icache_range(unsigned long start, unsigned long end); extern void flush_icache_page(struct vm_area_struct *vma, struct page *page); diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h index da0cd4b3a28f..859b8a34adcf 100644 --- a/arch/parisc/include/asm/cacheflush.h +++ b/arch/parisc/include/asm/cacheflush.h @@ -50,7 +50,6 @@ void invalidate_kernel_vmap_range(void *vaddr, int size); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); -void flush_dcache_folio(struct folio *folio); #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) diff --git a/arch/sh/include/asm/cacheflush.h b/arch/sh/include/asm/cacheflush.h index c7a97f32432f..481a664287e2 100644 --- a/arch/sh/include/asm/cacheflush.h +++ b/arch/sh/include/asm/cacheflush.h @@ -43,7 +43,6 @@ extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); -void flush_dcache_folio(struct folio *folio); extern void flush_icache_range(unsigned long start, unsigned long end); #define flush_icache_user_range flush_icache_range extern void flush_icache_page(struct vm_area_struct *vma, diff --git a/arch/xtensa/include/asm/cacheflush.h b/arch/xtensa/include/asm/cacheflush.h index a8a041609c5d..7b4359312c25 100644 --- a/arch/xtensa/include/asm/cacheflush.h +++ b/arch/xtensa/include/asm/cacheflush.h @@ -121,7 +121,6 @@ void flush_cache_page(struct vm_area_struct*, #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *); -void flush_dcache_folio(struct folio *); void local_flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); @@ -138,9 +137,7 @@ void local_flush_cache_page(struct vm_area_struct *vma, #define flush_cache_vunmap(start,end) do { } while (0) #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0 -#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO #define flush_dcache_page(page) do { } while (0) -static inline void flush_dcache_folio(struct folio *folio) { } #define flush_icache_range local_flush_icache_range #define flush_cache_page(vma, addr, pfn) do { } while (0) -- cgit v1.2.3 From 574c3c55e969096cea770eda3375ff35ccf91702 Mon Sep 17 00:00:00 2001 From: Ben Gardon Date: Mon, 15 Nov 2021 13:17:04 -0800 Subject: KVM: x86/mmu: Fix TLB flush range when handling disconnected pt When recursively clearing out disconnected pts, the range based TLB flush in handle_removed_tdp_mmu_page uses the wrong starting GFN, resulting in the flush mostly missing the affected range. Fix this by using base_gfn for the flush. In response to feedback from David Matlack on the RFC version of this patch, also move a few definitions into the for loop in the function to prevent unintended references to them in the future. Fixes: a066e61f13cf ("KVM: x86/mmu: Factor out handling of removed page tables") CC: stable@vger.kernel.org Signed-off-by: Ben Gardon Message-Id: <20211115211704.2621644-1-bgardon@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/tdp_mmu.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index a54c3491af42..377a96718a2e 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -317,9 +317,6 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt, struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt)); int level = sp->role.level; gfn_t base_gfn = sp->gfn; - u64 old_child_spte; - u64 *sptep; - gfn_t gfn; int i; trace_kvm_mmu_prepare_zap_page(sp); @@ -327,8 +324,9 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt, tdp_mmu_unlink_page(kvm, sp, shared); for (i = 0; i < PT64_ENT_PER_PAGE; i++) { - sptep = rcu_dereference(pt) + i; - gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level); + u64 *sptep = rcu_dereference(pt) + i; + gfn_t gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level); + u64 old_child_spte; if (shared) { /* @@ -374,7 +372,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt, shared); } - kvm_flush_remote_tlbs_with_address(kvm, gfn, + kvm_flush_remote_tlbs_with_address(kvm, base_gfn, KVM_PAGES_PER_HPAGE(level + 1)); call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback); -- cgit v1.2.3 From 9dba4d24cbb5524dd39ab1e08886373b17f07ff2 Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Wed, 17 Nov 2021 08:16:17 +0100 Subject: x86/kvm: remove unused ack_notifier callbacks Commit f52447261bc8c2 ("KVM: irq ack notification") introduced an ack_notifier() callback in struct kvm_pic and in struct kvm_ioapic without using them anywhere. Remove those callbacks again. Signed-off-by: Juergen Gross Message-Id: <20211117071617.19504-1-jgross@suse.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/ioapic.h | 1 - arch/x86/kvm/irq.h | 1 - 2 files changed, 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h index e66e620c3bed..539333ac4b38 100644 --- a/arch/x86/kvm/ioapic.h +++ b/arch/x86/kvm/ioapic.h @@ -81,7 +81,6 @@ struct kvm_ioapic { unsigned long irq_states[IOAPIC_NUM_PINS]; struct kvm_io_device dev; struct kvm *kvm; - void (*ack_notifier)(void *opaque, int irq); spinlock_t lock; struct rtc_status rtc_status; struct delayed_work eoi_inject; diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h index 650642b18d15..c2d7cfe82d00 100644 --- a/arch/x86/kvm/irq.h +++ b/arch/x86/kvm/irq.h @@ -56,7 +56,6 @@ struct kvm_pic { struct kvm_io_device dev_master; struct kvm_io_device dev_slave; struct kvm_io_device dev_elcr; - void (*ack_notifier)(void *opaque, int irq); unsigned long irq_states[PIC_NUM_PINS]; }; -- cgit v1.2.3 From c7785d85b6c6cc9f3d0f1a8cab128f4062b30abb Mon Sep 17 00:00:00 2001 From: Hou Wenlong Date: Wed, 17 Nov 2021 17:20:39 +0800 Subject: KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range() If the parameter flush is set, zap_gfn_range() would flush remote tlb when yield, then tlb flush is not needed outside. So use the return value of zap_gfn_range() directly instead of OR on it in kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(). Fixes: 3039bcc744980 ("KVM: Move x86's MMU notifier memslot walkers to generic code") Signed-off-by: Hou Wenlong Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com> Reviewed-by: Sean Christopherson Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/mmu/tdp_mmu.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3be9beea838d..0a8436ea0090 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1582,7 +1582,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) flush = kvm_handle_gfn_range(kvm, range, kvm_unmap_rmapp); if (is_tdp_mmu_enabled(kvm)) - flush |= kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush); + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush); return flush; } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 377a96718a2e..1f8c9f783b78 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1032,8 +1032,8 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, struct kvm_mmu_page *root; for_each_tdp_mmu_root(kvm, root, range->slot->as_id) - flush |= zap_gfn_range(kvm, root, range->start, range->end, - range->may_block, flush, false); + flush = zap_gfn_range(kvm, root, range->start, range->end, + range->may_block, flush, false); return flush; } -- cgit v1.2.3 From 8ed716ca7dc91f058be0ba644a3048667a20db13 Mon Sep 17 00:00:00 2001 From: Hou Wenlong Date: Wed, 17 Nov 2021 17:20:40 +0800 Subject: KVM: x86/mmu: Pass parameter flush as false in kvm_tdp_mmu_zap_collapsible_sptes() Since tlb flush has been done for legacy MMU before kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush should be false for kvm_tdp_mmu_zap_collapsible_sptes(). Fixes: e2209710ccc5d ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Hou Wenlong Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0a8436ea0090..0c839ee1282c 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5854,7 +5854,7 @@ restart: void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot) { - bool flush = false; + bool flush; if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); @@ -5871,7 +5871,7 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, if (is_tdp_mmu_enabled(kvm)) { read_lock(&kvm->mmu_lock); - flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, flush); + flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, false); if (flush) kvm_arch_flush_remote_tlbs_memslot(kvm, slot); read_unlock(&kvm->mmu_lock); -- cgit v1.2.3 From 187bea472600dcc8d2eb714335053264dd437172 Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Thu, 18 Nov 2021 15:25:08 +0100 Subject: ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE When CONFIG_FORTIFY_SOURCE is set, memcpy() checks the potential buffer overflow and panics. The code in sofcpga bootstrapping contains the memcpy() calls are mistakenly translated as the shorter size, hence it triggers a panic as if it were overflowing. This patch changes the secondary_trampoline and *_end definitions to arrays for avoiding the false-positive crash above. Fixes: 9c4566a117a6 ("ARM: socfpga: Enable SMP for socfpga") Suggested-by: Kees Cook Buglink: https://bugzilla.suse.com/show_bug.cgi?id=1192473 Link: https://lore.kernel.org/r/20211117193244.31162-1-tiwai@suse.de Signed-off-by: Takashi Iwai Signed-off-by: Dinh Nguyen --- arch/arm/mach-socfpga/core.h | 2 +- arch/arm/mach-socfpga/platsmp.c | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) (limited to 'arch') diff --git a/arch/arm/mach-socfpga/core.h b/arch/arm/mach-socfpga/core.h index fc2608b18a0d..18f01190dcfd 100644 --- a/arch/arm/mach-socfpga/core.h +++ b/arch/arm/mach-socfpga/core.h @@ -33,7 +33,7 @@ extern void __iomem *sdr_ctl_base_addr; u32 socfpga_sdram_self_refresh(u32 sdr_base); extern unsigned int socfpga_sdram_self_refresh_sz; -extern char secondary_trampoline, secondary_trampoline_end; +extern char secondary_trampoline[], secondary_trampoline_end[]; extern unsigned long socfpga_cpu1start_addr; diff --git a/arch/arm/mach-socfpga/platsmp.c b/arch/arm/mach-socfpga/platsmp.c index fbb80b883e5d..201191cf68f3 100644 --- a/arch/arm/mach-socfpga/platsmp.c +++ b/arch/arm/mach-socfpga/platsmp.c @@ -20,14 +20,14 @@ static int socfpga_boot_secondary(unsigned int cpu, struct task_struct *idle) { - int trampoline_size = &secondary_trampoline_end - &secondary_trampoline; + int trampoline_size = secondary_trampoline_end - secondary_trampoline; if (socfpga_cpu1start_addr) { /* This will put CPU #1 into reset. */ writel(RSTMGR_MPUMODRST_CPU1, rst_manager_base_addr + SOCFPGA_RSTMGR_MODMPURST); - memcpy(phys_to_virt(0), &secondary_trampoline, trampoline_size); + memcpy(phys_to_virt(0), secondary_trampoline, trampoline_size); writel(__pa_symbol(secondary_startup), sys_manager_base_addr + (socfpga_cpu1start_addr & 0x000000ff)); @@ -45,12 +45,12 @@ static int socfpga_boot_secondary(unsigned int cpu, struct task_struct *idle) static int socfpga_a10_boot_secondary(unsigned int cpu, struct task_struct *idle) { - int trampoline_size = &secondary_trampoline_end - &secondary_trampoline; + int trampoline_size = secondary_trampoline_end - secondary_trampoline; if (socfpga_cpu1start_addr) { writel(RSTMGR_MPUMODRST_CPU1, rst_manager_base_addr + SOCFPGA_A10_RSTMGR_MODMPURST); - memcpy(phys_to_virt(0), &secondary_trampoline, trampoline_size); + memcpy(phys_to_virt(0), secondary_trampoline, trampoline_size); writel(__pa_symbol(secondary_startup), sys_manager_base_addr + (socfpga_cpu1start_addr & 0x00000fff)); -- cgit v1.2.3 From 2a0991929aba0a3dd6fe51d1daba06a93a96a021 Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Fri, 19 Nov 2021 16:39:13 +0100 Subject: xen/pvh: add missing prototype to header The prototype of mem_map_via_hcall() is missing in its header, so add it. Reported-by: kernel test robot Fixes: a43fb7da53007e67ad ("xen/pvh: Move Xen code for getting mem map via hcall out of common file") Signed-off-by: Juergen Gross Link: https://lore.kernel.org/r/20211119153913.21678-1-jgross@suse.com Reviewed-by: Boris Ostrovsky Signed-off-by: Boris Ostrovsky --- arch/x86/include/asm/xen/hypervisor.h | 1 + 1 file changed, 1 insertion(+) (limited to 'arch') diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h index 4957f59deb40..5adab895127e 100644 --- a/arch/x86/include/asm/xen/hypervisor.h +++ b/arch/x86/include/asm/xen/hypervisor.h @@ -64,6 +64,7 @@ void xen_arch_unregister_cpu(int num); #ifdef CONFIG_PVH void __init xen_pvh_init(struct boot_params *boot_params); +void __init mem_map_via_hcall(struct boot_params *boot_params_p); #endif #endif /* _ASM_X86_XEN_HYPERVISOR_H */ -- cgit v1.2.3 From 756e1fc16505c31c9f86b602fcb8e2bc55c4b7e5 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 4 Nov 2021 16:41:06 +0000 Subject: KVM: RISC-V: Unmap stage2 mapping when deleting/moving a memslot Unmap stage2 page tables when a memslot is being deleted or moved. It's the architectures' responsibility to ensure existing mappings are removed when kvm_arch_flush_shadow_memslot() returns. Fixes: 9d05c1fee837 ("RISC-V: KVM: Implement stage2 page table programming") Signed-off-by: Sean Christopherson Signed-off-by: Anup Patel --- arch/riscv/kvm/mmu.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch') diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index d81bae8eb55e..fc058ff5f4b6 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -453,6 +453,12 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm) void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + gpa_t gpa = slot->base_gfn << PAGE_SHIFT; + phys_addr_t size = slot->npages << PAGE_SHIFT; + + spin_lock(&kvm->mmu_lock); + stage2_unmap_range(kvm, gpa, size, false); + spin_unlock(&kvm->mmu_lock); } void kvm_arch_commit_memory_region(struct kvm *kvm, -- cgit v1.2.3 From 74c2e97b01846eb237b7819a3e2944455cfdb26a Mon Sep 17 00:00:00 2001 From: Anup Patel Date: Wed, 17 Nov 2021 10:30:29 +0530 Subject: RISC-V: KVM: Fix incorrect KVM_MAX_VCPUS value The KVM_MAX_VCPUS value is supposed to be aligned with number of VMID bits in the hgatp CSR but the current KVM_MAX_VCPUS value is aligned with number of ASID bits in the satp CSR. Fixes: 99cdc6c18c2d ("RISC-V: Add initial skeletal KVM support") Signed-off-by: Anup Patel Reviewed-by: Atish Patra --- arch/riscv/include/asm/kvm_host.h | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) (limited to 'arch') diff --git a/arch/riscv/include/asm/kvm_host.h b/arch/riscv/include/asm/kvm_host.h index 25ba21f98504..2639b9ee48f9 100644 --- a/arch/riscv/include/asm/kvm_host.h +++ b/arch/riscv/include/asm/kvm_host.h @@ -12,14 +12,12 @@ #include #include #include +#include #include #include -#ifdef CONFIG_64BIT -#define KVM_MAX_VCPUS (1U << 16) -#else -#define KVM_MAX_VCPUS (1U << 9) -#endif +#define KVM_MAX_VCPUS \ + ((HGATP_VMID_MASK >> HGATP_VMID_SHIFT) + 1) #define KVM_HALT_POLL_NS_DEFAULT 500000 -- cgit v1.2.3 From 169d1a4a2adb2c246396c56aa2f9eec3868546f1 Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Fri, 19 Nov 2021 22:16:37 +0100 Subject: parisc: Provide an extru_safe() macro to extract unsigned bits The extru instruction leaves the most significant 32 bits of the target register in an undefined state on PA 2.0 systems. Provide a macro to safely use extru on 32- and 64-bit machines. Suggested-by: John David Anglin Signed-off-by: Helge Deller --- arch/parisc/include/asm/assembly.h | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'arch') diff --git a/arch/parisc/include/asm/assembly.h b/arch/parisc/include/asm/assembly.h index 39e7985086f9..6d13ae236fcb 100644 --- a/arch/parisc/include/asm/assembly.h +++ b/arch/parisc/include/asm/assembly.h @@ -147,6 +147,17 @@ extrd,u \r, 63-(\sa), 64-(\sa), \t .endm + /* Extract unsigned for 32- and 64-bit + * The extru instruction leaves the most significant 32 bits of the + * target register in an undefined state on PA 2.0 systems. */ + .macro extru_safe r, p, len, t +#ifdef CONFIG_64BIT + extrd,u \r, 32+(\p), \len, \t +#else + extru \r, \p, \len, \t +#endif + .endm + /* load 32-bit 'value' into 'reg' compensating for the ldil * sign-extension when running in wide mode. * WARNING!! neither 'value' nor 'reg' can be expressions -- cgit v1.2.3 From df2ffeda6370a77011902e7c9d7a1eb1cbffed4f Mon Sep 17 00:00:00 2001 From: John David Anglin Date: Fri, 19 Nov 2021 22:18:47 +0100 Subject: parisc: Fix extraction of hash lock bits in syscall.S The extru instruction leaves the most significant 32 bits of the target register in an undefined state on PA 2.0 systems. If any of these bits are nonzero, this will break the calculation of the lock pointer. Fix by using extrd,u instruction via extru_safe macro on 64-bit kernels. Signed-off-by: John David Anglin Signed-off-by: Helge Deller --- arch/parisc/kernel/syscall.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S index 4fb3b6a993bf..d2497b339d13 100644 --- a/arch/parisc/kernel/syscall.S +++ b/arch/parisc/kernel/syscall.S @@ -566,7 +566,7 @@ lws_compare_and_swap: ldo R%lws_lock_start(%r20), %r28 /* Extract eight bits from r26 and hash lock (Bits 3-11) */ - extru %r26, 28, 8, %r20 + extru_safe %r26, 28, 8, %r20 /* Find lock to use, the hash is either one of 0 to 15, multiplied by 16 (keep it 16-byte aligned) @@ -751,7 +751,7 @@ cas2_lock_start: ldo R%lws_lock_start(%r20), %r28 /* Extract eight bits from r26 and hash lock (Bits 3-11) */ - extru %r26, 28, 8, %r20 + extru_safe %r26, 28, 8, %r20 /* Find lock to use, the hash is either one of 0 to 15, multiplied by 16 (keep it 16-byte aligned) -- cgit v1.2.3 From 3fbdc121bd051d9f1b3b2e232ad734c44b47d32c Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Fri, 19 Nov 2021 22:20:14 +0100 Subject: parisc: Convert PTE lookup to use extru_safe() macro Convert the PTE lookup functions to use the safer extru_safe macro. Signed-off-by: Helge Deller --- arch/parisc/kernel/entry.S | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) (limited to 'arch') diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S index 88c188a965d8..6e9cdb269862 100644 --- a/arch/parisc/kernel/entry.S +++ b/arch/parisc/kernel/entry.S @@ -366,17 +366,9 @@ */ .macro L2_ptep pmd,pte,index,va,fault #if CONFIG_PGTABLE_LEVELS == 3 - extru \va,31-ASM_PMD_SHIFT,ASM_BITS_PER_PMD,\index + extru_safe \va,31-ASM_PMD_SHIFT,ASM_BITS_PER_PMD,\index #else -# if defined(CONFIG_64BIT) - extrd,u \va,63-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index - #else - # if PAGE_SIZE > 4096 - extru \va,31-ASM_PGDIR_SHIFT,32-ASM_PGDIR_SHIFT,\index - # else - extru \va,31-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index - # endif -# endif + extru_safe \va,31-ASM_PGDIR_SHIFT,ASM_BITS_PER_PGD,\index #endif dep %r0,31,PAGE_SHIFT,\pmd /* clear offset */ #if CONFIG_PGTABLE_LEVELS < 3 @@ -386,7 +378,7 @@ bb,>=,n \pmd,_PxD_PRESENT_BIT,\fault dep %r0,31,PxD_FLAG_SHIFT,\pmd /* clear flags */ SHLREG \pmd,PxD_VALUE_SHIFT,\pmd - extru \va,31-PAGE_SHIFT,ASM_BITS_PER_PTE,\index + extru_safe \va,31-PAGE_SHIFT,ASM_BITS_PER_PTE,\index dep %r0,31,PAGE_SHIFT,\pmd /* clear offset */ shladd \index,BITS_PER_PTE_ENTRY,\pmd,\pmd /* pmd is now pte */ .endm -- cgit v1.2.3 From 98400ad75e95860e9a10ec78b0b90ab66184a2ce Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Sun, 21 Nov 2021 11:10:55 +0100 Subject: Revert "parisc: Fix backtrace to always include init funtion names" This reverts commit 279917e27edc293eb645a25428c6ab3f3bca3f86. With the CONFIG_HARDENED_USERCOPY option enabled, this patch triggers kernel bugs at runtime: usercopy: Kernel memory overwrite attempt detected to kernel text (offset 2084839, size 6)! kernel BUG at mm/usercopy.c:99! Backtrace: IAOQ[0]: usercopy_abort+0xc4/0xe8 [<00000000406ed1c8>] __check_object_size+0x174/0x238 [<00000000407086d4>] copy_strings.isra.0+0x3e8/0x708 [<0000000040709a20>] do_execveat_common.isra.0+0x1bc/0x328 [<000000004070b760>] compat_sys_execve+0x7c/0xb8 [<0000000040303eb8>] syscall_exit+0x0/0x14 The problem is, that we have an init section of at least 2MB size which starts at _stext and is freed after bootup. If then later some kernel data is (temporarily) stored in this free memory, check_kernel_text_object() will trigger a bug since the data appears to be inside the kernel text (>=_stext) area: if (overlaps(ptr, len, _stext, _etext)) usercopy_abort("kernel text"); Signed-off-by: Helge Deller Cc: stable@kernel.org # 5.4+ --- arch/parisc/kernel/vmlinux.lds.S | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S index 3d208afd15bc..2769eb991f58 100644 --- a/arch/parisc/kernel/vmlinux.lds.S +++ b/arch/parisc/kernel/vmlinux.lds.S @@ -57,8 +57,6 @@ SECTIONS { . = KERNEL_BINARY_TEXT_START; - _stext = .; /* start of kernel text, includes init code & data */ - __init_begin = .; HEAD_TEXT_SECTION MLONGCALL_DISCARD(INIT_TEXT_SECTION(8)) @@ -82,6 +80,7 @@ SECTIONS /* freed after init ends here */ _text = .; /* Text and read-only data */ + _stext = .; MLONGCALL_KEEP(INIT_TEXT_SECTION(8)) .text ALIGN(PAGE_SIZE) : { TEXT_TEXT -- cgit v1.2.3 From 94902d849e85093aafcdbea2be8e2beff47233e6 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Mon, 22 Nov 2021 12:58:20 +0000 Subject: arm64: uaccess: avoid blocking within critical sections As Vincent reports in: https://lore.kernel.org/r/20211118163417.21617-1-vincent.whitchurch@axis.com The put_user() in schedule_tail() can get stuck in a livelock, similar to a problem recently fixed on riscv in commit: 285a76bb2cf51b0c ("riscv: evaluate put_user() arg before enabling user access") In __raw_put_user() we have a critical section between uaccess_ttbr0_enable() and uaccess_ttbr0_disable() where we cannot safely call into the scheduler without having taken an exception, as schedule() and other scheduling functions will not save/restore the TTBR0 state. If either of the `x` or `ptr` arguments to __raw_put_user() contain a blocking call, we may call into the scheduler within the critical section. This can result in two problems: 1) The access within the critical section will occur without the required TTBR0 tables installed. This will fault, and where the required tables permit access, the access will be retried without the required tables, resulting in a livelock. 2) When TTBR0 SW PAN is in use, check_and_switch_context() does not modify TTBR0, leaving a stale value installed. The mappings of the blocked task will erroneously be accessible to regular accesses in the context of the new task. Additionally, if the tables are subsequently freed, local TLB maintenance required to reuse the ASID may be lost, potentially resulting in TLB corruption (e.g. in the presence of CnP). The same issue exists for __raw_get_user() in the critical section between uaccess_ttbr0_enable() and uaccess_ttbr0_disable(). A similar issue exists for __get_kernel_nofault() and __put_kernel_nofault() for the critical section between __uaccess_enable_tco_async() and __uaccess_disable_tco_async(), as the TCO state is not context-switched by direct calls into the scheduler. Here the TCO state may be lost from the context of the current task, resulting in unexpected asynchronous tag check faults. It may also be leaked to another task, suppressing expected tag check faults. To fix all of these cases, we must ensure that we do not directly call into the scheduler in their respective critical sections. This patch reworks __raw_put_user(), __raw_get_user(), __get_kernel_nofault(), and __put_kernel_nofault(), ensuring that parameters are evaluated outside of the critical sections. To make this requirement clear, comments are added describing the problem, and line spaces added to separate the critical sections from other portions of the macros. For __raw_get_user() and __raw_put_user() the `err` parameter is conditionally assigned to, and we must currently evaluate this in the critical section. This behaviour is relied upon by the signal code, which uses chains of put_user_error() and get_user_error(), checking the return value at the end. In all cases, the `err` parameter is a plain int rather than a more complex expression with a blocking call, so this is safe. In future we should try to clean up the `err` usage to remove the potential for this to be a problem. Aside from the changes to time of evaluation, there should be no functional change as a result of this patch. Reported-by: Vincent Whitchurch Link: https://lore.kernel.org/r/20211118163417.21617-1-vincent.whitchurch@axis.com Fixes: f253d827f33c ("arm64: uaccess: refactor __{get,put}_user") Signed-off-by: Mark Rutland Cc: Will Deacon Cc: Catalin Marinas Link: https://lore.kernel.org/r/20211122125820.55286-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/include/asm/uaccess.h | 48 ++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 7 deletions(-) (limited to 'arch') diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 6e2e0b7031ab..3a5ff5e20586 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -281,12 +281,22 @@ do { \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ } while (0) +/* + * We must not call into the scheduler between uaccess_ttbr0_enable() and + * uaccess_ttbr0_disable(). As `x` and `ptr` could contain blocking functions, + * we must evaluate these outside of the critical section. + */ #define __raw_get_user(x, ptr, err) \ do { \ + __typeof__(*(ptr)) __user *__rgu_ptr = (ptr); \ + __typeof__(x) __rgu_val; \ __chk_user_ptr(ptr); \ + \ uaccess_ttbr0_enable(); \ - __raw_get_mem("ldtr", x, ptr, err); \ + __raw_get_mem("ldtr", __rgu_val, __rgu_ptr, err); \ uaccess_ttbr0_disable(); \ + \ + (x) = __rgu_val; \ } while (0) #define __get_user_error(x, ptr, err) \ @@ -310,14 +320,22 @@ do { \ #define get_user __get_user +/* + * We must not call into the scheduler between __uaccess_enable_tco_async() and + * __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking + * functions, we must evaluate these outside of the critical section. + */ #define __get_kernel_nofault(dst, src, type, err_label) \ do { \ + __typeof__(dst) __gkn_dst = (dst); \ + __typeof__(src) __gkn_src = (src); \ int __gkn_err = 0; \ \ __uaccess_enable_tco_async(); \ - __raw_get_mem("ldr", *((type *)(dst)), \ - (__force type *)(src), __gkn_err); \ + __raw_get_mem("ldr", *((type *)(__gkn_dst)), \ + (__force type *)(__gkn_src), __gkn_err); \ __uaccess_disable_tco_async(); \ + \ if (unlikely(__gkn_err)) \ goto err_label; \ } while (0) @@ -351,11 +369,19 @@ do { \ } \ } while (0) +/* + * We must not call into the scheduler between uaccess_ttbr0_enable() and + * uaccess_ttbr0_disable(). As `x` and `ptr` could contain blocking functions, + * we must evaluate these outside of the critical section. + */ #define __raw_put_user(x, ptr, err) \ do { \ - __chk_user_ptr(ptr); \ + __typeof__(*(ptr)) __user *__rpu_ptr = (ptr); \ + __typeof__(*(ptr)) __rpu_val = (x); \ + __chk_user_ptr(__rpu_ptr); \ + \ uaccess_ttbr0_enable(); \ - __raw_put_mem("sttr", x, ptr, err); \ + __raw_put_mem("sttr", __rpu_val, __rpu_ptr, err); \ uaccess_ttbr0_disable(); \ } while (0) @@ -380,14 +406,22 @@ do { \ #define put_user __put_user +/* + * We must not call into the scheduler between __uaccess_enable_tco_async() and + * __uaccess_disable_tco_async(). As `dst` and `src` may contain blocking + * functions, we must evaluate these outside of the critical section. + */ #define __put_kernel_nofault(dst, src, type, err_label) \ do { \ + __typeof__(dst) __pkn_dst = (dst); \ + __typeof__(src) __pkn_src = (src); \ int __pkn_err = 0; \ \ __uaccess_enable_tco_async(); \ - __raw_put_mem("str", *((type *)(src)), \ - (__force type *)(dst), __pkn_err); \ + __raw_put_mem("str", *((type *)(__pkn_src)), \ + (__force type *)(__pkn_dst), __pkn_err); \ __uaccess_disable_tco_async(); \ + \ if (unlikely(__pkn_err)) \ goto err_label; \ } while(0) -- cgit v1.2.3 From cf0b0e3712f7af90006f8317ff27278094c2c128 Mon Sep 17 00:00:00 2001 From: Nicholas Piggin Date: Fri, 19 Nov 2021 13:16:27 +1000 Subject: KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB The POWER9 ERAT flush instruction is a SLBIA with IH=7, which is a reserved value on POWER7/8. On POWER8 this invalidates the SLB entries above index 0, similarly to SLBIA IH=0. If the SLB entries are invalidated, and then the guest is bypassed, the host SLB does not get re-loaded, so the bolted entries above 0 will be lost. This can result in kernel stack access causing a SLB fault. Kernel stack access causing a SLB fault was responsible for the infamous mega bug (search "Fix SLB reload bug"). Although since commit 48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C") that starts using the kernel stack in the SLB miss handler, it might only result in an infinite loop of SLB faults. In any case it's a bug. Fix this by only executing the instruction on >= POWER9 where IH=7 is defined not to invalidate the SLB. POWER7/8 don't require this ERAT flush. Fixes: 500871125920 ("KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Nicholas Piggin Reviewed-by: Fabiano Rosas Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20211119031627.577853-1-npiggin@gmail.com --- arch/powerpc/kvm/book3s_hv_builtin.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index fcf4760a3a0e..70b7a8f97153 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -695,6 +695,7 @@ static void flush_guest_tlb(struct kvm *kvm) "r" (0) : "memory"); } asm volatile("ptesync": : :"memory"); + // POWER9 congruence-class TLBIEL leaves ERAT. Flush it now. asm volatile(PPC_RADIX_INVALIDATE_ERAT_GUEST : : :"memory"); } else { for (set = 0; set < kvm->arch.tlb_sets; ++set) { @@ -705,7 +706,9 @@ static void flush_guest_tlb(struct kvm *kvm) rb += PPC_BIT(51); /* increment set number */ } asm volatile("ptesync": : :"memory"); - asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT : : :"memory"); + // POWER9 congruence-class TLBIEL leaves ERAT. Flush it now. + if (cpu_has_feature(CPU_FTR_ARCH_300)) + asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT : : :"memory"); } } -- cgit v1.2.3 From 5bb60ea611db1e04814426ed4bd1c95d1487678e Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Thu, 18 Nov 2021 10:39:53 +0100 Subject: powerpc/32: Fix hardlockup on vmap stack overflow Since the commit c118c7303ad5 ("powerpc/32: Fix vmap stack - Do not activate MMU before reading task struct") a vmap stack overflow results in a hard lockup. This is because emergency_ctx is still addressed with its virtual address allthough data MMU is not active anymore at that time. Fix it by using a physical address instead. Fixes: c118c7303ad5 ("powerpc/32: Fix vmap stack - Do not activate MMU before reading task struct") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Christophe Leroy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/ce30364fb7ccda489272af4a1612b6aa147e1d23.1637227521.git.christophe.leroy@csgroup.eu --- arch/powerpc/kernel/head_32.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h index 6b1ec9e3541b..349c4a820231 100644 --- a/arch/powerpc/kernel/head_32.h +++ b/arch/powerpc/kernel/head_32.h @@ -202,11 +202,11 @@ vmap_stack_overflow: mfspr r1, SPRN_SPRG_THREAD lwz r1, TASK_CPU - THREAD(r1) slwi r1, r1, 3 - addis r1, r1, emergency_ctx@ha + addis r1, r1, emergency_ctx-PAGE_OFFSET@ha #else - lis r1, emergency_ctx@ha + lis r1, emergency_ctx-PAGE_OFFSET@ha #endif - lwz r1, emergency_ctx@l(r1) + lwz r1, emergency_ctx-PAGE_OFFSET@l(r1) addi r1, r1, THREAD_SIZE - INT_FRAME_SIZE EXCEPTION_PROLOG_2 0 vmap_stack_overflow prepare_transfer_to_handler -- cgit v1.2.3 From c0f2077baa4113f38f008b8e912b9fb3ff8d43df Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Tue, 23 Nov 2021 08:04:34 +0100 Subject: x86/boot: Mark prepare_command_line() __init Fix: WARNING: modpost: vmlinux.o(.text.unlikely+0x64d0): Section mismatch in reference \ from the function prepare_command_line() to the variable .init.data:command_line The function prepare_command_line() references the variable __initdata command_line. This is often because prepare_command_line lacks a __initdata annotation or the annotation of command_line is wrong. Apparently some toolchains do different inlining decisions. Reported-by: Stephen Rothwell Signed-off-by: Borislav Petkov Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/YZySgpmBcNNM2qca@zn.tnic --- arch/x86/kernel/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index c410be738ae7..6a190c7f4d71 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -742,7 +742,7 @@ dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p) return 0; } -static char *prepare_command_line(void) +static char * __init prepare_command_line(void) { #ifdef CONFIG_CMDLINE_BOOL #ifdef CONFIG_CMDLINE_OVERRIDE -- cgit v1.2.3 From 83bb2c1a01d7127d5adc7d69d7aaa3f7072de2b4 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 16 Nov 2021 10:20:06 +0000 Subject: KVM: arm64: Save PSTATE early on exit In order to be able to use primitives such as vcpu_mode_is_32bit(), we need to synchronize the guest PSTATE. However, this is currently done deep into the bowels of the world-switch code, and we do have helpers evaluating this much earlier (__vgic_v3_perform_cpuif_access and handle_aarch32_guest, for example). Move the saving of the guest pstate into the early fixups, which cures the first issue. The second one will be addressed separately. Tested-by: Fuad Tabba Reviewed-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/hyp/switch.h | 6 ++++++ arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 7 ++++++- 2 files changed, 12 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index 7a0af1d39303..d79fd101615f 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -429,6 +429,12 @@ static inline bool kvm_hyp_handle_exit(struct kvm_vcpu *vcpu, u64 *exit_code) */ static inline bool fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code) { + /* + * Save PSTATE early so that we can evaluate the vcpu mode + * early on. + */ + vcpu->arch.ctxt.regs.pstate = read_sysreg_el2(SYS_SPSR); + if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ) vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR); diff --git a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h index de7e14c862e6..7ecca8b07851 100644 --- a/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h +++ b/arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h @@ -70,7 +70,12 @@ static inline void __sysreg_save_el1_state(struct kvm_cpu_context *ctxt) static inline void __sysreg_save_el2_return_state(struct kvm_cpu_context *ctxt) { ctxt->regs.pc = read_sysreg_el2(SYS_ELR); - ctxt->regs.pstate = read_sysreg_el2(SYS_SPSR); + /* + * Guest PSTATE gets saved at guest fixup time in all + * cases. We still need to handle the nVHE host side here. + */ + if (!has_vhe() && ctxt->__hyp_running_vcpu) + ctxt->regs.pstate = read_sysreg_el2(SYS_SPSR); if (cpus_have_final_cap(ARM64_HAS_RAS_EXTN)) ctxt_sys_reg(ctxt, DISR_EL1) = read_sysreg_s(SYS_VDISR_EL2); -- cgit v1.2.3 From 7183b2b5ae6b8d77a37069566d77cf2a74060f7e Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Tue, 16 Nov 2021 12:39:35 +0000 Subject: KVM: arm64: Move pkvm's special 32bit handling into a generic infrastructure Protected KVM is trying to turn AArch32 exceptions into an illegal exception entry. Unfortunately, it does that in a way that is a bit abrupt, and too early for PSTATE to be available. Instead, move it to the fixup code, which is a more reasonable place for it. This will also be useful for the NV code. Reviewed-by: Fuad Tabba Signed-off-by: Marc Zyngier --- arch/arm64/kvm/hyp/include/hyp/switch.h | 8 ++++++++ arch/arm64/kvm/hyp/nvhe/switch.c | 8 +------- arch/arm64/kvm/hyp/vhe/switch.c | 4 ++++ 3 files changed, 13 insertions(+), 7 deletions(-) (limited to 'arch') diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h index d79fd101615f..96c5f3fb7838 100644 --- a/arch/arm64/kvm/hyp/include/hyp/switch.h +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h @@ -403,6 +403,8 @@ typedef bool (*exit_handler_fn)(struct kvm_vcpu *, u64 *); static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu); +static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code); + /* * Allow the hypervisor to handle the exit with an exit handler if it has one. * @@ -435,6 +437,12 @@ static inline bool fixup_guest_exit(struct kvm_vcpu *vcpu, u64 *exit_code) */ vcpu->arch.ctxt.regs.pstate = read_sysreg_el2(SYS_SPSR); + /* + * Check whether we want to repaint the state one way or + * another. + */ + early_exit_filter(vcpu, exit_code); + if (ARM_EXCEPTION_CODE(*exit_code) != ARM_EXCEPTION_IRQ) vcpu->arch.fault.esr_el2 = read_sysreg_el2(SYS_ESR); diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c index c0e3fed26d93..d13115a12434 100644 --- a/arch/arm64/kvm/hyp/nvhe/switch.c +++ b/arch/arm64/kvm/hyp/nvhe/switch.c @@ -233,7 +233,7 @@ static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu) * Returns false if the guest ran in AArch32 when it shouldn't have, and * thus should exit to the host, or true if a the guest run loop can continue. */ -static bool handle_aarch32_guest(struct kvm_vcpu *vcpu, u64 *exit_code) +static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code) { struct kvm *kvm = kern_hyp_va(vcpu->kvm); @@ -248,10 +248,7 @@ static bool handle_aarch32_guest(struct kvm_vcpu *vcpu, u64 *exit_code) vcpu->arch.target = -1; *exit_code &= BIT(ARM_EXIT_WITH_SERROR_BIT); *exit_code |= ARM_EXCEPTION_IL; - return false; } - - return true; } /* Switch to the guest for legacy non-VHE systems */ @@ -316,9 +313,6 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu) /* Jump in the fire! */ exit_code = __guest_enter(vcpu); - if (unlikely(!handle_aarch32_guest(vcpu, &exit_code))) - break; - /* And we're baaack! */ } while (fixup_guest_exit(vcpu, &exit_code)); diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c index 5a2cb5d9bc4b..fbb26b93c347 100644 --- a/arch/arm64/kvm/hyp/vhe/switch.c +++ b/arch/arm64/kvm/hyp/vhe/switch.c @@ -112,6 +112,10 @@ static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu) return hyp_exit_handlers; } +static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code) +{ +} + /* Switch to the guest for VHE systems running in EL2 */ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu) { -- cgit v1.2.3 From fbf3bce458214bb971d3d571515b3b129eac290b Mon Sep 17 00:00:00 2001 From: Paul Cercueil Date: Fri, 19 Nov 2021 17:50:52 +0000 Subject: MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression Just like before with __bswapdi2(), for MIPS pre-boot when CONFIG_KERNEL_ZSTD=y the decompressor function will use __ashldi3(), so the object file should be added to the target object file. Fixes these build errors: mipsel-linux-ld: arch/mips/boot/compressed/decompress.o: in function `FSE_buildDTable_internal': decompress.c:(.text.FSE_buildDTable_internal+0x48): undefined reference to `__ashldi3' mipsel-linux-ld: arch/mips/boot/compressed/decompress.o: in function `FSE_decompress_wksp_body_default': decompress.c:(.text.FSE_decompress_wksp_body_default+0xa8): undefined reference to `__ashldi3' mipsel-linux-ld: arch/mips/boot/compressed/decompress.o: in function `ZSTD_getFrameHeader_advanced': decompress.c:(.text.ZSTD_getFrameHeader_advanced+0x134): undefined reference to `__ashldi3' Signed-off-by: Paul Cercueil Reviewed-by: Randy Dunlap Signed-off-by: Thomas Bogendoerfer --- arch/mips/boot/compressed/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/mips/boot/compressed/Makefile b/arch/mips/boot/compressed/Makefile index 2861a05c2e0c..f27cf31b4140 100644 --- a/arch/mips/boot/compressed/Makefile +++ b/arch/mips/boot/compressed/Makefile @@ -52,7 +52,7 @@ endif vmlinuzobjs-$(CONFIG_KERNEL_XZ) += $(obj)/ashldi3.o -vmlinuzobjs-$(CONFIG_KERNEL_ZSTD) += $(obj)/bswapdi.o +vmlinuzobjs-$(CONFIG_KERNEL_ZSTD) += $(obj)/bswapdi.o $(obj)/ashldi3.o targets := $(notdir $(vmlinuzobjs-y)) -- cgit v1.2.3 From 53ae7230918154d1f4281d7aa3aae9650436eadf Mon Sep 17 00:00:00 2001 From: Ilie Halip Date: Wed, 17 Nov 2021 19:48:21 +0200 Subject: s390/test_unwind: use raw opcode instead of invalid instruction Building with clang & LLVM_IAS=1 leads to an error: arch/s390/lib/test_unwind.c:179:4: error: invalid register pair " mvcl %%r1,%%r1\n" ^ The test creates an invalid instruction that would trap at runtime, but the LLVM inline assembler tries to validate it at compile time too. Use the raw instruction opcode instead. Reported-by: Nick Desaulniers Signed-off-by: Ilie Halip Reviewed-by: Nick Desaulniers Suggested-by: Ulrich Weigand Link: https://github.com/ClangBuiltLinux/linux/issues/1421 Link: https://lore.kernel.org/r/20211117174822.3632412-1-ilie.halip@gmail.com Reviewed-by: Christian Borntraeger Signed-off-by: Christian Borntraeger [hca@linux.ibm.com: use illegal opcode, and update comment] Signed-off-by: Heiko Carstens --- arch/s390/lib/test_unwind.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/s390/lib/test_unwind.c b/arch/s390/lib/test_unwind.c index cfc5f5557c06..bc7973359ae2 100644 --- a/arch/s390/lib/test_unwind.c +++ b/arch/s390/lib/test_unwind.c @@ -173,10 +173,11 @@ static noinline int unwindme_func4(struct unwindme *u) } /* - * trigger specification exception + * Trigger operation exception; use insn notation to bypass + * llvm's integrated assembler sanity checks. */ asm volatile( - " mvcl %%r1,%%r1\n" + " .insn e,0x0000\n" /* illegal opcode */ "0: nopr %%r7\n" EX_TABLE(0b, 0b) :); -- cgit v1.2.3 From a0eb2da92b715d0c97b96b09979689ea09faefe6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andr=C3=A9=20Almeida?= Date: Wed, 24 Nov 2021 10:21:12 -0300 Subject: futex: Wireup futex_waitv syscall MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wireup futex_waitv syscall for all remaining archs. Signed-off-by: André Almeida Acked-by: Max Filippov Acked-by: Geert Uytterhoeven Tested-by: Michael Ellerman (powerpc) Signed-off-by: Arnd Bergmann --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + 8 files changed, 8 insertions(+) (limited to 'arch') diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index e4a041cd5715..ca5a32228cd6 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -488,3 +488,4 @@ 556 common landlock_restrict_self sys_landlock_restrict_self # 557 reserved for memfd_secret 558 common process_mrelease sys_process_mrelease +559 common futex_waitv sys_futex_waitv diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index 6fea1844fb95..707ae121f6d3 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -369,3 +369,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 7976dff8f879..45bc32a41b90 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -448,3 +448,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 6b0e11362bd2..2204bde3ce4a 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -454,3 +454,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 7bef917cc84e..15109af9d075 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -528,3 +528,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 208f131659c5..d9539d28bdaa 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -451,3 +451,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index c37764dc764d..46adabcb1720 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -494,3 +494,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 104b327f8ac9..3e3e1a506bed 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -419,3 +419,4 @@ 446 common landlock_restrict_self sys_landlock_restrict_self # 447 reserved for memfd_secret 448 common process_mrelease sys_process_mrelease +449 common futex_waitv sys_futex_waitv -- cgit v1.2.3 From 5fe762515bc9dd0476ed1de06377d7186565da99 Mon Sep 17 00:00:00 2001 From: Chanho Park Date: Wed, 24 Nov 2021 09:50:41 +0100 Subject: arm64: dts: exynos: drop samsung,ufs-shareability-reg-offset in ExynosAutov9 samsung,ufs-shareability-reg-offset is not necessary anymore since it was integrated into the second argument of samsung,sysreg. Fixes: 31bbac5263aa ("arm64: dts: exynos: add initial support for exynosautov9 SoC") Signed-off-by: Chanho Park Signed-off-by: Krzysztof Kozlowski Link: https://lore.kernel.org/r/20211102064826.15796-1-chanho61.park@samsung.com Link: https://lore.kernel.org/r/20211124085042.9649-2-krzysztof.kozlowski@canonical.com' Signed-off-by: Arnd Bergmann --- arch/arm64/boot/dts/exynos/exynosautov9.dtsi | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/arm64/boot/dts/exynos/exynosautov9.dtsi b/arch/arm64/boot/dts/exynos/exynosautov9.dtsi index 3e4727344b4a..a960c0bc2dba 100644 --- a/arch/arm64/boot/dts/exynos/exynosautov9.dtsi +++ b/arch/arm64/boot/dts/exynos/exynosautov9.dtsi @@ -296,8 +296,7 @@ pinctrl-0 = <&ufs_rst_n &ufs_refclk_out>; phys = <&ufs_0_phy>; phy-names = "ufs-phy"; - samsung,sysreg = <&syscon_fsys2>; - samsung,ufs-shareability-reg-offset = <0x710>; + samsung,sysreg = <&syscon_fsys2 0x710>; status = "disabled"; }; }; -- cgit v1.2.3 From b1c45ad53efbad779aa6cdb588de0b8ea1ed54bb Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Thu, 25 Nov 2021 10:20:55 +0100 Subject: xen: make HYPERVISOR_get_debugreg() always_inline HYPERVISOR_get_debugreg() is being called from noinstr code, so it should be attributed "always_inline". Fixes: f4afb713e5c3a4419ba ("x86/xen: Make get_debugreg() noinstr") Reported-by: kernel test robot Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky Link: https://lore.kernel.org/r/20211125092056.24758-2-jgross@suse.com Signed-off-by: Boris Ostrovsky --- arch/x86/include/asm/xen/hypercall.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index 0575f5863b7f..28ca1119606b 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -287,7 +287,7 @@ HYPERVISOR_set_debugreg(int reg, unsigned long value) return _hypercall2(int, set_debugreg, reg, value); } -static inline unsigned long +static __always_inline unsigned long HYPERVISOR_get_debugreg(int reg) { return _hypercall1(unsigned long, get_debugreg, reg); -- cgit v1.2.3 From 00db58cf21188f4b99bc5f15fcc2995e30e4a9fe Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Thu, 25 Nov 2021 10:20:56 +0100 Subject: xen: make HYPERVISOR_set_debugreg() always_inline HYPERVISOR_set_debugreg() is being called from noinstr code, so it should be attributed "always_inline". Fixes: 7361fac0465ba96ec8f ("x86/xen: Make set_debugreg() noinstr") Reported-by: kernel test robot Signed-off-by: Juergen Gross Reviewed-by: Boris Ostrovsky Link: https://lore.kernel.org/r/20211125092056.24758-3-jgross@suse.com Signed-off-by: Boris Ostrovsky --- arch/x86/include/asm/xen/hypercall.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index 28ca1119606b..e5e0fe10c692 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -281,7 +281,7 @@ HYPERVISOR_callback_op(int cmd, void *arg) return _hypercall2(int, callback_op, cmd, arg); } -static inline int +static __always_inline int HYPERVISOR_set_debugreg(int reg, unsigned long value) { return _hypercall2(int, set_debugreg, reg, value); -- cgit v1.2.3 From 1cab5bd69eb1f995ced2d7576cb15f8a8941fd85 Mon Sep 17 00:00:00 2001 From: Tiezhu Yang Date: Thu, 25 Nov 2021 19:39:32 +0800 Subject: MIPS: Fix using smp_processor_id() in preemptible in show_cpuinfo() There exists the following issue under DEBUG_PREEMPT: BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1 caller is show_cpuinfo+0x460/0xea0 ... Call Trace: [] show_stack+0x94/0x128 [] dump_stack_lvl+0x94/0xd8 [] check_preemption_disabled+0x104/0x110 [] show_cpuinfo+0x460/0xea0 [] seq_read_iter+0xfc/0x4f8 [] new_sync_read+0x110/0x1b8 [] vfs_read+0x1b4/0x1d0 [] ksys_read+0xd0/0x110 [] syscall_common+0x34/0x58 We can see the following call trace: show_cpuinfo() cpu_has_fpu current_cpu_data smp_processor_id() $ addr2line -f -e vmlinux 0xffffffff802209c8 show_cpuinfo arch/mips/kernel/proc.c:188 $ head -188 arch/mips/kernel/proc.c | tail -1 if (cpu_has_fpu) arch/mips/include/asm/cpu-features.h # define cpu_has_fpu (current_cpu_data.options & MIPS_CPU_FPU) arch/mips/include/asm/cpu-info.h #define current_cpu_data cpu_data[smp_processor_id()] Based on the above analysis, fix the issue by using raw_cpu_has_fpu which calls raw_smp_processor_id() in show_cpuinfo(). Fixes: 626bfa037299 ("MIPS: kernel: proc: add CPU option reporting") Signed-off-by: Tiezhu Yang Signed-off-by: Thomas Bogendoerfer --- arch/mips/kernel/proc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/mips/kernel/proc.c b/arch/mips/kernel/proc.c index 376a6e2676e9..9f47a889b047 100644 --- a/arch/mips/kernel/proc.c +++ b/arch/mips/kernel/proc.c @@ -185,7 +185,7 @@ static int show_cpuinfo(struct seq_file *m, void *v) seq_puts(m, " tx39_cache"); if (cpu_has_octeon_cache) seq_puts(m, " octeon_cache"); - if (cpu_has_fpu) + if (raw_cpu_has_fpu) seq_puts(m, " fpu"); if (cpu_has_32fpr) seq_puts(m, " 32fpr"); -- cgit v1.2.3 From 7db5e9e9e5e6c10d7d26f8df7f8fd8841cb15ee7 Mon Sep 17 00:00:00 2001 From: Huang Pei Date: Thu, 25 Nov 2021 18:59:49 +0800 Subject: MIPS: loongson64: fix FTLB configuration It turns out that 'decode_configs' -> 'set_ftlb_enable' is called under c->cputype unset, which leaves FTLB disabled on BOTH 3A2000 and 3A3000 Fix it by calling "decode_configs" after c->cputype is initialized Fixes: da1bd29742b1 ("MIPS: Loongson64: Probe CPU features via CPUCFG") Signed-off-by: Huang Pei Signed-off-by: Thomas Bogendoerfer --- arch/mips/kernel/cpu-probe.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c index ac0e2cfc6d57..24a529c6c4be 100644 --- a/arch/mips/kernel/cpu-probe.c +++ b/arch/mips/kernel/cpu-probe.c @@ -1734,8 +1734,6 @@ static inline void decode_cpucfg(struct cpuinfo_mips *c) static inline void cpu_probe_loongson(struct cpuinfo_mips *c, unsigned int cpu) { - decode_configs(c); - /* All Loongson processors covered here define ExcCode 16 as GSExc. */ c->options |= MIPS_CPU_GSEXCEX; @@ -1796,6 +1794,8 @@ static inline void cpu_probe_loongson(struct cpuinfo_mips *c, unsigned int cpu) panic("Unknown Loongson Processor ID!"); break; } + + decode_configs(c); } #else static inline void cpu_probe_loongson(struct cpuinfo_mips *c, unsigned int cpu) { } -- cgit v1.2.3 From 1f80d15020d7f130194821feb1432b67648c632d Mon Sep 17 00:00:00 2001 From: Catalin Marinas Date: Thu, 25 Nov 2021 15:20:14 +0000 Subject: KVM: arm64: Avoid setting the upper 32 bits of TCR_EL2 and CPTR_EL2 to 1 Having a signed (1 << 31) constant for TCR_EL2_RES1 and CPTR_EL2_TCPAC causes the upper 32-bit to be set to 1 when assigning them to a 64-bit variable. Bit 32 in TCR_EL2 is no longer RES0 in ARMv8.7: with FEAT_LPA2 it changes the meaning of bits 49:48 and 9:8 in the stage 1 EL2 page table entries. As a result of the sign-extension, a non-VHE kernel can no longer boot on a model with ARMv8.7 enabled. CPTR_EL2 still has the top 32 bits RES0 but we should preempt any future problems Make these top bit constants unsigned as per commit df655b75c43f ("arm64: KVM: Avoid setting the upper 32 bits of VTCR_EL2 to 1"). Signed-off-by: Catalin Marinas Reported-by: Chris January Cc: Cc: Will Deacon Cc: Marc Zyngier Signed-off-by: Marc Zyngier Link: https://lore.kernel.org/r/20211125152014.2806582-1-catalin.marinas@arm.com --- arch/arm64/include/asm/kvm_arm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index a39fcf318c77..01d47c5886dc 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -91,7 +91,7 @@ #define HCR_HOST_VHE_FLAGS (HCR_RW | HCR_TGE | HCR_E2H) /* TCR_EL2 Registers bits */ -#define TCR_EL2_RES1 ((1 << 31) | (1 << 23)) +#define TCR_EL2_RES1 ((1U << 31) | (1 << 23)) #define TCR_EL2_TBI (1 << 20) #define TCR_EL2_PS_SHIFT 16 #define TCR_EL2_PS_MASK (7 << TCR_EL2_PS_SHIFT) @@ -276,7 +276,7 @@ #define CPTR_EL2_TFP_SHIFT 10 /* Hyp Coprocessor Trap Register */ -#define CPTR_EL2_TCPAC (1 << 31) +#define CPTR_EL2_TCPAC (1U << 31) #define CPTR_EL2_TAM (1 << 30) #define CPTR_EL2_TTA (1 << 20) #define CPTR_EL2_TFP (1 << CPTR_EL2_TFP_SHIFT) -- cgit v1.2.3 From 41ce097f714401e6ad8f3f5eb30d7f91b0b5e495 Mon Sep 17 00:00:00 2001 From: Huang Pei Date: Thu, 25 Nov 2021 18:59:48 +0800 Subject: MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48 It hangup when booting Loongson 3A1000 with BOTH CONFIG_PAGE_SIZE_64KB and CONFIG_MIPS_VA_BITS_48, that it turn out to use 2-level pgtable instead of 3-level. 64KB page size with 2-level pgtable only cover 42 bits VA, use 3-level pgtable to cover all 48 bits VA(55 bits) Fixes: 1e321fa917fb ("MIPS64: Support of at least 48 bits of SEGBITS) Signed-off-by: Huang Pei Signed-off-by: Thomas Bogendoerfer --- arch/mips/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index de60ad190057..0215dc1529e9 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -3097,7 +3097,7 @@ config STACKTRACE_SUPPORT config PGTABLE_LEVELS int default 4 if PAGE_SIZE_4KB && MIPS_VA_BITS_48 - default 3 if 64BIT && !PAGE_SIZE_64KB + default 3 if 64BIT && (!PAGE_SIZE_64KB || MIPS_VA_BITS_48) default 2 config MIPS_AUTO_PFN_OFFSET -- cgit v1.2.3 From 8503fea6761de32b72585001ac94e5f81ce8ca44 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 18:20:16 -0500 Subject: KVM: VMX: do not use uninitialized gfn_to_hva_cache An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes the accessors to croak very loudly. While a BUG_ON is definitely _too_ loud and a bug on its own, there is indeed an issue of using the caches in such a way that they could not have been initialized, because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init would not be called. For the vmcs12_cache, the solution is simply to invoke kvm_gfn_to_hva_cache_init unconditionally: we already know that the cache does not match the current VMCS pointer. For the shadow_vmcs12_cache, there is no similar condition that checks the VMCS link pointer, so invalidate the cache on VMXON. Fixes: cee66664dcd6 ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld") Acked-by: David Woodhouse Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 1e2f66951566..315fa456d368 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4857,6 +4857,7 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu) if (!vmx->nested.cached_vmcs12) goto out_cached_vmcs12; + vmx->nested.shadow_vmcs12_cache.gpa = INVALID_GPA; vmx->nested.cached_shadow_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT); if (!vmx->nested.cached_shadow_vmcs12) goto out_cached_shadow_vmcs12; @@ -5289,8 +5290,7 @@ static int handle_vmptrld(struct kvm_vcpu *vcpu) struct gfn_to_hva_cache *ghc = &vmx->nested.vmcs12_cache; struct vmcs_hdr hdr; - if (ghc->gpa != vmptr && - kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmptr, VMCS12_SIZE)) { + if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmptr, VMCS12_SIZE)) { /* * Reads from an unbacked page return all 1s, * which means that the 32 bits located at the -- cgit v1.2.3 From 78311a514099932cd8434d5d2194aa94e56ab67c Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Wed, 17 Nov 2021 07:35:44 -0500 Subject: KVM: x86: ignore APICv if LAPIC is not enabled Synchronize the two calls to kvm_x86_sync_pir_to_irr. The one in the reenter-guest fast path invoked the callback unconditionally even if LAPIC is present but disabled. In this case, there are no interrupts to deliver, and therefore posted interrupts can be ignored. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5a403d92833f..441f4769173e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9849,7 +9849,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break; - if (vcpu->arch.apicv_active) + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) static_call(kvm_x86_sync_pir_to_irr)(vcpu); if (unlikely(kvm_vcpu_exit_request(vcpu))) { -- cgit v1.2.3 From 30d7c5d60a886e3c89633ccf0ea4865276a759fe Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Thu, 18 Nov 2021 04:41:34 -0500 Subject: KVM: SEV: expose KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM capability The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION. Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Cc: Peter Gonda Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch') diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 441f4769173e..30c4d72bf717 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4133,6 +4133,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_SGX_ATTRIBUTE: #endif case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM: + case KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM: case KVM_CAP_SREGS2: case KVM_CAP_EXIT_ON_EMULATION_FAILURE: case KVM_CAP_VCPU_ATTRIBUTES: -- cgit v1.2.3 From 2b4a5a5d56881ece3c66b9a9a8943a6f41bd7349 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 25 Nov 2021 01:49:43 +0000 Subject: KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of always flushing vpid01. Any TLB flush that is triggered when L2 is active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB flush needs to be applied to L2's VPID. Reported-by: Lai Jiangshan Fixes: 07ffaf343e34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20211125014944.536398-2-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index ba66c171d951..18971cfadd4f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2918,6 +2918,13 @@ static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu) } } +static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu) +{ + if (is_guest_mode(vcpu)) + return nested_get_vpid02(vcpu); + return to_vmx(vcpu)->vpid; +} + static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) { struct kvm_mmu *mmu = vcpu->arch.mmu; @@ -2930,31 +2937,29 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu) if (enable_ept) ept_sync_context(construct_eptp(vcpu, root_hpa, mmu->shadow_root_level)); - else if (!is_guest_mode(vcpu)) - vpid_sync_context(to_vmx(vcpu)->vpid); else - vpid_sync_context(nested_get_vpid02(vcpu)); + vpid_sync_context(vmx_get_current_vpid(vcpu)); } static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr) { /* - * vpid_sync_vcpu_addr() is a nop if vmx->vpid==0, see the comment in + * vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in * vmx_flush_tlb_guest() for an explanation of why this is ok. */ - vpid_sync_vcpu_addr(to_vmx(vcpu)->vpid, addr); + vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr); } static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu) { /* - * vpid_sync_context() is a nop if vmx->vpid==0, e.g. if enable_vpid==0 - * or a vpid couldn't be allocated for this vCPU. VM-Enter and VM-Exit - * are required to flush GVA->{G,H}PA mappings from the TLB if vpid is + * vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a + * vpid couldn't be allocated for this vCPU. VM-Enter and VM-Exit are + * required to flush GVA->{G,H}PA mappings from the TLB if vpid is * disabled (VM-Enter with vpid enabled and vpid==0 is disallowed), * i.e. no explicit INVVPID is necessary. */ - vpid_sync_context(to_vmx(vcpu)->vpid); + vpid_sync_context(vmx_get_current_vpid(vcpu)); } void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From 40e5f9080472b614eeedcc5ba678289cd98d70df Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 25 Nov 2021 01:49:43 +0000 Subject: KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at nested transitions, as KVM doesn't track requests for L1 vs L2. E.g. if there's a pending flush when a nested VM-Exit occurs, then the flush was requested in the context of L2 and needs to be handled before switching to L1, otherwise the flush for L2 would effectiely be lost. Opportunistically add a helper to handle CURRENT and GUEST as a pair, the logic for when they need to be serviced is identical as both requests are tied to L1 vs. L2, the only difference is the scope of the flush. Reported-by: Lai Jiangshan Fixes: 07ffaf343e34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20211125014944.536398-2-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 8 +++----- arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++---- arch/x86/kvm/x86.h | 7 +------ 3 files changed, 28 insertions(+), 15 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 315fa456d368..8e55aaef33ee 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3344,8 +3344,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, }; u32 failed_index; - if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu)) - kvm_vcpu_flush_tlb_current(vcpu); + kvm_service_local_tlb_flush_requests(vcpu); evaluate_pending_interrupts = exec_controls_get(vmx) & (CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING); @@ -4502,9 +4501,8 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, (void)nested_get_evmcs_page(vcpu); } - /* Service the TLB flush request for L2 before switching to L1. */ - if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu)) - kvm_vcpu_flush_tlb_current(vcpu); + /* Service pending TLB flush requests for L2 before switching to L1. */ + kvm_service_local_tlb_flush_requests(vcpu); /* * VCPU_EXREG_PDPTR will be clobbered in arch/x86/kvm/vmx/vmx.h between diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 30c4d72bf717..028151c309c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3258,6 +3258,29 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu) static_call(kvm_x86_tlb_flush_guest)(vcpu); } + +static inline void kvm_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu) +{ + ++vcpu->stat.tlb_flush; + static_call(kvm_x86_tlb_flush_current)(vcpu); +} + +/* + * Service "local" TLB flush requests, which are specific to the current MMU + * context. In addition to the generic event handling in vcpu_enter_guest(), + * TLB flushes that are targeted at an MMU context also need to be serviced + * prior before nested VM-Enter/VM-Exit. + */ +void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu) +{ + if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu)) + kvm_vcpu_flush_tlb_current(vcpu); + + if (kvm_check_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu)) + kvm_vcpu_flush_tlb_guest(vcpu); +} +EXPORT_SYMBOL_GPL(kvm_service_local_tlb_flush_requests); + static void record_steal_time(struct kvm_vcpu *vcpu) { struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; @@ -9649,10 +9672,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) /* Flushing all ASIDs flushes the current ASID... */ kvm_clear_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); } - if (kvm_check_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu)) - kvm_vcpu_flush_tlb_current(vcpu); - if (kvm_check_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu)) - kvm_vcpu_flush_tlb_guest(vcpu); + kvm_service_local_tlb_flush_requests(vcpu); if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 997669ae9caa..4abcd8d9836d 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -103,6 +103,7 @@ static inline unsigned int __shrink_ple_window(unsigned int val, #define MSR_IA32_CR_PAT_DEFAULT 0x0007040600070406ULL +void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu); int kvm_check_nested_events(struct kvm_vcpu *vcpu); static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu) @@ -185,12 +186,6 @@ static inline bool mmu_is_nested(struct kvm_vcpu *vcpu) return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu; } -static inline void kvm_vcpu_flush_tlb_current(struct kvm_vcpu *vcpu) -{ - ++vcpu->stat.tlb_flush; - static_call(kvm_x86_tlb_flush_current)(vcpu); -} - static inline int is_pae(struct kvm_vcpu *vcpu) { return kvm_read_cr4_bits(vcpu, X86_CR4_PAE); -- cgit v1.2.3 From 712494de96f35f3e146b36b752c2afe0fdc0f0cc Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Thu, 25 Nov 2021 01:49:44 +0000 Subject: KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12 Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12, i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's TLB entries for vpid02. From L1's perspective, changing L2's VPID is effectively a TLB flush unless "hardware" has previously cached entries for the new vpid12. Because KVM tracks only a single vpid12, KVM doesn't know if the new vpid12 has been used in the past and so must treat it as a brand new, never been used VPID, i.e. must assume that the new vpid12 represents a TLB flush from L1's perspective. For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12 and thus can't have cached entries in the TLB for vpid12. Reported-by: Lai Jiangshan Fixes: 5c614b3583e7 ("KVM: nVMX: nested VPID emulation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20211125014944.536398-3-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 37 +++++++++++++++++-------------------- 1 file changed, 17 insertions(+), 20 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8e55aaef33ee..64f2828035c2 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -1162,29 +1162,26 @@ static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, WARN_ON(!enable_vpid); /* - * If VPID is enabled and used by vmc12, but L2 does not have a unique - * TLB tag (ASID), i.e. EPT is disabled and KVM was unable to allocate - * a VPID for L2, flush the current context as the effective ASID is - * common to both L1 and L2. - * - * Defer the flush so that it runs after vmcs02.EPTP has been set by - * KVM_REQ_LOAD_MMU_PGD (if nested EPT is enabled) and to avoid - * redundant flushes further down the nested pipeline. - * - * If a TLB flush isn't required due to any of the above, and vpid12 is - * changing then the new "virtual" VPID (vpid12) will reuse the same - * "real" VPID (vpid02), and so needs to be flushed. There's no direct - * mapping between vpid02 and vpid12, vpid02 is per-vCPU and reused for - * all nested vCPUs. Remember, a flush on VM-Enter does not invalidate - * guest-physical mappings, so there is no need to sync the nEPT MMU. + * VPID is enabled and in use by vmcs12. If vpid12 is changing, then + * emulate a guest TLB flush as KVM does not track vpid12 history nor + * is the VPID incorporated into the MMU context. I.e. KVM must assume + * that the new vpid12 has never been used and thus represents a new + * guest ASID that cannot have entries in the TLB. */ - if (!nested_has_guest_tlb_tag(vcpu)) { - kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); - } else if (is_vmenter && - vmcs12->virtual_processor_id != vmx->nested.last_vpid) { + if (is_vmenter && vmcs12->virtual_processor_id != vmx->nested.last_vpid) { vmx->nested.last_vpid = vmcs12->virtual_processor_id; - vpid_sync_context(nested_get_vpid02(vcpu)); + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); + return; } + + /* + * If VPID is enabled, used by vmc12, and vpid12 is not changing but + * does not have a unique TLB tag (ASID), i.e. EPT is disabled and + * KVM was unable to allocate a VPID for L2, flush the current context + * as the effective ASID is common to both L1 and L2. + */ + if (!nested_has_guest_tlb_tag(vcpu)) + kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); } static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask) -- cgit v1.2.3 From feb627e8d6f69c9a319fe279710959efb3eba873 Mon Sep 17 00:00:00 2001 From: Vitaly Kuznetsov Date: Mon, 22 Nov 2021 18:58:18 +0100 Subject: KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN Commit 63f5a1909f9e ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls after first successful KVM_RUN and promissed to make this sequence forbiden in 5.16. It's time to fulfil the promise. Signed-off-by: Vitaly Kuznetsov Message-Id: <20211122175818.608220-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 28 +++++++++++----------------- arch/x86/kvm/x86.c | 19 +++++++++++++++++++ 2 files changed, 30 insertions(+), 17 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0c839ee1282c..0c44581721b0 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5025,6 +5025,14 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu) /* * Invalidate all MMU roles to force them to reinitialize as CPUID * information is factored into reserved bit calculations. + * + * Correctly handling multiple vCPU models with respect to paging and + * physical address properties) in a single VM would require tracking + * all relevant CPUID information in kvm_mmu_page_role. That is very + * undesirable as it would increase the memory requirements for + * gfn_track (see struct kvm_mmu_page_role comments). For now that + * problem is swept under the rug; KVM's CPUID API is horrific and + * it's all but impossible to solve it without introducing a new API. */ vcpu->arch.root_mmu.mmu_role.ext.valid = 0; vcpu->arch.guest_mmu.mmu_role.ext.valid = 0; @@ -5032,24 +5040,10 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu) kvm_mmu_reset_context(vcpu); /* - * KVM does not correctly handle changing guest CPUID after KVM_RUN, as - * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't - * tracked in kvm_mmu_page_role. As a result, KVM may miss guest page - * faults due to reusing SPs/SPTEs. Alert userspace, but otherwise - * sweep the problem under the rug. - * - * KVM's horrific CPUID ABI makes the problem all but impossible to - * solve, as correctly handling multiple vCPU models (with respect to - * paging and physical address properties) in a single VM would require - * tracking all relevant CPUID information in kvm_mmu_page_role. That - * is very undesirable as it would double the memory requirements for - * gfn_track (see struct kvm_mmu_page_role comments), and in practice - * no sane VMM mucks with the core vCPU model on the fly. + * Changing guest CPUID after KVM_RUN is forbidden, see the comment in + * kvm_arch_vcpu_ioctl(). */ - if (vcpu->arch.last_vmentry_cpu != -1) { - pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} after KVM_RUN may cause guest instability\n"); - pr_warn_ratelimited("KVM: KVM_SET_CPUID{,2} will fail after KVM_RUN starting with Linux 5.16\n"); - } + KVM_BUG_ON(vcpu->arch.last_vmentry_cpu != -1, vcpu->kvm); } void kvm_mmu_reset_context(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 028151c309c9..817898eab7c3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5148,6 +5148,17 @@ long kvm_arch_vcpu_ioctl(struct file *filp, struct kvm_cpuid __user *cpuid_arg = argp; struct kvm_cpuid cpuid; + /* + * KVM does not correctly handle changing guest CPUID after KVM_RUN, as + * MAXPHYADDR, GBPAGES support, AMD reserved bit behavior, etc.. aren't + * tracked in kvm_mmu_page_role. As a result, KVM may miss guest page + * faults due to reusing SPs/SPTEs. In practice no sane VMM mucks with + * the core vCPU model on the fly, so fail. + */ + r = -EINVAL; + if (vcpu->arch.last_vmentry_cpu != -1) + goto out; + r = -EFAULT; if (copy_from_user(&cpuid, cpuid_arg, sizeof(cpuid))) goto out; @@ -5158,6 +5169,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp, struct kvm_cpuid2 __user *cpuid_arg = argp; struct kvm_cpuid2 cpuid; + /* + * KVM_SET_CPUID{,2} after KVM_RUN is forbidded, see the comment in + * KVM_SET_CPUID case above. + */ + r = -EINVAL; + if (vcpu->arch.last_vmentry_cpu != -1) + goto out; + r = -EFAULT; if (copy_from_user(&cpuid, cpuid_arg, sizeof(cpuid))) goto out; -- cgit v1.2.3 From 12ec33a705749e18d9588b0a0e69e02821371156 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 24 Nov 2021 20:20:43 +0800 Subject: KVM: X86: Fix when shadow_root_level=5 && guest root_level<4 If the is an L1 with nNPT in 32bit, the shadow walk starts with pae_root. Fixes: a717a780fc4e ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host) Signed-off-by: Lai Jiangshan Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0c44581721b0..d7ae369ec8c2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2173,10 +2173,10 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato iterator->shadow_addr = root; iterator->level = vcpu->arch.mmu->shadow_root_level; - if (iterator->level == PT64_ROOT_4LEVEL && + if (iterator->level >= PT64_ROOT_4LEVEL && vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL && !vcpu->arch.mmu->direct_map) - --iterator->level; + iterator->level = PT32E_ROOT_LEVEL; if (iterator->level == PT32E_ROOT_LEVEL) { /* -- cgit v1.2.3 From 05b29633c7a956d5675f5fbba70db0d26aa5e73e Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Wed, 24 Nov 2021 20:20:46 +0800 Subject: KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() INVLPG operates on guest virtual address, which are represented by vcpu->arch.walk_mmu. In nested virtualization scenarios, kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were emulated by L0 (in practice, it hardly happen) when nested two-dimensional paging is enabled, the call to ->tlb_flush_gva() would be skipped and the hardware TLB entry would not be invalidated. Signed-off-by: Lai Jiangshan Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index d7ae369ec8c2..5942e9c6dd6e 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5363,7 +5363,7 @@ void kvm_mmu_invalidate_gva(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { - kvm_mmu_invalidate_gva(vcpu, vcpu->arch.mmu, gva, INVALID_PAGE); + kvm_mmu_invalidate_gva(vcpu, vcpu->arch.walk_mmu, gva, INVALID_PAGE); ++vcpu->stat.invlpg; } EXPORT_SYMBOL_GPL(kvm_mmu_invlpg); -- cgit v1.2.3 From 21e96a2035db43fc72f7023c4577a63ca606de86 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Tue, 23 Nov 2021 11:55:06 +0100 Subject: iommu/vt-d: Remove unused PASID_DISABLED The macro is unused after commit 00ecd5401349a so it can be removed. Reported-by: Linus Torvalds Fixes: 00ecd5401349a ("iommu/vt-d: Clean up unused PASID updating functions") Signed-off-by: Joerg Roedel Reviewed-by: Lu Baolu Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org --- arch/x86/include/asm/fpu/api.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 6053674f9132..c2767a6a387e 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -102,12 +102,6 @@ extern void switch_fpu_return(void); */ extern int cpu_has_xfeatures(u64 xfeatures_mask, const char **feature_name); -/* - * Tasks that are not using SVA have mm->pasid set to zero to note that they - * will not have the valid bit set in MSR_IA32_PASID while they are running. - */ -#define PASID_DISABLED 0 - /* Trap handling */ extern int fpu__exception_code(struct fpu *fpu, int trap_nr); extern void fpu_sync_fpstate(struct fpu *fpu); -- cgit v1.2.3 From 1f0e290cc5fd818d002e0a83b0ea8eceb8f2c515 Mon Sep 17 00:00:00 2001 From: Guenter Roeck Date: Sat, 27 Nov 2021 07:44:40 -0800 Subject: arch: Add generic Kconfig option indicating page size smaller than 64k NTFS_RW and VMXNET3 require a page size smaller than 64kB. Add generic Kconfig option for use outside architecture code to avoid architecture specific Kconfig options in that code. Suggested-by: Michael Ellerman Signed-off-by: Guenter Roeck Cc: Anton Altaparmakov Signed-off-by: Linus Torvalds --- arch/Kconfig | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'arch') diff --git a/arch/Kconfig b/arch/Kconfig index 26b8ed11639d..d3c4ab249e9c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -991,6 +991,16 @@ config HAVE_ARCH_COMPAT_MMAP_BASES and vice-versa 32-bit applications to call 64-bit mmap(). Required for applications doing different bitness syscalls. +config PAGE_SIZE_LESS_THAN_64KB + def_bool y + depends on !ARM64_64K_PAGES + depends on !IA64_PAGE_SIZE_64KB + depends on !PAGE_SIZE_64KB + depends on !PARISC_PAGE_SIZE_64KB + depends on !PPC_64K_PAGES + depends on !PPC_256K_PAGES + depends on !PAGE_SIZE_256KB + # This allows to use a set of generic functions to determine mmap base # address by giving priority to top-down scheme only if the process # is not in legacy mode (compat task, unlimited stack size or -- cgit v1.2.3 From 52d04d408185b7aa47628d2339c28ec70074e0ae Mon Sep 17 00:00:00 2001 From: Niklas Schnelle Date: Thu, 4 Nov 2021 15:04:10 +0100 Subject: s390/pci: move pseudo-MMIO to prevent MIO overlap When running without MIO support, with pci=nomio or for devices which are not MIO-capable the zPCI subsystem generates pseudo-MMIO addresses to allow access to PCI BARs via MMIO based Linux APIs even though the platform uses function handles and BAR numbers. This is done by stashing an index into our global IOMAP array which contains the function handle in the 16 most significant bits of the addresses returned by ioremap() always setting the most significant bit. On the other hand the MIO addresses assigned by the platform for use, while requiring special instructions, allow PCI access with virtually mapped physical addresses. Now the problem is that these MIO addresses and our own pseudo-MMIO addresses may overlap, while functionally this would not be a problem by itself this overlap is detected by common code as both address types are added as resources in the iomem_resource tree. This leads to the overlapping resource claim of either the MIO capable or non-MIO capable devices with being rejected. Since PCI is tightly coupled to the use of the iomem_resource tree, see for example the code for request_mem_region(), we can't reasonably get rid of the overlap being detected by keeping our pseudo-MMIO addresses out of the iomem_resource tree. Instead let's move the range used by our own pseudo-MMIO addresses by starting at (1UL << 62) and only using addresses below (1UL << 63) thus avoiding the range currently used for MIO addresses. Fixes: c7ff0e918a7c ("s390/pci: deal with devices that have no support for MIO instructions") Cc: stable@vger.kernel.org # 5.3+ Reviewed-by: Pierre Morel Signed-off-by: Niklas Schnelle Signed-off-by: Heiko Carstens --- arch/s390/include/asm/pci_io.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/s390/include/asm/pci_io.h b/arch/s390/include/asm/pci_io.h index e4dc64cc9c55..287bb88f7698 100644 --- a/arch/s390/include/asm/pci_io.h +++ b/arch/s390/include/asm/pci_io.h @@ -14,12 +14,13 @@ /* I/O Map */ #define ZPCI_IOMAP_SHIFT 48 -#define ZPCI_IOMAP_ADDR_BASE 0x8000000000000000UL +#define ZPCI_IOMAP_ADDR_SHIFT 62 +#define ZPCI_IOMAP_ADDR_BASE (1UL << ZPCI_IOMAP_ADDR_SHIFT) #define ZPCI_IOMAP_ADDR_OFF_MASK ((1UL << ZPCI_IOMAP_SHIFT) - 1) #define ZPCI_IOMAP_MAX_ENTRIES \ - ((ULONG_MAX - ZPCI_IOMAP_ADDR_BASE + 1) / (1UL << ZPCI_IOMAP_SHIFT)) + (1UL << (ZPCI_IOMAP_ADDR_SHIFT - ZPCI_IOMAP_SHIFT)) #define ZPCI_IOMAP_ADDR_IDX_MASK \ - (~ZPCI_IOMAP_ADDR_OFF_MASK - ZPCI_IOMAP_ADDR_BASE) + ((ZPCI_IOMAP_ADDR_BASE - 1) & ~ZPCI_IOMAP_ADDR_OFF_MASK) struct zpci_iomap_entry { u32 fh; -- cgit v1.2.3 From 7533377215b6ee432c06c5855f6be5d66e694e46 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 20 Nov 2021 01:50:08 +0000 Subject: KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping Use the yield-safe variant of the TDP MMU iterator when handling an unmapping event from the MMU notifier, as most occurences of the event allow yielding. Fixes: e1eed5847b09 ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson Message-Id: <20211120015008.3780032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/tdp_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 1f8c9f783b78..4cd6bf7e73f0 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1031,7 +1031,7 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range, { struct kvm_mmu_page *root; - for_each_tdp_mmu_root(kvm, root, range->slot->as_id) + for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) flush = zap_gfn_range(kvm, root, range->start, range->end, range->may_block, flush, false); -- cgit v1.2.3 From 4b85c921cd393764d22c0cdab6d7d5d120aa0980 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 20 Nov 2021 04:50:21 +0000 Subject: KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path Drop the "flush" param and return values to/from the TDP MMU's helper for zapping collapsible SPTEs. Because the helper runs with mmu_lock held for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic zap handles the necessary remote TLB flush. Similarly, because mmu_lock is dropped and re-acquired between zapping legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must handle remote TLB flushes from the legacy MMU before calling into the TDP MMU. Fixes: e2209710ccc5d ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 9 ++------- arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++--------------- arch/x86/kvm/mmu/tdp_mmu.h | 5 ++--- 3 files changed, 11 insertions(+), 25 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 5942e9c6dd6e..1b3a7cc9d595 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -5848,8 +5848,6 @@ restart: void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *slot) { - bool flush; - if (kvm_memslots_have_rmaps(kvm)) { write_lock(&kvm->mmu_lock); /* @@ -5857,17 +5855,14 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, * logging at a 4k granularity and never creates collapsible * 2m SPTEs during dirty logging. */ - flush = slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true); - if (flush) + if (slot_handle_level_4k(kvm, slot, kvm_mmu_zap_collapsible_spte, true)) kvm_arch_flush_remote_tlbs_memslot(kvm, slot); write_unlock(&kvm->mmu_lock); } if (is_tdp_mmu_enabled(kvm)) { read_lock(&kvm->mmu_lock); - flush = kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot, false); - if (flush) - kvm_arch_flush_remote_tlbs_memslot(kvm, slot); + kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot); read_unlock(&kvm->mmu_lock); } } diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 4cd6bf7e73f0..1db8496259ad 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1362,10 +1362,9 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, * Clear leaf entries which could be replaced by large mappings, for * GFNs within the slot. */ -static bool zap_collapsible_spte_range(struct kvm *kvm, +static void zap_collapsible_spte_range(struct kvm *kvm, struct kvm_mmu_page *root, - const struct kvm_memory_slot *slot, - bool flush) + const struct kvm_memory_slot *slot) { gfn_t start = slot->base_gfn; gfn_t end = start + slot->npages; @@ -1376,10 +1375,8 @@ static bool zap_collapsible_spte_range(struct kvm *kvm, tdp_root_for_each_pte(iter, root, start, end) { retry: - if (tdp_mmu_iter_cond_resched(kvm, &iter, flush, true)) { - flush = false; + if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) continue; - } if (!is_shadow_present_pte(iter.old_spte) || !is_last_spte(iter.old_spte, iter.level)) @@ -1391,6 +1388,7 @@ retry: pfn, PG_LEVEL_NUM)) continue; + /* Note, a successful atomic zap also does a remote TLB flush. */ if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) { /* * The iter must explicitly re-read the SPTE because @@ -1399,30 +1397,24 @@ retry: iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep)); goto retry; } - flush = true; } rcu_read_unlock(); - - return flush; } /* * Clear non-leaf entries (and free associated page tables) which could * be replaced by large mappings, for GFNs within the slot. */ -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, - const struct kvm_memory_slot *slot, - bool flush) +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot) { struct kvm_mmu_page *root; lockdep_assert_held_read(&kvm->mmu_lock); for_each_tdp_mmu_root_yield_safe(kvm, root, slot->as_id, true) - flush = zap_collapsible_spte_range(kvm, root, slot, flush); - - return flush; + zap_collapsible_spte_range(kvm, root, slot); } /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 476b133544dd..3899004a5d91 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -64,9 +64,8 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, unsigned long mask, bool wrprot); -bool kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, - const struct kvm_memory_slot *slot, - bool flush); +void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm, + const struct kvm_memory_slot *slot); bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, -- cgit v1.2.3 From 28f091bc2f8c23b7eac2402956b692621be7f9f4 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 13:01:37 -0500 Subject: KVM: MMU: shadow nested paging does not have PKU Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding incorrect interpretations of the nested hypervisor's page tables. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1b3a7cc9d595..0e017a3b7c27 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4855,7 +4855,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0, struct kvm_mmu *context = &vcpu->arch.guest_mmu; struct kvm_mmu_role_regs regs = { .cr0 = cr0, - .cr4 = cr4, + .cr4 = cr4 & ~X86_CR4_PKE, .efer = efer, }; union kvm_mmu_role new_role; @@ -4919,7 +4919,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, context->direct_map = false; update_permission_bitmask(context, true); - update_pkru_bitmask(context); + context->pkru_mask = 0; reset_rsvds_bits_mask_ept(vcpu, context, execonly); reset_ept_shadow_zero_bits_mask(vcpu, context, execonly); } -- cgit v1.2.3 From f47491d7f30b8ab084d0b1596697a7ea4561a894 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 20 Nov 2021 01:57:06 +0000 Subject: KVM: x86/mmu: Handle "default" period when selectively waking kthread Account for the '0' being a default, "let KVM choose" period, when determining whether or not the recovery worker needs to be awakened in response to userspace reducing the period. Failure to do so results in the worker not being awakened properly, e.g. when changing the period from '0' to any small-ish value. Fixes: 4dfe4f40d845 ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid Signed-off-by: Sean Christopherson Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 48 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 33 insertions(+), 15 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0e017a3b7c27..6354297e92ae 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6171,23 +6171,46 @@ void kvm_mmu_module_exit(void) mmu_audit_disable(); } +/* + * Calculate the effective recovery period, accounting for '0' meaning "let KVM + * select a halving time of 1 hour". Returns true if recovery is enabled. + */ +static bool calc_nx_huge_pages_recovery_period(uint *period) +{ + /* + * Use READ_ONCE to get the params, this may be called outside of the + * param setters, e.g. by the kthread to compute its next timeout. + */ + bool enabled = READ_ONCE(nx_huge_pages); + uint ratio = READ_ONCE(nx_huge_pages_recovery_ratio); + + if (!enabled || !ratio) + return false; + + *period = READ_ONCE(nx_huge_pages_recovery_period_ms); + if (!*period) { + /* Make sure the period is not less than one second. */ + ratio = min(ratio, 3600u); + *period = 60 * 60 * 1000 / ratio; + } + return true; +} + static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel_param *kp) { bool was_recovery_enabled, is_recovery_enabled; uint old_period, new_period; int err; - was_recovery_enabled = nx_huge_pages_recovery_ratio; - old_period = nx_huge_pages_recovery_period_ms; + was_recovery_enabled = calc_nx_huge_pages_recovery_period(&old_period); err = param_set_uint(val, kp); if (err) return err; - is_recovery_enabled = nx_huge_pages_recovery_ratio; - new_period = nx_huge_pages_recovery_period_ms; + is_recovery_enabled = calc_nx_huge_pages_recovery_period(&new_period); - if (READ_ONCE(nx_huge_pages) && is_recovery_enabled && + if (is_recovery_enabled && (!was_recovery_enabled || old_period > new_period)) { struct kvm *kvm; @@ -6251,18 +6274,13 @@ static void kvm_recover_nx_lpages(struct kvm *kvm) static long get_nx_lpage_recovery_timeout(u64 start_time) { - uint ratio = READ_ONCE(nx_huge_pages_recovery_ratio); - uint period = READ_ONCE(nx_huge_pages_recovery_period_ms); + bool enabled; + uint period; - if (!period && ratio) { - /* Make sure the period is not less than one second. */ - ratio = min(ratio, 3600u); - period = 60 * 60 * 1000 / ratio; - } + enabled = calc_nx_huge_pages_recovery_period(&period); - return READ_ONCE(nx_huge_pages) && ratio - ? start_time + msecs_to_jiffies(period) - get_jiffies_64() - : MAX_SCHEDULE_TIMEOUT; + return enabled ? start_time + msecs_to_jiffies(period) - get_jiffies_64() + : MAX_SCHEDULE_TIMEOUT; } static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data) -- cgit v1.2.3 From 7e1901f6c86c896acff6609e0176f93f756d8b2a Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:43:09 -0500 Subject: KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled If APICv is disabled for this vCPU, assigned devices may still attempt to post interrupts. In that case, we need to cancel the vmentry and deliver the interrupt with KVM_REQ_EVENT. Extend the existing code that handles injection of L1 interrupts into L2 to cover this case as well. vmx_hwapic_irr_update is only called when APICv is active so it would be confusing to add a check for vcpu->arch.apicv_active in there. Instead, just use vmx_set_rvi directly in vmx_sync_pir_to_irr. Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky Reviewed-by: David Matlack Reviewed-by: Sean Christopherson Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 18971cfadd4f..1fadec8cbf96 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6267,9 +6267,9 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); int max_irr; - bool max_irr_updated; + bool got_posted_interrupt; - if (KVM_BUG_ON(!vcpu->arch.apicv_active, vcpu->kvm)) + if (KVM_BUG_ON(!enable_apicv, vcpu->kvm)) return -EIO; if (pi_test_on(&vmx->pi_desc)) { @@ -6279,22 +6279,33 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu) * But on x86 this is just a compiler barrier anyway. */ smp_mb__after_atomic(); - max_irr_updated = + got_posted_interrupt = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir, &max_irr); - - /* - * If we are running L2 and L1 has a new pending interrupt - * which can be injected, this may cause a vmexit or it may - * be injected into L2. Either way, this interrupt will be - * processed via KVM_REQ_EVENT, not RVI, because we do not use - * virtual interrupt delivery to inject L1 interrupts into L2. - */ - if (is_guest_mode(vcpu) && max_irr_updated) - kvm_make_request(KVM_REQ_EVENT, vcpu); } else { max_irr = kvm_lapic_find_highest_irr(vcpu); + got_posted_interrupt = false; } - vmx_hwapic_irr_update(vcpu, max_irr); + + /* + * Newly recognized interrupts are injected via either virtual interrupt + * delivery (RVI) or KVM_REQ_EVENT. Virtual interrupt delivery is + * disabled in two cases: + * + * 1) If L2 is running and the vCPU has a new pending interrupt. If L1 + * wants to exit on interrupts, KVM_REQ_EVENT is needed to synthesize a + * VM-Exit to L1. If L1 doesn't want to exit, the interrupt is injected + * into L2, but KVM doesn't use virtual interrupt delivery to inject + * interrupts into L2, and so KVM_REQ_EVENT is again needed. + * + * 2) If APICv is disabled for this vCPU, assigned devices may still + * attempt to post interrupts. The posted interrupt vector will cause + * a VM-Exit and the subsequent entry will call sync_pir_to_irr. + */ + if (!is_guest_mode(vcpu) && kvm_vcpu_apicv_active(vcpu)) + vmx_set_rvi(max_irr); + else if (got_posted_interrupt) + kvm_make_request(KVM_REQ_EVENT, vcpu); + return max_irr; } -- cgit v1.2.3 From 37c4dbf337c5c2cdb24365ffae6ed70ac1e74d7a Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:43:10 -0500 Subject: KVM: x86: check PIR even for vCPUs with disabled APICv The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even if APICv is disabled on the vCPU that receives it. In that case, the interrupt will just cause a vmexit and leave the ON bit set together with the PIR bit corresponding to the interrupt. Right now, the interrupt would not be delivered until APICv is re-enabled. However, fixing this is just a matter of always doing the PIR->IRR synchronization, even if the vCPU has temporarily disabled APICv. This is not a problem for performance, or if anything it is an improvement. First, in the common case where vcpu->arch.apicv_active is true, one fewer check has to be performed. Second, static_call_cond will elide the function call if APICv is not present or disabled. Finally, in the case for AMD hardware we can remove the sync_pir_to_irr callback: it is only needed for apic_has_interrupt_for_ppr, and that function already has a fallback for !APICv. Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Reviewed-by: Maxim Levitsky Reviewed-by: David Matlack Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/lapic.c | 2 +- arch/x86/kvm/svm/svm.c | 1 - arch/x86/kvm/x86.c | 18 +++++++++--------- 3 files changed, 10 insertions(+), 11 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 759952dd1222..f206fc35deff 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -707,7 +707,7 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu) static int apic_has_interrupt_for_ppr(struct kvm_lapic *apic, u32 ppr) { int highest_irr; - if (apic->vcpu->arch.apicv_active) + if (kvm_x86_ops.sync_pir_to_irr) highest_irr = static_call(kvm_x86_sync_pir_to_irr)(apic->vcpu); else highest_irr = apic_find_highest_irr(apic); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 5630c241d5f6..d0f68d11ec70 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4651,7 +4651,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .load_eoi_exitmap = svm_load_eoi_exitmap, .hwapic_irr_update = svm_hwapic_irr_update, .hwapic_isr_update = svm_hwapic_isr_update, - .sync_pir_to_irr = kvm_lapic_find_highest_irr, .apicv_post_state_restore = avic_post_state_restore, .set_tss_addr = svm_set_tss_addr, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 817898eab7c3..0ee1a039b490 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4472,8 +4472,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { - if (vcpu->arch.apicv_active) - static_call(kvm_x86_sync_pir_to_irr)(vcpu); + static_call_cond(kvm_x86_sync_pir_to_irr)(vcpu); return kvm_apic_get_state(vcpu, s); } @@ -9571,8 +9570,7 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) if (irqchip_split(vcpu->kvm)) kvm_scan_ioapic_routes(vcpu, vcpu->arch.ioapic_handled_vectors); else { - if (vcpu->arch.apicv_active) - static_call(kvm_x86_sync_pir_to_irr)(vcpu); + static_call_cond(kvm_x86_sync_pir_to_irr)(vcpu); if (ioapic_in_kernel(vcpu->kvm)) kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors); } @@ -9842,10 +9840,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) /* * This handles the case where a posted interrupt was - * notified with kvm_vcpu_kick. + * notified with kvm_vcpu_kick. Assigned devices can + * use the POSTED_INTR_VECTOR even if APICv is disabled, + * so do it even if APICv is disabled on this vCPU. */ - if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) - static_call(kvm_x86_sync_pir_to_irr)(vcpu); + if (kvm_lapic_enabled(vcpu)) + static_call_cond(kvm_x86_sync_pir_to_irr)(vcpu); if (kvm_vcpu_exit_request(vcpu)) { vcpu->mode = OUTSIDE_GUEST_MODE; @@ -9889,8 +9889,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break; - if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) - static_call(kvm_x86_sync_pir_to_irr)(vcpu); + if (kvm_lapic_enabled(vcpu)) + static_call_cond(kvm_x86_sync_pir_to_irr)(vcpu); if (unlikely(kvm_vcpu_exit_request(vcpu))) { exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED; -- cgit v1.2.3 From 53b7ca1a359389276c76fbc9e1009d8626a17e40 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:43:11 -0500 Subject: KVM: x86: Use a stable condition around all VT-d PI paths Currently, checks for whether VT-d PI can be used refer to the current status of the feature in the current vCPU; or they more or less pick vCPU 0 in case a specific vCPU is not available. However, these checks do not attempt to synchronize with changes to the IRTE. In particular, there is no path that updates the IRTE when APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU that has APICv disabled, if the wakeup occurs because of an IRTE that points to a posted interrupt. To fix this, always go through the VT-d PI path as long as there are assigned devices and APICv is available on both the host and the VM side. Since the relevant condition was copied over three times, take the hint and factor it into a separate function. Suggested-by: Sean Christopherson Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson Reviewed-by: Maxim Levitsky Reviewed-by: David Matlack Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/posted_intr.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 5f81ef092bd4..1c94783b5a54 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -5,6 +5,7 @@ #include #include "lapic.h" +#include "irq.h" #include "posted_intr.h" #include "trace.h" #include "vmx.h" @@ -77,13 +78,18 @@ after_clear_sn: pi_set_on(pi_desc); } +static bool vmx_can_use_vtd_pi(struct kvm *kvm) +{ + return irqchip_in_kernel(kvm) && enable_apicv && + kvm_arch_has_assigned_device(kvm) && + irq_remapping_cap(IRQ_POSTING_CAP); +} + void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu) { struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); - if (!kvm_arch_has_assigned_device(vcpu->kvm) || - !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(vcpu)) + if (!vmx_can_use_vtd_pi(vcpu->kvm)) return; /* Set SN when the vCPU is preempted */ @@ -141,9 +147,7 @@ int pi_pre_block(struct kvm_vcpu *vcpu) struct pi_desc old, new; struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); - if (!kvm_arch_has_assigned_device(vcpu->kvm) || - !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(vcpu)) + if (!vmx_can_use_vtd_pi(vcpu->kvm)) return 0; WARN_ON(irqs_disabled()); @@ -270,9 +274,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq, struct vcpu_data vcpu_info; int idx, ret = 0; - if (!kvm_arch_has_assigned_device(kvm) || - !irq_remapping_cap(IRQ_POSTING_CAP) || - !kvm_vcpu_apicv_active(kvm->vcpus[0])) + if (!vmx_can_use_vtd_pi(kvm)) return 0; idx = srcu_read_lock(&kvm->irq_srcu); -- cgit v1.2.3 From 4674164f0ac5fd553c38b2b8c49fe13297fed38b Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:28 -0500 Subject: KVM: SEV: do not use list_replace_init on an empty list list_replace_init cannot be used if the source is an empty list, because "new->next->prev = new" will overwrite "old->next": new old prev = new, next = new prev = old, next = old new->next = old->next prev = new, next = old prev = old, next = old new->next->prev = new prev = new, next = old prev = old, next = new new->prev = old->prev prev = old, next = old prev = old, next = old new->next->prev = new prev = old, next = old prev = new, next = new The desired outcome instead would be to leave both old and new the same as they were (two empty circular lists). Use list_cut_before, which already has the necessary check and is documented to discard the previous contents of the list that will hold the result. Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 21ac0a5de4e0..75955beb3770 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1613,8 +1613,7 @@ static void sev_migrate_from(struct kvm_sev_info *dst, src->handle = 0; src->pages_locked = 0; - INIT_LIST_HEAD(&dst->regions_list); - list_replace_init(&src->regions_list, &dst->regions_list); + list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list); } static int sev_es_migrate_from(struct kvm *dst, struct kvm *src) -- cgit v1.2.3 From 501b580c02339a83917cf3b44c445f2419b15dcb Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:29 -0500 Subject: KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM Encapsulate the handling of the migration_in_progress flag for both VMs in two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit later, and this change 1) keeps the cleanup chain of labels smaller 2) makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic. Cc: Peter Gonda Cc: Sean Christopherson Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 53 +++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 26 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 75955beb3770..8902b018fc18 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1543,28 +1543,40 @@ static bool is_cmd_allowed_from_mirror(u32 cmd_id) return false; } -static int sev_lock_for_migration(struct kvm *kvm) +static int sev_lock_two_vms(struct kvm *dst_kvm, struct kvm *src_kvm) { - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_info *dst_sev = &to_kvm_svm(dst_kvm)->sev_info; + struct kvm_sev_info *src_sev = &to_kvm_svm(src_kvm)->sev_info; + + if (dst_kvm == src_kvm) + return -EINVAL; /* - * Bail if this VM is already involved in a migration to avoid deadlock - * between two VMs trying to migrate to/from each other. + * Bail if these VMs are already involved in a migration to avoid + * deadlock between two VMs trying to migrate to/from each other. */ - if (atomic_cmpxchg_acquire(&sev->migration_in_progress, 0, 1)) + if (atomic_cmpxchg_acquire(&dst_sev->migration_in_progress, 0, 1)) return -EBUSY; - mutex_lock(&kvm->lock); + if (atomic_cmpxchg_acquire(&src_sev->migration_in_progress, 0, 1)) { + atomic_set_release(&dst_sev->migration_in_progress, 0); + return -EBUSY; + } + mutex_lock(&dst_kvm->lock); + mutex_lock(&src_kvm->lock); return 0; } -static void sev_unlock_after_migration(struct kvm *kvm) +static void sev_unlock_two_vms(struct kvm *dst_kvm, struct kvm *src_kvm) { - struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_info *dst_sev = &to_kvm_svm(dst_kvm)->sev_info; + struct kvm_sev_info *src_sev = &to_kvm_svm(src_kvm)->sev_info; - mutex_unlock(&kvm->lock); - atomic_set_release(&sev->migration_in_progress, 0); + mutex_unlock(&dst_kvm->lock); + mutex_unlock(&src_kvm->lock); + atomic_set_release(&dst_sev->migration_in_progress, 0); + atomic_set_release(&src_sev->migration_in_progress, 0); } @@ -1665,15 +1677,6 @@ int svm_vm_migrate_from(struct kvm *kvm, unsigned int source_fd) bool charged = false; int ret; - ret = sev_lock_for_migration(kvm); - if (ret) - return ret; - - if (sev_guest(kvm)) { - ret = -EINVAL; - goto out_unlock; - } - source_kvm_file = fget(source_fd); if (!file_is_kvm(source_kvm_file)) { ret = -EBADF; @@ -1681,13 +1684,13 @@ int svm_vm_migrate_from(struct kvm *kvm, unsigned int source_fd) } source_kvm = source_kvm_file->private_data; - ret = sev_lock_for_migration(source_kvm); + ret = sev_lock_two_vms(kvm, source_kvm); if (ret) goto out_fput; - if (!sev_guest(source_kvm)) { + if (sev_guest(kvm) || !sev_guest(source_kvm)) { ret = -EINVAL; - goto out_source; + goto out_unlock; } src_sev = &to_kvm_svm(source_kvm)->sev_info; @@ -1727,13 +1730,11 @@ out_dst_cgroup: sev_misc_cg_uncharge(cg_cleanup_sev); put_misc_cg(cg_cleanup_sev->misc_cg); cg_cleanup_sev->misc_cg = NULL; -out_source: - sev_unlock_after_migration(source_kvm); +out_unlock: + sev_unlock_two_vms(kvm, source_kvm); out_fput: if (source_kvm_file) fput(source_kvm_file); -out_unlock: - sev_unlock_after_migration(kvm); return ret; } -- cgit v1.2.3 From 2b347a387811cb4aa7bcdb96e9203c5019a6fb41 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:30 -0500 Subject: KVM: SEV: initialize regions_list of a mirror VM This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM, but technically harmless because the region list was unused for a mirror VM. However, it is untidy and it now causes a NULL pointer access when attempting to move the encryption context of a mirror VM. Fixes: 54526d1fd593 ("KVM: x86: Support KVM VMs sharing SEV context") Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 1 + 1 file changed, 1 insertion(+) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 8902b018fc18..8daabc3dc079 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2007,6 +2007,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd) mirror_sev->fd = source_sev.fd; mirror_sev->es_active = source_sev.es_active; mirror_sev->handle = source_sev.handle; + INIT_LIST_HEAD(&mirror_sev->regions_list); /* * Do not copy ap_jump_table. Since the mirror does not share the same * KVM contexts as the original, and they may have different -- cgit v1.2.3 From 642525e3bd474dc50b7d0e8ee9c966b97e4be3ac Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:31 -0500 Subject: KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM Allow intra-host migration of a mirror VM; the destination VM will be a mirror of the same ASID as the source. Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 8daabc3dc079..74b6459b5fb2 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1619,11 +1619,13 @@ static void sev_migrate_from(struct kvm_sev_info *dst, dst->asid = src->asid; dst->handle = src->handle; dst->pages_locked = src->pages_locked; + dst->enc_context_owner = src->enc_context_owner; src->asid = 0; src->active = false; src->handle = 0; src->pages_locked = 0; + src->enc_context_owner = NULL; list_cut_before(&dst->regions_list, &src->regions_list, &src->regions_list); } -- cgit v1.2.3 From bf42b02b19e27d6849852a41dd734af4c05e73c6 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:33 -0500 Subject: KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked Now that we have a facility to lock two VMs with deadlock protection, use it for the creation of mirror VMs as well. One of COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst) would always fail, so the combination is nonsensical and it is okay to return -EBUSY if it is attempted. This sidesteps the question of what happens if a VM is MOVE_ENC_CONTEXT_FROM'd at the same time as it is COPY_ENC_CONTEXT_FROM'd: the locking prevents that from happening. Cc: Peter Gonda Cc: Sean Christopherson Reviewed-by: Sean Christopherson Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 66 ++++++++++++++++++-------------------------------- 1 file changed, 24 insertions(+), 42 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 74b6459b5fb2..025d9731b66c 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1955,77 +1955,59 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd) { struct file *source_kvm_file; struct kvm *source_kvm; - struct kvm_sev_info source_sev, *mirror_sev; + struct kvm_sev_info *source_sev, *mirror_sev; int ret; source_kvm_file = fget(source_fd); if (!file_is_kvm(source_kvm_file)) { ret = -EBADF; - goto e_source_put; + goto e_source_fput; } source_kvm = source_kvm_file->private_data; - mutex_lock(&source_kvm->lock); - - if (!sev_guest(source_kvm)) { - ret = -EINVAL; - goto e_source_unlock; - } + ret = sev_lock_two_vms(kvm, source_kvm); + if (ret) + goto e_source_fput; - /* Mirrors of mirrors should work, but let's not get silly */ - if (is_mirroring_enc_context(source_kvm) || source_kvm == kvm) { + /* + * Mirrors of mirrors should work, but let's not get silly. Also + * disallow out-of-band SEV/SEV-ES init if the target is already an + * SEV guest, or if vCPUs have been created. KVM relies on vCPUs being + * created after SEV/SEV-ES initialization, e.g. to init intercepts. + */ + if (sev_guest(kvm) || !sev_guest(source_kvm) || + is_mirroring_enc_context(source_kvm) || kvm->created_vcpus) { ret = -EINVAL; - goto e_source_unlock; + goto e_unlock; } - memcpy(&source_sev, &to_kvm_svm(source_kvm)->sev_info, - sizeof(source_sev)); - /* * The mirror kvm holds an enc_context_owner ref so its asid can't * disappear until we're done with it */ + source_sev = &to_kvm_svm(source_kvm)->sev_info; kvm_get_kvm(source_kvm); - fput(source_kvm_file); - mutex_unlock(&source_kvm->lock); - mutex_lock(&kvm->lock); - - /* - * Disallow out-of-band SEV/SEV-ES init if the target is already an - * SEV guest, or if vCPUs have been created. KVM relies on vCPUs being - * created after SEV/SEV-ES initialization, e.g. to init intercepts. - */ - if (sev_guest(kvm) || kvm->created_vcpus) { - ret = -EINVAL; - goto e_mirror_unlock; - } - /* Set enc_context_owner and copy its encryption context over */ mirror_sev = &to_kvm_svm(kvm)->sev_info; mirror_sev->enc_context_owner = source_kvm; mirror_sev->active = true; - mirror_sev->asid = source_sev.asid; - mirror_sev->fd = source_sev.fd; - mirror_sev->es_active = source_sev.es_active; - mirror_sev->handle = source_sev.handle; + mirror_sev->asid = source_sev->asid; + mirror_sev->fd = source_sev->fd; + mirror_sev->es_active = source_sev->es_active; + mirror_sev->handle = source_sev->handle; INIT_LIST_HEAD(&mirror_sev->regions_list); + ret = 0; + /* * Do not copy ap_jump_table. Since the mirror does not share the same * KVM contexts as the original, and they may have different * memory-views. */ - mutex_unlock(&kvm->lock); - return 0; - -e_mirror_unlock: - mutex_unlock(&kvm->lock); - kvm_put_kvm(source_kvm); - return ret; -e_source_unlock: - mutex_unlock(&source_kvm->lock); -e_source_put: +e_unlock: + sev_unlock_two_vms(kvm, source_kvm); +e_source_fput: if (source_kvm_file) fput(source_kvm_file); return ret; -- cgit v1.2.3 From 17d44a96f000fe1040d4ba1c34e458c63be6b7ce Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:34 -0500 Subject: KVM: SEV: Prohibit migration of a VM that has mirrors VMs that mirror an encryption context rely on the owner to keep the ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM would cause a dangling ASID: 1. copy context from A to B (gets ref to A) 2. move context from A to L (moves ASID from A to L) 3. close L (releases ASID from L, B still references it) The right way to do the handoff instead is to create a fresh mirror VM on the destination first: 1. copy context from A to B (gets ref to A) [later] 2. close B (releases ref to A) 3. move context from A to L (moves ASID from A to L) 4. copy context from L to M So, catch the situation by adding a count of how many VMs are mirroring this one's encryption context. Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 22 +++++++++++++++++++++- arch/x86/kvm/svm/svm.h | 1 + 2 files changed, 22 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 025d9731b66c..89a716290fac 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1696,6 +1696,16 @@ int svm_vm_migrate_from(struct kvm *kvm, unsigned int source_fd) } src_sev = &to_kvm_svm(source_kvm)->sev_info; + + /* + * VMs mirroring src's encryption context rely on it to keep the + * ASID allocated, but below we are clearing src_sev->asid. + */ + if (src_sev->num_mirrored_vms) { + ret = -EBUSY; + goto out_unlock; + } + dst_sev->misc_cg = get_current_misc_cg(); cg_cleanup_sev = dst_sev; if (dst_sev->misc_cg != src_sev->misc_cg) { @@ -1987,6 +1997,7 @@ int svm_vm_copy_asid_from(struct kvm *kvm, unsigned int source_fd) */ source_sev = &to_kvm_svm(source_kvm)->sev_info; kvm_get_kvm(source_kvm); + source_sev->num_mirrored_vms++; /* Set enc_context_owner and copy its encryption context over */ mirror_sev = &to_kvm_svm(kvm)->sev_info; @@ -2019,12 +2030,21 @@ void sev_vm_destroy(struct kvm *kvm) struct list_head *head = &sev->regions_list; struct list_head *pos, *q; + WARN_ON(sev->num_mirrored_vms); + if (!sev_guest(kvm)) return; /* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */ if (is_mirroring_enc_context(kvm)) { - kvm_put_kvm(sev->enc_context_owner); + struct kvm *owner_kvm = sev->enc_context_owner; + struct kvm_sev_info *owner_sev = &to_kvm_svm(owner_kvm)->sev_info; + + mutex_lock(&owner_kvm->lock); + if (!WARN_ON(!owner_sev->num_mirrored_vms)) + owner_sev->num_mirrored_vms--; + mutex_unlock(&owner_kvm->lock); + kvm_put_kvm(owner_kvm); return; } diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 5faad3dc10e2..1c7306c370fa 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -79,6 +79,7 @@ struct kvm_sev_info { struct list_head regions_list; /* List of registered regions */ u64 ap_jump_table; /* SEV-ES AP Jump Table address */ struct kvm *enc_context_owner; /* Owner of copied encryption context */ + unsigned long num_mirrored_vms; /* Number of VMs sharing this ASID */ struct misc_cg *misc_cg; /* For misc cgroup accounting */ atomic_t migration_in_progress; }; -- cgit v1.2.3 From 10a37929efeb4c51a0069afdd537c4fa3831f6e5 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:35 -0500 Subject: KVM: SEV: do not take kvm->lock when destroying Taking the lock is useless since there are no other references, and there are already accesses (e.g. to sev->enc_context_owner) that do not take it. So get rid of it. Reviewed-by: Sean Christopherson Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 89a716290fac..bbbf980c7e40 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2048,8 +2048,6 @@ void sev_vm_destroy(struct kvm *kvm) return; } - mutex_lock(&kvm->lock); - /* * Ensure that all guest tagged cache entries are flushed before * releasing the pages back to the system for use. CLFLUSH will @@ -2069,8 +2067,6 @@ void sev_vm_destroy(struct kvm *kvm) } } - mutex_unlock(&kvm->lock); - sev_unbind_asid(kvm, sev->handle); sev_asid_free(sev); } -- cgit v1.2.3 From c9d61dcb0bc26a761dc84a87bd8a0d3b3c432f10 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Mon, 22 Nov 2021 19:50:36 -0500 Subject: KVM: SEV: accept signals in sev_lock_two_vms Generally, kvm->lock is not taken for a long time, but sev_lock_two_vms is different: it takes vCPU locks inside, so userspace can hold it back just by calling a vCPU ioctl. Play it safe and use mutex_lock_killable. Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index bbbf980c7e40..59727a966f90 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -1547,6 +1547,7 @@ static int sev_lock_two_vms(struct kvm *dst_kvm, struct kvm *src_kvm) { struct kvm_sev_info *dst_sev = &to_kvm_svm(dst_kvm)->sev_info; struct kvm_sev_info *src_sev = &to_kvm_svm(src_kvm)->sev_info; + int r = -EBUSY; if (dst_kvm == src_kvm) return -EINVAL; @@ -1558,14 +1559,23 @@ static int sev_lock_two_vms(struct kvm *dst_kvm, struct kvm *src_kvm) if (atomic_cmpxchg_acquire(&dst_sev->migration_in_progress, 0, 1)) return -EBUSY; - if (atomic_cmpxchg_acquire(&src_sev->migration_in_progress, 0, 1)) { - atomic_set_release(&dst_sev->migration_in_progress, 0); - return -EBUSY; - } + if (atomic_cmpxchg_acquire(&src_sev->migration_in_progress, 0, 1)) + goto release_dst; - mutex_lock(&dst_kvm->lock); - mutex_lock(&src_kvm->lock); + r = -EINTR; + if (mutex_lock_killable(&dst_kvm->lock)) + goto release_src; + if (mutex_lock_killable(&src_kvm->lock)) + goto unlock_dst; return 0; + +unlock_dst: + mutex_unlock(&dst_kvm->lock); +release_src: + atomic_set_release(&src_sev->migration_in_progress, 0); +release_dst: + atomic_set_release(&dst_sev->migration_in_progress, 0); + return r; } static void sev_unlock_two_vms(struct kvm *dst_kvm, struct kvm *src_kvm) -- cgit v1.2.3 From e90e51d5f01d2baae5dcce280866bbb96816e978 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 30 Nov 2021 07:36:41 -0500 Subject: KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled There is nothing to synchronize if APICv is disabled, since neither other vCPUs nor assigned devices can set PIR.ON. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/vmx.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1fadec8cbf96..f90448809690 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7777,10 +7777,10 @@ static __init int hardware_setup(void) ple_window_shrink = 0; } - if (!cpu_has_vmx_apicv()) { + if (!cpu_has_vmx_apicv()) enable_apicv = 0; + if (!enable_apicv) vmx_x86_ops.sync_pir_to_irr = NULL; - } if (cpu_has_vmx_tsc_scaling()) { kvm_has_tsc_control = true; -- cgit v1.2.3 From 7cfc5c653b07782e7059527df8dc1e3143a7591e Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 30 Nov 2021 03:46:07 -0500 Subject: KVM: fix avic_set_running for preemptable kernels avic_set_running() passes the current CPU to avic_vcpu_load(), albeit via vcpu->cpu rather than smp_processor_id(). If the thread is migrated while avic_set_running runs, the call to avic_vcpu_load() can use a stale value for the processor id. Avoid this by blocking preemption over the entire execution of avic_set_running(). Reported-by: Sean Christopherson Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/avic.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index affc0ea98d30..9d6066eb7c10 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -989,16 +989,18 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu) static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run) { struct vcpu_svm *svm = to_svm(vcpu); + int cpu = get_cpu(); + WARN_ON(cpu != vcpu->cpu); svm->avic_is_running = is_run; - if (!kvm_vcpu_apicv_active(vcpu)) - return; - - if (is_run) - avic_vcpu_load(vcpu, vcpu->cpu); - else - avic_vcpu_put(vcpu); + if (kvm_vcpu_apicv_active(vcpu)) { + if (is_run) + avic_vcpu_load(vcpu, cpu); + else + avic_vcpu_put(vcpu); + } + put_cpu(); } void svm_vcpu_blocking(struct kvm_vcpu *vcpu) -- cgit v1.2.3 From 1d7c29b77725d05faff6754d2f5e7c147aedcf93 Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Fri, 26 Nov 2021 22:35:45 +0100 Subject: parisc: Fix KBUILD_IMAGE for self-extracting kernel Default KBUILD_IMAGE to $(boot)/bzImage if a self-extracting (CONFIG_PARISC_SELF_EXTRACT=y) kernel is to be built. This fixes the bindeb-pkg make target. Signed-off-by: Helge Deller Cc: # v4.14+ --- arch/parisc/Makefile | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'arch') diff --git a/arch/parisc/Makefile b/arch/parisc/Makefile index 8db4af4879d0..82d77f4b0d08 100644 --- a/arch/parisc/Makefile +++ b/arch/parisc/Makefile @@ -15,7 +15,12 @@ # Mike Shaver, Helge Deller and Martin K. Petersen # +ifdef CONFIG_PARISC_SELF_EXTRACT +boot := arch/parisc/boot +KBUILD_IMAGE := $(boot)/bzImage +else KBUILD_IMAGE := vmlinuz +endif NM = sh $(srctree)/arch/parisc/nm CHECKFLAGS += -D__hppa__=1 -- cgit v1.2.3 From 7e8aeb9d466e1b14d0580b12a0b9f7c702c42f79 Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Sat, 27 Nov 2021 12:00:36 +0100 Subject: parisc: Enable sata sil, audit and usb support on 64-bit defconfig Add some more config options which reflect what's needed to boot our 64-bit debian buildds out of the box. Signed-off-by: Helge Deller --- arch/parisc/configs/generic-64bit_defconfig | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/parisc/configs/generic-64bit_defconfig b/arch/parisc/configs/generic-64bit_defconfig index d2daeac2b217..1b8fd80cbe7f 100644 --- a/arch/parisc/configs/generic-64bit_defconfig +++ b/arch/parisc/configs/generic-64bit_defconfig @@ -1,7 +1,9 @@ CONFIG_LOCALVERSION="-64bit" # CONFIG_LOCALVERSION_AUTO is not set +CONFIG_KERNEL_LZ4=y CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y +CONFIG_AUDIT=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y @@ -35,6 +37,7 @@ CONFIG_MODVERSIONS=y CONFIG_BLK_DEV_INTEGRITY=y CONFIG_BINFMT_MISC=m # CONFIG_COMPACTION is not set +CONFIG_MEMORY_FAILURE=y CONFIG_NET=y CONFIG_PACKET=y CONFIG_UNIX=y @@ -65,12 +68,15 @@ CONFIG_SCSI_ISCSI_ATTRS=y CONFIG_SCSI_SRP_ATTRS=y CONFIG_ISCSI_BOOT_SYSFS=y CONFIG_SCSI_MPT2SAS=y -CONFIG_SCSI_LASI700=m +CONFIG_SCSI_LASI700=y CONFIG_SCSI_SYM53C8XX_2=y CONFIG_SCSI_ZALON=y CONFIG_SCSI_QLA_ISCSI=m CONFIG_SCSI_DH=y CONFIG_ATA=y +CONFIG_SATA_SIL=y +CONFIG_SATA_SIS=y +CONFIG_SATA_VIA=y CONFIG_PATA_NS87415=y CONFIG_PATA_SIL680=y CONFIG_ATA_GENERIC=y @@ -79,6 +85,7 @@ CONFIG_MD_LINEAR=m CONFIG_BLK_DEV_DM=m CONFIG_DM_RAID=m CONFIG_DM_UEVENT=y +CONFIG_DM_AUDIT=y CONFIG_FUSION=y CONFIG_FUSION_SPI=y CONFIG_FUSION_SAS=y @@ -196,10 +203,15 @@ CONFIG_FB_MATROX_G=y CONFIG_FB_MATROX_I2C=y CONFIG_FB_MATROX_MAVEN=y CONFIG_FB_RADEON=y +CONFIG_LOGO=y +# CONFIG_LOGO_LINUX_CLUT224 is not set CONFIG_HIDRAW=y CONFIG_HID_PID=y CONFIG_USB_HIDDEV=y CONFIG_USB=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_OHCI_HCD_PLATFORM=y CONFIG_UIO=y CONFIG_UIO_PDRV_GENIRQ=m CONFIG_UIO_AEC=m -- cgit v1.2.3 From 7d697f0d5737768fa1039b8953b67c08d8d406d1 Mon Sep 17 00:00:00 2001 From: Tony Luck Date: Fri, 19 Nov 2021 09:08:32 -0800 Subject: x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define Convention for all the other "lake" CPUs is all one word. So s/RAPTOR_LAKE/RAPTORLAKE/ Fixes: fbdb5e8f2926 ("x86/cpu: Add Raptor Lake to Intel family") Reported-by: Rui Zhang Signed-off-by: Tony Luck Signed-off-by: Dave Hansen Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com --- arch/x86/include/asm/intel-family.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/include/asm/intel-family.h b/arch/x86/include/asm/intel-family.h index 5a0bcf8b78d7..048b6d5aff50 100644 --- a/arch/x86/include/asm/intel-family.h +++ b/arch/x86/include/asm/intel-family.h @@ -108,7 +108,7 @@ #define INTEL_FAM6_ALDERLAKE 0x97 /* Golden Cove / Gracemont */ #define INTEL_FAM6_ALDERLAKE_L 0x9A /* Golden Cove / Gracemont */ -#define INTEL_FAM6_RAPTOR_LAKE 0xB7 +#define INTEL_FAM6_RAPTORLAKE 0xB7 /* "Small Core" Processors (Atom) */ -- cgit v1.2.3 From 52d0b8b18776f184c53632c5e0068201491cdb61 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Fri, 26 Nov 2021 13:47:46 +0100 Subject: x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog() save_sw_bytes() did not fully initialize sw_bytes, which caused KMSAN to report an infoleak (see below). Initialize sw_bytes explicitly to avoid this. KMSAN report follows: ===================================================== BUG: KMSAN: kernel-infoleak in instrument_copy_to_user ./include/linux/instrumented.h:121 BUG: KMSAN: kernel-infoleak in __copy_to_user ./include/linux/uaccess.h:154 BUG: KMSAN: kernel-infoleak in save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 instrument_copy_to_user ./include/linux/instrumented.h:121 __copy_to_user ./include/linux/uaccess.h:154 save_xstate_epilog+0x2df/0x510 arch/x86/kernel/fpu/signal.c:127 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 get_sigframe+0x656/0x7e0 arch/x86/kernel/signal.c:296 __setup_rt_frame+0x14d/0x2a60 arch/x86/kernel/signal.c:471 setup_rt_frame arch/x86/kernel/signal.c:781 handle_signal arch/x86/kernel/signal.c:825 arch_do_signal_or_restart+0x417/0xdd0 arch/x86/kernel/signal.c:870 handle_signal_work kernel/entry/common.c:149 exit_to_user_mode_loop+0x1f6/0x490 kernel/entry/common.c:173 exit_to_user_mode_prepare kernel/entry/common.c:208 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 syscall_exit_to_user_mode+0x7e/0xc0 kernel/entry/common.c:302 do_syscall_64+0x60/0xd0 arch/x86/entry/common.c:88 entry_SYSCALL_64_after_hwframe+0x44/0xae ??:? Local variable sw_bytes created at: save_xstate_epilog+0x80/0x510 arch/x86/kernel/fpu/signal.c:121 copy_fpstate_to_sigframe+0x861/0xb60 arch/x86/kernel/fpu/signal.c:245 Bytes 20-47 of 48 are uninitialized Memory access of size 48 starts at ffff8880801d3a18 Data copied to user address 00007ffd90e2ef50 ===================================================== Link: https://lore.kernel.org/all/CAG_fn=V9T6OKPonSjsi9PmWB0hMHFC=yawozdft8i1-MSxrv=w@mail.gmail.com/ Fixes: 53599b4d54b9b8dd ("x86/fpu/signal: Prepare for variable sigframe length") Reported-by: Alexander Potapenko Signed-off-by: Marco Elver Signed-off-by: Alexander Potapenko Signed-off-by: Dave Hansen Tested-by: Alexander Potapenko Link: https://lkml.kernel.org/r/20211126124746.761278-1-glider@google.com --- arch/x86/kernel/fpu/signal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c index d5958278eba6..91d4b6de58ab 100644 --- a/arch/x86/kernel/fpu/signal.c +++ b/arch/x86/kernel/fpu/signal.c @@ -118,7 +118,7 @@ static inline bool save_xstate_epilog(void __user *buf, int ia32_frame, struct fpstate *fpstate) { struct xregs_state __user *x = buf; - struct _fpx_sw_bytes sw_bytes; + struct _fpx_sw_bytes sw_bytes = {}; u32 xfeatures; int err; -- cgit v1.2.3 From c7719e79347803b8e3b6b50da8c6db410a3012b5 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 17 Nov 2021 10:37:50 +0800 Subject: x86/tsc: Add a timer to make sure TSC_adjust is always checked The TSC_ADJUST register is checked every time a CPU enters idle state, but Thomas Gleixner mentioned there is still a caveat that a system won't enter idle [1], either because it's too busy or configured purposely to not enter idle. Setup a periodic timer (every 10 minutes) to make sure the check is happening on a regular base. [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/ Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Requested-by: Thomas Gleixner Signed-off-by: Feng Tang Signed-off-by: Thomas Gleixner Cc: "Paul E. McKenney" Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com --- arch/x86/kernel/tsc_sync.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) (limited to 'arch') diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c index 50a4515fe0ad..9452dc9664b5 100644 --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -30,6 +30,7 @@ struct tsc_adjust { }; static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust); +static struct timer_list tsc_sync_check_timer; /* * TSC's on different sockets may be reset asynchronously. @@ -77,6 +78,46 @@ void tsc_verify_tsc_adjust(bool resume) } } +/* + * Normally the tsc_sync will be checked every time system enters idle + * state, but there is still caveat that a system won't enter idle, + * either because it's too busy or configured purposely to not enter + * idle. + * + * So setup a periodic timer (every 10 minutes) to make sure the check + * is always on. + */ + +#define SYNC_CHECK_INTERVAL (HZ * 600) + +static void tsc_sync_check_timer_fn(struct timer_list *unused) +{ + int next_cpu; + + tsc_verify_tsc_adjust(false); + + /* Run the check for all onlined CPUs in turn */ + next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); + if (next_cpu >= nr_cpu_ids) + next_cpu = cpumask_first(cpu_online_mask); + + tsc_sync_check_timer.expires += SYNC_CHECK_INTERVAL; + add_timer_on(&tsc_sync_check_timer, next_cpu); +} + +static int __init start_sync_check_timer(void) +{ + if (!cpu_feature_enabled(X86_FEATURE_TSC_ADJUST) || tsc_clocksource_reliable) + return 0; + + timer_setup(&tsc_sync_check_timer, tsc_sync_check_timer_fn, 0); + tsc_sync_check_timer.expires = jiffies + SYNC_CHECK_INTERVAL; + add_timer(&tsc_sync_check_timer); + + return 0; +} +late_initcall(start_sync_check_timer); + static void tsc_sanitize_first_cpu(struct tsc_adjust *cur, s64 bootval, unsigned int cpu, bool bootcpu) { -- cgit v1.2.3 From b50db7095fe002fa3e16605546cba66bf1b68a3e Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 17 Nov 2021 10:37:51 +0800 Subject: x86/tsc: Disable clocksource watchdog for TSC on qualified platorms There are cases that the TSC clocksource is wrongly judged as unstable by the clocksource watchdog mechanism which tries to validate the TSC against HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to check the validity of a watchdog, Thomas Gleixner proposed [1]: "I'm inclined to lift that requirement when the CPU has: 1) X86_FEATURE_CONSTANT_TSC 2) X86_FEATURE_NONSTOP_TSC 3) X86_FEATURE_NONSTOP_TSC_S3 4) X86_FEATURE_TSC_ADJUST 5) At max. 4 sockets After two decades of horrors we're finally at a point where TSC seems to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST was really key as we can now detect even small modifications reliably and the important point is that we can cure them as well (not pretty but better than all other options)." As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive to use maximal 2 sockets. The check is done inside tsc_init() before registering 'tsc-early' and 'tsc' clocksources, as there were cases that both of them had been wrongly judged as unreliable. For more background of tsc/watchdog, there is a good summary in [2] [tglx} Update vs. jiffies: On systems where the only remaining clocksource aside of TSC is jiffies there is no way to make this work because that creates a circular dependency. Jiffies accuracy depends on not missing a periodic timer interrupt, which is not guaranteed. That could be detected by TSC, but as TSC is not trusted this cannot be compensated. The consequence is a circulus vitiosus which results in shutting down TSC and falling back to the jiffies clocksource which is even more unreliable. [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/ [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/ [ tglx: Refine comment and amend changelog ] Fixes: 6e3cd95234dc ("x86/hpet: Use another crystalball to evaluate HPET usability") Suggested-by: Thomas Gleixner Signed-off-by: Feng Tang Signed-off-by: Thomas Gleixner Cc: "Paul E. McKenney" Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com --- arch/x86/kernel/tsc.c | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 2e076a459a0c..a698196377be 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1180,6 +1180,12 @@ void mark_tsc_unstable(char *reason) EXPORT_SYMBOL_GPL(mark_tsc_unstable); +static void __init tsc_disable_clocksource_watchdog(void) +{ + clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY; + clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; +} + static void __init check_system_tsc_reliable(void) { #if defined(CONFIG_MGEODEGX1) || defined(CONFIG_MGEODE_LX) || defined(CONFIG_X86_GENERIC) @@ -1196,6 +1202,23 @@ static void __init check_system_tsc_reliable(void) #endif if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) tsc_clocksource_reliable = 1; + + /* + * Disable the clocksource watchdog when the system has: + * - TSC running at constant frequency + * - TSC which does not stop in C-States + * - the TSC_ADJUST register which allows to detect even minimal + * modifications + * - not more than two sockets. As the number of sockets cannot be + * evaluated at the early boot stage where this has to be + * invoked, check the number of online memory nodes as a + * fallback solution which is an reasonable estimate. + */ + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) && + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) && + boot_cpu_has(X86_FEATURE_TSC_ADJUST) && + nr_online_nodes <= 2) + tsc_disable_clocksource_watchdog(); } /* @@ -1387,9 +1410,6 @@ static int __init init_tsc_clocksource(void) if (tsc_unstable) goto unreg; - if (tsc_clocksource_reliable || no_tsc_watchdog) - clocksource_tsc.flags &= ~CLOCK_SOURCE_MUST_VERIFY; - if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3)) clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP; @@ -1527,7 +1547,7 @@ void __init tsc_init(void) } if (tsc_clocksource_reliable || no_tsc_watchdog) - clocksource_tsc_early.flags &= ~CLOCK_SOURCE_MUST_VERIFY; + tsc_disable_clocksource_watchdog(); clocksource_register_khz(&clocksource_tsc_early, tsc_khz); detect_art(); -- cgit v1.2.3 From cb1d220da0faa5ca0deb93449aff953f0c2cce6d Mon Sep 17 00:00:00 2001 From: Like Xu Date: Thu, 18 Nov 2021 21:03:20 +0800 Subject: KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register If we run the following perf command in an AMD Milan guest: perf stat \ -e cpu/event=0x1d0/ \ -e cpu/event=0x1c7/ \ -e cpu/umask=0x1f,event=0x18e/ \ -e cpu/umask=0x7,event=0x18e/ \ -e cpu/umask=0x18,event=0x18e/ \ ./workload dmesg will report a #GP warning from an unchecked MSR access error on MSR_F15H_PERF_CTLx. This is because according to APM (Revision: 4.03) Figure 13-7, the bits [35:32] of AMD PerfEvtSeln register is a part of the event select encoding, which extends the EVENT_SELECT field from 8 bits to 12 bits. Opportunistically update pmu->reserved_bits for reserved bit 19. Reported-by: Jim Mattson Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM") Signed-off-by: Like Xu Message-Id: <20211118130320.95997-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/pmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c index 871c426ec389..b4095dfeeee6 100644 --- a/arch/x86/kvm/svm/pmu.c +++ b/arch/x86/kvm/svm/pmu.c @@ -281,7 +281,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu) pmu->nr_arch_gp_counters = AMD64_NUM_COUNTERS; pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << 48) - 1; - pmu->reserved_bits = 0xffffffff00200000ull; + pmu->reserved_bits = 0xfffffff000280000ull; pmu->version = 1; /* not applicable to AMD; but clean them to prevent any fall out */ pmu->counter_bitmask[KVM_PMC_FIXED] = 0; -- cgit v1.2.3 From ef8b4b7203682cc9adb37c8336d3f0f3b80bc382 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 30 Nov 2021 07:37:45 -0500 Subject: KVM: ensure APICv is considered inactive if there is no APIC kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel local APIC, however kvm_apicv_activated might still be true if there are no reasons to disable APICv; in fact it is quite likely that there is none because APICv is inhibited by specific configurations of the local APIC and those configurations cannot be programmed. This triggers a WARN: WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); To avoid this, introduce another cause for APICv inhibition, namely the absence of an in-kernel local APIC. This cause is enabled by default, and is dropped by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. Reported-by: Ignat Korchagin Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22) Reviewed-by: Maxim Levitsky Reviewed-by: Sean Christopherson Tested-by: Ignat Korchagin Message-Id: <20211130123746.293379-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm/avic.c | 1 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 9 +++++---- 4 files changed, 8 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6ac61f85e07b..860ed500580c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1036,6 +1036,7 @@ struct kvm_x86_msr_filter { #define APICV_INHIBIT_REASON_PIT_REINJ 4 #define APICV_INHIBIT_REASON_X2APIC 5 #define APICV_INHIBIT_REASON_BLOCKIRQ 6 +#define APICV_INHIBIT_REASON_ABSENT 7 struct kvm_arch { unsigned long n_used_mmu_pages; diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index 9d6066eb7c10..8f9af7b7dbbe 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -900,6 +900,7 @@ out: bool svm_check_apicv_inhibit_reasons(ulong bit) { ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | + BIT(APICV_INHIBIT_REASON_ABSENT) | BIT(APICV_INHIBIT_REASON_HYPERV) | BIT(APICV_INHIBIT_REASON_NESTED) | BIT(APICV_INHIBIT_REASON_IRQWIN) | diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f90448809690..9453743ce0c4 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7525,6 +7525,7 @@ static void hardware_unsetup(void) static bool vmx_check_apicv_inhibit_reasons(ulong bit) { ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) | + BIT(APICV_INHIBIT_REASON_ABSENT) | BIT(APICV_INHIBIT_REASON_HYPERV) | BIT(APICV_INHIBIT_REASON_BLOCKIRQ); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ee1a039b490..e0aa4dd53c7f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5740,6 +5740,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, smp_wmb(); kvm->arch.irqchip_mode = KVM_IRQCHIP_SPLIT; kvm->arch.nr_reserved_ioapic_pins = cap->args[0]; + kvm_request_apicv_update(kvm, true, APICV_INHIBIT_REASON_ABSENT); r = 0; split_irqchip_unlock: mutex_unlock(&kvm->lock); @@ -6120,6 +6121,7 @@ set_identity_unlock: /* Write kvm->irq_routing before enabling irqchip_in_kernel. */ smp_wmb(); kvm->arch.irqchip_mode = KVM_IRQCHIP_KERNEL; + kvm_request_apicv_update(kvm, true, APICV_INHIBIT_REASON_ABSENT); create_irqchip_unlock: mutex_unlock(&kvm->lock); break; @@ -8818,10 +8820,9 @@ static void kvm_apicv_init(struct kvm *kvm) { init_rwsem(&kvm->arch.apicv_update_lock); - if (enable_apicv) - clear_bit(APICV_INHIBIT_REASON_DISABLE, - &kvm->arch.apicv_inhibit_reasons); - else + set_bit(APICV_INHIBIT_REASON_ABSENT, + &kvm->arch.apicv_inhibit_reasons); + if (!enable_apicv) set_bit(APICV_INHIBIT_REASON_DISABLE, &kvm->arch.apicv_inhibit_reasons); } -- cgit v1.2.3 From bfbb307c628676929c2d329da0daf9d22afa8ad2 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Tue, 30 Nov 2021 15:53:37 +0300 Subject: KVM: VMX: Set failure code in prepare_vmcs02() The error paths in the prepare_vmcs02() function are supposed to set *entry_failure_code but this path does not. It leads to using an uninitialized variable in the caller. Fixes: 71f7347025bf ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry") Signed-off-by: Dan Carpenter Message-Id: <20211130125337.GB24578@kili> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx/nested.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 64f2828035c2..9c941535f78c 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2591,8 +2591,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) && WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, - vmcs12->guest_ia32_perf_global_ctrl))) + vmcs12->guest_ia32_perf_global_ctrl))) { + *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; + } kvm_rsp_write(vcpu, vmcs12->guest_rsp); kvm_rip_write(vcpu, vmcs12->guest_rip); -- cgit v1.2.3 From a955cad84cdaffa282b3cf8f5ce69e9e5655e585 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Sat, 20 Nov 2021 04:50:22 +0000 Subject: KVM: x86/mmu: Retry page fault if root is invalidated by memslot update Bail from the page fault handler if the root shadow page was obsoleted by a memslot update. Do the check _after_ acuiring mmu_lock, as the TDP MMU doesn't rely on the memslot/MMU generation, and instead relies on the root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes mmu_lock for write. For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has moved past the gfn associated with the SP. For other MMUs, the resulting behavior is far more convoluted, though unlikely to be truly problematic. Installing SPs/SPTEs into the obsolete root isn't directly problematic, as the obsolete root will be unloaded and dropped before the vCPU re-enters the guest. But because the legacy MMU tracks shadow pages by their role, any SP created by the fault can can be reused in the new post-reload root. Again, that _shouldn't_ be problematic as any leaf child SPTEs will be created for the current/valid memslot generation, and kvm_mmu_get_page() will not reuse child SPs from the old generation as they will be flagged as obsolete. But, given that continuing with the fault is pointess (the root will be unloaded), apply the check to all MMUs. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon Signed-off-by: Sean Christopherson Message-Id: <20211120045046.3940942-5-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu/mmu.c | 23 +++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 3 ++- 2 files changed, 23 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6354297e92ae..e2e1d012df22 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1936,7 +1936,11 @@ static void mmu_audit_disable(void) { } static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) { - return sp->role.invalid || + if (sp->role.invalid) + return true; + + /* TDP MMU pages due not use the MMU generation. */ + return !sp->tdp_mmu_page && unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); } @@ -3976,6 +3980,20 @@ out_retry: return true; } +/* + * Returns true if the page fault is stale and needs to be retried, i.e. if the + * root was invalidated by a memslot update or a relevant mmu_notifier fired. + */ +static bool is_page_fault_stale(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, int mmu_seq) +{ + if (is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root_hpa))) + return true; + + return fault->slot && + mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva); +} + static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); @@ -4013,8 +4031,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else write_lock(&vcpu->kvm->mmu_lock); - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva)) + if (is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; + r = make_mmu_pages_available(vcpu); if (r) goto out_unlock; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index f87d36898c44..708a5d297fe1 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -911,7 +911,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock); - if (fault->slot && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva)) + + if (is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT); -- cgit v1.2.3 From 2f2183243f52a8ee77eecba4796316606701d101 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 30 Nov 2021 12:18:49 +0000 Subject: arm64: kexec: use __pa_symbol(empty_zero_page) In machine_kexec_post_load() we use __pa() on `empty_zero_page`, so that we can use the physical address during arm64_relocate_new_kernel() to switch TTBR1 to a new set of tables. While `empty_zero_page` is part of the old kernel, we won't clobber it until after this switch, so using it is benign. However, `empty_zero_page` is part of the kernel image rather than a linear map address, so it is not correct to use __pa(x), and we should instead use __pa_symbol(x) or __pa(lm_alias(x)). Otherwise, when the kernel is built with DEBUG_VIRTUAL, we'll encounter splats as below, as I've seen when fuzzing v5.16-rc3 with Syzkaller: | ------------[ cut here ]------------ | virt_to_phys used for non-linear address: 000000008492561a (empty_zero_page+0x0/0x1000) | WARNING: CPU: 3 PID: 11492 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12 | CPU: 3 PID: 11492 Comm: syz-executor.0 Not tainted 5.16.0-rc3-00001-g48bd452a045c #1 | Hardware name: linux,dummy-virt (DT) | pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12 | lr : __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12 | sp : ffff80001af17bb0 | x29: ffff80001af17bb0 x28: ffff1cc65207b400 x27: ffffb7828730b120 | x26: 0000000000000e11 x25: 0000000000000000 x24: 0000000000000001 | x23: ffffb7828963e000 x22: ffffb78289644000 x21: 0000600000000000 | x20: 000000000000002d x19: 0000b78289644000 x18: 0000000000000000 | x17: 74706d6528206131 x16: 3635323934383030 x15: 303030303030203a | x14: 1ffff000035e2eb8 x13: ffff6398d53f4f0f x12: 1fffe398d53f4f0e | x11: 1fffe398d53f4f0e x10: ffff6398d53f4f0e x9 : ffffb7827c6f76dc | x8 : ffff1cc6a9fa7877 x7 : 0000000000000001 x6 : ffff6398d53f4f0f | x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff1cc66f2a99c0 | x2 : 0000000000040000 x1 : d7ce7775b09b5d00 x0 : 0000000000000000 | Call trace: | __virt_to_phys+0x120/0x1c0 arch/arm64/mm/physaddr.c:12 | machine_kexec_post_load+0x284/0x670 arch/arm64/kernel/machine_kexec.c:150 | do_kexec_load+0x570/0x670 kernel/kexec.c:155 | __do_sys_kexec_load kernel/kexec.c:250 [inline] | __se_sys_kexec_load kernel/kexec.c:231 [inline] | __arm64_sys_kexec_load+0x1d8/0x268 kernel/kexec.c:231 | __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] | invoke_syscall+0x90/0x2e0 arch/arm64/kernel/syscall.c:52 | el0_svc_common.constprop.2+0x1e4/0x2f8 arch/arm64/kernel/syscall.c:142 | do_el0_svc+0xf8/0x150 arch/arm64/kernel/syscall.c:181 | el0_svc+0x60/0x248 arch/arm64/kernel/entry-common.c:603 | el0t_64_sync_handler+0x90/0xb8 arch/arm64/kernel/entry-common.c:621 | el0t_64_sync+0x180/0x184 arch/arm64/kernel/entry.S:572 | irq event stamp: 2428 | hardirqs last enabled at (2427): [] __up_console_sem+0xf0/0x118 kernel/printk/printk.c:255 | hardirqs last disabled at (2428): [] el1_dbg+0x28/0x80 arch/arm64/kernel/entry-common.c:375 | softirqs last enabled at (2424): [] softirq_handle_end kernel/softirq.c:401 [inline] | softirqs last enabled at (2424): [] __do_softirq+0xa28/0x11e4 kernel/softirq.c:587 | softirqs last disabled at (2417): [] do_softirq_own_stack include/asm-generic/softirq_stack.h:10 [inline] | softirqs last disabled at (2417): [] invoke_softirq kernel/softirq.c:439 [inline] | softirqs last disabled at (2417): [] __irq_exit_rcu kernel/softirq.c:636 [inline] | softirqs last disabled at (2417): [] irq_exit_rcu+0x53c/0x688 kernel/softirq.c:648 | ---[ end trace 0ca578534e7ca938 ]--- With or without DEBUG_VIRTUAL __pa() will fall back to __kimg_to_phys() for non-linear addresses, and will happen to do the right thing in this case, even with the warning. But we should not depend upon this, and to keep the warning useful we should fix this case. Fix this issue by using __pa_symbol(), which handles kernel image addresses (and checks its input is a kernel image address). This matches what we do elsewhere, e.g. in arch/arm64/include/asm/pgtable.h: | #define ZERO_PAGE(vaddr) phys_to_page(__pa_symbol(empty_zero_page)) Fixes: 3744b5280e67 ("arm64: kexec: install a copy of the linear-map") Signed-off-by: Mark Rutland Cc: Catalin Marinas Cc: James Morse Cc: Pasha Tatashin Cc: Will Deacon Reviewed-by: Pasha Tatashin Link: https://lore.kernel.org/r/20211130121849.3319010-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/kernel/machine_kexec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c index 1038494135c8..6fb31c117ebe 100644 --- a/arch/arm64/kernel/machine_kexec.c +++ b/arch/arm64/kernel/machine_kexec.c @@ -147,7 +147,7 @@ int machine_kexec_post_load(struct kimage *kimage) if (rc) return rc; kimage->arch.ttbr1 = __pa(trans_pgd); - kimage->arch.zero_page = __pa(empty_zero_page); + kimage->arch.zero_page = __pa_symbol(empty_zero_page); reloc_size = __relocate_new_kernel_end - __relocate_new_kernel_start; memcpy(reloc_code, __relocate_new_kernel_start, reloc_size); -- cgit v1.2.3 From 35b6b28e69985eafb20b3b2c7bd6eca452b56b53 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Mon, 29 Nov 2021 13:57:09 +0000 Subject: arm64: ftrace: add missing BTIs When branch target identifiers are in use, code reachable via an indirect branch requires a BTI landing pad at the branch target site. When building FTRACE_WITH_REGS atop patchable-function-entry, we miss BTIs at the start start of the `ftrace_caller` and `ftrace_regs_caller` trampolines, and when these are called from a module via a PLT (which will use a `BR X16`), we will encounter a BTI failure, e.g. | # insmod lkdtm.ko | lkdtm: No crash points registered, enable through debugfs | # echo function_graph > /sys/kernel/debug/tracing/current_tracer | # cat /sys/kernel/debug/provoke-crash/DIRECT | Unhandled 64-bit el1h sync exception on CPU0, ESR 0x34000001 -- BTI | CPU: 0 PID: 174 Comm: cat Not tainted 5.16.0-rc2-dirty #3 | Hardware name: linux,dummy-virt (DT) | pstate: 60400405 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=jc) | pc : ftrace_caller+0x0/0x3c | lr : lkdtm_debugfs_open+0xc/0x20 [lkdtm] | sp : ffff800012e43b00 | x29: ffff800012e43b00 x28: 0000000000000000 x27: ffff800012e43c88 | x26: 0000000000000000 x25: 0000000000000000 x24: ffff0000c171f200 | x23: ffff0000c27b1e00 x22: ffff0000c2265240 x21: ffff0000c23c8c30 | x20: ffff8000090ba380 x19: 0000000000000000 x18: 0000000000000000 | x17: 0000000000000000 x16: ffff80001002bb4c x15: 0000000000000000 | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000900ff0 | x11: ffff0000c4166310 x10: ffff800012e43b00 x9 : ffff8000104f2384 | x8 : 0000000000000001 x7 : 0000000000000000 x6 : 000000000000003f | x5 : 0000000000000040 x4 : ffff800012e43af0 x3 : 0000000000000001 | x2 : ffff8000090b0000 x1 : ffff0000c171f200 x0 : ffff0000c23c8c30 | Kernel panic - not syncing: Unhandled exception | CPU: 0 PID: 174 Comm: cat Not tainted 5.16.0-rc2-dirty #3 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x1a4 | show_stack+0x24/0x30 | dump_stack_lvl+0x68/0x84 | dump_stack+0x1c/0x38 | panic+0x168/0x360 | arm64_exit_nmi.isra.0+0x0/0x80 | el1h_64_sync_handler+0x68/0xd4 | el1h_64_sync+0x78/0x7c | ftrace_caller+0x0/0x3c | do_dentry_open+0x134/0x3b0 | vfs_open+0x38/0x44 | path_openat+0x89c/0xe40 | do_filp_open+0x8c/0x13c | do_sys_openat2+0xbc/0x174 | __arm64_sys_openat+0x6c/0xbc | invoke_syscall+0x50/0x120 | el0_svc_common.constprop.0+0xdc/0x100 | do_el0_svc+0x84/0xa0 | el0_svc+0x28/0x80 | el0t_64_sync_handler+0xa8/0x130 | el0t_64_sync+0x1a0/0x1a4 | SMP: stopping secondary CPUs | Kernel Offset: disabled | CPU features: 0x0,00000f42,da660c5f | Memory Limit: none | ---[ end Kernel panic - not syncing: Unhandled exception ]--- Fix this by adding the required `BTI C`, as we only require these to be reachable via BL for direct calls or BR X16/X17 for PLTs. For now, these are open-coded in the function prologue, matching the style of the `__hwasan_tag_mismatch` trampoline. In future we may wish to consider adding a new SYM_CODE_START_*() variant which has an implicit BTI. When ftrace is built atop mcount, the trampolines are marked with SYM_FUNC_START(), and so get an implicit BTI. We may need to change these over to SYM_CODE_START() in future for RELIABLE_STACKTRACE, in case we need to apply special care aroud the return address being rewritten. Fixes: 97fed779f2a6 ("arm64: bti: Provide Kconfig for kernel mode BTI") Signed-off-by: Mark Rutland Cc: Catalin Marinas Cc: Mark Brown Cc: Will Deacon Reviewed-by: Mark Brown Link: https://lore.kernel.org/r/20211129135709.2274019-1-mark.rutland@arm.com Signed-off-by: Will Deacon --- arch/arm64/kernel/entry-ftrace.S | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'arch') diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S index b3e4f9a088b1..8cf970d219f5 100644 --- a/arch/arm64/kernel/entry-ftrace.S +++ b/arch/arm64/kernel/entry-ftrace.S @@ -77,11 +77,17 @@ .endm SYM_CODE_START(ftrace_regs_caller) +#ifdef BTI_C + BTI_C +#endif ftrace_regs_entry 1 b ftrace_common SYM_CODE_END(ftrace_regs_caller) SYM_CODE_START(ftrace_caller) +#ifdef BTI_C + BTI_C +#endif ftrace_regs_entry 0 b ftrace_common SYM_CODE_END(ftrace_caller) -- cgit v1.2.3 From 3c088b1e82cfb7c889823d39846d32079f190f3f Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Fri, 26 Nov 2021 15:16:31 +0100 Subject: s390: update defconfigs Signed-off-by: Heiko Carstens --- arch/s390/configs/debug_defconfig | 10 ++++++++-- arch/s390/configs/defconfig | 7 ++++++- arch/s390/configs/zfcpdump_defconfig | 2 ++ 3 files changed, 16 insertions(+), 3 deletions(-) (limited to 'arch') diff --git a/arch/s390/configs/debug_defconfig b/arch/s390/configs/debug_defconfig index fd825097cf04..b626bc6e0eaf 100644 --- a/arch/s390/configs/debug_defconfig +++ b/arch/s390/configs/debug_defconfig @@ -403,7 +403,6 @@ CONFIG_DEVTMPFS=y CONFIG_CONNECTOR=y CONFIG_ZRAM=y CONFIG_BLK_DEV_LOOP=m -CONFIG_BLK_DEV_CRYPTOLOOP=m CONFIG_BLK_DEV_DRBD=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_RAM=y @@ -476,6 +475,7 @@ CONFIG_MACVLAN=m CONFIG_MACVTAP=m CONFIG_VXLAN=m CONFIG_BAREUDP=m +CONFIG_AMT=m CONFIG_TUN=m CONFIG_VETH=m CONFIG_VIRTIO_NET=m @@ -489,6 +489,7 @@ CONFIG_NLMON=m # CONFIG_NET_VENDOR_AMD is not set # CONFIG_NET_VENDOR_AQUANTIA is not set # CONFIG_NET_VENDOR_ARC is not set +# CONFIG_NET_VENDOR_ASIX is not set # CONFIG_NET_VENDOR_ATHEROS is not set # CONFIG_NET_VENDOR_BROADCOM is not set # CONFIG_NET_VENDOR_BROCADE is not set @@ -571,6 +572,7 @@ CONFIG_WATCHDOG=y CONFIG_WATCHDOG_NOWAYOUT=y CONFIG_SOFT_WATCHDOG=m CONFIG_DIAG288_WATCHDOG=m +# CONFIG_DRM_DEBUG_MODESET_LOCK is not set CONFIG_FB=y CONFIG_FRAMEBUFFER_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y @@ -775,12 +777,14 @@ CONFIG_CRC4=m CONFIG_CRC7=m CONFIG_CRC8=m CONFIG_RANDOM32_SELFTEST=y +CONFIG_XZ_DEC_MICROLZMA=y CONFIG_DMA_CMA=y CONFIG_CMA_SIZE_MBYTES=0 CONFIG_PRINTK_TIME=y CONFIG_DYNAMIC_DEBUG=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_DWARF4=y +CONFIG_DEBUG_INFO_BTF=y CONFIG_GDB_SCRIPTS=y CONFIG_HEADERS_INSTALL=y CONFIG_DEBUG_SECTION_MISMATCH=y @@ -807,6 +811,7 @@ CONFIG_DEBUG_MEMORY_INIT=y CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m CONFIG_DEBUG_PER_CPU_MAPS=y CONFIG_KFENCE=y +CONFIG_KFENCE_STATIC_KEYS=y CONFIG_DEBUG_SHIRQ=y CONFIG_PANIC_ON_OOPS=y CONFIG_DETECT_HUNG_TASK=y @@ -842,6 +847,7 @@ CONFIG_FTRACE_STARTUP_TEST=y CONFIG_SAMPLES=y CONFIG_SAMPLE_TRACE_PRINTK=m CONFIG_SAMPLE_FTRACE_DIRECT=m +CONFIG_SAMPLE_FTRACE_DIRECT_MULTI=m CONFIG_DEBUG_ENTRY=y CONFIG_CIO_INJECT=y CONFIG_KUNIT=m @@ -860,7 +866,7 @@ CONFIG_FAIL_FUNCTION=y CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y CONFIG_LKDTM=m CONFIG_TEST_MIN_HEAP=y -CONFIG_KPROBES_SANITY_TEST=y +CONFIG_KPROBES_SANITY_TEST=m CONFIG_RBTREE_TEST=y CONFIG_INTERVAL_TREE_TEST=m CONFIG_PERCPU_TEST=m diff --git a/arch/s390/configs/defconfig b/arch/s390/configs/defconfig index c9c3cedff2d8..0056cab27372 100644 --- a/arch/s390/configs/defconfig +++ b/arch/s390/configs/defconfig @@ -394,7 +394,6 @@ CONFIG_DEVTMPFS=y CONFIG_CONNECTOR=y CONFIG_ZRAM=y CONFIG_BLK_DEV_LOOP=m -CONFIG_BLK_DEV_CRYPTOLOOP=m CONFIG_BLK_DEV_DRBD=m CONFIG_BLK_DEV_NBD=m CONFIG_BLK_DEV_RAM=y @@ -467,6 +466,7 @@ CONFIG_MACVLAN=m CONFIG_MACVTAP=m CONFIG_VXLAN=m CONFIG_BAREUDP=m +CONFIG_AMT=m CONFIG_TUN=m CONFIG_VETH=m CONFIG_VIRTIO_NET=m @@ -480,6 +480,7 @@ CONFIG_NLMON=m # CONFIG_NET_VENDOR_AMD is not set # CONFIG_NET_VENDOR_AQUANTIA is not set # CONFIG_NET_VENDOR_ARC is not set +# CONFIG_NET_VENDOR_ASIX is not set # CONFIG_NET_VENDOR_ATHEROS is not set # CONFIG_NET_VENDOR_BROADCOM is not set # CONFIG_NET_VENDOR_BROCADE is not set @@ -762,12 +763,14 @@ CONFIG_PRIME_NUMBERS=m CONFIG_CRC4=m CONFIG_CRC7=m CONFIG_CRC8=m +CONFIG_XZ_DEC_MICROLZMA=y CONFIG_DMA_CMA=y CONFIG_CMA_SIZE_MBYTES=0 CONFIG_PRINTK_TIME=y CONFIG_DYNAMIC_DEBUG=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_DWARF4=y +CONFIG_DEBUG_INFO_BTF=y CONFIG_GDB_SCRIPTS=y CONFIG_DEBUG_SECTION_MISMATCH=y CONFIG_MAGIC_SYSRQ=y @@ -792,9 +795,11 @@ CONFIG_HIST_TRIGGERS=y CONFIG_SAMPLES=y CONFIG_SAMPLE_TRACE_PRINTK=m CONFIG_SAMPLE_FTRACE_DIRECT=m +CONFIG_SAMPLE_FTRACE_DIRECT_MULTI=m CONFIG_KUNIT=m CONFIG_KUNIT_DEBUGFS=y CONFIG_LKDTM=m +CONFIG_KPROBES_SANITY_TEST=m CONFIG_PERCPU_TEST=m CONFIG_ATOMIC64_SELFTEST=y CONFIG_TEST_BPF=m diff --git a/arch/s390/configs/zfcpdump_defconfig b/arch/s390/configs/zfcpdump_defconfig index aceccf3b9a88..eed3b9acfa71 100644 --- a/arch/s390/configs/zfcpdump_defconfig +++ b/arch/s390/configs/zfcpdump_defconfig @@ -65,9 +65,11 @@ CONFIG_ZFCP=y # CONFIG_NETWORK_FILESYSTEMS is not set CONFIG_LSM="yama,loadpin,safesetid,integrity" # CONFIG_ZLIB_DFLTCC is not set +CONFIG_XZ_DEC_MICROLZMA=y CONFIG_PRINTK_TIME=y # CONFIG_SYMBOLIC_ERRNAME is not set CONFIG_DEBUG_INFO=y +CONFIG_DEBUG_INFO_BTF=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_PANIC_ON_OOPS=y -- cgit v1.2.3 From 51523ed1c26758de1af7e58730a656875f72f783 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Thu, 2 Dec 2021 16:32:26 +0100 Subject: x86/64/mm: Map all kernel memory into trampoline_pgd The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff range of kernel memory (with 4-level paging). This range contains the kernel's text+data+bss mappings and the module mapping space but not the direct mapping and the vmalloc area. This is enough to get the application processors out of real-mode, but for code that switches back to real-mode the trampoline_pgd is missing important parts of the address space. For example, consider this code from arch/x86/kernel/reboot.c, function machine_real_restart() for a 64-bit kernel: #ifdef CONFIG_X86_32 load_cr3(initial_page_table); #else write_cr3(real_mode_header->trampoline_pgd); /* Exiting long mode will fail if CR4.PCIDE is set. */ if (boot_cpu_has(X86_FEATURE_PCID)) cr4_clear_bits(X86_CR4_PCIDE); #endif /* Jump to the identity-mapped low memory code */ #ifdef CONFIG_X86_32 asm volatile("jmpl *%0" : : "rm" (real_mode_header->machine_real_restart_asm), "a" (type)); #else asm volatile("ljmpl *%0" : : "m" (real_mode_header->machine_real_restart_asm), "D" (type)); #endif The code switches to the trampoline_pgd, which unmaps the direct mapping and also the kernel stack. The call to cr4_clear_bits() will find no stack and crash the machine. The real_mode_header pointer below points into the direct mapping, and dereferencing it also causes a crash. The reason this does not crash always is only that kernel mappings are global and the CR3 switch does not flush those mappings. But if theses mappings are not in the TLB already, the above code will crash before it can jump to the real-mode stub. Extend the trampoline_pgd to contain all kernel mappings to prevent these crashes and to make code which runs on this page-table more robust. Signed-off-by: Joerg Roedel Signed-off-by: Borislav Petkov Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org --- arch/x86/realmode/init.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'arch') diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 4a3da7592b99..38d24d2ab38b 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -72,6 +72,7 @@ static void __init setup_real_mode(void) #ifdef CONFIG_X86_64 u64 *trampoline_pgd; u64 efer; + int i; #endif base = (unsigned char *)real_mode_header; @@ -128,8 +129,17 @@ static void __init setup_real_mode(void) trampoline_header->flags = 0; trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd); + + /* Map the real mode stub as virtual == physical */ trampoline_pgd[0] = trampoline_pgd_entry.pgd; - trampoline_pgd[511] = init_top_pgt[511].pgd; + + /* + * Include the entirety of the kernel mapping into the trampoline + * PGD. This way, all mappings present in the normal kernel page + * tables are usable while running on trampoline_pgd. + */ + for (i = pgd_index(__PAGE_OFFSET); i < PTRS_PER_PGD; i++) + trampoline_pgd[i] = init_top_pgt[i].pgd; #endif sme_sev_setup_real_mode(trampoline_header); -- cgit v1.2.3 From 1d5379d0475419085d3575bd9155f2e558e96390 Mon Sep 17 00:00:00 2001 From: Michael Sterritt Date: Fri, 19 Nov 2021 15:27:57 -0800 Subject: x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword Properly type the operands being passed to __put_user()/__get_user(). Otherwise, these routines truncate data for dependent instructions (e.g., INSW) and only read/write one byte. This has been tested by sending a string with REP OUTSW to a port and then reading it back in with REP INSW on the same port. Previous behavior was to only send and receive the first char of the size. For example, word operations for "abcd" would only read/write "ac". With change, the full string is now written and read back. Fixes: f980f9c31a923 (x86/sev-es: Compile early handler code into kernel image) Signed-off-by: Michael Sterritt Signed-off-by: Borislav Petkov Reviewed-by: Paolo Bonzini Reviewed-by: Marc Orr Reviewed-by: Peter Gonda Reviewed-by: Joerg Roedel Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com --- arch/x86/kernel/sev.c | 57 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 39 insertions(+), 18 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 74f0ec955384..a9fc2ac7a8bd 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -294,11 +294,6 @@ static enum es_result vc_write_mem(struct es_em_ctxt *ctxt, char *dst, char *buf, size_t size) { unsigned long error_code = X86_PF_PROT | X86_PF_WRITE; - char __user *target = (char __user *)dst; - u64 d8; - u32 d4; - u16 d2; - u8 d1; /* * This function uses __put_user() independent of whether kernel or user @@ -320,26 +315,42 @@ static enum es_result vc_write_mem(struct es_em_ctxt *ctxt, * instructions here would cause infinite nesting. */ switch (size) { - case 1: + case 1: { + u8 d1; + u8 __user *target = (u8 __user *)dst; + memcpy(&d1, buf, 1); if (__put_user(d1, target)) goto fault; break; - case 2: + } + case 2: { + u16 d2; + u16 __user *target = (u16 __user *)dst; + memcpy(&d2, buf, 2); if (__put_user(d2, target)) goto fault; break; - case 4: + } + case 4: { + u32 d4; + u32 __user *target = (u32 __user *)dst; + memcpy(&d4, buf, 4); if (__put_user(d4, target)) goto fault; break; - case 8: + } + case 8: { + u64 d8; + u64 __user *target = (u64 __user *)dst; + memcpy(&d8, buf, 8); if (__put_user(d8, target)) goto fault; break; + } default: WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size); return ES_UNSUPPORTED; @@ -362,11 +373,6 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt, char *src, char *buf, size_t size) { unsigned long error_code = X86_PF_PROT; - char __user *s = (char __user *)src; - u64 d8; - u32 d4; - u16 d2; - u8 d1; /* * This function uses __get_user() independent of whether kernel or user @@ -388,26 +394,41 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt, * instructions here would cause infinite nesting. */ switch (size) { - case 1: + case 1: { + u8 d1; + u8 __user *s = (u8 __user *)src; + if (__get_user(d1, s)) goto fault; memcpy(buf, &d1, 1); break; - case 2: + } + case 2: { + u16 d2; + u16 __user *s = (u16 __user *)src; + if (__get_user(d2, s)) goto fault; memcpy(buf, &d2, 2); break; - case 4: + } + case 4: { + u32 d4; + u32 __user *s = (u32 __user *)src; + if (__get_user(d4, s)) goto fault; memcpy(buf, &d4, 4); break; - case 8: + } + case 8: { + u64 d8; + u64 __user *s = (u64 __user *)src; if (__get_user(d8, s)) goto fault; memcpy(buf, &d8, 8); break; + } default: WARN_ONCE(1, "%s: Invalid size: %zu\n", __func__, size); return ES_UNSUPPORTED; -- cgit v1.2.3 From c07e45553da1808aa802e9f0ffa8108cfeaf7a17 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Fri, 26 Nov 2021 18:11:21 +0800 Subject: x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry() Commit 18ec54fdd6d18 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations") added FENCE_SWAPGS_{KERNEL|USER}_ENTRY for conditional SWAPGS. In paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both branches. This is because the fence is required for both cases since the CR3 write is conditional even when PTI is enabled. But 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") changed the order of SWAPGS and the CR3 write. And it missed the needed FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case. Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY can cover both branches. [ bp: Massage, fix typos, remove obsolete comment while at it. ] Fixes: 96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry") Signed-off-by: Lai Jiangshan Signed-off-by: Borislav Petkov Link: https://lkml.kernel.org/r/20211126101209.8613-2-jiangshanlai@gmail.com --- arch/x86/entry/entry_64.S | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) (limited to 'arch') diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index e38a4cf795d9..f1a8b5b2af96 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -890,6 +890,7 @@ SYM_CODE_START_LOCAL(paranoid_entry) .Lparanoid_entry_checkgs: /* EBX = 1 -> kernel GSBASE active, no restore required */ movl $1, %ebx + /* * The kernel-enforced convention is a negative GSBASE indicates * a kernel value. No SWAPGS needed on entry and exit. @@ -897,21 +898,14 @@ SYM_CODE_START_LOCAL(paranoid_entry) movl $MSR_GS_BASE, %ecx rdmsr testl %edx, %edx - jns .Lparanoid_entry_swapgs - ret + js .Lparanoid_kernel_gsbase -.Lparanoid_entry_swapgs: + /* EBX = 0 -> SWAPGS required on exit */ + xorl %ebx, %ebx swapgs +.Lparanoid_kernel_gsbase: - /* - * The above SAVE_AND_SWITCH_TO_KERNEL_CR3 macro doesn't do an - * unconditional CR3 write, even in the PTI case. So do an lfence - * to prevent GS speculation, regardless of whether PTI is enabled. - */ FENCE_SWAPGS_KERNEL_ENTRY - - /* EBX = 0 -> SWAPGS required on exit */ - xorl %ebx, %ebx ret SYM_CODE_END(paranoid_entry) -- cgit v1.2.3 From 1367afaa2ee90d1c956dfc224e199fcb3ff3f8cc Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Fri, 26 Nov 2021 18:11:22 +0800 Subject: x86/entry: Use the correct fence macro after swapgs in kernel CR3 The commit c75890700455 ("x86/entry/64: Remove unneeded kernel CR3 switching") removed a CR3 write in the faulting path of load_gs_index(). But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is enabled, see spectre_v1_select_mitigation(). Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3 and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make sure speculation is blocked. [ bp: Massage commit message and comment. ] Fixes: c75890700455 ("x86/entry/64: Remove unneeded kernel CR3 switching") Signed-off-by: Lai Jiangshan Signed-off-by: Borislav Petkov Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20211126101209.8613-3-jiangshanlai@gmail.com --- arch/x86/entry/entry_64.S | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) (limited to 'arch') diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index f1a8b5b2af96..f9e1c06a1c32 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -987,11 +987,6 @@ SYM_CODE_START_LOCAL(error_entry) pushq %r12 ret -.Lerror_entry_done_lfence: - FENCE_SWAPGS_KERNEL_ENTRY -.Lerror_entry_done: - ret - /* * There are two places in the kernel that can potentially fault with * usergs. Handle them here. B stepping K8s sometimes report a @@ -1014,8 +1009,14 @@ SYM_CODE_START_LOCAL(error_entry) * .Lgs_change's error handler with kernel gsbase. */ SWAPGS - FENCE_SWAPGS_USER_ENTRY - jmp .Lerror_entry_done + + /* + * Issue an LFENCE to prevent GS speculation, regardless of whether it is a + * kernel or user gsbase. + */ +.Lerror_entry_done_lfence: + FENCE_SWAPGS_KERNEL_ENTRY + ret .Lbstep_iret: /* Fix truncated RIP */ -- cgit v1.2.3 From 5c8f6a2e316efebb3ba93d8c1af258155dcf5632 Mon Sep 17 00:00:00 2001 From: Lai Jiangshan Date: Fri, 26 Nov 2021 18:11:23 +0800 Subject: x86/xen: Add xenpv_restore_regs_and_return_to_usermode() In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the trampoline stack. But XEN pv doesn't use trampoline stack, so PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack. In that case, source and destination stacks are identical, which means that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv would cause %rsp to move up to the top of the kernel stack and leave the IRET frame below %rsp. This is dangerous as it can be corrupted if #NMI / #MC hit as either of these events occurring in the middle of the stack pushing would clobber data on the (original) stack. And, with XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing the IRET frame on to the original address is useless and error-prone when there is any future attempt to modify the code. [ bp: Massage commit message. ] Fixes: 7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries") Signed-off-by: Lai Jiangshan Signed-off-by: Borislav Petkov Reviewed-by: Boris Ostrovsky Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com --- arch/x86/entry/entry_64.S | 4 ++++ arch/x86/xen/xen-asm.S | 20 ++++++++++++++++++++ 2 files changed, 24 insertions(+) (limited to 'arch') diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index f9e1c06a1c32..97b1f84bb53f 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -574,6 +574,10 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL) ud2 1: #endif +#ifdef CONFIG_XEN_PV + ALTERNATIVE "", "jmp xenpv_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV +#endif + POP_REGS pop_rdi=0 /* diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S index 220dd9678494..444d824775f6 100644 --- a/arch/x86/xen/xen-asm.S +++ b/arch/x86/xen/xen-asm.S @@ -20,6 +20,7 @@ #include #include +#include <../entry/calling.h> .pushsection .noinstr.text, "ax" /* @@ -192,6 +193,25 @@ SYM_CODE_START(xen_iret) jmp hypercall_iret SYM_CODE_END(xen_iret) +/* + * XEN pv doesn't use trampoline stack, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is + * also the kernel stack. Reusing swapgs_restore_regs_and_return_to_usermode() + * in XEN pv would cause %rsp to move up to the top of the kernel stack and + * leave the IRET frame below %rsp, which is dangerous to be corrupted if #NMI + * interrupts. And swapgs_restore_regs_and_return_to_usermode() pushing the IRET + * frame at the same address is useless. + */ +SYM_CODE_START(xenpv_restore_regs_and_return_to_usermode) + UNWIND_HINT_REGS + POP_REGS + + /* stackleak_erase() can work safely on the kernel stack. */ + STACKLEAK_ERASE_NOCLOBBER + + addq $8, %rsp /* skip regs->orig_ax */ + jmp xen_iret +SYM_CODE_END(xenpv_restore_regs_and_return_to_usermode) + /* * Xen handles syscall callbacks much like ordinary exceptions, which * means we have: -- cgit v1.2.3 From 0f9fee4cdebfbe695c297e5b603a275e2557c1cc Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Sat, 4 Dec 2021 21:14:40 +0100 Subject: parisc: Fix "make install" on newer debian releases On newer debian releases the debian-provided "installkernel" script is installed in /usr/sbin. Fix the kernel install.sh script to look for the script in this directory as well. Signed-off-by: Helge Deller Cc: # v3.13+ --- arch/parisc/install.sh | 1 + 1 file changed, 1 insertion(+) (limited to 'arch') diff --git a/arch/parisc/install.sh b/arch/parisc/install.sh index 056d588befdd..70d3cffb0251 100644 --- a/arch/parisc/install.sh +++ b/arch/parisc/install.sh @@ -39,6 +39,7 @@ verify "$3" if [ -n "${INSTALLKERNEL}" ]; then if [ -x ~/bin/${INSTALLKERNEL} ]; then exec ~/bin/${INSTALLKERNEL} "$@"; fi if [ -x /sbin/${INSTALLKERNEL} ]; then exec /sbin/${INSTALLKERNEL} "$@"; fi + if [ -x /usr/sbin/${INSTALLKERNEL} ]; then exec /usr/sbin/${INSTALLKERNEL} "$@"; fi fi # Default install -- cgit v1.2.3 From afdb4a5b1d340e4afffc65daa21cc71890d7d589 Mon Sep 17 00:00:00 2001 From: Helge Deller Date: Sat, 4 Dec 2021 21:21:46 +0100 Subject: parisc: Mark cr16 CPU clocksource unstable on all SMP machines In commit c8c3735997a3 ("parisc: Enhance detection of synchronous cr16 clocksources") I assumed that CPUs on the same physical core are syncronous. While booting up the kernel on two different C8000 machines, one with a dual-core PA8800 and one with a dual-core PA8900 CPU, this turned out to be wrong. The symptom was that I saw a jump in the internal clocks printed to the syslog and strange overall behaviour. On machines which have 4 cores (2 dual-cores) the problem isn't visible, because the current logic already marked the cr16 clocksource unstable in this case. This patch now marks the cr16 interval timers unstable if we have more than one CPU in the system, and it fixes this issue. Fixes: c8c3735997a3 ("parisc: Enhance detection of synchronous cr16 clocksources") Signed-off-by: Helge Deller Cc: # v5.15+ --- arch/parisc/kernel/time.c | 30 ++++++++---------------------- 1 file changed, 8 insertions(+), 22 deletions(-) (limited to 'arch') diff --git a/arch/parisc/kernel/time.c b/arch/parisc/kernel/time.c index 9fb1e794831b..061119a56fbe 100644 --- a/arch/parisc/kernel/time.c +++ b/arch/parisc/kernel/time.c @@ -249,30 +249,16 @@ void __init time_init(void) static int __init init_cr16_clocksource(void) { /* - * The cr16 interval timers are not syncronized across CPUs on - * different sockets, so mark them unstable and lower rating on - * multi-socket SMP systems. + * The cr16 interval timers are not syncronized across CPUs, even if + * they share the same socket. */ if (num_online_cpus() > 1 && !running_on_qemu) { - int cpu; - unsigned long cpu0_loc; - cpu0_loc = per_cpu(cpu_data, 0).cpu_loc; - - for_each_online_cpu(cpu) { - if (cpu == 0) - continue; - if ((cpu0_loc != 0) && - (cpu0_loc == per_cpu(cpu_data, cpu).cpu_loc)) - continue; - - /* mark sched_clock unstable */ - clear_sched_clock_stable(); - - clocksource_cr16.name = "cr16_unstable"; - clocksource_cr16.flags = CLOCK_SOURCE_UNSTABLE; - clocksource_cr16.rating = 0; - break; - } + /* mark sched_clock unstable */ + clear_sched_clock_stable(); + + clocksource_cr16.name = "cr16_unstable"; + clocksource_cr16.flags = CLOCK_SOURCE_UNSTABLE; + clocksource_cr16.rating = 0; } /* register at clocksource framework */ -- cgit v1.2.3 From 75236f5f2299b502e4b9b267c1ce3bc14a222ceb Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Nov 2021 22:23:49 +0000 Subject: KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails Return appropriate error codes if setting up the GHCB scratch area for an SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM when allocating the kernel buffer could be confusing as userspace would likely suspect a guest issue. Fixes: 8f423a80d299 ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: Tom Lendacky Signed-off-by: Sean Christopherson Message-Id: <20211109222350.2266045-2-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 59727a966f90..f35f59bfdb41 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2530,7 +2530,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu) } #define GHCB_SCRATCH_AREA_LIMIT (16ULL * PAGE_SIZE) -static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) +static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) { struct vmcb_control_area *control = &svm->vmcb->control; struct ghcb *ghcb = svm->sev_es.ghcb; @@ -2541,14 +2541,14 @@ static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) scratch_gpa_beg = ghcb_get_sw_scratch(ghcb); if (!scratch_gpa_beg) { pr_err("vmgexit: scratch gpa not provided\n"); - return false; + return -EINVAL; } scratch_gpa_end = scratch_gpa_beg + len; if (scratch_gpa_end < scratch_gpa_beg) { pr_err("vmgexit: scratch length (%#llx) not valid for scratch address (%#llx)\n", len, scratch_gpa_beg); - return false; + return -EINVAL; } if ((scratch_gpa_beg & PAGE_MASK) == control->ghcb_gpa) { @@ -2566,7 +2566,7 @@ static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) scratch_gpa_end > ghcb_scratch_end) { pr_err("vmgexit: scratch area is outside of GHCB shared buffer area (%#llx - %#llx)\n", scratch_gpa_beg, scratch_gpa_end); - return false; + return -EINVAL; } scratch_va = (void *)svm->sev_es.ghcb; @@ -2579,18 +2579,18 @@ static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) if (len > GHCB_SCRATCH_AREA_LIMIT) { pr_err("vmgexit: scratch area exceeds KVM limits (%#llx requested, %#llx limit)\n", len, GHCB_SCRATCH_AREA_LIMIT); - return false; + return -EINVAL; } scratch_va = kzalloc(len, GFP_KERNEL_ACCOUNT); if (!scratch_va) - return false; + return -ENOMEM; if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, scratch_va, len)) { /* Unable to copy scratch area from guest */ pr_err("vmgexit: kvm_read_guest for scratch area failed\n"); kfree(scratch_va); - return false; + return -EFAULT; } /* @@ -2606,7 +2606,7 @@ static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) svm->sev_es.ghcb_sa = scratch_va; svm->sev_es.ghcb_sa_len = len; - return true; + return 0; } static void set_ghcb_msr_bits(struct vcpu_svm *svm, u64 value, u64 mask, @@ -2745,10 +2745,10 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) ghcb_set_sw_exit_info_1(ghcb, 0); ghcb_set_sw_exit_info_2(ghcb, 0); - ret = -EINVAL; switch (exit_code) { case SVM_VMGEXIT_MMIO_READ: - if (!setup_vmgexit_scratch(svm, true, control->exit_info_2)) + ret = setup_vmgexit_scratch(svm, true, control->exit_info_2); + if (ret) break; ret = kvm_sev_es_mmio_read(vcpu, @@ -2757,7 +2757,8 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) svm->sev_es.ghcb_sa); break; case SVM_VMGEXIT_MMIO_WRITE: - if (!setup_vmgexit_scratch(svm, false, control->exit_info_2)) + ret = setup_vmgexit_scratch(svm, false, control->exit_info_2); + if (ret) break; ret = kvm_sev_es_mmio_write(vcpu, @@ -2800,6 +2801,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) vcpu_unimpl(vcpu, "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n", control->exit_info_1, control->exit_info_2); + ret = -EINVAL; break; default: ret = svm_invoke_exit_handler(vcpu, exit_code); @@ -2812,6 +2814,7 @@ int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in) { int count; int bytes; + int r; if (svm->vmcb->control.exit_info_2 > INT_MAX) return -EINVAL; @@ -2820,8 +2823,9 @@ int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in) if (unlikely(check_mul_overflow(count, size, &bytes))) return -EINVAL; - if (!setup_vmgexit_scratch(svm, in, bytes)) - return -EINVAL; + r = setup_vmgexit_scratch(svm, in, bytes); + if (r) + return r; return kvm_sev_es_string_io(&svm->vcpu, size, port, svm->sev_es.ghcb_sa, count, in); -- cgit v1.2.3 From a655276a594978a4887520c1241cf6ac49d6230b Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Tue, 9 Nov 2021 22:23:50 +0000 Subject: KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so that KVM falls back to __vmalloc() if physically contiguous memory isn't available. The buffer is purely a KVM software construct, i.e. there's no need for it to be physically contiguous. Cc: Tom Lendacky Signed-off-by: Sean Christopherson Message-Id: <20211109222350.2266045-3-seanjc@google.com> Signed-off-by: Paolo Bonzini --- arch/x86/kvm/svm/sev.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'arch') diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index f35f59bfdb41..94bde57df72e 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2260,7 +2260,7 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu) __free_page(virt_to_page(svm->sev_es.vmsa)); if (svm->sev_es.ghcb_sa_free) - kfree(svm->sev_es.ghcb_sa); + kvfree(svm->sev_es.ghcb_sa); } static void dump_ghcb(struct vcpu_svm *svm) @@ -2493,7 +2493,7 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm) svm->sev_es.ghcb_sa_sync = false; } - kfree(svm->sev_es.ghcb_sa); + kvfree(svm->sev_es.ghcb_sa); svm->sev_es.ghcb_sa = NULL; svm->sev_es.ghcb_sa_free = false; } @@ -2581,7 +2581,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) len, GHCB_SCRATCH_AREA_LIMIT); return -EINVAL; } - scratch_va = kzalloc(len, GFP_KERNEL_ACCOUNT); + scratch_va = kvzalloc(len, GFP_KERNEL_ACCOUNT); if (!scratch_va) return -ENOMEM; @@ -2589,7 +2589,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) /* Unable to copy scratch area from guest */ pr_err("vmgexit: kvm_read_guest for scratch area failed\n"); - kfree(scratch_va); + kvfree(scratch_va); return -EFAULT; } -- cgit v1.2.3 From ad5b353240c8837109d1bcc6c3a9a501d7f6a960 Mon Sep 17 00:00:00 2001 From: Tom Lendacky Date: Thu, 2 Dec 2021 12:52:05 -0600 Subject: KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT exit code or exit parameters fails. The VMGEXIT instruction can be issued from userspace, even though userspace (likely) can't update the GHCB. To prevent userspace from being able to kill the guest, return an error through the GHCB when validation fails rather than terminating the guest. For cases where the GHCB can't be updated (e.g. the GHCB can't be mapped, etc.), just return back to the guest. The new error codes are documented in the lasest update to the GHCB specification. Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Tom Lendacky Message-Id: Signed-off-by: Paolo Bonzini --- arch/x86/include/asm/sev-common.h | 11 ++++ arch/x86/kvm/svm/sev.c | 106 +++++++++++++++++++++----------------- 2 files changed, 71 insertions(+), 46 deletions(-) (limited to 'arch') diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h index 2cef6c5a52c2..6acaf5af0a3d 100644 --- a/arch/x86/include/asm/sev-common.h +++ b/arch/x86/include/asm/sev-common.h @@ -73,4 +73,15 @@ #define GHCB_RESP_CODE(v) ((v) & GHCB_MSR_INFO_MASK) +/* + * Error codes related to GHCB input that can be communicated back to the guest + * by setting the lower 32-bits of the GHCB SW_EXITINFO1 field to 2. + */ +#define GHCB_ERR_NOT_REGISTERED 1 +#define GHCB_ERR_INVALID_USAGE 2 +#define GHCB_ERR_INVALID_SCRATCH_AREA 3 +#define GHCB_ERR_MISSING_INPUT 4 +#define GHCB_ERR_INVALID_INPUT 5 +#define GHCB_ERR_INVALID_EVENT 6 + #endif diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 94bde57df72e..7656a2c5662a 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -2352,24 +2352,29 @@ static void sev_es_sync_from_ghcb(struct vcpu_svm *svm) memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap)); } -static int sev_es_validate_vmgexit(struct vcpu_svm *svm) +static bool sev_es_validate_vmgexit(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu; struct ghcb *ghcb; - u64 exit_code = 0; + u64 exit_code; + u64 reason; ghcb = svm->sev_es.ghcb; - /* Only GHCB Usage code 0 is supported */ - if (ghcb->ghcb_usage) - goto vmgexit_err; - /* - * Retrieve the exit code now even though is may not be marked valid + * Retrieve the exit code now even though it may not be marked valid * as it could help with debugging. */ exit_code = ghcb_get_sw_exit_code(ghcb); + /* Only GHCB Usage code 0 is supported */ + if (ghcb->ghcb_usage) { + reason = GHCB_ERR_INVALID_USAGE; + goto vmgexit_err; + } + + reason = GHCB_ERR_MISSING_INPUT; + if (!ghcb_sw_exit_code_is_valid(ghcb) || !ghcb_sw_exit_info_1_is_valid(ghcb) || !ghcb_sw_exit_info_2_is_valid(ghcb)) @@ -2448,30 +2453,34 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm) case SVM_VMGEXIT_UNSUPPORTED_EVENT: break; default: + reason = GHCB_ERR_INVALID_EVENT; goto vmgexit_err; } - return 0; + return true; vmgexit_err: vcpu = &svm->vcpu; - if (ghcb->ghcb_usage) { + if (reason == GHCB_ERR_INVALID_USAGE) { vcpu_unimpl(vcpu, "vmgexit: ghcb usage %#x is not valid\n", ghcb->ghcb_usage); + } else if (reason == GHCB_ERR_INVALID_EVENT) { + vcpu_unimpl(vcpu, "vmgexit: exit code %#llx is not valid\n", + exit_code); } else { - vcpu_unimpl(vcpu, "vmgexit: exit reason %#llx is not valid\n", + vcpu_unimpl(vcpu, "vmgexit: exit code %#llx input is not valid\n", exit_code); dump_ghcb(svm); } - vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; - vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON; - vcpu->run->internal.ndata = 2; - vcpu->run->internal.data[0] = exit_code; - vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; + /* Clear the valid entries fields */ + memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap)); + + ghcb_set_sw_exit_info_1(ghcb, 2); + ghcb_set_sw_exit_info_2(ghcb, reason); - return -EINVAL; + return false; } void sev_es_unmap_ghcb(struct vcpu_svm *svm) @@ -2530,7 +2539,7 @@ void pre_sev_run(struct vcpu_svm *svm, int cpu) } #define GHCB_SCRATCH_AREA_LIMIT (16ULL * PAGE_SIZE) -static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) +static bool setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) { struct vmcb_control_area *control = &svm->vmcb->control; struct ghcb *ghcb = svm->sev_es.ghcb; @@ -2541,14 +2550,14 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) scratch_gpa_beg = ghcb_get_sw_scratch(ghcb); if (!scratch_gpa_beg) { pr_err("vmgexit: scratch gpa not provided\n"); - return -EINVAL; + goto e_scratch; } scratch_gpa_end = scratch_gpa_beg + len; if (scratch_gpa_end < scratch_gpa_beg) { pr_err("vmgexit: scratch length (%#llx) not valid for scratch address (%#llx)\n", len, scratch_gpa_beg); - return -EINVAL; + goto e_scratch; } if ((scratch_gpa_beg & PAGE_MASK) == control->ghcb_gpa) { @@ -2566,7 +2575,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) scratch_gpa_end > ghcb_scratch_end) { pr_err("vmgexit: scratch area is outside of GHCB shared buffer area (%#llx - %#llx)\n", scratch_gpa_beg, scratch_gpa_end); - return -EINVAL; + goto e_scratch; } scratch_va = (void *)svm->sev_es.ghcb; @@ -2579,18 +2588,18 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) if (len > GHCB_SCRATCH_AREA_LIMIT) { pr_err("vmgexit: scratch area exceeds KVM limits (%#llx requested, %#llx limit)\n", len, GHCB_SCRATCH_AREA_LIMIT); - return -EINVAL; + goto e_scratch; } scratch_va = kvzalloc(len, GFP_KERNEL_ACCOUNT); if (!scratch_va) - return -ENOMEM; + goto e_scratch; if (kvm_read_guest(svm->vcpu.kvm, scratch_gpa_beg, scratch_va, len)) { /* Unable to copy scratch area from guest */ pr_err("vmgexit: kvm_read_guest for scratch area failed\n"); kvfree(scratch_va); - return -EFAULT; + goto e_scratch; } /* @@ -2606,7 +2615,13 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len) svm->sev_es.ghcb_sa = scratch_va; svm->sev_es.ghcb_sa_len = len; - return 0; + return true; + +e_scratch: + ghcb_set_sw_exit_info_1(ghcb, 2); + ghcb_set_sw_exit_info_2(ghcb, GHCB_ERR_INVALID_SCRATCH_AREA); + + return false; } static void set_ghcb_msr_bits(struct vcpu_svm *svm, u64 value, u64 mask, @@ -2657,7 +2672,7 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) ret = svm_invoke_exit_handler(vcpu, SVM_EXIT_CPUID); if (!ret) { - ret = -EINVAL; + /* Error, keep GHCB MSR value as-is */ break; } @@ -2693,10 +2708,13 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) GHCB_MSR_TERM_REASON_POS); pr_info("SEV-ES guest requested termination: %#llx:%#llx\n", reason_set, reason_code); - fallthrough; + + ret = -EINVAL; + break; } default: - ret = -EINVAL; + /* Error, keep GHCB MSR value as-is */ + break; } trace_kvm_vmgexit_msr_protocol_exit(svm->vcpu.vcpu_id, @@ -2720,14 +2738,18 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) if (!ghcb_gpa) { vcpu_unimpl(vcpu, "vmgexit: GHCB gpa is not set\n"); - return -EINVAL; + + /* Without a GHCB, just return right back to the guest */ + return 1; } if (kvm_vcpu_map(vcpu, ghcb_gpa >> PAGE_SHIFT, &svm->sev_es.ghcb_map)) { /* Unable to map GHCB from guest */ vcpu_unimpl(vcpu, "vmgexit: error mapping GHCB [%#llx] from guest\n", ghcb_gpa); - return -EINVAL; + + /* Without a GHCB, just return right back to the guest */ + return 1; } svm->sev_es.ghcb = svm->sev_es.ghcb_map.hva; @@ -2737,18 +2759,17 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) exit_code = ghcb_get_sw_exit_code(ghcb); - ret = sev_es_validate_vmgexit(svm); - if (ret) - return ret; + if (!sev_es_validate_vmgexit(svm)) + return 1; sev_es_sync_from_ghcb(svm); ghcb_set_sw_exit_info_1(ghcb, 0); ghcb_set_sw_exit_info_2(ghcb, 0); + ret = 1; switch (exit_code) { case SVM_VMGEXIT_MMIO_READ: - ret = setup_vmgexit_scratch(svm, true, control->exit_info_2); - if (ret) + if (!setup_vmgexit_scratch(svm, true, control->exit_info_2)) break; ret = kvm_sev_es_mmio_read(vcpu, @@ -2757,8 +2778,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) svm->sev_es.ghcb_sa); break; case SVM_VMGEXIT_MMIO_WRITE: - ret = setup_vmgexit_scratch(svm, false, control->exit_info_2); - if (ret) + if (!setup_vmgexit_scratch(svm, false, control->exit_info_2)) break; ret = kvm_sev_es_mmio_write(vcpu, @@ -2787,14 +2807,10 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) default: pr_err("svm: vmgexit: unsupported AP jump table request - exit_info_1=%#llx\n", control->exit_info_1); - ghcb_set_sw_exit_info_1(ghcb, 1); - ghcb_set_sw_exit_info_2(ghcb, - X86_TRAP_UD | - SVM_EVTINJ_TYPE_EXEPT | - SVM_EVTINJ_VALID); + ghcb_set_sw_exit_info_1(ghcb, 2); + ghcb_set_sw_exit_info_2(ghcb, GHCB_ERR_INVALID_INPUT); } - ret = 1; break; } case SVM_VMGEXIT_UNSUPPORTED_EVENT: @@ -2814,7 +2830,6 @@ int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in) { int count; int bytes; - int r; if (svm->vmcb->control.exit_info_2 > INT_MAX) return -EINVAL; @@ -2823,9 +2838,8 @@ int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in) if (unlikely(check_mul_overflow(count, size, &bytes))) return -EINVAL; - r = setup_vmgexit_scratch(svm, in, bytes); - if (r) - return r; + if (!setup_vmgexit_scratch(svm, in, bytes)) + return 1; return kvm_sev_es_string_io(&svm->vcpu, size, port, svm->sev_es.ghcb_sa, count, in); -- cgit v1.2.3