summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'x86_entry_for_v5.11_rc6' of ↵Linus Torvalds2021-01-311-10/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: "A single fix for objtool to generate proper unwind info for newer toolchains which do not generate section symbols anymore. And a cleanup ontop. This was originally going to go during the next merge window but people can already trigger a build error with binutils-2.36 which doesn't emit section symbols - something which objtool relies on - so let's expedite it" * tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument x86/entry: Emit a symbol for register restoring thunk
| * x86/entry: Remove put_ret_addr_in_rdi THUNK macro argumentBorislav Petkov2021-01-191-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | That logic is unused since 320100a5ffe5 ("x86/entry: Remove the TRACE_IRQS cruft") Remove it. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/YAAszZJ2GcIYZmB5@hirez.programming.kicks-ass.net
| * x86/entry: Emit a symbol for register restoring thunkNick Desaulniers2021-01-141-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arnd found a randconfig that produces the warning: arch/x86/entry/thunk_64.o: warning: objtool: missing symbol for insn at offset 0x3e when building with LLVM_IAS=1 (Clang's integrated assembler). Josh notes: With the LLVM assembler not generating section symbols, objtool has no way to reference this code when it generates ORC unwinder entries, because this code is outside of any ELF function. The limitation now being imposed by objtool is that all code must be contained in an ELF symbol. And .L symbols don't create such symbols. So basically, you can use an .L symbol *inside* a function or a code segment, you just can't use the .L symbol to contain the code using a SYM_*_START/END annotation pair. Fangrui notes that this optimization is helpful for reducing image size when compiling with -ffunction-sections and -fdata-sections. I have observed on the order of tens of thousands of symbols for the kernel images built with those flags. A patch has been authored against GNU binutils to match this behavior of not generating unused section symbols ([1]), so this will also become a problem for users of GNU binutils once they upgrade to 2.36. Omit the .L prefix on a label so that the assembler will emit an entry into the symbol table for the label, with STB_LOCAL binding. This enables objtool to generate proper unwind info here with LLVM_IAS=1 or GNU binutils 2.36+. [ bp: Massage commit message. ] Reported-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Suggested-by: Borislav Petkov <bp@alien8.de> Suggested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20210112194625.4181814-1-ndesaulniers@google.com Link: https://github.com/ClangBuiltLinux/linux/issues/1209 Link: https://reviews.llvm.org/D93783 Link: https://sourceware.org/binutils/docs/as/Symbol-Names.html Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1408485ce69f844dcd7ded093a8 [1]
* | Merge tag 'for-linus-5.11-rc6-tag' of ↵Linus Torvalds2021-01-283-1/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A fix for a regression introduced in 5.11 resulting in Xen dom0 having problems to correctly initialize Xenstore. - A fix for avoiding WARN splats when booting as Xen dom0 with CONFIG_AMD_MEM_ENCRYPT enabled due to a missing trap handler for the #VC exception (even if the handler should never be called). - A fix for the Xen bklfront driver adapting to the correct but unexpected behavior of new qemu. * tag 'for-linus-5.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled xen: Fix XenStore initialisation for XS_LOCAL xen-blkfront: allow discard-* nodes to be optional
| * | x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabledJuergen Gross2021-01-273-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When booting a kernel which has been built with CONFIG_AMD_MEM_ENCRYPT enabled as a Xen pv guest a warning is issued for each processor: [ 5.964347] ------------[ cut here ]------------ [ 5.968314] WARNING: CPU: 0 PID: 1 at /home/gross/linux/head/arch/x86/xen/enlighten_pv.c:660 get_trap_addr+0x59/0x90 [ 5.972321] Modules linked in: [ 5.976313] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.11.0-rc5-default #75 [ 5.980313] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A05 12/05/2013 [ 5.984313] RIP: e030:get_trap_addr+0x59/0x90 [ 5.988313] Code: 42 10 83 f0 01 85 f6 74 04 84 c0 75 1d b8 01 00 00 00 c3 48 3d 00 80 83 82 72 08 48 3d 20 81 83 82 72 0c b8 01 00 00 00 eb db <0f> 0b 31 c0 c3 48 2d 00 80 83 82 48 ba 72 1c c7 71 1c c7 71 1c 48 [ 5.992313] RSP: e02b:ffffc90040033d38 EFLAGS: 00010202 [ 5.996313] RAX: 0000000000000001 RBX: ffffffff82a141d0 RCX: ffffffff8222ec38 [ 6.000312] RDX: ffffffff8222ec38 RSI: 0000000000000005 RDI: ffffc90040033d40 [ 6.004313] RBP: ffff8881003984a0 R08: 0000000000000007 R09: ffff888100398000 [ 6.008312] R10: 0000000000000007 R11: ffffc90040246000 R12: ffff8884082182a8 [ 6.012313] R13: 0000000000000100 R14: 000000000000001d R15: ffff8881003982d0 [ 6.016316] FS: 0000000000000000(0000) GS:ffff888408200000(0000) knlGS:0000000000000000 [ 6.020313] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.024313] CR2: ffffc900020ef000 CR3: 000000000220a000 CR4: 0000000000050660 [ 6.028314] Call Trace: [ 6.032313] cvt_gate_to_trap.part.7+0x3f/0x90 [ 6.036313] ? asm_exc_double_fault+0x30/0x30 [ 6.040313] xen_convert_trap_info+0x87/0xd0 [ 6.044313] xen_pv_cpu_up+0x17a/0x450 [ 6.048313] bringup_cpu+0x2b/0xc0 [ 6.052313] ? cpus_read_trylock+0x50/0x50 [ 6.056313] cpuhp_invoke_callback+0x80/0x4c0 [ 6.060313] _cpu_up+0xa7/0x140 [ 6.064313] cpu_up+0x98/0xd0 [ 6.068313] bringup_nonboot_cpus+0x4f/0x60 [ 6.072313] smp_init+0x26/0x79 [ 6.076313] kernel_init_freeable+0x103/0x258 [ 6.080313] ? rest_init+0xd0/0xd0 [ 6.084313] kernel_init+0xa/0x110 [ 6.088313] ret_from_fork+0x1f/0x30 [ 6.092313] ---[ end trace be9ecf17dceeb4f3 ]--- Reason is that there is no Xen pv trap entry for X86_TRAP_VC. Fix that by adding a generic trap handler for unknown traps and wire all unknown bare metal handlers to this generic handler, which will just crash the system in case such a trap will ever happen. Fixes: 0786138c78e793 ("x86/sev-es: Add a Runtime #VC Exception Handler") Cc: <stable@vger.kernel.org> # v5.10 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2021-01-269-53/+90
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: - x86 bugfixes - Documentation fixes - Avoid performance regression due to SEV-ES patches - ARM: - Don't allow tagged pointers to point to memslots - Filter out ARMv8.1+ PMU events on v8.0 hardware - Hide PMU registers from userspace when no PMU is configured - More PMU cleanups - Don't try to handle broken PSCI firmware - More sys_reg() to reg_to_encoding() conversions * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written" KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration kvm: tracing: Fix unmatched kvm_entry and kvm_exit events KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG KVM: x86: get smi pending status correctly KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[] KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh() KVM: x86: Add more protection against undefined behavior in rsvd_bits() KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM KVM: Forbid the use of tagged userspace addresses for memslots KVM: arm64: Filter out v8.1+ events on v8.0 HW KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg() KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return KVM: arm64: Simplify handling of absent PMU system registers KVM: arm64: Hide PMU registers from userspace when not available
| * | | KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMXPaolo Bonzini2021-01-253-9/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS, which may need to be loaded outside guest mode. Therefore we cannot WARN in that case. However, that part of nested_get_vmcs12_pages is _not_ needed at vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling, so that both vmentry and migration (and in the latter case, independent of is_guest_mode) do the parts that are needed. Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES Cc: <stable@vger.kernel.org> # 5.10.x Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"Sean Christopherson2021-01-251-26/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert the dirty/available tracking of GPRs now that KVM copies the GPRs to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per commit de3cd117ed2f ("KVM: x86: Omit caching logic for always-available GPRs"), tracking for GPRs noticeably impacts KVM's code footprint. This reverts commit 1c04d8c986567c27c56c05205dceadc92efb14ff. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guestSean Christopherson2021-01-251-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the GRPs' dirty bits are set from time zero and never cleared, i.e. will always be seen as dirty. The obvious alternative would be to clear the dirty bits when appropriate, but removing the dirty checks is desirable as it allows reverting GPR dirty+available tracking, which adds overhead to all flavors of x86 VMs. Note, unconditionally writing the GPRs in the GHCB is tacitly allowed by the GHCB spec, which allows the hypervisor (or guest) to provide unnecessary info; it's the guest's responsibility to consume only what it needs (the hypervisor is untrusted after all). The guest and hypervisor can supply additional state if desired but must not rely on that additional state being provided. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migrationMaxim Levitsky2021-01-251-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even when we are outside the nested guest, some vmcs02 fields may not be in sync vs vmcs12. This is intentional, even across nested VM-exit, because the sync can be delayed until the nested hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those rarely accessed fields. However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare() before the vmcs12 contents are copied to userspace. Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when needed") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | kvm: tracing: Fix unmatched kvm_entry and kvm_exit eventsLorenzo Brescia2021-01-253-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On VMX, if we exit and then re-enter immediately without leaving the vmx_vcpu_run() function, the kvm_entry event is not logged. That means we will see one (or more) kvm_exit, without its (their) corresponding kvm_entry, as shown here: CPU-1979 [002] 89.871187: kvm_entry: vcpu 1 CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE It also seems possible for a kvm_entry event to be logged, but then we leave vmx_vcpu_run() right away (if vmx->emulation_required is true). In this case, we will have a spurious kvm_entry event in the trace. Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run() (where trace_kvm_exit() already is). A trace obtained with this patch applied looks like this: CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT Of course, not calling trace_kvm_entry() in common x86 code any longer means that we need to adjust the SVM side of things too. Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: get smi pending status correctlyJay Zhou2021-01-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The injection process of smi has two steps: Qemu KVM Step1: cpu->interrupt_request &= \ ~CPU_INTERRUPT_SMI; kvm_vcpu_ioctl(cpu, KVM_SMI) call kvm_vcpu_ioctl_smi() and kvm_make_request(KVM_REQ_SMI, vcpu); Step2: kvm_vcpu_ioctl(cpu, KVM_RUN, 0) call process_smi() if kvm_check_request(KVM_REQ_SMI, vcpu) is true, mark vcpu->arch.smi_pending = true; The vcpu->arch.smi_pending will be set true in step2, unfortunately if vcpu paused between step1 and step2, the kvm_run->immediate_exit will be set and vcpu has to exit to Qemu immediately during step2 before mark vcpu->arch.smi_pending true. During VM migration, Qemu will get the smi pending status from KVM using KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status will be lost. Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com> Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com> Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]Like Xu2021-01-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as 0x0300 in the intel_perfmon_event_map[]. Correct its usage. Fixes: 62079d8a4312 ("KVM: PMU: add proper support for fixed counter 2") Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()Like Xu2021-01-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we know vPMU will not work properly when (1) the guest bit_width(s) of the [gp|fixed] counters are greater than the host ones, or (2) guest requested architectural events exceeds the range supported by the host, so we can setup a smaller left shift value and refresh the guest cpuid entry, thus fixing the following UBSAN shift-out-of-bounds warning: shift exponent 197 is too large for 64-bit type 'long long unsigned int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348 kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177 kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308 kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709 kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: Add more protection against undefined behavior in rsvd_bits()Sean Christopherson2021-01-251-1/+8
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add compile-time asserts in rsvd_bits() to guard against KVM passing in garbage hardcoded values, and cap the upper bound at '63' for dynamic values to prevent generating a mask that would overflow a u64. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210113204515.3473079-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-01-241-11/+9
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "18 patches. Subsystems affected by this patch series: mm (pagealloc, memcg, kasan, memory-failure, and highmem), ubsan, proc, and MAINTAINERS" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: MAINTAINERS: add a couple more files to the Clang/LLVM section proc_sysctl: fix oops caused by incorrect command parameters powerpc/mm/highmem: use __set_pte_at() for kmap_local() mips/mm/highmem: use set_pte() for kmap_local() mm/highmem: prepare for overriding set_pte_at() sparc/mm/highmem: flush cache and TLB mm: fix page reference leak in soft_offline_page() ubsan: disable unsigned-overflow check for i386 kasan, mm: fix resetting page_alloc tags for HW_TAGS kasan, mm: fix conflicts with init_on_alloc/free kasan: fix HW_TAGS boot parameters kasan: fix incorrect arguments passing in kasan_add_zero_shadow kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow mm: fix numa stats for thp migration mm: memcg: fix memcg file_dirty numa stat mm: memcg/slab: optimize objcg stock draining mm: fix initialization of struct page for holes in memory layout x86/setup: don't remove E820_TYPE_RAM for pfn 0
| * | | x86/setup: don't remove E820_TYPE_RAM for pfn 0Mike Rapoport2021-01-241-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: fix initialization of struct page for holes in memory layout", v3. Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") exposed several issues with the memory map initialization and these patches fix those issues. Initially there were crashes during compaction that Qian Cai reported back in April [1]. It seemed back then that the problem was fixed, but a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional discussion at [3]. [1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw [2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com [3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org This patch (of 2): The first 4Kb of memory is a BIOS owned area and to avoid its allocation for the kernel it was not listed in e820 tables as memory. As the result, pfn 0 was never recognised by the generic memory management and it is not a part of neither node 0 nor ZONE_DMA. If set_pfnblock_flags_mask() would be ever called for the pageblock corresponding to the first 2Mbytes of memory, having pfn 0 outside of ZONE_DMA would trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); Along with reserving the first 4Kb in e820 tables, several first pages are reserved with memblock in several places during setup_arch(). These reservations are enough to ensure the kernel does not touch the BIOS area and it is not necessary to remove E820_TYPE_RAM for pfn 0. Remove the update of e820 table that changes the type of pfn 0 and move the comment describing why it was done to trim_low_memory_range() that reserves the beginning of the memory. Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'sched_urgent_for_v5.11_rc5' of ↵Linus Torvalds2021-01-241-0/+19
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Correct the marking of kthreads which are supposed to run on a specific, single CPU vs such which are affine to only one CPU, mark per-cpu workqueue threads as such and make sure that marking "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads. - A fix to not push away tasks on CPUs coming online. - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity on CPU offlining so that pending workers can finish on newly arrived onlined CPUs too. - Dump tasks which haven't vacated a CPU which is currently being unplugged. - Register a special scale invariance callback which gets called on resume from RAM to read out APERF/MPERF after resume and thus make the schedutil scaling governor more precise. * tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Relax the set_cpus_allowed_ptr() semantics sched: Fix CPU hotplug / tighten is_per_cpu_kthread() sched: Prepare to use balance_push in ttwu() workqueue: Restrict affinity change to rescuer workqueue: Tag bound workers with KTHREAD_IS_PER_CPU kthread: Extract KTHREAD_IS_PER_CPU sched: Don't run cpu-online with balance_push() enabled workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity sched/core: Print out straggler tasks in sched_cpu_dying() x86: PM: Register syscore_ops for scale invariance
| * | | | x86: PM: Register syscore_ops for scale invarianceRafael J. Wysocki2021-01-191-0/+19
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86 scale invariace tends to be disabled during resume from suspend-to-RAM, because the MPERF or APERF MSR values are not as expected then due to updates taking place after the platform firmware has been invoked to complete the suspend transition. That, of course, is not desirable, especially if the schedutil scaling governor is in use, because the lack of scale invariance causes it to be less reliable. To counter that effect, modify init_freq_invariance() to register a syscore_ops object for scale invariance with the ->resume callback pointing to init_counter_refs() which will run on the CPU starting the resume transition (the other CPUs will be taken care of the "online" operations taking place later). Fixes: e2b0d619b400 ("x86, sched: check for counters overflow in frequency invariant accounting") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz> Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
* | | | Merge tag 'x86_urgent_for_v5.11_rc5' of ↵Linus Torvalds2021-01-2411-25/+65
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add a new Intel model number for Alder Lake - Differentiate which aspects of the FPU state get saved/restored when the FPU is used in-kernel and fix a boot crash on K7 due to early MXCSR access before CR4.OSFXSR is even set. - A couple of noinstr annotation fixes - Correct die ID setting on AMD for users of topology information which need the correct die ID - A SEV-ES fix to handle string port IO to/from kernel memory properly * tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add another Alder Lake CPU to the Intel family x86/mmx: Use KFPU_387 for MMX string operations x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state x86/topology: Make __max_die_per_package available unconditionally x86: __always_inline __{rd,wr}msr() x86/mce: Remove explicit/superfluous tracing locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP locking/lockdep: Cure noinstr fail x86/sev: Fix nonistr violation x86/entry: Fix noinstr fail x86/cpu/amd: Set __max_die_per_package on AMD x86/sev-es: Handle string port IO to kernel memory properly
| * | | x86/cpu: Add another Alder Lake CPU to the Intel familyGayatri Kammela2021-01-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add Alder Lake mobile CPU model number to Intel family. Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210121215004.11618-1-tony.luck@intel.com
| * | | x86/mmx: Use KFPU_387 for MMX string operationsAndy Lutomirski2021-01-211-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The default kernel_fpu_begin() doesn't work on systems that support XMM but haven't yet enabled CR4.OSFXSR. This causes crashes when _mmx_memcpy() is called too early because LDMXCSR generates #UD when the aforementioned bit is clear. Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly. Fixes: 7ad816762f9b ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()") Reported-by: Krzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl> Tested-by: Krzysztof Mazur <krzysiek@podlesie.net> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/e7bf21855fe99e5f3baa27446e32623358f69e8d.1611205691.git.luto@kernel.org
| * | | x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize stateAndy Lutomirski2021-01-212-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, requesting kernel FPU access doesn't distinguish which parts of the extended ("FPU") state are needed. This is nice for simplicity, but there are a few cases in which it's suboptimal: - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do not use legacy 387 state. These users want MXCSR initialized but don't care about the FPU control word. Skipping FNINIT would save time. (Empirically, FNINIT is several times slower than LDMXCSR.) - Code that wants MMX doesn't want or need MXCSR initialized. _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and initializing MXCSR will fail because LDMXCSR generates an #UD when the aforementioned CR4 bit is not set. - Any future in-kernel users of XFD (eXtended Feature Disable)-capable dynamic states will need special handling. Add a more specific API that allows callers to specify exactly what they want. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Krzysztof Piotr Olędzki <ole@ans.pl> Link: https://lkml.kernel.org/r/aff1cac8b8fc7ee900cf73e8f2369966621b053f.1611205691.git.luto@kernel.org
| * | | x86/topology: Make __max_die_per_package available unconditionallyBorislav Petkov2021-01-142-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move it outside of CONFIG_SMP in order to avoid ifdeffery at the usage sites. Fixes: 76e2fc63ca40 ("x86/cpu/amd: Set __max_die_per_package on AMD") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210114111814.5346-1-bp@alien8.de
| * | | x86: __always_inline __{rd,wr}msr()Peter Zijlstra2021-01-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the compiler choses to not inline the trivial MSR helpers: vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xce: call to __wrmsr.constprop.14() leaves .noinstr.text section Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/X/bf3gV+BW7kGEsB@hirez.programming.kicks-ass.net
| * | | x86/mce: Remove explicit/superfluous tracingPeter Zijlstra2021-01-121-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's some explicit tracing left in exc_machine_check_kernel(), remove it, as it's already implied by irqentry_nmi_enter(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210106144017.719310466@infradead.org
| * | | x86/sev: Fix nonistr violationPeter Zijlstra2021-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the compiler fails to inline, it violates nonisntr: vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xc7: call to sev_es_wr_ghcb_msr() leaves .noinstr.text section Fixes: 4ca68e023b11 ("x86/sev-es: Handle NMI State") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210106144017.532902065@infradead.org
| * | | x86/entry: Fix noinstr failPeter Zijlstra2021-01-121-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmlinux.o: warning: objtool: __do_fast_syscall_32()+0x47: call to syscall_enter_from_user_mode_work() leaves .noinstr.text section Fixes: 4facb95b7ada ("x86/entry: Unbreak 32bit fast syscall") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210106144017.472696632@infradead.org
| * | | x86/cpu/amd: Set __max_die_per_package on AMDYazen Ghannam2021-01-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set the maximum DIE per package variable on AMD using the NodesPerProcessor topology value. This will be used by RAPL, among others, to determine the maximum number of DIEs on the system in order to do per-DIE manipulations. [ bp: Productize into a proper patch. ] Fixes: 028c221ed190 ("x86/CPU/AMD: Save AMD NodeId as cpu_die_id") Reported-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at> Reported-by: Rafael Kitover <rkitover@gmail.com> Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Johnathan Smithinovic <johnathan.smithinovic@gmx.at> Tested-by: Rafael Kitover <rkitover@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=210939 Link: https://lkml.kernel.org/r/20210106112106.GE5729@zn.tnic Link: https://lkml.kernel.org/r/20210111101455.1194-1-bp@alien8.de
| * | | x86/sev-es: Handle string port IO to kernel memory properlyHyunwook (Wooky) Baek2021-01-111-0/+12
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't assume dest/source buffers are userspace addresses when manually copying data for string I/O or MOVS MMIO, as {get,put}_user() will fail if handed a kernel address and ultimately lead to a kernel panic. When invoking INSB/OUTSB instructions in kernel space in a SEV-ES-enabled VM, the kernel crashes with the following message: "SEV-ES: Unsupported exception in #VC instruction emulation - can't continue" Handle that case properly. [ bp: Massage commit message. ] Fixes: f980f9c31a92 ("x86/sev-es: Compile early handler code into kernel image") Signed-off-by: Hyunwook (Wooky) Baek <baekhw@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: David Rientjes <rientjes@google.com> Link: https://lkml.kernel.org/r/20210110071102.2576186-1-baekhw@google.com
* | | Merge tag 'for-linus-5.11-rc5-tag' of ↵Linus Torvalds2021-01-201-0/+2
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "A fix for build failure showing up in some configurations" * tag 'for-linus-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: fix 'nopvspin' build error
| * | x86/xen: fix 'nopvspin' build errorRandy Dunlap2021-01-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix build error in x86/xen/ when PARAVIRT_SPINLOCKS is not enabled. Fixes this build error: ../arch/x86/xen/smp_hvm.c: In function ‘xen_hvm_smp_init’: ../arch/x86/xen/smp_hvm.c:77:3: error: ‘nopvspin’ undeclared (first use in this function) nopvspin = true; Fixes: 3d7746bea925 ("x86/xen: Fix xen_hvm_smp_init() when vector callback not available") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20210115191123.27572-1-rdunlap@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
* | | Merge tag 'hyperv-fixes-signed-20210119' of ↵Linus Torvalds2021-01-191-3/+26
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fix from Wei Liu: "One patch from Dexuan to fix clockevent initialization" * tag 'hyperv-fixes-signed-20210119' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Initialize clockevents after LAPIC is initialized
| * | | x86/hyperv: Initialize clockevents after LAPIC is initializedDexuan Cui2021-01-171-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 4df4cb9e99f8, the Hyper-V direct-mode STIMER is actually initialized before LAPIC is initialized: see apic_intr_mode_init() x86_platform.apic_post_init() hyperv_init() hv_stimer_alloc() apic_bsp_setup() setup_local_APIC() setup_local_APIC() temporarily disables LAPIC, initializes it and re-eanble it. The direct-mode STIMER depends on LAPIC, and when it's registered, it can be programmed immediately and the timer can fire very soon: hv_stimer_init clockevents_config_and_register clockevents_register_device tick_check_new_device tick_setup_device tick_setup_periodic(), tick_setup_oneshot() clockevents_program_event When the timer fires in the hypervisor, if the LAPIC is in the disabled state, new versions of Hyper-V ignore the event and don't inject the timer interrupt into the VM, and hence the VM hangs when it boots. Note: when the VM starts/reboots, the LAPIC is pre-enabled by the firmware, so the window of LAPIC being temporarily disabled is pretty small, and the issue can only happen once out of 100~200 reboots for a 40-vCPU VM on one dev host, and on another host the issue doesn't reproduce after 2000 reboots. The issue is more noticeable for kdump/kexec, because the LAPIC is disabled by the first kernel, and stays disabled until the kdump/kexec kernel enables it. This is especially an issue to a Generation-2 VM (for which Hyper-V doesn't emulate the PIT timer) when CONFIG_HZ=1000 (rather than CONFIG_HZ=250) is used. Fix the issue by moving hv_stimer_alloc() to a later place where the LAPIC timer is initialized. Fixes: 4df4cb9e99f8 ("x86/hyperv: Initialize clockevents earlier in CPU onlining") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20210116223136.13892-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
* | | | Merge tag 'for-linus-5.11-rc4-tag' of ↵Linus Torvalds2021-01-152-13/+29
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A series to fix a regression when running as a fully virtualized guest on an old Xen hypervisor not supporting PV interrupt callbacks for HVM guests. - A patch to add support to query Xen resource sizes (setting was possible already) from user mode. * tag 'for-linus-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: Fix xen_hvm_smp_init() when vector callback not available x86/xen: Don't register Xen IPIs when they aren't going to be used x86/xen: Add xen_no_vector_callback option to test PCI INTX delivery xen: Set platform PCI device INTX affinity to CPU0 xen: Fix event channel callback via INTX/GSI xen/privcmd: allow fetching resource sizes
| * | | x86/xen: Fix xen_hvm_smp_init() when vector callback not availableDavid Woodhouse2021-01-131-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only the IPI-related functions in the smp_ops should be conditional on the vector callback being available. The rest should still happen: • xen_hvm_smp_prepare_boot_cpu() This function does two things, both of which should still happen if there is no vector callback support. The call to xen_vcpu_setup() for vCPU0 should still happen as it just sets up the vcpu_info for CPU0. That does happen for the secondary vCPUs too, from xen_cpu_up_prepare_hvm(). The second thing it does is call xen_init_spinlocks(), which perhaps counter-intuitively should *also* still be happening in the case without vector callbacks, so that it can clear its local xen_pvspin flag and disable the virt_spin_lock_key accordingly. Checking xen_have_vector_callback in xen_init_spinlocks() itself would affect PV guests, so set the global nopvspin flag in xen_hvm_smp_init() instead, when vector callbacks aren't available. • xen_hvm_smp_prepare_cpus() This does some IPI-related setup by calling xen_smp_intr_init() and xen_init_lock_cpu(), which can be made conditional. And it sets the xen_vcpu_id to XEN_VCPU_ID_INVALID for all possible CPUS, which does need to happen. • xen_smp_cpus_done() This offlines any vCPUs which doesn't fit in the global shared_info page, if separate vcpu_info placement isn't available. That part also needs to happen regardless of vector callback support. • xen_hvm_cpu_die() This doesn't actually do anything other than commin_cpu_die() right right now in the !vector_callback case; all three teardown functions it calls should be no-ops. But to guard against future regressions it's useful to call it anyway, and for it to explicitly check for xen_have_vector_callback before calling those additional functions. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210106153958.584169-6-dwmw2@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
| * | | x86/xen: Don't register Xen IPIs when they aren't going to be usedDavid Woodhouse2021-01-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case where xen_have_vector_callback is false, we still register the IPI vectors in xen_smp_intr_init() for the secondary CPUs even though they aren't going to be used. Stop doing that. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210106153958.584169-5-dwmw2@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
| * | | x86/xen: Add xen_no_vector_callback option to test PCI INTX deliveryDavid Woodhouse2021-01-131-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's useful to be able to test non-vector event channel delivery, to make sure Linux will work properly on older Xen which doesn't have it. It's also useful for those working on Xen and Xen-compatible hypervisors, because there are guest kernels still in active use which use PCI INTX even when vector delivery is available. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210106153958.584169-4-dwmw2@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com>
* | | | Merge tag 'hyperv-fixes-signed-20210111' of ↵Linus Torvalds2021-01-114-3/+33
|\ \ \ \ | | |/ / | |/| / | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - fix kexec panic/hang (Dexuan Cui) - fix occasional crashes when flushing TLB (Wei Liu) * tag 'hyperv-fixes-signed-20210111' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: check cpu mask after interrupt has been disabled x86/hyperv: Fix kexec panic/hang issues
| * | x86/hyperv: check cpu mask after interrupt has been disabledWei Liu2021-01-061-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've observed crashes due to an empty cpu mask in hyperv_flush_tlb_others. Obviously the cpu mask in question is changed between the cpumask_empty call at the beginning of the function and when it is actually used later. One theory is that an interrupt comes in between and a code path ends up changing the mask. Move the check after interrupt has been disabled to see if it fixes the issue. Signed-off-by: Wei Liu <wei.liu@kernel.org> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20210105175043.28325-1-wei.liu@kernel.org Reviewed-by: Michael Kelley <mikelley@microsoft.com>
| * | x86/hyperv: Fix kexec panic/hang issuesDexuan Cui2021-01-053-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the kexec kernel can panic or hang due to 2 causes: 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the old VP Assist Pages when the kexec kernel runs. The same issue is fixed for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for hibernation"). Now fix it for kexec. 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so between hv_kexec_handler() and native_machine_shutdown(), the other CPUs can still try to access the hypercall page and cause panic. The workaround "hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move hyperv_cleanup() to a better place. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20201222065541.24312-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
* | | Merge tag 'x86_urgent_for_v5.11_rc3' of ↵Linus Torvalds2021-01-105-73/+57
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "As expected, fixes started trickling in after the holidays so here is the accumulated pile of x86 fixes for 5.11: - A fix for fanotify_mark() missing the conversion of x86_32 native syscalls which take 64-bit arguments to the compat handlers due to former having a general compat handler. (Brian Gerst) - Add a forgotten pmd page destructor call to pud_free_pmd_page() where a pmd page is freed. (Dan Williams) - Make IN/OUT insns with an u8 immediate port operand handling for SEV-ES guests more precise by using only the single port byte and not the whole s32 value of the insn decoder. (Peter Gonda) - Correct a straddling end range check before returning the proper MTRR type, when the end address is the same as top of memory. (Ying-Tsun Huang) - Change PQR_ASSOC MSR update scheme when moving a task to a resctrl resource group to avoid significant performance overhead with some resctrl workloads. (Fenghua Yu) - Avoid the actual task move overhead when the task is already in the resource group. (Fenghua Yu)" * tag 'x86_urgent_for_v5.11_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Don't move a task to the same resource group x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR x86/mtrr: Correct the range check before performing MTRR type lookups x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling x86/mm: Fix leak of pmd ptlock fanotify: Fix sys_fanotify_mark() on native x86-32
| * | | x86/resctrl: Don't move a task to the same resource groupFenghua Yu2021-01-081-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shakeel Butt reported in [1] that a user can request a task to be moved to a resource group even if the task is already in the group. It just wastes time to do the move operation which could be costly to send IPI to a different CPU. Add a sanity check to ensure that the move operation only happens when the task is not already in the resource group. [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/ Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Reported-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/962ede65d8e95be793cb61102cca37f7bb018e66.1608243147.git.reinette.chatre@intel.com
| * | | x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSRFenghua Yu2021-01-081-69/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when moving a task to a resource group the PQR_ASSOC MSR is updated with the new closid and rmid in an added task callback. If the task is running, the work is run as soon as possible. If the task is not running, the work is executed later in the kernel exit path when the kernel returns to the task again. Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task is running is the right thing to do. Queueing work for a task that is not running is unnecessary (the PQR_ASSOC MSR is already updated when the task is scheduled in) and causing system resource waste with the way in which it is implemented: Work to update the PQR_ASSOC register is queued every time the user writes a task id to the "tasks" file, even if the task already belongs to the resource group. This could result in multiple pending work items associated with a single task even if they are all identical and even though only a single update with most recent values is needed. Specifically, even if a task is moved between different resource groups while it is sleeping then it is only the last move that is relevant but yet a work item is queued during each move. This unnecessary queueing of work items could result in significant system resource waste, especially on tasks sleeping for a long time. For example, as demonstrated by Shakeel Butt in [1] writing the same task id to the "tasks" file can quickly consume significant memory. The same problem (wasted system resources) occurs when moving a task between different resource groups. As pointed out by Valentin Schneider in [2] there is an additional issue with the way in which the queueing of work is done in that the task_struct update is currently done after the work is queued, resulting in a race with the register update possibly done before the data needed by the update is available. To solve these issues, update the PQR_ASSOC MSR in a synchronous way right after the new closid and rmid are ready during the task movement, only if the task is running. If a moved task is not running nothing is done since the PQR_ASSOC MSR will be updated next time the task is scheduled. This is the same way used to update the register when tasks are moved as part of resource group removal. [1] https://lore.kernel.org/lkml/CALvZod7E9zzHwenzf7objzGKsdBmVwTgEJ0nPgs0LUFU3SN5Pw@mail.gmail.com/ [2] https://lore.kernel.org/lkml/20201123022433.17905-1-valentin.schneider@arm.com [ bp: Massage commit message and drop the two update_task_closid_rmid() variants. ] Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Reported-by: Shakeel Butt <shakeelb@google.com> Reported-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/17aa2fb38fc12ce7bb710106b3e7c7b45acb9e94.1608243147.git.reinette.chatre@intel.com
| * | | x86/mtrr: Correct the range check before performing MTRR type lookupsYing-Tsun Huang2021-01-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In mtrr_type_lookup(), if the input memory address region is not in the MTRR, over 4GB, and not over the top of memory, a write-back attribute is returned. These condition checks are for ensuring the input memory address region is actually mapped to the physical memory. However, if the end address is just aligned with the top of memory, the condition check treats the address is over the top of memory, and write-back attribute is not returned. And this hits in a real use case with NVDIMM: the nd_pmem module tries to map NVDIMMs as cacheable memories when NVDIMMs are connected. If a NVDIMM is the last of the DIMMs, the performance of this NVDIMM becomes very low since it is aligned with the top of memory and its memory type is uncached-minus. Move the input end address change to inclusive up into mtrr_type_lookup(), before checking for the top of memory in either mtrr_type_lookup_{variable,fixed}() helpers. [ bp: Massage commit message. ] Fixes: 0cc705f56e40 ("x86/mm/mtrr: Clean up mtrr_type_lookup()") Signed-off-by: Ying-Tsun Huang <ying-tsun.huang@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201215070721.4349-1-ying-tsun.huang@amd.com
| * | | x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handlingPeter Gonda2021-01-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IN and OUT instructions with port address as an immediate operand only use an 8-bit immediate (imm8). The current VC handler uses the entire 32-bit immediate value but these instructions only set the first bytes. Cast the operand to an u8 for that. [ bp: Massage commit message. ] Fixes: 25189d08e5168 ("x86/sev-es: Add support for handling IOIO exceptions") Signed-off-by: Peter Gonda <pgonda@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: David Rientjes <rientjes@google.com> Link: https://lkml.kernel.org/r/20210105163311.221490-1-pgonda@google.com
| * | | x86/mm: Fix leak of pmd ptlockDan Williams2021-01-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") introduced a new location where a pmd was released, but neglected to run the pmd page destructor. In fact, this happened previously for a different pmd release path and was fixed by commit: c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables"). This issue was hidden until recently because the failure mode is silent, but commit: b2b29d6d0119 ("mm: account PMD tables like PTE tables") turns the failure mode into this signature: BUG: Bad page state in process lt-pmem-ns pfn:15943d page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d flags: 0xaffff800000000() raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000 raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000 page dumped because: nonzero mapcount [..] dump_stack+0x8b/0xb0 bad_page.cold+0x63/0x94 free_pcp_prepare+0x224/0x270 free_unref_page+0x18/0xd0 pud_free_pmd_page+0x146/0x160 ioremap_pud_range+0xe3/0x350 ioremap_page_range+0x108/0x160 __ioremap_caller.constprop.0+0x174/0x2b0 ? memremap+0x7a/0x110 memremap+0x7a/0x110 devm_memremap+0x53/0xa0 pmem_attach_disk+0x4ed/0x530 [nd_pmem] ? __devm_release_region+0x52/0x80 nvdimm_bus_probe+0x85/0x210 [libnvdimm] Given this is a repeat occurrence it seemed prudent to look for other places where this destructor might be missing and whether a better helper is needed. try_to_free_pmd_page() looks like a candidate, but testing with setting up and tearing down pmd mappings via the dax unit tests is thus far not triggering the failure. As for a better helper pmd_free() is close, but it is a messy fit due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table() instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the best way forward for now. Debugged together with Matthew Wilcox <willy@infradead.org>. Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yi Zhang <yi.zhang@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com
| * | | fanotify: Fix sys_fanotify_mark() on native x86-32Brian Gerst2020-12-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments") converted native x86-32 which take 64-bit arguments to use the compat handlers to allow conversion to passing args via pt_regs. sys_fanotify_mark() was however missed, as it has a general compat handler. Add a config option that will use the syscall wrapper that takes the split args for native 32-bit. [ bp: Fix typo in Kconfig help text. ] Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments") Reported-by: Paweł Jasiak <pawel@jasiak.xyz> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
* | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2021-01-0813-97/+178
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "x86: - Fixes for the new scalable MMU - Fixes for migration of nested hypervisors on AMD - Fix for clang integrated assembler - Fix for left shift by 64 (UBSAN) - Small cleanups - Straggler SEV-ES patch ARM: - VM init cleanups - PSCI relay cleanups - Kill CONFIG_KVM_ARM_PMU - Fixup __init annotations - Fixup reg_to_encoding() - Fix spurious PMCR_EL0 access Misc: - selftests cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (38 commits) KVM: x86: __kvm_vcpu_halt can be static KVM: SVM: Add support for booting APs in an SEV-ES guest KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit KVM: nSVM: mark vmcb as dirty when forcingly leaving the guest mode KVM: nSVM: correctly restore nested_run_pending on migration KVM: x86/mmu: Clarify TDP MMU page list invariants KVM: x86/mmu: Ensure TDP MMU roots are freed after yield kvm: check tlbs_dirty directly KVM: x86: change in pv_eoi_get_pending() to make code more readable MAINTAINERS: Really update email address for Sean Christopherson KVM: x86: fix shift out of bounds reported by UBSAN KVM: selftests: Implement perf_test_util more conventionally KVM: selftests: Use vm_create_with_vcpus in create_vm KVM: selftests: Factor out guest mode code KVM/SVM: Remove leftover __svm_vcpu_run prototype from svm.c KVM: SVM: Add register operand to vmsave call in sev_es_vcpu_load KVM: x86/mmu: Optimize not-present/MMIO SPTE check in get_mmio_spte() KVM: x86/mmu: Use raw level to index into MMIO walks' sptes array KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte() ...
| * | | KVM: x86: __kvm_vcpu_halt can be staticPaolo Bonzini2021-01-081-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>