summaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'gpc-fixes' of git://git.infradead.org/users/dwmw2/linux into HEADPaolo Bonzini2022-12-023-66/+85
|\ | | | | | | | | | | | | | | | | | | | | | | | | Pull Xen-for-KVM changes from David Woodhouse: * add support for 32-bit guests in SCHEDOP_poll * the rest of the gfn-to-pfn cache API cleanup "I still haven't reinstated the last of those patches to make gpc->len immutable." Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpersSean Christopherson2022-11-302-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
| * KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()Michal Luczaj2022-11-302-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
| * KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()Michal Luczaj2022-11-302-10/+8
| | | | | | | | | | | | | | | | | | Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
| * KVM: Store immutable gfn_to_pfn_cache propertiesMichal Luczaj2022-11-302-43/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
| * KVM: x86/xen: add support for 32-bit guests in SCHEDOP_pollMetin Kaya2022-11-302-4/+36
| | | | | | | | | | | | | | | | | | | | | | | | This patch introduces compat version of struct sched_poll for SCHEDOP_poll sub-operation of sched_op hypercall, reads correct amount of data (16 bytes in 32-bit case, 24 bytes otherwise) by using new compat_sched_poll struct, copies it to sched_poll properly, and lets rest of the code run as is. Signed-off-by: Metin Kaya <metikaya@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* | KVM: x86: Advertise that the SMM_CTL MSR is not supportedJim Mattson2022-12-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | CPUID.80000021H:EAX[bit 9] indicates that the SMM_CTL MSR (0xc0010116) is not supported. This defeature can be advertised by KVM_GET_SUPPORTED_CPUID regardless of whether or not the host enumerates it; currently it will be included only if the host enumerates at least leaf 8000001DH, due to a preexisting bug in QEMU that KVM has to work around (commit f751d8eac176, "KVM: x86: work around QEMU issue with synthetic CPUID leaves", 2022-04-29). Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20221007221644.138355-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: remove unnecessary exportsPaolo Bonzini2022-12-024-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | Several symbols are not used by vendor modules but still exported. Removing them ensures that new coupling between kvm.ko and kvm-*.ko is noticed and reviewed. Co-developed-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <like.xu.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itselfYuan ZhaoXiong2022-12-021-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a VM reboots itself, the reset process will result in an ioctl(KVM_SET_LAPIC, ...) to disable x2APIC mode and set the xAPIC id of the vCPU to its default value, which is the vCPU id. That will be handled in KVM as follows: kvm_vcpu_ioctl_set_lapic kvm_apic_set_state kvm_lapic_set_base => disable X2APIC mode kvm_apic_state_fixup kvm_lapic_xapic_id_updated kvm_xapic_id(apic) != apic->vcpu->vcpu_id kvm_set_apicv_inhibit(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) memcpy(vcpu->arch.apic->regs, s->regs, sizeof(*s)) => update APIC_ID When kvm_apic_set_state invokes kvm_lapic_set_base to disable x2APIC mode, the old 32-bit x2APIC id is still present rather than the 8-bit xAPIC id. kvm_lapic_xapic_id_updated will set the APICV_INHIBIT_REASON_APIC_ID_MODIFIED bit and disable APICv/x2AVIC. Instead, kvm_lapic_xapic_id_updated must be called after APIC_ID is changed. In fact, this fixes another small issue in the code in that potential changes to a vCPU's xAPIC ID need not be tracked for KVM_GET_LAPIC. Fixes: 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") Signed-off-by: Yuan ZhaoXiong <yuanzhaoxiong@baidu.com> Message-Id: <1669984574-32692-1-git-send-email-yuanzhaoxiong@baidu.com> Cc: stable@vger.kernel.org Reported-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Merge tag 'kvm-x86-fixes-6.2-1' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2022-12-0210-51/+161
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Misc KVM x86 fixes and cleanups for 6.2: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up the MSR filter docs. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency.
| * | KVM: x86: Use current rather than snapshotted TSC frequency if it is constantAnton Romanov2022-11-301-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing TSC during initialization is unnecessary, KVM can simply use tsc_khz. This value is snapshotted from kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency can lead to VM to think its TSC frequency is not what it actually is if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. Signed-off-by: Anton Romanov <romanton@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helperSean Christopherson2022-11-301-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from the guest, into the helper to check the overall validity of the incoming value. Opportunistically rename the helper to make it clear that it returns a bool. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
| * | KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bitsSean Christopherson2022-11-301-5/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS. KVM's ABI is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs to values that would fault according to the guest CPUID model. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
| * | KVM: VMX: Make vmread_error_trampoline() uncallable from C codeSean Christopherson2022-11-302-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declare vmread_error_trampoline() as an opaque symbol so that it cannot be called from C code, at least not without some serious fudging. The trampoline always passes parameters on the stack so that the inline VMREAD sequence doesn't need to clobber registers. regparm(0) was originally added to document the stack behavior, but it ended up being confusing because regparm(0) is a nop for 64-bit targets. Opportunustically wrap the trampoline and its declaration in #ifdeffery to make it even harder to invoke incorrectly, to document why it exists, and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT is true for all supported toolchains. No functional change intended. Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
| * | KVM: nVMX: Reword comments about generating nested CR0/4 read shadowsSean Christopherson2022-11-302-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reword the comments that (attempt to) document nVMX's overrides of the CR0/4 read shadows for L2 after calling vmx_set_cr0/4(). The important behavior that needs to be documented is that KVM needs to override the shadows to account for L1's masks even though the shadows are set by the common helpers (and that setting the shadows first would result in the correct shadows being clobbered). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
| * | KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRSJim Mattson2022-11-302-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to Intel's document on Indirect Branch Restricted Speculation, "Enabling IBRS does not prevent software from controlling the predicted targets of indirect branches of unrelated software executed later at the same predictor mode (for example, between two different user applications, or two different virtual machines). Such isolation can be ensured through use of the Indirect Branch Predictor Barrier (IBPB) command." This applies to both basic and enhanced IBRS. Since L1 and L2 VMs share hardware predictor modes (guest-user and guest-kernel), hardware IBRS is not sufficient to virtualize IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts, hardware IBRS is actually sufficient in practice, even though it isn't sufficient architecturally.) For virtual CPUs that support IBRS, add an indirect branch prediction barrier on emulated VM-exit, to ensure that the predicted targets of indirect branches executed in L1 cannot be controlled by software that was executed in L2. Since we typically don't intercept guest writes to IA32_SPEC_CTRL, perform the IBPB at emulated VM-exit regardless of the current IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is clear at emulated VM-exit. This is CVE-2022-2196. Fixes: 5c911beff20a ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: VMX: Guest usage of IA32_SPEC_CTRL is likelyJim Mattson2022-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this point in time, most guests (in the default, out-of-the-box configuration) are likely to use IA32_SPEC_CTRL. Therefore, drop the compiler hint that it is unlikely for KVM to be intercepting WRMSR of IA32_SPEC_CTRL. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check failsSean Christopherson2022-11-301-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the generic "is CRx valid" check, but passes the CR4.VMXE check, and do the generic checks _after_ handling the post-VMXON VM-Fail. The CR4.VMXE check, and all other #UD cases, are special pre-conditions that are enforced prior to pivoting on the current VMX mode, i.e. occur before interception if VMXON is attempted in VMX non-root mode. All other CR0/CR4 checks generate #GP and effectively have lower priority than the post-VMXON check. Per the SDM: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD; ELSIF not in VMX operation THEN IF (CPL > 0) or (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); ELSIF in VMX non-root operation THEN VMexit; ELSIF CPL > 0 THEN #GP(0); ELSE VMfail("VMXON executed in VMX root operation"); FI; which, if re-written without ELSIF, yields: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD IF in VMX non-root operation THEN VMexit; IF CPL > 0 THEN #GP(0) IF in VMX operation THEN VMfail("VMXON executed in VMX root operation"); IF (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1, i.e. there is no need to check the vCPU is not in VMX non-root mode. Add a comment to explain why unconditionally forwarding such exits is functionally correct. Reported-by: Eric Li <ercli@ucdavis.edu> Fixes: c7d855c2aff2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com
| * | KVM: SVM: Replace kmap_atomic() with kmap_local_page()Zhao Liu2022-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The use of kmap_atomic() is being deprecated in favor of kmap_local_page()[1]. The main difference between atomic and local mappings is that local mappings don't disable page faults or preemption. There're 2 reasons we can use kmap_local_page() here: 1. SEV is 64-bit only and kmap_local_page() doesn't disable migration in this case, but here the function clflush_cache_range() uses CLFLUSHOPT instruction to flush, and on x86 CLFLUSHOPT is not CPU-local and flushes the page out of the entire cache hierarchy on all CPUs (APM volume 3, chapter 3, CLFLUSHOPT). So there's no need to disable preemption to ensure CPU-local. 2. clflush_cache_range() doesn't need to disable pagefault and the mapping is still valid even if sleeps. This is also true for sched out/in when preempted. In addition, though kmap_local_page() is a thin wrapper around page_address() on 64-bit, kmap_local_page() should still be used here in preference to page_address() since page_address() isn't suitable to be used in a generic function (like sev_clflush_pages()) where the page passed in is not easy to determine the source of allocation. Keeping the kmap* API in place means it can be used for things other than highmem mappings[2]. Therefore, sev_clflush_pages() is a function that should use kmap_local_page() in place of kmap_atomic(). Convert the calls of kmap_atomic() / kunmap_atomic() to kmap_local_page() / kunmap_local(). [1]: https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com [2]: https://lore.kernel.org/lkml/5d667258-b58b-3d28-3609-e7914c99b31b@intel.com/ Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220928092748.463631-1-zhao1.liu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't validSean Christopherson2022-11-301-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Skip the WRMSR fastpath in SVM's VM-Exit handler if the next RIP isn't valid, e.g. because KVM is running with nrips=false. SVM must decode and emulate to skip the WRMSR if the CPU doesn't provide the next RIP. Getting the instruction bytes to decode the WRMSR requires reading guest memory, which in turn means dereferencing memslots, and that isn't safe because KVM doesn't hold SRCU when the fastpath runs. Don't bother trying to enable the fastpath for this case, e.g. by doing only the WRMSR and leaving the "skip" until later. NRIPS is supported on all modern CPUs (KVM has considered making it mandatory), and the next RIP will be valid the vast, vast majority of the time. ============================= WARNING: suspicious RCU usage 6.0.0-smp--4e557fcd3d80-skip #13 Tainted: G O ----------------------------- include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by stable/206475: #0: ffff9d9dfebcc0f0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8b/0x620 [kvm] stack backtrace: CPU: 152 PID: 206475 Comm: stable Tainted: G O 6.0.0-smp--4e557fcd3d80-skip #13 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022 Call Trace: <TASK> dump_stack_lvl+0x69/0xaa dump_stack+0x10/0x12 lockdep_rcu_suspicious+0x11e/0x130 kvm_vcpu_gfn_to_memslot+0x155/0x190 [kvm] kvm_vcpu_gfn_to_hva_prot+0x18/0x80 [kvm] paging64_walk_addr_generic+0x183/0x450 [kvm] paging64_gva_to_gpa+0x63/0xd0 [kvm] kvm_fetch_guest_virt+0x53/0xc0 [kvm] __do_insn_fetch_bytes+0x18b/0x1c0 [kvm] x86_decode_insn+0xf0/0xef0 [kvm] x86_emulate_instruction+0xba/0x790 [kvm] kvm_emulate_instruction+0x17/0x20 [kvm] __svm_skip_emulated_instruction+0x85/0x100 [kvm_amd] svm_skip_emulated_instruction+0x13/0x20 [kvm_amd] handle_fastpath_set_msr_irqoff+0xae/0x180 [kvm] svm_vcpu_run+0x4b8/0x5a0 [kvm_amd] vcpu_enter_guest+0x16ca/0x22f0 [kvm] kvm_arch_vcpu_ioctl_run+0x39d/0x900 [kvm] kvm_vcpu_ioctl+0x538/0x620 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930234031.1732249-1-seanjc@google.com
| * | KVM: x86: Fail emulation during EMULTYPE_SKIP on any exceptionSean Christopherson2022-11-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Treat any exception during instruction decode for EMULTYPE_SKIP as a "full" emulation failure, i.e. signal failure instead of queuing the exception. When decoding purely to skip an instruction, KVM and/or the CPU has already done some amount of emulation that cannot be unwound, e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated MMIO. KVM already does this if a #UD is encountered, but not for other exceptions, e.g. if a #PF is encountered during fetch. In SVM's soft-injection use case, queueing the exception is particularly problematic as queueing exceptions while injecting events can put KVM into an infinite loop due to bailing from VM-Enter to service the newly pending exception. E.g. multiple warnings to detect such behavior fire: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
| * | KVM: x86: Keep the lock order consistent between SRCU and gpc spinlockPeng Hao2022-11-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as to be consistent with all other functions that acquire both locks. It's not illegal to acquire SRCU inside a spinlock, nor is there deadlock potential, but in general it's preferable to order locks from least restrictive to most restrictive, e.g. if wait_pending_event() needed to sleep for whatever reason, it could do so while holding SRCU, but would need to drop the spinlock. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: VMX: Resume guest immediately when injecting #GP on ECREATESean Christopherson2022-11-301-1/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | Resume the guest immediately when injecting a #GP on ECREATE due to an invalid enclave size, i.e. don't attempt ECREATE in the host. The #GP is a terminal fault, e.g. skipping the instruction if ECREATE is successful would result in KVM injecting #GP on the instruction following ECREATE. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Cc: stable@vger.kernel.org Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20220930233132.1723330-1-seanjc@google.com
* / KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctlJavier Martinez Canillas2022-12-021-8/+0
|/ | | | | | | | | | The documentation says that the ioctl has been deprecated, but it has been actually removed and the remaining references are just left overs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Message-Id: <20221202105011.185147-3-javierm@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini2022-11-301-1/+1
| | | | | | | | | | | | | If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Shorten gfn_to_pfn_cache function namesMichal Luczaj2022-11-302-19/+19
| | | | | | | | | | | | Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundaryDavid Woodhouse2022-11-301-0/+2
| | | | | | | | | | | | | | | | | Torture test the cases where the runstate crosses a page boundary, and and especially the case where it's configured in 32-bit mode and doesn't, but then switching to 64-bit mode makes it go onto the second page. To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl also update the guest runstate area. It already did so if the actual runstate changed, as a side-effect of kvm_xen_update_runstate(). So doing it in the plain adjustment case is making it more consistent, as well as giving us a nice way to trigger the update without actually running the vCPU again and changing the values. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configuredDavid Woodhouse2022-11-303-14/+47
| | | | | | | | | | | | | | | | | | | Closer inspection of the Xen code shows that we aren't supposed to be using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall. If we randomly set the top bit of ->state_entry_time for a guest that hasn't asked for it and doesn't expect it, that could make the runtimes fail to add up and confuse the guest. Without the flag it's perfectly safe for a vCPU to read its own vcpu_runstate_info; just not for one vCPU to read *another's*. I briefly pondered adding a word for the whole set of VMASST_TYPE_* flags but the only one we care about for HVM guests is this, so it seemed a bit pointless. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221127122210.248427-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/xen: Compatibility fixes for shared runstate areaDavid Woodhouse2022-11-303-107/+270
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The guest runstate area can be arbitrarily byte-aligned. In fact, even when a sane 32-bit guest aligns the overall structure nicely, the 64-bit fields in the structure end up being unaligned due to the fact that the 32-bit ABI only aligns them to 32 bits. So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE is buggy, because if it's unaligned then we can't update the whole field atomically; the low bytes might be observable before the _UPDATE bit is. Xen actually updates the *byte* containing that top bit, on its own. KVM should do the same. In addition, we cannot assume that the runstate area fits within a single page. One option might be to make the gfn_to_pfn cache cope with regions that cross a page — but getting a contiguous virtual kernel mapping of a discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and it seems this is the *only* current use case for the GPC which would benefit from it. An earlier version of the runstate code did use a gfn_to_hva cache for this purpose, but it still had the single-page restriction because it used the uhva directly — because it needs to be able to do so atomically when the vCPU is being scheduled out, so it used pagefault_disable() around the accesses and didn't just use kvm_write_guest_cached() which has a fallback path. So... use a pair of GPCs for the first and potential second page covering the runstate area. We can get away with locking both at once because nothing else takes more than one GPC lock at a time so we can invent a trivial ordering rule. The common case where it's all in the same page is kept as a fast path, but in both cases, the actual guest structure (compat or not) is built up from the fields in @vx, following preset pointers to the state and times fields. The only difference is whether those pointers point to the kernel stack (in the split case) or to guest memory directly via the GPC. The fast path is also fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual memcpy. Finally, Xen also does write the runstate area immediately when it's configured. Flip the kvm_xen_update_runstate() and …_guest() functions and call the latter directly when the runstate area is set. This means that other ioctls which modify the runstate also write it immediately to the guest when they do so, which is also intended. Update the xen_shinfo_test to exercise the pathological case where the XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is actually in a different page to the rest of the 64-bit word. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'kvm-s390-next-6.2-1' of ↵Paolo Bonzini2022-11-2816-157/+562
|\ | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function
| * KVM: s390: remove unused gisa_clear_ipm_gisc() functionHeiko Carstens2022-11-231-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clang warns about an unused function: arch/s390/kvm/interrupt.c:317:20: error: unused function 'gisa_clear_ipm_gisc' [-Werror,-Wunused-function] static inline void gisa_clear_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) Remove gisa_clear_ipm_gisc(), since it is unused and get rid of this warning. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Link: https://lore.kernel.org/r/20221118151133.2974602-1-hca@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: module parameter to fence asynchronous destroyClaudio Imbrenda2022-11-231-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the module parameter "async_destroy", to allow the asynchronous destroy mechanism to be switched off. This might be useful for debugging purposes. The parameter is enabled by default since the feature is opt-in anyway. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Link: https://lore.kernel.org/r/20221111170632.77622-7-imbrenda@linux.ibm.com Message-Id: <20221111170632.77622-7-imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: support for Destroy fast UVCClaudio Imbrenda2022-11-232-8/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the Destroy Secure Configuration Fast Ultravisor call, and take advantage of it for asynchronous destroy. When supported, the protected guest is destroyed immediately using the new UVC, leaving only the memory to be cleaned up asynchronously. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Link: https://lore.kernel.org/r/20221111170632.77622-6-imbrenda@linux.ibm.com Message-Id: <20221111170632.77622-6-imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: avoid export before import if possibleClaudio Imbrenda2022-11-231-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the appropriate UV feature bit is set, there is no need to perform an export before import. The misc feature indicates, among other things, that importing a shared page from a different protected VM will automatically also transfer its ownership. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Link: https://lore.kernel.org/r/20221111170632.77622-5-imbrenda@linux.ibm.com Message-Id: <20221111170632.77622-5-imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: add KVM_CAP_S390_PROTECTED_ASYNC_DISABLEClaudio Imbrenda2022-11-231-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add KVM_CAP_S390_PROTECTED_ASYNC_DISABLE to signal that the KVM_PV_ASYNC_DISABLE and KVM_PV_ASYNC_DISABLE_PREPARE commands for the KVM_S390_PV_COMMAND ioctl are available. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20221111170632.77622-4-imbrenda@linux.ibm.com Message-Id: <20221111170632.77622-4-imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: asynchronous destroy for rebootClaudio Imbrenda2022-11-234-18/+331
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now, destroying a protected guest was an entirely synchronous operation that could potentially take a very long time, depending on the size of the guest, due to the time needed to clean up the address space from protected pages. This patch implements an asynchronous destroy mechanism, that allows a protected guest to reboot significantly faster than previously. This is achieved by clearing the pages of the old guest in background. In case of reboot, the new guest will be able to run in the same address space almost immediately. The old protected guest is then only destroyed when all of its memory has been destroyed or otherwise made non protected. Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl: KVM_PV_ASYNC_CLEANUP_PREPARE: set aside the current protected VM for later asynchronous teardown. The current KVM VM will then continue immediately as non-protected. If a protected VM had already been set aside for asynchronous teardown, but without starting the teardown process, this call will fail. There can be at most one VM set aside at any time. Once it is set aside, the protected VM only exists in the context of the Ultravisor, it is not associated with the KVM VM anymore. Its protected CPUs have already been destroyed, but not its memory. This command can be issued again immediately after starting KVM_PV_ASYNC_CLEANUP_PERFORM, without having to wait for completion. KVM_PV_ASYNC_CLEANUP_PERFORM: tears down the protected VM previously set aside using KVM_PV_ASYNC_CLEANUP_PREPARE. Ideally the KVM_PV_ASYNC_CLEANUP_PERFORM PV command should be issued by userspace from a separate thread. If a fatal signal is received (or if the process terminates naturally), the command will terminate immediately without completing. All protected VMs whose teardown was interrupted will be put in the need_cleanup list. The rest of the normal KVM teardown process will take care of properly cleaning up all remaining protected VMs, including the ones on the need_cleanup list. Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Link: https://lore.kernel.org/r/20221111170632.77622-2-imbrenda@linux.ibm.com Message-Id: <20221111170632.77622-2-imbrenda@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * s390/mm: fix virtual-physical address confusion for swiotlbNico Boehr2022-11-072-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | swiotlb passes virtual addresses to set_memory_encrypted() and set_memory_decrypted(), but uv_remove_shared() and uv_set_shared() expect physical addresses. This currently works, because virtual and physical addresses are the same. Add virt_to_phys() to resolve the virtual-physical confusion. Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20221107121221.156274-2-nrb@linux.ibm.com Message-Id: <20221107121221.156274-2-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: VSIE: sort out virtual/physical address in pin_guest_pageNico Boehr2022-10-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pin_guest_page() used page_to_virt() to calculate the hpa of the pinned page. This currently works, because virtual and physical addresses are the same. Use page_to_phys() instead to resolve the virtual-real address confusion. One caller of pin_guest_page() actually expected the hpa to be a hva, so add the missing phys_to_virt() conversion here. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20221025082039.117372-2-nrb@linux.ibm.com Message-Id: <20221025082039.117372-2-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: pv: sort out physical vs virtual pointers usageNico Boehr2022-10-261-4/+5
| | | | | | | | | | | | | | | | | | | | Fix virtual vs physical address confusion (which currently are the same). Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20221020143159.294605-6-nrb@linux.ibm.com Message-Id: <20221020143159.294605-6-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: sida: sort out physical vs virtual pointers usageNico Boehr2022-10-265-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of the sida_origin() macro actually expected a virtual address, so rename it to sida_addr() and hand out a virtual address. At some places, the macro wasn't used, potentially creating problems if the sida size ever becomes nonzero (not currently the case), so let's start using it everywhere now while at it. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20221020143159.294605-5-nrb@linux.ibm.com Message-Id: <20221020143159.294605-5-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * KVM: s390: sort out physical vs virtual pointers usageNico Boehr2022-10-264-22/+30
| | | | | | | | | | | | | | | | | | | | Fix virtual vs physical address confusion (which currently are the same). Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20221020143159.294605-4-nrb@linux.ibm.com Message-Id: <20221020143159.294605-4-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * s390/entry: sort out physical vs virtual pointers usage in sie64aNico Boehr2022-10-264-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix virtual vs physical address confusion (which currently are the same). sie_block is accessed in entry.S and passed it to hardware, which is why both its physical and virtual address are needed. To avoid every caller having to do the virtual-physical conversion, add a new function sie64a() which converts the virtual address to physical. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20221020143159.294605-3-nrb@linux.ibm.com Message-Id: <20221020143159.294605-3-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
| * s390/mm: gmap: sort out physical vs virtual pointers usageNico Boehr2022-10-261-71/+76
| | | | | | | | | | | | | | | | | | | | | | Fix virtual vs physical address confusion (which currently are the same). Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20221020143159.294605-2-nrb@linux.ibm.com Message-Id: <20221020143159.294605-2-nrb@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
* | KVM: x86: Advertise PREFETCHIT0/1 CPUID to user spaceJiaxi Chen2022-11-282-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Latest Intel platform Granite Rapids has introduced a new instruction - PREFETCHIT0/1, which moves code to memory (cache) closer to the processor depending on specific hints. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 14] PREFETCHIT0/1 is on a KVM-only subleaf. Plus an x86_FEATURE definition for this feature bit to direct it to the KVM entry. Advertise PREFETCHIT0/1 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Message-Id: <20221125125845.1182922-9-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Advertise AVX-NE-CONVERT CPUID to user spaceJiaxi Chen2022-11-282-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX-NE-CONVERT is a new set of instructions which can convert low precision floating point like BF16/FP16 to high precision floating point FP32, and can also convert FP32 elements to BF16. This instruction allows the platform to have improved AI capabilities and better compatibility. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 5] AVX-NE-CONVERT is on a KVM-only subleaf. Plus an x86_FEATURE definition for this feature bit to direct it to the KVM entry. Advertise AVX-NE-CONVERT to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Message-Id: <20221125125845.1182922-8-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Advertise AVX-VNNI-INT8 CPUID to user spaceJiaxi Chen2022-11-282-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX-VNNI-INT8 is a new set of instructions in the latest Intel platform Sierra Forest, aims for the platform to have superior AI capabilities. This instruction multiplies the individual bytes of two unsigned or unsigned source operands, then adds and accumulates the results into the destination dword element size operand. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 4] AVX-VNNI-INT8 is on a new and sparse CPUID leaf and all bits on this leaf have no truly kernel use case for now. Given that and to save space for kernel feature bits, move this new leaf to KVM-only subleaf and plus an x86_FEATURE definition for AVX-VNNI-INT8 to direct it to the KVM entry. Advertise AVX-VNNI-INT8 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Message-Id: <20221125125845.1182922-7-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: KVM: Advertise AVX-IFMA CPUID to user spaceJiaxi Chen2022-11-282-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX-IFMA is a new instruction in the latest Intel platform Sierra Forest. This instruction packed multiplies unsigned 52-bit integers and adds the low/high 52-bit products to Qword Accumulators. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 23] AVX-IFMA is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering AVX-IFMA itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise AVX-IFMA to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Acked-by: Borislav Petkov <bp@suse.de> Message-Id: <20221125125845.1182922-6-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: KVM: Advertise AMX-FP16 CPUID to user spaceChang S. Bae2022-11-282-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Latest Intel platform Granite Rapids has introduced a new instruction - AMX-FP16, which performs dot-products of two FP16 tiles and accumulates the results into a packed single precision tile. AMX-FP16 adds FP16 capability and also allows a FP16 GPU trained model to run faster without loss of accuracy or added SW overhead. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 21] AMX-FP16 is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering AMX-FP16 itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise AMX-FP16 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Acked-by: Borislav Petkov <bp@suse.de> Message-Id: <20221125125845.1182922-5-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: KVM: Advertise CMPccXADD CPUID to user spaceJiaxi Chen2022-11-282-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CMPccXADD is a new set of instructions in the latest Intel platform Sierra Forest. This new instruction set includes a semaphore operation that can compare and add the operands if condition is met, which can improve database performance. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 7] CMPccXADD is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering CMPccXADD itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise CMPCCXADD to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <jiaxi.chen@linux.intel.com> Acked-by: Borislav Petkov <bp@suse.de> Message-Id: <20221125125845.1182922-4-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Update KVM-only leaf handling to allow for 100% KVM-only leafsSean Christopherson2022-11-282-7/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename kvm_cpu_cap_init_scattered() to kvm_cpu_cap_init_kvm_defined() in anticipation of adding KVM-only CPUID leafs that aren't recognized by the kernel and thus not scattered, i.e. for leafs that are 100% KVM-defined. Adjust/add comments to kvm_only_cpuid_leafs and KVM_X86_FEATURE to document how to create new kvm_only_cpuid_leafs entries for scattered features as well as features that are entirely unknown to the kernel. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221125125845.1182922-3-jiaxi.chen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>