summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm
Commit message (Collapse)AuthorAgeFilesLines
...
* | | | KVM: x86: Add helpers to test/mark reg availability and dirtinessSean Christopherson2019-10-224-32/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add helpers to prettify code that tests and/or marks whether or not a register is available and/or dirty. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: x86: Fold 'enum kvm_ex_reg' definitions into 'enum kvm_reg'Sean Christopherson2019-10-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that indexing into arch.regs is either protected by WARN_ON_ONCE or done with hardcoded enums, combine all definitions for registers that are tracked by regs_avail and regs_dirty into 'enum kvm_reg'. Having a single enum type will simplify additional cleanup related to regs_avail and regs_dirty. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: x86: Add WARNs to detect out-of-bounds register indicesSean Christopherson2019-10-222-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg values that would cause KVM to overflow vcpu->arch.regs. Change the reg param to an 'int' to make it clear that the reg index is unverified. Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead is limited to flows where the reg index is generated at runtime. And there is at least one historical bug where KVM has generated an out-of- bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels"). Adding the WARN_ON_ONCE() protection paves the way for additional cleanup related to kvm_reg and kvm_reg_ex. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: VMX: Optimize vmx_set_rflags() for unrestricted guestSean Christopherson2019-10-221-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework vmx_set_rflags() to avoid the extra code need to handle emulation of real mode and invalid state when unrestricted guest is disabled. The primary reason for doing so is to avoid the call to vmx_get_rflags(), which will incur a VMREAD when RFLAGS is not already available. When running nested VMs, the majority of calls to vmx_set_rflags() will occur without an associated vmx_get_rflags(), i.e. when stuffing GUEST_RFLAGS during transitions between vmcs01 and vmcs02. Note, vmx_get_rflags() guarantees RFLAGS is marked available. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Replace "else" with early "return" in the unrestricted guest branch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: VMX: Consolidate to_vmx() usage in RFLAGS accessorsSean Christopherson2019-10-221-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Capture struct vcpu_vmx in a local variable to improve the readability of vmx_{g,s}et_rflags(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: VMX: Skip GUEST_CR3 VMREAD+VMWRITE if the VMCS is up-to-dateSean Christopherson2019-10-221-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Skip the VMWRITE to update GUEST_CR3 if CR3 is not available, i.e. has not been read from the VMCS since the last VM-Enter. If vcpu->arch.cr3 is stale, kvm_read_cr3(vcpu) will refresh vcpu->arch.cr3 from the VMCS, meaning KVM will do a VMREAD and then VMWRITE the value it just pulled from the VMCS. Note, this is a purely theoretical change, no instances of skipping the VMREAD+VMWRITE have been observed with this change. Tested-by: Reto Buerki <reet@codelabs.ch> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: SVM: Reduce WBINVD/DF_FLUSH invocationsTom Lendacky2019-10-221-15/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Performing a WBINVD and DF_FLUSH are expensive operations. Currently, a WBINVD/DF_FLUSH is performed every time an SEV guest terminates. However, the WBINVD/DF_FLUSH is only required when an ASID is being re-allocated to a new SEV guest. Also, a single WBINVD/DF_FLUSH can enable all ASIDs that have been disassociated from guests through DEACTIVATE. To reduce the number of WBINVD/DF_FLUSH invocations, introduce a new ASID bitmap to track ASIDs that need to be reclaimed. When an SEV guest is terminated, add its ASID to the reclaim bitmap instead of clearing the bitmap in the existing SEV ASID bitmap. This delays the need to perform a WBINVD/DF_FLUSH invocation when an SEV guest terminates until all of the available SEV ASIDs have been used. At that point, the WBINVD/DF_FLUSH invocation can be performed and all ASIDs in the reclaim bitmap moved to the available ASIDs bitmap. The semaphore around DEACTIVATE can be changed to a read semaphore with the semaphore taken in write mode before performing the WBINVD/DF_FLUSH. Tested-by: David Rientjes <rientjes@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: SVM: Remove unneeded WBINVD and DF_FLUSH when starting SEV guestsTom Lendacky2019-10-221-15/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Performing a WBINVD and DF_FLUSH are expensive operations. The SEV support currently performs this WBINVD/DF_FLUSH combination when an SEV guest is terminated, so there is no need for it to be done before LAUNCH. However, when the SEV firmware transitions the platform from UNINIT state to INIT state, all ASIDs will be marked invalid across all threads. Therefore, as part of transitioning the platform to INIT state, perform a WBINVD/DF_FLUSH after a successful INIT in the PSP/SEV device driver. Since the PSP/SEV device driver is x86 only, it can reference and use the WBINVD related functions directly. Cc: Gary Hook <gary.hook@amd.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Tested-by: David Rientjes <rientjes@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-EnterSean Christopherson2019-10-222-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated VM-Exit occurs without actually entering L2, e.g. if the nested run is squashed because nested VM-Enter (from L1) is putting L2 into HLT. Note, the above scenario can occur regardless of whether L1 is intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will likely put a guest into HALTED if and only if it's not intercepting HLT. In an ideal world where EPT *requires* unrestricted guest (and vice versa), VMX could handle CR3 similar to how it handles RSP and RIP, e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But the unrestricted guest silliness complicates the dirty tracking logic to the point that explicitly handling vmcs02.GUEST_CR3 during nested VM-Enter is a simpler overall implementation. Cc: stable@vger.kernel.org Reported-and-tested-by: Reto Buerki <reet@codelabs.ch> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: SVM: Guard against DEACTIVATE when performing WBINVD/DF_FLUSHTom Lendacky2019-10-221-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SEV firmware DEACTIVATE command disassociates an SEV guest from an ASID, clears the WBINVD indicator on all threads and indicates that the SEV firmware DF_FLUSH command must be issued before the ASID can be re-used. The SEV firmware DF_FLUSH command will return an error if a WBINVD has not been performed on every thread before it has been invoked. A window exists between the WBINVD and the invocation of the DF_FLUSH command where an SEV firmware DEACTIVATE command could be invoked on another thread, clearing the WBINVD indicator. This will cause the subsequent SEV firmware DF_FLUSH command to fail which, in turn, results in the SEV firmware ACTIVATE command failing for the reclaimed ASID. This results in the SEV guest failing to start. Use a mutex to close the WBINVD/DF_FLUSH window by obtaining the mutex before the DEACTIVATE and releasing it after the DF_FLUSH. This ensures that any DEACTIVATE cannot run before a DF_FLUSH has completed. Fixes: 59414c989220 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command") Tested-by: David Rientjes <rientjes@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | KVM: SVM: Serialize access to the SEV ASID bitmapTom Lendacky2019-10-221-12/+17
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SEV ASID bitmap currently is not protected against parallel SEV guest startups. This can result in an SEV guest failing to start because another SEV guest could have been assigned the same ASID value. Use a mutex to serialize access to the SEV ASID bitmap. Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command") Tested-by: David Rientjes <rientjes@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: clear kvmclock MSR on resetPaolo Bonzini2019-10-221-5/+3
| | | | | | | | | | | | | | | | | | | | | After resetting the vCPU, the kvmclock MSR keeps the previous value but it is not enabled. This can be confusing, so fix it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: x86: fix bugon.cocci warningskbuild test robot2019-10-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use BUG_ON instead of a if condition followed by BUG. Generated by: scripts/coccinelle/misc/bugon.cocci Fixes: 4b526de50e39 ("KVM: x86: Check kvm_rebooting in kvm_spurious_fault()") CC: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: kbuild test robot <lkp@intel.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: VMX: Remove specialized handling of unexpected exit-reasonsLiran Alon2019-10-221-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit") introduced specialized handling of specific exit-reasons that should not be raised by CPU because KVM configures VMCS such that they should never be raised. However, since commit 7396d337cfad ("KVM: x86: Return to userspace with internal error on unexpected exit reason"), VMX & SVM exit handlers were modified to generically handle all unexpected exit-reasons by returning to userspace with internal error. Therefore, there is no need for specialized handling of specific unexpected exit-reasons (This specialized handling also introduced inconsistency for these exit-reasons to silently skip guest instruction instead of return to userspace on internal-error). Fixes: bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit") Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUIDJim Mattson2019-10-221-1/+1
| |/ |/| | | | | | | | | | | | | When the RDPID instruction is supported on the host, enumerate it in KVM_GET_SUPPORTED_CPUID. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: omit "impossible" pmu MSRs from MSR listPaolo Bonzini2019-10-041-16/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since some machines actually have MSRs past the reserved range, filtering them against x86_pmu.num_counters_gp may have false positives. Cut the list to 18 entries to avoid this. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jamttson@google.com> Fixes: e2ada66ec418 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: nVMX: Fix consistency check on injected exception error codeSean Christopherson2019-10-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current versions of Intel's SDM incorrectly state that "bits 31:15 of the VM-Entry exception error-code field" must be zero. In reality, bits 31:16 must be zero, i.e. error codes are 16-bit values. The bogus error code check manifests as an unexpected VM-Entry failure due to an invalid code field (error number 7) in L1, e.g. when injecting a #GP with error_code=0x9f00. Nadav previously reported the bug[*], both to KVM and Intel, and fixed the associated kvm-unit-test. [*] https://patchwork.kernel.org/patch/11124749/ Reported-by: Nadav Amit <namit@vmware.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: omit absent pmu MSRs from MSR listPaolo Bonzini2019-10-031-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since some machines actually have MSRs past the reserved range, these may survive the filtering of msrs_to_save array and would be rejected by KVM_GET/SET_MSR. To avoid this, cut the list to whatever CPUID reports for the host's architectural PMU. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Fixes: e2ada66ec418 ("kvm: x86: Add Intel PMU MSRs to msrs_to_save[]", 2019-08-21) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: vmx: Limit guest PMCs to those supported on the hostJim Mattson2019-10-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM can only virtualize as many PMCs as the host supports. Limit the number of generic counters and fixed counters to the number of corresponding counters supported on the host, rather than to INTEL_PMC_MAX_GENERIC and INTEL_PMC_MAX_FIXED, respectively. Note that INTEL_PMC_MAX_GENERIC is currently 32, which exceeds the 18 contiguous MSR indices reserved by Intel for event selectors. Since the existing code relies on a contiguous range of MSR indices for event selectors, it can't possibly work for more than 18 general purpose counters. Fixes: f5132b01386b5a ("KVM: Expose a version 2 architectural PMU to a guests") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86, powerpc: do not allow clearing largepages debugfs entryPaolo Bonzini2019-09-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Cc: kvm-ppc@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: VMX: Set VMENTER_L1D_FLUSH_NOT_REQUIRED if !X86_BUG_L1TFWaiman Long2019-09-271-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The l1tf_vmx_mitigation is only set to VMENTER_L1D_FLUSH_NOT_REQUIRED when the ARCH_CAPABILITIES MSR indicates that L1D flush is not required. However, if the CPU is not affected by L1TF, l1tf_vmx_mitigation will still be set to VMENTER_L1D_FLUSH_AUTO. This is certainly not the best option for a !X86_BUG_L1TF CPU. So force l1tf_vmx_mitigation to VMENTER_L1D_FLUSH_NOT_REQUIRED to make it more explicit in case users are checking the vmentry_l1d_flush parameter. Signed-off-by: Waiman Long <longman@redhat.com> [Patch rewritten accoring to Borislav Petkov's suggestion. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: fix nested guest live migration with PMLPaolo Bonzini2019-09-271-7/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Shadow paging is fundamentally incompatible with the page-modification log, because the GPAs in the log come from the wrong memory map. In particular, for the EPT page-modification log, the GPAs in the log come from L2 rather than L1. (If there was a non-EPT page-modification log, we couldn't use it for shadow paging because it would log GVAs rather than GPAs). Therefore, we need to rely on write protection to record dirty pages. This has the side effect of bypassing PML, since writes now result in an EPT violation vmexit. This is relatively easy to add to KVM, because pretty much the only place that needs changing is spte_clear_dirty. The first access to the page already goes through the page fault path and records the correct GPA; it's only subsequent accesses that are wrong. Therefore, we can equip set_spte (where the first access happens) to record that the SPTE will have to be write protected, and then spte_clear_dirty will use this information to do the right thing. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: assign two bits to track SPTE kindsPaolo Bonzini2019-09-271-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we are overloading SPTE_SPECIAL_MASK to mean both "A/D bits unavailable" and MMIO, where the difference between the two is determined by mio_mask and mmio_value. However, the next patch will need two bits to distinguish availability of A/D bits from write protection. So, while at it give MMIO its own bit pattern, and move the two bits from bit 62 to bits 52..53 since Intel is allocating EPT page table bits from the top. Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: x86: Expose XSAVEERPTR to the guestSebastian Andrzej Siewior2019-09-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I was surprised to see that the guest reported `fxsave_leak' while the host did not. After digging deeper I noticed that the bits are simply masked out during enumeration. The XSAVEERPTR feature is actually a bug fix on AMD which means the kernel can disable a workaround. Pass XSAVEERPTR to the guest if available on the host. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: Enumerate support for CLZERO instructionJim Mattson2019-09-261-2/+3
| | | | | | | | | | | | | | | | | | CLZERO is available to the guest if it is supported on the host. Therefore, enumerate support for the instruction in KVM_GET_SUPPORTED_CPUID whenever it is supported on the host. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: Use AMD CPUID semantics for AMD vCPUsJim Mattson2019-09-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the guest CPUID information represents an AMD vCPU, return all zeroes for queries of undefined CPUID leaves, whether or not they are in range. Signed-off-by: Jim Mattson <jmattson@google.com> Fixes: bd22f5cfcfe8f6 ("KVM: move and fix substitue search for missing CPUID entries") Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: Improve emulation of CPUID leaves 0BH and 1FHJim Mattson2019-09-261-36/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For these CPUID leaves, the EDX output is not dependent on the ECX input (i.e. the SIGNIFCANT_INDEX flag doesn't apply to EDX). Furthermore, the low byte of the ECX output is always identical to the low byte of the ECX input. KVM does not produce the correct ECX and EDX outputs for any undefined subleaves beyond the first. Special-case these CPUID leaves in kvm_cpuid, so that the ECX and EDX outputs are properly generated for all undefined subleaves. Fixes: 0771671749b59a ("KVM: Enhance guest cpuid management") Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: X86: Fix userspace set invalid CR4Wanpeng Li2019-09-261-17/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reported by syzkaller: WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel] CPU: 0 PID: 6544 Comm: a.out Tainted: G OE 5.3.0-rc4+ #4 RIP: 0010:handle_desc+0x37/0x40 [kvm_intel] Call Trace: vmx_handle_exit+0xbe/0x6b0 [kvm_intel] vcpu_enter_guest+0x4dc/0x18d0 [kvm] kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm] kvm_vcpu_ioctl+0x3ad/0x690 [kvm] do_vfs_ioctl+0xa2/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x74/0x720 entry_SYSCALL_64_after_hwframe+0x49/0xbe When CR4.UMIP is set, guest should have UMIP cpuid flag. Current kvm set_sregs function doesn't have such check when userspace inputs sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast triggers handle_desc warning when executing ltr instruction since guest architectural CR4 doesn't set UMIP. This patch fixes it by adding valid CR4 and CPUID combination checking in __set_sregs. syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000 Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: Fix a spurious -E2BIG in __do_cpuid_funcJim Mattson2019-09-261-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Don't return -E2BIG from __do_cpuid_func when processing function 0BH or 1FH and the last interesting subleaf occupies the last allocated entry in the result array. Cc: Paolo Bonzini <pbonzini@redhat.com> Fixes: 831bf664e9c1fc ("KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: LAPIC: Loosen filter for adaptive tuning of lapic_timer_advance_nsWanpeng Li2019-09-261-6/+7
|/ | | | | | | | | 5000 guest cycles delta is easy to encounter on desktop, per-vCPU lapic_timer_advance_ns always keeps at 1000ns initial value, let's loosen the filter a bit to let adaptive tuning make progress. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: nVMX: cleanup and fix host 64-bit mode checksPaolo Bonzini2019-09-251-31/+19
| | | | | | | | | | | | | | | | KVM was incorrectly checking vmcs12->host_ia32_efer even if the "load IA32_EFER" exit control was reset. Also, some checks were not using the new CC macro for tracing. Cleanup everything so that the vCPU's 64-bit mode is determined directly from EFER_LMA and the VMCS checks are based on that, which matches section 26.2.4 of the SDM. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: 5845038c111db27902bc220a4f70070fe945871c Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386Vitaly Kuznetsov2019-09-251-9/+5
| | | | | | | | | | | | | | | | The following was reported on i386: arch/x86/kvm/vmx/vmx.c: In function 'hv_enable_direct_tlbflush': arch/x86/kvm/vmx/vmx.c:503:10: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] pr_debugs() in this function are more or less useless, let's just remove them. evmcs->hv_vm_id can use 'unsigned long' instead of 'u64'. Also, simplify the code a little bit. Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: VMX: Add error handling to VMREAD helperSean Christopherson2019-09-252-4/+25
| | | | | | | | | | | | | | | | | | Now that VMREAD flows require a taken branch, courtesy of commit 3901336ed9887 ("x86/kvm: Don't call kvm_spurious_fault() from .fixup") bite the bullet and add full error handling to VMREAD, i.e. replace the JMP added by __ex()/____kvm_handle_fault_on_reboot() with a hinted Jcc. To minimize the code footprint, add a helper function, vmread_error(), to handle both faults and failures so that the inline flow has a single CALL. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: VMX: Optimize VMX instruction error and fault handlingSean Christopherson2019-09-252-32/+74
| | | | | | | | | | | | | | | | | | | | | | | | | Rework the VMX instruction helpers using asm-goto to branch directly to error/fault "handlers" in lieu of using __ex(), i.e. the generic ____kvm_handle_fault_on_reboot(). Branching directly to fault handling code during fixup avoids the extra JMP that is inserted after every VMX instruction when using the generic "fault on reboot" (see commit 3901336ed9887, "x86/kvm: Don't call kvm_spurious_fault() from .fixup"). Opportunistically clean up the helpers so that they all have consistent error handling and messages. Leave the usage of ____kvm_handle_fault_on_reboot() (via __ex()) in kvm_cpu_vmxoff() and nested_vmx_check_vmentry_hw() as is. The VMXOFF case is not a fast path, i.e. the cleanliness of __ex() is worth the JMP, and the extra JMP in nested_vmx_check_vmentry_hw() is unavoidable. Note, VMREAD cannot get the asm-goto treatment as output operands aren't compatible with GCC's asm-goto due to internal compiler restrictions. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Check kvm_rebooting in kvm_spurious_fault()Sean Christopherson2019-09-251-1/+2
| | | | | | | | | | | | | | | Explicitly check kvm_rebooting in kvm_spurious_fault() prior to invoking BUG(), as opposed to assuming the caller has already done so. Letting kvm_spurious_fault() be called "directly" will allow VMX to better optimize its low level assembly flows. As a happy side effect, kvm_spurious_fault() no longer needs to be marked as a dead end since it doesn't unconditionally BUG(). Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: nvmx: limit atomic switch MSRsMarc Orr2019-09-241-11/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allowing an unlimited number of MSRs to be specified via the VMX load/store MSR lists (e.g., vm-entry MSR load list) is bad for two reasons. First, a guest can specify an unreasonable number of MSRs, forcing KVM to process all of them in software. Second, the SDM bounds the number of MSRs allowed to be packed into the atomic switch MSR lists. Quoting the "Miscellaneous Data" section in the "VMX Capability Reporting Facility" appendix: "Bits 27:25 is used to compute the recommended maximum number of MSRs that should appear in the VM-exit MSR-store list, the VM-exit MSR-load list, or the VM-entry MSR-load list. Specifically, if the value bits 27:25 of IA32_VMX_MISC is N, then 512 * (N + 1) is the recommended maximum number of MSRs to be included in each list. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition)." Because KVM needs to protect itself and can't model "undefined processor behavior", arbitrarily force a VM-entry to fail due to MSR loading when the MSR load list is too large. Similarly, trigger an abort during a VM exit that encounters an MSR load list or MSR store list that is too large. The MSR list size is intentionally not pre-checked so as to maintain compatibility with hardware inasmuch as possible. Test these new checks with the kvm-unit-test "x86: nvmx: test max atomic switch MSRs". Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Marc Orr <marcorr@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: svm: Intercept RDPRUJim Mattson2019-09-241-0/+8
| | | | | | | | | | | | | | | | The RDPRU instruction gives the guest read access to the IA32_APERF MSR and the IA32_MPERF MSR. According to volume 3 of the APM, "When virtualization is enabled, this instruction can be intercepted by the Hypervisor. The intercept bit is at VMCB byte offset 10h, bit 14." Since we don't enumerate the instruction in KVM_SUPPORTED_CPUID, intercept it and synthesize #UD. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: x86: Add "significant index" flag to a few CPUID leavesJim Mattson2019-09-241-0/+6
| | | | | | | | | | | | | | | | According to the Intel SDM, volume 2, "CPUID," the index is significant (or partially significant) for CPUID leaves 0FH, 10H, 12H, 17H, 18H, and 1FH. Add the corresponding flag to these CPUID leaves in do_host_cpuid(). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Steve Rutherford <srutherford@google.com> Fixes: a87f2d3a6eadab ("KVM: x86: Add Intel CPUID.1F cpuid emulation support") Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zeroSean Christopherson2019-09-241-4/+5
| | | | | | | | | | | Do not skip invalid shadow pages when zapping obsolete pages if the pages' root_count has reached zero, in which case the page can be immediately zapped and freed. Update the comment accordingly. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Explicitly track only a single invalid mmu generationSean Christopherson2019-09-242-10/+20
| | | | | | | | | | | | | | | | | | | | | Toggle mmu_valid_gen between '0' and '1' instead of blindly incrementing the generation. Because slots_lock is held for the entire duration of zapping obsolete pages, it's impossible for there to be multiple invalid generations associated with shadow pages at any given time. Toggling between the two generations (valid vs. invalid) allows changing mmu_valid_gen from an unsigned long to a u8, which reduces the size of struct kvm_mmu_page from 160 to 152 bytes on 64-bit KVM, i.e. reduces KVM's memory footprint by 8 bytes per shadow page. Set sp->mmu_valid_gen before it is added to active_mmu_pages. Functionally this has no effect as kvm_mmu_alloc_page() has a single caller that sets sp->mmu_valid_gen soon thereafter, but visually it is jarring to see a shadow page being added to the list without its mmu_valid_gen first being set. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"Sean Christopherson2019-09-242-6/+24
| | | | | | | | | | | | | | | | | | | Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrasing the original changelog (commit 5ff0568374ed2 was itself a partial revert): Don't force reloading the remote mmu when zapping an obsolete page, as a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast() immediately after incrementing mmu_valid_gen, i.e. after marking pages obsolete. This reverts commit 5ff0568374ed2e585376a3832857ade5daccd381. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all ↵Sean Christopherson2019-09-241-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | pages"" Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Reload the mmu on all vCPUs after updating the generation number so that obsolete pages are not used by any vCPUs. This allows collapsing all TLB flushes during obsolete page zapping into a single flush, as there is no need to flush when dropping mmu_lock (to reschedule). Note: a remote TLB flush is still needed before freeing the pages as other vCPUs may be doing a lockless shadow page walk. Opportunstically improve the comments restored by the revert (the code itself is a true revert). This reverts commit f34d251d66ba263c077ed9d2bbd1874339a4c887. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""Sean Christopherson2019-09-241-26/+9
| | | | | | | | | | | | | | | | | | | | | Now that the fast invalidate mechanism has been reintroduced, restore the performance tweaks for fast invalidation that existed prior to its removal. Paraphrashing the original changelog: Zap at least 10 shadow pages before releasing mmu_lock to reduce the overhead associated with re-acquiring the lock. Note: "10" is an arbitrary number, speculated to be high enough so that a vCPU isn't stuck zapping obsolete pages for an extended period, but small enough so that other vCPUs aren't starved waiting for mmu_lock. This reverts commit 43d2b14b105fb00b8864c7b0ee7043cc1cc4a969. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for ↵Sean Christopherson2019-09-242-0/+22
| | | | | | | | | | | | | | | kvm_mmu_invalidate_all_pages"" Now that the fast invalidate mechanism has been reintroduced, restore the tracepoint associated with said mechanism. Note, the name of the tracepoint deviates from the original tracepoint so as to match KVM's current nomenclature. This reverts commit 42560fb1f3c6c7f730897b7fa7a478bc37e0be50. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page ↵Sean Christopherson2019-09-241-9/+12
| | | | | | | | | | | | related tracepoints"" Now that the fast invalidate mechanism has been reintroduced, restore tracing of the generation number in shadow page tracepoints. This reverts commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Use fast invalidate mechanism to zap MMIO sptesSean Christopherson2019-09-241-14/+3
| | | | | | | | | | | | | | | | | | | | | | Use the fast invalidate mechasim to zap MMIO sptes on a MMIO generation wrap. The fast invalidate flow was reintroduced to fix a livelock bug in kvm_mmu_zap_all() that can occur if kvm_mmu_zap_all() is invoked when the guest has live vCPUs. I.e. using kvm_mmu_zap_all() to handle the MMIO generation wrap is theoretically susceptible to the livelock bug. This effectively reverts commit 4771450c345dc ("Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes""), i.e. restores the behavior of commit a8eca9dcc656a ("KVM: MMU: drop kvm_mmu_zap_mmio_sptes"). Note, this actually fixes commit 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes"), but there is no need to incrementally revert back to using fast invalidate, e.g. doing so doesn't provide any bisection or stability benefits. Fixes: 571c5af06e303 ("KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Treat invalid shadow pages as obsoleteSean Christopherson2019-09-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | Treat invalid shadow pages as obsolete to fix a bug where an obsolete and invalid page with a non-zero root count could become non-obsolete due to mmu_valid_gen wrapping. The bug is largely theoretical with the current code base, as an unsigned long will effectively never wrap on 64-bit KVM, and userspace would have to deliberately stall a vCPU in order to keep an obsolete invalid page on the active list while simultaneously modifying memslots billions of times to trigger a wrap. The obvious alternative is to use a 64-bit value for mmu_valid_gen, but it's actually desirable to go in the opposite direction, i.e. using a smaller 8-bit value to reduce KVM's memory footprint by 8 bytes per shadow page, and relying on proper treatment of invalid pages instead of preventing the generation from wrapping. Note, "Fixes" points at a commit that was at one point reverted, but has since been restored. Fixes: 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: LAPIC: Tune lapic_timer_advance_ns smoothlyWanpeng Li2019-09-242-15/+14
| | | | | | | | | Filter out drastic fluctuation and random fluctuation, remove timer_advance_adjust_done altogether, the adjustment would be continuous. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexitTao Xu2019-09-242-16/+16
| | | | | | | | | | | | | | | | | | As the latest Intel 64 and IA-32 Architectures Software Developer's Manual, UMWAIT and TPAUSE instructions cause a VM exit if the RDTSC exiting and enable user wait and pause VM-execution controls are both 1. Because KVM never enable RDTSC exiting, the vm-exit for UMWAIT and TPAUSE should never happen. Considering EXIT_REASON_XSAVES and EXIT_REASON_XRSTORS is also unexpected VM-exit for KVM. Introduce a common exit helper handle_unexpected_vmexit() to handle these unexpected VM-exit. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROLTao Xu2019-09-243-0/+47
| | | | | | | | | | | | | | | | UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index E1H to determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate IA32_UMWAIT_CONTROL between host and guest. The variable mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value, so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL. Co-developed-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Jingqi Liu <jingqi.liu@intel.com> Signed-off-by: Tao Xu <tao3.xu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>