summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c
Commit message (Collapse)AuthorAgeFilesLines
...
* KVM: x86: Invert emulation re-execute behavior to make it opt-inSean Christopherson2018-08-301-1/+1
| | | | | | | | | | | Re-execution of an instruction after emulation decode failure is intended to be used only when emulating shadow page accesses. Invert the flag to make allowing re-execution opt-in since that behavior is by far in the minority. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-08-221-3/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull second set of KVM updates from Paolo Bonzini: "ARM: - Support for Group0 interrupts in guests - Cache management optimizations for ARMv8.4 systems - Userspace interface for RAS - Fault path optimization - Emulated physical timer fixes - Random cleanups x86: - fixes for L1TF - a new test case - non-support for SGX (inject the right exception in the guest) - fix lockdep false positive" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits) KVM: VMX: fixes for vmentry_l1d_flush module parameter kvm: selftest: add dirty logging test kvm: selftest: pass in extra memory when create vm kvm: selftest: include the tools headers kvm: selftest: unify the guest port macros tools: introduce test_and_clear_bit KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled KVM: vmx: Inject #UD for SGX ENCLS instruction in guest KVM: vmx: Add defines for SGX ENCLS exiting x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush() x86: kvm: avoid unused variable warning KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR KVM: arm/arm64: Skip updating PTE entry if no change KVM: arm/arm64: Skip updating PMD entry if no change KVM: arm: Use true and false for boolean values KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs ...
| * x86: kvm: avoid unused variable warningArnd Bergmann2018-08-221-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removing one of the two accesses of the maxphyaddr variable led to a harmless warning: arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask': arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable] Removing the #ifdef seems to be the nicest workaround, as it makes the code look cleaner than adding another #ifdef. Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org # L1TF Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | mm, oom: distinguish blockable mode for mmu notifiersMichal Hocko2018-08-221-2/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-08-191-17/+91
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull first set of KVM updates from Paolo Bonzini: "PPC: - minor code cleanups x86: - PCID emulation and CR3 caching for shadow page tables - nested VMX live migration - nested VMCS shadowing - optimized IPI hypercall - some optimizations ARM will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits) kvm: x86: Set highest physical address bits in non-present/reserved SPTEs KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c KVM: X86: Implement PV IPIs in linux guest KVM: X86: Add kvm hypervisor init time platform setup callback KVM: X86: Implement "send IPI" hypercall KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() KVM: x86: Skip pae_root shadow allocation if tdp enabled KVM/MMU: Combine flushing remote tlb in mmu_set_spte() KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup KVM: vmx: move struct host_state usage to struct loaded_vmcs KVM: vmx: compute need to reload FS/GS/LDT on demand KVM: nVMX: remove a misleading comment regarding vmcs02 fields KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state() KVM: vmx: add dedicated utility to access guest's kernel_gs_base KVM: vmx: track host_state.loaded using a loaded_vmcs pointer KVM: vmx: refactor segmentation code in vmx_save_host_state() kvm: nVMX: Fix fault priority for VMX operations kvm: nVMX: Fix fault vector for VMX operation at CPL > 0 ...
| * kvm: x86: Set highest physical address bits in non-present/reserved SPTEsJunaid Shahid2018-08-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Always set the 5 upper-most supported physical address bits to 1 for SPTEs that are marked as non-present or reserved, to make them unusable for L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs. (We do not need to mark PTEs that are completely 0 as physical page 0 is already reserved.) This allows mitigation of L1TF without disabling hyper-threading by using shadow paging mode instead of EPT. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: X86: Implement "send IPI" hypercallWanpeng Li2018-08-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using hypercall to send IPIs by one vmexit instead of one by one for xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster mode. Intel guest can enter x2apic cluster mode when interrupt remmaping is enabled in qemu, however, latest AMD EPYC still just supports xapic mode which can get great improvement by Exit-less IPIs. This patchset lets a guest send multicast IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141): x2apic cluster mode, vanilla Dry-run: 0, 2392199 ns Self-IPI: 6907514, 15027589 ns Normal IPI: 223910476, 251301666 ns Broadcast IPI: 0, 9282161150 ns Broadcast lock: 0, 8812934104 ns x2apic cluster mode, pv-ipi Dry-run: 0, 2449341 ns Self-IPI: 6720360, 15028732 ns Normal IPI: 228643307, 255708477 ns Broadcast IPI: 0, 7572293590 ns => 22% performance boost Broadcast lock: 0, 8316124651 ns x2apic physical mode, vanilla Dry-run: 0, 3135933 ns Self-IPI: 8572670, 17901757 ns Normal IPI: 226444334, 255421709 ns Broadcast IPI: 0, 19845070887 ns Broadcast lock: 0, 19827383656 ns x2apic physical mode, pv-ipi Dry-run: 0, 2446381 ns Self-IPI: 6788217, 15021056 ns Normal IPI: 219454441, 249583458 ns Broadcast IPI: 0, 7806540019 ns => 154% performance boost Broadcast lock: 0, 9143618799 ns Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()Tianyu Lan2018-08-061-4/+4
| | | | | | | | | | | | | | | | X86_CR4_OSXSAVE check belongs to sregs check and so move into kvm_valid_sregs(). Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Remove CR3_PCID_INVD flagJunaid Shahid2018-08-061-2/+2
| | | | | | | | | | | | | | It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Skip shadow page resync on CR3 switch when indicated by guestJunaid Shahid2018-08-061-3/+3
| | | | | | | | | | | | | | | | | | When the guest indicates that the TLB doesn't need to be flushed in a CR3 switch, we can also skip resyncing the shadow page tables since an out-of-sync shadow page table is equivalent to an out-of-sync TLB. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guestJunaid Shahid2018-08-061-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3 instruction indicates that the TLB doesn't need to be flushed. This change enables this optimization for MOV-to-CR3s in the guest that have been intercepted by KVM for shadow paging and are handled within the fast CR3 switch path. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Introduce KVM_REQ_LOAD_CR3Junaid Shahid2018-08-061-0/+2
| | | | | | | | | | | | | | | | The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the current root_hpa. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Add fast CR3 switch code pathJunaid Shahid2018-08-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using shadow paging, a CR3 switch in the guest results in a VM Exit. In the common case, that VM exit doesn't require much processing by KVM. However, it does acquire the MMU lock, which can start showing signs of contention under some workloads even on a 2 VCPU VM when the guest is using KPTI. Therefore, we add a fast path that avoids acquiring the MMU lock in the most common cases e.g. when switching back and forth between the kernel and user mode CR3s used by KPTI with no guest page table changes in between. For now, this fast path is implemented only for 64-bit guests and hosts to avoid the handling of PDPTEs, but it can be extended later to 32-bit guests and/or hosts as well. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: nVMX: Introduce KVM_CAP_NESTED_STATEJim Mattson2018-08-061-0/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For nested virtualization L0 KVM is managing a bit of state for L2 guests, this state can not be captured through the currently available IOCTLs. In fact the state captured through all of these IOCTLs is usually a mix of L1 and L2 state. It is also dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM that is in VMX operation. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jim Mattson <jmattson@google.com> [karahmed@ - rename structs and functions and make them ready for AMD and address previous comments. - handle nested.smm state. - rebase & a bit of refactoring. - Merge 7/8 and 8/8 into one patch. ] Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: do not load vmcs12 pages while still in SMMPaolo Bonzini2018-08-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If the vCPU enters system management mode while running a nested guest, RSM starts processing the vmentry while still in SMM. In that case, however, the pages pointed to by the vmcs12 might be incorrectly loaded from SMRAM. To avoid this, delay the handling of the pages until just before the next vmentry. This is done with a new request and a new entry in kvm_x86_ops, which we will be able to reuse for nested VMX state migration. Extracted from a patch by Jim Mattson and KarimAllah Ahmed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'dPaolo Bonzini2018-08-061-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you they are only accepted under special conditions. This makes the API a pain to use. To avoid this pain, this patch makes it so that the result of the get-list ioctl can always be used for host-initiated get and set. Since we don't have a separate way to check for read-only MSRs, this means some Hyper-V MSRs are ignored when written. Arguably they should not even be in the result of GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding Hyper-V feature. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentryPaolo Bonzini2018-08-051-1/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | When nested virtualization is in use, VMENTER operations from the nested hypervisor into the nested guest will always be processed by the bare metal hypervisor, and KVM's "conditional cache flushes" mode in particular does a flush on nested vmentry. Therefore, include the "skip L1D flush on vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting. Add the relevant Documentation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge 4.18-rc7 into master to pick up the KVM dependcyThomas Gleixner2018-08-051-1/+3
|\| | | | | | | Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-07-181-1/+3
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes, plus a small patchlet related to Spectre v2" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvmclock: fix TSC calibration for nested guests KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. x86/kvmclock: set pvti_cpu0_va after enabling kvmclock x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
| | * KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSRPaolo Bonzini2018-07-151-1/+3
| | | | | | | | | | | | | | | | | | | | | This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all requested features are available on the host. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | x86/KVM/VMX: Add L1D flush logicPaolo Bonzini2018-07-041-0/+8
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the logic for flushing L1D on VMENTER. The flush depends on the static key being enabled and the new l1tf_flush_l1d flag being set. The flags is set: - Always, if the flush module parameter is 'always' - Conditionally at: - Entry to vcpu_run(), i.e. after executing user space - From the sched_in notifier, i.e. when switching to a vCPU thread. - From vmexit handlers which are considered unsafe, i.e. where sensitive data can be brought into L1D: - The emulator, which could be a good target for other speculative execution-based threats, - The MMU, which can bring host page tables in the L1 cache. - External interrupts - Nested operations that require the MMU (see above). That is vmptrld, vmptrst, vmclear,vmwrite,vmread. - When handling invept,invvpid [ tglx: Split out from combo patch and reduced to a single flag ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | KVM: x86: fix typo at kvm_arch_hardware_setup commentMarcelo Tosatti2018-06-141-1/+1
| | | | | | | | | | | | | | Fix typo in sentence about min value calculation. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Merge tag 'overflow-v4.18-rc1-part2' of ↵Linus Torvalds2018-06-121-2/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull more overflow updates from Kees Cook: "The rest of the overflow changes for v4.18-rc1. This includes the explicit overflow fixes from Silvio, further struct_size() conversions from Matthew, and a bug fix from Dan. But the bulk of it is the treewide conversions to use either the 2-factor argument allocators (e.g. kmalloc(a * b, ...) into kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a * b) into vmalloc(array_size(a, b)). Coccinelle was fighting me on several fronts, so I've done a bunch of manual whitespace updates in the patches as well. Summary: - Error path bug fix for overflow tests (Dan) - Additional struct_size() conversions (Matthew, Kees) - Explicitly reported overflow fixes (Silvio, Kees) - Add missing kvcalloc() function (Kees) - Treewide conversions of allocators to use either 2-factor argument variant when available, or array_size() and array3_size() as needed (Kees)" * tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits) treewide: Use array_size in f2fs_kvzalloc() treewide: Use array_size() in f2fs_kzalloc() treewide: Use array_size() in f2fs_kmalloc() treewide: Use array_size() in sock_kmalloc() treewide: Use array_size() in kvzalloc_node() treewide: Use array_size() in vzalloc_node() treewide: Use array_size() in vzalloc() treewide: Use array_size() in vmalloc() treewide: devm_kzalloc() -> devm_kcalloc() treewide: devm_kmalloc() -> devm_kmalloc_array() treewide: kvzalloc() -> kvcalloc() treewide: kvmalloc() -> kvmalloc_array() treewide: kzalloc_node() -> kcalloc_node() treewide: kzalloc() -> kcalloc() treewide: kmalloc() -> kmalloc_array() mm: Introduce kvcalloc() video: uvesafb: Fix integer overflow in allocation UBIFS: Fix potential integer overflow in allocation leds: Use struct_size() in allocation Convert intel uncore to struct_size ...
| * | treewide: kvzalloc() -> kvcalloc()Kees Cook2018-06-121-2/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvzalloc() function has a 2-factor argument form, kvcalloc(). This patch replaces cases of: kvzalloc(a * b, gfp) with: kvcalloc(a * b, gfp) as well as handling cases of: kvzalloc(a * b * c, gfp) with: kvzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kvcalloc(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kvzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kvzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kvzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kvzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kvzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kvzalloc( - sizeof(char) * COUNT + COUNT , ...) | kvzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kvzalloc + kvcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kvzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kvzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kvzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kvzalloc(C1 * C2 * C3, ...) | kvzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kvzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kvzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kvzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kvzalloc(sizeof(THING) * C2, ...) | kvzalloc(sizeof(TYPE) * C2, ...) | kvzalloc(C1 * C2 * C3, ...) | kvzalloc(C1 * C2, ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - (E1) * E2 + E1, E2 , ...) | - kvzalloc + kvcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kvzalloc + kvcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-06-121-34/+63
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "Small update for KVM: ARM: - lazy context-switching of FPSIMD registers on arm64 - "split" regions for vGIC redistributor s390: - cleanups for nested - clock handling - crypto - storage keys - control register bits x86: - many bugfixes - implement more Hyper-V super powers - implement lapic_timer_advance_ns even when the LAPIC timer is emulated using the processor's VMX preemption timer. - two security-related bugfixes at the top of the branch" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits) kvm: fix typo in flag name kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system KVM: x86: introduce linear_{read,write}_system kvm: nVMX: Enforce cpl=0 for VMX instructions kvm: nVMX: Add support for "VMWRITE to any supported field" kvm: nVMX: Restrict VMX capability MSR changes KVM: VMX: Optimize tscdeadline timer latency KVM: docs: nVMX: Remove known limitations as they do not exist now KVM: docs: mmu: KVM support exposing SLAT to guests kvm: no need to check return value of debugfs_create functions kvm: Make VM ioctl do valloc for some archs kvm: Change return type to vm_fault_t KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008 kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation KVM: introduce kvm_make_vcpus_request_mask() API KVM: x86: hyperv: do rep check for each hypercall separately ...
| * kvm: fix typo in flag nameMichael S. Tsirkin2018-06-121-2/+2
| | | | | | | | | | | | | | | | | | | | KVM_X86_DISABLE_EXITS_HTL really refers to exit on halt. Obviously a typo: should be named KVM_X86_DISABLE_EXITS_HLT. Fixes: caa057a2cad ("KVM: X86: Provide a capability to disable HLT intercepts") Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor accessPaolo Bonzini2018-06-121-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions that were used in the emulation of fxrstor, fxsave, sgdt and sidt were originally meant for task switching, and as such they did not check privilege levels. This is very bad when the same functions are used in the emulation of unprivileged instructions. This is CVE-2018-10853. The obvious fix is to add a new argument to ops->read_std and ops->write_std, which decides whether the access is a "system" access or should use the processor's CPL. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_systemPaolo Bonzini2018-06-121-13/+26
| | | | | | | | | | | | | | | | | | | | | | Int the next patch the emulator's .read_std and .write_std callbacks will grow another argument, which is not needed in kvm_read_guest_virt and kvm_write_guest_virt_system's callers. Since we have to make separate functions, let's give the currently existing names a nicer interface, too. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: VMX: Optimize tscdeadline timer latencyWanpeng Li2018-06-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration")' advances the tscdeadline (the timer is emulated by hrtimer) expiration in order that the latency which is incurred by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch adds the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small (for example in Linux if an hrtimer feeds a expiration in the past); in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: Change return type to vm_fault_tSouptick Joarder2018-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capabilityVitaly Kuznetsov2018-05-261-0/+1
| | | | | | | | | | | | | | | | | | | | We need a new capability to indicate support for the newly added HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this capability, userspace is supposed to announce PV TLB flush features by setting the appropriate CPUID bits (if needed). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * Merge branch 'x86/hyperv' of ↵Radim Krčmář2018-05-261-1/+1
| |\ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip To resolve conflicts with the PV TLB flush series.
| * | KVM: x86: use timespec64 for KVM_HC_CLOCK_PAIRINGArnd Bergmann2018-05-231-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hypercall was added using a struct timespec based implementation, but we should not use timespec in new code. This changes it to timespec64. There is no functional change here since the implementation is only used in 64-bit kernels that use the same definition for timespec and timespec64. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | kvm: vmx: Introduce lapic_mode enumerationJim Mattson2018-05-141-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The local APIC can be in one of three modes: disabled, xAPIC or x2APIC. (A fourth mode, "invalid," is included for completeness.) Using the new enumeration can make some of the APIC mode logic easier to read. In kvm_set_apic_base, for instance, it is clear that one cannot transition directly from x2APIC mode to xAPIC mode or directly from APIC disabled to x2APIC mode. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Check invalid bits even if msr_info->host_initiated. Reported by Wanpeng Li. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: X86: Fix reserved bits check for MOV to CR3Wanpeng Li2018-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. It should be checked when PCIDE bit is not set, however commit 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width")' removes the bit 63 checking unconditionally. This patch fixes it by checking bit 63 of CR3 when PCIDE bit is not set in CR4. Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on its physical address width) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-05-261-9/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Radim Krčmář: "PPC: - Close a hole which could possibly lead to the host timebase getting out of sync. - Three fixes relating to PTEs and TLB entries for radix guests. - Fix a bug which could lead to an interrupt never getting delivered to the guest, if it is pending for a guest vCPU when the vCPU gets offlined. s390: - Fix false negatives in VSIE validity check (Cc stable) x86: - Fix time drift of VMX preemption timer when a guest uses LAPIC timer in periodic mode (Cc stable) - Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow migration from hosts that don't need retpoline mitigation (Cc stable) - Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and CPUID.OSXSAVE (Cc stable) - Report correct RIP after Hyper-V hypercall #UD (introduced in -rc6)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: fix #UD address of failed Hyper-V hypercalls kvm: x86: IA32_ARCH_CAPABILITIES is always supported KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed x86/kvm: fix LAPIC timer drift when guest uses periodic mode KVM: s390: vsie: fix < 8k check for the itdba KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change KVM: PPC: Book3S HV: Make radix clear pte when unmapping KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
| * | | KVM: x86: fix #UD address of failed Hyper-V hypercallsRadim Krčmář2018-05-251-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the hypercall was called from userspace or real mode, KVM injects #UD and then advances RIP, so it looks like #UD was caused by the following instruction. This probably won't cause more than confusion, but could give an unexpected access to guest OS' instruction emulator. Also, refactor the code to count hv hypercalls that were handled by the virt userspace. Fixes: 6356ee0c9602 ("x86: Delay skip of emulated hypercall instruction") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changedWei Huang2018-05-241-1/+4
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0) allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is supposed to update these CPUID bits when CR4 is updated. Current KVM code doesn't handle some special cases when updates come from emulator. Here is one example: Step 1: guest boots Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1 Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1 Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv Step 4 above will cause an #UD and guest crash because guest OS hasn't turned on OSXAVE yet. This patch solves the problem by comparing the the old_cr4 with cr4. If the related bits have been changed, kvm_update_cpuid() needs to be called. Signed-off-by: Wei Huang <wei@redhat.com> Reviewed-by: Bandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | Merge branch 'speck-v20' of ↵Linus Torvalds2018-05-211-9/+4
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge speculative store buffer bypass fixes from Thomas Gleixner: - rework of the SPEC_CTRL MSR management to accomodate the new fancy SSBD (Speculative Store Bypass Disable) bit handling. - the CPU bug and sysfs infrastructure for the exciting new Speculative Store Bypass 'feature'. - support for disabling SSB via LS_CFG MSR on AMD CPUs including Hyperthread synchronization on ZEN. - PRCTL support for dynamic runtime control of SSB - SECCOMP integration to automatically disable SSB for sandboxed processes with a filter flag for opt-out. - KVM integration to allow guests fiddling with SSBD including the new software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on AMD. - BPF protection against SSB .. this is just the core and x86 side, other architecture support will come separately. * 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) bpf: Prevent memory disambiguation attack x86/bugs: Rename SSBD_NO to SSB_NO KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG x86/bugs: Rework spec_ctrl base and mask logic x86/bugs: Remove x86_spec_ctrl_set() x86/bugs: Expose x86_spec_ctrl_base directly x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} x86/speculation: Rework speculative_store_bypass_update() x86/speculation: Add virtualized speculative store bypass disable support x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Handle HT correctly on AMD x86/cpufeatures: Add FEATURE_ZEN x86/cpufeatures: Disentangle SSBD enumeration x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Move spec control call after restore of GS x86/cpu: Make alternative_msr_write work for 32-bit code x86/bugs: Fix the parameters alignment and missing void x86/bugs: Make cpu_show_common() static ...
| * | KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky2018-05-171-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | KVM: X86: Lower the default timer frequency limit to 200usWanpeng Li2018-05-151-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Anthoine reported: The period used by Windows change over time but it can be 1 milliseconds or less. I saw the limit_periodic_timer_frequency print so 500 microseconds is sometimes reached. As suggested by Paolo, lower the default timer frequency limit to a smaller interval of 200 us (5000 Hz) to leave some headroom. This is required due to Windows 10 changing the scheduler tick limit from 1024 Hz to 2048 Hz. Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Cc: Darren Kenny <darren.kenny@oracle.com> Cc: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabledJunaid Shahid2018-05-111-1/+4
| | | | | | | | | | | | | | | | | | If the PCIDE bit is not set in CR4, then the MSb of CR3 is a reserved bit. If the guest tries to set it, that should cause a #GP fault. So mask out the bit only when the PCIDE bit is set. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86: Delay skip of emulated hypercall instructionMarian Rotariu2018-05-111-8/+11
|/ | | | | | | | | | | | | | The IP increment should be done after the hypercall emulation, after calling the various handlers. In this way, these handlers can accurately identify the the IP of the VMCALL if they need it. This patch keeps the same functionality for the Hyper-V handler which does not use the return code of the standard kvm_skip_emulated_instruction() call. Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com> [Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: x86: move MSR_IA32_TSC handling to x86.cPaolo Bonzini2018-04-161-0/+6
| | | | | | | This is not specific to Intel/AMD anymore. The TSC offset is available in vcpu->arch.tsc_offset. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* X86/KVM: Properly update 'tsc_offset' to represent the running guestKarimAllah Ahmed2018-04-161-2/+4
| | | | | | | | | | | | | | | | Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always captures the TSC_OFFSET of the running guest whether it is the L1 or L2 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not availableKarimAllah Ahmed2018-04-111-1/+2
| | | | | | | | | | | | | | | | | | | | | If the processor does not have an "Always Running APIC Timer" (aka ARAT), we should not give guests direct access to MWAIT. The LAPIC timer would stop ticking in deep C-states, so any host deadlines would not wakeup the host kernel. The host kernel intel_idle driver handles this by switching to broadcast mode when ARAT is not available and MWAIT is issued with a deep C-state that would stop the LAPIC timer. When MWAIT is passed through, we can not tell when MWAIT is issued. So just disable this capability when LAPIC ARAT is not available. I am not even sure if there are any CPUs with VMX support but no LAPIC ARAT or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Reported-by: Wanpeng Li <kernellwp@gmail.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: x86: fix a compile warningPeng Hao2018-04-041-1/+1
| | | | | | | | fix a "warning: no previous prototype". Cc: stable@vger.kernel.org Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"Wanpeng Li2018-04-041-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no easy way to force KVM to run an instruction through the emulator (by design as that will expose the x86 emulator as a significant attack-surface). However, we do wish to expose the x86 emulator in case we are testing it (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix" that is designed to raise #UD which KVM will trap and it's #UD exit-handler will match "force emulation prefix" to run instruction after prefix by the x86 emulator. To not expose the x86 emulator by default, we add a module parameter that should be off by default. A simple testcase here: #include <stdio.h> #include <string.h> #define HYPERVISOR_INFO 0x40000000 #define CPUID(idx, eax, ebx, ecx, edx) \ asm volatile (\ "ud2a; .ascii \"kvm\"; cpuid" \ :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \ :"0"(idx) ); void main() { unsigned int eax, ebx, ecx, edx; char string[13]; CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx); *(unsigned int *)(string + 0) = ebx; *(unsigned int *)(string + 4) = ecx; *(unsigned int *)(string + 8) = edx; string[12] = 0; if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0) printf("kvm guest\n"); else printf("bare hardware\n"); } Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Correctly handle usermode exits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: X86: Introduce handle_ud()Wanpeng Li2018-04-041-0/+13
| | | | | | | | | | | | | | | Introduce handle_ud() to handle invalid opcode, this function will be used by later patches. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event ↵Liran Alon2018-03-281-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pending In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with IDT-vectoring-info which vmx_complete_interrupts() makes sure to reinject before next resume of L2. While handling the VMExit in L0, an IPI could be sent by another L1 vCPU to the L1 vCPU which currently runs L2 and exited to L0. When L0 will reach vcpu_enter_guest() and call inject_pending_event(), it will note that a previous event was re-injected to L2 (by IDT-vectoring-info) and therefore won't check if there are pending L1 events which require exit from L2 to L1. Thus, L0 enters L2 without immediate VMExit even though there are pending L1 events! This commit fixes the issue by making sure to check for L1 pending events even if a previous event was reinjected to L2 and bailing out from inject_pending_event() before evaluating a new pending event in case an event was already reinjected. The bug was observed by the following setup: * L0 is a 64CPU machine which runs KVM. * L1 is a 16CPU machine which runs KVM. * L0 & L1 runs with APICv disabled. (Also reproduced with APICv enabled but easier to analyze below info with APICv disabled) * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. During L2 boot, L1 hangs completely and analyzing the hang reveals that one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI that he has sent for another L1 vCPU. And all other L1 vCPUs are currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck forever (as L1 runs with kernel-preemption disabled). Observing /sys/kernel/debug/tracing/trace_pipe reveals the following series of events: (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f vec 252 (Fixed|edge) (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION rip 0xffffe00069202690 info 83 0 (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 Which can be analyzed as follows: (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject a pending-interrupt of 0xd2. (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which notes that interrupt 0xd2 was reinjected and therefore calls vmx_inject_irq() and returns. Without checking for pending L1 events! Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() before calling inject_pending_event(). (4) L0 resumes L2 without immediate-exit even though there is a pending L1 event (The IPI pending in LAPIC's IRR). We have already reached the buggy scenario but events could be furthered analyzed: (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during event-delivery. (7) L0 decides to forward the VMExit to L1 for further handling. (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the LAPIC's IRR is not examined and therefore the IPI is still not delivered into L1! Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>