| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
| |
Re-execution of an instruction after emulation decode failure is
intended to be used only when emulating shadow page accesses. Invert
the flag to make allowing re-execution opt-in since that behavior is
by far in the minority.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull second set of KVM updates from Paolo Bonzini:
"ARM:
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups
x86:
- fixes for L1TF
- a new test case
- non-support for SGX (inject the right exception in the guest)
- fix lockdep false positive"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
KVM: VMX: fixes for vmentry_l1d_flush module parameter
kvm: selftest: add dirty logging test
kvm: selftest: pass in extra memory when create vm
kvm: selftest: include the tools headers
kvm: selftest: unify the guest port macros
tools: introduce test_and_clear_bit
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
KVM: vmx: Add defines for SGX ENCLS exiting
x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
x86: kvm: avoid unused variable warning
KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
KVM: arm/arm64: Skip updating PTE entry if no change
KVM: arm/arm64: Skip updating PMD entry if no change
KVM: arm: Use true and false for boolean values
KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Removing one of the two accesses of the maxphyaddr variable led to
a harmless warning:
arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
Removing the #ifdef seems to be the nicest workaround, as it
makes the code look cleaner than adding another #ifdef.
Fixes: 28a1f3ac1d0c ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org # L1TF
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull first set of KVM updates from Paolo Bonzini:
"PPC:
- minor code cleanups
x86:
- PCID emulation and CR3 caching for shadow page tables
- nested VMX live migration
- nested VMCS shadowing
- optimized IPI hypercall
- some optimizations
ARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
KVM: X86: Implement PV IPIs in linux guest
KVM: X86: Add kvm hypervisor init time platform setup callback
KVM: X86: Implement "send IPI" hypercall
KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
KVM: x86: Skip pae_root shadow allocation if tdp enabled
KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
KVM: vmx: move struct host_state usage to struct loaded_vmcs
KVM: vmx: compute need to reload FS/GS/LDT on demand
KVM: nVMX: remove a misleading comment regarding vmcs02 fields
KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
KVM: vmx: add dedicated utility to access guest's kernel_gs_base
KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
KVM: vmx: refactor segmentation code in vmx_save_host_state()
kvm: nVMX: Fix fault priority for VMX operations
kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Always set the 5 upper-most supported physical address bits to 1 for SPTEs
that are marked as non-present or reserved, to make them unusable for
L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
(We do not need to mark PTEs that are completely 0 as physical page 0
is already reserved.)
This allows mitigation of L1TF without disabling hyper-threading by using
shadow paging mode instead of EPT.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Using hypercall to send IPIs by one vmexit instead of one by one for
xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
is enabled in qemu, however, latest AMD EPYC still just supports xapic
mode which can get great improvement by Exit-less IPIs. This patchset
lets a guest send multicast IPIs, with at most 128 destinations per
hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
x2apic cluster mode, vanilla
Dry-run: 0, 2392199 ns
Self-IPI: 6907514, 15027589 ns
Normal IPI: 223910476, 251301666 ns
Broadcast IPI: 0, 9282161150 ns
Broadcast lock: 0, 8812934104 ns
x2apic cluster mode, pv-ipi
Dry-run: 0, 2449341 ns
Self-IPI: 6720360, 15028732 ns
Normal IPI: 228643307, 255708477 ns
Broadcast IPI: 0, 7572293590 ns => 22% performance boost
Broadcast lock: 0, 8316124651 ns
x2apic physical mode, vanilla
Dry-run: 0, 3135933 ns
Self-IPI: 8572670, 17901757 ns
Normal IPI: 226444334, 255421709 ns
Broadcast IPI: 0, 19845070887 ns
Broadcast lock: 0, 19827383656 ns
x2apic physical mode, pv-ipi
Dry-run: 0, 2446381 ns
Self-IPI: 6788217, 15021056 ns
Normal IPI: 219454441, 249583458 ns
Broadcast IPI: 0, 7806540019 ns => 154% performance boost
Broadcast lock: 0, 9143618799 ns
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
X86_CR4_OSXSAVE check belongs to sregs check and so move into
kvm_valid_sregs().
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| | |
It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When the guest indicates that the TLB doesn't need to be flushed in a
CR3 switch, we can also skip resyncing the shadow page tables since an
out-of-sync shadow page table is equivalent to an out-of-sync TLB.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.
This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
current root_hpa.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When using shadow paging, a CR3 switch in the guest results in a VM Exit.
In the common case, that VM exit doesn't require much processing by KVM.
However, it does acquire the MMU lock, which can start showing signs of
contention under some workloads even on a 2 VCPU VM when the guest is
using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
lock in the most common cases e.g. when switching back and forth between
the kernel and user mode CR3s used by KPTI with no guest page table
changes in between.
For now, this fast path is implemented only for 64-bit guests and hosts
to avoid the handling of PDPTEs, but it can be extended later to 32-bit
guests and/or hosts as well.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.
With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
that is in VMX operation.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
address previous comments.
- handle nested.smm state.
- rebase & a bit of refactoring.
- Merge 7/8 and 8/8 into one patch. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the vCPU enters system management mode while running a nested guest,
RSM starts processing the vmentry while still in SMM. In that case,
however, the pages pointed to by the vmcs12 might be incorrectly
loaded from SMRAM. To avoid this, delay the handling of the pages
until just before the next vmentry. This is done with a new request
and a new entry in kvm_x86_ops, which we will be able to reuse for
nested VMX state migration.
Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back
to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you
they are only accepted under special conditions. This makes the API a pain to
use.
To avoid this pain, this patch makes it so that the result of the get-list
ioctl can always be used for host-initiated get and set. Since we don't have
a separate way to check for read-only MSRs, this means some Hyper-V MSRs are
ignored when written. Arguably they should not even be in the result of
GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the
outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding
Hyper-V feature.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry. Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
Add the relevant Documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|\|
| |
| |
| | |
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull kvm fixes from Paolo Bonzini:
"Miscellaneous bugfixes, plus a small patchlet related to Spectre v2"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvmclock: fix TSC calibration for nested guests
KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled
KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
x86/kvmclock: set pvti_cpu0_va after enabling kvmclock
x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD
kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading
x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks
KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all
requested features are available on the host.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| |
| |
| |
| |
| |
| |
| | |
Fix typo in sentence about min value calculation.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull more overflow updates from Kees Cook:
"The rest of the overflow changes for v4.18-rc1.
This includes the explicit overflow fixes from Silvio, further
struct_size() conversions from Matthew, and a bug fix from Dan.
But the bulk of it is the treewide conversions to use either the
2-factor argument allocators (e.g. kmalloc(a * b, ...) into
kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
b) into vmalloc(array_size(a, b)).
Coccinelle was fighting me on several fronts, so I've done a bunch of
manual whitespace updates in the patches as well.
Summary:
- Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
- Explicitly reported overflow fixes (Silvio, Kees)
- Add missing kvcalloc() function (Kees)
- Treewide conversions of allocators to use either 2-factor argument
variant when available, or array_size() and array3_size() as needed
(Kees)"
* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
treewide: Use array_size in f2fs_kvzalloc()
treewide: Use array_size() in f2fs_kzalloc()
treewide: Use array_size() in f2fs_kmalloc()
treewide: Use array_size() in sock_kmalloc()
treewide: Use array_size() in kvzalloc_node()
treewide: Use array_size() in vzalloc_node()
treewide: Use array_size() in vzalloc()
treewide: Use array_size() in vmalloc()
treewide: devm_kzalloc() -> devm_kcalloc()
treewide: devm_kmalloc() -> devm_kmalloc_array()
treewide: kvzalloc() -> kvcalloc()
treewide: kvmalloc() -> kvmalloc_array()
treewide: kzalloc_node() -> kcalloc_node()
treewide: kzalloc() -> kcalloc()
treewide: kmalloc() -> kmalloc_array()
mm: Introduce kvcalloc()
video: uvesafb: Fix integer overflow in allocation
UBIFS: Fix potential integer overflow in allocation
leds: Use struct_size() in allocation
Convert intel uncore to struct_size
...
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
patch replaces cases of:
kvzalloc(a * b, gfp)
with:
kvcalloc(a * b, gfp)
as well as handling cases of:
kvzalloc(a * b * c, gfp)
with:
kvzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kvcalloc(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kvzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kvzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kvzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kvzalloc
+ kvcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kvzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kvzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kvzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kvzalloc(sizeof(THING) * C2, ...)
|
kvzalloc(sizeof(TYPE) * C2, ...)
|
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(C1 * C2, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|\ \
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull KVM updates from Paolo Bonzini:
"Small update for KVM:
ARM:
- lazy context-switching of FPSIMD registers on arm64
- "split" regions for vGIC redistributor
s390:
- cleanups for nested
- clock handling
- crypto
- storage keys
- control register bits
x86:
- many bugfixes
- implement more Hyper-V super powers
- implement lapic_timer_advance_ns even when the LAPIC timer is
emulated using the processor's VMX preemption timer.
- two security-related bugfixes at the top of the branch"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
kvm: fix typo in flag name
kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
KVM: x86: introduce linear_{read,write}_system
kvm: nVMX: Enforce cpl=0 for VMX instructions
kvm: nVMX: Add support for "VMWRITE to any supported field"
kvm: nVMX: Restrict VMX capability MSR changes
KVM: VMX: Optimize tscdeadline timer latency
KVM: docs: nVMX: Remove known limitations as they do not exist now
KVM: docs: mmu: KVM support exposing SLAT to guests
kvm: no need to check return value of debugfs_create functions
kvm: Make VM ioctl do valloc for some archs
kvm: Change return type to vm_fault_t
KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
KVM: introduce kvm_make_vcpus_request_mask() API
KVM: x86: hyperv: do rep check for each hypercall separately
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
KVM_X86_DISABLE_EXITS_HTL really refers to exit on halt.
Obviously a typo: should be named KVM_X86_DISABLE_EXITS_HLT.
Fixes: caa057a2cad ("KVM: X86: Provide a capability to disable HLT intercepts")
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The functions that were used in the emulation of fxrstor, fxsave, sgdt and
sidt were originally meant for task switching, and as such they did not
check privilege levels. This is very bad when the same functions are used
in the emulation of unprivileged instructions. This is CVE-2018-10853.
The obvious fix is to add a new argument to ops->read_std and ops->write_std,
which decides whether the access is a "system" access or should use the
processor's CPL.
Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Int the next patch the emulator's .read_std and .write_std callbacks will
grow another argument, which is not needed in kvm_read_guest_virt and
kvm_write_guest_virt_system's callers. Since we have to make separate
functions, let's give the currently existing names a nicer interface, too.
Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline
hrtimer expiration")' advances the tscdeadline (the timer is emulated
by hrtimer) expiration in order that the latency which is incurred
by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
adds the advance tscdeadline expiration support to which the tscdeadline
timer is emulated by VMX preemption timer to reduce the hypervisor
lantency (handle_preemption_timer -> vmentry). The guest can also
set an expiration that is very small (for example in Linux if an
hrtimer feeds a expiration in the past); in that case we set delta_tsc
to 0, leading to an immediately vmexit when delta_tsc is not bigger than
advance ns.
This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
commit 1c8f422059ae ("mm: change return type to vm_fault_t")
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We need a new capability to indicate support for the newly added
HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this
capability, userspace is supposed to announce PV TLB flush features
by setting the appropriate CPUID bits (if needed).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
| |\
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
To resolve conflicts with the PV TLB flush series.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The hypercall was added using a struct timespec based implementation,
but we should not use timespec in new code.
This changes it to timespec64. There is no functional change
here since the implementation is only used in 64-bit kernels
that use the same definition for timespec and timespec64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The local APIC can be in one of three modes: disabled, xAPIC or
x2APIC. (A fourth mode, "invalid," is included for completeness.)
Using the new enumeration can make some of the APIC mode logic easier
to read. In kvm_set_apic_base, for instance, it is clear that one
cannot transition directly from x2APIC mode to xAPIC mode or directly
from APIC disabled to x2APIC mode.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
[Check invalid bits even if msr_info->host_initiated. Reported by
Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4.
It should be checked when PCIDE bit is not set, however commit
'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on
its physical address width")' removes the bit 63 checking
unconditionally. This patch fixes it by checking bit 63 of CR3
when PCIDE bit is not set in CR4.
Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on its physical address width)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull KVM fixes from Radim Krčmář:
"PPC:
- Close a hole which could possibly lead to the host timebase getting
out of sync.
- Three fixes relating to PTEs and TLB entries for radix guests.
- Fix a bug which could lead to an interrupt never getting delivered
to the guest, if it is pending for a guest vCPU when the vCPU gets
offlined.
s390:
- Fix false negatives in VSIE validity check (Cc stable)
x86:
- Fix time drift of VMX preemption timer when a guest uses LAPIC
timer in periodic mode (Cc stable)
- Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow
migration from hosts that don't need retpoline mitigation (Cc
stable)
- Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and
CPUID.OSXSAVE (Cc stable)
- Report correct RIP after Hyper-V hypercall #UD (introduced in
-rc6)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: fix #UD address of failed Hyper-V hypercalls
kvm: x86: IA32_ARCH_CAPABILITIES is always supported
KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
x86/kvm: fix LAPIC timer drift when guest uses periodic mode
KVM: s390: vsie: fix < 8k check for the itdba
KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path
KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change
KVM: PPC: Book3S HV: Make radix clear pte when unmapping
KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page
KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If the hypercall was called from userspace or real mode, KVM injects #UD
and then advances RIP, so it looks like #UD was caused by the following
instruction. This probably won't cause more than confusion, but could
give an unexpected access to guest OS' instruction emulator.
Also, refactor the code to count hv hypercalls that were handled by the
virt userspace.
Fixes: 6356ee0c9602 ("x86: Delay skip of emulated hypercall instruction")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
| | |/
| |/|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0)
allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is
supposed to update these CPUID bits when CR4 is updated. Current KVM
code doesn't handle some special cases when updates come from emulator.
Here is one example:
Step 1: guest boots
Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1
Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1
Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv
Step 4 above will cause an #UD and guest crash because guest OS hasn't
turned on OSXAVE yet. This patch solves the problem by comparing the the
old_cr4 with cr4. If the related bits have been changed,
kvm_update_cpuid() needs to be called.
Signed-off-by: Wei Huang <wei@redhat.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|\ \ \
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using
speculative store bypass disable (SSBD) under SVM. This will allow guests
to use SSBD on hardware that uses non-architectural mechanisms for enabling
SSBD.
[ tglx: Folded the migration fixup from Paolo Bonzini ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Anthoine reported:
The period used by Windows change over time but it can be 1
milliseconds or less. I saw the limit_periodic_timer_frequency
print so 500 microseconds is sometimes reached.
As suggested by Paolo, lower the default timer frequency limit to a
smaller interval of 200 us (5000 Hz) to leave some headroom. This
is required due to Windows 10 changing the scheduler tick limit
from 1024 Hz to 2048 Hz.
Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com>
Cc: Darren Kenny <darren.kenny@oracle.com>
Cc: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the PCIDE bit is not set in CR4, then the MSb of CR3 is a reserved
bit. If the guest tries to set it, that should cause a #GP fault. So
mask out the bit only when the PCIDE bit is set.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
The IP increment should be done after the hypercall emulation, after
calling the various handlers. In this way, these handlers can accurately
identify the the IP of the VMCALL if they need it.
This patch keeps the same functionality for the Hyper-V handler which does
not use the return code of the standard kvm_skip_emulated_instruction()
call.
Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com>
[Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
| |
This is not specific to Intel/AMD anymore. The TSC offset is available
in vcpu->arch.tsc_offset.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always
captures the TSC_OFFSET of the running guest whether it is the L1 or L2
guest.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
[AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the processor does not have an "Always Running APIC Timer" (aka ARAT),
we should not give guests direct access to MWAIT. The LAPIC timer would
stop ticking in deep C-states, so any host deadlines would not wakeup the
host kernel.
The host kernel intel_idle driver handles this by switching to broadcast
mode when ARAT is not available and MWAIT is issued with a deep C-state
that would stop the LAPIC timer. When MWAIT is passed through, we can not
tell when MWAIT is issued.
So just disable this capability when LAPIC ARAT is not available. I am not
even sure if there are any CPUs with VMX support but no LAPIC ARAT or not.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
| |
fix a "warning: no previous prototype".
Cc: stable@vger.kernel.org
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce handle_ud() to handle invalid opcode, this function will be
used by later patches.
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pending
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.
While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.
When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!
This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.
The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).
Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).
We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|