summaryrefslogtreecommitdiffstats
path: root/virt
Commit message (Collapse)AuthorAgeFilesLines
...
* kvm: Convert kvm_lock to a mutexJunaid Shahid2019-11-121-15/+15
| | | | | | | | | | | | | commit 0d9ce162cf46c99628cc5da9510b959c7976735b upstream. It doesn't seem as if there is any particular need for kvm_lock to be a spinlock, so convert the lock to a mutex so that sleepable functions (in particular cond_resched()) can be called while holding it. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: x86, powerpc: do not allow clearing largepages debugfs entryPaolo Bonzini2019-11-121-3/+7
| | | | | | | | | | | | | | | commit 833b45de69a6016c4b0cebe6765d526a31a81580 upstream. The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: kvm-ppc@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: coalesced_mmio: add bounds checkingMatt Delco2019-09-211-7/+10
| | | | | | | | | | | | | | | | | | | | | commit b60fe990c6b07ef6d4df67bc0530c7c90a62623a upstream. The first/last indexes are typically shared with a user app. The app can change the 'last' index that the kernel uses to store the next result. This change sanity checks the index before using it for writing to a potentially arbitrary address. This fixes CVE-2019-14821. Cc: stable@vger.kernel.org Fixes: 5f94c1741bdc ("KVM: Add coalesced MMIO support (common part)") Signed-off-by: Matt Delco <delco@chromium.org> Signed-off-by: Jim Mattson <jmattson@google.com> Reported-by: syzbot+983c866c3dd6efa3662a@syzkaller.appspotmail.com [Use READ_ONCE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: Check irqchip mode before assign irqfdPeter Xu2019-09-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 654f1f13ea56b92bacade8ce2725aea0457f91c0 ] When assigning kvm irqfd we didn't check the irqchip mode but we allow KVM_IRQFD to succeed with all the irqchip modes. However it does not make much sense to create irqfd even without the kernel chips. Let's provide a arch-dependent helper to check whether a specific irqfd is allowed by the arch. At least for x86, it should make sense to check: - when irqchip mode is NONE, all irqfds should be disallowed, and, - when irqchip mode is SPLIT, irqfds that are with resamplefd should be disallowed. For either of the case, previously we'll silently ignore the irq or the irq ack event if the irqchip mode is incorrect. However that can cause misterious guest behaviors and it can be hard to triage. Let's fail KVM_IRQFD even earlier to detect these incorrect configurations. CC: Paolo Bonzini <pbonzini@redhat.com> CC: Radim Krčmář <rkrcmar@redhat.com> CC: Alex Williamson <alex.williamson@redhat.com> CC: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: VGIC: Properly initialise private IRQ affinityAndre Przywara2019-09-101-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2e16f3e926ed48373c98edea85c6ad0ef69425d1 ] At the moment we initialise the target *mask* of a virtual IRQ to the VCPU it belongs to, even though this mask is only defined for GICv2 and quickly runs out of bits for many GICv3 guests. This behaviour triggers an UBSAN complaint for more than 32 VCPUs: ------ [ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21 [ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int' ------ Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs dump is wrong, due to this very same problem. Because there is no requirement to create the VGIC device before the VCPUs (and QEMU actually does it the other way round), we can't safely initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch every private IRQ for each VCPU anyway later (in vgic_init()), we can just move the initialisation of those fields into there, where we definitely know the VGIC type. On the way make sure we really have either a VGICv2 or a VGICv3 device, since the existing code is just checking for "VGICv3 or not", silently ignoring the uninitialised case. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reported-by: Dave Martin <dave.martin@arm.com> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: Only skip MMIO insn onceAndrew Jones2019-09-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2113c5f62b7423e4a72b890bd479704aa85c81ba ] If after an MMIO exit to userspace a VCPU is immediately run with an immediate_exit request, such as when a signal is delivered or an MMIO emulation completion is needed, then the VCPU completes the MMIO emulation and immediately returns to userspace. As the exit_reason does not get changed from KVM_EXIT_MMIO in these cases we have to be careful not to complete the MMIO emulation again, when the VCPU is eventually run again, because the emulation does an instruction skip (and doing too many skips would be a waste of guest code :-) We need to use additional VCPU state to track if the emulation is complete. As luck would have it, we already have 'mmio_needed', which even appears to be used in this way by other architectures already. Fixes: 0d640732dbeb ("arm64: KVM: Skip MMIO insn after emulation") Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WIMarc Zyngier2019-09-063-2/+26
| | | | | | | | | | | | | | | | | [ Upstream commit 82e40f558de566fdee214bec68096bbd5e64a6a4 ] A guest is not allowed to inject a SGI (or clear its pending state) by writing to GICD_ISPENDR0 (resp. GICD_ICPENDR0), as these bits are defined as WI (as per ARM IHI 0048B 4.3.7 and 4.3.8). Make sure we correctly emulate the architecture. Fixes: 96b298000db4 ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers") Cc: stable@vger.kernel.org # 4.7+ Reported-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is longHeyi Guo2019-09-061-0/+7
| | | | | | | | | | | | | | | | | | | [ Upstream commit d4a8061a7c5f7c27a2dc002ee4cb89b3e6637e44 ] If the ap_list is longer than 256 entries, merge_final() in list_sort() will call the comparison callback with the same element twice, causing a deadlock in vgic_irq_cmp(). Fix it by returning early when irqa == irqb. Cc: stable@vger.kernel.org # 4.7+ Fixes: 8e4447457965 ("KVM: arm/arm64: vgic-new: Add IRQ sorting") Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Heyi Guo <guoheyi@huawei.com> [maz: massaged commit log and patch, added Fixes and Cc-stable] Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to blockMarc Zyngier2019-08-255-2/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5eeaf10eec394b28fad2c58f1f5c3a5da0e87d1c upstream. Since commit commit 328e56647944 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or its GICv2 equivalent) loaded as long as we can, only syncing it back when we're scheduled out. There is a small snag with that though: kvm_vgic_vcpu_pending_irq(), which is indirectly called from kvm_vcpu_check_block(), needs to evaluate the guest's view of ICC_PMR_EL1. At the point were we call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever changes to PMR is not visible in memory until we do a vcpu_put(). Things go really south if the guest does the following: mov x0, #0 // or any small value masking interrupts msr ICC_PMR_EL1, x0 [vcpu preempted, then rescheduled, VMCR sampled] mov x0, #ff // allow all interrupts msr ICC_PMR_EL1, x0 wfi // traps to EL2, so samping of VMCR [interrupt arrives just after WFI] Here, the hypervisor's view of PMR is zero, while the guest has enabled its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no interrupts are pending (despite an interrupt being received) and we'll block for no reason. If the guest doesn't have a periodic interrupt firing once it has blocked, it will stay there forever. To avoid this unfortuante situation, let's resync VMCR from kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block() will observe the latest value of PMR. This has been found by booting an arm64 Linux guest with the pseudo NMI feature, and thus using interrupt priorities to mask interrupts instead of the usual PSTATE masking. Cc: stable@vger.kernel.org # 4.12 Fixes: 328e56647944 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put") Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li2019-08-161-1/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream. After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Fix kvm_device leak in vgic_its_destroyDave Martin2019-07-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4729ec8c1e1145234aeeebad5d96d77f4ccbb00a ] kvm_device->destroy() seems to be supposed to free its kvm_device struct, but vgic_its_destroy() is not currently doing this, resulting in a memory leak, resulting in kmemleak reports such as the following: unreferenced object 0xffff800aeddfe280 (size 128): comm "qemu-system-aar", pid 13799, jiffies 4299827317 (age 1569.844s) [...] backtrace: [<00000000a08b80e2>] kmem_cache_alloc+0x178/0x208 [<00000000dcad2bd3>] kvm_vm_ioctl+0x350/0xbc0 Fix it. Cc: Andre Przywara <andre.przywara@arm.com> Fixes: 1085fdc68c60 ("KVM: arm64: vgic-its: Introduce new KVM ITS device") Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: Move cc/it checks under hyp's Makefile to avoid instrumentationJames Morse2019-06-192-121/+136
| | | | | | | | | | | | | | | | | | [ Upstream commit 623e1528d4090bd1abaf93ec46f047dee9a6fb32 ] KVM has helpers to handle the condition codes of trapped aarch32 instructions. These are marked __hyp_text and used from HYP, but they aren't built by the 'hyp' Makefile, which has all the runes to avoid ASAN and KCOV instrumentation. Move this code to a new hyp/aarch32.c to avoid a hyp-panic when starting an aarch32 guest on a host built with the ASAN/KCOV debug options. Fixes: 021234ef3752f ("KVM: arm64: Make kvm_condition_valid32() accessible from EL2") Fixes: 8cebe750c4d9a ("arm64: KVM: Make kvm_skip_instr32 available to HYP") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_IDThomas Huth2019-06-092-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a86cb413f4bf273a9d341a3ab2c2ca44e12eb317 upstream. KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Ensure vcpu target is unset on reset failureAndrew Jones2019-05-251-3/+8
| | | | | | | | | | | | | | | | [ Upstream commit 811328fc3222f7b55846de0cd0404339e2e1e6d7 ] A failed KVM_ARM_VCPU_INIT should not set the vcpu target, as the vcpu target is used by kvm_vcpu_initialized() to determine if other vcpu ioctls may proceed. We need to set the target before calling kvm_reset_vcpu(), but if that call fails, we should then unset it and clear the feature bitmap while we're at it. Signed-off-by: Andrew Jones <drjones@redhat.com> [maz: Simplified patch, completed commit message] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: fix spectrev1 gadgetsPaolo Bonzini2019-05-162-4/+7
| | | | | | | | | [ Upstream commit 1d487e9bf8ba66a7174c56a0029c54b1eca8f99c ] These were found with smatch, and then generalized when applicable. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslotsMarc Zyngier2019-05-041-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7494cec6cb3ba7385a6a223b81906384f15aae34 ] Calling kvm_is_visible_gfn() implies that we're parsing the memslots, and doing this without the srcu lock is frown upon: [12704.164532] ============================= [12704.164544] WARNING: suspicious RCU usage [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G W [12704.164573] ----------------------------- [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [12704.164602] other info that might help us debug this: [12704.164616] rcu_scheduler_active = 2, debug_locks = 1 [12704.164631] 6 locks held by qemu-system-aar/13968: [12704.164644] #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [12704.164691] #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [12704.164726] #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164761] #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164794] #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164827] #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164861] stack backtrace: [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G W 5.1.0-rc1-00008-g600025238f51-dirty #16 [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [12704.164896] Call trace: [12704.164910] dump_backtrace+0x0/0x138 [12704.164920] show_stack+0x24/0x30 [12704.164934] dump_stack+0xbc/0x104 [12704.164946] lockdep_rcu_suspicious+0xcc/0x110 [12704.164958] gfn_to_memslot+0x174/0x190 [12704.164969] kvm_is_visible_gfn+0x28/0x70 [12704.164980] vgic_its_check_id.isra.0+0xec/0x1e8 [12704.164991] vgic_its_save_tables_v0+0x1ac/0x330 [12704.165001] vgic_its_set_attr+0x298/0x3a0 [12704.165012] kvm_device_ioctl_attr+0x9c/0xd8 [12704.165022] kvm_device_ioctl+0x8c/0xf8 [12704.165035] do_vfs_ioctl+0xc8/0x960 [12704.165045] ksys_ioctl+0x8c/0xa0 [12704.165055] __arm64_sys_ioctl+0x28/0x38 [12704.165067] el0_svc_common+0xd8/0x138 [12704.165078] el0_svc_handler+0x38/0x78 [12704.165089] el0_svc+0x8/0xc Make sure the lock is taken when doing this. Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
* KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memoryMarc Zyngier2019-05-042-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a6ecfb11bf37743c1ac49b266595582b107b61d4 ] When halting a guest, QEMU flushes the virtual ITS caches, which amounts to writing to the various tables that the guest has allocated. When doing this, we fail to take the srcu lock, and the kernel shouts loudly if running a lockdep kernel: [ 69.680416] ============================= [ 69.680819] WARNING: suspicious RCU usage [ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted [ 69.682096] ----------------------------- [ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [ 69.683225] [ 69.683225] other info that might help us debug this: [ 69.683225] [ 69.683975] [ 69.683975] rcu_scheduler_active = 2, debug_locks = 1 [ 69.684598] 6 locks held by qemu-system-aar/4097: [ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.690729] [ 69.690729] stack backtrace: [ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18 [ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [ 69.692831] Call trace: [ 69.694072] lockdep_rcu_suspicious+0xcc/0x110 [ 69.694490] gfn_to_memslot+0x174/0x190 [ 69.694853] kvm_write_guest+0x50/0xb0 [ 69.695209] vgic_its_save_tables_v0+0x248/0x330 [ 69.695639] vgic_its_set_attr+0x298/0x3a0 [ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8 [ 69.696424] kvm_device_ioctl+0x8c/0xf8 [ 69.696788] do_vfs_ioctl+0xc8/0x960 [ 69.697128] ksys_ioctl+0x8c/0xa0 [ 69.697445] __arm64_sys_ioctl+0x28/0x38 [ 69.697817] el0_svc_common+0xd8/0x138 [ 69.698173] el0_svc_handler+0x38/0x78 [ 69.698528] el0_svc+0x8/0xc The fix is to obviously take the srcu lock, just like we do on the read side of things since bf308242ab98. One wonders why this wasn't fixed at the same time, but hey... Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin (Microsoft) <sashal@kernel.org>
* KVM: Reject device ioctls from processes other than the VM's creatorSean Christopherson2019-04-031-0/+3
| | | | | | | | | | | | | | | | | | commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream. KVM's API requires thats ioctls must be issued from the same process that created the VM. In other words, userspace can play games with a VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the creator can do anything useful. Explicitly reject device ioctls that are issued by a process other than the VM's creator, and update KVM's API documentation to extend its requirements to device ioctls. Fixes: 852b6d57dc7f ("kvm: add device control API") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Call kvm_arch_memslots_updated() before updating memslotsSean Christopherson2019-03-232-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream. kvm_arch_memslots_updated() is at this point in time an x86-specific hook for handling MMIO generation wraparound. x86 stashes 19 bits of the memslots generation number in its MMIO sptes in order to avoid full page fault walks for repeat faults on emulated MMIO addresses. Because only 19 bits are used, wrapping the MMIO generation number is possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that the generation has changed so that it can invalidate all MMIO sptes in case the effective MMIO generation has wrapped so as to avoid using a stale spte, e.g. a (very) old spte that was created with generation==0. Given that the purpose of kvm_arch_memslots_updated() is to prevent consuming stale entries, it needs to be called before the new generation is propagated to memslots. Invalidating the MMIO sptes after updating memslots means that there is a window where a vCPU could dereference the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO spte that was created with (pre-wrap) generation==0. Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Always initialize the group of private IRQsChristoffer Dall2019-03-231-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ab2d5eb03dbb7b37a1c6356686fb48626ab0c93e ] We currently initialize the group of private IRQs during kvm_vgic_vcpu_init, and the value of the group depends on the GIC model we are emulating. However, CPUs created before creating (and initializing) the VGIC might end up with the wrong group if the VGIC is created as GICv3 later. Since we have no enforced ordering of creating the VGIC and creating VCPUs, we can end up with part the VCPUs being properly intialized and the remaining incorrectly initialized. That also means that we have no single place to do the per-cpu data structure initialization which depends on knowing the emulated GIC model (which is only the group field). This patch removes the incorrect comment from kvm_vgic_vcpu_init and initializes the group of all previously created VCPUs's private interrupts in vgic_init in addition to the existing initialization in kvm_vgic_vcpu_init. Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm/arm64: KVM: Allow a VCPU to fully reset itselfMarc Zyngier2019-03-232-20/+26
| | | | | | | | | | | | | | | | | | | [ Upstream commit 358b28f09f0ab074d781df72b8a671edb1547789 ] The current kvm_psci_vcpu_on implementation will directly try to manipulate the state of the VCPU to reset it. However, since this is not done on the thread that runs the VCPU, we can end up in a strangely corrupted state when the source and target VCPUs are running at the same time. Fix this by factoring out all reset logic from the PSCI implementation and forwarding the required information along with a request to the target VCPU. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: vgic: Make vgic_dist->lpi_list_lock a raw_spinlockJulien Thierry2019-03-233-10/+10
| | | | | | | | | | | | | | | | [ Upstream commit fc3bc475231e12e9c0142f60100cf84d077c79e1 ] vgic_dist->lpi_list_lock must always be taken with interrupts disabled as it is used in interrupt context. For configurations such as PREEMPT_RT_FULL, this means that it should be a raw_spinlock since RT spinlocks are interruptible. Signed-off-by: Julien Thierry <julien.thierry@arm.com> Acked-by: Christoffer Dall <christoffer.dall@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)Jann Horn2019-02-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream. kvm_ioctl_create_device() does the following: 1. creates a device that holds a reference to the VM object (with a borrowed reference, the VM's refcount has not been bumped yet) 2. initializes the device 3. transfers the reference to the device to the caller's file descriptor table 4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real reference The ownership transfer in step 3 must not happen before the reference to the VM becomes a proper, non-borrowed reference, which only happens in step 4. After step 3, an attacker can close the file descriptor and drop the borrowed reference, which can cause the refcount of the kvm object to drop to zero. This means that we need to grab a reference for the device before anon_inode_getfd(), otherwise the VM can disappear from under us. Fixes: 852b6d57dc7f ("kvm: add device control API") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: Change offset in kvm_write_guest_offset_cached to unsignedJim Mattson2019-02-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7a86dab8cf2f0fdf508f3555dddfc236623bff60 ] Since the offset is added directly to the hva from the gfn_to_hva_cache, a negative offset could result in an out of bounds write. The existing BUG_ON only checks for addresses beyond the end of the gfn_to_hva_cache, not for addresses before the start of the gfn_to_hva_cache. Note that all current call sites have non-negative offsets. Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()") Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Cfir Cohen <cfir@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm64: KVM: Skip MMIO insn after emulationMark Rutland2019-02-121-5/+6
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0d640732dbebed0f10f18526de21652931f0b2f2 ] When we emulate an MMIO instruction, we advance the CPU state within decode_hsr(), before emulating the instruction effects. Having this logic in decode_hsr() is opaque, and advancing the state before emulation is problematic. It gets in the way of applying consistent single-step logic, and it prevents us from being able to fail an MMIO instruction with a synchronous exception. Clean this up by only advancing the CPU state *after* the effects of the instruction are emulated. Cc: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: Fix VMID alloc race by reverting to lock-lessChristoffer Dall2019-01-161-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fb544d1ca65a89f7a3895f7531221ceeed74ada7 upstream. We recently addressed a VMID generation race by introducing a read/write lock around accesses and updates to the vmid generation values. However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but does so without taking the read lock. As far as I can tell, this can lead to the same kind of race: VM 0, VCPU 0 VM 0, VCPU 1 ------------ ------------ update_vttbr (vmid 254) update_vttbr (vmid 1) // roll over read_lock(kvm_vmid_lock); force_vm_exit() local_irq_disable need_new_vmid_gen == false //because vmid gen matches enter_guest (vmid 254) kvm_arch.vttbr = <PGD>:<VMID 1> read_unlock(kvm_vmid_lock); enter_guest (vmid 1) Which results in running two VCPUs in the same VM with different VMIDs and (even worse) other VCPUs from other VMs could now allocate clashing VMID 254 from the new generation as long as VCPU 0 is not exiting. Attempt to solve this by making sure vttbr is updated before another CPU can observe the updated VMID generation. Cc: stable@vger.kernel.org Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race" Reviewed-by: Julien Thierry <julien.thierry@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq()Gustavo A. R. Silva2019-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c23b2e6fc4ca346018618266bcabd335c0a8a49e upstream. When using the nospec API, it should be taken into account that: "...if the CPU speculates past the bounds check then * array_index_nospec() will clamp the index within the range of [0, * size)." The above is part of the header for macro array_index_nospec() in linux/nospec.h Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic: /* SGIs and PPIs */ if (intid <= VGIC_MAX_PRIVATE) return &vcpu->arch.vgic_cpu.private_irqs[intid]; /* SPIs */ if (intid <= VGIC_MAX_SPI) return &kvm->arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS]; are valid values for intid. Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1 and VGIC_MAX_SPI + 1 as arguments for its parameter size. Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> [dropped the SPI part which was fixed separately] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic-v2: Set active_source to 0 when restoring stateChristoffer Dall2019-01-091-1/+16
| | | | | | | | | | | | | | | | | | commit 60c3ab30d8c2ff3a52606df03f05af2aae07dc6b upstream. When restoring the active state from userspace, we don't know which CPU was the source for the active state, and this is not architecturally exposed in any of the register state. Set the active_source to 0 in this case. In the future, we can expand on this and exposse the information as additional information to userspace for GICv2 if anyone cares. Cc: stable@vger.kernel.org Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Cap SPIs to the VM-defined maximumMarc Zyngier2019-01-091-2/+2
| | | | | | | | | | | | commit bea2ef803ade3359026d5d357348842bca9edcf1 upstream. SPIs should be checked against the VMs specific configuration, and not the architectural maximum. Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabledJulien Thierry2019-01-091-21/+0
| | | | | | | | | | | | | | | | | | | | commit 2e2f6c3c0b08eed3fcf7de3c7684c940451bdeb1 upstream. To change the active state of an MMIO, halt is requested for all vcpus of the affected guest before modifying the IRQ state. This is done by calling cond_resched_lock() in vgic_mmio_change_active(). However interrupts are disabled at this point and we cannot reschedule a vcpu. We actually don't need any of this, as kvm_arm_halt_guest ensures that all the other vcpus are out of the guest. Let's just drop that useless code. Signed-off-by: Julien Thierry <julien.thierry@arm.com> Suggested-by: Christoffer Dall <christoffer.dall@arm.com> Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 ↵Marc Zyngier2019-01-091-2/+4
| | | | | | | | | | | | | | | | | | | PPIs/SGIs commit 107352a24900fb458152b92a4e72fbdc83fd5510 upstream. We currently only halt the guest when a vCPU messes with the active state of an SPI. This is perfectly fine for GICv2, but isn't enough for GICv3, where all vCPUs can access the state of any other vCPU. Let's broaden the condition to include any GICv3 interrupt that has an active state (i.e. all but LPIs). Cc: stable@vger.kernel.org Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Fix caching of host MDCR_EL2 valueMark Rutland2018-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit da5a3ce66b8bb51b0ea8a89f42aac153903f90fb upstream. At boot time, KVM stashes the host MDCR_EL2 value, but only does this when the kernel is not running in hyp mode (i.e. is non-VHE). In these cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can lead to CONSTRAINED UNPREDICTABLE behaviour. Since we use this value to derive the MDCR_EL2 value when switching to/from a guest, after a guest have been run, the performance counters do not behave as expected. This has been observed to result in accesses via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant counters, resulting in events not being counted. In these cases, only the fixed-purpose cycle counter appears to work as expected. Fix this by always stashing the host MDCR_EL2 value, regardless of VHE. Cc: Christopher Dall <christoffer.dall@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in HYP") Tested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Ensure only THP is candidate for adjustmentPunit Agrawal2018-11-131-1/+7
| | | | | | | | | | | | | | | | | | | | | | commit fd2ef358282c849c193aa36dadbf4f07f7dcd29b upstream. PageTransCompoundMap() returns true for hugetlbfs and THP hugepages. This behaviour incorrectly leads to stage 2 faults for unsupported hugepage sizes (e.g., 64K hugepage with 4K pages) to be treated as THP faults. Tighten the check to filter out hugetlbfs pages. This also leads to consistently mapping all unsupported hugepage sizes as PTE level entries at stage 2. Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: stable@vger.kernel.org # v4.13+ Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Remove obsolete kvm_unmap_hva notifier backendMarc Zyngier2018-09-072-27/+0
| | | | | | | | | | | kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to deal with. Drop the now obsolete code. Fixes: fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2") Cc: James Hogan <jhogan@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
* KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoWMarc Zyngier2018-09-071-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | When triggering a CoW, we unmap the RO page via an MMU notifier (invalidate_range_start), and then populate the new PTE using another one (change_pte). In the meantime, we'll have copied the old page into the new one. The problem is that the data for the new page is sitting in the cache, and should the guest have an uncached mapping to that page (or its MMU off), following accesses will bypass the cache. In a way, this is similar to what happens on a translation fault: We need to clean the page to the PoC before mapping it. So let's just do that. This fixes a KVM unit test regression observed on a HiSilicon platform, and subsequently reproduced on Seattle. Fixes: a9c0e12ebee5 ("KVM: arm/arm64: Only clean the dcache on translation fault") Cc: stable@vger.kernel.org # v4.16+ Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-08-2214-109/+413
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull second set of KVM updates from Paolo Bonzini: "ARM: - Support for Group0 interrupts in guests - Cache management optimizations for ARMv8.4 systems - Userspace interface for RAS - Fault path optimization - Emulated physical timer fixes - Random cleanups x86: - fixes for L1TF - a new test case - non-support for SGX (inject the right exception in the guest) - fix lockdep false positive" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits) KVM: VMX: fixes for vmentry_l1d_flush module parameter kvm: selftest: add dirty logging test kvm: selftest: pass in extra memory when create vm kvm: selftest: include the tools headers kvm: selftest: unify the guest port macros tools: introduce test_and_clear_bit KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled KVM: vmx: Inject #UD for SGX ENCLS instruction in guest KVM: vmx: Add defines for SGX ENCLS exiting x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush() x86: kvm: avoid unused variable warning KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR KVM: arm/arm64: Skip updating PTE entry if no change KVM: arm/arm64: Skip updating PMD entry if no change KVM: arm: Use true and false for boolean values KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs ...
| * Merge tag 'kvmarm-for-v4.19' of ↵Paolo Bonzini2018-08-2214-109/+413
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 4.19 - Support for Group0 interrupts in guests - Cache management optimizations for ARMv8.4 systems - Userspace interface for RAS, allowing error retrival and injection - Fault path optimization - Emulated physical timer fixes - Random cleanups
| | * KVM: arm/arm64: Skip updating PTE entry if no changePunit Agrawal2018-08-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When there is contention on faulting in a particular page table entry at stage 2, the break-before-make requirement of the architecture can lead to additional refaulting due to TLB invalidation. Avoid this by skipping a page table update if the new value of the PTE matches the previous value. Cc: stable@vger.kernel.org Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com> Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: Skip updating PMD entry if no changePunit Agrawal2018-08-131-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Contention on updating a PMD entry by a large number of vcpus can lead to duplicate work when handling stage 2 page faults. As the page table update follows the break-before-make requirement of the architecture, it can lead to repeated refaults due to clearing the entry and flushing the tlbs. This problem is more likely when - * there are large number of vcpus * the mapping is large block mapping such as when using PMD hugepages (512MB) with 64k pages. Fix this by skipping the page table update if there is no change in the entry being updated. Cc: stable@vger.kernel.org Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages") Reviewed-by: Suzuki Poulose <suzuki.poulose@arm.com> Acked-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabledJia He2018-08-123-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_vgic_sync_hwstate is only called with IRQ being disabled. There is thus no need to call spin_lock_irqsave/restore in vgic_fold_lr_state and vgic_prune_ap_list. This patch replace them with the non irq-safe version. Signed-off-by: Jia He <jia.he@hxt-semitech.com> Acked-by: Christoffer Dall <christoffer.dall@arm.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.hJia He2018-08-122-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | DEBUG_SPINLOCK_BUG_ON can be used with both vgic-v2 and vgic-v3, so let's move it to vgic.h Signed-off-by: Jia He <jia.he@hxt-semitech.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIsMarc Zyngier2018-08-121-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although vgic-v3 now supports Group0 interrupts, it still doesn't deal with Group0 SGIs. As usually with the GIC, nothing is simple: - ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1 with KVM (as per 8.1.10, Non-secure EL1 access) - ICC_SGI0R can only generate Group0 SGIs - ICC_ASGI1R sees its scope refocussed to generate only Group0 SGIs (as per the note at the bottom of Table 8-14) We only support Group1 SGIs so far, so no material change. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: Fix lost IRQs from emulated physcial timer when blockedChristoffer Dall2018-07-311-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the VCPU is blocked (for example from WFI) we don't inject the physical timer interrupt if it should fire while the CPU is blocked, but instead we just wake up the VCPU and expect kvm_timer_vcpu_load to take care of injecting the interrupt. Unfortunately, kvm_timer_vcpu_load() doesn't actually do that, it only has support to schedule a soft timer if the emulated phys timer is expected to fire in the future. Follow the same pattern as kvm_timer_update_state() and update the irq state after potentially scheduling a soft timer. Reported-by: Andre Przywara <andre.przywara@arm.com> Cc: Stable <stable@vger.kernel.org> # 4.15+ Fixes: bbdd52cfcba29 ("KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit") Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: Fix potential loss of ptimer interruptsChristoffer Dall2018-07-311-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_timer_update_state() is called when changing the phys timer configuration registers, either via vcpu reset, as a result of a trap from the guest, or when userspace programs the registers. phys_timer_emulate() is in turn called by kvm_timer_update_state() to either cancel an existing software timer, or program a new software timer, to emulate the behavior of a real phys timer, based on the change in configuration registers. Unfortunately, the interaction between these two functions left a small race; if the conceptual emulated phys timer should actually fire, but the soft timer hasn't executed its callback yet, we cancel the timer in phys_timer_emulate without injecting an irq. This only happens if the check in kvm_timer_update_state is called before the timer should fire, which is relatively unlikely, but possible. The solution is to update the state of the phys timer after calling phys_timer_emulate, which will pick up the pending timer state and update the interrupt value. Note that this leaves the opportunity of raising the interrupt twice, once in the just-programmed soft timer, and once in kvm_timer_update_state. Since this always happens synchronously with the VCPU execution, there is no harm in this, and the guest ever only sees a single timer interrupt. Cc: Stable <stable@vger.kernel.org> # 4.15+ Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Fix possible spectre-v1 write in vgic_mmio_write_apr()Mark Rutland2018-07-241-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible for userspace to control n. Sanitize n when using it as an array index, to inhibit the potential spectre-v1 write gadget. Note that while it appears that n must be bound to the interval [0,3] due to the way it is extracted from addr, we cannot guarantee that compiler transformations (and/or future refactoring) will ensure this is the case, and given this is a slow path it's better to always perform the masking. Found by smatch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: kvmarm@lists.cs.columbia.edu Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm: Add 32bit get/set events supportJames Morse2018-07-211-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arm64's new use of KVMs get_events/set_events API calls isn't just or RAS, it allows an SError that has been made pending by KVM as part of its device emulation to be migrated. Wire this up for 32bit too. We only need to read/write the HCR_VA bit, and check that no esr has been provided, as we don't yet support VDFSR. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Dongjiu Geng <gengdongjiu@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm64: Share the parts of get/set events useful to 32bitJames Morse2018-07-211-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The get/set events helpers to do some work to check reserved and padding fields are zero. This is useful on 32bit too. Move this code into virt/kvm/arm/arm.c, and give the arch code some underscores. This is temporarily hidden behind __KVM_HAVE_VCPU_EVENTS until 32bit is wired up. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Dongjiu Geng <gengdongjiu@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTSDongjiu Geng2018-07-211-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the migrating VMs, user space may need to know the exception state. For example, in the machine A, KVM make an SError pending, when migrate to B, KVM also needs to pend an SError. This new IOCTL exports user-invisible states related to SError. Together with appropriate user space changes, user space can get/set the SError exception state to do migrate/snapshot/suspend. Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> Reviewed-by: James Morse <james.morse@arm.com> [expanded documentation wording] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Let userspace opt-in to writable v2 IGROUPRChristoffer Dall2018-07-211-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simply letting IGROUPR be writable from userspace would break migration from old kernels to newer kernels, because old kernels incorrectly report interrupt groups as group 1. This would not be a big problem if userspace wrote GICD_IIDR as read from the kernel, because we could detect the incompatibility and return an error to userspace. Unfortunately, this is not the case with current userspace implementations and simply letting IGROUPR be writable from userspace for an emulated GICv2 silently breaks migration and causes the destination VM to no longer run after migration. We now encourage userspace to write the read and expected value of GICD_IIDR as the first part of a GIC register restore, and if we observe a write to GICD_IIDR we know that userspace has been updated and has had a chance to cope with older kernels (VGICv2 IIDR.Revision == 0) incorrectly reporting interrupts as group 1, and therefore we now allow groups to be user writable. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic: Allow configuration of interrupt groupsChristoffer Dall2018-07-215-4/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the required MMIO accessors for GICv2 and GICv3 for the IGROUPR distributor and redistributor registers. This can allow guests to change behavior compared to running on previous versions of KVM, but only to align with the architecture and hardware implementations. This also allows userspace to configure the interrupts groups for GICv3. We don't allow userspace to write the groups on GICv2 just yet, because that would result in GICv2 guests not receiving interrupts after migrating from an older kernel that exposes GICv2 interrupts as group 1. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>