summaryrefslogtreecommitdiffstats
path: root/virt/kvm
Commit message (Collapse)AuthorAgeFilesLines
* KVM: arm64: vgic: Fix exit condition in scan_its_table()Eric Ren2022-11-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c000a2607145d28b06c697f968491372ea56c23a upstream. With some PCIe topologies, restoring a guest fails while parsing the ITS device tables. Reproducer hints: 1. Create ARM virt VM with pxb-pcie bus which adds extra host bridges, with qemu command like: ``` -device pxb-pcie,bus_nr=8,id=pci.x,numa_node=0,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.x \ ... -device pxb-pcie,bus_nr=37,id=pci.y,numa_node=1,bus=pcie.0 \ -device pcie-root-port,..,bus=pci.y \ ... ``` 2. Ensure the guest uses 2-level device table 3. Perform VM migration which calls save/restore device tables In that setup, we get a big "offset" between 2 device_ids, which makes unsigned "len" round up a big positive number, causing the scan loop to continue with a bad GPA. For example: 1. L1 table has 2 entries; 2. and we are now scanning at L2 table entry index 2075 (pointed to by L1 first entry) 3. if next device id is 9472, we will get a big offset: 7397; 4. with unsigned 'len', 'len -= offset * esz', len will underflow to a positive number, mistakenly into next iteration with a bad GPA; (It should break out of the current L2 table scanning, and jump into the next L1 table entry) 5. that bad GPA fails the guest read. Fix it by stopping the L2 table scan when the next device id is outside of the current table, allowing the scan to continue from the next L1 table entry. Thanks to Eric Auger for the fix suggestion. Fixes: 920a7a8fa92a ("KVM: arm64: vgic-its: Add infrastructure for tableookup") Suggested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Eric Ren <renzhengeek@gmail.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d9c3a564af9e2c5bf63f48a7dcbf08cd593c5c0b.1665802985.git.renzhengeek@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Add infrastructure and macro to mark VM as buggedSean Christopherson2022-08-251-5/+5
| | | | | | | | | | | | | commit 0b8f11737cffc1a406d1134b58687abc29d76b52 upstream Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [SG: Adjusted context for kernel version 4.19] Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Prevent module exit until all VMs are freedDavid Matlack2022-04-151-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f6de5cbebee925a612856fce6f9182bb3eee0db upstream. Tie the lifetime the KVM module to the lifetime of each VM via kvm.users_count. This way anything that grabs a reference to the VM via kvm_get_kvm() cannot accidentally outlive the KVM module. Prior to this commit, the lifetime of the KVM module was tied to the lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU file descriptors by their respective file_operations "owner" field. This approach is insufficient because references grabbed via kvm_get_kvm() do not prevent closing any of the aforementioned file descriptors. This fixes a long standing theoretical bug in KVM that at least affects async page faults. kvm_setup_async_pf() grabs a reference via kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing prevents the VM file descriptor from being closed and the KVM module from being unloaded before this callback runs. Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out") Fixes: 3d3aab1b973b ("KVM: set owner of cpu and vm file operations") Cc: stable@vger.kernel.org Suggested-by: Ben Gardon <bgardon@google.com> [ Based on a patch from Ben implemented for Google's kernel. ] Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migratedJames Morse2022-03-231-0/+12
| | | | | | | | | | | | | | | | commit a5905d6af492ee6a4a2205f0d550b3f931b03d03 upstream. KVM allows the guest to discover whether the ARCH_WORKAROUND SMCCC are implemented, and to preserve that state during migration through its firmware register interface. Add the necessary boiler plate for SMCCC_ARCH_WORKAROUND_3. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [ kvm code moved to virt/kvm/arm, removed fw regs ABI. Added 32bit stub ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: remember position in kvm->vcpus arrayRadim Krčmář2021-09-261-2/+3
| | | | | | | | | | | | | | | | | commit 8750e72a79dda2f665ce17b62049f4d62130d991 upstream. Fetching an index for any vcpu in kvm->vcpus array by traversing the entire array everytime is costly. This patch remembers the position of each vcpu in kvm->vcpus array by storing it in vcpus_idx under kvm_vcpu structure. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [borntraeger@de.ibm.com]: backport to 4.19 (also fits for 5.4) Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Handle PSCI resets before userspace touches vCPU stateOliver Upton2021-09-221-0/+8
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6826c6849b46aaa91300201213701eb861af4ba0 ] The CPU_ON PSCI call takes a payload that KVM uses to configure a destination vCPU to run. This payload is non-architectural state and not exposed through any existing UAPI. Effectively, we have a race between CPU_ON and userspace saving/restoring a guest: if the target vCPU isn't ran again before the VMM saves its state, the requested PC and context ID are lost. When restored, the target vCPU will be runnable and start executing at its old PC. We can avoid this race by making sure the reset payload is serviced before userspace can access a vCPU's state. Fixes: 358b28f09f0a ("arm/arm64: KVM: Allow a VCPU to fully reset itself") Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210818202133.1106786-3-oupton@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()Sean Christopherson2021-07-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a9545779ee9e9e103648f6f2552e73cfe808d0f4 upstream. Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel. This bug was inadvertantly exposed by commit bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long. arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) | ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’ Cc: stable@vger.kernel.org Fixes: add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210208201940.1258328-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: do not allow mapping valid but non-reference-counted pagesNicholas Piggin2021-07-281-2/+17
| | | | | | | | | | | | | | | | | | | | | | | commit f8be156be163a052a067306417cd0ff679068c97 upstream. It's possible to create a region which maps valid but non-refcounted pages (e.g., tail pages of non-compound higher order allocations). These host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family of APIs, which take a reference to the page, which takes it from 0 to 1. When the reference is dropped, this will free the page incorrectly. Fix this by only taking a reference on valid pages if it was non-zero, which indicates it is participating in normal refcounting (and can be released with put_page). This addresses CVE-2021-22543. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: do not assume PTE is writable after follow_pfnPaolo Bonzini2021-07-281-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bd2fae8da794b55bf2ac02632da3a151b10e664c upstream. In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <stevensd@google.com> Cc: 3pvd@google.com Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [OP: backport to 4.19, adjust follow_pte() -> follow_pte_pmd()] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST readEric Auger2021-06-301-2/+2
| | | | | | | | | | | | | | | | | | | | | commit 94ac0835391efc1a30feda6fc908913ec012951e upstream. When reading the base address of the a REDIST region through KVM_VGIC_V3_ADDR_TYPE_REDIST we expect the redistributor region list to be populated with a single element. However list_first_entry() expects the list to be non empty. Instead we should use list_first_entry_or_null which effectively returns NULL if the list is empty. Fixes: dbd9733ab674 ("KVM: arm/arm64: Replace the single rdist region by a list") Cc: <Stable@vger.kernel.org> # v4.18+ Signed-off-by: Eric Auger <eric.auger@redhat.com> Reported-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210412150034.29185-1-eric.auger@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Initialize VCPU mdcr_el2 before loading itAlexandru Elisei2021-05-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 263d6287da1433aba11c5b4046388f2cdf49675c upstream. When a VCPU is created, the kvm_vcpu struct is initialized to zero in kvm_vm_ioctl_create_vcpu(). On VHE systems, the first time vcpu.arch.mdcr_el2 is loaded on hardware is in vcpu_load(), before it is set to a sensible value in kvm_arm_setup_debug() later in the run loop. The result is that KVM executes for a short time with MDCR_EL2 set to zero. This has several unintended consequences: * Setting MDCR_EL2.HPMN to 0 is constrained unpredictable according to ARM DDI 0487G.a, page D13-3820. The behavior specified by the architecture in this case is for the PE to behave as if MDCR_EL2.HPMN is set to a value less than or equal to PMCR_EL0.N, which means that an unknown number of counters are now disabled by MDCR_EL2.HPME, which is zero. * The host configuration for the other debug features controlled by MDCR_EL2 is temporarily lost. This has been harmless so far, as Linux doesn't use the other fields, but that might change in the future. Let's avoid both issues by initializing the VCPU's mdcr_el2 field in kvm_vcpu_vcpu_first_run_init(), thus making sure that the MDCR_EL2 register has a consistent value after each vcpu_load(). Fixes: d5a21bcc2995 ("KVM: arm64: Move common VHE/non-VHE trap config in separate functions") Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210407144857.199746-3-alexandru.elisei@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Fix exclusive limit for IPA sizeMarc Zyngier2021-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Commit 262b003d059c6671601a19057e9fe1a5e7f23722 upstream. When registering a memslot, we check the size and location of that memslot against the IPA size to ensure that we can provide guest access to the whole of the memory. Unfortunately, this check rejects memslot that end-up at the exact limit of the addressing capability for a given IPA size. For example, it refuses the creation of a 2GB memslot at 0x8000000 with a 32bit IPA space. Fix it by relaxing the check to accept a memslot reaching the limit of the IPA space. Fixes: c3058d5da222 ("arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org # 4.4, 4.9, 4.14, 4.19 Reviewed-by: Andrew Jones <drjones@redhat.com> Link: https://lore.kernel.org/r/20210311100016.3830038-3-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: check tlbs_dirty directlyLai Jiangshan2021-02-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 88bf56d04bc3564542049ec4ec168a8b60d0b48c upstream In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as: need_tlb_flush |= kvm->tlbs_dirty; with need_tlb_flush's type being int and tlbs_dirty's type being long. It means that tlbs_dirty is always used as int and the higher 32 bits is useless. We need to check tlbs_dirty in a correct way and this change checks it directly without propagating it to need_tlb_flush. Note: it's _extremely_ unlikely this neglecting of higher 32 bits can cause problems in practice. It would require encountering tlbs_dirty on a 4 billion count boundary, and KVM would need to be using shadow paging or be running a nested guest. Cc: stable@vger.kernel.org Fixes: a4ee1ca4a36e ("KVM: MMU: delay flush all tlbs on sync_page path") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20201217154118.16497-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sudip: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspaceZenghui Yu2020-12-021-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 23bde34771f1ea92fb5e6682c0d8c04304d34b3b upstream. It was recently reported that if GICR_TYPER is accessed before the RD base address is set, we'll suffer from the unset @rdreg dereferencing. Oops... gpa_t last_rdist_typer = rdreg->base + GICR_TYPER + (rdreg->free_index - 1) * KVM_VGIC_V3_REDIST_SIZE; It's "expected" that users will access registers in the redistributor if the RD has been properly configured (e.g., the RD base address is set). But it hasn't yet been covered by the existing documentation. Per discussion on the list [1], the reporting of the GICR_TYPER.Last bit for userspace never actually worked. And it's difficult for us to emulate it correctly given that userspace has the flexibility to access it any time. Let's just drop the reporting of the Last bit for userspace for now (userspace should have full knowledge about it anyway) and it at least prevents kernel from panic ;-) [1] https://lore.kernel.org/kvmarm/c20865a267e44d1e2c0d52ce4e012263@kernel.org/ Fixes: ba7b3f1275fd ("KVM: arm/arm64: Revisit Redistributor TYPER last bit computation") Reported-by: Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20201117151629.1738-1-yuzenghui@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetchMarc Zyngier2020-10-012-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c4ad98e4b72cb5be30ea282fce935248f2300e62 upstream. KVM currently assumes that an instruction abort can never be a write. This is in general true, except when the abort is triggered by a S1PTW on instruction fetch that tries to update the S1 page tables (to set AF, for example). This can happen if the page tables have been paged out and brought back in without seeing a direct write to them (they are thus marked read only), and the fault handling code will make the PT executable(!) instead of writable. The guest gets stuck forever. In these conditions, the permission fault must be considered as a write so that the Stage-1 update can take place. This is essentially the I-side equivalent of the problem fixed by 60e21a0ef54c ("arm64: KVM: Take S1 walks into account when determining S2 write faults"). Update kvm_is_write_fault() to return true on IABT+S1PTW, and introduce kvm_vcpu_trap_is_exec_fault() that only return true when no faulting on a S1 fault. Additionally, kvm_vcpu_dabt_iss1tw() is renamed to kvm_vcpu_abt_iss1tw(), as the above makes it plain that it isn't specific to data abort. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200915104218.1284701-2-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()Zenghui Yu2020-10-011-2/+9
| | | | | | | | | | | | | [ Upstream commit 57bdb436ce869a45881d8aa4bc5dac8e072dd2b6 ] If we're going to fail out the vgic_add_lpi(), let's make sure the allocated vgic_irq memory is also freed. Though it seems that both cases are unlikely to fail. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: fix overflow of zero page refcount with ksm runningZhuang Yanying2020-10-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7df003c85218b5f5b10a7f6418208f31e813f38f ] We are testing Virtual Machine with KSM on v5.4-rc2 kernel, and found the zero_page refcount overflow. The cause of refcount overflow is increased in try_async_pf (get_user_page) without being decreased in mmu_set_spte() while handling ept violation. In kvm_release_pfn_clean(), only unreserved page will call put_page. However, zero page is reserved. So, as well as creating and destroy vm, the refcount of zero page will continue to increase until it overflows. step1: echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan echo 1 > /sys/kernel/pages_to_scan/run echo 1 > /sys/kernel/pages_to_scan/use_zero_pages step2: just create several normal qemu kvm vms. And destroy it after 10s. Repeat this action all the time. After a long period of time, all domains hang because of the refcount of zero page overflow. Qemu print error log as follow: … error: kvm run failed Bad address EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4 EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f7070 00000037 IDT= 000f70ae 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04 … Meanwhile, a kernel warning is departed. [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30 [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G OE 5.2.0-rc2 #5 [40914.836415] RIP: 0010:try_get_page+0x1f/0x30 [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0 0 00 00 00 48 8b 47 08 a8 [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286 [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000 [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440 [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000 [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8 [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440 [40914.836423] FS: 00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000 [40914.836425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0 [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [40914.836428] Call Trace: [40914.836433] follow_page_pte+0x302/0x47b [40914.836437] __get_user_pages+0xf1/0x7d0 [40914.836441] ? irq_work_queue+0x9/0x70 [40914.836443] get_user_pages_unlocked+0x13f/0x1e0 [40914.836469] __gfn_to_pfn_memslot+0x10e/0x400 [kvm] [40914.836486] try_async_pf+0x87/0x240 [kvm] [40914.836503] tdp_page_fault+0x139/0x270 [kvm] [40914.836523] kvm_mmu_page_fault+0x76/0x5e0 [kvm] [40914.836588] vcpu_enter_guest+0xb45/0x1570 [kvm] [40914.836632] kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm] [40914.836645] kvm_vcpu_ioctl+0x26e/0x5d0 [kvm] [40914.836650] do_vfs_ioctl+0xa9/0x620 [40914.836653] ksys_ioctl+0x60/0x90 [40914.836654] __x64_sys_ioctl+0x16/0x20 [40914.836658] do_syscall_64+0x5b/0x180 [40914.836664] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [40914.836666] RIP: 0033:0x7fb61cb6bfc7 Signed-off-by: LinFeng <linfeng23@huawei.com> Signed-off-by: Zhuang Yanying <ann.zhuangyanying@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm/arm64: vgic: Fix potential double free dist->spis in ↵Miaohe Lin2020-10-011-0/+1
| | | | | | | | | | | | | | | | | | | | __kvm_vgic_destroy() [ Upstream commit 0bda9498dd45280e334bfe88b815ebf519602cc3 ] In kvm_vgic_dist_init() called from kvm_vgic_map_resources(), if dist->vgic_model is invalid, dist->spis will be freed without set dist->spis = NULL. And in vgicv2 resources clean up path, __kvm_vgic_destroy() will be called to free allocated resources. And dist->spis will be freed again in clean up chain because we forget to set dist->spis = NULL in kvm_vgic_dist_init() failed path. So double free would happen. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/1574923128-19956-1-git-send-email-linmiaohe@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: fix memory leak in kvm_io_bus_unregister_dev()Rustam Kovhaev2020-09-261-9/+12
| | | | | | | | | | | | | | | | | | [ Upstream commit f65886606c2d3b562716de030706dfe1bea4ed5e ] when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing the bus, we should iterate over all other devices linked to it and call kvm_iodevice_destructor() for them Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail") Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707 Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200907185535.233114-1-rkovhaev@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not setWill Deacon2020-08-261-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b5331379bc62611d1026173a09c73573384201d9 upstream. When an MMU notifier call results in unmapping a range that spans multiple PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary, since this avoids running into RCU stalls during VM teardown. Unfortunately, if the VM is destroyed as a result of OOM, then blocking is not permitted and the call to the scheduler triggers the following BUG(): | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394 | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper | INFO: lockdep is turned off. | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1 | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 | Call trace: | dump_backtrace+0x0/0x284 | show_stack+0x1c/0x28 | dump_stack+0xf0/0x1a4 | ___might_sleep+0x2bc/0x2cc | unmap_stage2_range+0x160/0x1ac | kvm_unmap_hva_range+0x1a0/0x1c8 | kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8 | __mmu_notifier_invalidate_range_start+0x218/0x31c | mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0 | __oom_reap_task_mm+0x128/0x268 | oom_reap_task+0xac/0x298 | oom_reaper+0x178/0x17c | kthread+0x1e4/0x1fc | ret_from_fork+0x10/0x30 Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier flags. Cc: <stable@vger.kernel.org> Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd") Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-3-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [will: Backport to 4.19; use 'blockable' instead of non-existent MMU_NOTIFIER_RANGE_BLOCKABLE flag] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon2020-08-262-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit fdfe7cbd58806522e799e2a50a15aee7f2cbb7b6 upstream. The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <stable@vger.kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20200811102725.7121-2-will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [will: Backport to 4.19; use 'blockable' instead of non-existent range flags] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: Synchronize sysreg state on injecting an AArch32 exceptionMarc Zyngier2020-06-221-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream. On a VHE system, the EL1 state is left in the CPU most of the time, and only syncronized back to memory when vcpu_put() is called (most of the time on preemption). Which means that when injecting an exception, we'd better have a way to either: (1) write directly to the EL1 sysregs (2) synchronize the state back to memory, and do the changes there For an AArch64, we already do (1), so we are safe. Unfortunately, doing the same thing for AArch32 would be pretty invasive. Instead, we can easily implement (2) by calling the put/load architectural backends, and keep preemption disabled. We can then reload the state back into EL1. Cc: stable@vger.kernel.org Reported-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: Fix APIC page invalidation raceEiichi Tsukata2020-06-221-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 ] Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried to fix inappropriate APIC page invalidation by re-introducing arch specific kvm_arch_mmu_notifier_invalidate_range() and calling it from kvm_mmu_notifier_invalidate_range_start. However, the patch left a possible race where the VMCS APIC address cache is updated *before* it is unmapped: (Invalidator) kvm_mmu_notifier_invalidate_range_start() (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD) (KVM VCPU) vcpu_enter_guest() (KVM VCPU) kvm_vcpu_reload_apic_access_page() (Invalidator) actually unmap page Because of the above race, there can be a mismatch between the host physical address stored in the APIC_ACCESS_PAGE VMCS field and the host physical address stored in the EPT entry for the APIC GPA (0xfee0000). When this happens, the processor will not trap APIC accesses, and will instead show the raw contents of the APIC-access page. Because Windows OS periodically checks for unexpected modifications to the LAPIC register, this will show up as a BSOD crash with BugCheck CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1751017. The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range() cannot guarantee that no additional references are taken to the pages in the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately, this case is supported by the MMU notifier API, as documented in include/linux/mmu_notifier.h: * If the subsystem * can't guarantee that no additional references are taken to * the pages in the range, it has to implement the * invalidate_range() notifier to remove any references taken * after invalidate_range_start(). The fix therefore is to reload the APIC-access page field in the VMCS from kvm_mmu_notifier_invalidate_range() instead of ..._range_start(). Cc: stable@vger.kernel.org Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation") Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951 Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm64: Fix 32bit PC wrap-aroundMarc Zyngier2020-05-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream. In the unlikely event that a 32bit vcpu traps into the hypervisor on an instruction that is located right at the end of the 32bit range, the emulation of that instruction is going to increment PC past the 32bit range. This isn't great, as userspace can then observe this value and get a bit confused. Conversly, userspace can do things like (in the context of a 64bit guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0, set PC to a 64bit value, change PSTATE to AArch32-USR, and observe that PC hasn't been truncated. More confusion. Fix both by: - truncating PC increments for 32bit guests - sanitizing all 32bit regs every time a core reg is changed by userspace, and that PSTATE indicates a 32bit mode. Cc: stable@vger.kernel.org Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVERMarc Zyngier2020-05-141-2/+2
| | | | | | | | | | | | | | | | | | commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream. When deciding whether a guest has to be stopped we check whether this is a private interrupt or not. Unfortunately, there's an off-by-one bug here, and we fail to recognize a whole range of interrupts as being global (GICv2 SPIs 32-63). Fix the condition from > to be >=. Cc: stable@vger.kernel.org Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race") Reported-by: André Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/kvm: Cache gfn to pfn translationBoris Ostrovsky2020-04-291-19/+79
| | | | | | | | | | | | | | | | | | commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream. __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is * relatively expensive * in certain cases (such as when done from atomic context) cannot be called Stashing gfn-to-pfn mapping should help with both cases. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/kvm: Introduce kvm_(un)map_gfn()Boris Ostrovsky2020-04-291-5/+24
| | | | | | | | | | | | | | | | | commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream. kvm_vcpu_(un)map operates on gfns from any current address space. In certain cases we want to make sure we are not mapping SMRAM and for that we can use kvm_(un)map_gfn() that we are introducing in this patch. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: Properly check if "page" is valid in kvm_vcpu_unmapKarimAllah Ahmed2020-04-291-1/+1
| | | | | | | | | | | | | | commit b614c6027896ff9ad6757122e84760d938cab15e upstream. The field "page" is initialized to KVM_UNMAPPED_PAGE when it is not used (i.e. when the memory lives outside kernel control). So this check will always end up using kunmap even for memremap regions. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: fix compile on s390 part 2Christian Borntraeger2020-04-291-0/+2
| | | | | | | | | | | | | | commit eb1f2f387db8c0d084581fb26e7faffde700bc8e upstream. We also need to fence the memunmap part. Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Fixes: d30b214d1d0a (kvm: fix compilation on s390) Cc: Michal Kubecek <mkubecek@suse.cz> Cc: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: fix compilation on s390Paolo Bonzini2020-04-291-0/+2
| | | | | | | | | | | commit d30b214d1d0addb7b2c9c78178d1501cd39a01fb upstream. s390 does not have memremap, even though in this particular case it would be useful. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: fix compilation on aarch64Paolo Bonzini2020-04-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c011d23ba046826ccf8c4a4a6c1d01c9ccaa1403 upstream. Commit e45adf665a53 ("KVM: Introduce a new guest mapping API", 2019-01-31) introduced a build failure on aarch64 defconfig: $ make -j$(nproc) ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=out defconfig \ Image.gz ... ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_map_gfn': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:9: error: implicit declaration of function 'memremap'; did you mean 'memset_p'? ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1763:46: error: 'MEMREMAP_WB' undeclared (first use in this function) ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_vcpu_unmap': ../arch/arm64/kvm/../../../virt/kvm/kvm_main.c:1795:3: error: implicit declaration of function 'memunmap'; did you mean 'vm_munmap'? because these functions are declared in <linux/io.h> rather than <asm/io.h>, and the former was being pulled in already on x86 but not on aarch64. Reported-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: Introduce a new guest mapping APIKarimAllah Ahmed2020-04-291-0/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e45adf665a53df0db37f784ed87c6b57ddd81885 upstream. In KVM, specially for nested guests, there is a dominant pattern of: => map guest memory -> do_something -> unmap guest memory In addition to all this unnecessarily noise in the code due to boiler plate code, most of the time the mapping function does not properly handle memory that is not backed by "struct page". This new guest mapping API encapsulate most of this boiler plate code and also handles guest memory that is not backed by "struct page". The current implementation of this API is using memremap for memory that is not backed by a "struct page" which would lead to a huge slow-down if it was used for high-frequency mapping operations. The API does not have any effect on current setups where guest memory is backed by a "struct page". Further patches are going to also introduce a pfn-cache which would significantly improve the performance of the memremap case. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [bwh: Backported to 4.19 as dependency of commit 1eff70a9abd4 "x86/kvm: Introduce kvm_(un)map_gfn()"] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: Check for a bad hva before dropping into the ghc slow pathSean Christopherson2020-03-051-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fcfbc617547fc6d9552cb6c1c563b6a90ee98085 upstream. When reading/writing using the guest/host cache, check for a bad hva before checking for a NULL memslot, which triggers the slow path for handing cross-page accesses. Because the memslot is nullified on error by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after crossing into a new page, then the kvm_{read,write}_guest() slow path could potentially write/access the first chunk prior to detecting the bad hva. Arguably, performing a partial access is semantically correct from an architectural perspective, but that behavior is certainly not intended. In the original implementation, memslot was not explicitly nullified and therefore the partial access behavior varied based on whether the memslot itself was null, or if the hva was simply bad. The current behavior was introduced as a seemingly unintentional side effect in commit f1b9dd5eb86c ("kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init"), which justified the change with "since some callers don't check the return code from this function, it sit seems prudent to clear ghc->memslot in the event of an error". Regardless of intent, the partial access is dependent on _not_ checking the result of the cache initialization, which is arguably a bug in its own right, at best simply weird. Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.") Cc: Jim Mattson <jmattson@google.com> Cc: Andrew Honig <ahonig@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unsetEric Auger2020-02-141-0/+3
| | | | | | | | | | | | | | | | | | | | commit 3837407c1aa1101ed5e214c7d6041e7a23335c6e upstream. The specification says PMSWINC increments PMEVCNTR<n>_EL1 by 1 if PMEVCNTR<n>_EL0 is enabled and configured to count SW_INCR. For PMEVCNTR<n>_EL0 to be enabled, we need both PMCNTENSET to be set for the corresponding event counter but we also need the PMCR.E bit to be set. Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register") Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Andrew Murray <andrew.murray@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm: Make inject_abt32() inject an external abort insteadJames Morse2020-02-141-3/+7
| | | | | | | | | | | | | | | | | | | | | | commit 21aecdbd7f3ab02c9b82597dc733ee759fb8b274 upstream. KVM's inject_abt64() injects an external-abort into an aarch64 guest. The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but for an aarch32 guest inject_abt32() injects an implementation-defined exception, 'Lockdown fault'. Change this to external abort. For non-LPAE we now get the documented: | Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00 and for LPAE: | Unhandled fault: synchronous external abort (0x210) at 0x9c800f00 Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection") Reported-by: Beata Michalska <beata.michalska@linaro.org> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm: Fix DFSR setting for non-LPAE aarch32 guestsJames Morse2020-02-141-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 018f22f95e8a6c3e27188b7317ef2c70a34cb2cd upstream. Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected exception to a non-LPAE aarch32 guest. The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault (Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on instruction cache maintenance". This fault is hooked by do_translation_fault() since ARMv6, which goes on to silently 'handle' the exception, and restart the faulting instruction. It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has to shuffle up to DFSR[10]. As KVM only does this in one place, fix up the static values. We now get the expected: | Unhandled fault: lock abort (0x404) at 0x9c800f00 Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection") Reported-by: Beata Michalska <beata.michalska@linaro.org> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Fix young bit from mmu notifierGavin Shan2020-02-141-1/+2
| | | | | | | | | | | | | | | | | | | | commit cf2d23e0bac9f6b5cd1cba8898f5f05ead40e530 upstream. kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong address range has been passed to handle_hva_to_gpa(). With the wrong address range, no young bits will be checked in handle_hva_to_gpa(). It means zero is always returned from mmu_notifier_test_young(). This fixes the issue by passing correct address range to the underly function handle_hva_to_gpa(), so that the hardware young (access) bit will be visited. Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging") Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic-its: Fix restoration of unmapped collectionsEric Auger2020-02-141-1/+2
| | | | | | | | | | | | | | | | | | commit 8c58be34494b7f1b2adb446e2d8beeb90e5de65b upstream. Saving/restoring an unmapped collection is a valid scenario. For example this happens if a MAPTI command was sent, featuring an unmapped collection. At the moment the CTE fails to be restored. Only compare against the number of online vcpus if the rdist base is set. Fixes: ea1ad53e1e31a ("KVM: arm64: vgic-its: Collection table save/restore") Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Zenghui Yu <yuzenghui@huawei.com> Link: https://lore.kernel.org/r/20191213094237.19627-1-eric.auger@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Play nice with read-only memslots when querying host page sizeSean Christopherson2020-02-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 42cde48b2d39772dba47e680781a32a6c4b7dc33 ] Avoid the "writable" check in __gfn_to_hva_many(), which will always fail on read-only memslots due to gfn_to_hva() assuming writes. Functionally, this allows x86 to create large mappings for read-only memslots that are backed by HugeTLB mappings. Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page support") states "If the largepage contains write-protected pages, a large pte is not used.", but "write-protected" refers to pages that are temporarily read-only, e.g. read-only memslots didn't even exist at the time. Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: Use vcpu-specific gva->hva translation when querying host page sizeSean Christopherson2020-02-111-2/+2
| | | | | | | | | | | | | [ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ] Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the correct set of memslots is used when handling x86 page faults in SMM. Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVMSean Christopherson2020-02-111-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 736c291c9f36b07f8889c61764c28edce20e715d ] Convert a plethora of parameters and variables in the MMU and page fault flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM. Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical addresses. When TDP is enabled, the fault address is a guest physical address and thus can be a 64-bit value, even when both KVM and its guest are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a 64-bit field, not a natural width field. Using a gva_t for the fault address means KVM will incorrectly drop the upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to translate L2 GPAs to L1 GPAs. Opportunistically rename variables and parameters to better reflect the dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain "addr" instead of "vaddr" when the address may be either a GVA or an L2 GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid a confusing "gpa_t gva" declaration; this also sets the stage for a future patch to combing nonpaging_page_fault() and tdp_page_fault() with minimal churn. Sprinkle in a few comments to document flows where an address is known to be a GVA and thus can be safely truncated to a 32-bit value. Add WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help document such cases and detect bugs. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: arm64: Only sign-extend MMIO up to register widthChristoffer Dall2020-02-111-0/+6
| | | | | | | | | | | | | | | | | | | | | | commit b6ae256afd32f96bec0117175b329d0dd617655e upstream. On AArch64 you can do a sign-extended load to either a 32-bit or 64-bit register, and we should only sign extend the register up to the width of the register as specified in the operation (by using the 32-bit Wn or 64-bit Xn register specifier). As it turns out, the architecture provides this decoding information in the SF ("Sixty-Four" -- how cute...) bit. Let's take advantage of this with the usual 32-bit/64-bit header file dance and do the right thing on AArch64 hosts. Signed-off-by: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191212195055.5541-1-christoffer.dall@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Correct AArch32 SPSR on exception entryMark Rutland2020-02-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1cfbb484de158e378e8971ac40f3082e53ecca55 upstream. Confusingly, there are three SPSR layouts that a kernel may need to deal with: (1) An AArch64 SPSR_ELx view of an AArch64 pstate (2) An AArch64 SPSR_ELx view of an AArch32 pstate (3) An AArch32 SPSR_* view of an AArch32 pstate When the KVM AArch32 support code deals with SPSR_{EL2,HYP}, it's either dealing with #2 or #3 consistently. On arm64 the PSR_AA32_* definitions match the AArch64 SPSR_ELx view, and on arm the PSR_AA32_* definitions match the AArch32 SPSR_* view. However, when we inject an exception into an AArch32 guest, we have to synthesize the AArch32 SPSR_* that the guest will see. Thus, an AArch64 host needs to synthesize layout #3 from layout #2. This patch adds a new host_spsr_to_spsr32() helper for this, and makes use of it in the KVM AArch32 support code. For arm64 we need to shuffle the DIT bit around, and remove the SS bit, while for arm we can use the value as-is. I've open-coded the bit manipulation for now to avoid having to rework the existing PSR_* definitions into PSR64_AA32_* and PSR32_AA32_* definitions. I hope to perform a more thorough refactoring in future so that we can handle pstate view manipulation more consistently across the kernel tree. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200108134324.46500-4-mark.rutland@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: Correct CPSR on exception entryMark Rutland2020-02-111-10/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3c2483f15499b877ccb53250d88addb8c91da147 upstream. When KVM injects an exception into a guest, it generates the CPSR value from scratch, configuring CPSR.{M,A,I,T,E}, and setting all other bits to zero. This isn't correct, as the architecture specifies that some CPSR bits are (conditionally) cleared or set upon an exception, and others are unchanged from the original context. This patch adds logic to match the architectural behaviour. To make this simple to follow/audit/extend, documentation references are provided, and bits are configured in order of their layout in SPSR_EL2. This layout can be seen in the diagram on ARM DDI 0487E.a page C5-426. Note that this code is used by both arm and arm64, and is intended to fuction with the SPSR_EL2 and SPSR_HYP layouts. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200108134324.46500-3-mark.rutland@arm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: arm/arm64: vgic: Don't rely on the wrong pending tableZenghui Yu2019-12-131-3/+3
| | | | | | | | | | | | | | | | | | | | commit ca185b260951d3b55108c0b95e188682d8a507b7 upstream. It's possible that two LPIs locate in the same "byte_offset" but target two different vcpus, where their pending status are indicated by two different pending tables. In such a scenario, using last_byte_offset optimization will lead KVM relying on the wrong pending table entry. Let us use last_ptr instead, which can be treated as a byte index into a pending table and also, can be vcpu specific. Fixes: 280771252c1b ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES") Cc: stable@vger.kernel.org Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Acked-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: properly check debugfs dentry before using itGreg Kroah-Hartman2019-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8ed0579c12b2fe56a1fac2f712f58fc26c1dc49b ] debugfs can now report an error code if something went wrong instead of just NULL. So if the return value is to be used as a "real" dentry, it needs to be checked if it is an error before dereferencing it. This is now happening because of ff9fb72bc077 ("debugfs: return error values, not NULL"). syzbot has found a way to trigger multiple debugfs files attempting to be created, which fails, and then the error code gets passed to dentry_path_raw() which obviously does not like it. Reported-by: Eric Biggers <ebiggers@kernel.org> Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: MMU: Do not treat ZONE_DEVICE pages as being reservedSean Christopherson2019-12-011-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a78986aae9b2988f8493f9f65a587ee433e83bc3 upstream. Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and instead manually handle ZONE_DEVICE on a case-by-case basis. For things like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal pages, e.g. put pages grabbed via gup(). But for flows such as setting A/D bits or shifting refcounts for transparent huge pages, KVM needs to to avoid processing ZONE_DEVICE pages as the flows in question lack the underlying machinery for proper handling of ZONE_DEVICE pages. This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup() when running a KVM guest backed with /dev/dax memory, as KVM straight up doesn't put any references to ZONE_DEVICE pages acquired by gup(). Note, Dan Williams proposed an alternative solution of doing put_page() on ZONE_DEVICE pages immediately after gup() in order to simplify the auditing needed to ensure is_zone_device_page() is called if and only if the backing device is pinned (via gup()). But that approach would break kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til unmap() when accessing guest memory, unlike KVM's secondary MMU, which coordinates with mmu_notifier invalidations to avoid creating stale page references, i.e. doesn't rely on pages being pinned. [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl Reported-by: Adam Borowski <kilobyte@angband.pl> Analyzed-by: David Hildenbrand <david@redhat.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [sean: backport to 4.x; resolve conflict in mmu.c] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page tableSuzuki K Poulose2019-11-241-1/+2
| | | | | | | | | | | | | | | | [ Upstream commit d2db7773ba864df6b4e19643dfc54838550d8049 ] So far we have only supported 3 level page table with fixed IPA of 40bits, where PUD is folded. With 4 level page tables, we need to check if the PUD entry is valid or not. Fix stage2_flush_memslot() to do this check, before walking down the table. Acked-by: Christoffer Dall <cdall@kernel.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: x86: mmu: Recovery of shattered NX large pagesJunaid Shahid2019-11-121-1/+29
| | | | | | | | | | | | | | | | | | | commit 1aa9b9572b10529c2e64e2b8f44025d86e124308 upstream. The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: Add helper function for creating VM worker threadsJunaid Shahid2019-11-121-0/+84
| | | | | | | | | | | | | | | commit c57c80467f90e5504c8df9ad3555d2c78800bf94 upstream. Add a function to create a kernel thread associated with a given VM. In particular, it ensures that the worker thread inherits the priority and cgroups of the calling thread. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>