summaryrefslogtreecommitdiffstats
path: root/virt/kvm/kvm_main.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2014-06-121-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more scheduler updates from Ingo Molnar: "Second round of scheduler changes: - try-to-wakeup and IPI reduction speedups, from Andy Lutomirski - continued power scheduling cleanups and refactorings, from Nicolas Pitre - misc fixes and enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Delete extraneous extern for to_ratio() sched/idle: Optimize try-to-wake-up IPI sched/idle: Simplify wake_up_idle_cpu() sched/idle: Clear polling before descheduling the idle thread sched, trace: Add a tracepoint for IPI-less remote wakeups cpuidle: Set polling in poll_idle sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...) sched: Rename capacity related flags sched: Final power vs. capacity cleanups sched: Remove remaining dubious usage of "power" sched: Let 'struct sched_group_power' care about CPU capacity sched/fair: Disambiguate existing/remaining "capacity" usage sched/fair: Change "has_capacity" to "has_free_capacity" sched/fair: Remove "power" from 'struct numa_stats' sched: Fix signedness bug in yield_to() sched/fair: Use time_after() in record_wakee() sched/balancing: Reduce the rate of needless idle load balancing sched/fair: Fix unlocked reads of some cfs_b->quota/period
| * sched: Fix signedness bug in yield_to()Dan Carpenter2014-06-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | yield_to() is supposed to return -ESRCH if there is no task to yield to, but because the type is bool that is the same as returning true. The only place I see which cares is kvm_vcpu_on_spin(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | KVM: add missing cleanup_srcu_structPaolo Bonzini2014-06-031-0/+1
| | | | | | | | | | Reported-by: hrg <hrgstephen@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | kvm/irqchip: Speed up KVM_SET_GSI_ROUTINGChristian Borntraeger2014-05-051-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When starting lots of dataplane devices the bootup takes very long on Christian's s390 with irqfd patches. With larger setups he is even able to trigger some timeouts in some components. Turns out that the KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec) when having multiple CPUs. This is caused by the synchronize_rcu and the HZ=100 of s390. By changing the code to use a private srcu we can speed things up. This patch reduces the boot time till mounting root from 8 to 2 seconds on my s390 guest with 100 disks. Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu are fine because they do not have lockdep checks (hlist_for_each_entry_rcu uses rcu_dereference_raw rather than rcu_dereference, and write-sides do not do rcu lockdep at all). Note that we're hardly relying on the "sleepable" part of srcu. We just want SRCU's faster detection of grace periods. Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS and RR. The difference between results "before" and "after" the patch has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the data points says that there is a 2.5% probability that the patch is the cause of the performance difference (rather than a random fluctuation). (Restricting the t-test to RR, which is the most likely to be affected, changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8% probability that the numbers actually say something about the patch. The probability increases mostly because there are fewer data points). Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Revert "KVM: Simplify kvm->tlbs_dirty handling"Xiao Guangrong2014-04-231-1/+4
| | | | | | | | | | | | | | | | | | | | This reverts commit 5befdc385ddb2d5ae8995ad89004529a3acf58fc. Since we will allow flush tlb out of mmu-lock in the later patch Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* | Merge tag 'kvm-s390-20140422' of ↵Marcelo Tosatti2014-04-221-2/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into queue Lazy storage key handling ------------------------- Linux does not use the ACC and F bits of the storage key. Newer Linux versions also do not use the storage keys for dirty and reference tracking. We can optimize the guest handling for those guests for faults as well as page-in and page-out by simply not caring about the guest visible storage key. We trap guest storage key instruction to enable those keys only on demand. Migration bitmap Until now s390 never provided a proper dirty bitmap. Let's provide a proper migration bitmap for s390. We also change the user dirty tracking to a fault based mechanism. This makes the host completely independent from the storage keys. Long term this will allow us to back guest memory with large pages. per-VM device attributes ------------------------ To avoid the introduction of new ioctls, let's provide the attribute semanantic also on the VM-"device". Userspace controlled CMMA ------------------------- The CMMA assist is changed from "always on" to "on if requested" via per-VM device attributes. In addition a callback to reset all usage states is provided. Proper guest DAT handling for intercepts ---------------------------------------- While instructions handled by SIE take care of all addressing aspects, KVM/s390 currently does not care about guest address translation of intercepts. This worked out fine, because - the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real for all pages up to memory size - intercepts happen only for a small amount of cases - all of these intercepts happen to be in the kernel text for current distros Of course we need to be better for other intercepts, kernel modules etc. We provide the infrastructure and rework all in-kernel intercepts to work on logical addresses (paging etc) instead of real ones. The code has been running internally for several months now, so it is time for going public. GDB support ----------- We provide breakpoints, single stepping and watchpoints. Fixes/Cleanups -------------- - Improve program check delivery - Factor out the handling of transactional memory on program checks - Use the existing define __LC_PGM_TDB - Several cleanups in the lowcore structure - Documentation NOTES ----- - All patches touching base s390 are either ACKed or written by the s390 maintainers - One base KVM patch "KVM: add kvm_is_error_gpa() helper" - One patch introduces the notion of VM device attributes Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Conflicts: include/uapi/linux/kvm.h
| * | KVM: s390: Add proper dirty bitmap support to S390 kvm.Jason J. Herne2014-04-221-2/+0
| |/ | | | | | | | | | | | | | | | | | | | | Replace the kvm_s390_sync_dirty_log() stub with code to construct the KVM dirty_bitmap from S390 memory change bits. Also add code to properly clear the dirty_bitmap size when clearing the bitmap. Signed-off-by: Jason J. Herne <jjherne@us.ibm.com> CC: Dominik Dingel <dingel@linux.vnet.ibm.com> [Dominik Dingel: use gmap_test_and_clear_dirty, locking fixes] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* / KVM: VMX: speed up wildcard MMIO EVENTFDMichael S. Tsirkin2014-04-171-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With KVM, MMIO is much slower than PIO, due to the need to do page walk and emulation. But with EPT, it does not have to be: we know the address from the VMCS so if the address is unique, we can look up the eventfd directly, bypassing emulation. Unfortunately, this only works if userspace does not need to match on access length and data. The implementation adds a separate FAST_MMIO bus internally. This serves two purposes: - minimize overhead for old userspace that does not use eventfd with lengtth = 0 - minimize disruption in other code (since we don't know the length, devices on the MMIO bus only get a valid address in write, this way we don't need to touch all devices to teach them to handle an invalid length) At the moment, this optimization only has effect for EPT on x86. It will be possible to speed up MMIO for NPT and MMU using the same idea in the future. With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO. I was unable to detect any measureable slowdown to non-eventfd MMIO. Making MMIO faster is important for the upcoming virtio 1.0 which includes an MMIO signalling capability. The idea was suggested by Peter Anvin. Lots of thanks to Gleb for pre-review and suggestions. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* Merge tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-04-021-5/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "PPC and ARM do not have much going on this time. Most of the cool stuff, instead, is in s390 and (after a few releases) x86. ARM has some caching fixes and PPC has transactional memory support in guests. MIPS has some fixes, with more probably coming in 3.16 as QEMU will soon get support for MIPS KVM. For x86 there are optimizations for debug registers, which trigger on some Windows games, and other important fixes for Windows guests. We now expose to the guest Broadwell instruction set extensions and also Intel MPX. There's also a fix/workaround for OS X guests, nested virtualization features (preemption timer), and a couple kvmclock refinements. For s390, the main news is asynchronous page faults, together with improvements to IRQs (floating irqs and adapter irqs) that speed up virtio devices" * tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits) KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8 KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode KVM: PPC: Book3S HV: Return ENODEV error rather than EIO KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state KVM: PPC: Book3S HV: Add transactional memory support KVM: Specify byte order for KVM_EXIT_MMIO KVM: vmx: fix MPX detection KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write KVM: s390: clear local interrupts at cpu initial reset KVM: s390: Fix possible memory leak in SIGP functions KVM: s390: fix calculation of idle_mask array size KVM: s390: randomize sca address KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP KVM: Bump KVM_MAX_IRQ_ROUTES for s390 KVM: s390: irq routing for adapter interrupts. KVM: s390: adapter interrupt sources ...
| * KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loopMichael Mueller2014-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | Use the arch specific function kvm_arch_vcpu_runnable() to add a further criterium to identify a suitable vcpu to yield to during undirected yield processing. Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: Simplify kvm->tlbs_dirty handlingTakuya Yoshikawa2014-02-181-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | When this was introduced, kvm_flush_remote_tlbs() could be called without holding mmu_lock. It is now acknowledged that the function must be called before releasing mmu_lock, and all callers have already been changed to do so. There is no need to use smp_mb() and cmpxchg() any more. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: s390: add floating irq controllerJens Freimann2014-01-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | This patch adds a floating irq controller as a kvm_device. It will be necessary for migration of floating interrupts as well as for hardening the reset code by allowing user space to explicitly remove all pending floating interrupts. Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
* | asmlinkage, kvm: Make kvm_rebooting visibleAndi Kleen2014-02-131-1/+1
|/ | | | | | | | | | | kvm_rebooting is referenced from assembler code, thus needs to be visible. Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Link: http://lkml.kernel.org/r/1391845930-28580-1-git-send-email-ak@linux.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2014-01-221-54/+27
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "First round of KVM updates for 3.14; PPC parts will come next week. Nothing major here, just bugfixes all over the place. The most interesting part is the ARM guys' virtualized interrupt controller overhaul, which lets userspace get/set the state and thus enables migration of ARM VMs" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits) kvm: make KVM_MMU_AUDIT help text more readable KVM: s390: Fix memory access error detection KVM: nVMX: Update guest activity state field on L2 exits KVM: nVMX: Fix nested_run_pending on activity state HLT KVM: nVMX: Clean up handling of VMX-related MSRs KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit KVM: nVMX: Leave VMX mode on clearing of feature control MSR KVM: VMX: Fix DR6 update on #DB exception KVM: SVM: Fix reading of DR6 KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS add support for Hyper-V reference time counter KVM: remove useless write to vcpu->hv_clock.tsc_timestamp KVM: x86: fix tsc catchup issue with tsc scaling KVM: x86: limit PIT timer frequency KVM: x86: handle invalid root_hpa everywhere kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub kvm: vfio: silence GCC warning KVM: ARM: Remove duplicate include arm/arm64: KVM: relax the requirements of VMA alignment for THP ...
| * kvm: Provide kvm_vcpu_eligible_for_directed_yield() stubScott Wood2014-01-151-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7940876e1330671708186ac3386aa521ffb5c182 ("kvm: make local functions static") broke KVM PPC builds due to removing (rather than moving) the stub version of kvm_vcpu_eligible_for_directed_yield(). This patch reintroduces it. Signed-off-by: Scott Wood <scottwood@freescale.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Alexander Graf <agraf@suse.de> [Move the #ifdef inside the function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: remove dead codeStephen Hemminger2014-01-081-27/+0
| | | | | | | | | | | | | | | | The function kvm_io_bus_read_cookie is defined but never used in current in-tree code. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * kvm: make local functions staticStephen Hemminger2014-01-081-17/+18
| | | | | | | | | | | | | | | | Running 'make namespacecheck' found lots of functions that should be declared static, since only used in one file. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGICChristoffer Dall2013-12-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support creating the ARM VGIC device through the KVM_CREATE_DEVICE ioctl, which can then later be leveraged to use the KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in a more generic API than the ARM-specific one and is useful for save/restore of VGIC state. Adds KVM_CAP_DEVICE_CTRL to ARM capabilities. Note that we change the check for creating a VGIC from bailing out if any VCPUs were created, to bailing out if any VCPUs were ever run. This is an important distinction that shouldn't break anything, but allows creating the VGIC after the VCPUs have been created. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
| * KVM: Use cond_resched() directly and remove useless kvm_resched()Takuya Yoshikawa2013-12-131-8/+0
| | | | | | | | | | | | | | | | | | | | Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers to make kvm preemptible"), the remaining stuff in this function is a simple cond_resched() call with an extra need_resched() check which was there to avoid dropping VCPUs unnecessarily. Now it is meaningless. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: Improve create VCPU parameter (CVE-2013-4587)Andy Honig2013-12-121-0/+3
|/ | | | | | | | | | | | | | In multiple functions the vcpu_id is used as an offset into a bitfield. Ag malicious user could specify a vcpu_id greater than 255 in order to set or clear bits in kernel memory. This could be used to elevate priveges in the kernel. This patch verifies that the vcpu_id provided is less than 255. The api documentation already specifies that the vcpu_id must be less than max_vcpus, but this is currently not checked. Reported-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: kvm_clear_guest_page(): fix empty_zero_page usageHeiko Carstens2013-11-211-2/+3
| | | | | | | | | | | | | | Using the address of 'empty_zero_page' as source address in order to clear a page is wrong. On some architectures empty_zero_page is only the pointer to the struct page of the empty_zero_page. Therefore the clear page operation would copy the contents of a couple of struct pages instead of clearing a page. For kvm only arm/arm64 are affected by this bug. To fix this use the ZERO_PAGE macro instead which will return the struct page address of the empty_zero_page on all architectures. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-11-151-79/+55
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM changes from Paolo Bonzini: "Here are the 3.13 KVM changes. There was a lot of work on the PPC side: the HV and emulation flavors can now coexist in a single kernel is probably the most interesting change from a user point of view. On the x86 side there are nested virtualization improvements and a few bugfixes. ARM got transparent huge page support, improved overcommit, and support for big endian guests. Finally, there is a new interface to connect KVM with VFIO. This helps with devices that use NoSnoop PCI transactions, letting the driver in the guest execute WBINVD instructions. This includes some nVidia cards on Windows, that fail to start without these patches and the corresponding userspace changes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits) kvm, vmx: Fix lazy FPU on nested guest arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu arm/arm64: KVM: MMIO support for BE guest kvm, cpuid: Fix sparse warning kvm: Delete prototype for non-existent function kvm_check_iopl kvm: Delete prototype for non-existent function complete_pio hung_task: add method to reset detector pvclock: detect watchdog reset at pvclock read kvm: optimize out smp_mb after srcu_read_unlock srcu: API for barrier after srcu read unlock KVM: remove vm mmap method KVM: IOMMU: hva align mapping page size KVM: x86: trace cpuid emulation when called from emulator KVM: emulator: cleanup decode_register_operand() a bit KVM: emulator: check rex prefix inside decode_register() KVM: x86: fix emulation of "movzbl %bpl, %eax" kvm_host: typo fix KVM: x86: emulate SAHF instruction MAINTAINERS: add tree for kvm.git Documentation/kvm: add a 00-INDEX file ...
| * KVM: remove vm mmap methodGleb Natapov2013-11-061-32/+0
| | | | | | | | | | | | | | | | It was used in conjunction with KVM_SET_MEMORY_REGION ioctl which was removed by b74a07beed0 in 2010, QEMU stopped using it in 2008, so it is time to remove the code finally. Signed-off-by: Gleb Natapov <gleb@redhat.com>
| * Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queueGleb Natapov2013-11-041-6/+10
| |\ | | | | | | | | | | | | Conflicts: arch/powerpc/include/asm/processor.h
| | * kvm: Add struct kvm arg to memslot APIsAneesh Kumar K.V2013-10-171-6/+6
| | | | | | | | | | | | | | | | | | | | | We will use that in the later patch to find the kvm ops handler Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * kvm: powerpc: book3s: Support building HV and PR KVM as moduleAneesh Kumar K.V2013-10-171-0/+4
| | | | | | | | | | | | | | | | | | Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in compile fix] Signed-off-by: Alexander Graf <agraf@suse.de>
| * | kvm: Add VFIO deviceAlex Williamson2013-10-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far we've succeeded at making KVM and VFIO mostly unaware of each other, but areas are cropping up where a connection beyond eventfds and irqfds needs to be made. This patch introduces a KVM-VFIO device that is meant to be a gateway for such interaction. The user creates the device and can add and remove VFIO groups to it via file descriptors. When a group is added, KVM verifies the group is valid and gets a reference to it via the VFIO external user interface. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Mapping IOMMU pages after updating memslotYang Zhang2013-10-281-15/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In kvm_iommu_map_pages(), we need to know the page size via call kvm_host_page_size(). And it will check whether the target slot is valid before return the right page size. Currently, we will map the iommu pages when creating a new slot. But we call kvm_iommu_map_pages() during preparing the new slot. At that time, the new slot is not visible by domain(still in preparing). So we cannot get the right page size from kvm_host_page_size() and this will break the IOMMU super page logic. The solution is to map the iommu pages after we insert the new slot into domain. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Tested-by: Patrick Lu <patrick.lu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | Powerpc KVM work is based on a commit after rc4.Gleb Natapov2013-10-171-2/+4
| |\ \ | | |/ | |/| | | | | | | | | | | | | Merging master into next to satisfy the dependencies. Conflicts: arch/arm/kvm/reset.c
| * | KVM: Convert kvm_lock back to non-raw spinlockPaolo Bonzini2013-09-301-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"), the kvm_lock was made a raw lock. However, the kvm mmu_shrink() function tries to grab the (non-raw) mmu_lock within the scope of the raw locked kvm_lock being held. This leads to the following: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0 Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm] Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt Call Trace: [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160 [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50 [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm] [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0 [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260 [<ffffffff8111824a>] balance_pgdat+0x54a/0x730 [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0 [<ffffffff811185bf>] kswapd+0x18f/0x490 [<ffffffff81070961>] ? get_parent_ip+0x11/0x50 [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50 [<ffffffff81118430>] ? balance_pgdat+0x730/0x730 [<ffffffff81060d2b>] kthread+0xdb/0xe0 [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100 [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10 [<ffffffff81060c50>] ? __init_kthread_worker+0x After the previous patch, kvm_lock need not be a raw spinlock anymore, so change it back. Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: kvm@vger.kernel.org Cc: gleb@redhat.com Cc: jan.kiszka@siemens.com Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: protect kvm_usage_count with its own spinlockPaolo Bonzini2013-09-301-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The VM list need not be protected by a raw spinlock. Separate the two so that kvm_lock can be made non-raw. Cc: kvm@vger.kernel.org Cc: gleb@redhat.com Cc: jan.kiszka@siemens.com Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: cleanup (physical) CPU hotplugPaolo Bonzini2013-09-301-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the useless argument, and do not do anything if there are no VMs running at the time of the hotplug. Cc: kvm@vger.kernel.org Cc: gleb@redhat.com Cc: jan.kiszka@siemens.com Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | KVM: use a more sensible error number when debugfs directory creation failsPaolo Bonzini2013-10-301-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | I don't know if this was due to cut and paste, or somebody was really using a D20 to pick the error code for kvm_init_debugfs as suggested by Linus (EFAULT is 14, so the possibility cannot be entirely ruled out). In any case, this patch fixes it. Reported-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | Fix NULL dereference in gfn_to_hva_prot()Gleb Natapov2013-10-031-2/+4
|/ | | | | | | | gfn_to_memslot() can return NULL or invalid slot. We need to check slot validity before accessing it. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: mmu: allow page tables to be in read-only slotsPaolo Bonzini2013-09-171-5/+9
| | | | | | | | | | | | | Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmpPaolo Bonzini2013-08-281-8/+8
| | | | | | | | This is the type-safe comparison function, so the double-underscore is not related. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* kvm: optimize away THP checks in kvm_is_mmio_pfn()Andrea Arcangeli2013-08-271-22/+2
| | | | | | | | | | | | | | | | | | | | | | | | | The checks on PG_reserved in the page structure on head and tail pages aren't necessary because split_huge_page wouldn't transfer the PG_reserved bit from head to tail anyway. This was a forward-thinking check done in the case PageReserved was set by a driver-owned page mapped in userland with something like remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not possible right now). It was meant to be very safe, but it's overkill as it's unlikely split_huge_page could ever run without the driver noticing and tearing down the hugepage itself. And if a driver in the future will really want to map a reserved hugepage in userland using an huge pmd it should simply take care of marking all subpages reserved too to keep KVM safe. This of course would require such a hypothetical driver to tear down the huge pmd itself and splitting the hugepage itself, instead of relaying on split_huge_page, but that sounds very reasonable, especially considering split_huge_page wouldn't currently transfer the reserved bit anyway. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* kvm: use anon_inode_getfd() with O_CLOEXEC flagYann Droneaud2013-08-261-3/+3
| | | | | | | | | | | | | | | | | | KVM uses anon_inode_get() to allocate file descriptors as part of some of its ioctls. But those ioctls are lacking a flag argument allowing userspace to choose options for the newly opened file descriptor. In such case it's advised to use O_CLOEXEC by default so that userspace is allowed to choose, without race, if the file descriptor is going to be inherited across exec(). This patch set O_CLOEXEC flag on all file descriptors created with anon_inode_getfd() to not leak file descriptors across exec(). Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: introduce __kvm_io_bus_sort_cmpPaolo Bonzini2013-07-291-9/+12
| | | | | | | | | kvm_io_bus_sort_cmp is used also directly, not just as a callback for sort and bsearch. In these cases, it is handy to have a type-safe variant. This patch introduces such a variant, __kvm_io_bus_sort_cmp, and uses it throughout kvm_main.c. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: Introduce kvm_arch_memslots_updated()Takuya Yoshikawa2013-07-181-1/+4
| | | | | | | | | | | | | | | | This is called right after the memslots is updated, i.e. when the result of update_memslots() gets installed in install_new_memslots(). Since the memslots needs to be updated twice when we delete or move a memslot, kvm_arch_commit_memory_region() does not correspond to this exactly. In the following patch, x86 will use this new API to check if the mmio generation has reached its maximum value, in which case mmio sptes need to be flushed out. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Acked-by: Alexander Graf <agraf@suse.de> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: kvm-io: support cookiesCornelia Huck2013-07-181-16/+92
| | | | | | | | | | Add new functions kvm_io_bus_{read,write}_cookie() that allows users of the kvm io infrastructure to use a cookie value to speed up lookup of a device on an io bus. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* kvm: exclude ioeventfd from counting kvm_io_range limitAmos Kong2013-06-041-1/+2
| | | | | | | | | | | | | | | | | | | | We can easily reach the 1000 limit by start VM with a couple hundred I/O devices (multifunction=on). The hardcode limit already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000). In userspace, we already have maximum file descriptor to limit ioeventfd count. But kvm_io_bus devices also are used for pit, pic, ioapic, coalesced_mmio. They couldn't be limited by maximum file descriptor. Currently only ioeventfds take too much kvm_io_bus devices, so just exclude it from counting kvm_io_range limit. Also fixed one indent issue in kvm_host.h Signed-off-by: Amos Kong <akong@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* KVM: add missing misc_deregister() on error in kvm_init()Wei Yongjun2013-05-121-0/+1
| | | | | | | | Add the missing misc_deregister() before return from kvm_init() in the debugfs init error handling case. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* Merge tag 'kvm-3.10-2' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-05-101-5/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Gleb Natapov: "Most of the fixes are in the emulator since now we emulate more than we did before for correctness sake we see more bugs there, but there is also an OOPS fixed and corruption of xcr0 register." * tag 'kvm-3.10-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: emulator: emulate SALC KVM: emulator: emulate XLAT KVM: emulator: emulate AAM KVM: VMX: fix halt emulation while emulating invalid guest sate KVM: Fix kvm_irqfd_init initialization KVM: x86: fix maintenance of guest/host xcr0 state
| * KVM: Fix kvm_irqfd_init initializationAsias He2013-05-081-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko), kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be called on the error handling path. This way, the kvm_irqfd system will not be ready. This patch fix the following: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30 PGD 0 Oops: 0002 [#1] SMP Modules linked in: vhost_net CPU 6 Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. OptiPlex 790/0V5HMK RIP: 0010:[<ffffffff81c0721e>] [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30 RSP: 0018:ffff880221721cc8 EFLAGS: 00010046 RAX: 0000000000000100 RBX: ffff88022dcc003f RCX: ffff880221734950 RDX: ffff8802208f6ca8 RSI: 000000007fffffff RDI: 0000000000000000 RBP: ffff880221721cc8 R08: 0000000000000002 R09: 0000000000000002 R10: 00007f7fd01087e0 R11: 0000000000000246 R12: ffff8802208f6ca8 R13: 0000000000000080 R14: ffff880223e2a900 R15: 0000000000000000 FS: 00007f7fd38488e0(0000) GS:ffff88022dcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000022309f000 CR4: 00000000000427e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 4257, threadinfo ffff880221720000, task ffff880222bd5640) Stack: ffff880221721d08 ffffffff810ac5c5 ffff88022431dc00 0000000000000086 0000000000000080 ffff880223e2a900 ffff8802208f6ca8 0000000000000000 ffff880221721d48 ffffffff810ac8fe 0000000000000000 ffff880221734000 Call Trace: [<ffffffff810ac5c5>] __queue_work+0x45/0x2d0 [<ffffffff810ac8fe>] queue_work_on+0x8e/0xa0 [<ffffffff810ac949>] queue_work+0x19/0x20 [<ffffffff81009b6b>] irqfd_deactivate+0x4b/0x60 [<ffffffff8100a69d>] kvm_irqfd+0x39d/0x580 [<ffffffff81007a27>] kvm_vm_ioctl+0x207/0x5b0 [<ffffffff810c9545>] ? update_curr+0xf5/0x180 [<ffffffff811b66e8>] do_vfs_ioctl+0x98/0x550 [<ffffffff810c1f5e>] ? finish_task_switch+0x4e/0xe0 [<ffffffff81c054aa>] ? __schedule+0x2ea/0x710 [<ffffffff811b6bf7>] sys_ioctl+0x57/0x90 [<ffffffff8140ae9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c [<ffffffff81c0f602>] system_call_fastpath+0x16/0x1b Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 <f0> 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f RIP [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30 RSP <ffff880221721cc8> CR2: 0000000000000000 ---[ end trace 13fb1e4b6e5ab21f ]--- Signed-off-by: Asias He <asias@redhat.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
* | Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds2013-05-101-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MIPS updates from Ralf Baechle: - More work on DT support for various platforms - Various fixes that were to late to make it straight into 3.9 - Improved platform support, in particular the Netlogic XLR and BCM63xx, and the SEAD3 and Malta eval boards. - Support for several Ralink SOC families. - Complete support for the microMIPS ASE which basically reencodes the existing MIPS32/MIPS64 ISA to use non-constant size instructions. - Some fallout from LTO work which remove old cruft and will generally make the MIPS kernel easier to maintain and resistant to compiler optimization, even in absence of LTO. - KVM support. While MIPS has announced hardware virtualization extensions this KVM extension uses trap and emulate mode for virtualization of MIPS32. More KVM work to add support for VZ hardware virtualizaiton extensions and MIPS64 will probably already be merged for 3.11. Most of this has been sitting in -next for a long time. All defconfigs have been build or run time tested except three for which fixes are being sent by other maintainers. Semantic conflict with kvm updates done as per Ralf * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (118 commits) MIPS: Add new GIC clockevent driver. MIPS: Formatting clean-ups for clocksources. MIPS: Refactor GIC clocksource code. MIPS: Move 'gic_frequency' to common location. MIPS: Move 'gic_present' to common location. MIPS: MIPS16e: Add unaligned access support. MIPS: MIPS16e: Support handling of delay slots. MIPS: MIPS16e: Add instruction formats. MIPS: microMIPS: Optimise 'strnlen' core library function. MIPS: microMIPS: Optimise 'strlen' core library function. MIPS: microMIPS: Optimise 'strncpy' core library function. MIPS: microMIPS: Optimise 'memset' core library function. MIPS: microMIPS: Add configuration option for microMIPS kernel. MIPS: microMIPS: Disable LL/SC and fix linker bug. MIPS: microMIPS: Add vdso support. MIPS: microMIPS: Add unaligned access support. MIPS: microMIPS: Support handling of delay slots. MIPS: microMIPS: Add support for exception handling. MIPS: microMIPS: Floating point support. MIPS: microMIPS: Fix macro naming in micro-assembler. ...
| * Merge branch 'next/kvm' into mips-for-linux-nextRalf Baechle2013-05-091-1/+1
| |\
| | * KVM/MIPS32: Do not call vcpu_load when injecting interrupts.Sanjay Lal2013-05-091-1/+1
| | | | | | | | | | | | | | | | | | | | | Signed-off-by: Sanjay Lal <sanjayl@kymasys.com> Cc: kvm@vger.kernel.org Cc: linux-mips@linux-mips.org Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
* | | Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2013-05-051-47/+211
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Gleb Natapov: "Highlights of the updates are: general: - new emulated device API - legacy device assignment is now optional - irqfd interface is more generic and can be shared between arches x86: - VMCS shadow support and other nested VMX improvements - APIC virtualization and Posted Interrupt hardware support - Optimize mmio spte zapping ppc: - BookE: in-kernel MPIC emulation with irqfd support - Book3S: in-kernel XICS emulation (incomplete) - Book3S: HV: migration fixes - BookE: more debug support preparation - BookE: e6500 support ARM: - reworking of Hyp idmaps s390: - ioeventfd for virtio-ccw And many other bug fixes, cleanups and improvements" * tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) kvm: Add compat_ioctl for device control API KVM: x86: Account for failing enable_irq_window for NMI window request KVM: PPC: Book3S: Add API for in-kernel XICS emulation kvm/ppc/mpic: fix missing unlock in set_base_addr() kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write kvm/ppc/mpic: remove users kvm/ppc/mpic: fix mmio region lists when multiple guests used kvm/ppc/mpic: remove default routes from documentation kvm: KVM_CAP_IOMMU only available with device assignment ARM: KVM: iterate over all CPUs for CPU compatibility check KVM: ARM: Fix spelling in error message ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally KVM: ARM: Fix API documentation for ONE_REG encoding ARM: KVM: promote vfp_host pointer to generic host cpu context ARM: KVM: add architecture specific hook for capabilities ARM: KVM: perform HYP initilization for hotplugged CPUs ARM: KVM: switch to a dual-step HYP init code ARM: KVM: rework HYP page table freeing ARM: KVM: enforce maximum size for identity mapped code ARM: KVM: move to a KVM provided HYP idmap ...
| * | kvm: Add compat_ioctl for device control APIScott Wood2013-05-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | This API shouldn't have 32/64-bit issues, but VFS assumes it does unless told otherwise. Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>