summaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* x86_64, asm: Work around AMD SYSRET SS descriptor attribute issueAndy Lutomirski2015-04-265-0/+48
| | | | | | | | | | | | | | | | | | | AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET with SS == 0 results in an invalid usermode state in which SS is apparently equal to __USER_DS but causes #SS if used. Work around the issue by setting SS to __KERNEL_DS __switch_to, thus ensuring that SYSRET never happens with SS set to NULL. This was exposed by a recent vDSO cleanup. Fixes: e7d6eefaaa44 x86/vdso32/syscall.S: Do not load __USER32_DS to %ss Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Brian Gerst <brgerst@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-04-264-22/+22
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
| * VFS: assorted d_backing_inode() annotationsDavid Howells2015-04-151-1/+1
| | | | | | | | | | Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * VFS: assorted weird filesystems: d_inode() annotationsDavid Howells2015-04-153-21/+21
| | | | | | | | | | Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2015-04-261-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull crypto fixes from Herbert Xu: "This push fixes a build problem with img-hash under non-standard configurations and a serious regression with sha512_ssse3 which can lead to boot failures" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: img-hash - CRYPTO_DEV_IMGTEC_HASH should depend on HAS_DMA crypto: x86/sha512_ssse3 - fixup for asm function prototype change
| * | crypto: x86/sha512_ssse3 - fixup for asm function prototype changeArd Biesheuvel2015-04-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch e68410ebf626 ("crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer") changed the prototypes of the core asm SHA-512 implementations so that they are compatible with the prototype used by the base layer. However, in one instance, the register that was used for passing the input buffer was reused as a scratch register later on in the code, and since the input buffer param changed places with the digest param -which needs to be written back before the function returns- this resulted in the scratch register to be dereferenced in a memory write operation, causing a GPF. Fix this by changing the scratch register to use the same register as the input buffer param again. Fixes: e68410ebf626 ("crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer") Reported-By: Bobby Powers <bobbypowers@gmail.com> Tested-By: Bobby Powers <bobbypowers@gmail.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | Merge tag 'cris-for-4.1' of ↵Linus Torvalds2015-04-2647-1145/+285
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jesper/cris Pull arch/cris updates from Jesper Nilsson: "Some much needed love for the CRIS-port. There's a bunch of changes this time, giving the CRISv32 port a bit of modern makeover with device-tree, irq domain and gpiolib support, and more switchover to generic frameworks. Some small fixes and removal of the theoretical SMP support brings up the rear" * tag 'cris-for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jesper/cris: cris: fix integer overflow in ELF_ET_DYN_BASE CRISv32: use GENERIC_SCHED_CLOCK CRISv32: use MMIO clocksource CRISv32: use generic clockevents CRIS: use generic headers via Kbuild CRIS: use generic cmpxchg.h CRIS: use generic atomic.h CRIS: use generic atomic bitops CRISv10: remove redundant macros from system.h CRIS: remove SMP code CRISv32: don't enable irqs in INIT_THREAD CRISv32: handle multiple signals CRISv32: prevent bogus restarts on sigreturn CRISv32: don't attempt syscall restart on irq exit Add binding documentation for CRIS CRIS: add Axis 88 board device tree CRISv32: add device tree support CRISv32: add irq domains support CRIS: enable GPIOLIB
| * | | cris: fix integer overflow in ELF_ET_DYN_BASEAndrey Ryabinin2015-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost all arches define ELF_ET_DYN_BASE as 2/3 of TASK_SIZE. Though it seems that some architectures do this in a wrong way. The problem is that 2*TASK_SIZE may overflow 32-bits so the real ELF_ET_DYN_BASE becomes wrong. Fix this overflow by dividing TASK_SIZE prior to multiplying: (TASK_SIZE / 3 * 2) Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: use GENERIC_SCHED_CLOCKRabin Vincent2015-03-253-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a fast sched clock using the free-running timer and the generic sched_clock infrastructure. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: use MMIO clocksourceRabin Vincent2015-03-252-21/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a generic MMIO clocksource and get rid of some lines of code. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: use generic clockeventsRabin Vincent2015-03-252-60/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a oneshot-capable clockevents device so we get support for things like hrtimers and NOHZ. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: use generic headers via KbuildRabin Vincent2015-03-2513-53/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Delete headers which do nothing but include the asm-generic versions and use Kbuild magic instead. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: use generic cmpxchg.hRabin Vincent2015-03-252-51/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | CRIS can use asm-generic's cmpxchg.h Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: use generic atomic.hRabin Vincent2015-03-254-165/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | CRIS can use asm-generic's atomic.h. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: use generic atomic bitopsRabin Vincent2015-03-251-110/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic atomic bitops are the same as the CRIS-specific ones. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv10: remove redundant macros from system.hRabin Vincent2015-03-251-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All of these are either unused or already provided by other headers, so they can be removed. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: remove SMP codeRabin Vincent2015-03-2517-638/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CRIS SMP code cannot be built since there is no (and appears to never have been) a CONFIG_SMP Kconfig option in arch/cris/. Remove it. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: don't enable irqs in INIT_THREADRabin Vincent2015-03-251-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | INIT_THREAD enables interrupts in the thread_struct's saved flags. This means that interrupts get enabled in the middle of context_switch() while switching to new tasks that get forked off the init task during boot. Don't do this. Fixes the following splat on boot with spinlock debugging on: BUG: spinlock cpu recursion on CPU#0, swapper/2 lock: runqueues+0x0/0x47c, .magic: dead4ead, .owner: swapper/0, .owner_cpu: 0 CPU: 0 PID: 2 Comm: swapper Not tainted 3.19.0-08796-ga747b55 #285 Call Trace: [<c0032b80>] spin_bug+0x2a/0x36 [<c0032c98>] do_raw_spin_lock+0xa2/0x126 [<c01964b0>] _raw_spin_lock+0x20/0x2a [<c00286c8>] scheduler_tick+0x22/0x76 [<c003db2c>] update_process_times+0x5e/0x72 [<c0007a94>] timer_interrupt+0x4e/0x6a [<c00378d6>] handle_irq_event_percpu+0x54/0xf2 [<c00379c4>] handle_irq_event+0x50/0x74 [<c003988e>] handle_simple_irq+0x6c/0xbe [<c0037270>] generic_handle_irq+0x2a/0x36 [<c0004c40>] do_IRQ+0x38/0x84 [<c000662e>] crisv32_do_IRQ+0x54/0x60 [<c0006204>] IRQ0x4b_interrupt+0x34/0x3c [<c0192baa>] __schedule+0x24a/0x532 [<c00056b4>] ret_from_kernel_thread+0x0/0x14 Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: handle multiple signalsRabin Vincent2015-03-252-32/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro noted that CRIS fails to handle multiple signals. This fixes the problem for CRISv32 by making it use a C work_pending handling loop similar to the ARM implementation in 0a267fa6a15d41c ("ARM: 7472/1: pull all work_pending logics into C function"). This also happens to fixes the warnings which currently trigger on CRISv32 due to do_signal() being called with interrupts disabled. Test case (should die of the SIGSEGV which gets raised when setting up the stack for SIGALRM, but instead reaches and executes the _exit(1)): #include <unistd.h> #include <signal.h> #include <sys/time.h> #include <err.h> static void handler(int sig) { } int main(int argc, char *argv[]) { int ret; struct itimerval t1 = { .it_value = {1} }; stack_t ss = { .ss_sp = NULL, .ss_size = SIGSTKSZ, }; struct sigaction action = { .sa_handler = handler, .sa_flags = SA_ONSTACK, }; ret = sigaltstack(&ss, NULL); if (ret < 0) err(1, "sigaltstack"); sigaction(SIGALRM, &action, NULL); setitimer(ITIMER_REAL, &t1, NULL); pause(); _exit(1); return 0; } Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/r/20121208074429.GC4939@ZenIV.linux.org.uk Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: prevent bogus restarts on sigreturnRabin Vincent2015-03-252-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro noted that CRIS is vulnerable to bogus restarts on sigreturn. The fixes CRISv32 by using regs->exs as an additional indicator to whether we should attempt to restart the syscall or not. EXS is only used in the sigtrap handling, and in that path we already have r9 (the other indicator, which indicates if we're in a syscall or not) cleared. Test case, a port of Al's ARM version from 653d48b22166db2d8 ("arm: fix really nasty sigreturn bug"): #include <unistd.h> #include <signal.h> #include <stdlib.h> #include <sys/time.h> #include <errno.h> void f(int n) { register int r10 asm ("r10") = n; __asm__ __volatile__( "ba 1f \n" "nop \n" "break 8 \n" "1: ba . \n" "nop \n" : : "r" (r10) : "memory"); } void handler1(int sig) { } void handler2(int sig) { raise(1); } void handler3(int sig) { exit(0); } int main(int argc, char *argv[]) { struct sigaction s = {.sa_handler = handler2}; struct itimerval t1 = { .it_value = {1} }; struct itimerval t2 = { .it_value = {2} }; signal(1, handler1); sigemptyset(&s.sa_mask); sigaddset(&s.sa_mask, 1); sigaction(SIGALRM, &s, NULL); signal(SIGVTALRM, handler3); setitimer(ITIMER_REAL, &t1, NULL); setitimer(ITIMER_VIRTUAL, &t2, NULL); f(-513); /* -ERESTARTNOINTR */ return 0; } Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/r/20121208074429.GC4939@ZenIV.linux.org.uk Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: don't attempt syscall restart on irq exitRabin Vincent2015-03-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r9 is used to determine whether syscall restarting must be performed or not. Unfortunately, r9 is never set to zero in the non-syscall path, and r9 is on top of that a callee-saved register which can be set to non-zero by the C functions that are called during IRQ handling. This means that if r10 (used for the syscall return value) is one of the -ERESTART* values when a hardware interrupt occurs which leads to a signal being delivered to the process, the kernel will "restart" a syscall which never occurred. This will lead to the PC being moved back by 2 on return to user space. Fix the problem by setting r9 to zero in the interrupt path. Test case (should loop forever but ends up executing the break 8 trap instruction): #include <signal.h> #include <stdlib.h> #include <sys/time.h> void f(int n) { register int r9 asm ("r9") = 1; register int r10 asm ("r10") = n; __asm__ __volatile__( "ba 1f \n" "nop \n" "break 8 \n" "1: ba . \n" "nop \n" : : "r" (r9), "r" (r10) : "memory"); } void handler1(int sig) { } int main(int argc, char *argv[]) { struct itimerval t1 = { .it_value = {1} }; signal(SIGALRM, handler1); setitimer(ITIMER_REAL, &t1, NULL); f(-513); /* -ERESTARTNOINTR */ return 0; } Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRIS: add Axis 88 board device treeRabin Vincent2015-03-252-0/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a minimal device tree for the ETRAX FS SoC and the Axis 88 developer board. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jespern@axis.com>
| * | | CRISv32: add device tree supportRabin Vincent2015-03-256-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for booting CRISv32 with a built-in device tree. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
| * | | CRISv32: add irq domains supportRabin Vincent2015-03-252-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for IRQ domains to the CRISv32 interrupt controller. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
| * | | CRIS: enable GPIOLIBRabin Vincent2015-03-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable GPIOLIB on CRIS so that we can use the generic GPIO APIs. Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
* | | | Merge tag 'powerpc-4.1-2' of ↵Linus Torvalds2015-04-2611-117/+137
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc fixes from Michael Ellerman: - fix for mm_dec_nr_pmds() from Scott. - fixes for oopses seen with KVM + THP from Aneesh. - build fixes from Aneesh & Shreyas. * tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error powerpc/mm/thp: Return pte address if we find trans_splitting. powerpc/mm/thp: Make page table walk safe against thp split/collapse KVM: PPC: Remove page table walk helpers KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()
| * | | | powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabledAneesh Kumar K.V2015-04-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fix the below build error arch/powerpc/mm/hash_utils_64.c: In function ‘flush_hash_hugepage’: arch/powerpc/mm/hash_utils_64.c:1381:1: error: label at end of compound statement tm_abort: ^ make[1]: *** [arch/powerpc/mm/hash_utils_64.o] Error 1 Reported-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build errorShreyas B. Prabhu2015-04-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_no_guest() calls power7_wakeup_loss() to put the thread into the deepest supported idle state. power7_wakeup_loss() is defined in arch/powerpc/kernel/idle_power7.S, which is compiled only when PPC_P7_NAP=y. And PPC_P7_NAP is selected when PPC_POWERNV=y. Hence in cases where PPC_POWERNV=n and KVM_BOOK3S_64_HV=y we see the following error: arch/powerpc/kvm/built-in.o: In function `kvm_no_guest': arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x42c): undefined reference to `power7_wakeup_loss' Fix this by adding PPC_POWERNV as a dependency for KVM_BOOK3S_64_HV. Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/mm/thp: Return pte address if we find trans_splitting.Aneesh Kumar K.V2015-04-174-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For THP that is marked trans splitting, we return the pte. This require the callers to handle the pmd_trans_splitting scenario, if they care. All the current callers are either looking at pfn or write_ok, hence we don't need to update them. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | powerpc/mm/thp: Make page table walk safe against thp split/collapseAneesh Kumar K.V2015-04-179-36/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can disable a THP split or a hugepage collapse by disabling irq. We do send IPI to all the cpus in the early part of split/collapse, and disabling local irq ensure we don't make progress with split/collapse. If the THP is getting split we return NULL from find_linux_pte_or_hugepte(). For all the current callers it should be ok. We need to be careful if we want to use returned pte_t pointer outside the irq disabled region. W.r.t to THP split, the pfn remains the same, but then a hugepage collapse will result in a pfn change. There are few steps we can take to avoid a hugepage collapse.One way is to take page reference inside the irq disable region. Other option is to take mmap_sem so that a parallel collapse will not happen. We can also disable collapse by taking pmd_lock. Another method used by kvm subsystem is to check whether we had a mmu_notifer update in between using mmu_notifier_retry(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | KVM: PPC: Remove page table walk helpersAneesh Kumar K.V2015-04-173-57/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch remove helpers which we had used only once in the code. Limiting page table walk variants help in ensuring that we won't end up with code walking page table with wrong assumptions. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | KVM: PPC: Use READ_ONCE when dereferencing pte_t pointerAneesh Kumar K.V2015-04-172-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pte can get updated from other CPUs as part of multiple activities like THP split, huge page collapse, unmap. We need to make sure we don't reload the pte value again and again for different checks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * | | | Merge branch 'master' of ↵Michael Ellerman2015-04-171-0/+1
| |\ \ \ \ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into fixes
| | * | | | powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()Scott Wood2015-04-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit dc6c9a35b66b5 ("mm: account pmd page tables to the process") added a counter that is incremented whenever a PMD is allocated and decremented whenever a PMD is freed. For hugepages on PPC, common code is used to allocated PMDs, but arch-specific code is used to free PMDs. This results in kernel output such as "BUG: non-zero nr_pmds on freeing mm: 1" when using hugepages. Update the PPC hugepage PMD freeing code to decrement the count, just as the above commit did for free_pmd_range(). Fixes: dc6c9a35b66b5 ("mm: account pmd page tables to the process") Signed-off-by: Scott Wood <scottwood@freescale.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # 4.0.x
* | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2015-04-2629-391/+1607
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull second batch of KVM changes from Paolo Bonzini: "This mostly includes the PPC changes for 4.1, which this time cover Book3S HV only (debugging aids, minor performance improvements and some cleanups). But there are also bug fixes and small cleanups for ARM, x86 and s390. The task_migration_notifier revert and real fix is still pending review, but I'll send it as soon as possible after -rc1" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits) KVM: arm/arm64: check IRQ number on userland injection KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi KVM: VMX: Preserve host CR4.MCE value while in guest mode. KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C KVM: PPC: Book3S HV: Streamline guest entry and exit KVM: PPC: Book3S HV: Use bitmap of active threads rather than count KVM: PPC: Book3S HV: Use decrementer to wake napping threads KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu KVM: PPC: Book3S HV: Minor cleanups KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update KVM: PPC: Book3S HV: Accumulate timing information for real-mode code KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT KVM: PPC: Book3S HV: Add ICP real mode counters KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock KVM: PPC: Book3S HV: Add guest->host real mode completion counters KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte ...
| * \ \ \ \ \ Merge tag 'kvm-arm-for-4.1-take2' of ↵Paolo Bonzini2015-04-223-4/+15
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM changes for v4.1, take #2: Rather small this time: - a fix for a nasty bug with virtual IRQ injection - a fix for irqfd
| | * | | | | | KVM: arm/arm64: check IRQ number on userland injectionAndre Przywara2015-04-223-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When userland injects a SPI via the KVM_IRQ_LINE ioctl we currently only check it against a fixed limit, which historically is set to 127. With the new dynamic IRQ allocation the effective limit may actually be smaller (64). So when now a malicious or buggy userland injects a SPI in that range, we spill over on our VGIC bitmaps and bytemaps memory. I could trigger a host kernel NULL pointer dereference with current mainline by injecting some bogus IRQ number from a hacked kvmtool: ----------------- .... DEBUG: kvm_vgic_inject_irq(kvm, cpu=0, irq=114, level=1) DEBUG: vgic_update_irq_pending(kvm, cpu=0, irq=114, level=1) DEBUG: IRQ #114 still in the game, writing to bytemap now... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = ffffffc07652e000 [00000000] *pgd=00000000f658b003, *pud=00000000f658b003, *pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 1053 Comm: lkvm-msi-irqinj Not tainted 4.0.0-rc7+ #3027 Hardware name: FVP Base (DT) task: ffffffc0774e9680 ti: ffffffc0765a8000 task.ti: ffffffc0765a8000 PC is at kvm_vgic_inject_irq+0x234/0x310 LR is at kvm_vgic_inject_irq+0x30c/0x310 pc : [<ffffffc0000ae0a8>] lr : [<ffffffc0000ae180>] pstate: 80000145 ..... So this patch fixes this by checking the SPI number against the actual limit. Also we remove the former legacy hard limit of 127 in the ioctl code. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> CC: <stable@vger.kernel.org> # 4.0, 3.19, 3.18 [maz: wrap KVM_ARM_IRQ_GIC_MAX with #ifndef __KERNEL__, as suggested by Christopher Covington] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | | | | | | Merge tag 'signed-kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into ↵Paolo Bonzini2015-04-2122-364/+1561
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm-master Patch queue for ppc - 2015-04-21 This is the latest queue for KVM on PowerPC changes. Highlights this time around: - Book3S HV: Debugging aids - Book3S HV: Minor performance improvements - Book3S HV: Cleanups
| | * | | | | | | KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8Paul Mackerras2015-04-214-22/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses msgsnd where possible for signalling other threads within the same core on POWER8 systems, rather than IPIs through the XICS interrupt controller. This includes waking secondary threads to run the guest, the interrupts generated by the virtual XICS, and the interrupts to bring the other threads out of the guest when exiting. Aggregated statistics from debugfs across vcpus for a guest with 32 vcpus, 8 threads/vcore, running on a POWER8, show this before the change: rm_entry: 3387.6ns (228 - 86600, 1008969 samples) rm_exit: 4561.5ns (12 - 3477452, 1009402 samples) rm_intr: 1660.0ns (12 - 553050, 3600051 samples) and this after the change: rm_entry: 3060.1ns (212 - 65138, 953873 samples) rm_exit: 4244.1ns (12 - 9693408, 954331 samples) rm_intr: 1342.3ns (12 - 1104718, 3405326 samples) for a test of booting Fedora 20 big-endian to the login prompt. The time taken for a H_PROD hcall (which is handled in the host kernel) went down from about 35 microseconds to about 16 microseconds with this change. The noinline added to kvmppc_run_core turned out to be necessary for good performance, at least with gcc 4.9.2 as packaged with Fedora 21 and a little-endian POWER8 host. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to CPaul Mackerras2015-04-214-68/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces the assembler code for kvmhv_commence_exit() with C code in book3s_hv_builtin.c. It also moves the IPI sending code that was in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq(). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Streamline guest entry and exitPaul Mackerras2015-04-211-86/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On entry to the guest, secondary threads now wait for the primary to switch the MMU after loading up most of their state, rather than before. This means that the secondary threads get into the guest sooner, in the common case where the secondary threads get to kvmppc_hv_entry before the primary thread. On exit, the first thread out increments the exit count and interrupts the other threads (to get them out of the guest) before saving most of its state, rather than after. That means that the other threads exit sooner and means that the first thread doesn't spend so much time waiting for the other threads at the point where the MMU gets switched back to the host. This pulls out the code that increments the exit count and interrupts other threads into a separate function, kvmhv_commence_exit(). This also makes sure that r12 and vcpu->arch.trap are set correctly in some corner cases. Statistics from /sys/kernel/debug/kvm/vm*/vcpu*/timings show the improvement. Aggregating across vcpus for a guest with 32 vcpus, 8 threads/vcore, running on a POWER8, gives this before the change: rm_entry: avg 4537.3ns (222 - 48444, 1068878 samples) rm_exit: avg 4787.6ns (152 - 165490, 1010717 samples) rm_intr: avg 1673.6ns (12 - 341304, 3818691 samples) and this after the change: rm_entry: avg 3427.7ns (232 - 68150, 1118921 samples) rm_exit: avg 4716.0ns (12 - 150720, 1119477 samples) rm_intr: avg 1614.8ns (12 - 522436, 3850432 samples) showing a substantial reduction in the time spent per guest entry in the real-mode guest entry code, and smaller reductions in the real mode guest exit and interrupt handling times. (The test was to start the guest and boot Fedora 20 big-endian to the login prompt.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Use bitmap of active threads rather than countPaul Mackerras2015-04-215-49/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the entry_exit_count field in the kvmppc_vcore struct contains two 8-bit counts, one of the threads that have started entering the guest, and one of the threads that have started exiting the guest. This changes it to an entry_exit_map field which contains two bitmaps of 8 bits each. The advantage of doing this is that it gives us a bitmap of which threads need to be signalled when exiting the guest. That means that we no longer need to use the trick of setting the HDEC to 0 to pull the other threads out of the guest, which led in some cases to a spurious HDEC interrupt on the next guest entry. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Use decrementer to wake napping threadsPaul Mackerras2015-04-211-2/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This arranges for threads that are napping due to their vcpu having ceded or due to not having a vcpu to wake up at the end of the guest's timeslice without having to be poked with an IPI. We do that by arranging for the decrementer to contain a value no greater than the number of timebase ticks remaining until the end of the timeslice. In the case of a thread with no vcpu, this number is in the hypervisor decrementer already. In the case of a ceded vcpu, we use the smaller of the HDEC value and the DEC value. Using the DEC like this when ceded means we need to save and restore the guest decrementer value around the nap. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPIPaul Mackerras2015-04-211-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running a multi-threaded guest and vcpu 0 in a virtual core is not running in the guest (i.e. it is busy elsewhere in the host), thread 0 of the physical core will switch the MMU to the guest and then go to nap mode in the code at kvm_do_nap. If the guest sends an IPI to thread 0 using the msgsndp instruction, that will wake up thread 0 and cause all the threads in the guest to exit to the host unnecessarily. To avoid the unnecessary exit, this arranges for the PECEDP bit to be cleared in this situation. When napping due to a H_CEDE from the guest, we still set PECEDP so that the thread will wake up on an IPI sent using msgsndp. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_wokenPaul Mackerras2015-04-214-35/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can tell when a secondary thread has finished running a guest by the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there is no real need for the nap_count field in the kvmppc_vcore struct. This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu pointers of the secondary threads rather than polling vc->nap_count. Besides reducing the size of the kvmppc_vcore struct by 8 bytes, this also means that we can tell which secondary threads have got stuck and thus print a more informative error message. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpuPaul Mackerras2015-04-212-42/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than calling cond_resched() in kvmppc_run_core() before doing the post-processing for the vcpus that we have just run (that is, calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do that post-processing before calling cond_resched(), and that post- processing is moved out into its own function, post_guest_process(). The reschedule point is now in kvmppc_run_vcpu() and we define a new vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner task is runnable but not running. (Doing the reschedule with the vcore in VCORE_INACTIVE state would be bad because there are potentially other vcpus waiting for the runner in kvmppc_wait_for_exec() which then wouldn't get woken up.) Also, we make use of the handy cond_resched_lock() function, which unlocks and relocks vc->lock for us around the reschedule. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Minor cleanupsPaul Mackerras2015-04-213-28/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Remove unused kvmppc_vcore::n_busy field. * Remove setting of RMOR, since it was only used on PPC970 and the PPC970 KVM support has been removed. * Don't use r1 or r2 in setting the runlatch since they are conventionally reserved for other things; use r0 instead. * Streamline the code a little and remove the ext_interrupt_to_host label. * Add some comments about register usage. * hcall_try_real_mode doesn't need to be global, and can't be called from C code anyway. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA updatePaul Mackerras2015-04-212-29/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, if kvmppc_run_core() was running a VCPU that needed a VPA update (i.e. one of its 3 virtual processor areas needed to be pinned in memory so the host real mode code can update it on guest entry and exit), we would drop the vcore lock and do the update there and then. Future changes will make it inconvenient to drop the lock, so instead we now remove it from the list of runnable VCPUs and wake up its VCPU task. This will have the effect that the VCPU task will exit kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call to kvmppc_update_vpas() and then rejoin the vcore. The one complication is that the runner VCPU (whose VCPU task is the current task) might be one of the ones that gets removed from the runnable list. In that case we just return from kvmppc_run_core() and let the code in kvmppc_run_vcpu() wake up another VCPU task to be the runner if necessary. This all means that the VCORE_STARTING state is no longer used, so we remove it. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Accumulate timing information for real-mode codePaul Mackerras2015-04-217-2/+346
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reads the timebase at various points in the real-mode guest entry/exit code and uses that to accumulate total, minimum and maximum time spent in those parts of the code. Currently these times are accumulated per vcpu in 5 parts of the code: * rm_entry - time taken from the start of kvmppc_hv_entry() until just before entering the guest. * rm_intr - time from when we take a hypervisor interrupt in the guest until we either re-enter the guest or decide to exit to the host. This includes time spent handling hcalls in real mode. * rm_exit - time from when we decide to exit the guest until the return from kvmppc_hv_entry(). * guest - time spend in the guest * cede - time spent napping in real mode due to an H_CEDE hcall while other threads in the same vcore are active. These times are exposed in debugfs in a directory per vcpu that contains a file called "timings". This file contains one line for each of the 5 timings above, with the name followed by a colon and 4 numbers, which are the count (number of times the code has been executed), the total time, the minimum time, and the maximum time, all in nanoseconds. The overhead of the extra code amounts to about 30ns for an hcall that is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since production environments may not wish to incur this overhead, the new code is conditional on a new config symbol, CONFIG_KVM_BOOK3S_HV_EXIT_TIMING. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
| | * | | | | | | KVM: PPC: Book3S HV: Create debugfs file for each guest's HPTPaul Mackerras2015-04-214-0/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This creates a debugfs directory for each HV guest (assuming debugfs is enabled in the kernel config), and within that directory, a file by which the contents of the guest's HPT (hashed page table) can be read. The directory is named vmnnnn, where nnnn is the PID of the process that created the guest. The file is named "htab". This is intended to help in debugging problems in the host's management of guest memory. The contents of the file consist of a series of lines like this: 3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196 The first field is the index of the entry in the HPT, the second and third are the HPT entry, so the third entry contains the real page number that is mapped by the entry if the entry's valid bit is set. The fourth field is the guest's view of the second doubleword of the entry, so it contains the guest physical address. (The format of the second through fourth fields are described in the Power ISA and also in arch/powerpc/include/asm/mmu-hash64.h.) Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>