summaryrefslogtreecommitdiffstats
path: root/arch/powerpc
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/MSI: Fix race condition in tearing down MSI interruptsPaul Mackerras2015-10-055-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e297c939b745e420ef0b9dc989cb87bda617b399 upstream. This fixes a race which can result in the same virtual IRQ number being assigned to two different MSI interrupts. The most visible consequence of that is usually a warning and stack trace from the sysfs code about an attempt to create a duplicate entry in sysfs. The race happens when one CPU (say CPU 0) is disposing of an MSI while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls (for example) pnv_teardown_msi_irqs(), which calls msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its hardware IRQ number) is no longer in use. Then, before CPU 0 gets to calling irq_dispose_mapping() to free up the virtal IRQ number, CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an MSI, and gets the same hardware IRQ number that CPU 0 just freed. CPU 1 then calls irq_create_mapping() to get a virtual IRQ number, which sees that there is currently a mapping for that hardware IRQ number and returns the corresponding virtual IRQ number (which is the same virtual IRQ number that CPU 0 was using). CPU 0 then calls irq_dispose_mapping() and frees that virtual IRQ number. Now, if another CPU comes along and calls irq_create_mapping(), it is likely to get the virtual IRQ number that was just freed, resulting in the same virtual IRQ number apparently being used for two different hardware interrupts. To fix this race, we just move the call to msi_bitmap_free_hwirqs() to after the call to irq_dispose_mapping(). Since virq_to_hw() doesn't work for the virtual IRQ number after irq_dispose_mapping() has been called, we need to call it before irq_dispose_mapping() and remember the result for the msi_bitmap_free_hwirqs() call. The pattern of calling msi_bitmap_free_hwirqs() before irq_dispose_mapping() appears in 5 places under arch/powerpc, and appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") from 2007. Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mm: Recompute hash value after a failed updateAneesh Kumar K.V2015-09-301-1/+2
| | | | | | | | | | | | | | | | | | | | commit 36b35d5d807b7e57aff7d08e63de8b17731ee211 upstream. If we had secondary hash flag set, we ended up modifying hash value in the updatepp code path. Hence with a failed updatepp we will be using a wrong hash value for the following hash insert. Fix this by recomputing hash before insert. Without this patch we can end up with using wrong slot number in linux pte. That can result in us missing an hash pte update or invalidate which can cause memory corruption or even machine check. Fixes: 6d492ecc6489 ("powerpc/THP: Add code to handle HPTE faults for hugepages") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/rtas: Introduce rtas_get_sensor_fast() for IRQ handlersThomas Huth2015-09-303-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1c2cb594441d02815d304cccec9742ff5c707495 upstream. The EPOW interrupt handler uses rtas_get_sensor(), which in turn uses rtas_busy_delay() to wait for RTAS becoming ready in case it is necessary. But rtas_busy_delay() is annotated with might_sleep() and thus may not be used by interrupts handlers like the EPOW handler! This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is enabled: BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6 Call Trace: [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable) [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180 [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0 [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0 [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450 [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300 [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0 [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260 [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80 [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200 [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24 [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110 [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180 Fix this issue by introducing a new rtas_get_sensor_fast() function that does not use rtas_busy_delay() - and thus can only be used for sensors that do not cause a BUSY condition - known as "fast" sensors. The EPOW sensor is defined to be "fast" in sPAPR - mpe. Fixes: 587f83e8dd50 ("powerpc/pseries: Use rtas_get_sensor in RAS code") Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mm: Fix pte_pagesize_index() crash on 4K w/64K hashMichael Ellerman2015-09-301-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 74b5037baa2011a2799e2c43adde7d171b072f9e upstream. The powerpc kernel can be built to have either a 4K PAGE_SIZE or a 64K PAGE_SIZE. However when built with a 4K PAGE_SIZE there is an additional config option which can be enabled, PPC_HAS_HASH_64K, which means the kernel also knows how to hash a 64K page even though the base PAGE_SIZE is 4K. This is used in one obscure configuration, to support 64K pages for SPU local store on the Cell processor when the rest of the kernel is using 4K pages. In this configuration, pte_pagesize_index() is defined to just pass through its arguments to get_slice_psize(). However pte_pagesize_index() is called for both user and kernel addresses, whereas get_slice_psize() only knows how to handle user addresses. This has been broken forever, however until recently it happened to work. That was because in get_slice_psize() the large kernel address would cause the right shift of the slice mask to return zero. However in commit 7aa0727f3302 ("powerpc/mm: Increase the slice range to 64TB"), the get_slice_psize() code was changed so that instead of a right shift we do an array lookup based on the address. When passed a kernel address this means we index way off the end of the slice array and return random junk. That is only fatal if we happen to hit something non-zero, but when we do return a non-zero value we confuse the MMU code and eventually cause a check stop. This fix is ugly, but simple. When we're called for a kernel address we return 4K, which is always correct in this configuration, otherwise we use the slice mask. Fixes: 7aa0727f3302 ("powerpc/mm: Increase the slice range to 64TB") Reported-by: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* signal: fix information leak in copy_siginfo_from_user32Amanieu d'Antras2015-08-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3c00cb5e68dc719f2fc73a33b1b230aadfcb1309 upstream. This function can leak kernel stack data when the user siginfo_t has a positive si_code value. The top 16 bits of si_code descibe which fields in the siginfo_t union are active, but they are treated inconsistently between copy_siginfo_from_user32, copy_siginfo_to_user32 and copy_siginfo_to_user. copy_siginfo_from_user32 is called from rt_sigqueueinfo and rt_tgsigqueueinfo in which the user has full control overthe top 16 bits of si_code. This fixes the following information leaks: x86: 8 bytes leaked when sending a signal from a 32-bit process to itself. This leak grows to 16 bytes if the process uses x32. (si_code = __SI_CHLD) x86: 100 bytes leaked when sending a signal from a 32-bit process to a 64-bit process. (si_code = -1) sparc: 4 bytes leaked when sending a signal from a 32-bit process to a 64-bit process. (si_code = any) parsic and s390 have similar bugs, but they are not vulnerable because rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code to a different process. These bugs are also fixed for consistency. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* arch: Introduce smp_load_acquire(), smp_store_release()Peter Zijlstra2015-08-251-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 47933ad41a86a4a9b50bed7c9b9bd2ba242aac63 upstream. A number of situations currently require the heavyweight smp_mb(), even though there is no need to order prior stores against later loads. Many architectures have much cheaper ways to handle these situations, but the Linux kernel currently has no portable way to make use of them. This commit therefore supplies smp_load_acquire() and smp_store_release() to remedy this situation. The new smp_load_acquire() primitive orders the specified load against any subsequent reads or writes, while the new smp_store_release() primitive orders the specifed store against any prior reads or writes. These primitives allow array-based circular FIFOs to be implemented without an smp_mb(), and also allow a theoretical hole in rcu_assign_pointer() to be closed at no additional expense on most architectures. In addition, the RCU experience transitioning from explicit smp_read_barrier_depends() and smp_wmb() to rcu_dereference() and rcu_assign_pointer(), respectively resulted in substantial improvements in readability. It therefore seems likely that replacing other explicit barriers with smp_load_acquire() and smp_store_release() will provide similar benefits. It appears that roughly half of the explicit barriers in core kernel code might be so replaced. [Changelog by PaulMck] Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Victor Kaplansky <VICTORK@il.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20131213150640.908486364@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Fix book3s kernel to userspace backtracesAnton Blanchard2015-07-301-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 72e349f1124a114435e599479c9b8d14bfd1ebcd upstream. When we take a PMU exception or a software event we call perf_read_regs(). This overloads regs->result with a boolean that describes if we should use the sampled instruction address register (SIAR) or the regs. If the exception is in kernel, we start with the kernel regs and backtrace through the kernel stack. At this point we switch to the userspace regs and backtrace the user stack with perf_callchain_user(). Unfortunately these regs have not got the perf_read_regs() treatment, so regs->result could be anything. If it is non zero, perf_instruction_pointer() decides to use the SIAR, and we get issues like this: 0.11% qemu-system-ppc [kernel.kallsyms] [k] _raw_spin_lock_irqsave | ---_raw_spin_lock_irqsave | |--52.35%-- 0 | | | |--46.39%-- __hrtimer_start_range_ns | | kvmppc_run_core | | kvmppc_vcpu_run_hv | | kvmppc_vcpu_run | | kvm_arch_vcpu_ioctl_run | | kvm_vcpu_ioctl | | do_vfs_ioctl | | sys_ioctl | | system_call | | | | | |--67.08%-- _raw_spin_lock_irqsave <--- hi mum | | | | | | | --100.00%-- 0x7e714 | | | 0x7e714 Notice the bogus _raw_spin_irqsave when we transition from kernel (system_call) to userspace (0x7e714). We inserted what was in the SIAR. Add a check in regs_use_siar() to check that the regs in question are from a PMU exception. With this fix the backtrace makes sense: 0.47% qemu-system-ppc [kernel.vmlinux] [k] _raw_spin_lock_irqsave | ---_raw_spin_lock_irqsave | |--53.83%-- 0 | | | |--44.73%-- hrtimer_try_to_cancel | | kvmppc_start_thread | | kvmppc_run_core | | kvmppc_vcpu_run_hv | | kvmppc_vcpu_run | | kvm_arch_vcpu_ioctl_run | | kvm_vcpu_ioctl | | do_vfs_ioctl | | sys_ioctl | | system_call | | __ioctl | | 0x7e714 | | 0x7e714 Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Align TOC to 256 bytesAnton Blanchard2015-06-031-0/+1
| | | | | | | | | | | | | | | | commit 5e95235ccd5442d4a4fe11ec4eb99ba1b7959368 upstream. Recent toolchains force the TOC to be 256 byte aligned. We need to enforce this alignment in our linker script, otherwise pointers to our TOC variables (__toc_start, __prom_init_toc_start) could be incorrect. If they are bad, we die a few hundred instructions into boot. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mm: Fix mmap errno when MAP_FIXED is set and mapping exceeds the ↵jmarchan@redhat.com2015-06-021-1/+1
| | | | | | | | | | | | | | | | | allowed address space commit 19751c07b3728748c1253627ce94e6906fa5e273 upstream. According to Posix, if MAP_FIXED is specified mmap shall set ENOMEM if the requested mapping exceeds the allowed range for address space of the process. The generic code set it right, but the specific powerpc slice_get_unmapped_area() function currently returns -EINVAL in that case. This patch corrects it. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Fix missing L2 cache size in /sys/devices/system/cpuDave Olson2015-05-041-10/+34
| | | | | | | | | | | | | | | | | | | | | commit f7e9e358362557c3aa2c1ec47490f29fe880a09e upstream. This problem appears to have been introduced in 2.6.29 by commit 93197a36a9c1 "Rewrite sysfs processor cache info code". This caused lscpu to error out on at least e500v2 devices, eg: error: cannot open /sys/devices/system/cpu/cpu0/cache/index2/size: No such file or directory Some embedded powerpc systems use cache-size in DTS for the unified L2 cache size, not d-cache-size, so we need to allow for both DTS names. Added a new CACHE_TYPE_UNIFIED_D cache_type_info structure to handle this. Fixes: 93197a36a9c1 ("powerpc: Rewrite sysfs processor cache info code") Signed-off-by: Dave Olson <olson@cumulusnetworks.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTHAnton Blanchard2015-05-041-1/+1
| | | | | | | | | | | commit 9a5cbce421a283e6aea3c4007f141735bf9da8c3 upstream. We cap 32bit userspace backtraces to PERF_MAX_STACK_DEPTH (currently 127), but we forgot to do the same for 64bit backtraces. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* nosave: consolidate __nosave_{begin,end} in <asm/sections.h>Geert Uytterhoeven2015-05-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7f8998c7aef3ac9c5f3f2943e083dfa6302e90d0 upstream. The different architectures used their own (and different) declarations: extern __visible const void __nosave_begin, __nosave_end; extern const void __nosave_begin, __nosave_end; extern long __nosave_begin, __nosave_end; Consolidate them using the first variant in <asm/sections.h>. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> [js -- port to 3.12: arm does not have hibernation yet]
* powerpc/pseries: Little endian fixes for post mobility device tree updateTyrel Datwyler2015-04-211-21/+23
| | | | | | | | | | | | | | | | | | | | | commit f6ff04149637723261aa4738958b0098b929ee9e upstream. We currently use the device tree update code in the kernel after resuming from a suspend operation to re-sync the kernels view of the device tree with that of the hypervisor. The code as it stands is not endian safe as it relies on parsing buffers returned by RTAS calls that thusly contains data in big endian format. This patch annotates variables and structure members with __be types as well as performing necessary byte swaps to cpu endian for data that needs to be parsed. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Fix sys_call_table declaration to enable syscall tracingRomeo Cane2015-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1028ccf560b97adbf272381a61a67e17d44d1054 upstream. Declaring sys_call_table as a pointer causes the compiler to generate the wrong lookup code in arch_syscall_addr(). <arch_syscall_addr>: lis r9,-16384 rlwinm r3,r3,2,0,29 - lwz r11,30640(r9) - lwzx r3,r11,r3 + addi r9,r9,30640 + lwzx r3,r9,r3 blr The actual sys_call_table symbol, declared in assembler, is an array. If we lie about that to the compiler we get the wrong code generated, as above. This definition seems only to be used by the syscall tracing code in kernel/trace/trace_syscalls.c. With this patch I can successfully use the syscall tracepoints: bash-3815 [002] .... 333.239082: sys_write -> 0x2 bash-3815 [002] .... 333.239087: sys_dup2(oldfd: a, newfd: 1) bash-3815 [002] .... 333.239088: sys_dup2 -> 0x1 bash-3815 [002] .... 333.239092: sys_fcntl(fd: a, cmd: 1, arg: 0) bash-3815 [002] .... 333.239093: sys_fcntl -> 0x1 bash-3815 [002] .... 333.239094: sys_close(fd: a) bash-3815 [002] .... 333.239094: sys_close -> 0x0 Signed-off-by: Romeo Cane <romeo.cane.ext@coriant.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/smp: Wait until secondaries are active & onlineMichael Ellerman2015-04-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 875ebe940d77a41682c367ad799b4f39f128d3fa upstream. Anton has a busy ppc64le KVM box where guests sometimes hit the infamous "kernel BUG at kernel/smpboot.c:134!" issue during boot: BUG_ON(td->cpu != smp_processor_id()); Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops output confirms it: CPU: 0 Comm: watchdog/130 The problem is that we aren't ensuring the CPU active bit is set for the secondary before allowing the master to continue on. The master unparks the secondary CPU's kthreads and the scheduler looks for a CPU to run on. It calls select_task_rq() and realises the suggested CPU is not in the cpus_allowed mask. It then ends up in select_fallback_rq(), and since the active bit isnt't set we choose some other CPU to run on. This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug vs. set_cpus_allowed_ptr()", which changed from setting active before online to setting active after online. However that was in turn fixing a bug where other code assumed an active CPU was also online, so we can't just revert that fix. The simplest fix is just to spin waiting for both active & online to be set. We already have a barrier prior to set_cpu_online() (which also sets active), to ensure all other setup is completed before online & active are set. Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Fix ABIv2 kernel backtracesAnton Blanchard2015-04-092-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 85101af13bb854a6572fa540df7c7201958624b9 upstream. ABIv2 kernels are failing to backtrace through the kernel. An example: 39.30% readseek2_proce [kernel.kallsyms] [k] find_get_entry | --- find_get_entry __GI___libc_read The problem is in valid_next_sp() where we check that the new stack pointer is at least STACK_FRAME_OVERHEAD below the previous one. ABIv1 has a minimum stack frame size of 112 bytes consisting of 48 bytes and 64 bytes of parameter save area. ABIv2 changes that to 32 bytes with no paramter save area. STACK_FRAME_OVERHEAD is in theory the minimum stack frame size, but we over 240 uses of it, some of which assume that it includes space for the parameter area. We need to work through all our stack defines and rationalise them but let's fix perf now by creating STACK_FRAME_MIN_SIZE and using in valid_next_sp(). This fixes the issue: 30.64% readseek2_proce [kernel.kallsyms] [k] find_get_entry | --- find_get_entry pagecache_get_page generic_file_read_iter new_sync_read vfs_read sys_read syscall_exit __GI___libc_read Cc: stable@vger.kernel.org # 3.16+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* power, sched: stop updating inside arch_update_cpu_topology() when nothing ↵Michael Wang2015-04-091-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to be update commit 9a0133613e4412b2caaaf0d9dd81f213bcededf1 upstream. Since v1: Edited the comment according to Srivatsa's suggestion. During the testing, we encounter below WARN followed by Oops: WARNING: at kernel/sched/core.c:6218 ... NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200 LR [c000000000101358] .build_sched_domains+0xec8/0x1200 PACATMSCRATCH [800000000000f032] Call Trace: [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0 LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200 PACATMSCRATCH [8000000000029032] Call Trace: [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0 [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... This was caused by that 'sd->groups == NULL' after building groups, which was caused by the empty 'sd->span'. The cpu's domain contained nothing because the cpu was assigned to a wrong node, due to the following unfortunate sequence of events: 1. The hypervisor sent a topology update to the guest OS, to notify changes to the cpu-node mapping. However, the update was actually redundant - i.e., the "new" mapping was exactly the same as the old one. 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting the 'for-loop' in arch_update_cpu_topology(). 3. So we ended up calling stop-machine() with an empty cpumask list, which made stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as the cpu to run the payload (the update_cpu_topology() function). 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates' is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology() finds update->cpu as well as update->new_nid to be 0. In other words, we end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly. Along with the following wrong updating, it causes the sched-domain rebuild code to break and crash the system. Fix this by skipping the topology update in cases where we find that the topology has not actually changed in reality (ie., spurious updates). CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: Stephen Rothwell <sfr@canb.auug.org.au> CC: Andrew Morton <akpm@linux-foundation.org> CC: Robert Jennings <rcj@linux.vnet.ibm.com> CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com> CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> CC: Alistair Popple <alistair@popple.id.au> Suggested-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mpc85xx: Add ranges to etsec2 nodesScott Wood2015-04-093-0/+3
| | | | | | | | | | | | | | commit bb344ca5b90df62b1a3b7a35c6a9d00b306a170d upstream. Commit 746c9e9f92dd "of/base: Fix PowerPC address parsing hack" limited the applicability of the workaround whereby a missing ranges is treated as an empty ranges. This workaround was hiding a bug in the etsec2 device tree nodes, which have children with reg, but did not have ranges. Signed-off-by: Scott Wood <scottwood@freescale.com> Reported-by: Alexander Graf <agraf@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* mm/hugetlb: reduce arch dependent code around follow_huge_*Naoya Horiguchi2015-03-161-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 61f77eda9bbf0d2e922197ed2dcf88638a639ce5 upstream. Currently we have many duplicates in definitions around follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove the m. The basic idea is to put the default implementation for these functions in mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement arch-specific code only when the arch needs it. For follow_huge_addr(), only powerpc and ia64 have their own implementation, and in all other architectures this function just returns ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as default. As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always return 0 in your architecture (like in ia64 or sparc,) it's never called (the callsite is optimized away) no matter how implemented it is. So in such architectures, we don't need arch-specific implementation. In some architecture (like mips, s390 and tile,) their current arch-specific follow_huge_(pmd|pud)() are effectively identical with the common code, so this patch lets these architecture use the common code. One exception is metag, where pmd_huge() could return non-zero but it expects follow_huge_pmd() to always return NULL. This means that we need arch-specific implementation which returns NULL. This behavior looks strange to me (because non-zero pmd_huge() implies that the architecture supports PMD-based hugepage, so follow_huge_pmd() can/should return some relevant value,) but that's beyond this cleanup patch, so let's keep it. Justification of non-trivial changes: - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE is true when follow_huge_pmd() can be called (note that pmd_huge() has the same check and always returns 0 for !MACHINE_HAS_HPAGE.) - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common code. This patch forces these archs use PMD_MASK, but it's OK because they are identical in both archs. In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20. In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but PTE_ORDER is always 0, so these are identical. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds2015-03-122-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* axonram: Fix bug in direct_accessMatthew Wilcox2015-03-011-1/+1
| | | | | | | | | | | | | | commit 91117a20245b59f70b563523edbf998a62fc6383 upstream. The 'pfn' returned by axonram was completely bogus, and has been since 2008. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/xmon: Fix another endiannes issue in RTAS call from xmonLaurent Dufour2015-02-081-0/+1
| | | | | | | | | | | | | | | | | | | | commit e6eb2eba494d6f99e69ca3c3748cd37a2544ab38 upstream. The commit 3b8a3c010969 ("powerpc/pseries: Fix endiannes issue in RTAS call from xmon") was fixing an endianness issue in the call made from xmon to RTAS. However, as Michael Ellerman noticed, this fix was not complete, the token value was not byte swapped. This lead to call an unexpected and most of the time unexisting RTAS function, which is silently ignored by RTAS. This fix addresses this hole. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* move d_rcu from overlapping d_child to overlapping d_aliasAl Viro2015-01-291-1/+1
| | | | | | | | commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* crypto: add missing crypto module aliasesMathias Krause2015-01-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | commit 3e14dcf7cb80b34a1f38b55bc96f02d23fdaaaaf upstream. Commit 5d26a105b5a7 ("crypto: prefix module autoloading with "crypto-"") changed the automatic module loading when requesting crypto algorithms to prefix all module requests with "crypto-". This requires all crypto modules to have a crypto specific module alias even if their file name would otherwise match the requested crypto algorithm. Even though commit 5d26a105b5a7 added those aliases for a vast amount of modules, it was missing a few. Add the required MODULE_ALIAS_CRYPTO annotations to those files to make them get loaded automatically, again. This fixes, e.g., requesting 'ecb(blowfish-generic)', which used to work with kernels v3.18 and below. Also change MODULE_ALIAS() lines to MODULE_ALIAS_CRYPTO(). The former won't work for crypto modules any more. Fixes: 5d26a105b5a7 ("crypto: prefix module autoloading with "crypto-"") Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* crypto: prefix module autoloading with "crypto-"Kees Cook2015-01-291-1/+1
| | | | | | | | | | | | | | commit 5d26a105b5a73e5635eae0629b42fa0a90e07b7b upstream. This prefixes all crypto module loading with "crypto-" so we never run the risk of exposing module auto-loading to userspace via a crypto API, as demonstrated by Mathias Krause: https://lkml.org/lkml/2013/3/4/70 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle modePaul Mackerras2015-01-262-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 8117ac6a6c2fa0f847ff6a21a1f32c8d2c8501d0 upstream. Currently, when going idle, we set the flag indicating that we are in nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap (or sleep or rvwinkle) instruction, all with the MMU on. This is bad for two reasons: (a) the architecture specifies that those instructions must be executed with the MMU off, and in fact with only the SF, HV, ME and possibly RI bits set, and (b) this introduces a race, because as soon as we set the flag, another thread can switch the MMU to a guest context. If the race is lost, this thread will typically start looping on relocation-on ISIs at 0xc...4400. This fixes it by setting the MSR as required by the architecture before setting the flag or executing the nap/sleep/rvwinkle instruction. [ shreyas@linux.vnet.ibm.com: Edited to handle LE ] Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Fix bad NULL pointer check in udbg_uart_getc_poll()Anton Blanchard2015-01-261-1/+5
| | | | | | | | | | | | | | | commit cd32e2dcc9de6c27ecbbfc0e2079fb64b42bad5f upstream. We have some code in udbg_uart_getc_poll() that tries to protect against a NULL udbg_uart_in, but gets it all wrong. Found with the LLVM static analyzer (scan-build). Fixes: 309257484cc1 ("powerpc: Cleanup udbg_16550 and add support for LPC PIO-only UARTs") Signed-off-by: Anton Blanchard <anton@samba.org> [mpe: Add some newlines for readability while we're here] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: 32 bit getcpu VDSO function uses 64 bit instructionsAnton Blanchard2015-01-071-2/+2
| | | | | | | | | | | | commit 152d44a853e42952f6c8a504fb1f8eefd21fd5fd upstream. I used some 64 bit instructions when adding the 32 bit getcpu VDSO function. Fix it. Fixes: 18ad51dd342a ("powerpc: Add VDSO version of getcpu") Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/powernv: Honor the generic "no_64bit_msi" flagBenjamin Herrenschmidt2014-12-062-4/+3
| | | | | | | | | | | commit 360743814c4082515581aa23ab1d8e699e1fbe88 upstream. Instead of the arch specific quirk which we are deprecating and that drivers don't understand. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/pseries: Fix endiannes issue in RTAS call from xmonLaurent Dufour2014-12-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3b8a3c01096925a824ed3272601082289d9c23a5 upstream. On pseries system (LPAR) xmon failed to enter when running in LE mode, system is hunging. Inititating xmon will lead to such an output on the console: SysRq : Entering xmon cpu 0x15: Vector: 0 at [c0000003f39ffb10] pc: c00000000007ed7c: sysrq_handle_xmon+0x5c/0x70 lr: c00000000007ed7c: sysrq_handle_xmon+0x5c/0x70 sp: c0000003f39ffc70 msr: 8000000000009033 current = 0xc0000003fafa7180 paca = 0xc000000007d75e80 softe: 0 irq_happened: 0x01 pid = 14617, comm = bash Bad kernel stack pointer fafb4b0 at eca7cc4 cpu 0x15: Vector: 300 (Data Access) at [c000000007f07d40] pc: 000000000eca7cc4 lr: 000000000eca7c44 sp: fafb4b0 msr: 8000000000001000 dar: 10000000 dsisr: 42000000 current = 0xc0000003fafa7180 paca = 0xc000000007d75e80 softe: 0 irq_happened: 0x01 pid = 14617, comm = bash cpu 0x15: Exception 300 (Data Access) in xmon, returning to main loop xmon: WARNING: bad recursive fault on cpu 0x15 The root cause is that xmon is calling RTAS to turn off the surveillance when entering xmon, and RTAS is requiring big endian parameters. This patch is byte swapping the RTAS arguments when running in LE mode. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/pseries: Honor the generic "no_64bit_msi" flagBenjamin Herrenschmidt2014-12-061-1/+1
| | | | | | | | | commit 415072a041bf50dbd6d56934ffc0cbbe14c97be8 upstream. Instead of the arch specific quirk which we are deprecating Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: use device_online/offline() instead of cpu_up/down()Dan Streetman2014-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit 10ccaf178b2b961d8bca252d647ed7ed8aae2a20 upstream. In powerpc pseries platform dlpar operations, use device_online() and device_offline() instead of cpu_up() and cpu_down(). Calling cpu_up/down() directly does not update the cpu device offline field, which is used to online/offline a cpu from sysfs. Calling device_online/offline() instead keeps the sysfs cpu online value correct. The hotplug lock, which is required to be held when calling device_online/offline(), is already held when dlpar_online/offline_cpu() are called, since they are called only from cpu_probe|release_store(). This patch fixes errors on phyp (PowerVM) systems that have cpu(s) added/removed using dlpar operations; without this patch, the /sys/devices/system/cpu/cpuN/online nodes do not correctly show the online state of added/removed cpus. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Fixes: 0902a9044fa5 ("Driver core: Use generic offline/online for CPU offline/online") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Add smp_mb()s to arch_spin_unlock_wait()Michael Ellerman2014-10-311-0/+4
| | | | | | | | | | | | | | | | | | | | commit 78e05b1421fa41ae8457701140933baa5e7d9479 upstream. Similar to the previous commit which described why we need to add a barrier to arch_spin_is_locked(), we have a similar problem with spin_unlock_wait(). We need a barrier on entry to ensure any spinlock we have previously taken is visibly locked prior to the load of lock->slock. It's also not clear if spin_unlock_wait() is intended to have ACQUIRE semantics. For now be conservative and add a barrier on exit to give it ACQUIRE semantics. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Add smp_mb() to arch_spin_is_locked()Michael Ellerman2014-10-311-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 51d7d5205d3389a32859f9939f1093f267409929 upstream. The kernel defines the function spin_is_locked(), which can be used to check if a spinlock is currently locked. Using spin_is_locked() on a lock you don't hold is obviously racy. That is, even though you may observe that the lock is unlocked, it may become locked at any time. There is (at least) one exception to that, which is if two locks are used as a pair, and the holder of each checks the status of the other before doing any update. Assuming *A and *B are two locks, and *COUNTER is a shared non-atomic value: The first CPU does: spin_lock(*A) if spin_is_locked(*B) # nothing else smp_mb() LOAD r = *COUNTER r++ STORE *COUNTER = r spin_unlock(*A) And the second CPU does: spin_lock(*B) if spin_is_locked(*A) # nothing else smp_mb() LOAD r = *COUNTER r++ STORE *COUNTER = r spin_unlock(*B) Although this is a strange locking construct, it should work. It seems to be understood, but not documented, that spin_is_locked() is not a memory barrier, so in the examples above and below the caller inserts its own memory barrier before acting on the result of spin_is_locked(). For now we assume spin_is_locked() is implemented as below, and we break it out in our examples: bool spin_is_locked(*LOCK) { LOAD l = *LOCK return l.locked } Our intuition is that there should be no problem even if the two code sequences run simultaneously such as: CPU 0 CPU 1 ================================================== spin_lock(*A) spin_lock(*B) LOAD b = *B LOAD a = *A if b.locked # true if a.locked # true # nothing # nothing spin_unlock(*A) spin_unlock(*B) If one CPU gets the lock before the other then it will do the update and the other CPU will back off: CPU 0 CPU 1 ================================================== spin_lock(*A) LOAD b = *B spin_lock(*B) if b.locked # false LOAD a = *A else if a.locked # true smp_mb() # nothing LOAD r1 = *COUNTER spin_unlock(*B) r1++ STORE *COUNTER = r1 spin_unlock(*A) However in reality spin_lock() itself is not indivisible. On powerpc we implement it as a load-and-reserve and store-conditional. Ignoring the retry logic for the lost reservation case, it boils down to: spin_lock(*LOCK) { LOAD l = *LOCK l.locked = true STORE *LOCK = l ACQUIRE_BARRIER } The ACQUIRE_BARRIER is required to give spin_lock() ACQUIRE semantics as defined in memory-barriers.txt: This acts as a one-way permeable barrier. It guarantees that all memory operations after the ACQUIRE operation will appear to happen after the ACQUIRE operation with respect to the other components of the system. On modern powerpc systems we use lwsync for ACQUIRE_BARRIER. lwsync is also know as "lightweight sync", or "sync 1". As described in Power ISA v2.07 section B.2.1.1, in this scenario the lwsync is not the barrier itself. It instead causes the LOAD of *LOCK to act as the barrier, preventing any loads or stores in the locked region from occurring prior to the load of *LOCK. Whether this behaviour is in accordance with the definition of ACQUIRE semantics in memory-barriers.txt is open to discussion, we may switch to a different barrier in future. What this means in practice is that the following can occur: CPU 0 CPU 1 ================================================== LOAD a = *A LOAD b = *B a.locked = true b.locked = true LOAD b = *B LOAD a = *A STORE *A = a STORE *B = b if b.locked # false if a.locked # false else else smp_mb() smp_mb() LOAD r1 = *COUNTER LOAD r2 = *COUNTER r1++ r2++ STORE *COUNTER = r1 STORE *COUNTER = r2 # Lost update spin_unlock(*A) spin_unlock(*B) That is, the load of *B can occur prior to the store that makes *A visibly locked. And similarly for CPU 1. The result is both CPUs hold their lock and believe the other lock is unlocked. The easiest fix for this is to add a full memory barrier to the start of spin_is_locked(), so adding to our previous definition would give us: bool spin_is_locked(*LOCK) { smp_mb() LOAD l = *LOCK return l.locked } The new barrier orders the store to the lock we are locking vs the load of the other lock: CPU 0 CPU 1 ================================================== LOAD a = *A LOAD b = *B a.locked = true b.locked = true STORE *A = a STORE *B = b smp_mb() smp_mb() LOAD b = *B LOAD a = *A if b.locked # true if a.locked # true # nothing # nothing spin_unlock(*A) spin_unlock(*B) Although the above example is theoretical, there is code similar to this example in sem_lock() in ipc/sem.c. This commit in addition to the next commit appears to be a fix for crashes we are seeing in that code where we believe this race happens in practice. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Use ACCESS_ONCE when loading pmdpAneesh Kumar K.V2014-09-171-1/+3
| | | | | | | | | | | | commit 7e467245bf5226db34c4b12d3cbacfa2f7a15a8b upstream. We would get wrong results in compiler recomputed old_pmd. Avoid that by using ACCESS_ONCE Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Invalidate with vpn in loopAneesh Kumar K.V2014-09-171-16/+7
| | | | | | | | | | | | | | | | | commit 969b7b208f7408712a3526856e4ae60ad13f6928 upstream. As per ISA, for 4k base page size we compare 14..65 bits of VA specified with the entry_VA in tlb. That implies we need to make sure we do a tlbie with all the possible 4k va we used to access the 16MB hugepage. With 64k base page size we compare 14..57 bits of VA. Hence we cannot ignore the lower 24 bits of va while tlbie .We also cannot tlb invalidate a 16MB entry with just one tlbie instruction because we don't track which va was used to instantiate the tlb entry. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Handle combo pages in invalidateAneesh Kumar K.V2014-09-173-5/+13
| | | | | | | | | | | | | | | | | | | | | | | commit fc0479557572375100ef16c71170b29a98e0d69a upstream. If we changed base page size of the segment, either via sub_page_protect or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash table entries. We do a lazy hash page table flush for all mapped pages in the demoted segment. This happens when we handle hash page fault for these pages. We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte, that implies that we could possibly have older 64K hash pte entries in the hash page table and we need to invalidate those entries. Use _PAGE_COMBO to determine the page size with which we should invalidate the hash table entries on unmap. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pteAneesh Kumar K.V2014-09-171-9/+70
| | | | | | | | | | | | | | | | | | | | | | commit 629149fae478f0ac6bf705a535708b192e9c6b59 upstream. If we changed base page size of the segment, either via sub_page_protect or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash table entries. We do a lazy hash page table flush for all mapped pages in the demoted segment. This happens when we handle hash page fault for these pages. We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte, that implies that we could possibly have older 64K hash pte entries in the hash page table and we need to invalidate those entries. Handle this correctly for 16M pages Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Don't recompute vsid and ssize in loop on invalidateAneesh Kumar K.V2014-09-174-43/+26
| | | | | | | | | | | | | commit fa1f8ae80f8bb996594167ff4750a0b0a5a5bb5d upstream. The segment identifier and segment size will remain the same in the loop, So we can compute it outside. We also change the hugepage_invalidate interface so that we can use it the later patch Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/thp: Add write barrier after updating the valid bitAneesh Kumar K.V2014-09-171-1/+4
| | | | | | | | | | | | | | commit b0aa44a3dfae3d8f45bd1264349aa87f87b7774f upstream. With hugepages, we store the hpte valid information in the pte page whose address is stored in the second half of the PMD. Use a write barrier to make sure clearing pmd busy bit and updating hpte valid info are ordered properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/pseries: Avoid deadlock on removing ddwGavin Shan2014-09-171-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5efbabe09d986f25c02d19954660238fcd7f008a upstream. Function remove_ddw() could be called in of_reconfig_notifier and we potentially remove the dynamic DMA window property, which invokes of_reconfig_notifier again. Eventually, it leads to the deadlock as following backtrace shows. The patch fixes the above issue by deferring releasing the dynamic DMA window property while releasing the device node. ============================================= [ INFO: possible recursive locking detected ] 3.16.0+ #428 Tainted: G W --------------------------------------------- drmgr/2273 is trying to acquire lock: ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \ .__blocking_notifier_call_chain+0x40/0x78 but task is already holding lock: ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \ .__blocking_notifier_call_chain+0x40/0x78 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock((of_reconfig_chain).rwsem); lock((of_reconfig_chain).rwsem); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by drmgr/2273: #0: (sb_writers#4){.+.+.+}, at: [<c0000000001cbe70>] \ .vfs_write+0xb0/0x1f8 #1: ((of_reconfig_chain).rwsem){.+.+..}, at: [<c000000000091890>] \ .__blocking_notifier_call_chain+0x40/0x78 stack backtrace: CPU: 17 PID: 2273 Comm: drmgr Tainted: G W 3.16.0+ #428 Call Trace: [c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable) [c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c [c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68 [c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104 [c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90 [c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78 [c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48 [c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c [c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54 [c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4 [c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168 [c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0 [c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4 [c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78 [c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48 [c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c [c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc [c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688 [c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4 [c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8 [c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0 [c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98 Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/pseries: Failure on removing device nodeGavin Shan2014-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f1b3929c232784580e5d8ee324b6bc634e709575 upstream. While running command "drmgr -c phb -r -s 'PHB 528'", following backtrace jumped out because the target device node isn't marked with OF_DETACHED by of_detach_node(), which caused by error returned from memory hotplug related reconfig notifier when disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it. ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0 CPU: 14 PID: 2252 Comm: drmgr Tainted: G W 3.16.0+ #427 Call Trace: [c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable) [c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c [c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0 [c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8 [c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78 [c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34 [c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70 [c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c [c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688 [c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4 [c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8 [c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0 [c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98 Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mm: Use read barrier when creating real_pteAneesh Kumar K.V2014-09-171-5/+25
| | | | | | | | | | | | | | | | | commit 85c1fafd7262e68ad821ee1808686b1392b1167d upstream. On ppc64 we support 4K hash pte with 64K page size. That requires us to track the hash pte slot information on a per 4k basis. We do that by storing the slot details in the second half of pte page. The pte bit _PAGE_COMBO is used to indicate whether the second half need to be looked while building real_pte. We need to use read memory barrier while doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO check. On the store side we already do a lwsync in __hash_page_4K Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/mm/numa: Fix break placementAndrey Utkin2014-09-171-1/+1
| | | | | | | | | | | commit b00fc6ec1f24f9d7af9b8988b6a198186eb3408c upstream. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631 Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* locking/mutex: Disable optimistic spinning on some architecturesPeter Zijlstra2014-07-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream. The optimistic spin code assumes regular stores and cmpxchg() play nice; this is found to not be true for at least: parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon. There is further wreckage, but this in particular seemed easy to trigger, so blacklist this. Opt in for known good archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Jason Low <jason.low2@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Disable RELOCATABLE for COMPILE_TEST with PPC64Guenter Roeck2014-07-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit fb43e8477ed9006c4f397f904c691a120503038c upstream. powerpc:allmodconfig has been failing for some time with the following error. arch/powerpc/kernel/exceptions-64s.S: Assembler messages: arch/powerpc/kernel/exceptions-64s.S:1312: Error: attempt to move .org backwards make[1]: *** [arch/powerpc/kernel/head_64.o] Error 1 A number of attempts to fix the problem by moving around code have been unsuccessful and resulted in failed builds for some configurations and the discovery of toolchain bugs. Fix the problem by disabling RELOCATABLE for COMPILE_TEST builds instead. While this is less than perfect, it avoids substantial code changes which would otherwise be necessary just to make COMPILE_TEST builds happy and might have undesired side effects. Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Clear MMCR2 when enabling PMUJoel Stanley2014-07-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b50a6c584bb47b370f84bfd746770c0bbe7129b7 upstream. On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze the PMU counters. Aside from on boot they are then never reset, resulting in stuck perf counters for any user in the guest or host. We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane state for perf to use the PMU counters under either the guest or the host. This was manifesting as a bug with ppc64_cpu --frequency: $ sudo ppc64_cpu --frequency WARNING: couldn't run on cpu 0 WARNING: couldn't run on cpu 8 ... WARNING: couldn't run on cpu 144 WARNING: couldn't run on cpu 152 min: 18446744073.710 GHz (cpu -1) max: 0.000 GHz (cpu -1) avg: 0.000 GHz The command uses a perf counter to measure CPU cycles over a fixed amount of time, in order to approximate the frequency of the machine. The counters were returning zero once a guest was started, regardless of weather it was still running or had been shut down. By dumping the value of MMCR2, it was observed that once a guest is running MMCR2 is set to 1s - which stops counters from running: $ sudo sh -c 'echo p > /proc/sysrq-trigger' CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6 PMC1: 5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000 PMC5: 1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000 MMCR2: fffffffffffffc00 EBBHR: 0000000000000000 EBBRR: 0000000000000000 BESCR: 0000000000000000 SIAR: 00000000000a51cc SDAR: c00000000fc40000 SIER: 0000000001000000 This is done unconditionally in book3s_hv_interrupts.S upon entering the guest, and the original value is only save/restored if the host has indicated it was using the PMU. This is okay, however the user of the PMU needs to ensure that it is in a defined state when it starts using it. Fixes: e05b9b9e5c10 ("powerpc/perf: Power8 PMU support") Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Add PPMU_ARCH_207S defineJoel Stanley2014-07-183-5/+4
| | | | | | | | | | | | | | commit 4d9690dd56b0d18f2af8a9d4a279cb205aae3345 upstream. Instead of separate bits for every POWER8 PMU feature, have a single one for v2.07 of the architecture. This saves us adding a MMCR2 define for a future patch. Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc/perf: Never program book3s PMCs with values >= 0x80000000Anton Blanchard2014-07-181-1/+16
| | | | | | | | | | | | | | | | | | | | | | commit f56029410a13cae3652d1f34788045c40a13ffc7 upstream. We are seeing a lot of PMU warnings on POWER8: Can't find PMC that caused IRQ Looking closer, the active PMC is 0 at this point and we took a PMU exception on the transition from negative to 0. Some versions of POWER8 have an issue where they edge detect and not level detect PMC overflows. A number of places program the PMC with (0x80000000 - period_left), where period_left can be negative. We can either fix all of these or just ensure that period_left is always >= 1. This patch takes the second option. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
* powerpc: Don't skip ePAPR spin-table CPUsScott Wood2014-07-171-1/+9
| | | | | | | | | | | | | | | | | | | commit 6663a4fa6711050036562ddfd2086edf735fae21 upstream. Commit 59a53afe70fd530040bdc69581f03d880157f15a "powerpc: Don't setup CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs that aren't presently running shall have status of disabled, with enable-method being used to determine whether the CPU can be enabled. Fix by checking for spin-table, which is currently the only supported enable-method. Signed-off-by: Scott Wood <scottwood@freescale.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Emil Medve <Emilian.Medve@Freescale.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>