summaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()Gautham R. Shenoy2019-10-112-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c784be435d5dae28d3b03db31753dd7a18733f0c ] The calls to arch_add_memory()/arch_remove_memory() are always made with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin(). On pSeries, arch_add_memory()/arch_remove_memory() eventually call resize_hpt() which in turn calls stop_machine() which acquires the read-side cpu_hotplug_lock again, thereby resulting in the recursive acquisition of this lock. In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system lockup during a memory hotplug operation because cpus_read_lock() is a per-cpu rwsem read, which, in the fast-path (in the absence of the writer, which in our case is a CPU-hotplug operation) simply increments the read_count on the semaphore. Thus a recursive read in the fast-path doesn't cause any problems. However, we can hit this problem in practice if there is a concurrent CPU-Hotplug operation in progress which is waiting to acquire the write-side of the lock. This will cause the second recursive read to block until the writer finishes. While the writer is blocked since the first read holds the lock. Thus both the reader as well as the writers fail to make any progress thereby blocking both CPU-Hotplug as well as Memory Hotplug operations. Memory-Hotplug CPU-Hotplug CPU 0 CPU 1 ------ ------ 1. down_read(cpu_hotplug_lock.rw_sem) [memory_hotplug_begin] 2. down_write(cpu_hotplug_lock.rw_sem) [cpu_up/cpu_down] 3. down_read(cpu_hotplug_lock.rw_sem) [stop_machine()] Lockdep complains as follows in these code-paths. swapper/0/1 is trying to acquire lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60 but task is already holding lock: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cpu_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by swapper/0/1: #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0 #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50 #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166 Call Trace: dump_stack+0xe8/0x164 (unreliable) __lock_acquire+0x1110/0x1c70 lock_acquire+0x240/0x290 cpus_read_lock+0x64/0xf0 stop_machine+0x2c/0x60 pseries_lpar_resize_hpt+0x19c/0x2c0 resize_hpt_for_hotplug+0x70/0xd0 arch_add_memory+0x58/0xfc devm_memremap_pages+0x5e8/0x8f0 pmem_attach_disk+0x764/0x830 nvdimm_bus_probe+0x118/0x240 really_probe+0x230/0x4b0 driver_probe_device+0x16c/0x1e0 __driver_attach+0x148/0x1b0 bus_for_each_dev+0x90/0x130 driver_attach+0x34/0x50 bus_add_driver+0x1a8/0x360 driver_register+0x108/0x170 __nd_driver_register+0xd0/0xf0 nd_pmem_driver_init+0x34/0x48 do_one_initcall+0x1e0/0x45c kernel_init_freeable+0x540/0x64c kernel_init+0x2c/0x160 ret_from_kernel_thread+0x5c/0x68 Fix this issue by 1) Requiring all the calls to pseries_lpar_resize_hpt() be made with cpu_hotplug_lock held. 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() as a consequence of 1) 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt() with cpu_hotplug_lock held. Fixes: dbcf929c0062 ("powerpc/pseries: Add support for hash table resizing") Cc: stable@vger.kernel.org # v4.11+ Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VPCédric Le Goater2019-10-111-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 237aed48c642328ff0ab19b63423634340224a06 ] When a vCPU is brought done, the XIVE VP (Virtual Processor) is first disabled and then the event notification queues are freed. When freeing the queues, we check for possible escalation interrupts and free them also. But when a XIVE VP is disabled, the underlying XIVE ENDs also are disabled in OPAL. When an END (Event Notification Descriptor) is disabled, its ESB pages (ESn and ESe) are disabled and loads return all 1s. Which means that any access on the ESB page of the escalation interrupt will return invalid values. When an interrupt is freed, the shutdown handler computes a 'saved_p' field from the value returned by a load in xive_do_source_set_mask(). This value is incorrect for escalation interrupts for the reason described above. This has no impact on Linux/KVM today because we don't make use of it but we will introduce in future changes a xive_get_irqchip_state() handler. This handler will use the 'saved_p' field to return the state of an interrupt and 'saved_p' being incorrect, softlockup will occur. Fix the vCPU cleanup sequence by first freeing the escalation interrupts if any, then disable the XIVE VP and last free the queues. Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode") Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/powernv: Restrict OPAL symbol map to only be readable by rootAndrew Donnellan2019-10-111-4/+7
| | | | | | | | | | | | | | | | | | commit e7de4f7b64c23e503a8c42af98d56f2a7462bd6d upstream. Currently the OPAL symbol map is globally readable, which seems bad as it contains physical addresses. Restrict it to root. Fixes: c8742f85125d ("powerpc/powernv: Expose OPAL firmware symbol map") Cc: stable@vger.kernel.org # v3.19+ Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190503075253.22798-1-ajd@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: handle page fault in vmread fixJack Wang2019-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During backport f7eea636c3d5 ("KVM: nVMX: handle page fault in vmread"), there was a mistake the exception reference should be passed to function kvm_write_guest_virt_system, instead of NULL, other wise, we will get NULL pointer deref, eg kvm-unit-test triggered a NULL pointer deref below: [ 948.518437] kvm [24114]: vcpu0, guest rIP: 0x407ef9 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x3, nop [ 949.106464] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 949.106707] PGD 0 P4D 0 [ 949.106872] Oops: 0002 [#1] SMP [ 949.107038] CPU: 2 PID: 24126 Comm: qemu-2.7 Not tainted 4.19.77-pserver #4.19.77-1+feature+daily+update+20191005.1625+a4168bb~deb9 [ 949.107283] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 2.7.3 01/31/2018 [ 949.107549] RIP: 0010:kvm_write_guest_virt_system+0x12/0x40 [kvm] [ 949.107719] Code: c0 5d 41 5c 41 5d 41 5e 83 f8 03 41 0f 94 c0 41 c1 e0 02 e9 b0 ed ff ff 0f 1f 44 00 00 48 89 f0 c6 87 59 56 00 00 01 48 89 d6 <49> c7 00 00 00 00 00 89 ca 49 c7 40 08 00 00 00 00 49 c7 40 10 00 [ 949.108044] RSP: 0018:ffffb31b0a953cb0 EFLAGS: 00010202 [ 949.108216] RAX: 000000000046b4d8 RBX: ffff9e9f415b0000 RCX: 0000000000000008 [ 949.108389] RDX: ffffb31b0a953cc0 RSI: ffffb31b0a953cc0 RDI: ffff9e9f415b0000 [ 949.108562] RBP: 00000000d2e14928 R08: 0000000000000000 R09: 0000000000000000 [ 949.108733] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffc8 [ 949.108907] R13: 0000000000000002 R14: ffff9e9f4f26f2e8 R15: 0000000000000000 [ 949.109079] FS: 00007eff8694c700(0000) GS:ffff9e9f51a80000(0000) knlGS:0000000031415928 [ 949.109318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 949.109495] CR2: 0000000000000000 CR3: 00000003be53b002 CR4: 00000000003626e0 [ 949.109671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 949.109845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 949.110017] Call Trace: [ 949.110186] handle_vmread+0x22b/0x2f0 [kvm_intel] [ 949.110356] ? vmexit_fill_RSB+0xc/0x30 [kvm_intel] [ 949.110549] kvm_arch_vcpu_ioctl_run+0xa98/0x1b30 [kvm] [ 949.110725] ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm] [ 949.110901] kvm_vcpu_ioctl+0x388/0x5d0 [kvm] [ 949.111072] do_vfs_ioctl+0xa2/0x620 Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9Paul Mackerras2019-10-111-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | commit ff42df49e75f053a8a6b4c2533100cdcc23afe69 upstream. On POWER9, when userspace reads the value of the DPDES register on a vCPU, it is possible for 0 to be returned although there is a doorbell interrupt pending for the vCPU. This can lead to a doorbell interrupt being lost across migration. If the guest kernel uses doorbell interrupts for IPIs, then it could malfunction because of the lost interrupt. This happens because a newly-generated doorbell interrupt is signalled by setting vcpu->arch.doorbell_request to 1; the DPDES value in vcpu->arch.vcore->dpdes is not updated, because it can only be updated when holding the vcpu mutex, in order to avoid races. To fix this, we OR in vcpu->arch.doorbell_request when reading the DPDES value. Cc: stable@vger.kernel.org # v4.13+ Fixes: 579006944e0d ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* s390/topology: avoid firing events before kobjs are createdVasily Gorbik2019-10-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f3122a79a1b0a113d3aea748e0ec26f2cb2889de upstream. arch_update_cpu_topology is first called from: kernel_init_freeable->sched_init_smp->sched_init_domains even before cpus has been registered in: kernel_init_freeable->do_one_initcall->s390_smp_init Do not trigger kobject_uevent change events until cpu devices are actually created. Fixes the following kasan findings: BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb40/0xee0 Read of size 8 at addr 0000000000000020 by task swapper/0/1 BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb36/0xee0 Read of size 8 at addr 0000000000000018 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B Hardware name: IBM 3906 M04 704 (LPAR) Call Trace: ([<0000000143c6db7e>] show_stack+0x14e/0x1a8) [<0000000145956498>] dump_stack+0x1d0/0x218 [<000000014429fb4c>] print_address_description+0x64/0x380 [<000000014429f630>] __kasan_report+0x138/0x168 [<0000000145960b96>] kobject_uevent_env+0xb36/0xee0 [<0000000143c7c47c>] arch_update_cpu_topology+0x104/0x108 [<0000000143df9e22>] sched_init_domains+0x62/0xe8 [<000000014644c94a>] sched_init_smp+0x3a/0xc0 [<0000000146433a20>] kernel_init_freeable+0x558/0x958 [<000000014599002a>] kernel_init+0x22/0x160 [<00000001459a71d4>] ret_from_fork+0x28/0x30 [<00000001459a71dc>] kernel_thread_starter+0x0/0x10 Cc: stable@vger.kernel.org Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: s390: Test for bad access register and size at the start of S390_MEM_OPThomas Huth2019-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a13b03bbb4575b350b46090af4dfd30e735aaed1 upstream. If the KVM_S390_MEM_OP ioctl is called with an access register >= 16, then there is certainly a bug in the calling userspace application. We check for wrong access registers, but only if the vCPU was already in the access register mode before (i.e. the SIE block has recorded it). The check is also buried somewhere deep in the calling chain (in the function ar_translation()), so this is somewhat hard to find. It's better to always report an error to the userspace in case this field is set wrong, and it's safer in the KVM code if we block wrong values here early instead of relying on a check somewhere deep down the calling chain, so let's add another check to kvm_s390_guest_mem_op() directly. We also should check that the "size" is non-zero here (thanks to Janosch Frank for the hint!). If we do not check the size, we could call vmalloc() with this 0 value, and this will cause a kernel warning. Signed-off-by: Thomas Huth <thuth@redhat.com> Link: https://lkml.kernel.org/r/20190829122517.31042-1-thuth@redhat.com Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* s390/process: avoid potential reading of freed stackVasily Gorbik2019-10-111-6/+16
| | | | | | | | | | | | | | | commit 8769f610fe6d473e5e8e221709c3ac402037da6c upstream. With THREAD_INFO_IN_TASK (which is selected on s390) task's stack usage is refcounted and should always be protected by get/put when touching other task's stack to avoid race conditions with task's destruction code. Fixes: d5c352cdd022 ("s390: move thread_info into task_struct") Cc: stable@vger.kernel.org # v4.10+ Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hypfs: Fix error number left in struct pointer memberDavid Howells2019-10-071-4/+5
| | | | | | | | | | | | | | | | | | | [ Upstream commit b54c64f7adeb241423cd46598f458b5486b0375e ] In hypfs_fill_super(), if hypfs_create_update_file() fails, sbi->update_file is left holding an error number. This is passed to hypfs_kill_super() which doesn't check for this. Fix this by not setting sbi->update_value until after we've checked for error. Fixes: 24bbb1faf3f0 ("[PATCH] s390_hypfs filesystem") Signed-off-by: David Howells <dhowells@redhat.com> cc: Martin Schwidefsky <schwidefsky@de.ibm.com> cc: Heiko Carstens <heiko.carstens@de.ibm.com> cc: linux-s390@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: 8903/1: ensure that usable memory in bank 0 starts from a PMD-aligned ↵Mike Rapoport2019-10-071-0/+16
| | | | | | | | | | | | | | | | | | | | | address [ Upstream commit 00d2ec1e6bd82c0538e6dd3e4a4040de93ba4fef ] The calculation of memblock_limit in adjust_lowmem_bounds() assumes that bank 0 starts from a PMD-aligned address. However, the beginning of the first bank may be NOMAP memory and the start of usable memory will be not aligned to PMD boundary. In such case the memblock_limit will be set to the end of the NOMAP region, which will prevent any memblock allocations. Mark the region between the end of the NOMAP area and the next PMD-aligned address as NOMAP as well, so that the usable memory will start at PMD-aligned address. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: 8898/1: mm: Don't treat faults reported from cache maintenance as writesWill Deacon2019-10-072-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 834020366da9ab3fb87d1eb9a3160eb22dbed63a ] Translation faults arising from cache maintenance instructions are rather unhelpfully reported with an FSR value where the WnR field is set to 1, indicating that the faulting access was a write. Since cache maintenance instructions on 32-bit ARM do not require any particular permissions, this can cause our private 'cacheflush' system call to fail spuriously if a translation fault is generated due to page aging when targetting a read-only VMA. In this situation, we will return -EFAULT to userspace, although this is unfortunately suppressed by the popular '__builtin___clear_cache()' intrinsic provided by GCC, which returns void. Although it's tempting to write this off as a userspace issue, we can actually do a little bit better on CPUs that support LPAE, even if the short-descriptor format is in use. On these CPUs, cache maintenance faults additionally set the CM field in the FSR, which we can use to suppress the write permission checks in the page fault handler and succeed in performing cache maintenance to read-only areas even in the presence of a translation fault. Reported-by: Orion Hodson <oth@google.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* MIPS: tlbex: Explicitly cast _PAGE_NO_EXEC to a booleanNathan Chancellor2019-10-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c59ae0a1055127dd3828a88e111a0db59b254104 ] clang warns: arch/mips/mm/tlbex.c:634:19: error: use of logical '&&' with constant operand [-Werror,-Wconstant-logical-operand] if (cpu_has_rixi && _PAGE_NO_EXEC) { ^ ~~~~~~~~~~~~~ arch/mips/mm/tlbex.c:634:19: note: use '&' for a bitwise operation if (cpu_has_rixi && _PAGE_NO_EXEC) { ^~ & arch/mips/mm/tlbex.c:634:19: note: remove constant to silence this warning if (cpu_has_rixi && _PAGE_NO_EXEC) { ~^~~~~~~~~~~~~~~~ 1 error generated. Explicitly cast this value to a boolean so that clang understands we intend for this to be a non-zero value. Fixes: 00bf1c691d08 ("MIPS: tlbex: Avoid placing software PTE bits in Entry* PFN fields") Link: https://github.com/ClangBuiltLinux/linux/issues/609 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: clang-built-linux@googlegroups.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm64: fix unreachable code issue with cmpxchgArnd Bergmann2019-10-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 920fdab7b3ce98c14c840261e364f490f3679a62 ] On arm64 build with clang, sometimes the __cmpxchg_mb is not inlined when CONFIG_OPTIMIZE_INLINING is set. Clang then fails a compile-time assertion, because it cannot tell at compile time what the size of the argument is: mm/memcontrol.o: In function `__cmpxchg_mb': memcontrol.c:(.text+0x1a4c): undefined reference to `__compiletime_assert_175' memcontrol.c:(.text+0x1a4c): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `__compiletime_assert_175' Mark all of the cmpxchg() style functions as __always_inline to ensure that the compiler can see the result. Acked-by: Nick Desaulniers <ndesaulniers@google.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Link: https://github.com/ClangBuiltLinux/linux/issues/648 Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Andrew Murray <andrew.murray@arm.com> Tested-by: Andrew Murray <andrew.murray@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/pseries: correctly track irq state in default idleNathan Lynch2019-10-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 92c94dfb69e350471473fd3075c74bc68150879e ] prep_irq_for_idle() is intended to be called before entering H_CEDE (and it is used by the pseries cpuidle driver). However the default pseries idle routine does not call it, leading to mismanaged lazy irq state when the cpuidle driver isn't in use. Manifestations of this include: * Dropped IPIs in the time immediately after a cpu comes online (before it has installed the cpuidle handler), making the online operation block indefinitely waiting for the new cpu to respond. * Hitting this WARN_ON in arch_local_irq_restore(): /* * We should already be hard disabled here. We had bugs * where that wasn't the case so let's dbl check it and * warn if we are wrong. Only do that when IRQ tracing * is enabled as mfmsr() can be costly. */ if (WARN_ON_ONCE(mfmsr() & MSR_EE)) __hard_irq_disable(); Call prep_irq_for_idle() from pseries_lpar_idle() and honor its result. Fixes: 363edbe2614a ("powerpc: Default arch idle could cede processor on pseries") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190910225244.25056-1-nathanl@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/64s/exception: machine check use correct cfar for late handlerNicholas Piggin2019-10-071-0/+4
| | | | | | | | | | | | | | | | | | | [ Upstream commit 0b66370c61fcf5fcc1d6901013e110284da6e2bb ] Bare metal machine checks run an "early" handler in real mode before running the main handler which reports the event. The main handler runs exactly as a normal interrupt handler, after the "windup" which sets registers back as they were at interrupt entry. CFAR does not get restored by the windup code, so that will be wrong when the handler is run. Restore the CFAR to the saved value before running the late handler. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190802105709.27696-8-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/pseries/mobility: use cond_resched when updating device treeNathan Lynch2019-10-071-0/+9
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit ccfb5bd71d3d1228090a8633800ae7cdf42a94ac ] After a partition migration, pseries_devicetree_update() processes changes to the device tree communicated from the platform to Linux. This is a relatively heavyweight operation, with multiple device tree searches, memory allocations, and conversations with partition firmware. There's a few levels of nested loops which are bounded only by decisions made by the platform, outside of Linux's control, and indeed we have seen RCU stalls on large systems while executing this call graph. Use cond_resched() in these loops so that the cpu is yielded when needed. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190802192926.19277-4-nathanl@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/futex: Fix warning: 'oldval' may be used uninitialized in this functionChristophe Leroy2019-10-071-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 38a0d0cdb46d3f91534e5b9839ec2d67be14c59d ] We see warnings such as: kernel/futex.c: In function 'do_futex': kernel/futex.c:1676:17: warning: 'oldval' may be used uninitialized in this function [-Wmaybe-uninitialized] return oldval == cmparg; ^ kernel/futex.c:1651:6: note: 'oldval' was declared here int oldval, ret; ^ This is because arch_futex_atomic_op_inuser() only sets *oval if ret is 0 and GCC doesn't see that it will only use it when ret is 0. Anyway, the non-zero ret path is an error path that won't suffer from setting *oval, and as *oval is a local var in futex_atomic_op_inuser() it will have no impact. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> [mpe: reword change log slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/86b72f0c134367b214910b27b9a6dd3321af93bb.1565774657.git.christophe.leroy@c-s.fr Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/rtas: use device model APIs and serialization during LPMNathan Lynch2019-10-071-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a6717c01ddc259f6f73364779df058e2c67309f8 ] The LPAR migration implementation and userspace-initiated cpu hotplug can interleave their executions like so: 1. Set cpu 7 offline via sysfs. 2. Begin a partition migration, whose implementation requires the OS to ensure all present cpus are online; cpu 7 is onlined: rtas_ibm_suspend_me -> rtas_online_cpus_mask -> cpu_up This sets cpu 7 online in all respects except for the cpu's corresponding struct device; dev->offline remains true. 3. Set cpu 7 online via sysfs. _cpu_up() determines that cpu 7 is already online and returns success. The driver core (device_online) sets dev->offline = false. 4. The migration completes and restores cpu 7 to offline state: rtas_ibm_suspend_me -> rtas_offline_cpus_mask -> cpu_down This leaves cpu7 in a state where the driver core considers the cpu device online, but in all other respects it is offline and unused. Attempts to online the cpu via sysfs appear to succeed but the driver core actually does not pass the request to the lower-level cpuhp support code. This makes the cpu unusable until the cpu device is manually set offline and then online again via sysfs. Instead of directly calling cpu_up/cpu_down, the migration code should use the higher-level device core APIs to maintain consistent state and serialize operations. Fixes: 120496ac2d2d ("powerpc: Bring all threads online prior to migration/hibernation") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190802192926.19277-2-nathanl@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/xmon: Check for HV mode when dumping XIVE info from OPALCédric Le Goater2019-10-071-6/+9
| | | | | | | | | | | | | | | [ Upstream commit c3e0dbd7f780a58c4695f1cd8fc8afde80376737 ] Currently, the xmon 'dx' command calls OPAL to dump the XIVE state in the OPAL logs and also outputs some of the fields of the internal XIVE structures in Linux. The OPAL calls can only be done on baremetal (PowerNV) and they crash a pseries machine. Fix by checking the hypervisor feature of the CPU. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190814154754.23682-2-clg@kaod.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm64: dts: rockchip: limit clock rate of MMC controllers for RK3328Shawn Lin2019-10-051-0/+3
| | | | | | | | | | | | | | | | | | | commit 03e61929c0d227ed3e1c322fc3804216ea298b7e upstream. 150MHz is a fundamental limitation of RK3328 Soc, w/o this limitation, eMMC, for instance, will run into 200MHz clock rate in HS200 mode, which makes the RK3328 boards not always boot properly. By adding it in rk3328.dtsi would also obviate the worry of missing it when adding new boards. Fixes: 52e02d377a72 ("arm64: dts: rockchip: add core dtsi file for RK3328 SoCs") Cc: stable@vger.kernel.org Cc: Robin Murphy <robin.murphy@arm.com> Cc: Liang Chen <cl@rock-chips.com> Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ARM: zynq: Use memcpy_toio instead of memcpy on smp bring-upLuis Araneda2019-10-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b7005d4ef4f3aa2dc24019ffba03a322557ac43d upstream. This fixes a kernel panic on memcpy when FORTIFY_SOURCE is enabled. The initial smp implementation on commit aa7eb2bb4e4a ("arm: zynq: Add smp support") used memcpy, which worked fine until commit ee333554fed5 ("ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE") enabled overflow checks at runtime, producing a read overflow panic. The computed size of memcpy args are: - p_size (dst): 4294967295 = (size_t) -1 - q_size (src): 1 - size (len): 8 Additionally, the memory is marked as __iomem, so one of the memcpy_* functions should be used for read/write. Fixes: aa7eb2bb4e4a ("arm: zynq: Add smp support") Signed-off-by: Luis Araneda <luaraneda@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ARM: samsung: Fix system restart on S3C6410Lihua Yao2019-10-051-0/+1
| | | | | | | | | | | | | commit 16986074035cc0205472882a00d404ed9d213313 upstream. S3C6410 system restart is triggered by watchdog reset. Cc: <stable@vger.kernel.org> Fixes: 9f55342cc2de ("ARM: dts: s3c64xx: Fix infinite interrupt in soft mode") Signed-off-by: Lihua Yao <ylhuajnu@outlook.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: Manually calculate reserved bits when loading PDPTRSSean Christopherson2019-10-051-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 16cfacc8085782dab8e365979356ce1ca87fd6cc upstream. Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe47df ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b56168d ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: set ctxt->have_exception in x86_decode_insn()Jan Dakinevich2019-10-052-0/+8
| | | | | | | | | | | | | | | | | | commit c8848cee74ff05638e913582a476bde879c968ad upstream. x86_emulate_instruction() takes into account ctxt->have_exception flag during instruction decoding, but in practice this flag is never set in x86_decode_insn(). Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: always stop emulation on page faultJan Dakinevich2019-10-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8530a79c5a9f4e29e6ffb35ec1a79d81f4968ec8 upstream. inject_emulated_exception() returns true if and only if nested page fault happens. However, page fault can come from guest page tables walk, either nested or not nested. In both cases we should stop an attempt to read under RIP and give guest to step over its own page fault handler. This is also visible when an emulated instruction causes a #GP fault and the VMware backdoor is enabled. To handle the VMware backdoor, KVM intercepts #GP faults; with only the next patch applied, x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi() (or gp_interception() for SVM) to re-inject the original #GP because it thinks emulation failed due to a non-VMware opcode. This patch prevents the issue as x86_emulate_instruction() will return EMULATE_DONE after injecting the #GP. Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/retpolines: Fix up backport of a9d57ef15cbeNathan Chancellor2019-10-051-1/+1
| | | | | | | | | | | | | | | | | | | | Commit a9d57ef15cbe ("x86/retpolines: Disable switch jump tables when retpolines are enabled") added -fno-jump-tables to workaround a GCC issue while deliberately avoiding adding this flag when CONFIG_CC_IS_CLANG is set, which is defined by the kconfig system when CC=clang is provided. However, this symbol was added in 4.18 in commit 469cb7376c06 ("kconfig: add CC_IS_CLANG and CLANG_VERSION") so it is always undefined in 4.14, meaning -fno-jump-tables gets added when using Clang. Fix this up by using the equivalent $(cc-name) comparison, which matches what upstream did until commit 076f421da5d4 ("kbuild: replace cc-name test with CONFIG_CC_IS_CLANG"). Fixes: e28951100515 ("x86/retpolines: Disable switch jump tables when retpolines are enabled") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* s390/crypto: xts-aes-s390 fix extra run-time crypto self tests findingHarald Freudenberger2019-10-051-0/+6
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 9e323d45ba94262620a073a3f9945ca927c07c71 ] With 'extra run-time crypto self tests' enabled, the selftest for s390-xts fails with alg: skcipher: xts-aes-s390 encryption unexpectedly succeeded on test vector "random: len=0 klen=64"; expected_error=-22, cfg="random: inplace use_digest nosimd src_divs=[2.61%@+4006, 84.44%@+21, 1.55%@+13, 4.50%@+344, 4.26%@+21, 2.64%@+27]" This special case with nbytes=0 is not handled correctly and this fix now makes sure that -EINVAL is returned when there is en/decrypt called with 0 bytes to en/decrypt. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: dts: exynos: Mark LDO10 as always-on on Peach Pit/Pi ChromebooksMarek Szyprowski2019-10-052-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5b0eeeaa37615df37a9a30929b73e9defe61ca84 ] Commit aff138bf8e37 ("ARM: dts: exynos: Add TMU nodes regulator supply for Peach boards") assigned LDO10 to Exynos Thermal Measurement Unit, but it turned out that it supplies also some other critical parts and board freezes/crashes when it is turned off. The mentioned commit made Exynos TMU a consumer of that regulator and in typical case Exynos TMU driver keeps it enabled from early boot. However there are such configurations (example is multi_v7_defconfig), in which some of the regulators are compiled as modules and are not available from early boot. In such case it may happen that LDO10 is turned off by regulator core, because it has no consumers yet (in this case consumer drivers cannot get it, because the supply regulators for it are not yet available). This in turn causes the board to crash. This patch restores 'always-on' property for the LDO10 regulator. Fixes: aff138bf8e37 ("ARM: dts: exynos: Add TMU nodes regulator supply for Peach boards") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm64: kpti: ensure patched kernel text is fetched from PoUMark Rutland2019-10-051-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f32c7a8e45105bd0af76872bf6eef0438ff12fb2 ] While the MMUs is disabled, I-cache speculation can result in instructions being fetched from the PoC. During boot we may patch instructions (e.g. for alternatives and jump labels), and these may be dirty at the PoU (and stale at the PoC). Thus, while the MMU is disabled in the KPTI pagetable fixup code we may load stale instructions into the I-cache, potentially leading to subsequent crashes when executing regions of code which have been modified at runtime. Similarly to commit: 8ec41987436d566f ("arm64: mm: ensure patched kernel text is fetched from PoU") ... we can invalidate the I-cache after enabling the MMU to prevent such issues. The KPTI pagetable fixup code itself should be clean to the PoC per the boot protocol, so no maintenance is required for this code. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: dts: imx7d: cl-som-imx7: make ethernet work againAndré Draszik2019-10-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9846a4524ac90b63496580b7ad50674b40d92a8f ] Recent changes to the atheros at803x driver caused ethernet to stop working on this board. In particular commit 6d4cd041f0af ("net: phy: at803x: disable delay only for RGMII mode") and commit cd28d1d6e52e ("net: phy: at803x: Disable phy delay for RGMII mode") fix the AR8031 driver to configure the phy's (RX/TX) delays as per the 'phy-mode' in the device tree. This now prevents ethernet from working on this board. It used to work before those commits, because the AR8031 comes out of reset with RX delay enabled, and the at803x driver didn't touch the delay configuration at all when "rgmii" mode was selected, and because arch/arm/mach-imx/mach-imx7d.c:ar8031_phy_fixup() unconditionally enables TX delay. Since above commits ar8031_phy_fixup() also has no effect anymore, and the end-result is that all delays are disabled in the phy, no ethernet. Update the device tree to restore functionality. Signed-off-by: André Draszik <git@andred.net> CC: Ilya Ledvich <ilya@compulab.co.il> CC: Igor Grinberg <grinberg@compulab.co.il> CC: Rob Herring <robh+dt@kernel.org> CC: Mark Rutland <mark.rutland@arm.com> CC: Shawn Guo <shawnguo@kernel.org> CC: Sascha Hauer <s.hauer@pengutronix.de> CC: Pengutronix Kernel Team <kernel@pengutronix.de> CC: Fabio Estevam <festevam@gmail.com> CC: NXP Linux Team <linux-imx@nxp.com> CC: devicetree@vger.kernel.org CC: linux-arm-kernel@lists.infradead.org Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ia64:unwind: fix double free for mod->arch.init_unw_tablechenzefeng2019-10-051-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c5e5c48c16422521d363c33cfb0dcf58f88c119b ] The function free_module in file kernel/module.c as follow: void free_module(struct module *mod) { ...... module_arch_cleanup(mod); ...... module_arch_freeing_init(mod); ...... } Both module_arch_cleanup and module_arch_freeing_init function would free the mod->arch.init_unw_table, which cause double free. Here, set mod->arch.init_unw_table = NULL after remove the unwind table to avoid double free. Signed-off-by: chenzefeng <chenzefeng2@huawei.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Soft disable APIC before initializing itThomas Gleixner2019-10-051-0/+8
| | | | | | | | | | | | | | | | | [ Upstream commit 2640da4cccf5cc613bf26f0998b9e340f4b5f69c ] If the APIC was already enabled on entry of setup_local_APIC() then disabling it soft via the SPIV register makes a lot of sense. That masks all LVT entries and brings it into a well defined state. Otherwise previously enabled LVTs which are not touched in the setup function stay unmasked and might surprise the just booting kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190722105219.068290579@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI failsGrzegorz Halat2019-10-051-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 747d5a1bf293dcb33af755a6d285d41b8c1ea010 ] A reboot request sends an IPI via the reboot vector and waits for all other CPUs to stop. If one or more CPUs are in critical regions with interrupts disabled then the IPI is not handled on those CPUs and the shutdown hangs if native_stop_other_cpus() is called with the wait argument set. Such a situation can happen when one CPU was stopped within a lock held section and another CPU is trying to acquire that lock with interrupts disabled. There are other scenarios which can cause such a lockup as well. In theory the shutdown should be attempted by an NMI IPI after the timeout period elapsed. Though the wait loop after sending the reboot vector IPI prevents this. It checks the wait request argument and the timeout. If wait is set, which is true for sys_reboot() then it won't fall through to the NMI shutdown method after the timeout period has finished. This was an oversight when the NMI shutdown mechanism was added to handle the 'reboot IPI is not working' situation. The mechanism was added to deal with stuck panic shutdowns, which do not have the wait request set, so the 'wait request' case was probably not considered. Remove the wait check from the post reboot vector IPI wait loop and enforce that the wait loop in the NMI fallback path is invoked even if NMI IPIs are disabled or the registration of the NMI handler fails. That second wait loop will then hang if not all CPUs shutdown and the wait argument is set. [ tglx: Avoid the hard to parse line break in the NMI fallback path, add comments and massage the changelog ] Fixes: 7d007d21e539 ("x86/reboot: Use NMI to assist in shutting down if IRQ fails") Signed-off-by: Grzegorz Halat <ghalat@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Don Zickus <dzickus@redhat.com> Link: https://lkml.kernel.org/r/20190628122813.15500-1-ghalat@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 fieldWill Deacon2019-10-051-0/+5
| | | | | | | | | | | | | | | | commit 2a355ec25729053bb9a1a89b6c1d1cdd6c3b3fb1 upstream. While the CSV3 field of the ID_AA64_PFR0 CPU ID register can be checked to see if a CPU is susceptible to Meltdown and therefore requires kpti to be enabled, existing CPUs do not implement this field. We therefore whitelist all unaffected Cortex-A CPUs that do not implement the CSV3 field. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* powerpc/xive: Fix bogus error code returned by OPALGreg Kurz2019-10-053-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6ccb4ac2bf8a35c694ead92f8ac5530a16e8f2c8 upstream. There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call to return the 32-bit value 0xffffffff when OPAL has run out of IRQs. Unfortunatelty, OPAL return values are signed 64-bit entities and errors are supposed to be negative. If that happens, the linux code confusingly treats 0xffffffff as a valid IRQ number and panics at some point. A fix was recently merged in skiboot: e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()") but we need a workaround anyway to support older skiboots already in the field. Internally convert 0xffffffff to OPAL_RESOURCE which is the usual error returned upon resource exhaustion. Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821713818.1985334.14123187368108582810.stgit@bahia.lan (groug: fix arch/powerpc/platforms/powernv/opal-wrappers.S instead of non-existing arch/powerpc/platforms/powernv/opal-call.c) Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/hyper-v: Fix overflow bug in fill_gva_list()Tianyu Lan2019-09-211-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4030b4c585c41eeefec7bd20ce3d0e100a0f2e4d ] When the 'start' parameter is >= 0xFF000000 on 32-bit systems, or >= 0xFFFFFFFF'FF000000 on 64-bit systems, fill_gva_list() gets into an infinite loop. With such inputs, 'cur' overflows after adding HV_TLB_FLUSH_UNIT and always compares as less than end. Memory is filled with guest virtual addresses until the system crashes. Fix this by never incrementing 'cur' to be larger than 'end'. Reported-by: Jong Hyun Park <park.jonghyun@yonsei.ac.kr> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2ffd9e33ce4a ("x86/hyper-v: Use hypercall for remote TLB flush") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/uaccess: Don't leak the AC flags into __get_user() argument evaluationPeter Zijlstra2019-09-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9b8bd476e78e89c9ea26c3b435ad0201c3d7dbf5 ] Identical to __put_user(); the __get_user() argument evalution will too leak UBSAN crud into the __uaccess_begin() / __uaccess_end() region. While uncommon this was observed to happen for: drivers/xen/gntdev.c: if (__get_user(old_status, batch->status[i])) where UBSAN added array bound checking. This complements commit: 6ae865615fc4 ("x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation") Tested-by Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: broonie@kernel.org Cc: sfr@canb.auug.org.au Cc: akpm@linux-foundation.org Cc: Randy Dunlap <rdunlap@infradead.org> Cc: mhocko@suse.cz Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20190829082445.GM2369@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/amd/ibs: Fix sample bias for dispatched micro-opsKim Phillips2019-09-212-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0f4cd769c410e2285a4e9873a684d90423f03090 ] When counting dispatched micro-ops with cnt_ctl=1, in order to prevent sample bias, IBS hardware preloads the least significant 7 bits of current count (IbsOpCurCnt) with random values, such that, after the interrupt is handled and counting resumes, the next sample taken will be slightly perturbed. The current count bitfield is in the IBS execution control h/w register, alongside the maximum count field. Currently, the IBS driver writes that register with the maximum count, leaving zeroes to fill the current count field, thereby overwriting the random bits the hardware preloaded for itself. Fix the driver to actually retain and carry those random bits from the read of the IBS control register, through to its write, instead of overwriting the lower current count bits with zeroes. Tested with: perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0 <workload> 'perf annotate' output before: 15.70 65: addsd %xmm0,%xmm1 17.30 add $0x1,%rax 15.88 cmp %rdx,%rax je 82 17.32 72: test $0x1,%al jne 7c 7.52 movapd %xmm1,%xmm0 5.90 jmp 65 8.23 7c: sqrtsd %xmm1,%xmm0 12.15 jmp 65 'perf annotate' output after: 16.63 65: addsd %xmm0,%xmm1 16.82 add $0x1,%rax 16.81 cmp %rdx,%rax je 82 16.69 72: test $0x1,%al jne 7c 8.30 movapd %xmm1,%xmm0 8.13 jmp 65 8.24 7c: sqrtsd %xmm1,%xmm0 8.39 jmp 65 Tested on Family 15h and 17h machines. Machines prior to family 10h Rev. C don't have the RDWROPCNT capability, and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't affect their operation. It is unknown why commit db98c5faf8cb ("perf/x86: Implement 64-bit counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt field; the number of preloaded random bits has always been 7, AFAICT. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Arnaldo Carvalho de Melo" <acme@kernel.org> Cc: <x86@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Borislav Petkov" <bp@alien8.de> Cc: Stephane Eranian <eranian@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: "Namhyung Kim" <namhyung@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel: Restrict period on NehalemJosh Hunt2019-09-211-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 44d3bbb6f5e501b873218142fe08cdf62a4ac1f3 ] We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in some cases when using perf: perfevents: irq loop stuck! WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530 ... RIP: 0010:intel_pmu_handle_irq+0x37b/0x530 ... Call Trace: <NMI> ? perf_event_nmi_handler+0x2e/0x50 ? intel_pmu_save_and_restart+0x50/0x50 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x6e/0x120 default_do_nmi+0x3e/0x100 do_nmi+0x102/0x160 end_repeat_nmi+0x16/0x50 ... ? native_write_msr+0x6/0x20 ? native_write_msr+0x6/0x20 </NMI> intel_pmu_enable_event+0x1ce/0x1f0 x86_pmu_start+0x78/0xa0 x86_pmu_enable+0x252/0x310 __perf_event_task_sched_in+0x181/0x190 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 finish_task_switch+0x158/0x260 __schedule+0x2f6/0x840 ? hrtimer_start_range_ns+0x153/0x210 schedule+0x32/0x80 schedule_hrtimeout_range_clock+0x8a/0x100 ? hrtimer_init+0x120/0x120 ep_poll+0x2f7/0x3a0 ? wake_up_q+0x60/0x60 do_epoll_wait+0xa9/0xc0 __x64_sys_epoll_wait+0x1a/0x20 do_syscall_64+0x4e/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fdeb1e96c03 ... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: acme@kernel.org Cc: Josh Hunt <johunt@akamai.com> Cc: bpuranda@akamai.com Cc: mingo@redhat.com Cc: jolsa@redhat.com Cc: tglx@linutronix.de Cc: namhyung@kernel.org Cc: alexander.shishkin@linux.intel.com Link: https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-johunt@akamai.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: 8901/1: add a criteria for pfn_valid of armzhaoyang2019-09-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5b3efa4f1479c91cb8361acef55f9c6662feba57 ] pfn_valid can be wrong when parsing a invalid pfn whose phys address exceeds BITS_PER_LONG as the MSB will be trimed when shifted. The issue originally arise from bellowing call stack, which corresponding to an access of the /proc/kpageflags from userspace with a invalid pfn parameter and leads to kernel panic. [46886.723249] c7 [<c031ff98>] (stable_page_flags) from [<c03203f8>] [46886.723264] c7 [<c0320368>] (kpageflags_read) from [<c0312030>] [46886.723280] c7 [<c0311fb0>] (proc_reg_read) from [<c02a6e6c>] [46886.723290] c7 [<c02a6e24>] (__vfs_read) from [<c02a7018>] [46886.723301] c7 [<c02a6f74>] (vfs_read) from [<c02a778c>] [46886.723315] c7 [<c02a770c>] (SyS_pread64) from [<c0108620>] (ret_fast_syscall+0x0/0x28) Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machinesThomas Gleixner2019-09-211-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3e5bedc2c258341702ddffbd7688c5e6eb01eafa ] Rahul Tanwar reported the following bug on DT systems: > 'ioapic_dynirq_base' contains the virtual IRQ base number. Presently, it is > updated to the end of hardware IRQ numbers but this is done only when IOAPIC > configuration type is IOAPIC_DOMAIN_LEGACY or IOAPIC_DOMAIN_STRICT. There is > a third type IOAPIC_DOMAIN_DYNAMIC which applies when IOAPIC configuration > comes from devicetree. > > See dtb_add_ioapic() in arch/x86/kernel/devicetree.c > > In case of IOAPIC_DOMAIN_DYNAMIC (DT/OF based system), 'ioapic_dynirq_base' > remains to zero initialized value. This means that for OF based systems, > virtual IRQ base will get set to zero. Such systems will very likely not even boot. For DT enabled machines ioapic_dynirq_base is irrelevant and not updated, so simply map the IRQ base 1:1 instead. Reported-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Tested-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: alan@linux.intel.com Cc: bp@alien8.de Cc: cheol.yong.kim@intel.com Cc: qi-ming.wu@intel.com Cc: rahul.tanwar@intel.com Cc: rppt@linux.ibm.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/20190821081330.1187-1-rahul.tanwar@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: 8874/1: mm: only adjust sections of valid mm structuresDoug Berger2019-09-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c51bc12d06b3a5494fbfcbd788a8e307932a06e9 ] A timing hazard exists when an early fork/exec thread begins exiting and sets its mm pointer to NULL while a separate core tries to update the section information. This commit ensures that the mm pointer is not NULL before setting its section parameters. The arguments provided by commit 11ce4b33aedc ("ARM: 8672/1: mm: remove tasklist locking from update_sections_early()") are equally valid for not requiring grabbing the task_lock around this check. Fixes: 08925c2f124f ("ARM: 8464/1: Update all mm structures with section adjustments") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Laura Abbott <labbott@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org> Cc: Peng Fan <peng.fan@nxp.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* s390/bpf: use 32-bit index for tail callsIlya Leoshkevich2019-09-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 91b4db5313a2c793aabc2143efb8ed0cf0fdd097 ] "p runtime/jit: pass > 32bit index to tail_call" fails when bpf_jit_enable=1, because the tail call is not executed. This in turn is because the generated code assumes index is 64-bit, while it must be 32-bit, and as a result prog array bounds check fails, while it should pass. Even if bounds check would have passed, the code that follows uses 64-bit index to compute prog array offset. Fix by using clrj instead of clgrj for comparing index with array size, and also by using llgfr for truncating index to 32 bits before using it to compute prog array offset. Fixes: 6651ee070b31 ("s390/bpf: implement bpf_tail_call() helper") Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: dts: dra74x: Fix iodelay configuration for mmc3Faiz Abbas2019-09-211-25/+25
| | | | | | | | | | | | | | | | [ Upstream commit 07f9a8be66a9bd86f9eaedf8f8aeb416195adab8 ] According to the latest am572x[1] and dra74x[2] data manuals, mmc3 default, hs, sdr12 and sdr25 modes use iodelay values given in MMC3_MANUAL1. Set the MODE_SELECT bit for these so that manual mode is selected and correct iodelay values can be configured. [1] http://www.ti.com/lit/ds/symlink/am5728.pdf [2] http://www.ti.com/lit/ds/symlink/dra746.pdf Signed-off-by: Faiz Abbas <faiz_abbas@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: OMAP2+: Fix omap4 errata warning on other SoCsTony Lindgren2019-09-211-0/+3
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 45da5e09dd32fa98c32eaafe2513db6bd75e2f4f ] We have errata i688 workaround produce warnings on SoCs other than omap4 and omap5: omap4_sram_init:Unable to allocate sram needed to handle errata I688 omap4_sram_init:Unable to get sram pool needed to handle errata I688 This is happening because there is no ti,omap4-mpu node, or no SRAM to configure for the other SoCs, so let's remove the warning based on the SoC revision checks. As nobody has complained it seems that the other SoC variants do not need this workaround. Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* s390/bpf: fix lcgr instruction encodingIlya Leoshkevich2019-09-211-1/+1
| | | | | | | | | | | | | | | | | | | [ Upstream commit bb2d267c448f4bc3a3389d97c56391cb779178ae ] "masking, test in bounds 3" fails on s390, because BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0) ignores the top 32 bits of BPF_REG_2. The reason is that JIT emits lcgfr instead of lcgr. The associated comment indicates that the code was intended to emit lcgr in the first place, it's just that the wrong opcode was used. Fix by using the correct opcode. Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ARM: OMAP2+: Fix missing SYSC_HAS_RESET_STATUS for dra7 epwmssTony Lindgren2019-09-211-1/+2
| | | | | | | | | | | | | | | [ Upstream commit afd58b162e48076e3fe66d08a69eefbd6fe71643 ] TRM says PWMSS_SYSCONFIG bit for SOFTRESET changes to zero when reset is completed. Let's configure it as otherwise we get warnings on boot when we check the data against dts provided data. Eventually the legacy platform data will be just dropped, but let's fix the warning first. Reviewed-by: Suman Anna <s-anna@ti.com> Tested-by: Keerthy <j-keerthy@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* powerpc/mm/radix: Use the right page size for vmemmap mappingAneesh Kumar K.V2019-09-211-9/+7
| | | | | | | | | | | | | | | | | | commit 89a3496e0664577043666791ec07fb731d57c950 upstream. We use mmu_vmemmap_psize to find the page size for mapping the vmmemap area. With radix translation, we are suboptimally setting this value to PAGE_SIZE. We do check for 2M page size support and update mmu_vmemap_psize to use hugepage size but we suboptimally reset the value to PAGE_SIZE in radix__early_init_mmu(). This resulted in always mapping vmemmap area with 64K page size. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/build: Add -Wnoaddress-of-packed-member to REALMODE_CFLAGS, to silence ↵Linus Torvalds2019-09-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC9 build warning commit 42e0e95474fc6076b5cd68cab8fa0340a1797a72 upstream. One of the very few warnings I have in the current build comes from arch/x86/boot/edd.c, where I get the following with a gcc9 build: arch/x86/boot/edd.c: In function ‘query_edd’: arch/x86/boot/edd.c:148:11: warning: taking address of packed member of ‘struct boot_params’ may result in an unaligned pointer value [-Waddress-of-packed-member] 148 | mbrptr = boot_params.edd_mbr_sig_buffer; | ^~~~~~~~~~~ This warning triggers because we throw away all the CFLAGS and then make a new set for REALMODE_CFLAGS, so the -Wno-address-of-packed-member we added in the following commit is not present: 6f303d60534c ("gcc-9: silence 'address-of-packed-member' warning") The simplest solution for now is to adjust the warning for this version of CFLAGS as well, but it would definitely make sense to examine whether REALMODE_CFLAGS could be derived from CFLAGS, so that it picks up changes in the compiler flags environment automatically. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* powerpc: Add barrier_nospec to raw_copy_in_user()Suraj Jitindar Singh2019-09-191-0/+1
| | | | | | | | | | | | | | | | | | | commit 6fbcdd59094ade30db63f32316e9502425d7b256 upstream. Commit ddf35cf3764b ("powerpc: Use barrier_nospec in copy_from_user()") Added barrier_nospec before loading from user-controlled pointers. The intention was to order the load from the potentially user-controlled pointer vs a previous branch based on an access_ok() check or similar. In order to achieve the same result, add a barrier_nospec to the raw_copy_in_user() function before loading from such a user-controlled pointer. Fixes: ddf35cf3764b ("powerpc: Use barrier_nospec in copy_from_user()") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>