summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* perf/x86/amd: Change/fix NMI latency mitigation to use a timestampTom Lendacky2019-11-061-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit df4d29732fdad43a51284f826bec3e6ded177540 ] It turns out that the NMI latency workaround from commit: 6d3edaae16c6 ("x86/perf/amd: Resolve NMI latency issues for active PMCs") ends up being too conservative and results in the perf NMI handler claiming NMIs too easily on AMD hardware when the NMI watchdog is active. This has an impact, for example, on the hpwdt (HPE watchdog timer) module. This module can produce an NMI that is used to reset the system. It registers an NMI handler for the NMI_UNKNOWN type and relies on the fact that nothing has claimed an NMI so that its handler will be invoked when the watchdog device produces an NMI. After the referenced commit, the hpwdt module is unable to process its generated NMI if the NMI watchdog is active, because the current NMI latency mitigation results in the NMI being claimed by the perf NMI handler. Update the AMD perf NMI latency mitigation workaround to, instead, use a window of time. Whenever a PMC is handled in the perf NMI handler, set a timestamp which will act as a perf NMI window. Any NMIs arriving within that window will be claimed by perf. Anything outside that window will not be claimed by perf. The value for the NMI window is set to 100 msecs. This is a conservative value that easily covers any NMI latency in the hardware. While this still results in a window in which the hpwdt module will not receive its NMI, the window is now much, much smaller. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jerry Hoemann <jerry.hoemann@hpe.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6d3edaae16c6 ("x86/perf/amd: Resolve NMI latency issues for active PMCs") Link: https://lkml.kernel.org/r/Message-ID: Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/xen: Return from panic notifierBoris Ostrovsky2019-11-061-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c6875f3aacf2a5a913205accddabf0bfb75cac76 ] Currently execution of panic() continues until Xen's panic notifier (xen_panic_event()) is called at which point we make a hypercall that never returns. This means that any notifier that is supposed to be called later as well as significant part of panic() code (such as pstore writes from kmsg_dump()) is never executed. There is no reason for xen_panic_event() to be this last point in execution since panic()'s emergency_restart() will call into xen_emergency_restart() from where we can perform our hypercall. Nevertheless, we will provide xen_legacy_crash boot option that will preserve original behavior during crash. This option could be used, for example, if running kernel dumper (which happens after panic notifiers) is undesirable. Reported-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* efi/x86: Do not clean dummy variable in kexec pathDave Young2019-11-061-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2ecb7402cfc7f22764e7bbc80790e66eadb20560 ] kexec reboot fails randomly in UEFI based KVM guest. The firmware just resets while calling efi_delete_dummy_variable(); Unfortunately I don't know how to debug the firmware, it is also possible a potential problem on real hardware as well although nobody reproduced it. The intention of the efi_delete_dummy_variable is to trigger garbage collection when entering virtual mode. But SetVirtualAddressMap can only run once for each physical reboot, thus kexec_enter_virtual_mode() is not necessarily a good place to clean a dummy object. Drop the efi_delete_dummy_variable so that kexec reboot can work. Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Matthew Garrett <mjg59@google.com> Cc: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukas Wunner <lukas@wunner.de> Cc: Lyude Paul <lyude@redhat.com> Cc: Octavian Purdila <octavian.purdila@intel.com> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Talbert <swt@techie.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Cc: linux-integrity@vger.kernel.org Link: https://lkml.kernel.org/r/20191002165904.8819-8-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/cpu: Add Atom Tremont (Jacobsville)Kan Liang2019-11-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 00ae831dfe4474ef6029558f5eb3ef0332d80043 ] Add the Atom Tremont model number to the Intel family list. [ Tony: Also update comment at head of file to say "_X" suffix is also used for microserver parts. ] Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Aristeu Rozanski <aris@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com> Cc: Megha Dey <megha.dey@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190125195902.17109-4-tony.luck@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* kvm: vmx: Basic APIC virtualization controls have three settingsJim Mattson2019-10-294-28/+38
| | | | | | | | | | | | | | | | | commit 8d860bbeedef97fe981d28fa7b71d77f3b29563f upstream. Previously, we toggled between SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE and SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, depending on whether or not the EXTD bit was set in MSR_IA32_APICBASE. However, if the local APIC is disabled, we should not set either of these APIC virtualization control bits. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: "Jitindar SIngh, Suraj" <surajjs@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in useJunaid Shahid2019-10-291-10/+4
| | | | | | | | | | | | | | | | | | | commit a468f2dbf921d02f5107378501693137a812999b upstream. Currently, KVM flushes the TLB after a change to the APIC access page address or the APIC mode when EPT mode is enabled. However, even in shadow paging mode, a TLB flush is needed if VPIDs are being used, as specified in the Intel SDM Section 29.4.5. So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will flush if either EPT or VPIDs are in use. Signed-off-by: Junaid Shahid <junaids@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Cc: "Jitindar SIngh, Suraj" <surajjs@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: vmx: Introduce lapic_mode enumerationJim Mattson2019-10-292-11/+29
| | | | | | | | | | | | | | | | | | | | | commit 588716494258899389206fa50426e78cc9df89b9 upstream. The local APIC can be in one of three modes: disabled, xAPIC or x2APIC. (A fourth mode, "invalid," is included for completeness.) Using the new enumeration can make some of the APIC mode logic easier to read. In kvm_set_apic_base, for instance, it is clear that one cannot transition directly from x2APIC mode to xAPIC mode or directly from APIC disabled to x2APIC mode. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Check invalid bits even if msr_info->host_initiated. Reported by Wanpeng Li. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Cc: "Jitindar SIngh, Suraj" <surajjs@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: X86: introduce invalidate_gpa argument to tlb flushWanpeng Li2019-10-294-21/+22
| | | | | | | | | | | | | | | | | | | | commit c2ba05ccfde2f069a66c0462e5b5ef8a517dcc9c upstream. Introduce a new bool invalidate_gpa argument to kvm_x86_ops->tlb_flush, it will be used by later patches to just flush guest tlb. For VMX, this will use INVVPID instead of INVEPT, which will invalidate combined mappings while keeping guest-physical mappings. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Jitindar SIngh, Suraj" <surajjs@amazon.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel areaSteve Wahl2019-10-291-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2aa85f246c181b1fa89f27e8e20c5636426be624 upstream. Our hardware (UV aka Superdome Flex) has address ranges marked reserved by the BIOS. Access to these ranges is caught as an error, causing the BIOS to halt the system. Initial page tables mapped a large range of physical addresses that were not checked against the list of BIOS reserved addresses, and sometimes included reserved addresses in part of the mapped range. Including the reserved range in the map allowed processor speculative accesses to the reserved range, triggering a BIOS halt. Used early in booting, the page table level2_kernel_pgt addresses 1 GiB divided into 2 MiB pages, and it was set up to linearly map a full 1 GiB of physical addresses that included the physical address range of the kernel image, as chosen by KASLR. But this also included a large range of unused addresses on either side of the kernel image. And unlike the kernel image's physical address range, this extra mapped space was not checked against the BIOS tables of usable RAM addresses. So there were times when the addresses chosen by KASLR would result in processor accessible mappings of BIOS reserved physical addresses. The kernel code did not directly access any of this extra mapped space, but having it mapped allowed the processor to issue speculative accesses into reserved memory, causing system halts. This was encountered somewhat rarely on a normal system boot, and much more often when starting the crash kernel if "crashkernel=512M,high" was specified on the command line (this heavily restricts the physical address of the crash kernel, in our case usually within 1 GiB of reserved space). The solution is to invalidate the pages of this table outside the kernel image's space before the page table is activated. It fixes this problem on our hardware. [ bp: Touchups. ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: dimitri.sivanich@hpe.com Cc: Feng Tang <feng.tang@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jordan Borgner <mail@jordan-borgner.de> Cc: Juergen Gross <jgross@suse.com> Cc: mike.travis@hpe.com Cc: russ.anderson@hpe.com Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen/efi: Set nonblocking callbacksRoss Lagerwall2019-10-291-0/+2
| | | | | | | | | | | | | | [ Upstream commit df359f0d09dc029829b66322707a2f558cb720f7 ] Other parts of the kernel expect these nonblocking EFI callbacks to exist and crash when running under Xen. Since the implementations of xen_efi_set_variable() and xen_efi_query_variable_info() do not take any locks, use them for the nonblocking callbacks too. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/asm: Fix MWAITX C-state hint valueJanakarajan Natarajan2019-10-172-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 454de1e7d970d6bc567686052329e4814842867c upstream. As per "AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions", MWAITX EAX[7:4]+1 specifies the optional hint of the optimized C-state. For C0 state, EAX[7:4] should be set to 0xf. Currently, a value of 0xf is set for EAX[3:0] instead of EAX[7:4]. Fix this by changing MWAITX_DISABLE_CSTATES from 0xf to 0xf0. This hasn't had any implications so far because setting reserved bits in EAX is simply ignored by the CPU. [ bp: Fixup comment in delay_mwaitx() and massage. ] Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191007190011.4859-1-Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: handle page fault in vmread fixJack Wang2019-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During backport f7eea636c3d5 ("KVM: nVMX: handle page fault in vmread"), there was a mistake the exception reference should be passed to function kvm_write_guest_virt_system, instead of NULL, other wise, we will get NULL pointer deref, eg kvm-unit-test triggered a NULL pointer deref below: [ 948.518437] kvm [24114]: vcpu0, guest rIP: 0x407ef9 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x3, nop [ 949.106464] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 949.106707] PGD 0 P4D 0 [ 949.106872] Oops: 0002 [#1] SMP [ 949.107038] CPU: 2 PID: 24126 Comm: qemu-2.7 Not tainted 4.19.77-pserver #4.19.77-1+feature+daily+update+20191005.1625+a4168bb~deb9 [ 949.107283] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 2.7.3 01/31/2018 [ 949.107549] RIP: 0010:kvm_write_guest_virt_system+0x12/0x40 [kvm] [ 949.107719] Code: c0 5d 41 5c 41 5d 41 5e 83 f8 03 41 0f 94 c0 41 c1 e0 02 e9 b0 ed ff ff 0f 1f 44 00 00 48 89 f0 c6 87 59 56 00 00 01 48 89 d6 <49> c7 00 00 00 00 00 89 ca 49 c7 40 08 00 00 00 00 49 c7 40 10 00 [ 949.108044] RSP: 0018:ffffb31b0a953cb0 EFLAGS: 00010202 [ 949.108216] RAX: 000000000046b4d8 RBX: ffff9e9f415b0000 RCX: 0000000000000008 [ 949.108389] RDX: ffffb31b0a953cc0 RSI: ffffb31b0a953cc0 RDI: ffff9e9f415b0000 [ 949.108562] RBP: 00000000d2e14928 R08: 0000000000000000 R09: 0000000000000000 [ 949.108733] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffc8 [ 949.108907] R13: 0000000000000002 R14: ffff9e9f4f26f2e8 R15: 0000000000000000 [ 949.109079] FS: 00007eff8694c700(0000) GS:ffff9e9f51a80000(0000) knlGS:0000000031415928 [ 949.109318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 949.109495] CR2: 0000000000000000 CR3: 00000003be53b002 CR4: 00000000003626e0 [ 949.109671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 949.109845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 949.110017] Call Trace: [ 949.110186] handle_vmread+0x22b/0x2f0 [kvm_intel] [ 949.110356] ? vmexit_fill_RSB+0xc/0x30 [kvm_intel] [ 949.110549] kvm_arch_vcpu_ioctl_run+0xa98/0x1b30 [kvm] [ 949.110725] ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm] [ 949.110901] kvm_vcpu_ioctl+0x388/0x5d0 [kvm] [ 949.111072] do_vfs_ioctl+0xa2/0x620 Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Manually calculate reserved bits when loading PDPTRSSean Christopherson2019-10-051-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 16cfacc8085782dab8e365979356ce1ca87fd6cc upstream. Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe47df ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b56168d ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: set ctxt->have_exception in x86_decode_insn()Jan Dakinevich2019-10-052-0/+8
| | | | | | | | | | | | | | | | | | commit c8848cee74ff05638e913582a476bde879c968ad upstream. x86_emulate_instruction() takes into account ctxt->have_exception flag during instruction decoding, but in practice this flag is never set in x86_decode_insn(). Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: always stop emulation on page faultJan Dakinevich2019-10-051-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8530a79c5a9f4e29e6ffb35ec1a79d81f4968ec8 upstream. inject_emulated_exception() returns true if and only if nested page fault happens. However, page fault can come from guest page tables walk, either nested or not nested. In both cases we should stop an attempt to read under RIP and give guest to step over its own page fault handler. This is also visible when an emulated instruction causes a #GP fault and the VMware backdoor is enabled. To handle the VMware backdoor, KVM intercepts #GP faults; with only the next patch applied, x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi() (or gp_interception() for SVM) to re-inject the original #GP because it thinks emulation failed due to a non-VMware opcode. This patch prevents the issue as x86_emulate_instruction() will return EMULATE_DONE after injecting the #GP. Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/retpolines: Fix up backport of a9d57ef15cbeNathan Chancellor2019-10-051-1/+1
| | | | | | | | | | | | | | | | | | | | Commit a9d57ef15cbe ("x86/retpolines: Disable switch jump tables when retpolines are enabled") added -fno-jump-tables to workaround a GCC issue while deliberately avoiding adding this flag when CONFIG_CC_IS_CLANG is set, which is defined by the kconfig system when CC=clang is provided. However, this symbol was added in 4.18 in commit 469cb7376c06 ("kconfig: add CC_IS_CLANG and CLANG_VERSION") so it is always undefined in 4.14, meaning -fno-jump-tables gets added when using Clang. Fix this up by using the equivalent $(cc-name) comparison, which matches what upstream did until commit 076f421da5d4 ("kbuild: replace cc-name test with CONFIG_CC_IS_CLANG"). Fixes: e28951100515 ("x86/retpolines: Disable switch jump tables when retpolines are enabled") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/apic: Soft disable APIC before initializing itThomas Gleixner2019-10-051-0/+8
| | | | | | | | | | | | | | | | | [ Upstream commit 2640da4cccf5cc613bf26f0998b9e340f4b5f69c ] If the APIC was already enabled on entry of setup_local_APIC() then disabling it soft via the SPIV register makes a lot of sense. That masks all LVT entries and brings it into a well defined state. Otherwise previously enabled LVTs which are not touched in the setup function stay unmasked and might surprise the just booting kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190722105219.068290579@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI failsGrzegorz Halat2019-10-051-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 747d5a1bf293dcb33af755a6d285d41b8c1ea010 ] A reboot request sends an IPI via the reboot vector and waits for all other CPUs to stop. If one or more CPUs are in critical regions with interrupts disabled then the IPI is not handled on those CPUs and the shutdown hangs if native_stop_other_cpus() is called with the wait argument set. Such a situation can happen when one CPU was stopped within a lock held section and another CPU is trying to acquire that lock with interrupts disabled. There are other scenarios which can cause such a lockup as well. In theory the shutdown should be attempted by an NMI IPI after the timeout period elapsed. Though the wait loop after sending the reboot vector IPI prevents this. It checks the wait request argument and the timeout. If wait is set, which is true for sys_reboot() then it won't fall through to the NMI shutdown method after the timeout period has finished. This was an oversight when the NMI shutdown mechanism was added to handle the 'reboot IPI is not working' situation. The mechanism was added to deal with stuck panic shutdowns, which do not have the wait request set, so the 'wait request' case was probably not considered. Remove the wait check from the post reboot vector IPI wait loop and enforce that the wait loop in the NMI fallback path is invoked even if NMI IPIs are disabled or the registration of the NMI handler fails. That second wait loop will then hang if not all CPUs shutdown and the wait argument is set. [ tglx: Avoid the hard to parse line break in the NMI fallback path, add comments and massage the changelog ] Fixes: 7d007d21e539 ("x86/reboot: Use NMI to assist in shutting down if IRQ fails") Signed-off-by: Grzegorz Halat <ghalat@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Don Zickus <dzickus@redhat.com> Link: https://lkml.kernel.org/r/20190628122813.15500-1-ghalat@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/hyper-v: Fix overflow bug in fill_gva_list()Tianyu Lan2019-09-211-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4030b4c585c41eeefec7bd20ce3d0e100a0f2e4d ] When the 'start' parameter is >= 0xFF000000 on 32-bit systems, or >= 0xFFFFFFFF'FF000000 on 64-bit systems, fill_gva_list() gets into an infinite loop. With such inputs, 'cur' overflows after adding HV_TLB_FLUSH_UNIT and always compares as less than end. Memory is filled with guest virtual addresses until the system crashes. Fix this by never incrementing 'cur' to be larger than 'end'. Reported-by: Jong Hyun Park <park.jonghyun@yonsei.ac.kr> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2ffd9e33ce4a ("x86/hyper-v: Use hypercall for remote TLB flush") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/uaccess: Don't leak the AC flags into __get_user() argument evaluationPeter Zijlstra2019-09-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9b8bd476e78e89c9ea26c3b435ad0201c3d7dbf5 ] Identical to __put_user(); the __get_user() argument evalution will too leak UBSAN crud into the __uaccess_begin() / __uaccess_end() region. While uncommon this was observed to happen for: drivers/xen/gntdev.c: if (__get_user(old_status, batch->status[i])) where UBSAN added array bound checking. This complements commit: 6ae865615fc4 ("x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation") Tested-by Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: broonie@kernel.org Cc: sfr@canb.auug.org.au Cc: akpm@linux-foundation.org Cc: Randy Dunlap <rdunlap@infradead.org> Cc: mhocko@suse.cz Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20190829082445.GM2369@hirez.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/amd/ibs: Fix sample bias for dispatched micro-opsKim Phillips2019-09-212-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0f4cd769c410e2285a4e9873a684d90423f03090 ] When counting dispatched micro-ops with cnt_ctl=1, in order to prevent sample bias, IBS hardware preloads the least significant 7 bits of current count (IbsOpCurCnt) with random values, such that, after the interrupt is handled and counting resumes, the next sample taken will be slightly perturbed. The current count bitfield is in the IBS execution control h/w register, alongside the maximum count field. Currently, the IBS driver writes that register with the maximum count, leaving zeroes to fill the current count field, thereby overwriting the random bits the hardware preloaded for itself. Fix the driver to actually retain and carry those random bits from the read of the IBS control register, through to its write, instead of overwriting the lower current count bits with zeroes. Tested with: perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0 <workload> 'perf annotate' output before: 15.70 65: addsd %xmm0,%xmm1 17.30 add $0x1,%rax 15.88 cmp %rdx,%rax je 82 17.32 72: test $0x1,%al jne 7c 7.52 movapd %xmm1,%xmm0 5.90 jmp 65 8.23 7c: sqrtsd %xmm1,%xmm0 12.15 jmp 65 'perf annotate' output after: 16.63 65: addsd %xmm0,%xmm1 16.82 add $0x1,%rax 16.81 cmp %rdx,%rax je 82 16.69 72: test $0x1,%al jne 7c 8.30 movapd %xmm1,%xmm0 8.13 jmp 65 8.24 7c: sqrtsd %xmm1,%xmm0 8.39 jmp 65 Tested on Family 15h and 17h machines. Machines prior to family 10h Rev. C don't have the RDWROPCNT capability, and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't affect their operation. It is unknown why commit db98c5faf8cb ("perf/x86: Implement 64-bit counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt field; the number of preloaded random bits has always been 7, AFAICT. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Arnaldo Carvalho de Melo" <acme@kernel.org> Cc: <x86@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Borislav Petkov" <bp@alien8.de> Cc: Stephane Eranian <eranian@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: "Namhyung Kim" <namhyung@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel: Restrict period on NehalemJosh Hunt2019-09-211-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 44d3bbb6f5e501b873218142fe08cdf62a4ac1f3 ] We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in some cases when using perf: perfevents: irq loop stuck! WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530 ... RIP: 0010:intel_pmu_handle_irq+0x37b/0x530 ... Call Trace: <NMI> ? perf_event_nmi_handler+0x2e/0x50 ? intel_pmu_save_and_restart+0x50/0x50 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x6e/0x120 default_do_nmi+0x3e/0x100 do_nmi+0x102/0x160 end_repeat_nmi+0x16/0x50 ... ? native_write_msr+0x6/0x20 ? native_write_msr+0x6/0x20 </NMI> intel_pmu_enable_event+0x1ce/0x1f0 x86_pmu_start+0x78/0xa0 x86_pmu_enable+0x252/0x310 __perf_event_task_sched_in+0x181/0x190 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 finish_task_switch+0x158/0x260 __schedule+0x2f6/0x840 ? hrtimer_start_range_ns+0x153/0x210 schedule+0x32/0x80 schedule_hrtimeout_range_clock+0x8a/0x100 ? hrtimer_init+0x120/0x120 ep_poll+0x2f7/0x3a0 ? wake_up_q+0x60/0x60 do_epoll_wait+0xa9/0xc0 __x64_sys_epoll_wait+0x1a/0x20 do_syscall_64+0x4e/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fdeb1e96c03 ... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: acme@kernel.org Cc: Josh Hunt <johunt@akamai.com> Cc: bpuranda@akamai.com Cc: mingo@redhat.com Cc: jolsa@redhat.com Cc: tglx@linutronix.de Cc: namhyung@kernel.org Cc: alexander.shishkin@linux.intel.com Link: https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-johunt@akamai.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machinesThomas Gleixner2019-09-211-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3e5bedc2c258341702ddffbd7688c5e6eb01eafa ] Rahul Tanwar reported the following bug on DT systems: > 'ioapic_dynirq_base' contains the virtual IRQ base number. Presently, it is > updated to the end of hardware IRQ numbers but this is done only when IOAPIC > configuration type is IOAPIC_DOMAIN_LEGACY or IOAPIC_DOMAIN_STRICT. There is > a third type IOAPIC_DOMAIN_DYNAMIC which applies when IOAPIC configuration > comes from devicetree. > > See dtb_add_ioapic() in arch/x86/kernel/devicetree.c > > In case of IOAPIC_DOMAIN_DYNAMIC (DT/OF based system), 'ioapic_dynirq_base' > remains to zero initialized value. This means that for OF based systems, > virtual IRQ base will get set to zero. Such systems will very likely not even boot. For DT enabled machines ioapic_dynirq_base is irrelevant and not updated, so simply map the IRQ base 1:1 instead. Reported-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Tested-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: alan@linux.intel.com Cc: bp@alien8.de Cc: cheol.yong.kim@intel.com Cc: qi-ming.wu@intel.com Cc: rahul.tanwar@intel.com Cc: rppt@linux.ibm.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/20190821081330.1187-1-rahul.tanwar@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/build: Add -Wnoaddress-of-packed-member to REALMODE_CFLAGS, to silence ↵Linus Torvalds2019-09-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC9 build warning commit 42e0e95474fc6076b5cd68cab8fa0340a1797a72 upstream. One of the very few warnings I have in the current build comes from arch/x86/boot/edd.c, where I get the following with a gcc9 build: arch/x86/boot/edd.c: In function ‘query_edd’: arch/x86/boot/edd.c:148:11: warning: taking address of packed member of ‘struct boot_params’ may result in an unaligned pointer value [-Waddress-of-packed-member] 148 | mbrptr = boot_params.edd_mbr_sig_buffer; | ^~~~~~~~~~~ This warning triggers because we throw away all the CFLAGS and then make a new set for REALMODE_CFLAGS, so the -Wno-address-of-packed-member we added in the following commit is not present: 6f303d60534c ("gcc-9: silence 'address-of-packed-member' warning") The simplest solution for now is to adjust the warning for this version of CFLAGS as well, but it would definitely make sense to examine whether REALMODE_CFLAGS could be derived from CFLAGS, so that it picks up changes in the compiler flags environment automatically. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: handle page fault in vmreadPaolo Bonzini2019-09-191-2/+5
| | | | | | | | | | | | commit f7eea636c3d505fe6f1d1066234f1aaf7171b681 upstream. The implementation of vmread to memory is still incomplete, as it lacks the ability to do vmread to I/O memory just like vmptrst. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: work around leak of uninitialized stack contentsFuqian Huang2019-09-191-0/+7
| | | | | | | | | | | | | | | | | | | | commit 541ab2aeb28251bf7135c7961f3a6080eebcc705 upstream. Emulation of VMPTRST can incorrectly inject a page fault when passed an operand that points to an MMIO address. The page fault will use uninitialized kernel stack memory as the CR2 and error code. The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR exit to userspace; however, it is not an easy fix, so for now just ensure that the error code and CR2 are zero. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Cc: stable@vger.kernel.org [add comment] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/boot: Preserve boot_params.secure_boot from sanitizingJohn S. Gruber2019-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 29d9a0b50736768f042752070e5cdf4e4d4c00df upstream. Commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") now zeroes the secure boot setting information (enabled/disabled/...) passed by the boot loader or by the kernel's EFI handover mechanism. The problem manifests itself with signed kernels using the EFI handoff protocol with grub and the kernel loses the information whether secure boot is enabled in the firmware, i.e., the log message "Secure boot enabled" becomes "Secure boot could not be determined". efi_main() arch/x86/boot/compressed/eboot.c sets this field early but it is subsequently zeroed by the above referenced commit. Include boot_params.secure_boot in the preserve field list. [ bp: restructure commit message and massage. ] Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") Signed-off-by: John S. Gruber <JohnSGruber@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Mark Brown <broonie@kernel.org> Cc: stable <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/CAPotdmSPExAuQcy9iAHqX3js_fc4mMLQOTr5RBGvizyCOPcTQQ@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "x86/apic: Include the LDR when clearing out APIC registers"Linus Torvalds2019-09-101-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 950b07c14e8c59444e2359f15fd70ed5112e11a0 ] This reverts commit 558682b5291937a70748d36fd9ba757fb25b99ae. Chris Wilson reports that it breaks his CPU hotplug test scripts. In particular, it breaks offlining and then re-onlining the boot CPU, which we treat specially (and the BIOS does too). The symptoms are that we can offline the CPU, but it then does not come back online again: smpboot: CPU 0 is now offline smpboot: Booting Node 0 Processor 0 APIC 0x0 smpboot: do_boot_cpu failed(-1) to wakeup CPU#0 Thomas says he knows why it's broken (my personal suspicion: our magic handling of the "cpu0_logical_apicid" thing), but for 5.3 the right fix is to just revert it, since we've never touched the LDR bits before, and it's not worth the risk to do anything else at this stage. [ Hotpluging of the boot CPU is special anyway, and should be off by default. See the "BOOTPARAM_HOTPLUG_CPU0" config option and the cpu0_hotplug kernel parameter. In general you should not do it, and it has various known limitations (hibernate and suspend require the boot CPU, for example). But it should work, even if the boot CPU is special and needs careful treatment - Linus ] Link: https://lore.kernel.org/lkml/156785100521.13300.14461504732265570003@skylake-alporthouse-com/ Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bandan Das <bsd@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/ptrace: fix up botched merge of spectrev1 fixGreg Kroah-Hartman2019-09-061-2/+1
| | | | | | | | | | | | | | | | | I incorrectly merged commit 31a2fbb390fe ("x86/ptrace: Fix possible spectre-v1 in ptrace_get_debugreg()") when backporting it, as was graciously pointed out at https://grsecurity.net/teardown_of_a_failed_linux_lts_spectre_fix.php Resolve the upstream difference with the stable kernel merge to properly protect things. Reported-by: Brad Spengler <spender@grsecurity.net> Cc: Dianzhang Chen <dianzhangchen0@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <bp@alien8.de> Cc: <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* uprobes/x86: Fix detection of 32-bit user modeSebastian Mayr2019-09-061-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9212ec7d8357ea630031e89d0d399c761421c83b ] 32-bit processes running on a 64-bit kernel are not always detected correctly, causing the process to crash when uretprobes are installed. The reason for the crash is that in_ia32_syscall() is used to determine the process's mode, which only works correctly when called from a syscall. In the case of uretprobes, however, the function is called from a exception and always returns 'false' on a 64-bit kernel. In consequence this leads to corruption of the process's return address. Fix this by using user_64bit_mode() instead of in_ia32_syscall(), which is correct in any situation. [ tglx: Add a comment and the following historical info ] This should have been detected by the rename which happened in commit abfb9498ee13 ("x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()") which states in the changelog: The is_ia32_task()/is_x32_task() function names are a big misnomer: they suggests that the compat-ness of a system call is a task property, which is not true, the compatness of a system call purely depends on how it was invoked through the system call layer. ..... and then it went and blindly renamed every call site. Sadly enough this was already mentioned here: 8faaed1b9f50 ("uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and arch_uretprobe_hijack_return_addr()") where the changelog says: TODO: is_ia32_task() is not what we actually want, TS_COMPAT does not necessarily mean 32bit. Fortunately syscall-like insns can't be probed so it actually works, but it would be better to rename and use is_ia32_frame(). and goes all the way back to: 0326f5a94dde ("uprobes/core: Handle breakpoint and singlestep exceptions") Oh well. 7+ years until someone actually tried a uretprobe on a 32bit process on a 64bit kernel.... Fixes: 0326f5a94dde ("uprobes/core: Handle breakpoint and singlestep exceptions") Signed-off-by: Sebastian Mayr <me@sam.st> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190728152617.7308-1-me@sam.st Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Include the LDR when clearing out APIC registersBandan Das2019-09-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 558682b5291937a70748d36fd9ba757fb25b99ae upstream. Although APIC initialization will typically clear out the LDR before setting it, the APIC cleanup code should reset the LDR. This was discovered with a 32-bit KVM guest jumping into a kdump kernel. The stale bits in the LDR triggered a bug in the KVM APIC implementation which caused the destination mapping for VCPUs to be corrupted. Note that this isn't intended to paper over the KVM APIC bug. The kernel has to clear the LDR when resetting the APIC registers except when X2APIC is enabled. This lacks a Fixes tag because missing to clear LDR goes way back into pre git history. [ tglx: Made x2apic_enabled a function call as required ] Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/apic: Do not initialize LDR and DFR for bigsmpBandan Das2019-09-061-22/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bae3a8d3308ee69a7dbdf145911b18dfda8ade0d upstream. Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The bigsmp APIC implementation uses physical destination mode, but it nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with multiple bit being set. This does not cause a functional problem because LDR and DFR are ignored when physical destination mode is active, but it triggered a problem on a 32-bit KVM guest which jumps into a kdump kernel. The multiple bits set unearthed a bug in the KVM APIC implementation. The code which creates the logical destination map for VCPUs ignores the disabled state of the APIC and ends up overwriting an existing valid entry and as a result, APIC calibration hangs in the guest during kdump initialization. Remove the bogus LDR/DFR initialization. This is not intended to work around the KVM APIC bug. The LDR/DFR ininitalization is wrong on its own. The issue goes back into the pre git history. The fixes tag is the commit in the bitkeeper import which introduced bigsmp support in 2003. git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190826101513.5080-2-bsd@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: Don't update RIP or do single-step on faulting emulationSean Christopherson2019-09-061-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 75ee23b30dc712d80d2421a9a547e7ab6e379b44 upstream. Don't advance RIP or inject a single-step #DB if emulation signals a fault. This logic applies to all state updates that are conditional on clean retirement of the emulation instruction, e.g. updating RFLAGS was previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update EFLAGS on faulting emulation"). Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with ctxt->_eip until emulation "retires" anyways. Skipping #DB injection fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation overwriting the #UD with #DB and thus restarting the bad SYSCALL over and over. Cc: Nadav Amit <nadav.amit@gmail.com> Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: x86: skip populating logical dest map if apic is not sw enabledRadim Krcmar2019-09-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | commit b14c876b994f208b6b95c222056e1deb0a45de0e upstream. recalculate_apic_map does not santize ldr and it's possible that multiple bits are set. In that case, a previous valid entry can potentially be overwritten by an invalid one. This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then triggering a crash to boot a kdump kernel. This is the sequence of events: 1. Linux boots in bigsmp mode and enables PhysFlat, however, it still writes to the LDR which probably will never be used. 2. However, when booting into kdump, the stale LDR values remain as they are not cleared by the guest and there isn't a apic reset. 3. kdump boots with 1 cpu, and uses Logical Destination Mode but the logical map has been overwritten and points to an inactive vcpu. Signed-off-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/boot: Fix boot regression caused by bootparam sanitizingJohn Hubbard2019-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 7846f58fba964af7cb8cf77d4d13c33254725211 upstream. commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") had two errors: * It preserved boot_params.acpi_rsdp_addr, and * It failed to preserve boot_params.hdr Therefore, zero out acpi_rsdp_addr, and preserve hdr. Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") Reported-by: Neil MacLeod <neil@nmacleod.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neil MacLeod <neil@nmacleod.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190821192513.20126-1-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/boot: Save fields explicitly, zero out everything elseJohn Hubbard2019-08-291-13/+47
| | | | | | | | | | | | | | | | | | | | | | commit a90118c445cc7f07781de26a9684d4ec58bfcfd1 upstream. Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds memset, if the memset goes accross several fields of a struct. This generated a couple of warnings on x86_64 builds in sanitize_boot_params(). Fix this by explicitly saving the fields in struct boot_params that are intended to be preserved, and zeroing all the rest. [ tglx: Tagged for stable as it breaks the warning free build there as well ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16hTom Lendacky2019-08-293-13/+140
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream. There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand=force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/apic: Handle missing global clockevent gracefullyThomas Gleixner2019-08-291-15/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f897e60a12f0b9146357780d317879bce2a877dc upstream. Some newer machines do not advertise legacy timers. The kernel can handle that situation if the TSC and the CPU frequency are enumerated by CPUID or MSRs and the CPU supports TSC deadline timer. If the CPU does not support TSC deadline timer the local APIC timer frequency has to be known as well. Some Ryzens machines do not advertize legacy timers, but there is no reliable way to determine the bus frequency which feeds the local APIC timer when the machine allows overclocking of that frequency. As there is no legacy timer the local APIC timer calibration crashes due to a NULL pointer dereference when accessing the not installed global clock event device. Switch the calibration loop to a non interrupt based one, which polls either TSC (if frequency is known) or jiffies. The latter requires a global clockevent. As the machines which do not have a global clockevent installed have a known TSC frequency this is a non issue. For older machines where TSC frequency is not known, there is no known case where the legacy timers do not exist as that would have been reported long ago. Reported-by: Daniel Drake <drake@endlessm.com> Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Daniel Drake <drake@endlessm.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386Sean Christopherson2019-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b63f20a778c88b6a04458ed6ffc69da953d3a109 upstream. Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to avoid clobbering flags. KVM's emulator makes indirect calls into a jump table of sorts, where the destination of the CALL_NOSPEC is a small blob of code that performs fast emulation by executing the target instruction with fixed operands. adcb_al_dl: 0x000339f8 <+0>: adc %dl,%al 0x000339fa <+2>: ret A major motiviation for doing fast emulation is to leverage the CPU to handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is both an input and output to the target of CALL_NOSPEC. Clobbering flags results in all sorts of incorrect emulation, e.g. Jcc instructions often take the wrong path. Sans the nops... asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n" 0x0003595a <+58>: mov 0xc0(%ebx),%eax 0x00035960 <+64>: mov 0x60(%ebx),%edx 0x00035963 <+67>: mov 0x90(%ebx),%ecx 0x00035969 <+73>: push %edi 0x0003596a <+74>: popf 0x0003596b <+75>: call *%esi 0x000359a0 <+128>: pushf 0x000359a1 <+129>: pop %edi 0x000359a2 <+130>: mov %eax,0xc0(%ebx) 0x000359b1 <+145>: mov %edx,0x60(%ebx) ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK); 0x000359a8 <+136>: mov -0x10(%ebp),%eax 0x000359ab <+139>: and $0x8d5,%edi 0x000359b4 <+148>: and $0xfffff72a,%eax 0x000359b9 <+153>: or %eax,%edi 0x000359bd <+157>: mov %edi,0x4(%ebx) For the most part this has gone unnoticed as emulation of guest code that can trigger fast emulation is effectively limited to MMIO when running on modern hardware, and MMIO is rarely, if ever, accessed by instructions that affect or consume flags. Breakage is almost instantaneous when running with unrestricted guest disabled, in which case KVM must emulate all instructions when the guest has invalid state, e.g. when the guest is in Big Real Mode during early BIOS. Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support") Fixes: 1a29b5b7f347a ("KVM: x86: Make indirect calls in emulator speculation safe") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/lib/cpu: Address missing prototypes warningValdis Klētnieks2019-08-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 04f5bda84b0712d6f172556a7e8dca9ded5e73b9 ] When building with W=1, warnings about missing prototypes are emitted: CC arch/x86/lib/cpu.o arch/x86/lib/cpu.c:5:14: warning: no previous prototype for 'x86_family' [-Wmissing-prototypes] 5 | unsigned int x86_family(unsigned int sig) | ^~~~~~~~~~ arch/x86/lib/cpu.c:18:14: warning: no previous prototype for 'x86_model' [-Wmissing-prototypes] 18 | unsigned int x86_model(unsigned int sig) | ^~~~~~~~~ arch/x86/lib/cpu.c:33:14: warning: no previous prototype for 'x86_stepping' [-Wmissing-prototypes] 33 | unsigned int x86_stepping(unsigned int sig) | ^~~~~~~~~~~~ Add the proper include file so the prototypes are there. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/42513.1565234837@turing-police Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm: Use WRITE_ONCE() when setting PTEsNadav Amit2019-08-252-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9bc4f28af75a91aea0ae383f50b0a430c4509303 upstream. When page-table entries are set, the compiler might optimize their assignment by using multiple instructions to set the PTE. This might turn into a security hazard if the user somehow manages to use the interim PTE. L1TF does not make our lives easier, making even an interim non-present PTE a security hazard. Using WRITE_ONCE() to set PTEs and friends should prevent this potential security hazard. I skimmed the differences in the binary with and without this patch. The differences are (obviously) greater when CONFIG_PARAVIRT=n as more code optimizations are possible. For better and worse, the impact on the binary with this patch is pretty small. Skimming the code did not cause anything to jump out as a security hazard, but it seems that at least move_soft_dirty_pte() caused set_pte_at() to use multiple writes. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com [bwh: Backported to 4.14: - Drop changes in pmdp_establish() - 5-level paging is a compile-time option - Update both cases in native_set_pgd() - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: get rid of pure_initcall dependency to enable jitsDaniel Borkmann2019-08-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | commit fa9dd599b4dae841924b022768354cfde9affecb upstream. Having a pure_initcall() callback just to permanently enable BPF JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave a small race window in future where JIT is still disabled on boot. Since we know about the setting at compilation time anyway, just initialize it properly there. Also consolidate all the individual bpf_jit_enable variables into a single one and move them under one location. Moreover, don't allow for setting unspecified garbage values on them. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> [bwh: Backported to 4.14 as dependency of commit 2e4a30983b0f "bpf: restrict access to core bpf sysctls": - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li2019-08-164-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream. After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/mm: Sync also unmappings in vmalloc_sync_all()Joerg Roedel2019-08-161-8/+5
| | | | | | | | | | | | | | | | | | | | commit 8e998fc24de47c55b47a887f6c95ab91acd4a720 upstream. With huge-page ioremap areas the unmappings also need to be synced between all page-tables. Otherwise it can cause data corruption when a region is unmapped and later re-used. Make the vmalloc_sync_one() function ready to sync unmappings and make sure vmalloc_sync_all() iterates over all page-tables even when an unmapped PMD is found. Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-3-joro@8bytes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/mm: Check for pfn instead of page in vmalloc_sync_one()Joerg Roedel2019-08-161-1/+1
| | | | | | | | | | | | | | | | commit 51b75b5b563a2637f9d8dc5bd02a31b2ff9e5ea0 upstream. Do not require a struct page for the mapped memory location because it might not exist. This can happen when an ioremapped region is mapped with 2MB pages. Fixes: 5d72b4fba40ef ('x86, mm: support huge I/O mapping capability I/F') Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20190719184652.11391-2-joro@8bytes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGSThomas Gleixner2019-08-063-29/+32
| | | | | | | | | | | | | | | | | | | | | | | | commit f36cf386e3fec258a341d446915862eded3e13d8 upstream Intel provided the following information: On all current Atom processors, instructions that use a segment register value (e.g. a load or store) will not speculatively execute before the last writer of that segment retires. Thus they will not use a speculatively written segment value. That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS entry paths can be excluded from the extra LFENCE if PTI is disabled. Create a separate bug flag for the through SWAPGS speculation and mark all out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs are excluded from the whole mitigation mess anyway. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Enable Spectre v1 swapgs mitigationsJosh Poimboeuf2019-08-061-9/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream The previous commit added macro calls in the entry code which mitigate the Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are enabled. Enable those features where applicable. The mitigations may be disabled with "nospectre_v1" or "mitigations=off". There are different features which can affect the risk of attack: - When FSGSBASE is enabled, unprivileged users are able to place any value in GS, using the wrgsbase instruction. This means they can write a GS value which points to any value in kernel space, which can be useful with the following gadget in an interrupt/exception/NMI handler: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 // dependent load or store based on the value of %reg // for example: mov %(reg1), %reg2 If an interrupt is coming from user space, and the entry code speculatively skips the swapgs (due to user branch mistraining), it may speculatively execute the GS-based load and a subsequent dependent load or store, exposing the kernel data to an L1 side channel leak. Note that, on Intel, a similar attack exists in the above gadget when coming from kernel space, if the swapgs gets speculatively executed to switch back to the user GS. On AMD, this variant isn't possible because swapgs is serializing with respect to future GS-based accesses. NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case doesn't exist quite yet. - When FSGSBASE is disabled, the issue is mitigated somewhat because unprivileged users must use prctl(ARCH_SET_GS) to set GS, which restricts GS values to user space addresses only. That means the gadget would need an additional step, since the target kernel address needs to be read from user space first. Something like: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg1 mov (%reg1), %reg2 // dependent load or store based on the value of %reg2 // for example: mov %(reg2), %reg3 It's difficult to audit for this gadget in all the handlers, so while there are no known instances of it, it's entirely possible that it exists somewhere (or could be introduced in the future). Without tooling to analyze all such code paths, consider it vulnerable. Effects of SMAP on the !FSGSBASE case: - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not susceptible to Meltdown), the kernel is prevented from speculatively reading user space memory, even L1 cached values. This effectively disables the !FSGSBASE attack vector. - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP still prevents the kernel from speculatively reading user space memory. But it does *not* prevent the kernel from reading the user value from L1, if it has already been cached. This is probably only a small hurdle for an attacker to overcome. Thanks to Dave Hansen for contributing the speculative_smap() function. Thanks to Andrew Cooper for providing the inside scoop on whether swapgs is serializing on AMD. [ tglx: Fixed the USER fence decision and polished the comment as suggested by Dave Hansen ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/speculation: Prepare entry code for Spectre v1 swapgs mitigationsJosh Poimboeuf2019-08-063-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 18ec54fdd6d18d92025af097cd042a75cf0ea24c upstream Spectre v1 isn't only about array bounds checks. It can affect any conditional checks. The kernel entry code interrupt, exception, and NMI handlers all have conditional swapgs checks. Those may be problematic in the context of Spectre v1, as kernel code can speculatively run with a user GS. For example: if (coming from user space) swapgs mov %gs:<percpu_offset>, %reg mov (%reg), %reg1 When coming from user space, the CPU can speculatively skip the swapgs, and then do a speculative percpu load using the user GS value. So the user can speculatively force a read of any kernel value. If a gadget exists which uses the percpu value as an address in another load/store, then the contents of the kernel value may become visible via an L1 side channel attack. A similar attack exists when coming from kernel space. The CPU can speculatively do the swapgs, causing the user GS to get used for the rest of the speculative window. The mitigation is similar to a traditional Spectre v1 mitigation, except: a) index masking isn't possible; because the index (percpu offset) isn't user-controlled; and b) an lfence is needed in both the "from user" swapgs path and the "from kernel" non-swapgs path (because of the two attacks described above). The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a CR3 write when PTI is enabled. Since CR3 writes are serializing, the lfences can be skipped in those cases. On the other hand, the kernel entry swapgs paths don't depend on PTI. To avoid unnecessary lfences for the user entry case, create two separate features for alternative patching: X86_FEATURE_FENCE_SWAPGS_USER X86_FEATURE_FENCE_SWAPGS_KERNEL Use these features in entry code to patch in lfences where needed. The features aren't enabled yet, so there's no functional change. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/cpufeatures: Combine word 11 and 12 into a new scattered features wordFenghua Yu2019-08-066-34/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit acec0ce081de0c36459eea91647faf99296445a3 upstream It's a waste for the four X86_FEATURE_CQM_* feature bits to occupy two whole feature bits words. To better utilize feature words, re-define word 11 to host scattered features and move the four X86_FEATURE_CQM_* features into Linux defined word 11. More scattered features can be added in word 11 in the future. Rename leaf 11 in cpuid_leafs to CPUID_LNX_4 to reflect it's a Linux-defined leaf. Rename leaf 12 as CPUID_DUMMY which will be replaced by a meaningful name in the next patch when CPUID.7.1:EAX occupies world 12. Maximum number of RMID and cache occupancy scale are retrieved from CPUID.0xf.1 after scattered CQM features are enumerated. Carve out the code into a separate function. KVM doesn't support resctrl now. So it's safe to move the X86_FEATURE_CQM_* features to scattered features word 11 for KVM. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Aaron Lewis <aaronlewis@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Babu Moger <babu.moger@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Ravi V Shankar <ravi.v.shankar@intel.com> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: x86 <x86@kernel.org> Link: https://lkml.kernel.org/r/1560794416-217638-2-git-send-email-fenghua.yu@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/cpufeatures: Carve out CQM features retrievalBorislav Petkov2019-08-061-27/+33
| | | | | | | | | | | | | | | | commit 45fc56e629caa451467e7664fbd4c797c434a6c4 upstream ... into a separate function for better readability. Split out from a patch from Fenghua Yu <fenghua.yu@intel.com> to keep the mechanical, sole code movement separate for easy review. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: x86@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>