summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/nmi.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2023-02-251-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "ARM: - Provide a virtual cache topology to the guest to avoid inconsistencies with migration on heterogenous systems. Non secure software has no practical need to traverse the caches by set/way in the first place - Add support for taking stage-2 access faults in parallel. This was an accidental omission in the original parallel faults implementation, but should provide a marginal improvement to machines w/o FEAT_HAFDBS (such as hardware from the fruit company) - A preamble to adding support for nested virtualization to KVM, including vEL2 register state, rudimentary nested exception handling and masking unsupported features for nested guests - Fixes to the PSCI relay that avoid an unexpected host SVE trap when resuming a CPU when running pKVM - VGIC maintenance interrupt support for the AIC - Improvements to the arch timer emulation, primarily aimed at reducing the trap overhead of running nested - Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the interest of CI systems - Avoid VM-wide stop-the-world operations when a vCPU accesses its own redistributor - Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions in the host - Aesthetic and comment/kerneldoc fixes - Drop the vestiges of the old Columbia mailing list and add [Oliver] as co-maintainer RISC-V: - Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE - Correctly place the guest in S-mode after redirecting a trap to the guest - Redirect illegal instruction traps to guest - SBI PMU support for guest s390: - Sort out confusion between virtual and physical addresses, which currently are the same on s390 - A new ioctl that performs cmpxchg on guest memory - A few fixes x86: - Change tdp_mmu to a read-only parameter - Separate TDP and shadow MMU page fault paths - Enable Hyper-V invariant TSC control - Fix a variety of APICv and AVIC bugs, some of them real-world, some of them affecting architecurally legal but unlikely to happen in practice - Mark APIC timer as expired if its in one-shot mode and the count underflows while the vCPU task was being migrated - Advertise support for Intel's new fast REP string features - Fix a double-shootdown issue in the emergency reboot code - Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM similar treatment to VMX - Update Xen's TSC info CPUID sub-leaves as appropriate - Add support for Hyper-V's extended hypercalls, where "support" at this point is just forwarding the hypercalls to userspace - Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and MSR filters - One-off fixes and cleanups - Fix and cleanup the range-based TLB flushing code, used when KVM is running on Hyper-V - Add support for filtering PMU events using a mask. If userspace wants to restrict heavily what events the guest can use, it can now do so without needing an absurd number of filter entries - Clean up KVM's handling of "PMU MSRs to save", especially when vPMU support is disabled - Add PEBS support for Intel Sapphire Rapids - Fix a mostly benign overflow bug in SEV's send|receive_update_data() - Move several SVM-specific flags into vcpu_svm x86 Intel: - Handle NMI VM-Exits before leaving the noinstr region - A few trivial cleanups in the VM-Enter flows - Stop enabling VMFUNC for L1 purely to document that KVM doesn't support EPTP switching (or any other VM function) for L1 - Fix a crash when using eVMCS's enlighted MSR bitmaps Generic: - Clean up the hardware enable and initialization flow, which was scattered around multiple arch-specific hooks. Instead, just let the arch code call into generic code. Both x86 and ARM should benefit from not having to fight common KVM code's notion of how to do initialization - Account allocations in generic kvm_arch_alloc_vm() - Fix a memory leak if coalesced MMIO unregistration fails selftests: - On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit the correct hypercall instruction instead of relying on KVM to patch in VMMCALL - Use TAP interface for kvm_binary_stats_test and tsc_msrs_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits) KVM: SVM: hyper-v: placate modpost section mismatch error KVM: x86/mmu: Make tdp_mmu_allowed static KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes KVM: arm64: nv: Filter out unsupported features from ID regs KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2 KVM: arm64: nv: Allow a sysreg to be hidden from userspace only KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2 KVM: arm64: nv: Handle SMCs taken from virtual EL2 KVM: arm64: nv: Handle trapped ERET from virtual EL2 KVM: arm64: nv: Inject HVC exceptions to the virtual EL2 KVM: arm64: nv: Support virtual EL2 exceptions KVM: arm64: nv: Handle HCR_EL2.NV system register traps KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state KVM: arm64: nv: Add EL2 system registers to vcpu context KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set KVM: arm64: nv: Introduce nested virtualization VCPU feature KVM: arm64: Use the S2 MMU context to iterate over S2 table ...
| * x86/entry: KVM: Use dedicated VMX NMI entry for 32-bit kernels tooSean Christopherson2023-01-241-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a dedicated entry for invoking the NMI handler from KVM VMX's VM-Exit path for 32-bit even though using a dedicated entry for 32-bit isn't strictly necessary. Exposing a single symbol will allow KVM to reference the entry point in assembly code without having to resort to more #ifdefs (or #defines). identry.h is intended to be included from asm files only once, and so simply including idtentry.h in KVM assembly isn't an option. Bypassing the ESP fixup and CR3 switching in the standard NMI entry code is safe as KVM always handles NMIs that occur in the guest on a kernel stack, with a kernel CR3. Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221213060912.654668-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
* | x86/nmi: Print reasons why backtrace NMIs are ignoredPaul E. McKenney2023-01-191-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instrument nmi_trigger_cpumask_backtrace() to dump out diagnostics based on evidence accumulated by exc_nmi(). These diagnostics are dumped for CPUs that ignored an NMI backtrace request for more than 10 seconds. [ paulmck: Apply Ingo Molnar feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Reviewed-by: Ingo Molnar <mingo@kernel.org>
* | x86/nmi: Accumulate NMI-progress evidence in exc_nmi()Paul E. McKenney2023-01-191-1/+34
|/ | | | | | | | | | | | | | | | | | CPUs ignoring NMIs is often a sign of those CPUs going bad, but there are quite a few other reasons why a CPU might ignore NMIs. Therefore, accumulate evidence within exc_nmi() as to what might be preventing a given CPU from responding to an NMI. [ paulmck: Apply Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Reviewed-by: Ingo Molnar <mingo@kernel.org>
* x86/nmi: Make register_nmi_handler() more robustThomas Gleixner2022-05-171-4/+8
| | | | | | | | | | | | | | | | register_nmi_handler() has no sanity check whether a handler has been registered already. Such an unintended double-add leads to list corruption and hard to diagnose problems during the next NMI handling. Init the list head in the static NMI action struct and check it for being empty in register_nmi_handler(). [ bp: Fixups. ] Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/lkml/20220511234332.3654455-1-seanjc@google.com
* x86/nmi: Remove the 'strange power saving mode' hint from unknown NMI handlerJiri Kosina2022-03-161-1/+0
| | | | | | | | | | | | | | | | | | | | The Do you have a strange power saving mode enabled? hint when unknown NMI happens dates back to i386 stone age, and isn't currently really helpful. Unknown NMIs are coming for many different reasons (broken firmware, faulty hardware, ...) and rarely have anything to do with 'strange power saving mode' (whatever that even is). Just remove it as it's largerly misleading. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/nycvar.YFH.7.76.2203140924120.24795@cbobk.fhfr.pm
* x86/sev-es: Rename sev-es.{ch} to sev.{ch}Brijesh Singh2021-05-101-1/+1
| | | | | | | | | | | | SEV-SNP builds upon the SEV-ES functionality while adding new hardware protection. Version 2 of the GHCB specification adds new NAE events that are SEV-SNP specific. Rename the sev-es.{ch} to sev.{ch} so that all SEV* functionality can be consolidated in one place. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20210427111636.1207-2-brijesh.singh@amd.com
* KVM/VMX: Invoke NMI non-IST entry instead of IST entryLai Jiangshan2021-05-051-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In VMX, the host NMI handler needs to be invoked after NMI VM-Exit. Before commit 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn"), this was done by INTn ("int $2"). But INTn microcode is relatively expensive, so the commit reworked NMI VM-Exit handling to invoke the kernel handler by function call. But this missed a detail. The NMI entry point for direct invocation is fetched from the IDT table and called on the kernel stack. But on 64-bit the NMI entry installed in the IDT expects to be invoked on the IST stack. It relies on the "NMI executing" variable on the IST stack to work correctly, which is at a fixed position in the IST stack. When the entry point is unexpectedly called on the kernel stack, the RSP-addressed "NMI executing" variable is obviously also on the kernel stack and is "uninitialized" and can cause the NMI entry code to run in the wrong way. Provide a non-ist entry point for VMX which shares the C-function with the regular NMI entry and invoke the new asm entry point instead. On 32-bit this just maps to the regular NMI entry point as 32-bit has no ISTs and is not affected. [ tglx: Made it independent for backporting, massaged changelog ] Fixes: 1a5488ef0dcf6 ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Lai Jiangshan <laijs@linux.alibaba.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de
* x86/entry: Move nmi entry/exit into common codeThomas Gleixner2020-11-041-3/+3
| | | | | | | | | | | | | | | | | Lockdep state handling on NMI enter and exit is nothing specific to X86. It's not any different on other architectures. Also the extra state type is not necessary, irqentry_state_t can carry the necessary information as well. Move it to common code and extend irqentry_state_t to carry lockdep state. [ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive between the IRQ and NMI exceptions, and add kernel documentation for struct irqentry_state_t ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com
* Merge tag 'x86_seves_for_v5.10' of ↵Linus Torvalds2020-10-141-0/+15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV-ES support from Borislav Petkov: "SEV-ES enhances the current guest memory encryption support called SEV by also encrypting the guest register state, making the registers inaccessible to the hypervisor by en-/decrypting them on world switches. Thus, it adds additional protection to Linux guests against exfiltration, control flow and rollback attacks. With SEV-ES, the guest is in full control of what registers the hypervisor can access. This is provided by a guest-host exchange mechanism based on a new exception vector called VMM Communication Exception (#VC), a new instruction called VMGEXIT and a shared Guest-Host Communication Block which is a decrypted page shared between the guest and the hypervisor. Intercepts to the hypervisor become #VC exceptions in an SEV-ES guest so in order for that exception mechanism to work, the early x86 init code needed to be made able to handle exceptions, which, in itself, brings a bunch of very nice cleanups and improvements to the early boot code like an early page fault handler, allowing for on-demand building of the identity mapping. With that, !KASLR configurations do not use the EFI page table anymore but switch to a kernel-controlled one. The main part of this series adds the support for that new exchange mechanism. The goal has been to keep this as much as possibly separate from the core x86 code by concentrating the machinery in two SEV-ES-specific files: arch/x86/kernel/sev-es-shared.c arch/x86/kernel/sev-es.c Other interaction with core x86 code has been kept at minimum and behind static keys to minimize the performance impact on !SEV-ES setups. Work by Joerg Roedel and Thomas Lendacky and others" * tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) x86/sev-es: Use GHCB accessor for setting the MMIO scratch buffer x86/sev-es: Check required CPU features for SEV-ES x86/efi: Add GHCB mappings when SEV-ES is active x86/sev-es: Handle NMI State x86/sev-es: Support CPU offline/online x86/head/64: Don't call verify_cpu() on starting APs x86/smpboot: Load TSS and getcpu GDT entry before loading IDT x86/realmode: Setup AP jump table x86/realmode: Add SEV-ES specific trampoline entry point x86/vmware: Add VMware-specific handling for VMMCALL under SEV-ES x86/kvm: Add KVM-specific VMMCALL handling under SEV-ES x86/paravirt: Allow hypervisor-specific VMMCALL handling under SEV-ES x86/sev-es: Handle #DB Events x86/sev-es: Handle #AC Events x86/sev-es: Handle VMMCALL Events x86/sev-es: Handle MWAIT/MWAITX Events x86/sev-es: Handle MONITOR/MONITORX Events x86/sev-es: Handle INVD Events x86/sev-es: Handle RDPMC Events x86/sev-es: Handle RDTSC(P) Events ...
| * x86/sev-es: Handle NMI StateJoerg Roedel2020-09-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running under SEV-ES, the kernel has to tell the hypervisor when to open the NMI window again after an NMI was injected. This is done with an NMI-complete message to the hypervisor. Add code to the kernel's NMI handler to send this message right at the beginning of do_nmi(). This always allows nesting NMIs. [ bp: Mark __sev_es_nmi_complete() noinstr: vmlinux.o: warning: objtool: exc_nmi()+0x17: call to __sev_es_nmi_complete() leaves .noinstr.text section While at it, use __pa_nodebug() for the same reason due to CONFIG_DEBUG_VIRTUAL=y: vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xd9: call to __phys_addr() leaves .noinstr.text section ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-71-joro@8bytes.org
| * x86/sev-es: Adjust #VC IST Stack on entering NMI handlerJoerg Roedel2020-09-091-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an NMI hits in the #VC handler entry code before it has switched to another stack, any subsequent #VC exception in the NMI code-path will overwrite the interrupted #VC handler's stack. Make sure this doesn't happen by explicitly adjusting the #VC IST entry in the NMI handler for the time it can cause #VC exceptions. [ bp: Touchups, spelling fixes. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-44-joro@8bytes.org
* | x86/nmi: Fix nmi_handle() duration miscalculationLibing Zhou2020-10-011-3/+2
|/ | | | | | | | | | | | | | | | When nmi_check_duration() is checking the time an NMI handler took to execute, the whole_msecs value used should be read from the @duration argument, not from the ->max_duration, the latter being used to store the current maximal duration. [ bp: Rewrite commit message. ] Fixes: 248ed51048c4 ("x86/nmi: Remove irq_work from the long duration NMI handler") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Changbin Du <changbin.du@gmail.com> Link: https://lkml.kernel.org/r/20200820025641.44075-1-libing.zhou@nokia-sbell.com
* x86/entry: Fix NMI vs IRQ state trackingPeter Zijlstra2020-07-101-5/+4
| | | | | | | | | | | | | While the nmi_enter() users did trace_hardirqs_{off_prepare,on_finish}() there was no matching lockdep_hardirqs_*() calls to complete the picture. Introduce idtentry_{enter,exit}_nmi() to enable proper IRQ state tracking across the NMIs. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.216740948@infradead.org
* x86/entry, cpumask: Provide non-instrumented variant of cpu_is_offline()Peter Zijlstra2020-06-151-1/+1
| | | | | | | | | | | | vmlinux.o: warning: objtool: exc_nmi()+0x12: call to cpumask_test_cpu.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: mce_check_crashing_cpu()+0x12: call to cpumask_test_cpu.constprop.0()leaves .noinstr.text section cpumask_test_cpu() test_bit() instrument_atomic_read() arch_test_bit() Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* x86/entry: Make NMI use IDTENTRY_RAWThomas Gleixner2020-06-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | For no reason other than beginning brainmelt, IDTENTRY_NMI was mapped to IDTENTRY_IST. This is not a problem on 64bit because the IST default entry point maps to IDTENTRY_RAW which does not any entry handling. The surplus function declaration for the noist C entry point is unused and as there is no ASM code emitted for NMI this went unnoticed. On 32bit IDTENTRY_IST maps to a regular IDTENTRY which does the normal entry handling. That is clearly the wrong thing to do for NMI. Map it to IDTENTRY_RAW to unbreak it. The IDTENTRY_NMI mapping needs to stay to avoid emitting ASM code. Fixes: 6271fef00b34 ("x86/entry: Convert NMI to IDTENTRY_NMI") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Debugged-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/CA+G9fYvF3cyrY+-iw_SZtpN-i2qA2BruHg4M=QYECU2-dNdsMw@mail.gmail.com
* x86/entry: Rename trace_hardirqs_off_prepare()Peter Zijlstra2020-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The typical pattern for trace_hardirqs_off_prepare() is: ENTRY lockdep_hardirqs_off(); // because hardware ... do entry magic instrumentation_begin(); trace_hardirqs_off_prepare(); ... do actual work trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(); instrumentation_end(); ... do exit magic lockdep_hardirqs_on(); which shows that it's named wrong, rename it to trace_hardirqs_off_finish(), as it concludes the hardirq_off transition. Also, given that the above is the only correct order, make the traditional all-in-one trace_hardirqs_off() follow suit. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.415774872@infradead.org
* x86/entry, nmi: Disable #DBPeter Zijlstra2020-06-111-52/+3
| | | | | | | | | | | | | | | | | Instead of playing stupid games with IST stacks, fully disallow #DB during NMIs. There is absolutely no reason to allow them, and killing this saves a heap of trouble. #DB is already forbidden on noinstr and CEA, so there can't be a #DB before this. Disabling it right after nmi_enter() ensures that the full NMI code is protected. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.069223695@infradead.org
* x86/entry: Move paranoid irq tracing out of ASM codeThomas Gleixner2020-06-111-0/+3
| | | | | | | | | | | | The last step to remove the irq tracing cruft from ASM. Ignore #DF as the maschine is going to die anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20200521202120.414043330@linutronix.de
* x86/nmi: Protect NMI entry against instrumentationThomas Gleixner2020-06-111-6/+9
| | | | | | | | | | | | | | | | | | Mark all functions in the fragile code parts noinstr or force inlining so they can't be instrumented. Also make the hardware latency tracer invocation explicit outside of non-instrumentable section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135314.716186134@linutronix.de
* x86/entry: Convert NMI to IDTENTRY_NMIThomas Gleixner2020-06-111-3/+1
| | | | | | | | | | | | | | | | | | | Convert #NMI to IDTENTRY_NMI: - Implement the C entry point with DEFINE_IDTENTRY_NMI - Fixup the XEN/PV code - Remove the old prototypes No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135314.609932306@linutronix.de
* x86/nmi: Remove edac.h include leftoverBorislav Petkov2020-05-161-4/+0
| | | | | | | | | | | | | ... which db47d5f85646 ("x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMI") forgot to remove. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200515182246.3553-1-bp@alien8.de
* x86: Fix a handful of typosMartin Molnar2020-02-161-2/+2
| | | | | | | | | | | Fix a couple of typos in code comments. [ bp: While at it: s/IRQ's/IRQs/. ] Signed-off-by: Martin Molnar <martin.molnar.programming@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/0819a044-c360-44a4-f0b6-3f5bafe2d35c@gmail.com
* x86/nmi: Remove irq_work from the long duration NMI handlerChangbin Du2020-01-111-11/+9
| | | | | | | | | | | | | | | | | | | | First, printk() is NMI-context safe now since the safe printk() has been implemented and it already has an irq_work to make NMI-context safe. Second, this NMI irq_work actually does not work if a NMI handler causes panic by watchdog timeout. It has no chance to run in such case, while the safe printk() will flush its per-cpu buffers before panicking. While at it, repurpose the irq_work callback into a function which concentrates the NMI duration checking and makes the code easier to follow. [ bp: Massage. ] Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
* x86/hotplug: Silence APIC and NMI when CPU is deadThomas Gleixner2019-07-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to support IPI/NMI broadcasting via the shorthand mechanism side effects of shorthands need to be mitigated: Shorthand IPIs and NMIs hit all CPUs including unplugged CPUs Neither of those can be handled on unplugged CPUs for obvious reasons. It would be trivial to just fully disable the APIC via the enable bit in MSR_APICBASE. But that's not possible because clearing that bit on systems based on the 3 wire APIC bus would require a hardware reset to bring it back as the APIC would lose track of bus arbitration. On systems with FSB delivery APICBASE could be disabled, but it has to be guaranteed that no interrupt is sent to the APIC while in that state and it's not clear from the SDM whether it still responds to INIT/SIPI messages. Therefore stay on the safe side and switch the APIC into soft disabled mode so it won't deliver any regular vector to the CPU. NMIs are still propagated to the 'dead' CPUs. To mitigate that add a check for the CPU being offline on early nmi entry and if so bail. Note, this cannot use the stop/restart_nmi() magic which is used in the alternatives code. A dead CPU cannot invoke nmi_enter() or anything else due to RCU and other reasons. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907241723290.1791@nanos.tec.linutronix.de
* treewide: Add SPDX license identifier for missed filesThomas Gleixner2019-05-211-0/+1
| | | | | | | | | | | | | | | | | Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds2019-05-141-0/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
| * x86/speculation/mds: Clear CPU buffers on exit to userThomas Gleixner2019-03-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a static key which controls the invocation of the CPU buffer clear mechanism on exit to user space and add the call into prepare_exit_to_usermode() and do_nmi() right before actually returning. Add documentation which kernel to user space transition this covers and explain why some corner cases are not mitigated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
* | x86/exceptions: Split debug IST stackThomas Gleixner2019-04-171-1/+19
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The debug IST stack is actually two separate debug stacks to handle #DB recursion. This is required because the CPU starts always at top of stack on exception entry, which means on #DB recursion the second #DB would overwrite the stack of the first. The low level entry code therefore adjusts the top of stack on entry so a secondary #DB starts from a different stack page. But the stack pages are adjacent without a guard page between them. Split the debug stack into 3 stacks which are separated by guard pages. The 3rd stack is never mapped into the cpu_entry_area and is only there to catch triple #DB nesting: --- top of DB_stack <- Initial stack --- end of DB_stack guard page --- top of DB1_stack <- Top of stack after entering first #DB --- end of DB1_stack guard page --- top of DB2_stack <- Top of stack after entering second #DB --- end of DB2_stack guard page If DB2 would not act as the final guard hole, a second #DB would point the top of #DB stack to the stack below #DB1 which would be valid and not catch the not so desired triple nesting. The backing store does not allocate any memory for DB2 and its guard page as it is not going to be mapped into the cpu_entry_area. - Adjust the low level entry code so it adjusts top of #DB with the offset between the stacks instead of exception stack size. - Make the dumpstack code aware of the new stacks. - Adjust the in_debug_stack() implementation and move it into the NMI code where it belongs. As this is NMI hotpath code, it just checks the full area between top of DB_stack and bottom of DB1_stack without checking for the guard page. That's correct because the NMI cannot hit a stackpointer pointing to the guard page between DB and DB1 stack. Even if it would, then the NMI operation still is unaffected, but the resume of the debug exception on the topmost DB stack will crash by touching the guard page. [ bp: Make exception_stack_names static const char * const ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: linux-doc@vger.kernel.org Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
* locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland2017-10-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/nmi: Use raw lockScott Wood2017-08-161-9/+9
| | | | | | | | | | | | register_nmi_handler() can be called from PREEMPT_RT atomic context (e.g. wakeup_cpu_via_init_nmi() or native_stop_other_cpus()), and thus ordinary spinlocks cannot be used. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/20170724213242.27598-1-swood@redhat.com
* Merge tag 'edac_for_4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bpLinus Torvalds2017-05-011-11/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull EDAC updates from Borislav Petkov: - an EDAC driver for Cavium ThunderX RAS IP (Sergey Temerkhanov) - removal of DRAM error reporting through PCI SERR NMI (Borislav Petkov) - misc small fixes (Jan Glauber, Thor Thayer) * tag 'edac_for_4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: EDAC, ghes: Do not enable it by default EDAC: Rename report status accessors EDAC: Delete edac_stub.c EDAC: Update Kconfig help text EDAC: Remove EDAC_MM_EDAC EDAC: Issue tracepoint only when it is defined ACPI/extlog: Add EDAC dependency EDAC: Move edac_op_state to edac_mc.c EDAC: Remove edac_err_assert EDAC: Get rid of edac_handlers x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMI EDAC, highbank: Align Makefile directives EDAC, thunderx: Remove unused code EDAC, thunderx: Change LMC index calculation EDAC, altera: Fix peripheral warnings for Cyclone5 EDAC, thunderx: Fix L2C MCI interrupt disable EDAC, thunderx: Add Cavium ThunderX EDAC driver
| * x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMIBorislav Petkov2017-04-101-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apparently, some machines used to report DRAM errors through a PCI SERR NMI. This is why we have a call into EDAC in the NMI handler. See c0d121720220 ("drivers/edac: add new nmi rescan"). From looking at the patch above, that's two drivers: e752x_edac.c and e7xxx_edac.c. Now, I wanna say those are old machines which are probably decommissioned already. Tony says that "[t]the newest CPU supported by either of those drivers is the Xeon E7520 (a.k.a. "Nehalem") released in Q1'2010. Possibly some folks are still using these ... but people that hold onto h/w for 7 years generally cling to old s/w too ... so I'd guess it unlikely that we will get complaints for breaking these in upstream." So even if there is a small number still in use, we did load EDAC with edac_op_state == EDAC_OPSTATE_POLL by default (we still do, in fact) which means a default EDAC setup without any parameters supplied on the command line or otherwise would never even log the error in the NMI handler because we're polling by default: inline int edac_handler_set(void) { if (edac_op_state == EDAC_OPSTATE_POLL) return 0; return atomic_read(&edac_handlers); } So, long story short, I'd like to get rid of that nastiness called edac_stub.c and confine all the EDAC drivers solely to drivers/edac/. If we ever have to do stuff like that again, it should be notifiers we're using and not some insanity like this one. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com>
* | x86/platform: Remove warning message for duplicate NMI handlersMike Travis2017-03-131-4/+2
|/ | | | | | | | | | | | | | | | | | | | | | | Remove the WARNING message associated with multiple NMI handlers as there are at least two that are legitimate. These are the KGDB and the UV handlers and both want to be called if the NMI has not been claimed by any other NMI handler. Use of the UNKNOWN NMI call chain dramatically lowers the NMI call rate when high frequency NMI tools are in use, notably the perf tools. It is required on systems that cannot sustain a high NMI call rate without adversely affecting the system operation. Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Frank Ramsay <frank.ramsay@hpe.com> Cc: Tony Ernst <tony.ernst@hpe.com> Link: http://lkml.kernel.org/r/20170307210841.730959611@asylum.americas.sgi.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar2017-03-021-0/+1
| | | | | | | | | | | | | | | | | | | | <linux/sched/debug.h> We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/debug.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar2017-03-021-0/+1
| | | | | | | | | | | | | | | | | | | | <linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86: include linux/ratelimit.h in nmi.cArnd Bergmann2016-06-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building random configurations, we now occasionally get a new build error: In file included from include/linux/kernel.h:13:0, from include/linux/list.h:8, from include/linux/preempt.h:10, from include/linux/spinlock.h:50, from arch/x86/kernel/nmi.c:13: arch/x86/kernel/nmi.c: In function 'nmi_max_handler': include/linux/printk.h:375:9: error: type defaults to 'int' in declaration of 'DEFINE_RATELIMIT_STATE' [-Werror=implicit-int] static DEFINE_RATELIMIT_STATE(_rs, \ ^ arch/x86/kernel/nmi.c:110:2: note: in expansion of macro 'printk_ratelimited' printk_ratelimited(KERN_INFO ^~~~~~~~~~~~~~~~~~ This was working before the rtc rework series because linux/ratelimit.h was included implictly through asm/mach_traps.h -> asm/mc146818rtc.h -> linux/mc146818rtc.h -> linux/rtc.h -> linux/device.h. We clearly shouldn't rely on this indirect inclusion, so this adds an explicit #include in the file that needs it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kbuild test robot <fengguang.wu@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Fixes: 5ab788d73832 ("rtc: cmos: move mc146818rtc code out of asm-generic/rtc.h") Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
* x86/nmi: Mark 'ignore_nmis' as __read_mostlyKostenzer Felix2016-03-081-1/+2
| | | | | | | | | | | | | | | | ignore_nmis is used in two distinct places: 1. modified through {stop,restart}_nmi by alternative_instructions 2. read by do_nmi to determine if default_do_nmi should be called or not thus the access pattern conforms to __read_mostly and do_nmi() is a fastpath. Signed-off-by: Kostenzer Felix <fkostenzer@live.at> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/nmi: Save regs in crash dump on external NMIHidehiro Kawai2015-12-191-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, multiple CPUs can receive an external NMI simultaneously by specifying the "apic_extnmi=all" command line parameter. When we take a crash dump by using external NMI with this option, we fail to save registers into the crash dump. This happens as follows: CPU 0 CPU 1 ================================ ============================= receive an external NMI default_do_nmi() receive an external NMI spin_lock(&nmi_reason_lock) default_do_nmi() io_check_error() spin_lock(&nmi_reason_lock) panic() busy loop ... kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET busy loop... Here, since CPU 1 is in NMI context, an additional NMI from CPU 0 remains unhandled until CPU 1 IRETs. However, CPU 1 will never execute IRET so the NMI is not handled and the callback function to save registers is never called. To solve this issue, we check if the IPI for crash dumping was issued while waiting for nmi_reason_lock to be released, and if so, call its callback function directly. If the IPI is not issued (e.g. kdump is disabled), the actual behavior doesn't change. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210065245.4587.39316.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* panic, x86: Allow CPUs to save registers even if looping in NMI contextHidehiro Kawai2015-12-191-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends an NMI IPI to CPUs which haven't called panic() to stop them, save their register information and do some cleanups for crash dumping. However, if such a CPU is infinitely looping in NMI context, we fail to save its register information into the crash dump. For example, this can happen when unknown NMIs are broadcast to all CPUs as follows: CPU 0 CPU 1 =========================== ========================== receive an unknown NMI unknown_nmi_error() panic() receive an unknown NMI spin_trylock(&panic_lock) unknown_nmi_error() crash_kexec() panic() spin_trylock(&panic_lock) panic_smp_self_stop() infinite loop kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET infinite loop... Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET, so the NMI is not handled and the callback function to save registers is never called. In practice, this can happen on some servers which broadcast NMIs to all CPUs when the NMI button is pushed. To save registers in this case, we need to: a) Return from NMI handler instead of looping infinitely or b) Call the callback function directly from the infinite loop Inherently, a) is risky because NMI is also used to prevent corrupted data from being propagated to devices. So, we chose b). This patch does the following: 1. Move the infinite looping of CPUs which haven't called panic() in NMI context (actually done by panic_smp_self_stop()) outside of panic() to enable us to refer pt_regs. Please note that panic_smp_self_stop() is still used for normal context. 2. Call a callback of kdump_nmi_shootdown_cpus() directly to save registers and do some cleanups after setting waiting_for_crash_ipi which is used for counting down the number of CPUs which handled the callback Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Seth Jennings <sjenning@redhat.com> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* panic, x86: Fix re-entrance problem due to panic on NMIHidehiro Kawai2015-12-191-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If panic on NMI happens just after panic() on the same CPU, panic() is recursively called. Kernel stalls, as a result, after failing to acquire panic_lock. To avoid this problem, don't call panic() in NMI context if we've already entered panic(). For that, introduce nmi_panic() macro to reduce code duplication. In the case of panic on NMI, don't return from NMI handlers if another CPU already panicked. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Merge branch 'x86/urgent' into x86/asm, before applying dependent patchesIngo Molnar2015-07-311-71/+52
|\ | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/nmi/64: Improve nested NMI commentsAndy Lutomirski2015-07-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I found the nested NMI documentation to be difficult to follow. Improve the comments. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * x86/nmi: Enable nested do_nmi() handling for 64-bit kernelsAndy Lutomirski2015-07-171-71/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 32-bit kernels handle nested NMIs in C. Enable the exact same handling on 64-bit kernels as well. This isn't currently necessary, but it will become necessary once the asm code starts allowing limited nesting. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | x86/nmi: Remove the 'b2b' parameter from nmi_handle()Andy Lutomirski2015-07-211-5/+5
|/ | | | | | | | | | | | | | | | | It has never had any effect. Remove it for comprehensibility. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/c91fa38507760d9e54a4b8737fa6409bde896b33.1437418322.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotationMasami Hiramatsu2014-04-241-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use NOKPROBE_SYMBOL macro for protecting functions from kprobes instead of __kprobes annotation under arch/x86. This applies nokprobe_inline annotation for some cases, because NOKPROBE_SYMBOL() will inhibit inlining by referring the symbol address. This just folds a bunch of previous NOKPROBE_SYMBOL() cleanup patches for x86 to one patch. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/nmi: Push duration printk() to irq contextPeter Zijlstra2014-02-091-13/+24
| | | | | | | | | | | | | | | Calling printk() from NMI context is bad (TM), so move it to IRQ context. In doing so we slightly change (probably wreck) the debugfs nmi_longest_ns thingy, in that it doesn't update to reflect the longest, nor does writing to it reset the count. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Don Zickus <dzickus@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/n/tip-rdw0au56a5ymis1u8p48c12d@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86: Fix NMI measurementsPeter Zijlstra2013-10-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OK, so what I'm actually seeing on my WSM is that sched/clock.c is 'broken' for the purpose we're using it for. What triggered it is that my WSM-EP is broken :-( [ 0.001000] tsc: Fast TSC calibration using PIT [ 0.002000] tsc: Detected 2533.715 MHz processor [ 0.500180] TSC synchronization [CPU#0 -> CPU#6]: [ 0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock. [ 0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed For some reason it consistently detects TSC skew, even though NHM+ should have a single clock domain for 'reasonable' systems. This marks sched_clock_stable=0, which means that we do fancy stuff to try and get a 'sane' clock. Part of this fancy stuff relies on the tick, clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until it either wakes up or gets kicked by another cpu. While this is perfectly fine for the scheduler -- it only cares about actually running stuff, and when we're running stuff we're obviously not idle. This does somewhat break down for perf which can trigger events just fine on an otherwise idle cpu. So I've got NMIs get get 'measured' as taking ~1ms, which actually don't last nearly that long: <idle>-0 [013] d.h. 886.311970: rcu_nmi_enter <-do_nmi ... <idle>-0 [013] d.h. 886.311997: perf_sample_event_took: HERE!!! : 1040990 So ftrace (which uses sched_clock(), not the fancy bits) only sees ~27us, but we measure ~1ms !! Now since all this measurement stuff lives in x86 code, we can actually fix it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mingo@kernel.org Cc: dave.hansen@linux.intel.com Cc: eranian@google.com Cc: Don Zickus <dzickus@redhat.com> Cc: jmario@redhat.com Cc: acme@infradead.org Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* perf/x86: Fix incorrect use of do_div() in NMI warningDave Hansen2013-07-121-3/+4
| | | | | | | | | | | | | | | | | | | | I completely botched understanding the calling conventions of do_div(). I assumed that do_div() returned the result instead of realizing that it modifies its argument and returns a remainder. The side-effect from this would be bogus numbers for the "msecs" value in the warning messages: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.114 msecs Note, there was a second fix posted by Stephane Eranian for a separate patch which I also botched: http://lkml.kernel.org/r/20130704223010.GA30625@quad Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20130708214404.B0B6EA66@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86: Add NMI duration tracepointsDave Hansen2013-06-231-2/+7
| | | | | | | | | | | | | | | | | This patch has been invaluable in my adventures finding issues in the perf NMI handler. I'm as big a fan of printk() as anybody is, but using printk() in NMIs is deadly when they're happening frequently. Even hacking in trace_printk() ended up eating enough CPU to throw off some of the measurements I was making. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@ghostprotocols.net Cc: Dave Hansen <dave@sr71.net> Signed-off-by: Ingo Molnar <mingo@kernel.org>