summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
...
* KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTSSean Christopherson2024-09-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4bcdd831d9d01e0fb64faea50732b59b2ee88da1 upstream. Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/CPU/AMD: Add models 0x60-0x6f to the Zen5 rangePerry Yuan2024-09-081-1/+1
| | | | | | | | | | | | | [ Upstream commit bf5641eccf71bcd13a849930e190563c3a19815d ] Add some new Zen5 models for the 0x1A family. [ bp: Merge the 0x60 and 0x70 ranges. ] Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240729064626.24297-1-bp@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mtrr: Check if fixed MTRRs exist before saving themAndi Kleen2024-08-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | commit 919f18f961c03d6694aa726c514184f2311a4614 upstream. MTRRs have an obsolete fixed variant for fine grained caching control of the 640K-1MB region that uses separate MSRs. This fixed variant has a separate capability bit in the MTRR capability MSR. So far all x86 CPUs which support MTRR have this separate bit set, so it went unnoticed that mtrr_save_state() does not check the capability bit before accessing the fixed MTRR MSRs. Though on a CPU that does not support the fixed MTRR capability this results in a #GP. The #GP itself is harmless because the RDMSR fault is handled gracefully, but results in a WARN_ON(). Add the missing capability check to prevent this. Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP") Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240808000244.946864-1-ak@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/paravirt: Fix incorrect virt spinlock setting on bare metalChen Yu2024-08-142-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e639222a51196c69c70b49b67098ce2f9919ed08 upstream. The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. The kernel uses a static key to switch between the two modes. In theory, the static key is enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior or paravirt spinlock). A performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is incorrectly set to true on bare metal. The qspinlock degenerates to test-and-set spinlock, which decreases the performance on bare metal. Set the default value of virt_spin_lock_key to false. If booting in a VM, enable this key. Later during the VM initialization, if other high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user wants the native qspinlock (via nopvspin boot commandline), the virt_spin_lock_key is disabled accordingly. This results in the following decision matrix: X86_FEATURE_HYPERVISOR Y Y Y N CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N PV spinlock Y N N Y/N virt_spin_lock_key N Y/N Y N Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning") Reported-by: Prem Nath Dey <prem.nath.dey@intel.com> Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Suggested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Suggested-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240806112207.29792-1-yu.c.chen@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/mm: Fix pti_clone_entry_text() for i386Peter Zijlstra2024-08-141-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 3db03fb4995ef85fc41e86262ead7b4852f4bcf0 ] While x86_64 has PMD aligned text sections, i386 does not have this luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE alignment. This means that text on i386 can be page granular at the tail end, which in turn means that the PTI text clones should consistently account for this. Make pti_clone_entry_text() consistent with pti_clone_kernel_text(). Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm: Fix pti_clone_pgtable() alignment assumptionPeter Zijlstra2024-08-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 41e71dbb0e0a0fe214545fe64af031303a08524c ] Guenter reported dodgy crashes on an i386-nosmp build using GCC-11 that had the form of endless traps until entry stack exhaust and then #DF from the stack guard. It turned out that pti_clone_pgtable() had alignment assumptions on the start address, notably it hard assumes start is PMD aligned. This is true on x86_64, but very much not true on i386. These assumptions can cause the end condition to malfunction, leading to a 'short' clone. Guess what happens when the user mapping has a short copy of the entry text? Use the correct increment form for addr to avoid alignment assumptions. Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240731163105.GG33588@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86: Fix smp_processor_id()-in-preemptible warningsLi Huafei2024-08-141-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f73cefa3b72eaa90abfc43bf6d68137ba059d4b1 ] The following bug was triggered on a system built with CONFIG_DEBUG_PREEMPT=y: # echo p > /proc/sysrq-trigger BUG: using smp_processor_id() in preemptible [00000000] code: sh/117 caller is perf_event_print_debug+0x1a/0x4c0 CPU: 3 UID: 0 PID: 117 Comm: sh Not tainted 6.11.0-rc1 #109 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4f/0x60 check_preemption_disabled+0xc8/0xd0 perf_event_print_debug+0x1a/0x4c0 __handle_sysrq+0x140/0x180 write_sysrq_trigger+0x61/0x70 proc_reg_write+0x4e/0x70 vfs_write+0xd0/0x430 ? handle_mm_fault+0xc8/0x240 ksys_write+0x9c/0xd0 do_syscall_64+0x96/0x190 entry_SYSCALL_64_after_hwframe+0x4b/0x53 This is because the commit d4b294bf84db ("perf/x86: Hybrid PMU support for counters") took smp_processor_id() outside the irq critical section. If a preemption occurs in perf_event_print_debug() and the task is migrated to another cpu, we may get incorrect pmu debug information. Move smp_processor_id() back inside the irq critical section to fix this issue. Fixes: d4b294bf84db ("perf/x86: Hybrid PMU support for counters") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Reviewed-and-tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240729220928.325449-1-lihuafei1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86: Support counter maskKan Liang2024-08-149-179/+199
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 722e42e45c2f1c6d1adec7813651dba5139f52f4 ] The current perf assumes that both GP and fixed counters are contiguous. But it's not guaranteed on newer Intel platforms or in a virtualization environment. Use the counter mask to replace the number of counters for both GP and the fixed counters. For the other ARCHs or old platforms which don't support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to replace. There is no functional change for them. The interface to KVM is not changed. The number of counters still be passed to KVM. It can be updated later separately. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-3-kan.liang@linux.intel.com Stable-dep-of: f73cefa3b72e ("perf/x86: Fix smp_processor_id()-in-preemptible warnings") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel: Support the PEBS event maskKan Liang2024-08-144-13/+26
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit a23eb2fc1d818cdac9b31f032842d55483a6a040 ] The current perf assumes that the counters that support PEBS are contiguous. But it's not guaranteed with the new leaf 0x23 introduced. The counters are enumerated with a counter mask. There may be holes in the counter mask for future platforms or in a virtualization environment. Store the PEBS event mask rather than the maximum number of PEBS counters in the x86 PMU structures. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lkml.kernel.org/r/20240626143545.480761-2-kan.liang@linux.intel.com Stable-dep-of: f73cefa3b72e ("perf/x86: Fix smp_processor_id()-in-preemptible warnings") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/amd: Use try_cmpxchg() in events/amd/{un,}core.cUros Bizjak2024-08-142-3/+9
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cd84351c8c1baec86342d784feb884ace007d51c ] Replace this pattern in events/amd/{un,}core.c: cmpxchg(*ptr, old, new) == old ... with the simpler and faster: try_cmpxchg(*ptr, &old, new) The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240425101708.5025-1-ubizjak@gmail.com Stable-dep-of: f73cefa3b72e ("perf/x86: Fix smp_processor_id()-in-preemptible warnings") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra ForestZhenyu Wang2024-08-141-2/+3
| | | | | | | | | | | | | | | [ Upstream commit b1d0e15c8725d21a73c22c099418a63940261041 ] Package C2 residency counter is also available on Sierra Forest. So add it support in srf_cstates. Fixes: 3877d55a0db2 ("perf/x86/intel/cstate: Add Sierra Forest support") Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Wendy Wang <wendy.wang@intel.com> Link: https://lore.kernel.org/r/20240717031609.74513-1-zhenyuw@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/cstate: Add Lunarlake supportZhang Rui2024-08-141-7/+19
| | | | | | | | | | | | | | [ Upstream commit 26579860fbd5129e18de9d6fa0751a48420b26b7 ] Compared with previous client platforms, PC8 is removed from Lunarlake. It supports CC1/CC6/CC7 and PC2/PC3/PC6/PC10 residency counters. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-4-rui.zhang@intel.com Stable-dep-of: b1d0e15c8725 ("perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra Forest") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/cstate: Add Arrowlake supportZhang Rui2024-08-141-8/+12
| | | | | | | | | | | | | [ Upstream commit a31000753d41305d2fb7faa8cc80a8edaeb7b56b ] Like Alderlake, Arrowlake supports CC1/CC6/CC7 and PC2/PC3/PC6/PC8/PC10. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-3-rui.zhang@intel.com Stable-dep-of: b1d0e15c8725 ("perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra Forest") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel: Add a distinct name for Granite RapidsKan Liang2024-08-111-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fa0c1c9d283b37fdb7fc1dcccbb88fc8f48a4aa4 ] Currently, the Sapphire Rapids and Granite Rapids share the same PMU name, sapphire_rapids. Because from the kernel’s perspective, GNR is similar to SPR. The only key difference is that they support different extra MSRs. The code path and the PMU name are shared. However, from end users' perspective, they are quite different. Besides the extra MSRs, GNR has a newer PEBS format, supports Retire Latency, supports new CPUID enumeration architecture, doesn't required the load-latency AUX event, has additional TMA Level 1 Architectural Events, etc. The differences can be enumerated by CPUID or the PERF_CAPABILITIES MSR. They weren't reflected in the model-specific kernel setup. But it is worth to have a distinct PMU name for GNR. Fixes: a6742cb90b56 ("perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL") Suggested-by: Ahmad Yasin <ahmad.yasin@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240708193336.1192217-3-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel: Switch to new Intel CPU model definesTony Luck2024-08-111-74/+74
| | | | | | | | | | | | [ Upstream commit d142df13f3574237688c7a20e0019cccc7ae39eb ] New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240520224620.9480-32-tony.luck%40intel.com Stable-dep-of: fa0c1c9d283b ("perf/x86/intel: Add a distinct name for Granite Rapids") Signed-off-by: Sasha Levin <sashal@kernel.org>
* arch: um: rust: Use the generated target.json againDavid Gow2024-08-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9a2123b397bbe0da5e853273369d63779ac97c8c ] The Rust compiler can take a target config from 'target.json', which is generated by scripts/generate_rust_target.rs. It used to be that all Linux architectures used this to generate a target.json, but now architectures must opt-in to this, or they will default to the Rust compiler's built-in target definition. This is mostly okay for (64-bit) x86 and UML, except that it can generate SSE instructions, which we can't use in the kernel. So re-instate the custom target.json, which disables SSE (and generally enables the 'soft-float' feature). This fixes the following compile error: error: <unknown>:0:0: in function _RNvMNtCs5QSdWC790r4_4core3f32f7next_up float (float): SSE register return with SSE disabled Fixes: f82811e22b48 ("rust: Refactor the build target to allow the use of builtin targets") Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Tested-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Link: https://patch.msgid.link/20240529093336.4075206-1-davidgow@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/pt: Fix a topa_entry base address calculationAdrian Hunter2024-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit ad97196379d0b8cb24ef3d5006978a6554e6467f upstream. topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit type, even if the underlying type is 64-bit, and so, if necessary, must be cast to a larger type when calculations are done. Fix a topa_entry->base address calculation by adding a cast. Without the cast, the address was limited to 36-bits i.e. 64GiB. The address calculation is used on systems that do not support Multiple Entry ToPA (only Broadwell), and affects physical addresses on or above 64GiB. Instead of writing to the correct address, the address comprising the first 36 bits would be written to. Intel PT snapshot and sampling modes are not affected. Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240624201101.60186-3-adrian.hunter@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel/pt: Fix topa_entry base lengthMarco Cavenati2024-08-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | commit 5638bd722a44bbe97c1a7b3fae5b9efddb3e70ff upstream. topa_entry->base needs to store a pfn. It obviously needs to be large enough to store the largest possible x86 pfn which is MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small. Increase the size of topa_entry->base from 36 bits to 40 bits. Note, systems where physical addresses can be 256TiB or more are affected. [ Adrian: Amend commit message as suggested by Dave Hansen ] Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Signed-off-by: Marco Cavenati <cavenati.marco@gmail.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240624201101.60186-2-adrian.hunter@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel/ds: Fix non 0 retire latency on RaptorlakeKan Liang2024-08-031-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e5f32ad56b22ebe384a6e7ddad6e9520c5495563 upstream. A non-0 retire latency can be observed on a Raptorlake which doesn't support the retire latency feature. By design, the retire latency shares the PERF_SAMPLE_WEIGHT_STRUCT sample type with other types of latency. That could avoid adding too many different sample types to support all kinds of latency. For the machine which doesn't support some kind of latency, 0 should be returned. Perf doesn’t clear/init all the fields of a sample data for the sake of performance. It expects the later perf_{prepare,output}_sample() to update the uninitialized field. However, the current implementation doesn't touch the field of the retire latency if the feature is not supported. The memory garbage is dumped into the perf data. Clear the retire latency if the feature is not supported. Fixes: c87a31093c70 ("perf/x86: Support Retire Latency") Reported-by: "Bayduraev, Alexey V" <alexey.v.bayduraev@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Bayduraev, Alexey V" <alexey.v.bayduraev@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240708193336.1192217-4-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel/uncore: Fix the bits of the CHA extended umask for SPRKan Liang2024-08-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | commit a5a6ff3d639d088d4af7e2935e1ee0d8b4e817d4 upstream. The perf stat errors out with UNC_CHA_TOR_INSERTS.IA_HIT_CXL_ACC_LOCAL event. $perf stat -e uncore_cha_55/event=0x35,umask=0x10c0008101/ -a -- ls event syntax error: '..0x35,umask=0x10c0008101/' \___ Bad event or PMU The definition of the CHA umask is config:8-15,32-55, which is 32bit. However, the umask of the event is bigger than 32bit. This is an error in the original uncore spec. Add a new umask_ext5 for the new CHA umask range. Fixes: 949b11381f81 ("perf/x86/intel/uncore: Add Sapphire Rapids server CHA support") Closes: https://lore.kernel.org/linux-perf-users/alpine.LRH.2.20.2401300733310.11354@Diego/ Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240708185524.1185505-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()Sean Christopherson2024-08-037-33/+5
| | | | | | | | | | | | | | | | | | | | commit 321ef62b0c5f6f57bb8500a2ca5986052675abbf upstream. Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: Check for pending posted interrupts when looking for nested eventsSean Christopherson2024-08-031-2/+34
| | | | | | | | | | | | | | | | | | | | | | commit 27c4fa42b11af780d49ce704f7fa67b3c2544df4 upstream. Check for pending (and notified!) posted interrupts when checking if L2 has a pending wake event, as fully posted/notified virtual interrupt is a valid wake event for HLT. Note that KVM must check vmx->nested.pi_pending to avoid prematurely waking L2, e.g. even if KVM sees a non-zero PID.PIR and PID.0N=1, the virtual interrupt won't actually be recognized until a notification IRQ is received by the vCPU or the vCPU does (nested) VM-Enter. Fixes: 26844fee6ade ("KVM: x86: never write to memory from kvm_vcpu_check_block()") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/all/20231207010302.2240506-1-jmattson@google.com Link: https://lore.kernel.org/r/20240607172609.3205077-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: Request immediate exit iff pending nested event needs injectionSean Christopherson2024-08-033-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 32f55e475ce2c4b8b124d335fcfaf1152ba977a1 upstream. When requesting an immediate exit from L2 in order to inject a pending event, do so only if the pending event actually requires manual injection, i.e. if and only if KVM actually needs to regain control in order to deliver the event. Avoiding the "immediate exit" isn't simply an optimization, it's necessary to make forward progress, as the "already expired" VMX preemption timer trick that KVM uses to force a VM-Exit has higher priority than events that aren't directly injected. At present time, this is a glorified nop as all events processed by vmx_has_nested_events() require injection, but that will not hold true in the future, e.g. if there's a pending virtual interrupt in vmcs02.RVI. I.e. if KVM is trying to deliver a virtual interrupt to L2, the expired VMX preemption timer will trigger VM-Exit before the virtual interrupt is delivered, and KVM will effectively hang the vCPU in an endless loop of forced immediate VM-Exits (because the pending virtual interrupt never goes away). Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: nVMX: Add a helper to get highest pending from Posted Interrupt vectorSean Christopherson2024-08-032-2/+13
| | | | | | | | | | | | | | | | | | | commit d83c36d822be44db4bad0c43bea99c8908f54117 upstream. Add a helper to retrieve the highest pending vector given a Posted Interrupt descriptor. While the actual operation is straightforward, it's surprisingly easy to mess up, e.g. if one tries to reuse lapic.c's find_highest_vector(), which doesn't work with PID.PIR due to the APIC's IRR and ISR component registers being physically discontiguous (they're 4-byte registers aligned at 16-byte intervals). To make PIR handling more consistent with respect to IRR and ISR handling, return -1 to indicate "no interrupt pending". Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: VMX: Split out the non-virtualization part of vmx_interrupt_blocked()Sean Christopherson2024-08-032-3/+9
| | | | | | | | | | | | | | | | commit 322a569c4b4188a0da2812f9e952780ce09b74ba upstream. Move the non-VMX chunk of the "interrupt blocked" checks to a separate helper so that KVM can reuse the code to detect if interrupts are blocked for L2, e.g. to determine if a virtual interrupt _for L2_ is a valid wake event. If L1 disables HLT-exiting for L2, nested APICv is enabled, and L2 HLTs, then L2 virtual interrupts are valid wake events, but if and only if interrupts are unblocked for L2. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240607172609.3205077-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/shstk: Make return uprobe work with shadow stackJiri Olsa2024-08-033-1/+19
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1713b63a07a28a475de94664f783b4cfd2e4fa90 ] Currently the application with enabled shadow stack will crash if it sets up return uprobe. The reason is the uretprobe kernel code changes the user space task's stack, but does not update shadow stack accordingly. Adding new functions to update values on shadow stack and using them in uprobe code to keep shadow stack in sync with uretprobe changes to user stack. Link: https://lore.kernel.org/all/20240611112158.40795-2-jolsa@kernel.org/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/amd/uncore: Fix DF and UMC domain identificationSandipan Das2024-08-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 57e11990f45f89bc29d0f84dd7b13a4e4263eeb2 ] For uncore PMUs, a single context is shared across all CPUs in a domain. The domain can be a CCX, like in the case of the L3 PMU, or a socket, like in the case of DF and UMC PMUs. This information is available via the PMU's cpumask. For contexts shared across a socket, the domain is currently determined from topology_die_id() which is incorrect after the introduction of commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") as it now returns a CCX identifier on Zen 4 and later systems which support CPUID leaf 0x80000026. Use topology_logical_package_id() instead as it always returns a socket identifier irrespective of the availability of CPUID leaf 0x80000026. Fixes: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240626074942.1044818-1-sandipan.das@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/amd/uncore: Avoid PMU registration if counters are unavailableSandipan Das2024-08-031-8/+14
| | | | | | | | | | | | | | | [ Upstream commit f997e208b6c96858a2f6c0855debfbdb9b52f131 ] X86_FEATURE_PERFCTR_NB and X86_FEATURE_PERFCTR_LLC are derived from CPUID leaf 0x80000001 ECX bits 24 and 28 respectively and denote the availability of DF and L3 counters. When these bits are not set, the corresponding PMUs have no counters and hence, should not be registered. Fixes: 07888daa056e ("perf/x86/amd/uncore: Move discovery and registration") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240626074404.1044230-1-sandipan.das@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/cstate: Fix Alderlake/Raptorlake/MeteorlakeZhang Rui2024-08-031-5/+2
| | | | | | | | | | | | | | | | | | | [ Upstream commit 2c3aedd9db6295619d21e50ad29efda614023bf1 ] For Alderlake, the spec changes after the patch submitted and PC7/PC9 are removed. Raptorlake and Meteorlake, which copy the Alderlake cstate PMU, also don't have PC7/PC9. Remove PC7/PC9 support for Alderlake/Raptorlake/Meteorlake. Fixes: d0ca946bcf84 ("perf/x86/cstate: Add Alder Lake CPU support") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240628031758.43103-2-rui.zhang@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/pt: Fix pt_topa_entry_for_page() address calculationAdrian Hunter2024-08-031-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 3520b251dcae2b4a27b95cd6f745c54fd658bda5 ] Currently, perf allocates an array of page pointers which is limited in size by MAX_PAGE_ORDER. That in turn limits the maximum Intel PT buffer size to 2GiB. Should that limitation be lifted, the Intel PT driver can support larger sizes, except for one calculation in pt_topa_entry_for_page(), which is limited to 32-bits. Fix pt_topa_entry_for_page() address calculation by adding a cast. Fixes: 39152ee51b77 ("perf/x86/intel/pt: Get rid of reverse lookup table for ToPA") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240624201101.60186-4-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86: Serialize set_attr_rdpmc()Thomas Gleixner2024-08-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bb9bb45f746b0f9457de9c3fc4da143a6351bdc9 ] Yue and Xingwei reported a jump label failure. It's caused by the lack of serialization in set_attr_rdpmc(): CPU0 CPU1 Assume: x86_pmu.attr_rdpmc == 0 if (val != x86_pmu.attr_rdpmc) { if (val == 0) ... else if (x86_pmu.attr_rdpmc == 0) static_branch_dec(&rdpmc_never_available_key); if (val != x86_pmu.attr_rdpmc) { if (val == 0) ... else if (x86_pmu.attr_rdpmc == 0) FAIL, due to imbalance ---> static_branch_dec(&rdpmc_never_available_key); The reported BUG() is a consequence of the above and of another bug in the jump label core code. The core code needs a separate fix, but that cannot prevent the imbalance problem caused by set_attr_rdpmc(). Prevent this by serializing set_attr_rdpmc() locally. Fixes: a66734297f78 ("perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks") Closes: https://lore.kernel.org/r/CAEkJfYNzfW1vG=ZTMdz_Weoo=RXY1NDunbxnDaLyj8R4kEoE_w@mail.gmail.com Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: Xingwei Lee <xrivendell7@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240610124406.359476013@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/sev: Do RMP memory coverage check after max_pfn has been setTom Lendacky2024-08-031-22/+22
| | | | | | | | | | | | | | | | | | | [ Upstream commit 0440feb090790c6243bca85d6a794824e71ff26c ] The RMP table is probed early in the boot process before max_pfn has been set, so the logic to check if the RMP covers all of system memory is not valid. Move the RMP memory coverage check from snp_probe_rmptable_info() into snp_rmptable_init(), which is well after max_pfn has been set. Also, fix the calculation to use PFN_UP instead of PHYS_PFN, in order to compute the required RMP size properly. Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/bec4364c7e34358cc576f01bb197a7796a109169.1718984524.git.thomas.lendacky@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/xen: Convert comma to semicolonChen Ni2024-08-031-2/+2
| | | | | | | | | | | | | [ Upstream commit 349d271416c61f82b853336509b1d0dc04c1fcbb ] Replace a comma between expression statements by a semicolon. Fixes: 8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240702031010.1411875-1-nichen@iscas.ac.cn Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/syscall: Mark exit[_group] syscall handlers __noreturnJosh Poimboeuf2024-08-037-23/+36
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9142be9e6443fd641ca37f820efe00d9cd890eb1 ] The direct-call syscall dispatch function doesn't know that the exit() and exit_group() syscall handlers don't return, so the call sites aren't optimized accordingly. Fix that by marking the exit syscall declarations __noreturn. Fixes the following warnings: vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation Fixes: 1e3ad78334a6 ("x86/syscall: Don't force use of indirect calls for system calls") Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/5d8882bc077d8eadcc7fd1740b56dfb781f12288.1719381528.git.jpoimboe@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnosIlpo Järvinen2024-08-031-2/+2
| | | | | | | | | | | | | | | | | | [ Upstream commit 7821fa101eab529521aa4b724bf708149d70820c ] iosf_mbi_pci_{read,write}_mdr() use pci_{read,write}_config_dword() that return PCIBIOS_* codes but functions also return -ENODEV which are not compatible error codes. As neither of the functions are related to PCI read/write functions, they should return normal errnos. Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal errno before returning it. Fixes: 46184415368a ("arch: x86: New MailBox support driver for Intel SOC's") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240527125538.13620-4-ilpo.jarvinen@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/pci/xen: Fix PCIBIOS_* return code handlingIlpo Järvinen2024-08-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e9d7b435dfaec58432f4106aaa632bf39f52ce9f ] xen_pcifront_enable_irq() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. xen_pcifront_enable_irq() also returns the PCIBIOS_* code back to the caller but the function is used as the (*pcibios_enable_irq) function which should return normal errnos. Convert the error check to plain non-zero check which works for PCIBIOS_* return codes and convert the PCIBIOS_* return code using pcibios_err_to_errno() into normal errno before returning it. Fixes: 3f2a230caf21 ("xen: handled remapped IRQs when enabling a pcifront PCI device.") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240527125538.13620-3-ilpo.jarvinen@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handlingIlpo Järvinen2024-08-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 724852059e97c48557151b3aa4af424614819752 ] intel_mid_pci_irq_enable() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. intel_mid_pci_irq_enable() also returns the PCIBIOS_* code back to the caller but the function is used as the (*pcibios_enable_irq) function which should return normal errnos. Convert the error check to plain non-zero check which works for PCIBIOS_* return codes and convert the PCIBIOS_* return code using pcibios_err_to_errno() into normal errno before returning it. Fixes: 5b395e2be6c4 ("x86/platform/intel-mid: Make IRQ allocation a bit more flexible") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240527125538.13620-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/of: Return consistent error type from x86_of_pci_irq_enable()Ilpo Järvinen2024-08-031-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit ec0b4c4d45cf7cf9a6c9626a494a89cb1ae7c645 ] x86_of_pci_irq_enable() returns PCIBIOS_* code received from pci_read_config_byte() directly and also -EINVAL which are not compatible error types. x86_of_pci_irq_enable() is used as (*pcibios_enable_irq) function which should not return PCIBIOS_* codes. Convert the PCIBIOS_* return code from pci_read_config_byte() into normal errno using pcibios_err_to_errno(). Fixes: 96e0a0797eba ("x86: dtb: Add support for PCI devices backed by dtb nodes") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240527125538.13620-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/kconfig: Add as-instr64 macro to properly evaluate AS_WRUSSMasahiro Yamada2024-08-031-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 469169803d52a5d8f0dc781090638e851a7d22b1 ] Some instructions are only available on the 64-bit architecture. Bi-arch compilers that default to -m32 need the explicit -m64 option to evaluate them properly. Fixes: 18e66b695e78 ("x86/shstk: Add Kconfig option for shadow stack") Closes: https://lore.kernel.org/all/20240612-as-instr-opt-wrussq-v2-1-bd950f7eead7@gmail.com/ Reported-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://lore.kernel.org/r/20240612050257.3670768-1-masahiroy@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/bhi: Avoid warning in #DB handler due to BHI mitigationAlexandre Chartre2024-07-031-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When BHI mitigation is enabled, if SYSENTER is invoked with the TF flag set then entry_SYSENTER_compat() uses CLEAR_BRANCH_HISTORY and calls the clear_bhb_loop() before the TF flag is cleared. This causes the #DB handler (exc_debug_kernel()) to issue a warning because single-step is used outside the entry_SYSENTER_compat() function. To address this issue, entry_SYSENTER_compat() should use CLEAR_BRANCH_HISTORY after making sure the TF flag is cleared. The problem can be reproduced with the following sequence: $ cat sysenter_step.c int main() { asm("pushf; pop %ax; bts $8,%ax; push %ax; popf; sysenter"); } $ gcc -o sysenter_step sysenter_step.c $ ./sysenter_step Segmentation fault (core dumped) The program is expected to crash, and the #DB handler will issue a warning. Kernel log: WARNING: CPU: 27 PID: 7000 at arch/x86/kernel/traps.c:1009 exc_debug_kernel+0xd2/0x160 ... RIP: 0010:exc_debug_kernel+0xd2/0x160 ... Call Trace: <#DB> ? show_regs+0x68/0x80 ? __warn+0x8c/0x140 ? exc_debug_kernel+0xd2/0x160 ? report_bug+0x175/0x1a0 ? handle_bug+0x44/0x90 ? exc_invalid_op+0x1c/0x70 ? asm_exc_invalid_op+0x1f/0x30 ? exc_debug_kernel+0xd2/0x160 exc_debug+0x43/0x50 asm_exc_debug+0x1e/0x40 RIP: 0010:clear_bhb_loop+0x0/0xb0 ... </#DB> <TASK> ? entry_SYSENTER_compat_after_hwframe+0x6e/0x8d </TASK> [ bp: Massage commit message. ] Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry") Reported-by: Suman Maity <suman.m.maity@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240524070459.3674025-1-alexandre.chartre@oracle.com
* x86-32: fix cmpxchg8b_emu build error with clangLinus Torvalds2024-06-301-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel test robot reported that clang no longer compiles the 32-bit x86 kernel in some configurations due to commit 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions"). The build fails with arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available and the reason seems to be that not only does the cmpxchg8b instruction need four fixed registers (EDX:EAX and ECX:EBX), with the emulation fallback the inline asm also wants a fifth fixed register for the address (it uses %esi for that, but that's just a software convention with cmpxchg8b_emu). Avoiding using another pointer input to the asm (and just forcing it to use the "0(%esi)" addressing that we end up requiring for the sw fallback) seems to fix the issue. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/ Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions") Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/ Suggested-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-and-Tested-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'hardening-v6.10-rc6' of ↵Linus Torvalds2024-06-281-9/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Remove invalid tty __counted_by annotation (Nathan Chancellor) - Add missing MODULE_DESCRIPTION()s for KUnit string tests (Jeff Johnson) - Remove non-functional per-arch kstack entropy filtering * tag 'hardening-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: tty: mxser: Remove __counted_by from mxser_board.ports[] randomize_kstack: Remove non-functional per-arch entropy filtering string: kunit: add missing MODULE_DESCRIPTION() macros
| * randomize_kstack: Remove non-functional per-arch entropy filteringKees Cook2024-06-281-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An unintended consequence of commit 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") was that the per-architecture entropy size filtering reduced how many bits were being added to the mix, rather than how many bits were being used during the offsetting. All architectures fell back to the existing default of 0x3FF (10 bits), which will consume at most 1KiB of stack space. It seems that this is working just fine, so let's avoid the confusion and update everything to use the default. The prior intent of the per-architecture limits were: arm64: capped at 0x1FF (9 bits), 5 bits effective powerpc: uncapped (10 bits), 6 or 7 bits effective riscv: uncapped (10 bits), 6 bits effective x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective s390: capped at 0xFF (8 bits), undocumented effective entropy Current discussion has led to just dropping the original per-architecture filters. The additional entropy appears to be safe for arm64, x86, and s390. Quoting Arnd, "There is no point pretending that 15.75KB is somehow safe to use while 15.00KB is not." Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com> Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
* | x86: stop playing stack games in profile_pc()Linus Torvalds2024-06-281-19/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'profile_pc()' function is used for timer-based profiling, which isn't really all that relevant any more to begin with, but it also ends up making assumptions based on the stack layout that aren't necessarily valid. Basically, the code tries to account the time spent in spinlocks to the caller rather than the spinlock, and while I support that as a concept, it's not worth the code complexity or the KASAN warnings when no serious profiling is done using timers anyway these days. And the code really does depend on stack layout that is only true in the simplest of cases. We've lost the comment at some point (I think when the 32-bit and 64-bit code was unified), but it used to say: Assume the lock function has either no stack frame or a copy of eflags from PUSHF. which explains why it just blindly loads a word or two straight off the stack pointer and then takes a minimal look at the values to just check if they might be eflags or the return pc: Eflags always has bits 22 and up cleared unlike kernel addresses but that basic stack layout assumption assumes that there isn't any lock debugging etc going on that would complicate the code and cause a stack frame. It causes KASAN unhappiness reported for years by syzkaller [1] and others [2]. With no real practical reason for this any more, just remove the code. Just for historical interest, here's some background commits relating to this code from 2006: 0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels") 31679f38d886 ("Simplify profile_pc on x86-64") and a code unification from 2009: ef4512882dbe ("x86: time_32/64.c unify profile_pc") but the basics of this thing actually goes back to before the git tree. Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1] Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | syscalls: fix compat_sys_io_pgetevents_time64 usageArnd Bergmann2024-06-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
* | Merge tag 'x86_urgent_for_v6.10_rc5' of ↵Linus Torvalds2024-06-231-1/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - An ARM-relevant fix to not free default RMIDs of a resource control group - A randconfig build fix for the VMware virtual GPU driver * tag 'x86_urgent_for_v6.10_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Don't try to free nonexistent RMIDs drm/vmwgfx: Fix missing HYPERVISOR_GUEST dependency
| * | x86/resctrl: Don't try to free nonexistent RMIDsDave Martin2024-06-191-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") adds logic to map individual monitoring groups into a global index space used for tracking allocated RMIDs. Attempts to free the default RMID are ignored in free_rmid(), and this works fine on x86. With arm64 MPAM, there is a latent bug here however: on platforms with no monitors exposed through resctrl, each control group still gets a different monitoring group ID as seen by the hardware, since the CLOSID always forms part of the monitoring group ID. This means that when removing a control group, the code may try to free this group's default monitoring group RMID for real. If there are no monitors however, the RMID tracking table rmid_ptrs[] would be a waste of memory and is never allocated, leading to a splat when free_rmid() tries to dereference the table. One option would be to treat RMID 0 as special for every CLOSID, but this would be ugly since bookkeeping still needs to be done for these monitoring group IDs when there are monitors present in the hardware. Instead, add a gating check of resctrl_arch_mon_capable() in free_rmid(), and just do nothing if the hardware doesn't have monitors. This fix mirrors the gating checks already present in mkdir_rdt_prepare_rmid_alloc() and elsewhere. No functional change on x86. [ bp: Massage commit message. ] Fixes: 6791e0ea3071 ("x86/resctrl: Access per-rmid structures by index") Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240618140152.83154-1-Dave.Martin@arm.com
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-06-222-7/+6
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "ARM: - Fix dangling references to a redistributor region if the vgic was prematurely destroyed. - Properly mark FFA buffers as released, ensuring that both parties can make forward progress. x86: - Allow getting/setting MSRs for SEV-ES guests, if they're using the pre-6.9 KVM_SEV_ES_INIT API. - Always sync pending posted interrupts to the IRR prior to IOAPIC route updates, so that EOIs are intercepted properly if the old routing table requested that. Generic: - Avoid __fls(0) - Fix reference leak on hwpoisoned page - Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic. - Fix bug in __kvm_handle_hva_range() where KVM calls a function pointer that was intended to be a marker only (nothing bad happens but kind of a mine and also technically undefined behavior) - Do not bother accounting allocations that are small and freed before getting back to userspace. Selftests: - Fix compilation for RISC-V. - Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest. - Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr from KVM's supported CPUID (if it's available)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guests KVM: Discard zero mask with function kvm_dirty_ring_reset virt: guest_memfd: fix reference leak on hwpoisoned page kvm: do not account temporary allocations to kmem MAINTAINERS: Drop Wanpeng Li as a Reviewer for KVM Paravirt support KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routes KVM: Stop processing *all* memslots when "null" mmu_notifier handler is found KVM: arm64: FFA: Release hyp rx buffer KVM: selftests: Fix RISC-V compilation KVM: arm64: Disassociate vcpus from redistributor region on teardown KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() KVM: selftests: x86: Prioritize getting max_gfn from GuestPhysBits KVM: selftests: Fix shift of 32 bit unsigned int more than 32 bits
| * | | KVM: SEV-ES: Fix svm_get_msr()/svm_set_msr() for KVM_SEV_ES_INIT guestsMichael Roth2024-06-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption"), older VMMs like QEMU 9.0 and older will fail when booting SEV-ES guests with something like the following error: qemu-system-x86_64: error: failed to get MSR 0x174 qemu-system-x86_64: ../qemu.git/target/i386/kvm/kvm.c:3950: kvm_get_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed. This is because older VMMs that might still call svm_get_msr()/svm_set_msr() for SEV-ES guests after guest boot even if those interfaces were essentially just noops because of the vCPU state being encrypted and stored separately in the VMSA. Now those VMMs will get an -EINVAL and generally crash. Newer VMMs that are aware of KVM_SEV_INIT2 however are already aware of the stricter limitations of what vCPU state can be sync'd during guest run-time, so newer QEMU for instance will work both for legacy KVM_SEV_ES_INIT interface as well as KVM_SEV_INIT2. So when using KVM_SEV_INIT2 it's okay to assume userspace can deal with -EINVAL, whereas for legacy KVM_SEV_ES_INIT the kernel might be dealing with either an older VMM and so it needs to assume that returning -EINVAL might break the VMM. Address this by only returning -EINVAL if the guest was started with KVM_SEV_INIT2. Otherwise, just silently return. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Nikunj A Dadhania <nikunj@amd.com> Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/lkml/37usuu4yu4ok7be2hqexhmcyopluuiqj3k266z4gajc2rcj4yo@eujb23qc3zcm/ Fixes: 27bd5fdc24c0 ("KVM: SEV-ES: Prevent MSR access post VMSA encryption") Signed-off-by: Michael Roth <michael.roth@amd.com> Message-ID: <20240604233510.764949-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: Always sync PIR to IRR prior to scanning I/O APIC routesSean Christopherson2024-06-201-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sync pending posted interrupts to the IRR prior to re-scanning I/O APIC routes, irrespective of whether the I/O APIC is emulated by userspace or by KVM. If a level-triggered interrupt routed through the I/O APIC is pending or in-service for a vCPU, KVM needs to intercept EOIs on said vCPU even if the vCPU isn't the destination for the new routing, e.g. if servicing an interrupt using the old routing races with I/O APIC reconfiguration. Commit fceb3a36c29a ("KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure race") fixed the common cases, but kvm_apic_pending_eoi() only checks if an interrupt is in the local APIC's IRR or ISR, i.e. misses the uncommon case where an interrupt is pending in the PIR. Failure to intercept EOI can manifest as guest hangs with Windows 11 if the guest uses the RTC as its timekeeping source, e.g. if the VMM doesn't expose a more modern form of time to the guest. Cc: stable@vger.kernel.org Cc: Adamos Ttofari <attofari@amazon.de> Cc: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240611014845.82795-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>