summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* perf,x86: avoid missing caller address in stack traces captured in uprobeAndrii Nakryiko2024-10-101-0/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cfa7f3d2c526c224a6271cc78a4a27a0de06f4f0 ] When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` (`push %ebp` on 32-bit architecture) instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp/%ebp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp/%esp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. We also check for `endbr64` (for 64-bit modes) as another common pattern for function entry, as suggested by Josh Poimboeuf. Even if we get this wrong sometimes for uprobes attached not at the function entry, it's OK because stack trace will still be overall meaningful, just with one extra bogus entry. If we don't detect this, we end up with guaranteed to be missing caller function entry in the stack trace, which is worse overall. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240729175223.23914-1-andrii@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/syscall: Avoid memcpy() for ia32 syscall_get_arguments()Kees Cook2024-10-101-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d19d638b1e6cf746263ef60b7d0dee0204d8216a ] Modern (fortified) memcpy() prefers to avoid writing (or reading) beyond the end of the addressed destination (or source) struct member: In function ‘fortify_memcpy_chk’, inlined from ‘syscall_get_arguments’ at ./arch/x86/include/asm/syscall.h:85:2, inlined from ‘populate_seccomp_data’ at kernel/seccomp.c:258:2, inlined from ‘__seccomp_filter’ at kernel/seccomp.c:1231:3: ./include/linux/fortify-string.h:580:25: error: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning] 580 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As already done for x86_64 and compat mode, do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Binary output differences are negligible, and actually ends up using less stack space: - sub $0x84,%esp + sub $0x6c,%esp and less text size: text data bss dec hex filename 10794 252 0 11046 2b26 gcc-32b/kernel/seccomp.o.stock 10714 252 0 10966 2ad6 gcc-32b/kernel/seccomp.o.after Closes: https://lore.kernel.org/lkml/9b69fb14-df89-4677-9c82-056ea9e706f5@gmail.com/ Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Mirsad Todorovac <mtodorovac69@gmail.com> Link: https://lore.kernel.org/all/20240708202202.work.477-kees%40kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm/ident_map: Use gbpages only where full GB page should be mapped.Steve Wahl2024-10-101-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cc31744a294584a36bf764a0ffa3255a8e69f036 ] When ident_pud_init() uses only GB pages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. This can include a lot of extra address space past that requested, including areas marked reserved by the BIOS. That allows processor speculation into reserved regions, that on UV systems can cause system halts. Only use GB pages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full GB page. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavin Joseph <me@pavinjoseph.com> Tested-by: Sarah Brofeldt <srhb@dbc.dk> Tested-by: Eric Hagberg <ehagberg@gmail.com> Link: https://lore.kernel.org/all/20240717213121.3064030-3-steve.wahl@hpe.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/kexec: Add EFI config table identity mapping for kexec kernelTao Liu2024-10-101-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5760929f6545c651682de3c2c6c6786816b17bb1 ] A kexec kernel boot failure is sometimes observed on AMD CPUs due to an unmapped EFI config table array. This can be seen when "nogbpages" is on the kernel command line, and has been observed as a full BIOS reboot rather than a successful kexec. This was also the cause of reported regressions attributed to Commit 7143c5f4cf20 ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.") which was subsequently reverted. To avoid this page fault, explicitly include the EFI config table array in the kexec identity map. Further explanation: The following 2 commits caused the EFI config table array to be accessed when enabling sev at kernel startup. commit ec1c66af3a30 ("x86/compressed/64: Detect/setup SEV/SME features earlier during boot") commit c01fce9cef84 ("x86/compressed: Add SEV-SNP feature detection/setup") This is in the code that examines whether SEV should be enabled or not, so it can even affect systems that are not SEV capable. This may result in a page fault if the EFI config table array's address is unmapped. Since the page fault occurs before the new kernel establishes its own identity map and page fault routines, it is unrecoverable and kexec fails. Most often, this problem is not seen because the EFI config table array gets included in the map by the luck of being placed at a memory address close enough to other memory areas that *are* included in the map created by kexec. Both the "nogbpages" command line option and the "use gpbages only where full GB page should be mapped" change greatly reduce the chance of being included in the map by luck, which is why the problem appears. Signed-off-by: Tao Liu <ltao@redhat.com> Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavin Joseph <me@pavinjoseph.com> Tested-by: Sarah Brofeldt <srhb@dbc.dk> Tested-by: Eric Hagberg <ehagberg@gmail.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/all/20240717213121.3064030-2-steve.wahl@hpe.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/pkeys: Restore altstack access in sigreturn()Aruna Ramakrishna2024-10-101-3/+3
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit d10b554919d4cc8fa8fe2e95b57ad2624728c8e4 ] A process can disable access to the alternate signal stack by not enabling the altstack's PKEY in the PKRU register. Nevertheless, the kernel updates the PKRU temporarily for signal handling. However, in sigreturn(), restore_sigcontext() will restore the PKRU to the user-defined PKRU value. This will cause restore_altstack() to fail with a SIGSEGV as it needs read access to the altstack which is prohibited by the user-defined PKRU value. Fix this by restoring altstack before restoring PKRU. Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240802061318.2140081-5-aruna.ramakrishna@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/pkeys: Add PKRU as a parameter in signal handling functionsAruna Ramakrishna2024-10-103-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 24cf2bc982ffe02aeffb4a3885c71751a2c7023b ] Assume there's a multithreaded application that runs untrusted user code. Each thread has its stack/code protected by a non-zero PKEY, and the PKRU register is set up such that only that particular non-zero PKEY is enabled. Each thread also sets up an alternate signal stack to handle signals, which is protected by PKEY zero. The PKEYs man page documents that the PKRU will be reset to init_pkru when the signal handler is invoked, which means that PKEY zero access will be enabled. But this reset happens after the kernel attempts to push fpu state to the alternate stack, which is not (yet) accessible by the kernel, which leads to a new SIGSEGV being sent to the application, terminating it. Enabling both the non-zero PKEY (for the thread) and PKEY zero in userspace will not work for this use case. It cannot have the alt stack writeable by all - the rationale here is that the code running in that thread (using a non-zero PKEY) is untrusted and should not have access to the alternate signal stack (that uses PKEY zero), to prevent the return address of a function from being changed. The expectation is that kernel should be able to set up the alternate signal stack and deliver the signal to the application even if PKEY zero is explicitly disabled by the application. The signal handler accessibility should not be dictated by whatever PKRU value the thread sets up. The PKRU register is managed by XSAVE, which means the sigframe contents must match the register contents - which is not the case here. It's required that the signal frame contains the user-defined PKRU value (so that it is restored correctly from sigcontext) but the actual register must be reset to init_pkru so that the alt stack is accessible and the signal can be delivered to the application. It seems that the proper fix here would be to remove PKRU from the XSAVE framework and manage it separately, which is quite complicated. As a workaround, do this: orig_pkru = rdpkru(); wrpkru(orig_pkru & init_pkru_value); xsave_to_user_sigframe(); put_user(pkru_sigframe_addr, orig_pkru) In preparation for writing PKRU to sigframe, pass PKRU as an additional parameter down the call chain from get_sigframe(). No functional change. Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240802061318.2140081-2-aruna.ramakrishna@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Remove logical destination mode for 64-bitThomas Gleixner2024-10-102-120/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 838ba7733e4e3a94a928e8d0a058de1811a58621 ] Logical destination mode of the local APIC is used for systems with up to 8 CPUs. It has an advantage over physical destination mode as it allows to target multiple CPUs at once with IPIs. That advantage was definitely worth it when systems with up to 8 CPUs were state of the art for servers and workstations, but that's history. Aside of that there are systems which fail to work with logical destination mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there which fail when interrupt remapping is enabled as reported by Rob and Christian. The latter problem can be cured by firmware updates, but not all OEMs distribute the required changes. Physical destination mode is guaranteed to work because it is the only way to get a CPU up and running via the INIT/INIT/STARTUP sequence. As the number of CPUs keeps increasing, logical destination mode becomes a less used code path so there is no real good reason to keep it around. Therefore remove logical destination mode support for 64-bit and default to physical destination mode. Reported-by: Rob Newcater <rob@durendal.co.uk> Reported-by: Christian Heusel <christian@heusel.eu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Rob Newcater <rob@durendal.co.uk> Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/ioapic: Handle allocation failures gracefullyThomas Gleixner2024-10-101-24/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 830802a0fea8fb39d3dc9fb7d6b5581e1343eb1f ] Breno observed panics when using failslab under certain conditions during runtime: can not alloc irq_pin_list (-1,0,20) Kernel panic - not syncing: IO-APIC: failed to add irq-pin. Can not proceed panic+0x4e9/0x590 mp_irqdomain_alloc+0x9ab/0xa80 irq_domain_alloc_irqs_locked+0x25d/0x8d0 __irq_domain_alloc_irqs+0x80/0x110 mp_map_pin_to_irq+0x645/0x890 acpi_register_gsi_ioapic+0xe6/0x150 hpet_open+0x313/0x480 That's a pointless panic which is a leftover of the historic IO/APIC code which panic'ed during early boot when the interrupt allocation failed. The only place which might justify panic is the PIT/HPET timer_check() code which tries to figure out whether the timer interrupt is delivered through the IO/APIC. But that code does not require to handle interrupt allocation failures. If the interrupt cannot be allocated then timer delivery fails and it either panics due to that or falls back to legacy mode. Cure this by removing the panic wrapper around __add_pin_to_irq_node() and making mp_irqdomain_alloc() aware of the failure condition and handle it as any other failure in this function gracefully. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Breno Leitao <leitao@debian.org> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/all/ZqfJmUF8sXIyuSHN@gmail.com Link: https://lore.kernel.org/all/20240802155440.275200843@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/bugs: Fix handling when SRSO mitigation is disabledDavid Kaplan2024-10-101-9/+5
| | | | | | | | | | | | | | | | | | | [ Upstream commit 1dbb6b1495d472806fef1f4c94f5b3e4c89a3c1d ] When the SRSO mitigation is disabled, either via mitigations=off or spec_rstack_overflow=off, the warning about the lack of IBPB-enhancing microcode is printed anyway. This is unnecessary since the user has turned off the mitigation. [ bp: Massage, drop SBPB rationale as it doesn't matter because when mitigations are disabled x86_pred_cmd is not being used anyway. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240904150711.193022-1-david.kaplan@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/bugs: Add missing NO_SSB flagDaniel Sneddon2024-10-101-2/+2
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 23e12b54acf621f4f03381dca91cc5f1334f21fd ] The Moorefield and Lightning Mountain Atom processors are missing the NO_SSB flag in the vulnerabilities whitelist. This will cause unaffected parts to incorrectly be reported as vulnerable. Add the missing flag. These parts are currently out of service and were verified internally with archived documentation that they need the NO_SSB flag. Closes: https://lore.kernel.org/lkml/CAEJ9NQdhh+4GxrtG1DuYgqYhvc0hi-sKZh-2niukJ-MyFLntAA@mail.gmail.com/ Reported-by: Shanavas.K.S <shanavasks@gmail.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240829192437.4074196-1-daniel.sneddon@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* crypto: x86/sha256 - Add parentheses around macros' single argumentsFangrui Song2024-10-101-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3363c460ef726ba693704dbcd73b7e7214ccc788 ] The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an unexpected/undocumented behavior of the GNU assembler, which might change in the future (https://sourceware.org/bugzilla/show_bug.cgi?id=32073). M (1) (2) // 1 arg !? Future: 2 args M 1 + 2 // 1 arg !? Future: 3 args M 1 2 // 2 args Add parentheses around the single arguments to support future GNU assembler and LLVM integrated assembler (when the IsOperator hack from the following link is dropped). Link: https://github.com/llvm/llvm-project/commit/055006475e22014b28a070db1bff41ca15f322f0 Signed-off-by: Fangrui Song <maskray@google.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/tdx: Fix "in-kernel MMIO" checkAlexey Gladkov (Intel)2024-10-041-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d4fc4d01471528da8a9797a065982e05090e1d81 ] TDX only supports kernel-initiated MMIO operations. The handle_mmio() function checks if the #VE exception occurred in the kernel and rejects the operation if it did not. However, userspace can deceive the kernel into performing MMIO on its behalf. For example, if userspace can point a syscall to an MMIO address, syscall does get_user() or put_user() on it, triggering MMIO #VE. The kernel will treat the #VE as in-kernel MMIO. Ensure that the target MMIO address is within the kernel before decoding instruction. Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO") Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/565a804b80387970460a4ebc67c88d1380f61ad1.1726237595.git.legion%40kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/tdx: Convert shared memory back to private on kexecKirill A. Shutemov2024-10-044-3/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 859e63b789d6b17b3c64e51a0aabdc58752a0254 ] TDX guests allocate shared buffers to perform I/O. It is done by allocating pages normally from the buddy allocator and converting them to shared with set_memory_decrypted(). The second, kexec-ed kernel has no idea what memory is converted this way. It only sees E820_TYPE_RAM. Accessing shared memory via private mapping is fatal. It leads to unrecoverable TD exit. On kexec, walk direct mapping and convert all shared memory back to private. It makes all RAM private again and second kernel may use it normally. The conversion occurs in two steps: stopping new conversions and unsharing all memory. In the case of normal kexec, the stopping of conversions takes place while scheduling is still functioning. This allows for waiting until any ongoing conversions are finished. The second step is carried out when all CPUs except one are inactive and interrupts are disabled. This prevents any conflicts with code that may access shared memory. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-12-kirill.shutemov@linux.intel.com Stable-dep-of: d4fc4d014715 ("x86/tdx: Fix "in-kernel MMIO" check") Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm: Add callbacks to prepare encrypted memory for kexecKirill A. Shutemov2024-10-044-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 22daa42294b419a0d8060a3870285e7a72aa63e4 ] AMD SEV and Intel TDX guests allocate shared buffers for performing I/O. This is done by allocating pages normally from the buddy allocator and then converting them to shared using set_memory_decrypted(). On kexec, the second kernel is unaware of which memory has been converted in this manner. It only sees E820_TYPE_RAM. Accessing shared memory as private is fatal. Therefore, the memory state must be reset to its original state before starting the new kernel with kexec. The process of converting shared memory back to private occurs in two steps: - enc_kexec_begin() stops new conversions. - enc_kexec_finish() unshares all existing shared memory, reverting it back to private. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-11-kirill.shutemov@linux.intel.com Stable-dep-of: d4fc4d014715 ("x86/tdx: Fix "in-kernel MMIO" check") Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/tdx: Account shared memoryKirill A. Shutemov2024-10-041-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c3abbf1376874f0d6eb22859a8655831644efa42 ] The kernel will convert all shared memory back to private during kexec. The direct mapping page tables will provide information on which memory is shared. It is extremely important to convert all shared memory. If a page is missed, it will cause the second kernel to crash when it accesses it. Keep track of the number of shared pages. This will allow for cross-checking against the shared information in the direct mapping and reporting if the shared bit is lost. Memory conversion is slow and does not happen often. Global atomic is not going to be a bottleneck. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-10-kirill.shutemov@linux.intel.com Stable-dep-of: d4fc4d014715 ("x86/tdx: Fix "in-kernel MMIO" check") Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm: Make x86_platform.guest.enc_status_change_*() return an errorKirill A. Shutemov2024-10-046-34/+36
| | | | | | | | | | | | | | | | | | | [ Upstream commit 99c5c4c60e0db1d2ff58b8a61c93b6851146469f ] TDX is going to have more than one reason to fail enc_status_change_prepare(). Change the callback to return errno instead of assuming -EIO. Change enc_status_change_finish() too to keep the interface symmetric. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Tao Liu <ltao@redhat.com> Link: https://lore.kernel.org/r/20240614095904.1345461-8-kirill.shutemov@linux.intel.com Stable-dep-of: d4fc4d014715 ("x86/tdx: Fix "in-kernel MMIO" check") Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)Sean Christopherson2024-10-044-12/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 73b42dc69be8564d4951a14d00f827929fe5ef79 ] Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM. Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging. Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: x86: Make x2APIC ID 100% readonlySean Christopherson2024-10-041-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4b7c3f6d04bd53f2e5b228b6821fb8f5d1ba3071 ] Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic *apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + /* To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + */ + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM *does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <mhal@rbox.co> Closes: https://lore.kernel.org/all/814baa0c-1eaa-4503-129f-059917365e80@rbox.co Reported-by: Haoyu Wu <haoyuwu254@gmail.com> Closes: https://lore.kernel.org/all/20240126161633.62529-1-haoyuwu254@gmail.com Reported-by: syzbot+545f1326f405db4e1c3e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000c2a6b9061cbca3c3@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240802202941.344889-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Stable-dep-of: 73b42dc69be8 ("KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)") Signed-off-by: Sasha Levin <sashal@kernel.org>
* KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definitionHou Wenlong2024-10-042-2/+0
| | | | | | | | | | | | | | | | | | [ Upstream commit c7d4c5f01961cdc4f1d29525e2b0d71f62c5bc33 ] The check_apicv_inhibit_reasons() callback implementation was dropped in the commit b3f257a84696 ("KVM: x86: Track required APICv inhibits with variable, not callback"), but the definition removal was missed in the final version patch (it was removed in the v4). Therefore, it should be dropped, and the vmx_check_apicv_inhibit_reasons() function declaration should also be removed. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 73b42dc69be8 ("KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC)") Signed-off-by: Sasha Levin <sashal@kernel.org>
* perf/x86/intel/pt: Fix sampling synchronizationAdrian Hunter2024-10-041-8/+7
| | | | | | | | | | | | | | | | | | | | | | | commit d92792a4b26e50b96ab734cbe203d8a4c932a7a9 upstream. pt_event_snapshot_aux() uses pt->handle_nmi to determine if tracing needs to be stopped, however tracing can still be going because pt->handle_nmi is set to zero before tracing is stopped in pt_event_stop, whereas pt_event_snapshot_aux() requires that tracing must be stopped in order to copy a sample of trace from the buffer. Instead call pt_config_stop() always, which anyway checks config for RTIT_CTL_TRACEEN and does nothing if it is already clear. Note pt_event_snapshot_aux() can continue to use pt->handle_nmi to determine if the trace needs to be restarted afterwards. Fixes: 25e8920b301c ("perf/x86/intel/pt: Add sampling support") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20240715160712.127117-2-adrian.hunter@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel: Allow to setup LBR for counting event for BPFKan Liang2024-10-041-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ef493f4b122d6b14a6de111d1acac1eab1d673b0 upstream. The BPF subsystem may capture LBR data on a counting event. However, the current implementation assumes that LBR can/should only be used with sampling events. For instance, retsnoop tool ([0]) makes an extensive use of this functionality and sets up perf event as follows: struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.type = PERF_TYPE_HARDWARE; attr.config = PERF_COUNT_HW_CPU_CYCLES; attr.sample_type = PERF_SAMPLE_BRANCH_STACK; attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL; To limit the LBR for a sampling event is to avoid unnecessary branch stack setup for a counting event in the sample read. Because LBR is only read in the sampling event's overflow. Although in most cases LBR is used in sampling, there is no HW limit to bind LBR to the sampling mode. Allow an LBR setup for a counting event unless in the sample read mode. Fixes: 85846b27072d ("perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag") Closes: https://lore.kernel.org/lkml/20240905180055.1221620-1-andrii@kernel.org/ Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240909155848.326640-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/entry: Remove unwanted instrumentation in common_interrupt()Dmitry Vyukov2024-10-042-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 477d81a1c47a1b79b9c08fc92b5dea3c5143800b upstream. common_interrupt() and related variants call kvm_set_cpu_l1tf_flush_l1d(), which is neither marked noinstr nor __always_inline. So compiler puts it out of line and adds instrumentation to it. Since the call is inside of instrumentation_begin/end(), objtool does not warn about it. The manifestation is that KCOV produces spurious coverage in kvm_set_cpu_l1tf_flush_l1d() in random places because the call happens when preempt count is not yet updated to say that the kernel is in an interrupt. Mark kvm_set_cpu_l1tf_flush_l1d() as __always_inline and move it out of the instrumentation_begin/end() section. It only calls __this_cpu_write() which is already safe to call in noinstr contexts. Fixes: 6368558c3710 ("x86/entry: Provide IDTENTRY_SYSVEC") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/3f9a1de9e415fcb53d07dc9e19fa8481bb021b1b.1718092070.git.dvyukov@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: Move x2APIC ICR helper above kvm_apic_write_nodecode()Sean Christopherson2024-10-041-23/+23
| | | | | | | | | | | | | | | commit d33234342f8b468e719e05649fd26549fb37ef8a upstream. Hoist kvm_x2apic_icr_write() above kvm_apic_write_nodecode() so that a local helper to _read_ the x2APIC ICR can be added and used in the nodecode path without needing a forward declaration. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: x86: Enforce x2APIC's must-be-zero reserved ICR bitsSean Christopherson2024-10-041-1/+14
| | | | | | | | | | | | | | | | | | | | | commit 71bf395a276f0578d19e0ae137a7d1d816d08e0e upstream. Inject a #GP on a WRMSR(ICR) that attempts to set any reserved bits that are must-be-zero on both Intel and AMD, i.e. any reserved bits other than the BUSY bit, which Intel ignores and basically says is undefined. KVM's xapic_state_test selftest has been fudging the bug since commit 4b88b1a518b3 ("KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode"), which essentially removed the testcase instead of fixing the bug. WARN if the nodecode path triggers a #GP, as the CPU is supposed to check reserved bits for ICR when it's partially virtualized. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240719235107.3023592-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen: allow mapping ACPI data using a different physical addressJuergen Gross2024-10-048-1/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9221222c717dbddac1e3c49906525475d87a3a44 upstream. When running as a Xen PV dom0 the system needs to map ACPI data of the host using host physical addresses, while those addresses can conflict with the guest physical addresses of the loaded linux kernel. The same problem might apply in case a PV guest is configured to use the host memory map. This conflict can be solved by mapping the ACPI data to a different guest physical address, but mapping the data via acpi_os_ioremap() must still be possible using the host physical address, as this address might be generated by AML when referencing some of the ACPI data. When configured to support running as a Xen PV domain, have an implementation of acpi_os_ioremap() being aware of the possibility to need above mentioned translation of a host physical address to the guest physical address. This modification requires to #include linux/acpi.h in some sources which need to include asm/acpi.h directly. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xen: move checks for e820 conflicts further upJuergen Gross2024-10-041-22/+22
| | | | | | | | | | | | | | | commit c4498ae316da5b5786ccd448fc555f3339b8e4ca upstream. Move the checks for e820 memory map conflicts using the xen_chk_is_e820_usable() helper further up in order to prepare resolving some of the possible conflicts by doing some e820 map modifications, which must happen before evaluating the RAM layout. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/PCI: Check pcie_find_root_port() return for NULLSamasth Norway Ananda2024-10-041-2/+2
| | | | | | | | | | | | | | | | | | [ Upstream commit dbc3171194403d0d40e4bdeae666f6e76e428b53 ] If pcie_find_root_port() is unable to locate a Root Port, it will return NULL. Check the pointer for NULL before dereferencing it. This particular case is in a quirk for devices that are always below a Root Port, so this won't avoid a problem and doesn't need to be backported, but check as a matter of style and to prevent copy/paste mistakes. Link: https://lore.kernel.org/r/20240812202659.1649121-1-samasth.norway.ananda@oracle.com Signed-off-by: Samasth Norway Ananda <samasth.norway.ananda@oracle.com> [bhelgaas: drop Fixes: and explain why there's no problem in this case] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bpf, x64: Fix tailcall hierarchyLeon Hwang2024-10-041-28/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 116e04ba1459fc08f80cf27b8c9f9f188be0fcb2 ] This patch fixes a tailcall issue caused by abusing the tailcall in bpf2bpf feature. As we know, tail_call_cnt propagates by rax from caller to callee when to call subprog in tailcall context. But, like the following example, MAX_TAIL_CALL_CNT won't work because of missing tail_call_cnt back-propagation from callee to caller. \#include <linux/bpf.h> \#include <bpf/bpf_helpers.h> \#include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail1(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } static __noinline int subprog_tail2(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { volatile int ret = 1; count++; subprog_tail1(skb); subprog_tail2(skb); return ret; } char __license[] SEC("license") = "GPL"; At run time, the tail_call_cnt in entry() will be propagated to subprog_tail1() and subprog_tail2(). But, when the tail_call_cnt in subprog_tail1() updates when bpf_tail_call_static(), the tail_call_cnt in entry() won't be updated at the same time. As a result, in entry(), when tail_call_cnt in entry() is less than MAX_TAIL_CALL_CNT and subprog_tail1() returns because of MAX_TAIL_CALL_CNT limit, bpf_tail_call_static() in suprog_tail2() is able to run because the tail_call_cnt in subprog_tail2() propagated from entry() is less than MAX_TAIL_CALL_CNT. So, how many tailcalls are there for this case if no error happens? From top-down view, does it look like hierarchy layer and layer? With this view, there will be 2+4+8+...+2^33 = 2^34 - 2 = 17,179,869,182 tailcalls for this case. How about there are N subprog_tail() in entry()? There will be almost N^34 tailcalls. Then, in this patch, it resolves this case on x86_64. In stead of propagating tail_call_cnt from caller to callee, it propagates its pointer, tail_call_cnt_ptr, tcc_ptr for short. However, where does it store tail_call_cnt? It stores tail_call_cnt on the stack of main prog. When tail call happens in subprog, it increments tail_call_cnt by tcc_ptr. Meanwhile, it stores tail_call_cnt_ptr on the stack of main prog, too. And, before jump to tail callee, it has to pop tail_call_cnt and tail_call_cnt_ptr. Then, at the prologue of subprog, it must not make rax as tail_call_cnt_ptr again. It has to reuse tail_call_cnt_ptr from caller. As a result, at run time, it has to recognize rax is tail_call_cnt or tail_call_cnt_ptr at prologue by: 1. rax is tail_call_cnt if rax is <= MAX_TAIL_CALL_CNT. 2. rax is tail_call_cnt_ptr if rax is > MAX_TAIL_CALL_CNT, because a pointer won't be <= MAX_TAIL_CALL_CNT. Here's an example to dump JITed. struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { int ret = 1; count++; subprog_tail(skb); subprog_tail(skb); return ret; } When bpftool p d j id 42: int entry(struct __sk_buff * skb): bpf_prog_0c0f4c2413ef19b1_entry: ; int entry(struct __sk_buff *skb) 0: endbr64 4: nopl (%rax,%rax) 9: xorq %rax, %rax ;; rax = 0 (tail_call_cnt) c: pushq %rbp d: movq %rsp, %rbp 10: endbr64 14: cmpq $33, %rax ;; if rax > 33, rax = tcc_ptr 18: ja 0x20 ;; if rax > 33 goto 0x20 ---+ 1a: pushq %rax ;; [rbp - 8] = rax = 0 | 1b: movq %rsp, %rax ;; rax = rbp - 8 | 1e: jmp 0x21 ;; ---------+ | 20: pushq %rax ;; <--------|---------------+ 21: pushq %rax ;; <--------+ [rbp - 16] = rax 22: pushq %rbx ;; callee saved 23: movq %rdi, %rbx ;; rbx = skb (callee saved) ; count++; 26: movabsq $-82417199407104, %rdi 30: movl (%rdi), %esi 33: addl $1, %esi 36: movl %esi, (%rdi) ; subprog_tail(skb); 39: movq %rbx, %rdi ;; rdi = skb 3c: movq -16(%rbp), %rax ;; rax = tcc_ptr 43: callq 0x80 ;; call subprog_tail() ; subprog_tail(skb); 48: movq %rbx, %rdi ;; rdi = skb 4b: movq -16(%rbp), %rax ;; rax = tcc_ptr 52: callq 0x80 ;; call subprog_tail() ; return ret; 57: movl $1, %eax 5c: popq %rbx 5d: leave 5e: retq int subprog_tail(struct __sk_buff * skb): bpf_prog_3a140cef239a4b4f_subprog_tail: ; int subprog_tail(struct __sk_buff *skb) 0: endbr64 4: nopl (%rax,%rax) 9: nopl (%rax) ;; do not touch tail_call_cnt c: pushq %rbp d: movq %rsp, %rbp 10: endbr64 14: pushq %rax ;; [rbp - 8] = rax (tcc_ptr) 15: pushq %rax ;; [rbp - 16] = rax (tcc_ptr) 16: pushq %rbx ;; callee saved 17: pushq %r13 ;; callee saved 19: movq %rdi, %rbx ;; rbx = skb ; asm volatile("r1 = %[ctx]\n\t" 1c: movabsq $-105487587488768, %r13 ;; r13 = jmp_table 26: movq %rbx, %rdi ;; 1st arg, skb 29: movq %r13, %rsi ;; 2nd arg, jmp_table 2c: xorl %edx, %edx ;; 3rd arg, index = 0 2e: movq -16(%rbp), %rax ;; rax = [rbp - 16] (tcc_ptr) 35: cmpq $33, (%rax) 39: jae 0x4e ;; if *tcc_ptr >= 33 goto 0x4e --------+ 3b: jmp 0x4e ;; jmp bypass, toggled by poking | 40: addq $1, (%rax) ;; (*tcc_ptr)++ | 44: popq %r13 ;; callee saved | 46: popq %rbx ;; callee saved | 47: popq %rax ;; undo rbp-16 push | 48: popq %rax ;; undo rbp-8 push | 49: nopl (%rax,%rax) ;; tail call target, toggled by poking | ; return 0; ;; | 4e: popq %r13 ;; restore callee saved <--------------+ 50: popq %rbx ;; restore callee saved 51: leave 52: retq Furthermore, when trampoline is the caller of bpf prog, which is tail_call_reachable, it is required to propagate rax through trampoline. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20240714123902.32305-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xen: tolerate ACPI NVS memory overlapping with Xen allocated memoryJuergen Gross2024-10-041-1/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit be35d91c8880650404f3bf813573222dfb106935 ] In order to minimize required special handling for running as Xen PV dom0, the memory layout is modified to match that of the host. This requires to have only RAM at the locations where Xen allocated memory is living. Unfortunately there seem to be some machines, where ACPI NVS is located at 64 MB, resulting in a conflict with the loaded kernel or the initial page tables built by Xen. Avoid this conflict by swapping the ACPI NVS area in the memory map with unused RAM. This is possible via modification of the dom0 P2M map. Accesses to the ACPI NVS area are done either for saving and restoring it across suspend operations (this will work the same way as before), or by ACPI code when NVS memory is referenced from other ACPI tables. The latter case is handled by a Xen specific indirection of acpi_os_ioremap(). While the E820 map can (and should) be modified right away, the P2M map can be updated only after memory allocation is working, as the P2M map might need to be extended. Fixes: 808fdb71936c ("xen: check for kernel memory conflicting with memory layout") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xen: add capability to remap non-RAM pages to different PFNsJuergen Gross2024-10-042-0/+66
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d05208cf7f05420ad10cc7f9550f91d485523659 ] When running as a Xen PV dom0 it can happen that the kernel is being loaded to a guest physical address conflicting with the host memory map. In order to be able to resolve this conflict, add the capability to remap non-RAM areas to different guest PFNs. A function to use this remapping information for other purposes than doing the remap will be added when needed. As the number of conflicts should be rather low (currently only machines with max. 1 conflict are known), save the remap data in a small statically allocated array. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Stable-dep-of: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Signed-off-by: Sasha Levin <sashal@kernel.org>
* xen: move max_pfn in xen_memory_setup() out of function scopeJuergen Gross2024-10-041-26/+26
| | | | | | | | | | | | | | | | | | [ Upstream commit 43dc2a0f479b9cd30f6674986d7a40517e999d31 ] Instead of having max_pfn as a local variable of xen_memory_setup(), make it a static variable in setup.c instead. This avoids having to pass it to subfunctions, which will be needed in more cases in future. Rename it to ini_nr_pages, as the value denotes the currently usable number of memory pages as passed from the hypervisor at boot time. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Stable-dep-of: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Signed-off-by: Sasha Levin <sashal@kernel.org>
* xen: introduce generic helper checking for memory map conflictsJuergen Gross2024-10-043-11/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ba88829706e2c5b7238638fc2b0713edf596495e ] When booting as a Xen PV dom0 the memory layout of the dom0 is modified to match that of the host, as this requires less changes in the kernel for supporting Xen. There are some cases, though, which are problematic, as it is the Xen hypervisor selecting the kernel's load address plus some other data, which might conflict with the host's memory map. These conflicts are detected at boot time and result in a boot error. In order to support handling at least some of these conflicts in future, introduce a generic helper function which will later gain the ability to adapt the memory layout when possible. Add the missing check for the xen_start_info area. Note that possible p2m map and initrd memory conflicts are handled already by copying the data to memory areas not conflicting with the memory map. The initial stack allocated by Xen doesn't need to be checked, as early boot code is switching to the statically allocated initial kernel stack. Initial page tables and the kernel itself will be handled later. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Stable-dep-of: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Signed-off-by: Sasha Levin <sashal@kernel.org>
* minmax: avoid overly complex min()/max() macro arguments in xenLinus Torvalds2024-10-041-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e8432ac802a028eaee6b1e86383d7cd8e9fb8431 ] We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/all/c83c17bb-be75-4c67-979d-54eee38774c6@lucifer.local/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <David.Laight@aculab.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Signed-off-by: Sasha Levin <sashal@kernel.org>
* xen: use correct end address of kernel for conflict checkingJuergen Gross2024-10-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fac1bceeeb04886fc2ee952672e6e6c85ce41dca ] When running as a Xen PV dom0 the kernel is loaded by the hypervisor using a different memory map than that of the host. In order to minimize the required changes in the kernel, the kernel adapts its memory map to that of the host. In order to do that it is checking for conflicts of its load address with the host memory map. Unfortunately the tested memory range does not include the .brk area, which might result in crashes or memory corruption when this area does conflict with the memory map of the host. Fix the test by using the _end label instead of __bss_stop. Fixes: 808fdb71936c ("xen: check for kernel memory conflicting with memory layout") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/boot/64: Strip percpu address space when setting up GDT descriptorsUros Bizjak2024-10-041-1/+2
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit b51207dc02ec3aeaa849e419f79055d7334845b6 ] init_per_cpu_var() returns a pointer in the percpu address space while rip_rel_ptr() expects a pointer in the generic address space. When strict address space checks are enabled, GCC's named address space checks fail: asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from pointer to non-enclosed address space Add a explicit cast to remove address space of the returned pointer. Fixes: 11e36b0f7c21 ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240819083334.148536-1-ubizjak@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/mm: Use IPIs to synchronize LAM enablementYosry Ahmed2024-10-042-7/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3b299b99556c1753923f8d9bbd9304bcd139282f ] LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. If LAM is enabled by a userspace process while a kthread is using its mm, the kthread will not observe LAM enablement (i.e. LAM will be disabled in CR3). This could be fine for the kthread itself, as LAM only affects userspace addresses. However, if the kthread context switches to a thread in the same userspace process, CR3 may or may not be updated because the mm_struct doesn't change (based on pending TLB flushes). If CR3 is not updated, the userspace thread will run incorrectly with LAM disabled, which may cause page faults when using tagged addresses. Example scenario: CPU 1 CPU 2 /* kthread */ kthread_use_mm() /* user thread */ prctl_enable_tagged_addr() /* LAM enabled on CPU 2 */ /* LAM disabled on CPU 1 */ context_switch() /* to CPU 1 */ /* Switching to user thread */ switch_mm_irqs_off() /* CR3 not updated */ /* LAM is still disabled on CPU 1 */ Synchronize LAM enablement by sending an IPI to all CPUs running with the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1 in the above scenario before prctl_enable_tagged_addr() returns and userspace starts using tagged addresses, and before it's possible to run the userspace process on CPU 1. In switch_mm_irqs_off(), move reading the LAM mask until after mm_cpumask() is updated. This ensures that if an outdated LAM mask is written to CR3, an IPI is received to update it right after IRQs are re-enabled. [ dhansen: Add a LAM enabling helper and comment it ] Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch") Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/sgx: Fix deadlock in SGX NUMA node searchAaron Lu2024-10-041-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9c936844010466535bd46ea4ce4656ef17653644 ] When the current node doesn't have an EPC section configured by firmware and all other EPC sections are used up, CPU can get stuck inside the while loop that looks for an available EPC page from remote nodes indefinitely, leading to a soft lockup. Note how nid_of_current will never be equal to nid in that while loop because nid_of_current is not set in sgx_numa_mask. Also worth mentioning is that it's perfectly fine for the firmware not to setup an EPC section on a node. While setting up an EPC section on each node can enhance performance, it is not a requirement for functionality. Rework the loop to start and end on *a* node that has SGX memory. This avoids the deadlock looking for the current SGX-lacking node to show up in the loop when it never will. Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Reported-by: "Molina Sabido, Gerardo" <gerardo.molina.sabido@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Zhimin Luo <zhimin.luo@intel.com> Link: https://lore.kernel.org/all/20240905080855.1699814-2-aaron.lu%40intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequencyMichael Kelley2024-09-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8fcc514809de41153b43ccbe1a0cdf7f72b78e7e ] A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux doesn't unnecessarily do refined TSC calibration when setting up the TSC clocksource. With this change, a message such as this is no longer output during boot when the TSC is used as the clocksource: [ 1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz Furthermore, the guest and host will have exactly the same view of the TSC frequency, which is important for features such as the TSC deadline timer that are emulated by the Hyper-V host. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20240606025559.1631-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240606025559.1631-1-mhklinux@outlook.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/hyperv: fix kexec crash due to VP assist page corruptionAnirudh Rayabharam (Microsoft)2024-09-183-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b9af6418279c4cf73ca073f8ea024992b38be8ab upstream. commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") introduces a new cpuhp state for hyperv initialization. cpuhp_setup_state() returns the state number if state is CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states. For the hyperv case, since a new cpuhp state was introduced it would return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and so hv_cpu_die() won't be called on all CPUs. This means the VP assist page won't be reset. When the kexec kernel tries to setup the VP assist page again, the hypervisor corrupts the memory region of the old VP assist page causing a panic in case the kexec kernel is using that memory elsewhere. This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec panic/hang issues"). Get rid of hyperv_init_cpuhp entirely since we are no longer using a dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with cpuhp_remove_state(). Cc: stable@vger.kernel.org Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* clocksource: hyper-v: Use lapic timer in a TDX VM without paravisorDexuan Cui2024-09-181-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | commit 7f828d5fff7d24752e1ecf6bebb6617a81f97b93 upstream. In a TDX VM without paravisor, currently the default timer is the Hyper-V timer, which depends on the slow VM Reference Counter MSR: the Hyper-V TSC page is not enabled in such a VM because the VM uses Invariant TSC as a better clocksource and it's challenging to mark the Hyper-V TSC page shared in very early boot. Lower the rating of the Hyper-V timer so the local APIC timer becomes the the default timer in such a VM, and print a warning in case Invariant TSC is unavailable in such a VM. This change should cause no perceivable performance difference. Cc: stable@vger.kernel.org # 6.6+ Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20240621061614.8339-1-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240621061614.8339-1-decui@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/mm: Fix PTI for i386 some moreThomas Gleixner2024-09-121-16/+29
| | | | | | | | | | | | | | | | | | | | | | commit c48b5a4cf3125adb679e28ef093f66ff81368d05 upstream. So it turns out that we have to do two passes of pti_clone_entry_text(), once before initcalls, such that device and late initcalls can use user-mode-helper / modprobe and once after free_initmem() / mark_readonly(). Now obviously mark_readonly() can cause PMD splits, and pti_clone_pgtable() doesn't like that much. Allow the late clone to split PMDs so that pagetables stay in sync. [peterz: Changelog and comments] Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lkml.kernel.org/r/20240806184843.GX37996@noisy.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel: Hide Topdown metrics events if the feature is not enumeratedKan Liang2024-09-121-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 556a7c039a52c21da33eaae9269984a1ef59189b ] The below error is observed on Ice Lake VM. $ perf stat Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (slots). /bin/dmesg | grep -i perf may provide additional information. In a virtualization env, the Topdown metrics and the slots event haven't been supported yet. The guest CPUID doesn't enumerate them. However, the current kernel unconditionally exposes the slots event and the Topdown metrics events to sysfs, which misleads the perf tool and triggers the error. Hide the perf-metrics topdown events and the slots event if the perf-metrics feature is not enumerated. The big core of a hybrid platform can also supports the perf-metrics feature. Fix the hybrid platform as well. Closes: https://lore.kernel.org/lkml/CAM9d7cj8z+ryyzUHR+P1Dcpot2jjW+Qcc4CPQpfafTXN=LEU0Q@mail.gmail.com/ Reported-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lkml.kernel.org/r/20240708193336.1192217-2-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/kmsan: Fix hook for unaligned accessesBrian Johannesmeyer2024-09-121-1/+4
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bf6ab33d8487f5e2a0998ce75286eae65bb0a6d6 ] When called with a 'from' that is not 4-byte-aligned, string_memcpy_fromio() calls the movs() macro to copy the first few bytes, so that 'from' becomes 4-byte-aligned before calling rep_movs(). This movs() macro modifies 'to', and the subsequent line modifies 'n'. As a result, on unaligned accesses, kmsan_unpoison_memory() uses the updated (aligned) values of 'to' and 'n'. Hence, it does not unpoison the entire region. Save the original values of 'to' and 'n', and pass those to kmsan_unpoison_memory(), so that the entire region is unpoisoned. Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20240523215029.4160518-1-bjohannesmeyer@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* x86/apic: Make x2apic_disable() work correctlyYuntao Wang2024-09-121-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | commit 0ecc5be200c84e67114f3640064ba2bae3ba2f5a upstream. x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Due to the early state check for X2APIC_ON, the code path which warns about a locked X2APIC cannot be reached. Test for state < X2APIC_ON instead and move the clearing of the state and mode variables to the place which actually disables X2APIC. [ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the right place for back ports ] Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup") Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240813014827.895381-1-yuntao.wang@linux.dev Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/fpu: Avoid writing LBR bit to IA32_XSS unless supportedMitchell Levy2024-09-123-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2848ff28d180bd63a95da8e5dcbcdd76c1beeb7b upstream. There are two distinct CPU features related to the use of XSAVES and LBR: whether LBR is itself supported and whether XSAVES supports LBR. The LBR subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the XSTATE subsystem does not. The LBR bit is only removed from xfeatures_mask_independent when LBR is not supported by the CPU, but there is no validation of XSTATE support. If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault, leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled with a warning and the boot continues. Consequently the next XRSTORS which tries to restore supervisor state fails with #GP because the RFBM has zero for all supervisor features, which does not match the XCOMP_BV field. As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails due to the same problem resulting in recursive #GPs until the kernel runs out of stack space and double faults. Prevent this by storing the supported independent features in fpu_kernel_cfg during XSTATE initialization and use that cached value for retrieving the independent feature bits to be written into IA32_XSS. [ tglx: Massaged change log ] Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Mitchell Levy <levymitchell0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240812-xsave-lbr-fix-v3-1-95bac1bf62f4@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner2024-09-124-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ea72ce5da22806d5713f3ffb39a6d5ae73841f93 upstream. iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf/x86/intel: Limit the period on HaswellKan Liang2024-09-121-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 25dfc9e357af8aed1ca79b318a73f2c59c1f0b2b upstream. Running the ltp test cve-2015-3290 concurrently reports the following warnings. perfevents: irq loop stuck! WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370 Call Trace: <NMI> ? __warn+0xa4/0x220 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? report_bug+0x3e/0xa0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? irq_work_claim+0x1e/0x40 ? intel_pmu_handle_irq+0x285/0x370 perf_event_nmi_handler+0x3d/0x60 nmi_handle+0x104/0x330 Thanks to Thomas Gleixner's analysis, the issue is caused by the low initial period (1) of the frequency estimation algorithm, which triggers the defects of the HW, specifically erratum HSW11 and HSW143. (For the details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/) The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL event, but the initial period in the freq mode is 1. The erratum is the same as the BDM11, which has been supported in the kernel. A minimum period of 128 is enforced as well on HSW. HSW143 is regarding that the fixed counter 1 may overcount 32 with the Hyper-Threading is enabled. However, based on the test, the hardware has more issues than it tells. Besides the fixed counter 1, the message 'interrupt took too long' can be observed on any counter which was armed with a period < 32 and two events expired in the same NMI. A minimum period of 32 is enforced for the rest of the events. The recommended workaround code of the HSW143 is not implemented. Because it only addresses the issue for the fixed counter. It brings extra overhead through extra MSR writing. No related overcounting issue has been reported so far. Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support") Reported-by: Li Huafei <lihuafei1@huawei.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240819183004.3132920-1-kan.liang@linux.intel.com Closes: https://lore.kernel.org/lkml/20240729223328.327835-1-lihuafei1@huawei.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/tdx: Fix data leak in mmio_read()Kirill A. Shutemov2024-09-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | commit b6fb565a2d15277896583d471b21bc14a0c99661 upstream. The mmio_read() function makes a TDVMCALL to retrieve MMIO data for an address from the VMM. Sean noticed that mmio_read() unintentionally exposes the value of an initialized variable (val) on the stack to the VMM. This variable is only needed as an output value. It did not need to be passed to the VMM in the first place. Do not send the original value of *val to the VMM. [ dhansen: clarify what 'val' is used for. ] Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20240826125304.1566719-1-kirill.shutemov%40linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missingRavi Bangoria2024-09-121-0/+3
| | | | | | | | | | | | | | | | | | | | commit 54950bfe2b69cdc06ef753872b5225e54eb73506 upstream. If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <jmattson@google.com> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: 76ea438b4afc ("KVM: X86: Expose bus lock debug exception to guest") Cc: stable@vger.kernel.org Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240808062937.1149-4-ravi.bangoria@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASEMaxim Levitsky2024-09-121-0/+12
| | | | | | | | | | | | | | | | commit dad1613e0533b380318281c1519e1a3477c2d0d2 upstream. If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240802151608.72896-3-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>