summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'perf_urgent_for_v6.9_rc2' of ↵Linus Torvalds2024-03-317-16/+62
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fixes from Borislav Petkov: - Define the correct set of default hw events on AMD Zen4 - Use the correct stalled cycles PMCs on AMD Zen2 and newer - Fix detection of the LBR freeze feature on AMD * tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later perf/x86/amd/lbr: Use freeze based on availability x86/cpufeatures: Add new word for scattered features
| * perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and laterSandipan Das2024-03-261-0/+15
| | | | | | | | | | | | | | | | | | | | | | Add the "ref-cycles" event for AMD processors based on Zen 4 and later microarchitectures. The backing event is based on PMCx120 which counts cycles not in halt state in P0 frequency (same as MPERF). Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/089155f19f7c7e65aeb1caa727a882e2ca9b8b04.1711352180.git.sandipan.das@amd.com
| * perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and laterSandipan Das2024-03-261-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMD processors based on Zen 2 and later microarchitectures do not support PMCx087 (instruction pipe stalls) which is used as the backing event for "stalled-cycles-frontend" and "stalled-cycles-backend". Use PMCx0A9 (cycles where micro-op queue is empty) instead to count frontend stalls and remove the entry for backend stalls since there is no direct replacement. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Fixes: 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h") Link: https://lore.kernel.org/r/03d7fc8fa2a28f9be732116009025bdec1b3ec97.1711352180.git.sandipan.das@amd.com
| * perf/x86/amd/lbr: Use freeze based on availabilitySandipan Das2024-03-254-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the LBR code assumes that LBR Freeze is supported on all processors when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX] bit 1 is set. This is incorrect as the availability of the feature is additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set, which may not be set for all Zen 4 processors. Define a new feature bit for LBR and PMC freeze and set the freeze enable bit (FLBRI) in DebugCtl (MSR 0x1d9) conditionally. It should still be possible to use LBR without freeze for profile-guided optimization of user programs by using an user-only branch filter during profiling. When the user-only filter is enabled, branches are no longer recorded after the transition to CPL 0 upon PMI arrival. When branch entries are read in the PMI handler, the branch stack does not change. E.g. $ perf record -j any,u -e ex_ret_brn_tkn ./workload Since the feature bit is visible under flags in /proc/cpuinfo, it can be used to determine the feasibility of use-cases which require LBR Freeze to be supported by the hardware such as profile-guided optimization of kernels. Fixes: ca5b7c0d9621 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
| * x86/cpufeatures: Add new word for scattered featuresSandipan Das2024-03-254-5/+9
| | | | | | | | | | | | | | | | | | Add a new word for scattered features because all free bits among the existing Linux-defined auxiliary flags have been exhausted. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/8380d2a0da469a1f0ad75b8954a79fb689599ff6.1711091584.git.sandipan.das@amd.com
* | Merge tag 'x86_urgent_for_v6.9_rc2' of ↵Linus Torvalds2024-03-3115-63/+63
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure single object builds in arch/x86/virt/ ala make ... arch/x86/virt/vmx/tdx/seamcall.o work again - Do not do ROM range scans and memory validation when the kernel is running as a SEV-SNP guest as those can get problematic and, before that, are not really needed in such a guest - Exclude the build-time generated vdso-image-x32.o object from objtool validation and in particular the return sites in there due to a warning which fires when an unpatched return thunk is being used - Improve the NMI CPUs stall message to show additional information about the state of each CPU wrt the NMI handler - Enable gcc named address spaces support only on !KCSAN configs due to compiler options incompatibility - Revert a change which was trying to use GB pages for mapping regions only when the regions would be large enough but that change lead to kexec failing - A documentation fixlet * tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/build: Use obj-y to descend into arch/x86/virt/ x86/sev: Skip ROM range scans and validation for SEV-SNP guests x86/vdso: Fix rethunk patching for vdso-image-x32.o too x86/nmi: Upgrade NMI backtrace stall checks & messages x86/percpu: Disable named address spaces for KCSAN Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped." Documentation/x86: Fix title underline length
| * | x86/build: Use obj-y to descend into arch/x86/virt/Masahiro Yamada2024-03-303-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c33621b4c5ad ("x86/virt/tdx: Wire up basic SEAMCALL functions") introduced a new instance of core-y instead of the standardized obj-y syntax. X86 Makefiles descend into subdirectories of arch/x86/virt inconsistently; into arch/x86/virt/ via core-y defined in arch/x86/Makefile, but into arch/x86/virt/svm/ via obj-y defined in arch/x86/Kbuild. This is problematic when you build a single object in parallel because multiple threads attempt to build the same file. $ make -j$(nproc) arch/x86/virt/vmx/tdx/seamcall.o [ snip ] AS arch/x86/virt/vmx/tdx/seamcall.o AS arch/x86/virt/vmx/tdx/seamcall.o fixdep: error opening file: arch/x86/virt/vmx/tdx/.seamcall.o.d: No such file or directory make[4]: *** [scripts/Makefile.build:362: arch/x86/virt/vmx/tdx/seamcall.o] Error 2 Use the obj-y syntax, as it works correctly. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240330060554.18524-1-masahiroy@kernel.org
| * | x86/sev: Skip ROM range scans and validation for SEV-SNP guestsKevin Loughlin2024-03-268-31/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SEV-SNP requires encrypted memory to be validated before access. Because the ROM memory range is not part of the e820 table, it is not pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes to access this range, the guest must first validate the range. The current SEV-SNP code does indeed scan the ROM range during early boot and thus attempts to validate the ROM range in probe_roms(). However, this behavior is neither sufficient nor necessary for the following reasons: * With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which falls in the ROM range) prior to validation. For example, Project Oak Stage 0 provides a minimal guest firmware that currently meets these configuration conditions, meaning guests booting atop Oak Stage 0 firmware encounter a problematic call chain during dmi_setup() -> dmi_scan_machine() that results in a crash during boot if SEV-SNP is enabled. * With regards to necessity, SEV-SNP guests generally read garbage (which changes across boots) from the ROM range, meaning these scans are unnecessary. The guest reads garbage because the legacy ROM range is unencrypted data but is accessed via an encrypted PMD during early boot (where the PMD is marked as encrypted due to potentially mapping actually-encrypted data in other PMD-contained ranges). In one exceptional case, EISA probing treats the ROM range as unencrypted data, which is inconsistent with other probing. Continuing to allow SEV-SNP guests to use garbage and to inconsistently classify ROM range encryption status can trigger undesirable behavior. For instance, if garbage bytes appear to be a valid signature, memory may be unnecessarily reserved for the ROM range. Future code or other use cases may result in more problematic (arbitrary) behavior that should be avoided. While one solution would be to overhaul the early PMD mapping to always treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not currently rely on data from the ROM region during early boot (and even if they did, they would be mostly relying on garbage data anyways). As a simpler solution, skip the ROM range scans (and the otherwise- necessary range validation) during SEV-SNP guest early boot. The potential SEV-SNP guest crash due to lack of ROM range validation is thus avoided by simply not accessing the ROM range. In most cases, skip the scans by overriding problematic x86_init functions during sme_early_init() to SNP-safe variants, which can be likened to x86_init overrides done for other platforms (ex: Xen); such overrides also avoid the spread of cc_platform_has() checks throughout the tree. In the exceptional EISA case, still use cc_platform_has() for the simplest change, given (1) checks for guest type (ex: Xen domain status) are already performed here, and (2) these checks occur in a subsys initcall instead of an x86_init function. [ bp: Massage commit message, remove "we"s. ] Fixes: 9704c07bf9f7 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active") Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
| * | x86/vdso: Fix rethunk patching for vdso-image-x32.o tooBorislav Petkov (AMD)2024-03-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a similar fashion to b388e57d4628 ("x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o") annotate vdso-image-x32.o too for objtool so that it gets annotated properly and the unused return thunk warning doesn't fire. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202403251454.23df6278-lkp@intel.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/202403251454.23df6278-lkp@intel.com
| * | x86/nmi: Upgrade NMI backtrace stall checks & messagesPaul E. McKenney2024-03-261-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit to improve NMI stall debuggability: 344da544f177 ("x86/nmi: Print reasons why backtrace NMIs are ignored") ... has shown value, but widespread use has also identified a few opportunities for improvement. The systems have (as usual) shown far more creativity than that commit's author, demonstrating yet again that failing CPUs can do whatever they want. In addition, the current message format is less friendly than one might like to those attempting to use these messages to identify failing CPUs. Therefore, separately flag CPUs that, during the full time that the stack-backtrace request was waiting, were always in an NMI handler, were never in an NMI handler, or exited one NMI handler. Also, split the message identifying the CPU and the time since that CPU's last NMI-related activity so that a single line identifies the CPU without any other variable information, greatly reducing the processing overhead required to identify repeat-offender CPUs. Co-developed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
| * | x86/percpu: Disable named address spaces for KCSANUros Bizjak2024-03-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | -fsanitize=thread (KCSAN) is at the moment incompatible with named address spaces in a similar way as KASAN - see GCC PR sanitizer/111736: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736 The patch disables named address spaces with KCSAN. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240325110128.615933-1-ubizjak@gmail.com
| * | Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."Ingo Molnar2024-03-251-18/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit d794734c9bbfe22f86686dc2909c25f5ffe1a572. While the original change tries to fix a bug, it also unintentionally broke existing systems, see the regressions reported at: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/ Since d794734c9bbf was also marked for -stable, let's back it out before causing more damage. Note that due to another upstream change the revert was not 100% automatic: 0a845e0f6348 mm/treewide: replace pud_large() with pud_leaf() Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Cc: Russ Anderson <rja@hpe.com> Cc: Steve Wahl <steve.wahl@hpe.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/ Fixes: d794734c9bbf ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
* | x86/bugs: Fix the SRSO mitigation on Zen3/4Borislav Petkov (AMD)2024-03-293-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original version of the mitigation would patch in the calls to the untraining routines directly. That is, the alternative() in UNTRAIN_RET will patch in the CALL to srso_alias_untrain_ret() directly. However, even if commit e7c25c441e9e ("x86/cpu: Cleanup the untrain mess") meant well in trying to clean up the situation, due to micro- architectural reasons, the untraining routine srso_alias_untrain_ret() must be the target of a CALL instruction and not of a JMP instruction as it is done now. Reshuffle the alternative macros to accomplish that. Fixes: e7c25c441e9e ("x86/cpu: Cleanup the untrain mess") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'efi-fixes-for-v6.9-3' of ↵Linus Torvalds2024-03-291-5/+15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "These address all the outstanding EFI/x86 boot related regressions: - Revert to the old initrd memory allocation soft limit of INT_MAX, which was dropped inadvertently - Ensure that startup_32() is entered with a valid boot_params pointer when using the new EFI mixed mode protocol - Fix a compiler warning introduced by a fix from the previous pull" * tag 'efi-fixes-for-v6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efistub: Reinstate soft limit for initrd loading efi/libstub: Cast away type warning in use of max() x86/efistub: Add missing boot_params for mixed mode compat entry
| * | x86/efistub: Add missing boot_params for mixed mode compat entryArd Biesheuvel2024-03-261-5/+15
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pure EFI stub entry point does not take a struct boot_params from the boot loader, but creates it from scratch, and populates only the fields that still have meaning in this context (command line, initrd base and size, etc) The original mixed mode implementation used the EFI handover protocol instead, where the boot loader (i.e., GRUB) populates a boot_params struct and passes it to a special Linux specific EFI entry point that takes the boot_params pointer as its third argument. When the new mixed mode implementation was introduced, using a special 32-bit PE entrypoint in the 64-bit kernel, it adopted the pure approach, and relied on the EFI stub to create the struct boot_params. This is preferred because it makes the bootloader side much easier to implement, as it does not need any x86-specific knowledge on how struct boot_params and struct setup_header are put together. This mixed mode implementation was adopted by systemd-boot version 252 and later. When commit e2ab9eab324c ("x86/boot/compressed: Move 32-bit entrypoint code into .text section") refactored this code and moved it out of head_64.S, the fact that ESI was populated with the address of the base of the image was overlooked, and to simplify the code flow, ESI is now zeroed and stored to memory unconditionally in shared code, so that the NULL-ness of that variable can still be used later to determine which mixed mode boot protocol is in use. With ESI pointing to the base of the image, it can serve as a struct boot_params pointer for startup_32(), which only accesses the init_data and kernel_alignment fields (and the scratch field as a temporary stack). Zeroing ESI means that those accesses produce garbage now, even though things appear to work if the first page of memory happens to be zeroed, and the region right before LOAD_PHYSICAL_ADDR (== 16 MiB) happens to be free. The solution is to pass a special, temporary struct boot_params to startup_32() via ESI, one that is sufficient for getting it to create the page tables correctly and is discarded right after. This involves setting a minimal alignment of 4k, only to get the statically allocated page tables line up correctly, and setting init_size to the executable image size (_end - startup_32). This ensures that the page tables are covered by the static footprint of the PE image. Given that EFI boot no longer calls the decompressor and no longer pads the image to permit the decompressor to execute in place, the same temporary struct boot_params should be used in the EFI handover protocol based mixed mode implementation as well, to prevent the page tables from being placed outside of allocated memory. Fixes: e2ab9eab324c ("x86/boot/compressed: Move 32-bit entrypoint code into .text section") Cc: <stable@kernel.org> # v6.1+ Closes: https://lore.kernel.org/all/20240321150510.GI8211@craftyguy.net/ Reported-by: Clayton Craft <clayton@craftyguy.net> Tested-by: Clayton Craft <clayton@craftyguy.net> Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
* / crash: use macro to add crashk_res into iomem early for specific archBaoquan He2024-03-261-0/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are regression reports[1][2] that crashkernel region on x86_64 can't be added into iomem tree sometime. This causes the later failure of kdump loading. This happened after commit 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") was merged. Even though, these reported issues are proved to be related to other component, they are just exposed after above commmit applied, I still would like to keep crashk_res and crashk_low_res being added into iomem early as before because the early adding has been always there on x86_64 and working very well. For safety of kdump, Let's change it back. Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that only ARCH defining the macro can have the early adding crashk_res/_low_res into iomem. Then define HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it. Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res handling which was mistakenly added back in commit 85fcde402db1 ("kexec: split crashkernel reservation code out from crash_core.c"). [1] [PATCH V2] x86/kexec: do not update E820 kexec table for setup_data https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u [2] Question about Address Range Validation in Crash Kernel Allocation https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv Fixes: 4a693ce65b18 ("kdump: defer the insertion of crashkernel resources") Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Li Huafei <lihuafei1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'efi-fixes-for-v6.9-2' of ↵Linus Torvalds2024-03-241-0/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Fix logic that is supposed to prevent placement of the kernel image below LOAD_PHYSICAL_ADDR - Use the firmware stack in the EFI stub when running in mixed mode - Clear BSS only once when using mixed mode - Check efi.get_variable() function pointer for NULL before trying to call it * tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: fix panic in kdump kernel x86/efistub: Don't clear BSS twice in mixed mode x86/efistub: Call mixed mode boot services on the firmware's stack efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
| * x86/efistub: Call mixed mode boot services on the firmware's stackArd Biesheuvel2024-03-241-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally, the EFI stub calls into the EFI boot services using the stack that was live when the stub was entered. According to the UEFI spec, this stack needs to be at least 128k in size - this might seem large but all asynchronous processing and event handling in EFI runs from the same stack and so quite a lot of space may be used in practice. In mixed mode, the situation is a bit different: the bootloader calls the 32-bit EFI stub entry point, which calls the decompressor's 32-bit entry point, where the boot stack is set up, using a fixed allocation of 16k. This stack is still in use when the EFI stub is started in 64-bit mode, and so all calls back into the EFI firmware will be using the decompressor's limited boot stack. Due to the placement of the boot stack right after the boot heap, any stack overruns have gone unnoticed. However, commit 5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code") moved the definition of the boot heap into C code, and now the boot stack is placed right at the base of BSS, where any overruns will corrupt the end of the .data section. While it would be possible to work around this by increasing the size of the boot stack, doing so would affect all x86 systems, and mixed mode systems are a tiny (and shrinking) fraction of the x86 installed base. So instead, record the firmware stack pointer value when entering from the 32-bit firmware, and switch to this stack every time a EFI boot service call is made. Cc: <stable@kernel.org> # v6.1+ Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
* | Merge tag 'x86-urgent-2024-03-24' of ↵Linus Torvalds2024-03-2414-85/+76
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Ensure that the encryption mask at boot is properly propagated on 5-level page tables, otherwise the PGD entry is incorrectly set to non-encrypted, which causes system crashes during boot. - Undo the deferred 5-level page table setup as it cannot work with memory encryption enabled. - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset to the default value but the cached variable is not, so subsequent comparisons might yield the wrong result and as a consequence the result prevents updating the MSR. - Register the local APIC address only once in the MPPARSE enumeration to prevent triggering the related WARN_ONs() in the APIC and topology code. - Handle the case where no APIC is found gracefully by registering a fake APIC in the topology code. That makes all related topology functions work correctly and does not affect the actual APIC driver code at all. - Don't evaluate logical IDs during early boot as the local APIC IDs are not yet enumerated and the invoked function returns an error code. Nothing requires the logical IDs before the final CPUID enumeration takes place, which happens after the enumeration. - Cure the fallout of the per CPU rework on UP which misplaced the copying of boot_cpu_data to per CPU data so that the final update to boot_cpu_data got lost which caused inconsistent state and boot crashes. - Use copy_from_kernel_nofault() in the kprobes setup as there is no guarantee that the address can be safely accessed. - Reorder struct members in struct saved_context to work around another kmemleak false positive - Remove the buggy code which tries to update the E820 kexec table for setup_data as that is never passed to the kexec kernel. - Update the resource control documentation to use the proper units. - Fix a Kconfig warning observed with tinyconfig * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/64: Move 5-level paging global variable assignments back x86/boot/64: Apply encryption mask to 5-level pagetable update x86/cpu: Add model number for another Intel Arrow Lake mobile processor x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD Documentation/x86: Document that resctrl bandwidth control units are MiB x86/mpparse: Register APIC address only once x86/topology: Handle the !APIC case gracefully x86/topology: Don't evaluate logical IDs during early boot x86/cpu: Ensure that CPU info updates are propagated on UP kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address x86/pm: Work around false positive kmemleak report in msr_build_context() x86/kexec: Do not update E820 kexec table for setup_data x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
| * | x86/boot/64: Move 5-level paging global variable assignments backTom Lendacky2024-03-241-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") moved assignment of 5-level global variables to later in the boot in order to avoid having to use RIP relative addressing in order to set them. However, when running with 5-level paging and SME active (mem_encrypt=on), the variables are needed as part of the page table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(), etc.). Since the variables haven't been set, the page table manipulation is done as if 4-level paging is active, causing the system to crash on boot. While only a subset of the assignments that were moved need to be set early, move all of the assignments back into check_la57_support() so that these assignments aren't spread between two locations. Instead of just reverting the fix, this uses the new RIP_REL_REF() macro when assigning the variables. Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
| * | x86/boot/64: Apply encryption mask to 5-level pagetable updateTom Lendacky2024-03-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running with 5-level page tables, the kernel mapping PGD entry is updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC, which, when SME is active (mem_encrypt=on), results in a page table entry without the encryption mask set, causing the system to crash on boot. Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so that the encryption mask is set for the PGD entry. Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
| * | x86/cpu: Add model number for another Intel Arrow Lake mobile processorTony Luck2024-03-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | This one is the regular laptop CPU. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240322161725.195614-1-tony.luck@intel.com
| * | x86/fpu: Keep xfd_state in sync with MSR_IA32_XFDAdamos Ttofari2024-03-242-6/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in order to avoid unnecessary writes to the MSR. On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which wipes out any stale state. But the per CPU cached xfd value is not reset, which brings them out of sync. As a consequence a subsequent xfd_update_state() might fail to update the MSR which in turn can result in XRSTOR raising a #NM in kernel space, which crashes the kernel. To fix this, introduce xfd_set_state() to write xfd_state together with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD. Fixes: 672365477ae8 ("x86/fpu: Update XFD state where required") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
| * | x86/mpparse: Register APIC address only onceThomas Gleixner2024-03-231-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The APIC address is registered twice. First during the early detection and afterwards when actually scanning the table for APIC IDs. The APIC and topology core warn about the second attempt. Restrict it to the early detection call. Fixes: 81287ad65da5 ("x86/apic: Sanitize APIC address setup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
| * | x86/topology: Handle the !APIC case gracefullyThomas Gleixner2024-03-231-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is no local APIC enumerated and registered then the topology bitmaps are empty. Therefore, topology_init_possible_cpus() will die with a division by zero exception. Prevent this by registering a fake APIC id to populate the topology bitmap. This also allows to use all topology query interfaces unconditionally. It does not affect the actual APIC code because either the local APIC address was not registered or no local APIC could be detected. Fixes: f1f758a80516 ("x86/topology: Add a mechanism to track topology via APIC IDs") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
| * | x86/topology: Don't evaluate logical IDs during early bootThomas Gleixner2024-03-231-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The local APICs have not yet been enumerated so the logical ID evaluation from the topology bitmaps does not work and would return an error code. Skip the evaluation during the early boot CPUID evaluation and only apply it on the final run. Fixes: 380414be78bf ("x86/cpu/topology: Use topology logical mapping mechanism") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de
| * | x86/cpu: Ensure that CPU info updates are propagated on UPThomas Gleixner2024-03-233-37/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The boot sequence evaluates CPUID information twice: 1) During early boot 2) When finalizing the early setup right before mitigations are selected and alternatives are patched. In both cases the evaluation is stored in boot_cpu_data, but on UP the copying of boot_cpu_data to the per CPU info of the boot CPU happens between #1 and #2. So any update which happens in #2 is never propagated to the per CPU info instance. Consolidate the whole logic and copy boot_cpu_data right before applying alternatives as that's the point where boot_cpu_data is in it's final state and not supposed to change anymore. This also removes the voodoo mb() from smp_prepare_cpus_common() which had absolutely no purpose. Fixes: 71eb4893cfaf ("x86/percpu: Cure per CPU madness on UP") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
| * | kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe addressMasami Hiramatsu (Google)2024-03-221-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Read from an unsafe address with copy_from_kernel_nofault() in arch_adjust_kprobe_addr() because this function is used before checking the address is in text or not. Syzcaller bot found a bug and reported the case if user specifies inaccessible data area, arch_adjust_kprobe_addr() will cause a kernel panic. [ mingo: Clarified the comment. ] Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes") Reported-by: Qiang Zhang <zzqq0103.hey@gmail.com> Tested-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/171042945004.154897.2221804961882915806.stgit@devnote2
| * | x86/pm: Work around false positive kmemleak report in msr_build_context()Anton Altaparmakov2024-03-221-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since: 7ee18d677989 ("x86/power: Make restore_processor_context() sane") kmemleak reports this issue: unreferenced object 0xf68241e0 (size 32): comm "swapper/0", pid 1, jiffies 4294668610 (age 68.432s) hex dump (first 32 bytes): 00 cc cc cc 29 10 01 c0 00 00 00 00 00 00 00 00 ....)........... 00 42 82 f6 cc cc cc cc cc cc cc cc cc cc cc cc .B.............. backtrace: [<461c1d50>] __kmem_cache_alloc_node+0x106/0x260 [<ea65e13b>] __kmalloc+0x54/0x160 [<c3858cd2>] msr_build_context.constprop.0+0x35/0x100 [<46635aff>] pm_check_save_msr+0x63/0x80 [<6b6bb938>] do_one_initcall+0x41/0x1f0 [<3f3add60>] kernel_init_freeable+0x199/0x1e8 [<3b538fde>] kernel_init+0x1a/0x110 [<938ae2b2>] ret_from_fork+0x1c/0x28 Which is a false positive. Reproducer: - Run rsync of whole kernel tree (multiple times if needed). - start a kmemleak scan - Note this is just an example: a lot of our internal tests hit these. The root cause is similar to the fix in: b0b592cf0836 x86/pm: Fix false positive kmemleak report in msr_build_context() ie. the alignment within the packed struct saved_context which has everything unaligned as there is only "u16 gs;" at start of struct where in the past there were four u16 there thus aligning everything afterwards. The issue is with the fact that Kmemleak only searches for pointers that are aligned (see how pointers are scanned in kmemleak.c) so when the struct members are not aligned it doesn't see them. Testing: We run a lot of tests with our CI, and after applying this fix we do not see any kmemleak issues any more whilst without it we see hundreds of the above report. From a single, simple test run consisting of 416 individual test cases on kernel 5.10 x86 with kmemleak enabled we got 20 failures due to this, which is quite a lot. With this fix applied we get zero kmemleak related failures. Fixes: 7ee18d677989 ("x86/power: Make restore_processor_context() sane") Signed-off-by: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240314142656.17699-1-anton@tuxera.com
| * | x86/kexec: Do not update E820 kexec table for setup_dataDave Young2024-03-221-16/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | crashkernel reservation failed on a Thinkpad t440s laptop recently. Actually the memblock reservation succeeded, but later insert_resource() failed. Test steps: kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */ kexec reboot -> dmesg|grep "crashkernel reserved"; crashkernel memory range like below reserved successfully: 0x00000000d0000000 - 0x00000000da000000 But no such "Crash kernel" region in /proc/iomem The background story: Currently the E820 code reserves setup_data regions for both the current kernel and the kexec kernel, and it inserts them into the resources list. Before the kexec kernel reboots nobody passes the old setup_data, and kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed. Thus the old setup data memory is not used at all. Due to old kernel updates the kexec e820 table as well so kexec kernel sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data regions are inserted into resources list in the kexec kernel by e820__reserve_resources(). Note, due to no setup_data is passed in for those old regions they are not early reserved (by function early_reserve_memory), and the crashkernel memblock reservation will just treat them as usable memory and it could reserve the crashkernel region which overlaps with the old setup_data regions. And just like the bug I noticed here, kdump insert_resource failed because e820__reserve_resources has added the overlapped chunks in /proc/iomem already. Finally, looking at the code, the old setup_data regions are not used at all as no setup_data is passed in by the kexec boot loader. Although something like SETUP_PCI etc could be needed, kexec should pass the info as new setup_data so that kexec kernel can take care of them. This should be taken care of in other separate patches if needed. Thus drop the useless buggy code here. Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Eric DeVolder <eric.devolder@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
| * | x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'Masahiro Yamada2024-03-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kconfig emits a warning for the following command: $ make ARCH=x86_64 tinyconfig ... .config:1380:warning: override: UNWINDER_GUESS changes choice state When X86_64=y, the unwinder is exclusively selected from the following three options: - UNWINDER_ORC - UNWINDER_FRAME_POINTER - UNWINDER_GUESS However, arch/x86/configs/tiny.config only specifies the values of the last two. UNWINDER_ORC must be explicitly disabled. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240320154313.612342-1-masahiroy@kernel.org
* | | Merge tag 'kbuild-v6.9' of ↵Linus Torvalds2024-03-214-3/+5
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Generate a list of built DTB files (arch/*/boot/dts/dtbs-list) - Use more threads when building Debian packages in parallel - Fix warnings shown during the RPM kernel package uninstallation - Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to Makefile - Support GCC's -fmin-function-alignment flag - Fix a null pointer dereference bug in modpost - Add the DTB support to the RPM package - Various fixes and cleanups in Kconfig * tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits) kconfig: tests: test dependency after shuffling choices kconfig: tests: add a test for randconfig with dependent choices kconfig: tests: support KCONFIG_SEED for the randconfig runner kbuild: rpm-pkg: add dtb files in kernel rpm kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig() kconfig: check prompt for choice while parsing kconfig: lxdialog: remove unused dialog colors kconfig: lxdialog: fix button color for blackbg theme modpost: fix null pointer dereference kbuild: remove GCC's default -Wpacked-bitfield-compat flag kbuild: unexport abs_srctree and abs_objtree kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1 kconfig: remove named choice support kconfig: use linked list in get_symbol_str() to iterate over menus kconfig: link menus to a symbol kbuild: fix inconsistent indentation in top Makefile kbuild: Use -fmin-function-alignment when available alpha: merge two entries for CONFIG_ALPHA_GAMMA alpha: merge two entries for CONFIG_ALPHA_EV4 kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj) ...
| * | | kbuild: change tool coverage variables to take the path relative to $(obj)Masahiro Yamada2024-02-232-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 54b8ae66ae1a ("kbuild: change *FLAGS_<basetarget>.o to take the path relative to $(obj)") changed the syntax of per-file compiler flags. The situation is the same for the following variables: OBJECT_FILES_NON_STANDARD_<basetarget>.o GCOV_PROFILE_<basetarget>.o KASAN_SANITIZE_<basetarget>.o KMSAN_SANITIZE_<basetarget>.o KMSAN_ENABLE_CHECKS_<basetarget>.o UBSAN_SANITIZE_<basetarget>.o KCOV_INSTRUMENT_<basetarget>.o KCSAN_SANITIZE_<basetarget>.o KCSAN_INSTRUMENT_BARRIERS_<basetarget>.o The <basetarget> is the filename of the target with its directory and suffix stripped. This syntax comes into a trouble when two files with the same basename appear in one Makefile, for example: obj-y += dir1/foo.o obj-y += dir2/foo.o OBJECT_FILES_NON_STANDARD_foo.o := y OBJECT_FILES_NON_STANDARD_foo.o is applied to both dir1/foo.o and dir2/foo.o. This syntax is not flexbile enough to handle cases where one of them is a standard object, but the other is not. It is more sensible to use the relative path to the Makefile, like this: obj-y += dir1/foo.o OBJECT_FILES_NON_STANDARD_dir1/foo.o := y obj-y += dir2/foo.o OBJECT_FILES_NON_STANDARD_dir2/foo.o := y To maintain the current behavior, I made adjustments to the following two Makefiles: - arch/x86/entry/vdso/Makefile, which compiles vclock_gettime.o, vgetcpu.o, and their vdso32 variants. - arch/x86/kvm/Makefile, which compiles vmx/vmenter.o and svm/vmenter.o Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Acked-by: Sean Christopherson <seanjc@google.com>
| * | | treewide: replace or remove redundant def_bool in Kconfig filesMasahiro Yamada2024-02-202-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | | | Merge tag 'char-misc-6.9-rc1' of ↵Linus Torvalds2024-03-211-4/+0
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc and other driver subsystem updates from Greg KH: "Here is the big set of char/misc and a number of other driver subsystem updates for 6.9-rc1. Included in here are: - IIO driver updates, loads of new ones and evolution of existing ones - coresight driver updates - const cleanups for many driver subsystems - speakup driver additions - platform remove callback void cleanups - mei driver updates - mhi driver updates - cdx driver updates for MSI interrupt handling - nvmem driver updates - other smaller driver updates and cleanups, full details in the shortlog All of these have been in linux-next for a long time with no reported issue, other than a build warning for the speakup driver" The build warning hits clang and is a gcc (and C23) extension, and is fixed up in the merge. Link: https://lore.kernel.org/all/20240321134831.GA2762840@dev-arch.thelio-3990X/ * tag 'char-misc-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (279 commits) binder: remove redundant variable page_addr uio_dmem_genirq: UIO_MEM_DMA_COHERENT conversion uio_pruss: UIO_MEM_DMA_COHERENT conversion cnic,bnx2,bnx2x: use UIO_MEM_DMA_COHERENT uio: introduce UIO_MEM_DMA_COHERENT type cdx: add MSI support for CDX bus pps: use cflags-y instead of EXTRA_CFLAGS speakup: Add /dev/synthu device speakup: Fix 8bit characters from direct synth parport: sunbpp: Convert to platform remove callback returning void parport: amiga: Convert to platform remove callback returning void char: xillybus: Convert to platform remove callback returning void vmw_balloon: change maintainership MAINTAINERS: change the maintainer for hpilo driver char: xilinx_hwicap: Fix NULL vs IS_ERR() bug hpet: remove hpets::hp_clocksource platform: goldfish: move the separate 'default' propery for CONFIG_GOLDFISH char: xilinx_hwicap: drop casting to void in dev_set_drvdata greybus: move is_gb_* functions out of greybus.h greybus: Remove usage of the deprecated ida_simple_xx() API ...
| * | | | platform: goldfish: move the separate 'default' propery for CONFIG_GOLDFISHMasahiro Yamada2024-03-071-4/+0
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there are two entries for CONFIG_GOLDFISH. In arch/x86/Kconfig: config GOLDFISH def_bool y depends on X86_GOLDFISH In drivers/platform/goldfish/Kconfig: menuconfig GOLDFISH bool "Platform support for Goldfish virtual devices" depends on HAS_IOMEM && HAS_DMA While Kconfig allows multiple entries, it generally leads to tricky code. Prior to commit bd2f348db503 ("goldfish: refactor goldfish platform configs"), CONFIG_GOLDFISH was an alias of CONFIG_X86_GOLDFISH. After the mentioned commit added the second entry with a user prompt, the former provides the 'default' property that is effective only when X86_GOLDFISH=y. Merge them tegether to clarify how it has worked in the past 8 years. Cc: Greg Hackmann <ghackmann@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20240204081004.33871-1-masahiroy@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | Merge tag 'hyperv-next-signed-20240320' of ↵Linus Torvalds2024-03-216-141/+157
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Use Hyper-V entropy to seed guest random number generator (Michael Kelley) - Convert to platform remove callback returning void for vmbus (Uwe Kleine-König) - Introduce hv_get_hypervisor_version function (Nuno Das Neves) - Rename some HV_REGISTER_* defines for consistency (Nuno Das Neves) - Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_* (Nuno Das Neves) - Cosmetic changes for hv_spinlock.c (Purna Pavan Chandra Aekkaladevi) - Use per cpu initial stack for vtl context (Saurabh Sengar) * tag 'hyperv-next-signed-20240320' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Use Hyper-V entropy to seed guest random number generator x86/hyperv: Cosmetic changes for hv_spinlock.c hyperv-tlfs: Rename some HV_REGISTER_* defines for consistency hv: vmbus: Convert to platform remove callback returning void mshyperv: Introduce hv_get_hypervisor_version function x86/hyperv: Use per cpu initial stack for vtl context hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*
| * | | | x86/hyperv: Use Hyper-V entropy to seed guest random number generatorMichael Kelley2024-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A Hyper-V host provides its guest VMs with entropy in a custom ACPI table named "OEM0". The entropy bits are updated each time Hyper-V boots the VM, and are suitable for seeding the Linux guest random number generator (rng). See a brief description of OEM0 in [1]. Generation 2 VMs on Hyper-V use UEFI to boot. Existing EFI code in Linux seeds the rng with entropy bits from the EFI_RNG_PROTOCOL. Via this path, the rng is seeded very early during boot with good entropy. The ACPI OEM0 table provided in such VMs is an additional source of entropy. Generation 1 VMs on Hyper-V boot from BIOS. For these VMs, Linux doesn't currently get any entropy from the Hyper-V host. While this is not fundamentally broken because Linux can generate its own entropy, using the Hyper-V host provided entropy would get the rng off to a better start and would do so earlier in the boot process. Improve the rng seeding for Generation 1 VMs by having Hyper-V specific code in Linux take advantage of the OEM0 table to seed the rng. For Generation 2 VMs, use the OEM0 table to provide additional entropy beyond the EFI_RNG_PROTOCOL. Because the OEM0 table is custom to Hyper-V, parse it directly in the Hyper-V code in the Linux kernel and use add_bootloader_randomness() to add it to the rng. Once the entropy bits are read from OEM0, zero them out in the table so they don't appear in /sys/firmware/acpi/tables/OEM0 in the running VM. The zero'ing is done out of an abundance of caution to avoid potential security risks to the rng. Also set the OEM0 data length to zero so a kexec or other subsequent use of the table won't try to use the zero'ed bits. [1] https://download.microsoft.com/download/1/c/9/1c9813b8-089c-4fef-b2ad-ad80e79403ba/Whitepaper%20-%20The%20Windows%2010%20random%20number%20generation%20infrastructure.pdf Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20240318155408.216851-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240318155408.216851-1-mhklinux@outlook.com>
| * | | | x86/hyperv: Cosmetic changes for hv_spinlock.cPurna Pavan Chandra Aekkaladevi2024-03-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix issues reported by checkpatch.pl script for hv_spinlock.c file. - Place __initdata after variable name - Add missing blank line after enum declaration No functional changes intended. Signed-off-by: Purna Pavan Chandra Aekkaladevi <paekkaladevi@linux.microsoft.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://lore.kernel.org/r/1710763751-14137-1-git-send-email-paekkaladevi@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1710763751-14137-1-git-send-email-paekkaladevi@linux.microsoft.com>
| * | | | hyperv-tlfs: Rename some HV_REGISTER_* defines for consistencyNuno Das Neves2024-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename HV_REGISTER_GUEST_OSID to HV_REGISTER_GUEST_OS_ID. This matches the existing HV_X64_MSR_GUEST_OS_ID. Rename HV_REGISTER_CRASH_* to HV_REGISTER_GUEST_CRASH_*. Including GUEST_ is consistent with other #defines such as HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. The new names also match the TLFS document more accurately, i.e. HvRegisterGuestCrash*. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Link: https://lore.kernel.org/r/1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1710285687-9160-1-git-send-email-nunodasneves@linux.microsoft.com>
| * | | | mshyperv: Introduce hv_get_hypervisor_version functionNuno Das Neves2024-03-121-19/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce x86_64 and arm64 functions to get the hypervisor version information and store it in a structure for simpler parsing. Use the new function to get and parse the version at boot time. While at it, move the printing code to hv_common_init() so it is not duplicated. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1709852618-29110-1-git-send-email-nunodasneves@linux.microsoft.com>
| * | | | x86/hyperv: Use per cpu initial stack for vtl contextSaurabh Sengar2024-03-081-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the secondary CPUs in Hyper-V VTL context lack support for parallel startup. Therefore, relying on the single initial_stack fetched from the current task structure suffices for all vCPUs. However, common initial_stack risks stack corruption when parallel startup is enabled. In order to facilitate parallel startup, use the initial_stack from the per CPU idle thread instead of the current task. Fixes: 3be1bc2fe9d2 ("x86/hyperv: VTL support for Hyper-V") Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1709452896-13342-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1709452896-13342-1-git-send-email-ssengar@linux.microsoft.com>
| * | | | hyperv-tlfs: Change prefix of generic HV_REGISTER_* MSRs to HV_MSR_*Nuno Das Neves2024-03-044-116/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The HV_REGISTER_ are used as arguments to hv_set/get_register(), which delegate to arch-specific mechanisms for getting/setting synthetic Hyper-V MSRs. On arm64, HV_REGISTER_ defines are synthetic VP registers accessed via the get/set vp registers hypercalls. The naming matches the TLFS document, although these register names are not specific to arm64. However, on x86 the prefix HV_REGISTER_ indicates Hyper-V MSRs accessed via rdmsrl()/wrmsrl(). This is not consistent with the TLFS doc, where HV_REGISTER_ is *only* used for used for VP register names used by the get/set register hypercalls. To fix this inconsistency and prevent future confusion, change the arch-generic aliases used by callers of hv_set/get_register() to have the prefix HV_MSR_ instead of HV_REGISTER_. Use the prefix HV_X64_MSR_ for the x86-only Hyper-V MSRs. On x86, the generic HV_MSR_'s point to the corresponding HV_X64_MSR_. Move the arm64 HV_REGISTER_* defines to the asm-generic hyperv-tlfs.h, since these are not specific to arm64. On arm64, the generic HV_MSR_'s point to the corresponding HV_REGISTER_. While at it, rename hv_get/set_registers() and related functions to hv_get/set_msr(), hv_get/set_nested_msr(), etc. These are only used for Hyper-V MSRs and this naming makes that clear. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1708440933-27125-1-git-send-email-nunodasneves@linux.microsoft.com>
* | | | | Merge tag 'for-linus-6.9-rc1-tag' of ↵Linus Torvalds2024-03-196-44/+122
|\ \ \ \ \ | |_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - Xen event channel handling fix for a regression with a rare kernel config and some added hardening - better support of running Xen dom0 in PVH mode - a cleanup for the xen grant-dma-iommu driver * tag 'for-linus-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: increment refcnt only if event channel is refcounted xen/evtchn: avoid WARN() when unbinding an event channel x86/xen: attempt to inflate the memory balloon on PVH xen/grant-dma-iommu: Convert to platform remove callback returning void
| * | | | x86/xen: attempt to inflate the memory balloon on PVHRoger Pau Monne2024-03-136-44/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running as PVH or HVM Linux will use holes in the memory map as scratch space to map grants, foreign domain pages and possibly miscellaneous other stuff. However the usage of such memory map holes for Xen purposes can be problematic. The request of holesby Xen happen quite early in the kernel boot process (grant table setup already uses scratch map space), and it's possible that by then not all devices have reclaimed their MMIO space. It's not unlikely for chunks of Xen scratch map space to end up using PCI bridge MMIO window memory, which (as expected) causes quite a lot of issues in the system. At least for PVH dom0 we have the possibility of using regions marked as UNUSABLE in the e820 memory map. Either if the region is UNUSABLE in the native memory map, or it has been converted into UNUSABLE in order to hide RAM regions from dom0, the second stage translation page-tables can populate those areas without issues. PV already has this kind of logic, where the balloon driver is inflated at boot. Re-use the current logic in order to also inflate it when running as PVH. onvert UNUSABLE regions up to the ratio specified in EXTRA_MEM_RATIO to RAM, while reserving them using xen_add_extra_mem() (which is also moved so it's no longer tied to CONFIG_PV). [jgross: fixed build for CONFIG_PVH without CONFIG_XEN_PVH] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240220174341.56131-1-roger.pau@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com>
* | | | | Merge tag 'perf-urgent-2024-03-17' of ↵Linus Torvalds2024-03-172-3/+4
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event fixes from Ingo Molnar: - Work around AMD erratum to filter out bogus LBR stack entries - Fix incorrect PMU reset that can result in warnings (or worse) during suspend/hibernation * tag 'perf-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/core: Avoid register reset when CPU is dead perf/x86/amd/lbr: Discard erroneous branch entries
| * | | | | perf/x86/amd/core: Avoid register reset when CPU is deadSandipan Das2024-03-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When bringing a CPU online, some of the PMC and LBR related registers are reset. The same is done when a CPU is taken offline although that is unnecessary. This currently happens in the "cpu_dead" callback which is also incorrect as the callback runs on a control CPU instead of the one that is being taken offline. This also affects hibernation and suspend to RAM on some platforms as reported in the link below. Fixes: 21d59e3e2c40 ("perf/x86/amd/core: Detect PerfMonV2 support") Reported-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/550a026764342cf7e5812680e3e2b91fe662b5ac.1706526029.git.sandipan.das@amd.com
| * | | | | perf/x86/amd/lbr: Discard erroneous branch entriesSandipan Das2024-03-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Revision Guide for AMD Family 19h Model 10-1Fh processors declares Erratum 1452 which states that non-branch entries may erroneously be recorded in the Last Branch Record (LBR) stack with the valid and spec bits set. Such entries can be recognized by inspecting bit 61 of the corresponding LastBranchStackToIp register. This bit is currently reserved but if found to be set, the associated branch entry should be discarded. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://bugzilla.kernel.org/attachment.cgi?id=305518 Link: https://lore.kernel.org/r/3ad2aa305f7396d41a40e3f054f740d464b16b7f.1706526029.git.sandipan.das@amd.com
* | | | | | x86/CPU/AMD: Update the Zenbleed microcode revisionsBorislav Petkov (AMD)2024-03-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update them to the correct revision numbers. Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-03-1538-656/+1253
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...