summaryrefslogtreecommitdiffstats
path: root/arch/x86
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | | x86/topology: convert to use arch_cpu_is_hotpluggable()Russell King (Oracle)2023-12-061-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert x86 to use the arch_cpu_is_hotpluggable() helper rather than arch_register_cpu(). Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3w-00Cszy-6k@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: use weak version of arch_unregister_cpu()Russell King (Oracle)2023-12-061-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the x86 version of arch_unregister_cpu() is the same as the weak version, drop the x86 specific version. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3r-00Cszs-2R@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: Switch over to GENERIC_CPU_DEVICESJames Morse2023-12-063-27/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that GENERIC_CPU_DEVICES calls arch_register_cpu(), which can be overridden by the arch code, switch over to this to allow common code to choose when the register_cpu() call is made. x86's struct cpus come from struct x86_cpu, which has no other members or users. Remove this and use the version defined by common code. This is an intermediate step to the logic being moved to drivers/acpi, where GENERIC_CPU_DEVICES will do the work when booting with acpi=off. This patch also has the effect of moving the registration of CPUs from subsys to driver core initialisation, prior to any initcalls running. ---- Changes since RFC: * Fixed the second copy of arch_register_cpu() used for non-hotplug Changes since RFC v2: * Remove duplicate of the weak generic arch_register_cpu(), spotted by Jonathan Cameron. Add note about initialisation order change. Changes since RFC v3: * Adapt to removal of EXPORT_SYMBOL()s Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3l-00Cszm-UA@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | ACPI: Move ACPI_HOTPLUG_CPU to be disabled on arm64 and riscvJames Morse2023-12-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neither arm64 nor riscv support physical hotadd of CPUs that were not present at boot. For arm64 much of the platform description is in static tables which do not have update methods. arm64 does support HOTPLUG_CPU, which is backed by a firmware interface to turn CPUs on and off. acpi_processor_hotadd_init() and acpi_processor_remove() are for adding and removing CPUs that were not present at boot. arm64 systems that do this are not supported as there is currently insufficient information in the platform description. (e.g. did the GICR get removed too?) arm64 currently relies on the MADT enabled flag check in map_gicc_mpidr() to prevent CPUs that were not described as present at boot from being added to the system. Similarly, riscv relies on the same check in map_rintc_hartid(). Both architectures also rely on the weak 'always fails' definitions of acpi_map_cpu() and arch_register_cpu(). Subsequent changes will redefine ACPI_HOTPLUG_CPU as making possible CPUs present. Neither arm64 nor riscv support this. Disable ACPI_HOTPLUG_CPU for arm64 and riscv by removing 'default y' and selecting it on the other three ACPI architectures. This allows the weak definitions of some symbols to be removed. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R31-00Csyt-Jq@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: remove arch_*register_cpu() exportsRussell King (Oracle)2023-12-061-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arch_register_cpu() and arch_unregister_cpu() are not used by anything that can be a module - they are used by drivers/base/cpu.c and drivers/acpi/acpi_processor.c, neither of which can be a module. Remove the exports. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R2r-00Csyh-7B@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86: intel_epb: Don't rely on link orderJames Morse2023-12-061-1/+1
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | intel_epb_init() is called as a subsys_initcall() to register cpuhp callbacks. The callbacks make use of get_cpu_device() which will return NULL unless register_cpu() has been called. register_cpu() is called from topology_init(), which is also a subsys_initcall(). This is fragile. Moving the register_cpu() to a different subsys_initcall() leads to a NULL dereference during boot. Make intel_epb_init() a late_initcall(), user-space can't provide a policy before this point anyway. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R2m-00Csyb-2S@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | | Merge tag 'pci-v6.8-changes' of ↵Linus Torvalds2024-01-177-117/+145
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Reserve ECAM so we don't assign it to PCI BARs; this works around bugs where BIOS included ECAM in a PNP0A03 host bridge window, didn't reserve it via a PNP0C02 motherboard device, and didn't allocate space for SR-IOV VF BARs (Bjorn Helgaas) - Add MMCONFIG/ECAM debug logging (Bjorn Helgaas) - Rename 'MMCONFIG' to 'ECAM' to match spec usage (Bjorn Helgaas) - Log device type (Root Port, Switch Port, etc) during enumeration (Bjorn Helgaas) - Log bridges before downstream devices so the dmesg order is more logical (Bjorn Helgaas) - Log resource names (BAR 0, VF BAR 0, bridge window, etc) consistently instead of a mix of names and "reg 0x10" (Puranjay Mohan, Bjorn Helgaas) - Fix 64GT/s effective data rate calculation to use 1b/1b encoding rather than the 8b/10b or 128b/130b used by lower rates (Ilpo Järvinen) - Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI lpfc (Ilpo Järvinen) - Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen) Resource management: - Restructure pci_dev_for_each_resource() to avoid computing the address of an out-of-bounds array element (the bounds check was performed later so the element was never actually *read*, but it's nicer to avoid even computing an out-of-bounds address) (Andy Shevchenko) Driver binding: - Convert pci-host-common.c platform .remove() callback to .remove_new() returning 'void' since it's not useful to return error codes here (Uwe Kleine-König) - Convert exynos, keystone, kirin from .remove() to .remove_new(), which returns void instead of int (Uwe Kleine-König) - Drop unused struct pci_driver.node member (Mathias Krause) Virtualization: - Add ACS quirk for more Zhaoxin Root Ports (LeoLiuoc) Error handling: - Log AER errors as "Correctable" (not "Corrected") or "Uncorrectable" to match spec terminology (Bjorn Helgaas) - Decode Requester ID when no error info found instead of printing the raw hex value (Bjorn Helgaas) Endpoint framework: - Use a unique test pattern for each BAR in the pci_endpoint_test to make it easier to debug address translation issues (Niklas Cassel) Broadcom STB PCIe controller driver: - Add DT property "brcm,clkreq-mode" and driver support for different CLKREQ# modes to make ASPM L1.x states possible (Jim Quinlan) Freescale Layerscape PCIe controller driver: - Add suspend/resume support for Layerscape LS1043a and LS1021a, including software-managed PME_Turn_Off and transitions between L0, L2/L3_Ready Link states (Frank Li) MediaTek PCIe controller driver: - Clear MSI interrupt status before handler to avoid missing MSIs that occur after the handler (qizhong cheng) MediaTek PCIe Gen3 controller driver: - Update mediatek-gen3 translation window setup to handle MMIO space that is not a power of two in size (Jianjun Wang) Qualcomm PCIe controller driver: - Increase qcom iommu-map maxItems to accommodate SDX55 (five entries) and SDM845 (sixteen entries) (Krzysztof Kozlowski) - Describe qcom,pcie-sc8180x clocks and resets accurately (Krzysztof Kozlowski) - Describe qcom,pcie-sm8150 clocks and resets accurately (Krzysztof Kozlowski) - Correct the qcom "reset-name" property, previously incorrectly called "reset-names" (Krzysztof Kozlowski) - Document qcom,pcie-sm8650, based on qcom,pcie-sm8550 (Neil Armstrong) Renesas R-Car PCIe controller driver: - Replace of_device.h with explicit of.h include to untangle header usage (Rob Herring) - Add DT and driver support for optional miniPCIe 1.5v and 3.3v regulators on KingFisher (Wolfram Sang) SiFive FU740 PCIe controller driver: - Convert fu740 CONFIG_PCIE_FU740 dependency from SOC_SIFIVE to ARCH_SIFIVE (Conor Dooley) Synopsys DesignWare PCIe controller driver: - Align iATU mapping for endpoint MSI-X (Niklas Cassel) - Drop "host_" prefix from struct dw_pcie_host_ops members (Yoshihiro Shimoda) - Drop "ep_" prefix from struct dw_pcie_ep_ops members (Yoshihiro Shimoda) - Rename struct dw_pcie_ep_ops.func_conf_select() to .get_dbi_offset() to be more descriptive (Yoshihiro Shimoda) - Add Endpoint DBI accessors to encapsulate offset lookups (Yoshihiro Shimoda) TI J721E PCIe driver: - Add j721e DT and driver support for 'num-lanes' for devices that support x1, x2, or x4 Links (Matt Ranostay) - Add j721e DT compatible strings and driver support for j784s4 (Matt Ranostay) - Make TI J721E Kconfig depend on ARCH_K3 since the hardware is specific to those TI SoC parts (Peter Robinson) TI Keystone PCIe controller driver: - Hold power management references to all PHYs while enabling them to avoid a race when one provides clocks to others (Siddharth Vadapalli) Xilinx XDMA PCIe controller driver: - Remove redundant dev_err(), since platform_get_irq() and platform_get_irq_byname() already log errors (Yang Li) - Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() (Krzysztof Wilczyński) - Fix xilinx_pl_dma_pcie_init_irq_domain() error return when irq_domain_add_linear() fails (Harshit Mogalapalli) MicroSemi Switchtec management driver: - Do dma_mrpc cleanup during switchtec_pci_remove() to match its devm ioremapping in switchtec_pci_probe(). Previously the cleanup was done in stdev_release(), which used stale pointers if stdev->cdev happened to be open when the PCI device was removed (Daniel Stodden) Miscellaneous: - Convert interrupt terminology from "legacy" to "INTx" to be more specific and match spec terminology (Damien Le Moal) - In dw-xdata-pcie, pci_endpoint_test, and vmd, replace usage of deprecated ida_simple_*() API with ida_alloc() and ida_free() (Christophe JAILLET)" * tag 'pci-v6.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits) PCI: Fix kernel-doc issues PCI: brcmstb: Configure HW CLKREQ# mode appropriate for downstream device dt-bindings: PCI: brcmstb: Add property "brcm,clkreq-mode" PCI: mediatek-gen3: Fix translation window size calculation PCI: mediatek: Clear interrupt status before dispatching handler PCI: keystone: Fix race condition when initializing PHYs PCI: xilinx-xdma: Fix error code in xilinx_pl_dma_pcie_init_irq_domain() PCI: xilinx-xdma: Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() PCI: rcar-gen4: Fix -Wvoid-pointer-to-enum-cast error PCI: iproc: Fix -Wvoid-pointer-to-enum-cast warning PCI: dwc: Add dw_pcie_ep_{read,write}_dbi[2] helpers PCI: dwc: Rename .func_conf_select to .get_dbi_offset in struct dw_pcie_ep_ops PCI: dwc: Rename .ep_init to .init in struct dw_pcie_ep_ops PCI: dwc: Drop host prefix from struct dw_pcie_host_ops members misc: pci_endpoint_test: Use a unique test pattern for each BAR PCI: j721e: Make TI J721E depend on ARCH_K3 PCI: j721e: Add TI J784S4 PCIe configuration PCI/AER: Use explicit register sizes for struct members PCI/AER: Decode Requester ID when no error info found PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors ...
| * \ \ \ \ Merge branch 'pci/enumeration'Bjorn Helgaas2024-01-153-11/+24
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Convert pci-host-common.c platform .remove() callback to .remove_new() returning 'void' since it's not useful to return error codes here (Uwe Kleine-König) - Log a message about updating AMD USB controller class code (so dwc3, not xhci, claims it) only when we actually change it (Guilherme G. Piccoli) - Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI lpfc (Ilpo Järvinen) - Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen) - Fix 64GT/s effective data rate calculation to use 1b/1b encoding rather than the 8b/10b or 128b/130b used by lower rates (Ilpo Järvinen) * pci/enumeration: PCI: Fix 64GT/s effective data rate calculation x86/pci: Clean up open-coded PCIBIOS return code mangling scsi: lpfc: Use PCI_HEADER_TYPE_MFD instead of literal powerpc/fsl-pci: Use PCI_HEADER_TYPE_MASK instead of literal x86/pci: Use PCI_HEADER_TYPE_* instead of literals PCI: Only override AMD USB controller if required PCI: host-generic: Convert to platform remove callback returning void
| | * | | | | x86/pci: Clean up open-coded PCIBIOS return code manglingIlpo Järvinen2023-12-011-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per PCI Firmware spec r3.3, sec 2.5.2, 2.6.2, and 2.7, the return code for these PCI BIOS interfaces is in 8 bits of the EAX register. Previously it was extracted by open-coded masks and shifting. Name the return code bits with a #define and add pcibios_get_return_code() to extract the return code to improve code readability. In addition, replace zero test with PCIBIOS_SUCCESSFUL. No functional changes intended. Link: https://lore.kernel.org/r/20231124085924.13830-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | x86/pci: Use PCI_HEADER_TYPE_* instead of literalsIlpo Järvinen2023-12-012-4/+3
| | | |_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace 0x7f and 0x80 literals with PCI_HEADER_TYPE_* defines. Link: https://lore.kernel.org/r/20231124090919.23687-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reorder pci_mmcfg_arch_map() definition before callsBjorn Helgaas2023-12-051-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The typical style is to define functions before calling them. Move pci_mmcfg_arch_map() and pci_mmcfg_arch_unmap() earlier so they're defined before they're called. No functional change intended. Link: https://lore.kernel.org/r/20231121183643.249006-10-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Return pci_mmconfig_add() failure earlyBjorn Helgaas2023-12-051-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If pci_mmconfig_alloc() fails, return the failure early so it's obvious that the failure is the exception, and the success is the normal case. No functional change intended. Link: https://lore.kernel.org/r/20231121183643.249006-9-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Comment pci_mmconfig_insert() obscure MCFG dependencyBjorn Helgaas2023-12-051-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In pci_mmconfig_insert(), there's no reference to "addr" between locking pci_mmcfg_lock and testing "addr", so it *looks* like we should move the test before the lock. But 07f9b61c3915 ("x86/PCI: MMCONFIG: Check earlier for MMCONFIG region at address zero") did that, which broke things by returning -EINVAL when "addr" is zero instead of -EEXIST. So 07f9b61c3915 was reverted by 67d470e0e171 ("Revert "x86/PCI: MMCONFIG: Check earlier for MMCONFIG region at address zero""). Add a comment about this issue to prevent it from happening again. Link: https://lore.kernel.org/r/20231121183643.249006-8-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename pci_mmcfg_check_reserved() to pci_mmcfg_reserved()Bjorn Helgaas2023-12-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "pci_mmcfg_check_reserved()" doesn't give a hint about what the boolean return value means. Rename it to pci_mmcfg_reserved() so testing "if (pci_mmcfg_reserved())" makes sense. Update callers to treat the return value as boolean instead of comparing with 0. Link: https://lore.kernel.org/r/20231121183643.249006-7-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename acpi_mcfg_check_entry() to acpi_mcfg_valid_entry()Bjorn Helgaas2023-12-051-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "acpi_mcfg_check_entry()" doesn't give a hint about what the return value means. Rename it to "acpi_mcfg_valid_entry()", convert the return value to bool, and update the return values and callers to match so testing "if (acpi_mcfg_valid_entry())" makes sense. Link: https://lore.kernel.org/r/20231121183643.249006-6-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename 'MMCONFIG' to 'ECAM', use pr_fmtBjorn Helgaas2023-12-053-67/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "MMCONFIG" term is not used in PCI/PCIe specs. Replace it with "ECAM", the term used in PCIe r6.0, sec 7.2.2. Define pr_fmt() instead of repeating PREFIX in every log message. Link: https://lore.kernel.org/r/20231121183643.249006-5-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Add MCFG debug loggingBjorn Helgaas2023-12-052-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MCFG handling is a frequent source of problems. Add more logging to aid in debugging. Enable the logging with CONFIG_DYNAMIC_DEBUG=y and the kernel boot parameter 'dyndbg="file arch/x86/pci +p"'. Link: https://lore.kernel.org/r/20231121183643.249006-4-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reword ECAM EfiMemoryMappedIO logging to avoid 'reserved'Bjorn Helgaas2023-12-051-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") added the concept of using the EFI memory map to help decide whether ECAM space mentioned in the MCFG table is valid. Unfortunately it described that EfiMemoryMappedIO space as "reserved", but it is actually not *reserved* by the EFI memory map. EfiMemoryMappedIO only means the firmware requested that the OS map this space for use by firmware runtime services. Change the dmesg logging to describe it as simply "EfiMemoryMappedIO", not as "reserved as EfiMemoryMappedIO". A previous commit actually *does* reserve the space if ACPI PNP0C01/02 devices haven't done so: - PCI: ECAM at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO + PCI: ECAM at [mem 0xe0000000-0xefffffff] is EfiMemoryMappedIO; assuming valid PCI: ECAM [mem 0xe0000000-0xefffffff] reserved to work around lack of ACPI motherboard _CRS Link: https://lore.kernel.org/r/20231121183643.249006-3-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reserve ECAM if BIOS didn't include it in PNP0C02 _CRSBjorn Helgaas2023-12-051-1/+12
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tomasz, Sebastian, and some Proxmox users reported problems initializing ixgbe NICs. I think the problem is that ECAM space described in the ACPI MCFG table is not reserved via a PNP0C02 _CRS method as required by the PCI Firmware spec (r3.3, sec 4.1.2), but it *is* included in the PNP0A03 host bridge _CRS as part of the MMIO aperture. If we allocate space for a PCI BAR, we're likely to allocate it from that ECAM space, which obviously cannot work. This could happen for any device, but in the ixgbe case it happens because it's an SR-IOV device and the BIOS didn't allocate space for the VF BARs, so Linux reallocated the bridge window leading to ixgbe and put it on top of the ECAM space. From Tomasz' system: PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000) PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources pci_bus 0000:00: root bus resource [mem 0x80000000-0xfbffffff window] pci 0000:00:01.1: PCI bridge to [bus 02-03] pci 0000:00:01.1: bridge window [mem 0xfb900000-0xfbbfffff] pci 0000:02:00.0: [8086:10fb] type 00 class 0x020000 # ixgbe pci 0000:02:00.0: reg 0x10: [mem 0xfba80000-0xfbafffff 64bit] pci 0000:02:00.0: VF(n) BAR0 space: [mem 0x00000000-0x000fffff 64bit] (contains BAR0 for 64 VFs) pci 0000:02:00.0: BAR 7: no space for [mem size 0x00100000 64bit] # VF BAR 0 pci_bus 0000:00: No. 2 try to assign unassigned res pci 0000:00:01.1: resource 14 [mem 0xfb900000-0xfbbfffff] released pci 0000:00:01.1: BAR 14: assigned [mem 0x80000000-0x806fffff] pci 0000:02:00.0: BAR 0: assigned [mem 0x80000000-0x8007ffff 64bit] pci 0000:02:00.0: BAR 7: assigned [mem 0x80204000-0x80303fff 64bit] # VF BAR 0 Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map") Fixes: fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") Reported-by: Tomasz Pala <gotar@polanet.pl> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218050 Reported-by: Sebastian Manciulea <manciuleas@protonmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218107 Link: https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/ Link: https://lore.kernel.org/r/20231121183643.249006-2-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v6.2+
* | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-01-1752-1049/+1825
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
| * | | | | x86/kvm: Do not try to disable kvmclock if it was not enabledKirill A. Shutemov2024-01-081-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is present in the VM. It leads to write to a MSR that doesn't exist on some configurations, namely in TDX guest: unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000) at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30) kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt features. Do not disable kvmclock if it was not enabled. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown") Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Message-Id: <20231205004510.27164-6-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-084-63/+54
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM x86 MMU changes for 6.8: - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read versus write so that KVM doesn't pass "bool shared" all over the place just to have precise assertions in paths that don't actually care about whether the caller is a reader or a writer.
| | * | | | | KVM: x86/mmu: fix comment about mmu_unsync_pages_lockPaolo Bonzini2023-12-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the comment about what can and cannot happen when mmu_unsync_pages_lock is not help. The comment correctly mentions "clearing sp->unsync", but then it talks about unsync going from 0 to 1. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86/mmu: always take tdp_mmu_pages_lockPaolo Bonzini2023-12-012-25/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is cheap to take tdp_mmu_pages_lock in all write-side critical sections. We already do it all the time when zapping with read_lock(), so it is not a problem to do it from the kvm_tdp_mmu_zap_all() path (aka kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86/mmu: remove unnecessary "bool shared" argument from iteratorsPaolo Bonzini2023-12-011-25/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "bool shared" argument is more or less unnecessary in the for_each_*_tdp_mmu_root_yield_safe() macros. Many users check for the lock before calling it; all of them either call small functions that do the check, or end up calling tdp_mmu_set_spte_atomic() and tdp_mmu_iter_set_spte(). Add a few assertions to make up for the lost check in for_each_*_tdp_mmu_root_yield_safe(), but even this is probably overkill and mostly for documentation reasons. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-3-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86/mmu: remove unnecessary "bool shared" argument from functionsPaolo Bonzini2023-12-013-16/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know if the lock is taken for read or write. Either way, protection is achieved via RCU and tdp_mmu_pages_lock. Remove the argument and just assert that the lock is taken. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86/mmu: Check for leaf SPTE when clearing dirty bit in the TDP MMUDavid Matlack2023-12-011-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Re-check that the given SPTE is still a leaf and present SPTE after a failed cmpxchg in clear_dirty_gfn_range(). clear_dirty_gfn_range() intends to only operate on present leaf SPTEs, but that could change after a failed cmpxchg. A check for present was added in commit 3354ef5a592d ("KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU") but the check for leaf is still buried in tdp_root_for_each_leaf_pte() and does not get rechecked on retry. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20231027172640.2335197-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86/mmu: Fix off-by-1 when splitting huge pages during CLEARDavid Matlack2023-12-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix an off-by-1 error when passing in the range of pages to kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end is the last page that needs to be split (inclusive) so pass in `end + 1` since kvm_mmu_try_split_huge_pages() expects the `end` to be non-inclusive. At worst this will cause a huge page to be write-protected instead of eagerly split, which is purely a performance issue, not a correctness issue. But even that is unlikely as it would require userspace pass in a bitmap where the last page is the only 4K page on a huge page that needs to be split. Reported-by: Vipin Sharma <vipinsh@google.com> Fixes: cb00a70bd4b7 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG") Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | | Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-082-6/+31
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM Xen change for 6.8: To workaround Xen guests that don't expect Xen PV clocks to be marked as being based on a stable TSC, add a Xen config knob to allow userspace to opt out of KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't entirely blameless for the buggy guest behavior.
| | * | | | | | KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BITPaul Durrant2023-12-072-6/+31
| | |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unless explicitly told to do so (by passing 'clocksource=tsc' and 'tsc=stable:socket', and then jumping through some hoops concerning potential CPU hotplug) Xen will never use TSC as its clocksource. Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set in either the primary or secondary pvclock memory areas. This has led to bugs in some guest kernels which only become evident if PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support such guests, give the VMM a new Xen HVM config flag to tell KVM to forcibly clear the bit in the Xen pvclocks. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | | Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-083-19/+21
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM SVM changes for 6.8: - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - Fix a benign NMI virtualization bug where KVM would unnecessarily intercept IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects a second, "simultaneous" NMI.
| | * | | | | | KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabledSean Christopherson2023-11-301-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When vNMI is enabled, rely entirely on hardware to correctly handle NMI blocking, i.e. don't intercept IRET to detect when NMIs are no longer blocked. KVM already correctly ignores svm->nmi_masked when vNMI is enabled, so the effect of the bug is essentially an unnecessary VM-Exit. KVM intercepts IRET for two reasons: - To track NMI masking to be able to know at any point of time if NMI is masked. - To track NMI windows (to inject another NMI after the guest executes IRET, i.e. unblocks NMIs) When vNMI is enabled, both cases are handled by hardware: - NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by KVM at will. - Hardware automatically "injects" pending virtual NMIs when virtual NMIs become unblocked. However, even though pending a virtual NMI for hardware to handle is the most common way to synthesize a guest NMI, KVM may still directly inject an NMI via when KVM is handling two "simultaneous" NMIs (see comments in process_nmi() for details on KVM's simultaneous NMI handling). Per AMD's APM, hardware sets the BLOCKING flag when software directly injects an NMI as well, i.e. KVM doesn't need to manually mark vNMIs as blocked: If Event Injection is used to inject an NMI when NMI Virtualization is enabled, VMRUN sets V_NMI_MASK in the guest state. Note, it's still possible that KVM could trigger a spurious IRET VM-Exit. When running a nested guest, KVM disables vNMI for L2 and thus will enable IRET interception (in both vmcb01 and vmcb02) while running L2 reason. If a nested VM-Exit happens before L2 executes IRET, KVM can end up running L1 with vNMI enable and IRET intercepted. This is also a benign bug, and even less likely to happen, i.e. can be safely punted to a future fix. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como Cc: Santosh Shukla <santosh.shukla@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20231018192021.1893261-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: SVM: Explicitly require FLUSHBYASID to enable SEV supportSean Christopherson2023-11-301-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a sanity check that FLUSHBYASID is available if SEV is supported in hardware, as SEV (and beyond) guests are bound to a single ASID, i.e. KVM can't "flush" by assigning a new, fresh ASID to the guest. If FLUSHBYASID isn't supported for some bizarre reason, KVM would completely fail to do TLB flushes for SEV+ guests (see pre_svm_run() and pre_sev_run()). Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20231018193617.1895752-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: nSVM: Advertise support for flush-by-ASIDSean Christopherson2023-11-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Advertise support for FLUSHBYASID when nested SVM is enabled, as KVM can always emulate flushing TLB entries for a vmcb12 ASID, e.g. by running L2 with a new, fresh ASID in vmcb02. Some modern hypervisors, e.g. VMWare Workstation 17, require FLUSHBYASID support and will refuse to run if it's not present. Punt on proper support, as "Honor L1's request to flush an ASID on nested VMRUN" is one of the TODO items in the (incomplete) list of issues that need to be addressed in order for KVM to NOT do a full TLB flush on every nested SVM transition (see nested_svm_transition_tlb_flush()). Reported-by: Stefan Sterz <s.sterz@proxmox.com> Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231018194104.1896415-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"Sean Christopherson2023-11-301-15/+0
| | |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert KVM's made-up consistency check on SVM's TLB control. The APM says that unsupported encodings are reserved, but the APM doesn't state that VMRUN checks for a supported encoding. Unless something is called out in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be Zero), AMD behavior is typically to let software shoot itself in the foot. This reverts commit 174a921b6975ef959dd82ee9e8844067a62e3ec1. Fixes: 174a921b6975 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB") Reported-by: Stefan Sterz <s.sterz@proxmox.com> Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | | Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-0818-30/+134
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM x86 support for virtualizing Linear Address Masking (LAM) Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality checks for most virtual address usage in 64-bit mode, such that only the most significant bit of the untranslated address bits must match the polarity of the last translated address bit. This allows software to use ignored, untranslated address bits for metadata, e.g. to efficiently tag pointers for address sanitization. LAM can be enabled separately for user pointers and supervisor pointers, and for userspace LAM can be select between 48-bit and 57-bit masking - 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15. - 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6. For user pointers, LAM enabling utilizes two previously-reserved high bits from CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and 61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.: - CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers - CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers - CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.: - CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers - CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers The modified LAM canonicality checks: - LAM_S48 : [ 1 ][ metadata ][ 1 ] 63 47 - LAM_U48 : [ 0 ][ metadata ][ 0 ] 63 47 - LAM_S57 : [ 1 ][ metadata ][ 1 ] 63 56 - LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ] 63 56 - LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ] 63 56..47 The bulk of KVM support for LAM is to emulate LAM's modified canonicality checks. The approach taken by KVM is to "fill" the metadata bits using the highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value from the raw, untagged virtual address is kept for the canonicality check. This untagging allows Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26], enabling via CR3 and CR4 bits, etc.
| | * | | | | | KVM: x86: Use KVM-governed feature framework to track "LAM enabled"Binbin Wu2023-11-284-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the governed feature framework to track if Linear Address Masking (LAM) is "enabled", i.e. if LAM can be used by the guest. Using the framework to avoid the relative expensive call guest_cpuid_has() during cr3 and vmexit handling paths for LAM. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Advertise and enable LAM (user and supervisor)Robert Hoo2023-11-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LAM is enumerated by CPUID.7.1:EAX.LAM[bit 26]. Advertise the feature to userspace and enable it as the final step after the LAM virtualization support for supervisor and user pointers. SGX LAM support is not advertised yet. SGX LAM support is enumerated in SGX's own CPUID and there's no hard requirement that it must be supported when LAM is reported in CPUID leaf 0x7. Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Jingqi Liu <jingqi.liu@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-13-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Virtualize LAM for user pointerRobert Hoo2023-11-283-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to allow guests to set the new CR3 control bits for Linear Address Masking (LAM) and add implementation to get untagged address for user pointers. LAM modifies the canonical check for 64-bit linear addresses, allowing software to use the masked/ignored address bits for metadata. Hardware masks off the metadata bits before using the linear addresses to access memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for virtualization. When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of the two LAM control bits. However, when EPT is off, the actual CR3 used by the guest is generated from the shadow MMU root which is different from the CR3 that is *set* by the guest, and KVM needs to manually apply any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest. KVM manually checks guest's CR3 to make sure it points to a valid guest physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check to allow the two LAM control bits to be set. After check, LAM bits of guest CR3 will be stripped off to extract guest physical address. In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address. Extend such check to allow the new LAM control bits too. Note, LAM doesn't have a global control bit to turn on/off LAM completely, but purely depends on hardware's CPUID to determine it can be enabled or not. That means, when EPT is on, even when KVM doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o causing problem. This is an unfortunate virtualization hole. KVM could choose to intercept CR3 in this case and inject fault but this would hurt performance when running a normal VM w/o LAM support. This is undesirable. Just choose to let the guest do such illegal thing as the worst case is guest being killed when KVM eventually find out such illegal behaviour and that the guest is misbehaving. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Virtualize LAM for supervisor pointerRobert Hoo2023-11-283-2/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to allow guests to set the new CR4 control bit for LAM and add implementation to get untagged address for supervisor pointers. LAM modifies the canonicality check applied to 64-bit linear addresses for data accesses, allowing software to use of the untranslated address bits for metadata and masks the metadata bits before using them as linear addresses to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM for supervisor pointers. It also changes VMENTER to allow the bit to be set in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP is allowed to be set even not in 64-bit mode, but it will not take effect since LAM only applies to 64-bit linear addresses. Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu supporting LAM or not. Leave it intercepted to prevent guest from setting the bit if LAM is not exposed to guest as well as to avoid vmread every time when KVM fetches its value, with the expectation that guest won't toggle the bit frequently. Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to allow guests to enable LAM for supervisor pointers in nested VMX operation. Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM doesn't need to emulate TLB flush based on it. There's no other features or vmx_exec_controls connection, and no other code needed in {kvm,vmx}_set_cr4(). Skip address untag for instruction fetches (which includes branch targets), operand of INVLPG instructions, and implicit system accesses, all of which are not subject to untagging. Note, get_untagged_addr() isn't invoked for implicit system accesses as there is no reason to do so, but check the flag anyways for documentation purposes. Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Untag addresses for LAM emulation where applicableBinbin Wu2023-11-285-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via get_untagged_addr()) and "direct" calls from various VM-Exit handlers in VMX where LAM untagging is supposed to be applied. Defer implementing the guts of vmx_get_untagged_addr() to future patches purely to make the changes easier to consume. LAM is active only for 64-bit linear addresses and several types of accesses are exempted. - Cases need to untag address (handled in get_vmx_mem_address()) Operand(s) of VMX instructions and INVPCID. Operand(s) of SGX ENCLS. - Cases LAM doesn't apply to (no change needed) Operand of INVLPG. Linear address in INVPCID descriptor. Linear address in INVVPID descriptor. BASEADDR specified in SECS of ECREATE. Note: - LAM doesn't apply to write to control registers or MSRs - LAM masking is applied before walking page tables, i.e. the faulting linear address in CR2 doesn't contain the metadata. - The guest linear address saved in VMCS doesn't contain metadata. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Introduce get_untagged_addr() in kvm_x86_ops and call it in emulatorBinbin Wu2023-11-285-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag the metadata from linear address. Call the interface in linearization of instruction emulator for 64-bit mode. When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper Address Ignore (UAI), linear addresses may be tagged with metadata that needs to be dropped prior to canonicality checks, i.e. the metadata is ignored. Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific code, as sadly LAM and UAI have different semantics. Pass the emulator flags to allow vendor specific implementation to precisely identify the access type (LAM doesn't untag certain accesses). Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Remove kvm_vcpu_is_illegal_gpa()Binbin Wu2023-11-283-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove kvm_vcpu_is_illegal_gpa() and use !kvm_vcpu_is_legal_gpa() instead. The "illegal" helper actually predates the "legal" helper, the only reason the "illegal" variant wasn't removed by commit 4bda0e97868a ("KVM: x86: Add a helper to check for a legal GPA") was to avoid code churn. Now that CR3 has a dedicated helper, there are fewer callers, and so the code churn isn't that much of a deterrent. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-8-binbin.wu@linux.intel.com [sean: provide a bit of history in the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Add & use kvm_vcpu_is_legal_cr3() to check CR3's legalityBinbin Wu2023-11-284-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide a clear distinction between CR3 and GPA checks. This will allow exempting bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks, e.g. for upcoming features that will use high bits in CR3 for feature enabling. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86/mmu: Drop non-PA bits when getting GFN for guest's PGDBinbin Wu2023-11-283-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical mask for guest MAXPHYADDR. Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit mode would be more expensive, and for EPT the mask isn't tied to guest mode. Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter in practice though. Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Add X86EMUL_F_INVLPG and pass it in em_invlpg()Binbin Wu2023-11-282-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an emulation flag X86EMUL_F_INVLPG, which is used to identify an instruction that does TLB invalidation without true memory access. Only invlpg & invlpga implemented in emulator belong to this kind. invlpga doesn't need additional information for emulation. Just pass the flag to em_invlpg(). Linear Address Masking (LAM) and Linear Address Space Separation (LASS) don't apply to addresses that are inputs to TLB invalidation. The flag will be consumed to support LAM/LASS virtualization. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-5-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Add an emulation flag for implicit system accessBinbin Wu2023-11-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an emulation flag X86EMUL_F_IMPLICIT to identify implicit system access in instruction emulation. Don't bother wiring up any usage at this point, as Linear Address Space Separation (LASS) will be the first "real" consumer of the flag and LASS support will require dedicated hooks, i.e. there aren't any existing calls where passing X86EMUL_F_IMPLICIT is meaningful. Add the IMPLICIT flag even though there's no imminent usage so that Linear Address Masking (LAM) support can reference the flag to document that addresses for implicit accesses aren't untagged. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-4-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | | KVM: x86: Consolidate flags for __linearize()Binbin Wu2023-11-282-10/+15
| | |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate @write and @fetch of __linearize() into a set of flags so that additional flags can be added without needing more/new boolean parameters, to precisely identify the access type. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20230913124227.12574-2-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | | Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-087-109/+137
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM x86 PMU changes for 6.8: - Fix a variety of bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow.
| | * | | | | | KVM: x86/pmu: Track emulated counter events instead of previous counterSean Christopherson2023-11-303-15/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly track emulated counter events instead of using the common counter value that's shared with the hardware counter owned by perf. Bumping the common counter requires snapshotting the pre-increment value in order to detect overflow from emulation, and the snapshot approach is inherently flawed. Snapshotting the previous counter at every increment assumes that there is at most one emulated counter event per emulated instruction (or rather, between checks for KVM_REQ_PMU). That's mostly holds true today because KVM only emulates (branch) instructions retired, but the approach will fall apart if KVM ever supports event types that don't have a 1:1 relationship with instructions. And KVM already has a relevant bug, as handle_invalid_guest_state() emulates multiple instructions without checking KVM_REQ_PMU, i.e. could miss an overflow event due to clobbering pmc->prev_counter. Not checking KVM_REQ_PMU is problematic in both cases, but at least with the emulated counter approach, the resulting behavior is delayed overflow detection, as opposed to completely lost detection. Tracking the emulated count fixes another bug where the snapshot approach can signal spurious overflow due to incorporating both the emulated count and perf's count in the check, i.e. if overflow is detected by perf, then KVM's emulation will also incorrectly signal overflow. Add a comment in the related code to call out the need to process emulated events *after* pausing the perf event (big kudos to Mingwei for figuring out that particular wrinkle). Cc: Mingwei Zhang <mizhang@google.com> Cc: Roman Kagan <rkagan@amazon.de> Cc: Jim Mattson <jmattson@google.com> Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>