diff options
Diffstat (limited to 'Documentation')
29 files changed, 787 insertions, 522 deletions
diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block index 3879963f0f01..11545c9e2e93 100644 --- a/Documentation/ABI/stable/sysfs-block +++ b/Documentation/ABI/stable/sysfs-block @@ -77,7 +77,7 @@ Description: What: /sys/block/<disk>/diskseq Date: February 2021 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: The /sys/block/<disk>/diskseq files reports the disk sequence number, which is a monotonically increasing diff --git a/Documentation/ABI/testing/sysfs-kernel-reboot b/Documentation/ABI/testing/sysfs-kernel-reboot index e117aba46be0..52571fd5ddba 100644 --- a/Documentation/ABI/testing/sysfs-kernel-reboot +++ b/Documentation/ABI/testing/sysfs-kernel-reboot @@ -1,7 +1,7 @@ What: /sys/kernel/reboot Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Interface to set the kernel reboot behavior, similarly to what can be done via the reboot= cmdline option. (see Documentation/admin-guide/kernel-parameters.txt) @@ -9,25 +9,25 @@ Description: Interface to set the kernel reboot behavior, similarly to What: /sys/kernel/reboot/mode Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Reboot mode. Valid values are: cold warm hard soft gpio What: /sys/kernel/reboot/type Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Reboot type. Valid values are: bios acpi kbd triple efi pci What: /sys/kernel/reboot/cpu Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: CPU number to use to reboot. What: /sys/kernel/reboot/force Date: November 2020 KernelVersion: 5.11 -Contact: Matteo Croce <mcroce@microsoft.com> +Contact: Matteo Croce <teknoraver@meta.com> Description: Don't wait for any other CPUs on reboot and avoid anything that could hang. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ff0b440ef2dc..451874b8135d 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -22,3 +22,4 @@ are configurable at compile, boot or run time. srso gather_data_sampling reg-file-data-sampling + rsb diff --git a/Documentation/admin-guide/hw-vuln/rsb.rst b/Documentation/admin-guide/hw-vuln/rsb.rst new file mode 100644 index 000000000000..21dbf9cf25f8 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/rsb.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +RSB-related mitigations +======================= + +.. warning:: + Please keep this document up-to-date, otherwise you will be + volunteered to update it and convert it to a very long comment in + bugs.c! + +Since 2018 there have been many Spectre CVEs related to the Return Stack +Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or +Return Address Predictor (RAP) on AMD). + +Information about these CVEs and how to mitigate them is scattered +amongst a myriad of microarchitecture-specific documents. + +This document attempts to consolidate all the relevant information in +once place and clarify the reasoning behind the current RSB-related +mitigations. It's meant to be as concise as possible, focused only on +the current kernel mitigations: what are the RSB-related attack vectors +and how are they currently being mitigated? + +It's *not* meant to describe how the RSB mechanism operates or how the +exploits work. More details about those can be found in the references +below. + +Rather, this is basically a glorified comment, but too long to actually +be one. So when the next CVE comes along, a kernel developer can +quickly refer to this as a refresher to see what we're actually doing +and why. + +At a high level, there are two classes of RSB attacks: RSB poisoning +(Intel and AMD) and RSB underflow (Intel only). They must each be +considered individually for each attack vector (and microarchitecture +where applicable). + +---- + +RSB poisoning (Intel and AMD) +============================= + +SpectreRSB +~~~~~~~~~~ + +RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where +an attacker poisons an RSB entry to cause a victim's return instruction +to speculate to an attacker-controlled address. This can happen when +there are unbalanced CALLs/RETs after a context switch or VMEXIT. + +* All attack vectors can potentially be mitigated by flushing out any + poisoned RSB entries using an RSB filling sequence + [#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between + untrusted and trusted domains. But this has a performance impact and + should be avoided whenever possible. + + .. DANGER:: + **FIXME**: Currently we're flushing 32 entries. However, some CPU + models have more than 32 entries. The loop count needs to be + increased for those. More detailed information is needed about RSB + sizes. + +* On context switch, the user->user mitigation requires ensuring the + RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_ + during a context switch: + + * AMD: + On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB. + This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_. + + On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be + always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is + indicated by X86_BUG_IBPB_NO_RET. + + * Intel: + IBPB always clears the RSB: + + "Software that executed before the IBPB command cannot control + the predicted targets of indirect branches executed after the + command on the same logical processor. The term indirect branch + in this context includes near return instructions, so these + predicted targets may come from the RSB." [#intel-ibpb-rsb]_ + +* On context switch, user->kernel attacks are prevented by SMEP. User + space can only insert user space addresses into the RSB. Even + non-canonical addresses can't be inserted due to the page gap at the + end of the user canonical address space reserved by TASK_SIZE_MAX. + A SMEP #PF at instruction fetch prevents the kernel from speculatively + executing user space. + + * AMD: + "Finally, branches that are predicted as 'ret' instructions get + their predicted targets from the Return Address Predictor (RAP). + AMD recommends software use a RAP stuffing sequence (mitigation + V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP) + to ensure that the addresses in the RAP are safe for + speculation. Collectively, we refer to these mitigations as "RAP + Protection"." [#amd-smep-rsb]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_ + +* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB + mitigation if needed): + + * AMD: + "When Automatic IBRS is enabled, the internal return address + stack used for return address predictions is cleared on VMEXIT." + [#amd-eibrs-vmexit]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits. Processors with + enhanced IBRS still support the usage model where IBRS is set + only in the OS/VMM for OSes that enable SMEP. To do this, such + processors will ensure that guest behavior cannot control the + RSB after a VM exit once IBRS is set, even if IBRS was not set + at the time of the VM exit." [#intel-eibrs-vmexit]_ + + Note that some Intel CPUs are susceptible to Post-barrier Return + Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last + CALL from the guest can be used to predict the first unbalanced RET. + In this case the PBRSB mitigation is needed in addition to eIBRS. + +AMD RETBleed / SRSO / Branch Type Confusion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On AMD, poisoned RSB entries can also be created by the AMD RETBleed +variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack +Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel +protects itself by replacing every RET in the kernel with a branch to a +single safe RET. + +---- + +RSB underflow (Intel only) +========================== + +RSB Alternate (RSBA) ("Intel Retbleed") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some Intel Skylake-generation CPUs are susceptible to the Intel variant +of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow +[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due +to mismatched CALLs/RETs or returning from a deep call stack, the branch +predictor can fall back to using the Branch Target Buffer (BTB). If a +user forces a BTB collision then the RET can speculatively branch to a +user-controlled address. + +* Note that RSB filling doesn't fully mitigate this issue. If there + are enough unbalanced RETs, the RSB may still underflow and fall back + to using a poisoned BTB entry. + +* On context switch, user->user underflow attacks are mitigated by the + conditional IBPB [#cond-ibpb]_ on context switch which effectively + clears the BTB: + + * "The indirect branch predictor barrier (IBPB) is an indirect branch + control mechanism that establishes a barrier, preventing software + that executed before the barrier from controlling the predicted + targets of indirect branches executed after the barrier on the same + logical processor." [#intel-ibpb-btb]_ + +* On context switch and VMEXIT, user->kernel and guest->host RSB + underflows are mitigated by IBRS or eIBRS: + + * "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU" + attack demonstrated by the researchers. As previously documented, + Intel recommends the use of enhanced IBRS, where supported. This + includes any processor that enumerates RRSBA but not RRSBA_DIS_S." + [#intel-rsbu]_ + + However, note that eIBRS and IBRS do not mitigate intra-mode attacks. + Like RRSBA below, this is mitigated by clearing the BHB on kernel + entry. + + As an alternative to classic IBRS, call depth tracking (combined with + retpolines) can be used to track kernel returns and fill the RSB when + it gets close to being empty. + +Restricted RSB Alternate (RRSBA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior, +which, similar to RSBA described above, also falls back to using the BTB +on RSB underflow. The only difference is that the predicted targets are +restricted to the current domain when eIBRS is enabled: + +* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch + predictors to be used by near RET instructions when the RSB is + empty. When eIBRS is enabled, the predicted targets of these + alternate predictors are restricted to those belonging to the + indirect branch predictor entries of the current prediction domain. + [#intel-eibrs-rrsba]_ + +When a CPU with RRSBA is vulnerable to Branch History Injection +[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an +intra-mode BTI attack. This is mitigated by clearing the BHB on +kernel entry. + +However if the kernel uses retpolines instead of eIBRS, it needs to +disable RRSBA: + +* "Where software is using retpoline as a mitigation for BHI or + intra-mode BTI, and the processor both enumerates RRSBA and + enumerates RRSBA_DIS controls, it should disable this behavior." + [#intel-retpoline-rrsba]_ + +---- + +References +========== + +.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_ + +.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_ + +.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_ + +.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt. + +.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD. + +.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_. + +.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_ + +.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_ + +.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_ + +.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_ + +.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_ + +.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_ + +.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ + +.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 76e538c77e31..d9fd26b95b34 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1407,18 +1407,15 @@ earlyprintk=serial[,0x...[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] + earlyprintk=mmio32,membase[,{nocfg|baudrate}] earlyprintk=pciserial[,force],bus:device.function[,{nocfg|baudrate}] earlyprintk=xdbc[xhciController#] earlyprintk=bios - earlyprintk=mmio,membase[,{nocfg|baudrate}] earlyprintk is useful when the kernel crashes before the normal console is initialized. It is not enabled by default because it has some cosmetic problems. - Only 32-bit memory addresses are supported for "mmio" - and "pciserial" devices. - Use "nocfg" to skip UART configuration, assume BIOS/firmware has configured UART correctly. diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index b67772cf36d6..3e76276bd488 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -124,6 +124,14 @@ When mounting an XFS filesystem, the following options are accepted. controls the size of each buffer and so is also relevant to this case. + lifetime (default) or nolifetime + Enable data placement based on write life time hints provided + by the user. This turns on co-allocation of data of similar + life times when statistically favorable to reduce garbage + collection cost. + + These options are only available for zoned rt file systems. + logbsize=value Set the size of each in-memory log buffer. The size may be specified in bytes, or in kilobytes with a "k" suffix. @@ -143,6 +151,14 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + max_open_zones=value + Specify the max number of zones to keep open for writing on a + zoned rt device. Many open zones aids file data separation + but may impact performance on HDDs. + + If ``max_open_zones`` is not specified, the value is determined + by the capabilities and the size of the zoned rt device. + noalign Data allocations will not be aligned at stripe unit boundaries. This is only relevant to filesystems created @@ -542,3 +558,37 @@ The interesting knobs for XFS workqueues are as follows: nice Relative priority of scheduling the threads. These are the same nice levels that can be applied to userspace processes. ============ =========== + +Zoned Filesystems +================= + +For zoned file systems, the following attribute is exposed in: + + /sys/fs/xfs/<dev>/zoned/ + + max_open_zones (Min: 1 Default: Varies Max: UINTMAX) + This read-only attribute exposes the maximum number of open zones + available for data placement. The value is determined at mount time and + is limited by the capabilities of the backing zoned device, file system + size and the max_open_zones mount option. + +Zoned Filesystems +================= + +For zoned file systems, the following attributes are exposed in: + + /sys/fs/xfs/<dev>/zoned/ + + max_open_zones (Min: 1 Default: Varies Max: UINTMAX) + This read-only attribute exposes the maximum number of open zones + available for data placement. The value is determined at mount time and + is limited by the capabilities of the backing zoned device, file system + size and the max_open_zones mount option. + + zonegc_low_space (Min: 0 Default: 0 Max: 100) + Define a percentage for how much of the unused space that GC should keep + available for writing. A high value will reclaim more of the space + occupied by unused blocks, creating a larger buffer against write + bursts at the cost of increased write amplification. Regardless + of this value, garbage collection will always aim to free a minimum + amount of blocks to keep max_open_zones open for data placement purposes. diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst index 53607d962653..f60bf5991755 100644 --- a/Documentation/arch/riscv/hwprobe.rst +++ b/Documentation/arch/riscv/hwprobe.rst @@ -51,7 +51,7 @@ The following keys are defined: * :c:macro:`RISCV_HWPROBE_KEY_MARCHID`: Contains the value of ``marchid``, as defined by the RISC-V privileged architecture specification. -* :c:macro:`RISCV_HWPROBE_KEY_MIMPLID`: Contains the value of ``mimplid``, as +* :c:macro:`RISCV_HWPROBE_KEY_MIMPID`: Contains the value of ``mimpid``, as defined by the RISC-V privileged architecture specification. * :c:macro:`RISCV_HWPROBE_KEY_BASE_BEHAVIOR`: A bitmask containing the base diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst index 6ef426a52cdc..f80e2a558d2a 100644 --- a/Documentation/arch/x86/cpuinfo.rst +++ b/Documentation/arch/x86/cpuinfo.rst @@ -79,8 +79,9 @@ feature flags. How are feature flags created? ============================== -a: Feature flags can be derived from the contents of CPUID leaves. ------------------------------------------------------------------- +Feature flags can be derived from the contents of CPUID leaves +-------------------------------------------------------------- + These feature definitions are organized mirroring the layout of CPUID leaves and grouped in words with offsets as mapped in enum cpuid_leafs in cpufeatures.h (see arch/x86/include/asm/cpufeatures.h for details). @@ -89,8 +90,9 @@ cpufeatures.h, and if it is detected at run time, the flags will be displayed accordingly in /proc/cpuinfo. For example, the flag "avx2" comes from X86_FEATURE_AVX2 in cpufeatures.h. -b: Flags can be from scattered CPUID-based features. ----------------------------------------------------- +Flags can be from scattered CPUID-based features +------------------------------------------------ + Hardware features enumerated in sparsely populated CPUID leaves get software-defined values. Still, CPUID needs to be queried to determine if a given feature is present. This is done in init_scattered_cpuid_features(). @@ -104,8 +106,9 @@ has only one feature and would waste 31 bits of space in the x86_capability[] array. Since there is a struct cpuinfo_x86 for each possible CPU, the wasted memory is not trivial. -c: Flags can be created synthetically under certain conditions for hardware features. -------------------------------------------------------------------------------------- +Flags can be created synthetically under certain conditions for hardware features +--------------------------------------------------------------------------------- + Examples of conditions include whether certain features are present in MSR_IA32_CORE_CAPS or specific CPU models are identified. If the needed conditions are met, the features are enabled by the set_cpu_cap or @@ -114,8 +117,8 @@ the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and "split_lock_detect" will be displayed. The flag "ring3mwait" will be displayed only when running on INTEL_XEON_PHI_[KNL|KNM] processors. -d: Flags can represent purely software features. ------------------------------------------------- +Flags can represent purely software features +-------------------------------------------- These flags do not represent hardware features. Instead, they represent a software feature implemented in the kernel. For example, Kernel Page Table Isolation is purely software feature and its feature flag X86_FEATURE_PTI is @@ -130,14 +133,18 @@ x86_cap/bug_flags[] arrays in kernel/cpu/capflags.c. The names in the resulting x86_cap/bug_flags[] are used to populate /proc/cpuinfo. The naming of flags in the x86_cap/bug_flags[] are as follows: -a: The name of the flag is from the string in X86_FEATURE_<name> by default. ----------------------------------------------------------------------------- -By default, the flag <name> in /proc/cpuinfo is extracted from the respective -X86_FEATURE_<name> in cpufeatures.h. For example, the flag "avx2" is from -X86_FEATURE_AVX2. +Flags do not appear by default in /proc/cpuinfo +----------------------------------------------- + +Feature flags are omitted by default from /proc/cpuinfo as it does not make +sense for the feature to be exposed to userspace in most cases. For example, +X86_FEATURE_ALWAYS is defined in cpufeatures.h but that flag is an internal +kernel feature used in the alternative runtime patching functionality. So the +flag does not appear in /proc/cpuinfo. + +Specify a flag name if absolutely needed +---------------------------------------- -b: The naming can be overridden. --------------------------------- If the comment on the line for the #define X86_FEATURE_* starts with a double-quote character (""), the string inside the double-quote characters will be the name of the flags. For example, the flag "sse4_1" comes from @@ -148,36 +155,31 @@ needed. For instance, /proc/cpuinfo is a userspace interface and must remain constant. If, for some reason, the naming of X86_FEATURE_<name> changes, one shall override the new naming with the name already used in /proc/cpuinfo. -c: The naming override can be "", which means it will not appear in /proc/cpuinfo. ----------------------------------------------------------------------------------- -The feature shall be omitted from /proc/cpuinfo if it does not make sense for -the feature to be exposed to userspace. For example, X86_FEATURE_ALWAYS is -defined in cpufeatures.h but that flag is an internal kernel feature used -in the alternative runtime patching functionality. So, its name is overridden -with "". Its flag will not appear in /proc/cpuinfo. - Flags are missing when one or more of these happen ================================================== -a: The hardware does not enumerate support for it. --------------------------------------------------- +The hardware does not enumerate support for it +---------------------------------------------- + For example, when a new kernel is running on old hardware or the feature is not enabled by boot firmware. Even if the hardware is new, there might be a problem enabling the feature at run time, the flag will not be displayed. -b: The kernel does not know about the flag. -------------------------------------------- +The kernel does not know about the flag +--------------------------------------- + For example, when an old kernel is running on new hardware. -c: The kernel disabled support for it at compile-time. ------------------------------------------------------- +The kernel disabled support for it at compile-time +-------------------------------------------------- + For example, if 5-level-paging is not enabled when building (i.e., CONFIG_X86_5LEVEL is not selected) the flag "la57" will not show up [#f1]_. Even though the feature will still be detected via CPUID, the kernel disables it by clearing via setup_clear_cpu_cap(X86_FEATURE_LA57). -d: The feature is disabled at boot-time. ----------------------------------------- +The feature is disabled at boot-time +------------------------------------ A feature can be disabled either using a command-line parameter or because it failed to be enabled. The command-line parameter clearcpuid= can be used to disable features using the feature number as defined in @@ -190,8 +192,9 @@ disable specific features. The list of parameters includes, but is not limited to, nofsgsbase, nosgx, noxsave, etc. 5-level paging can also be disabled using "no5lvl". -e: The feature was known to be non-functional. ----------------------------------------------- +The feature was known to be non-functional +------------------------------------------ + The feature was known to be non-functional because a dependency was missing at runtime. For example, AVX flags will not show up if XSAVE feature is disabled since they depend on XSAVE feature. Another example would be broken diff --git a/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml b/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml index 6ad78429dc74..c92341888a28 100644 --- a/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml +++ b/Documentation/devicetree/bindings/ata/ceva,ahci-1v84.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Ceva AHCI SATA Controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> description: | diff --git a/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml b/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml index 350fb8f400f0..5952e6448ed4 100644 --- a/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml +++ b/Documentation/devicetree/bindings/display/bridge/nwl-dsi.yaml @@ -111,11 +111,27 @@ properties: unevaluatedProperties: false port@1: - $ref: /schemas/graph.yaml#/properties/port + $ref: /schemas/graph.yaml#/$defs/port-base + unevaluatedProperties: false description: DSI output port node to the panel or the next bridge in the chain + properties: + endpoint: + $ref: /schemas/media/video-interfaces.yaml# + unevaluatedProperties: false + + properties: + data-lanes: + description: array of physical DSI data lane indexes. + minItems: 1 + items: + - const: 1 + - const: 2 + - const: 3 + - const: 4 + required: - port@0 - port@1 diff --git a/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml b/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml index bb93baa88879..e13e9d6dd148 100644 --- a/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml +++ b/Documentation/devicetree/bindings/gpio/xlnx,zynqmp-gpio-modepin.yaml @@ -12,7 +12,6 @@ description: PS_MODE). Every pin can be configured as input/output. maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml b/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml index 6076ddf56bb5..c49688be1058 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml +++ b/Documentation/devicetree/bindings/interrupt-controller/fsl,irqsteer.yaml @@ -19,6 +19,7 @@ properties: - fsl,imx8mp-irqsteer - fsl,imx8qm-irqsteer - fsl,imx8qxp-irqsteer + - fsl,imx94-irqsteer - const: fsl,imx-irqsteer reg: diff --git a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml index a4dfa09344dd..f85ee5d20ccb 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml +++ b/Documentation/devicetree/bindings/pwm/renesas,tpu-pwm.yaml @@ -9,15 +9,6 @@ title: Renesas R-Car Timer Pulse Unit PWM Controller maintainers: - Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com> -select: - properties: - compatible: - contains: - const: renesas,tpu - required: - - compatible - - '#pwm-cells' - properties: compatible: items: diff --git a/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml b/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml index 1f1b42dde94d..1db85fc9966f 100644 --- a/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml +++ b/Documentation/devicetree/bindings/reset/xlnx,zynqmp-reset.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Zynq UltraScale+ MPSoC and Versal reset maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> description: | diff --git a/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml b/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml index 31295be91013..234089b5954d 100644 --- a/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml +++ b/Documentation/devicetree/bindings/soc/fsl/fsl,ls1028a-reset.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Freescale Layerscape Reset Registers Module maintainers: - - Frank Li + - Frank Li <Frank.Li@nxp.com> description: Reset Module includes chip reset, service processor control and Reset Control diff --git a/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml b/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml index 891cca009528..6b80b060672e 100644 --- a/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml +++ b/Documentation/devicetree/bindings/timer/nxp,sysctr-timer.yaml @@ -18,9 +18,14 @@ description: | properties: compatible: - enum: - - nxp,imx95-sysctr-timer - - nxp,sysctr-timer + oneOf: + - enum: + - nxp,imx95-sysctr-timer + - nxp,sysctr-timer + - items: + - enum: + - nxp,imx94-sysctr-timer + - const: nxp,imx95-sysctr-timer reg: maxItems: 1 diff --git a/Documentation/devicetree/bindings/timer/renesas,tpu.yaml b/Documentation/devicetree/bindings/timer/renesas,tpu.yaml deleted file mode 100644 index 7a473b302775..000000000000 --- a/Documentation/devicetree/bindings/timer/renesas,tpu.yaml +++ /dev/null @@ -1,56 +0,0 @@ -# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -%YAML 1.2 ---- -$id: http://devicetree.org/schemas/timer/renesas,tpu.yaml# -$schema: http://devicetree.org/meta-schemas/core.yaml# - -title: Renesas H8/300 Timer Pulse Unit - -maintainers: - - Yoshinori Sato <ysato@users.sourceforge.jp> - -description: - The TPU is a 16bit timer/counter with configurable clock inputs and - programmable compare match. - This implementation supports only cascade mode. - -select: - properties: - compatible: - contains: - const: renesas,tpu - '#pwm-cells': false - required: - - compatible - -properties: - compatible: - const: renesas,tpu - - reg: - items: - - description: First channel - - description: Second channel - - clocks: - maxItems: 1 - - clock-names: - const: fck - -required: - - compatible - - reg - - clocks - - clock-names - -additionalProperties: false - -examples: - - | - tpu: tpu@ffffe0 { - compatible = "renesas,tpu"; - reg = <0xffffe0 16>, <0xfffff0 12>; - clocks = <&pclk>; - clock-names = "fck"; - }; diff --git a/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml b/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml index b5843f4d17d8..379dacacb526 100644 --- a/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml +++ b/Documentation/devicetree/bindings/usb/dwc3-xilinx.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Xilinx SuperSpeed DWC3 USB SoC controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml b/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml index e2a72deae776..c68c04da3399 100644 --- a/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml +++ b/Documentation/devicetree/bindings/usb/microchip,usb5744.yaml @@ -17,7 +17,6 @@ description: maintainers: - Michal Simek <michal.simek@amd.com> - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml b/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml index a7f75fe36665..f295aa9d9ee7 100644 --- a/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml +++ b/Documentation/devicetree/bindings/usb/xlnx,usb2.yaml @@ -7,7 +7,6 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Xilinx udc controller maintainers: - - Mubin Sayyed <mubin.sayyed@amd.com> - Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> properties: diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst index a1eb4a11a1d0..1b240661bfa3 100644 --- a/Documentation/filesystems/ext4/super.rst +++ b/Documentation/filesystems/ext4/super.rst @@ -328,9 +328,13 @@ The ext4 superblock is laid out as follows in - s_checksum_type - Metadata checksum algorithm type. The only valid value is 1 (crc32c). * - 0x176 - - __le16 - - s_reserved_pad - - + - \_\_u8 + - s\_encryption\_level + - Versioning level for encryption. + * - 0x177 + - \_\_u8 + - s\_reserved\_pad + - Padding to next 32bits. * - 0x178 - __le64 - s_kbytes_written @@ -466,9 +470,13 @@ The ext4 superblock is laid out as follows in - s_last_error_time_hi - Upper 8 bits of the s_last_error_time field. * - 0x27A - - __u8 - - s_pad[2] - - Zero padding. + - \_\_u8 + - s\_first\_error\_errcode + - + * - 0x27B + - \_\_u8 + - s\_last\_error\_errcode + - * - 0x27C - __le16 - s_encoding diff --git a/Documentation/netlink/specs/ovs_vport.yaml b/Documentation/netlink/specs/ovs_vport.yaml index 86ba9ac2a521..b538bb99ee9b 100644 --- a/Documentation/netlink/specs/ovs_vport.yaml +++ b/Documentation/netlink/specs/ovs_vport.yaml @@ -123,12 +123,12 @@ attribute-sets: operations: name-prefix: ovs-vport-cmd- + fixed-header: ovs-header list: - name: new doc: Create a new OVS vport attribute-set: vport - fixed-header: ovs-header do: request: attributes: @@ -141,7 +141,6 @@ operations: name: del doc: Delete existing OVS vport from a data path attribute-set: vport - fixed-header: ovs-header do: request: attributes: @@ -152,7 +151,6 @@ operations: name: get doc: Get / dump OVS vport configuration and state attribute-set: vport - fixed-header: ovs-header do: &vport-get-op request: attributes: diff --git a/Documentation/netlink/specs/rt_link.yaml b/Documentation/netlink/specs/rt_link.yaml index 31238455f8e9..6b9d5ee87d93 100644 --- a/Documentation/netlink/specs/rt_link.yaml +++ b/Documentation/netlink/specs/rt_link.yaml @@ -1113,11 +1113,10 @@ attribute-sets: - name: prop-list type: nest - nested-attributes: link-attrs + nested-attributes: prop-list-link-attrs - name: alt-ifname type: string - multi-attr: true - name: perm-address type: binary @@ -1164,6 +1163,13 @@ attribute-sets: name: netns-immutable type: u8 - + name: prop-list-link-attrs + subset-of: link-attrs + attributes: + - + name: alt-ifname + multi-attr: true + - name: af-spec-attrs attributes: - @@ -1585,7 +1591,7 @@ attribute-sets: name: nf-call-iptables type: u8 - - name: nf-call-ip6-tables + name: nf-call-ip6tables type: u8 - name: nf-call-arptables @@ -2077,7 +2083,7 @@ attribute-sets: name: id type: u16 - - name: flag + name: flags type: binary struct: ifla-vlan-flags - @@ -2165,7 +2171,7 @@ attribute-sets: type: binary struct: ifla-cacheinfo - - name: icmp6-stats + name: icmp6stats type: binary struct: ifla-icmp6-stats - @@ -2179,9 +2185,10 @@ attribute-sets: type: u32 - name: mctp-attrs + name-prefix: ifla-mctp- attributes: - - name: mctp-net + name: net type: u32 - name: phys-binding @@ -2453,7 +2460,6 @@ operations: - min-mtu - max-mtu - prop-list - - alt-ifname - perm-address - proto-down-reason - parent-dev-name diff --git a/Documentation/netlink/specs/rt_neigh.yaml b/Documentation/netlink/specs/rt_neigh.yaml index e670b6dc07be..a843caa72259 100644 --- a/Documentation/netlink/specs/rt_neigh.yaml +++ b/Documentation/netlink/specs/rt_neigh.yaml @@ -13,25 +13,25 @@ definitions: type: struct members: - - name: family + name: ndm-family type: u8 - - name: pad + name: ndm-pad type: pad len: 3 - - name: ifindex + name: ndm-ifindex type: s32 - - name: state + name: ndm-state type: u16 enum: nud-state - - name: flags + name: ndm-flags type: u8 enum: ntf-flags - - name: type + name: ndm-type type: u8 enum: rtm-type - @@ -189,7 +189,7 @@ attribute-sets: type: binary display-hint: ipv4 - - name: lladr + name: lladdr type: binary display-hint: mac - diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst index 6c2d8945f597..eab601ab2db0 100644 --- a/Documentation/networking/netdevices.rst +++ b/Documentation/networking/netdevices.rst @@ -338,10 +338,11 @@ operations directly under the netdev instance lock. Devices drivers are encouraged to rely on the instance lock where possible. For the (mostly software) drivers that need to interact with the core stack, -there are two sets of interfaces: ``dev_xxx`` and ``netif_xxx`` (e.g., -``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx`` functions handle -acquiring the instance lock themselves, while the ``netif_xxx`` functions -assume that the driver has already acquired the instance lock. +there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx`` +(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx`` +functions handle acquiring the instance lock themselves, while the +``netif_xxx`` functions assume that the driver has already acquired +the instance lock. Notifiers and netdev instance lock ================================== @@ -354,6 +355,7 @@ For devices with locked ops, currently only the following notifiers are running under the lock: * ``NETDEV_REGISTER`` * ``NETDEV_UP`` +* ``NETDEV_CHANGE`` The following notifiers are running without the lock: * ``NETDEV_UNREGISTER`` diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst index 12f429359a82..63344bea8393 100644 --- a/Documentation/power/runtime_pm.rst +++ b/Documentation/power/runtime_pm.rst @@ -154,7 +154,7 @@ suspending the device are satisfied) and to queue up a suspend request for the device in that case. If there is no idle callback, or if the callback returns 0, then the PM core will attempt to carry out a runtime suspend of the device, also respecting devices configured for autosuspend. In essence this means a -call to __pm_runtime_autosuspend() (do note that drivers needs to update the +call to pm_runtime_autosuspend() (do note that drivers needs to update the device last busy mark, pm_runtime_mark_last_busy(), to control the delay under this circumstance). To prevent this (for example, if the callback routine has started a delayed suspend), the routine must return a non-zero value. Negative diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 1dabfc29be0d..7195a7f91107 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -27,7 +27,7 @@ SYSCALL ======= mseal syscall signature ----------------------- - ``int mseal(void \* addr, size_t len, unsigned long flags)`` + ``int mseal(void *addr, size_t len, unsigned long flags)`` **addr**/**len**: virtual memory address range. The address range set by **addr**/**len** must meet: diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1f8625b7646a..47c7c3f92314 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7447,6 +7447,75 @@ Unused bitfields in the bitarrays must be set to zero. This capability connects the vcpu to an in-kernel XIVE device. +6.76 KVM_CAP_HYPERV_SYNIC +------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability, if KVM_CHECK_EXTENSION indicates that it is +available, means that the kernel has an implementation of the +Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is +used to support Windows Hyper-V based guest paravirt drivers(VMBus). + +In order to use SynIC, it has to be activated by setting this +capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this +will disable the use of APIC hardware virtualization even if supported +by the CPU, as it's incompatible with SynIC auto-EOI behavior. + +6.77 KVM_CAP_HYPERV_SYNIC2 +-------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability enables a newer version of Hyper-V Synthetic interrupt +controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM +doesn't clear SynIC message and event flags pages when they are enabled by +writing to the respective MSRs. + +6.78 KVM_CAP_HYPERV_DIRECT_TLBFLUSH +----------------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability indicates that KVM running on top of Hyper-V hypervisor +enables Direct TLB flush for its guests meaning that TLB flush +hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. +Due to the different ABI for hypercall parameters between Hyper-V and +KVM, enabling this capability effectively disables all hypercall +handling by KVM (as some KVM hypercall may be mistakenly treated as TLB +flush hypercalls by Hyper-V) so userspace should disable KVM identification +in CPUID and only exposes Hyper-V identification. In this case, guest +thinks it's running on Hyper-V and only use Hyper-V hypercalls. + +6.79 KVM_CAP_HYPERV_ENFORCE_CPUID +--------------------------------- + +:Architectures: x86 +:Target: vcpu + +When enabled, KVM will disable emulated Hyper-V features provided to the +guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all +currently implemented Hyper-V features are provided unconditionally when +Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) +leaf. + +6.80 KVM_CAP_ENFORCE_PV_FEATURE_CPUID +------------------------------------- + +:Architectures: x86 +:Target: vcpu + +When enabled, KVM will disable paravirtual features provided to the +guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf +(0x40000001). Otherwise, a guest may use the paravirtual features +regardless of what has actually been exposed through the CPUID leaf. + +.. _KVM_CAP_DIRTY_LOG_RING: + + .. _cap_enable_vm: 7. Capabilities that can be enabled on VMs @@ -7927,10 +7996,10 @@ by POWER10 processor. 7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM ------------------------------------- -Architectures: x86 SEV enabled -Type: vm -Parameters: args[0] is the fd of the source vm -Returns: 0 on success; ENOTTY on error +:Architectures: x86 SEV enabled +:Type: vm +:Parameters: args[0] is the fd of the source vm +:Returns: 0 on success; ENOTTY on error This capability enables userspace to copy encryption context from the vm indicated by the fd to the vm this is called on. @@ -7963,24 +8032,6 @@ default. See Documentation/arch/x86/sgx.rst for more details. -7.26 KVM_CAP_PPC_RPT_INVALIDATE -------------------------------- - -:Capability: KVM_CAP_PPC_RPT_INVALIDATE -:Architectures: ppc -:Type: vm - -This capability indicates that the kernel is capable of handling -H_RPT_INVALIDATE hcall. - -In order to enable the use of H_RPT_INVALIDATE in the guest, -user space might have to advertise it for the guest. For example, -IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is -present in the "ibm,hypertas-functions" device-tree property. - -This capability is enabled for hypervisors on platforms like POWER9 -that support radix MMU. - 7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE -------------------------------------- @@ -8038,24 +8089,9 @@ indicated by the fd to the VM this is called on. This is intended to support intra-host migration of VMs between userspace VMMs, upgrading the VMM process without interrupting the guest. -7.30 KVM_CAP_PPC_AIL_MODE_3 -------------------------------- - -:Capability: KVM_CAP_PPC_AIL_MODE_3 -:Architectures: ppc -:Type: vm - -This capability indicates that the kernel supports the mode 3 setting for the -"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" -resource that is controlled with the H_SET_MODE hypercall. - -This capability allows a guest kernel to use a better-performance mode for -handling interrupts and system calls. - 7.31 KVM_CAP_DISABLE_QUIRKS2 ---------------------------- -:Capability: KVM_CAP_DISABLE_QUIRKS2 :Parameters: args[0] - set of KVM quirks to disable :Architectures: x86 :Type: vm @@ -8210,27 +8246,6 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. -7.34 KVM_CAP_MEMORY_FAULT_INFO ------------------------------- - -:Architectures: x86 -:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. - -The presence of this capability indicates that KVM_RUN will fill -kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if -there is a valid memslot but no backing VMA for the corresponding host virtual -address. - -The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns -an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set -to KVM_EXIT_MEMORY_FAULT. - -Note: Userspaces which attempt to resolve memory faults so that they can retry -KVM_RUN are encouraged to guard against repeatedly receiving the same -error/annotated fault. - -See KVM_EXIT_MEMORY_FAULT for more information. - 7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS ----------------------------------- @@ -8248,19 +8263,220 @@ by KVM_CHECK_EXTENSION. Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest. -7.36 KVM_CAP_X86_GUEST_MODE ------------------------------- +7.36 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL +---------------------------------------------------------- + +:Architectures: x86, arm64 +:Type: vm +:Parameters: args[0] - size of the dirty log ring + +KVM is capable of tracking dirty memory using ring buffers that are +mmapped into userspace; there is one dirty ring per vcpu. + +The dirty ring is available to userspace as an array of +``struct kvm_dirty_gfn``. Each dirty entry is defined as:: + + struct kvm_dirty_gfn { + __u32 flags; + __u32 slot; /* as_id | slot_id */ + __u64 offset; + }; + +The following values are defined for the flags field to define the +current state of the entry:: + + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) + #define KVM_DIRTY_GFN_F_RESET BIT(1) + #define KVM_DIRTY_GFN_F_MASK 0x3 + +Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM +ioctl to enable this capability for the new guest and set the size of +the rings. Enabling the capability is only allowed before creating any +vCPU, and the size of the ring must be a power of two. The larger the +ring buffer, the less likely the ring is full and the VM is forced to +exit to userspace. The optimal size depends on the workload, but it is +recommended that it be at least 64 KiB (4096 entries). + +Just like for dirty page bitmaps, the buffer tracks writes to +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered +with the flag set, userspace can start harvesting dirty pages from the +ring buffer. + +An entry in the ring buffer can be unused (flag bits ``00``), +dirty (flag bits ``01``) or harvested (flag bits ``1X``). The +state machine for the entry is as follows:: + + dirtied harvested reset + 00 -----------> 01 -------------> 1X -------+ + ^ | + | | + +------------------------------------------+ + +To harvest the dirty pages, userspace accesses the mmapped ring buffer +to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage +the RESET bit must be cleared), then it means this GFN is a dirty GFN. +The userspace should harvest this GFN and mark the flags from state +``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set +to show that this GFN is harvested and waiting for a reset), and move +on to the next GFN. The userspace should continue to do this until the +flags of a GFN have the DIRTY bit cleared, meaning that it has harvested +all the dirty GFNs that were available. + +Note that on weakly ordered architectures, userspace accesses to the +ring buffer (and more specifically the 'flags' field) must be ordered, +using load-acquire/store-release accessors when available, or any +other memory barrier that will ensure this ordering. + +It's not necessary for userspace to harvest the all dirty GFNs at once. +However it must collect the dirty GFNs in sequence, i.e., the userspace +program cannot skip one dirty GFN to collect the one next to it. + +After processing one or more entries in the ring buffer, userspace +calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about +it, so that the kernel will reprotect those collected GFNs. +Therefore, the ioctl must be called *before* reading the content of +the dirty pages. + +The dirty ring can get full. When it happens, the KVM_RUN of the +vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. + +The dirty ring interface has a major difference comparing to the +KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from +userspace, it's still possible that the kernel has not yet flushed the +processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the +flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one +needs to kick the vcpu out of KVM_RUN using a signal. The resulting +vmexit ensures that all dirty GFNs are flushed to the dirty rings. + +NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that +should be exposed by weakly ordered architecture, in order to indicate +the additional memory ordering requirements imposed on userspace when +reading the state of an entry and mutating it from DIRTY to HARVESTED. +Architecture with TSO-like ordering (such as x86) are allowed to +expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL +to userspace. + +After enabling the dirty rings, the userspace needs to detect the +capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the +ring structures can be backed by per-slot bitmaps. With this capability +advertised, it means the architecture can dirty guest pages without +vcpu/ring context, so that some of the dirty information will still be +maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP +can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL +hasn't been enabled, or any memslot has been existing. + +Note that the bitmap here is only a backup of the ring structure. The +use of the ring and bitmap combination is only beneficial if there is +only a very small amount of memory that is dirtied out of vcpu/ring +context. Otherwise, the stand-alone per-slot bitmap mechanism needs to +be considered. + +To collect dirty bits in the backup bitmap, userspace can use the same +KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all +the generation of the dirty bits is done in a single pass. Collecting +the dirty bitmap should be the very last thing that the VMM does before +considering the state as complete. VMM needs to ensure that the dirty +state is final and avoid missing dirty pages from another ioctl ordered +after the bitmap collection. + +NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its +tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on +KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through +command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device +"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save +vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} +command on KVM device "kvm-arm-vgic-v3". + +7.37 KVM_CAP_PMU_CAPABILITY +--------------------------- :Architectures: x86 -:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. +:Type: vm +:Parameters: arg[0] is bitmask of PMU virtualization capabilities. +:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits -The presence of this capability indicates that KVM_RUN will update the -KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the -vCPU was executing nested guest code when it exited. +This capability alters PMU virtualization in KVM. -KVM exits with the register state of either the L1 or L2 guest -depending on which executed at the time of an exit. Userspace must -take care to differentiate between these cases. +Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of +PMU virtualization capabilities that can be adjusted on a VM. + +The argument to KVM_ENABLE_CAP is also a bitmask and selects specific +PMU virtualization capabilities to be applied to the VM. This can +only be invoked on a VM prior to the creation of VCPUs. + +At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting +this capability will disable PMU virtualization for that VM. Usermode +should adjust CPUID leaf 0xA to reflect that the PMU is disabled. + +7.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES +------------------------------------- + +:Architectures: x86 +:Type: vm +:Parameters: arg[0] must be 0. +:Returns: 0 on success, -EPERM if the userspace process does not + have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been + created. + +This capability disables the NX huge pages mitigation for iTLB MULTIHIT. + +The capability has no effect if the nx_huge_pages module parameter is not set. + +This capability may only be set before any vCPUs are created. + +7.39 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE +--------------------------------------- + +:Architectures: arm64 +:Type: vm +:Parameters: arg[0] is the new split chunk size. +:Returns: 0 on success, -EINVAL if any memslot was already created. + +This capability sets the chunk size used in Eager Page Splitting. + +Eager Page Splitting improves the performance of dirty-logging (used +in live migrations) when guest memory is backed by huge-pages. It +avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing +it eagerly when enabling dirty logging (with the +KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using +KVM_CLEAR_DIRTY_LOG. + +The chunk size specifies how many pages to break at a time, using a +single allocation for each chunk. Bigger the chunk size, more pages +need to be allocated ahead of time. + +The chunk size needs to be a valid block size. The list of acceptable +block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a +64-bit bitmap (each bit describing a block size). The default value is +0, to disable the eager page splitting. + +7.40 KVM_CAP_EXIT_HYPERCALL +--------------------------- + +:Architectures: x86 +:Type: vm + +This capability, if enabled, will cause KVM to exit to userspace +with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. + +Calling KVM_CHECK_EXTENSION for this capability will return a bitmask +of hypercalls that can be configured to exit to userspace. +Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. + +The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset +of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace +the hypercalls whose corresponding bit is in the argument, and return +ENOSYS for the others. + +7.41 KVM_CAP_ARM_SYSTEM_SUSPEND +------------------------------- + +:Architectures: arm64 +:Type: vm + +When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of +type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. 7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS ------------------------------------- @@ -8297,21 +8513,6 @@ H_RANDOM hypercall backed by a hardware random-number generator. If present, the kernel H_RANDOM handler can be enabled for guest use with the KVM_CAP_PPC_ENABLE_HCALL capability. -8.2 KVM_CAP_HYPERV_SYNIC ------------------------- - -:Architectures: x86 - -This capability, if KVM_CHECK_EXTENSION indicates that it is -available, means that the kernel has an implementation of the -Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is -used to support Windows Hyper-V based guest paravirt drivers(VMBus). - -In order to use SynIC, it has to be activated by setting this -capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this -will disable the use of APIC hardware virtualization even if supported -by the CPU, as it's incompatible with SynIC auto-EOI behavior. - 8.3 KVM_CAP_PPC_MMU_RADIX ------------------------- @@ -8362,20 +8563,6 @@ may be incompatible with the MIPS VZ ASE. virtualization, including standard guest virtual memory segments. == ========================================================================== -8.6 KVM_CAP_MIPS_TE -------------------- - -:Architectures: mips - -This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that -it is available, means that the trap & emulate implementation is available to -run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware -assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed -to KVM_CREATE_VM to create a VM which utilises it. - -If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is -available, it means that the VM is using trap & emulate. - 8.7 KVM_CAP_MIPS_64BIT ---------------------- @@ -8457,16 +8644,6 @@ virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N (counting from the right) is set, then a virtual SMT mode of 2^N is available. -8.11 KVM_CAP_HYPERV_SYNIC2 --------------------------- - -:Architectures: x86 - -This capability enables a newer version of Hyper-V Synthetic interrupt -controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM -doesn't clear SynIC message and event flags pages when they are enabled by -writing to the respective MSRs. - 8.12 KVM_CAP_HYPERV_VP_INDEX ---------------------------- @@ -8481,7 +8658,6 @@ capability is absent, userspace can still query this msr's value. ------------------------------- :Architectures: s390 -:Parameters: none This capability indicates if the flic device will be able to get/set the AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows @@ -8555,21 +8731,6 @@ This capability indicates that KVM supports paravirtualized Hyper-V IPI send hypercalls: HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. -8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH ------------------------------------ - -:Architectures: x86 - -This capability indicates that KVM running on top of Hyper-V hypervisor -enables Direct TLB flush for its guests meaning that TLB flush -hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. -Due to the different ABI for hypercall parameters between Hyper-V and -KVM, enabling this capability effectively disables all hypercall -handling by KVM (as some KVM hypercall may be mistakenly treated as TLB -flush hypercalls by Hyper-V) so userspace should disable KVM identification -in CPUID and only exposes Hyper-V identification. In this case, guest -thinks it's running on Hyper-V and only use Hyper-V hypercalls. - 8.22 KVM_CAP_S390_VCPU_RESETS ----------------------------- @@ -8647,142 +8808,6 @@ In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to trap and emulate MSRs that are outside of the scope of KVM as well as limit the attack surface on KVM's MSR emulation code. -8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID -------------------------------------- - -Architectures: x86 - -When enabled, KVM will disable paravirtual features provided to the -guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf -(0x40000001). Otherwise, a guest may use the paravirtual features -regardless of what has actually been exposed through the CPUID leaf. - -.. _KVM_CAP_DIRTY_LOG_RING: - -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL ----------------------------------------------------------- - -:Architectures: x86, arm64 -:Parameters: args[0] - size of the dirty log ring - -KVM is capable of tracking dirty memory using ring buffers that are -mmapped into userspace; there is one dirty ring per vcpu. - -The dirty ring is available to userspace as an array of -``struct kvm_dirty_gfn``. Each dirty entry is defined as:: - - struct kvm_dirty_gfn { - __u32 flags; - __u32 slot; /* as_id | slot_id */ - __u64 offset; - }; - -The following values are defined for the flags field to define the -current state of the entry:: - - #define KVM_DIRTY_GFN_F_DIRTY BIT(0) - #define KVM_DIRTY_GFN_F_RESET BIT(1) - #define KVM_DIRTY_GFN_F_MASK 0x3 - -Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM -ioctl to enable this capability for the new guest and set the size of -the rings. Enabling the capability is only allowed before creating any -vCPU, and the size of the ring must be a power of two. The larger the -ring buffer, the less likely the ring is full and the VM is forced to -exit to userspace. The optimal size depends on the workload, but it is -recommended that it be at least 64 KiB (4096 entries). - -Just like for dirty page bitmaps, the buffer tracks writes to -all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was -set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered -with the flag set, userspace can start harvesting dirty pages from the -ring buffer. - -An entry in the ring buffer can be unused (flag bits ``00``), -dirty (flag bits ``01``) or harvested (flag bits ``1X``). The -state machine for the entry is as follows:: - - dirtied harvested reset - 00 -----------> 01 -------------> 1X -------+ - ^ | - | | - +------------------------------------------+ - -To harvest the dirty pages, userspace accesses the mmapped ring buffer -to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage -the RESET bit must be cleared), then it means this GFN is a dirty GFN. -The userspace should harvest this GFN and mark the flags from state -``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set -to show that this GFN is harvested and waiting for a reset), and move -on to the next GFN. The userspace should continue to do this until the -flags of a GFN have the DIRTY bit cleared, meaning that it has harvested -all the dirty GFNs that were available. - -Note that on weakly ordered architectures, userspace accesses to the -ring buffer (and more specifically the 'flags' field) must be ordered, -using load-acquire/store-release accessors when available, or any -other memory barrier that will ensure this ordering. - -It's not necessary for userspace to harvest the all dirty GFNs at once. -However it must collect the dirty GFNs in sequence, i.e., the userspace -program cannot skip one dirty GFN to collect the one next to it. - -After processing one or more entries in the ring buffer, userspace -calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about -it, so that the kernel will reprotect those collected GFNs. -Therefore, the ioctl must be called *before* reading the content of -the dirty pages. - -The dirty ring can get full. When it happens, the KVM_RUN of the -vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. - -The dirty ring interface has a major difference comparing to the -KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from -userspace, it's still possible that the kernel has not yet flushed the -processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the -flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one -needs to kick the vcpu out of KVM_RUN using a signal. The resulting -vmexit ensures that all dirty GFNs are flushed to the dirty rings. - -NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that -should be exposed by weakly ordered architecture, in order to indicate -the additional memory ordering requirements imposed on userspace when -reading the state of an entry and mutating it from DIRTY to HARVESTED. -Architecture with TSO-like ordering (such as x86) are allowed to -expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL -to userspace. - -After enabling the dirty rings, the userspace needs to detect the -capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the -ring structures can be backed by per-slot bitmaps. With this capability -advertised, it means the architecture can dirty guest pages without -vcpu/ring context, so that some of the dirty information will still be -maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP -can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL -hasn't been enabled, or any memslot has been existing. - -Note that the bitmap here is only a backup of the ring structure. The -use of the ring and bitmap combination is only beneficial if there is -only a very small amount of memory that is dirtied out of vcpu/ring -context. Otherwise, the stand-alone per-slot bitmap mechanism needs to -be considered. - -To collect dirty bits in the backup bitmap, userspace can use the same -KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all -the generation of the dirty bits is done in a single pass. Collecting -the dirty bitmap should be the very last thing that the VMM does before -considering the state as complete. VMM needs to ensure that the dirty -state is final and avoid missing dirty pages from another ioctl ordered -after the bitmap collection. - -NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its -tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on -KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through -command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device -"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save -vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} -command on KVM device "kvm-arm-vgic-v3". - 8.30 KVM_CAP_XEN_HVM -------------------- @@ -8847,10 +8872,9 @@ clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be done when the KVM_CAP_XEN_HVM ioctl sets the KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag. -8.31 KVM_CAP_PPC_MULTITCE -------------------------- +8.31 KVM_CAP_SPAPR_MULTITCE +--------------------------- -:Capability: KVM_CAP_PPC_MULTITCE :Architectures: ppc :Type: vm @@ -8882,72 +8906,9 @@ This capability indicates that the KVM virtual PTP service is supported in the host. A VMM can check whether the service is available to the guest on migration. -8.33 KVM_CAP_HYPERV_ENFORCE_CPUID ---------------------------------- - -Architectures: x86 - -When enabled, KVM will disable emulated Hyper-V features provided to the -guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all -currently implemented Hyper-V features are provided unconditionally when -Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) -leaf. - -8.34 KVM_CAP_EXIT_HYPERCALL ---------------------------- - -:Capability: KVM_CAP_EXIT_HYPERCALL -:Architectures: x86 -:Type: vm - -This capability, if enabled, will cause KVM to exit to userspace -with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. - -Calling KVM_CHECK_EXTENSION for this capability will return a bitmask -of hypercalls that can be configured to exit to userspace. -Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. - -The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset -of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace -the hypercalls whose corresponding bit is in the argument, and return -ENOSYS for the others. - -8.35 KVM_CAP_PMU_CAPABILITY ---------------------------- - -:Capability: KVM_CAP_PMU_CAPABILITY -:Architectures: x86 -:Type: vm -:Parameters: arg[0] is bitmask of PMU virtualization capabilities. -:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits - -This capability alters PMU virtualization in KVM. - -Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of -PMU virtualization capabilities that can be adjusted on a VM. - -The argument to KVM_ENABLE_CAP is also a bitmask and selects specific -PMU virtualization capabilities to be applied to the VM. This can -only be invoked on a VM prior to the creation of VCPUs. - -At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting -this capability will disable PMU virtualization for that VM. Usermode -should adjust CPUID leaf 0xA to reflect that the PMU is disabled. - -8.36 KVM_CAP_ARM_SYSTEM_SUSPEND -------------------------------- - -:Capability: KVM_CAP_ARM_SYSTEM_SUSPEND -:Architectures: arm64 -:Type: vm - -When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of -type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. - 8.37 KVM_CAP_S390_PROTECTED_DUMP -------------------------------- -:Capability: KVM_CAP_S390_PROTECTED_DUMP :Architectures: s390 :Type: vm @@ -8957,27 +8918,9 @@ PV guests. The `KVM_PV_DUMP` command is available for the dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is available and supports the `KVM_PV_DUMP_CPU` subcommand. -8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES -------------------------------------- - -:Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES -:Architectures: x86 -:Type: vm -:Parameters: arg[0] must be 0. -:Returns: 0 on success, -EPERM if the userspace process does not - have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been - created. - -This capability disables the NX huge pages mitigation for iTLB MULTIHIT. - -The capability has no effect if the nx_huge_pages module parameter is not set. - -This capability may only be set before any vCPUs are created. - 8.39 KVM_CAP_S390_CPU_TOPOLOGY ------------------------------ -:Capability: KVM_CAP_S390_CPU_TOPOLOGY :Architectures: s390 :Type: vm @@ -8999,37 +8942,9 @@ structure. When getting the Modified Change Topology Report value, the attr->addr must point to a byte where the value will be stored or retrieved from. -8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE ---------------------------------------- - -:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE -:Architectures: arm64 -:Type: vm -:Parameters: arg[0] is the new split chunk size. -:Returns: 0 on success, -EINVAL if any memslot was already created. - -This capability sets the chunk size used in Eager Page Splitting. - -Eager Page Splitting improves the performance of dirty-logging (used -in live migrations) when guest memory is backed by huge-pages. It -avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing -it eagerly when enabling dirty logging (with the -KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using -KVM_CLEAR_DIRTY_LOG. - -The chunk size specifies how many pages to break at a time, using a -single allocation for each chunk. Bigger the chunk size, more pages -need to be allocated ahead of time. - -The chunk size needs to be a valid block size. The list of acceptable -block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a -64-bit bitmap (each bit describing a block size). The default value is -0, to disable the eager page splitting. - 8.41 KVM_CAP_VM_TYPES --------------------- -:Capability: KVM_CAP_MEMORY_ATTRIBUTES :Architectures: x86 :Type: system ioctl @@ -9046,6 +8961,67 @@ Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in production. The behavior and effective ABI for software-protected VMs is unstable. +8.42 KVM_CAP_PPC_RPT_INVALIDATE +------------------------------- + +:Architectures: ppc + +This capability indicates that the kernel is capable of handling +H_RPT_INVALIDATE hcall. + +In order to enable the use of H_RPT_INVALIDATE in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is +present in the "ibm,hypertas-functions" device-tree property. + +This capability is enabled for hypervisors on platforms like POWER9 +that support radix MMU. + +8.43 KVM_CAP_PPC_AIL_MODE_3 +--------------------------- + +:Architectures: ppc + +This capability indicates that the kernel supports the mode 3 setting for the +"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" +resource that is controlled with the H_SET_MODE hypercall. + +This capability allows a guest kernel to use a better-performance mode for +handling interrupts and system calls. + +8.44 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86 + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + +See KVM_EXIT_MEMORY_FAULT for more information. + +8.45 KVM_CAP_X86_GUEST_MODE +--------------------------- + +:Architectures: x86 + +The presence of this capability indicates that KVM_RUN will update the +KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the +vCPU was executing nested guest code when it exited. + +KVM exits with the register state of either the L1 or L2 guest +depending on which executed at the time of an exit. Userspace must +take care to differentiate between these cases. + 9. Known KVM API problems ========================= @@ -9076,9 +9052,10 @@ the local APIC. The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature. -CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``. -It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel -has enabled in-kernel emulation of the local APIC. +On older versions of Linux, CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by +``KVM_GET_SUPPORTED_CPUID``, but it can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` +is present and the kernel has enabled in-kernel emulation of the local APIC. +On newer versions, ``KVM_GET_SUPPORTED_CPUID`` does report the bit as available. CPU topology ~~~~~~~~~~~~ diff --git a/Documentation/wmi/devices/msi-wmi-platform.rst b/Documentation/wmi/devices/msi-wmi-platform.rst index 31a136942892..73197b31926a 100644 --- a/Documentation/wmi/devices/msi-wmi-platform.rst +++ b/Documentation/wmi/devices/msi-wmi-platform.rst @@ -138,6 +138,10 @@ input data, the meaning of which depends on the subfeature being accessed. The output buffer contains a single byte which signals success or failure (``0x00`` on failure) and 31 bytes of output data, the meaning if which depends on the subfeature being accessed. +.. note:: + The ACPI control method responsible for handling the WMI method calls is not thread-safe. + This is a firmware bug that needs to be handled inside the driver itself. + WMI method Get_EC() ------------------- |