summaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'arm64-upstream' of ↵Linus Torvalds2024-05-1421-258/+384
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The most interesting parts are probably the mm changes from Ryan which optimise the creation of the linear mapping at boot and (separately) implement write-protect support for userfaultfd. Outside of our usual directories, the Kbuild-related changes under scripts/ have been acked by Masahiro whilst the drivers/acpi/ parts have been acked by Rafael and the addition of cpumask_any_and_but() has been acked by Yury. ACPI: - Support for the Firmware ACPI Control Structure (FACS) signature feature which is used to reboot out of hibernation on some systems Kbuild: - Support for building Flat Image Tree (FIT) images, where the kernel Image is compressed alongside a set of devicetree blobs Memory management: - Optimisation of our early page-table manipulation for creation of the linear mapping - Support for userfaultfd write protection, which brings along some nice cleanups to our handling of invalid but present ptes - Extend our use of range TLBI invalidation at EL1 Perf and PMUs: - Ensure that the 'pmu->parent' pointer is correctly initialised by PMU drivers - Avoid allocating 'cpumask_t' types on the stack in some PMU drivers - Fix parsing of the CPU PMU "version" field in assembly code, as it doesn't follow the usual architectural rules - Add best-effort unwinding support for USER_STACKTRACE - Minor driver fixes and cleanups Selftests: - Minor cleanups to the arm64 selftests (missing NULL check, unused variable) Miscellaneous: - Add a command-line alias for disabling 32-bit application support - Add part number for Neoverse-V2 CPUs - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits) arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: simplify arch_static_branch/_jump function arm64: Add USER_STACKTRACE support arm64: Add the arm64.no32bit_el0 command line option drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group kselftest: arm64: Add a null pointer check arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings kselftest/arm64: Remove unused parameters in abi test perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device ...
| * Merge branch 'for-next/tlbi' into for-next/coreWill Deacon2024-05-091-22/+31
| |\ | | | | | | | | | | | | | | | | | | * for-next/tlbi: arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGES arm64: tlb: Improve __TLBI_VADDR_RANGE() arm64: tlb: Fix TLBI RANGE operand
| | * arm64: tlb: Allow range operation for MAX_TLBI_RANGE_PAGESGavin Shan2024-04-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MAX_TLBI_RANGE_PAGES pages is covered by SCALE#3 and NUM#31 and it's supported now. Allow TLBI RANGE operation when the number of pages is equal to MAX_TLBI_RANGE_PAGES in __flush_tlb_range_nosync(). Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240405035852.1532010-4-gshan@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
| | * arm64: tlb: Improve __TLBI_VADDR_RANGE()Gavin Shan2024-04-111-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macro returns the operand of TLBI RANGE instruction. A mask needs to be applied to each individual field upon producing the operand, to avoid the adjacent fields can interfere with each other when invalid arguments have been provided. The code looks more tidy at least with a mask and FIELD_PREP(). Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240405035852.1532010-3-gshan@redhat.com Signed-off-by: Will Deacon <will@kernel.org>
| * | Merge branch 'for-next/perf' into for-next/coreWill Deacon2024-05-095-121/+134
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-next/perf: (41 commits) arm64: Add USER_STACKTRACE support drivers/perf: hisi: hns3: Actually use devm_add_action_or_reset() drivers/perf: hisi: hns3: Fix out-of-bound access when valid event group drivers/perf: hisi_pcie: Fix out-of-bound access when valid event group perf/arm-spe: Assign parents for event_source device perf/arm-smmuv3: Assign parents for event_source device perf/arm-dsu: Assign parents for event_source device perf/arm-dmc620: Assign parents for event_source device perf/arm-ccn: Assign parents for event_source device perf/arm-cci: Assign parents for event_source device perf/alibaba_uncore: Assign parents for event_source device perf/arm_pmu: Assign parents for event_source devices perf/imx_ddr: Assign parents for event_source devices perf/qcom: Assign parents for event_source devices Documentation: qcom-pmu: Use /sys/bus/event_source/devices paths perf/riscv: Assign parents for event_source devices perf/thunderx2: Assign parents for event_source devices Documentation: thunderx2-pmu: Use /sys/bus/event_source/devices paths perf/xgene: Assign parents for event_source devices Documentation: xgene-pmu: Use /sys/bus/event_source/devices paths ...
| | * | arm64: Add USER_STACKTRACE supportchenqiwu2024-05-033-114/+125
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, userstacktrace is unsupported for ftrace and uprobe tracers on arm64. This patch uses the perf_callchain_user() code as blueprint to implement the arch_stack_walk_user() which add userstacktrace support on arm64. Meanwhile, we can use arch_stack_walk_user() to simplify the implementation of perf_callchain_user(). This patch is tested pass with ftrace, uprobe and perf tracers profiling userstacktrace cases. Tested-by: chenqiwu <qiwu.chen@transsion.com> Signed-off-by: chenqiwu <qiwu.chen@transsion.com> Link: https://lore.kernel.org/r/20231219022229.10230-1-qiwu.chen@transsion.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | arm64: arm_pmuv3: Correctly extract and check the PMUVerYicong Yang2024-04-122-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we're using "sbfx" to extract the PMUVer from ID_AA64DFR0_EL1 and skip the init/reset if no PMU present when the extracted PMUVer is negative or is zero. However for PMUv3p8 the PMUVer will be 0b1000 and PMUVer extracted by "sbfx" will always be negative and we'll skip the init/reset in __init_el2_debug/reset_pmuserenr_el0 unexpectedly. So this patch use "ubfx" instead of "sbfx" to extract the PMUVer. If the PMUVer is implementation defined (0b1111) or not implemented(0b0000) then skip the reset/init. Previously we'll also skip the init/reset if the PMUVer is higher than the version we known (currently PMUv3p9), with this patch we'll only skip if the PMU is not implemented or implementation defined. This keeps consistence with how we probe the PMU in the driver with pmuv3_implemented(). Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20240411123030.7201-1-yangyicong@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
| * | | Merge branch 'for-next/mm' into for-next/coreWill Deacon2024-05-094-78/+157
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-next/mm: arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2 arm64/mm: Add uffd write-protect support arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NG arm64/mm: Remove PTE_PROT_NONE bit arm64/mm: generalize PMD_PRESENT_INVALID for all levels arm64: mm: Don't remap pgtables for allocate vs populate arm64: mm: Batch dsb and isb when populating pgtables arm64: mm: Don't remap pgtables per-cont(pte|pmd) block
| | * | | arm64/mm: Fix pud_user_accessible_page() for PGTABLE_LEVELS <= 2Ryan Roberts2024-05-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent change to use pud_valid() as part of the implementation of pud_user_accessible_page() fails to build when PGTABLE_LEVELS <= 2 because pud_valid() is not defined in that case. Fix this by defining pud_valid() to false for this case. This means that pud_user_accessible_page() will correctly always return false for this config. Fixes: f0f5863a0fb0 ("arm64/mm: Remove PTE_PROT_NONE bit") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405082221.43rfWxz5-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240509122844.563320-1-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64/mm: Add uffd write-protect supportRyan Roberts2024-05-033-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's use the newly-free PTE SW bit (58) to add support for uffd-wp. The standard handlers are implemented for set/test/clear for both pte and pmd. Additionally we must also track the uffd-wp state as a pte swp bit, so use a free swap pte bit (3). Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240503144604.151095-5-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64/mm: Move PTE_PRESENT_INVALID to overlay PTE_NGRyan Roberts2024-05-032-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PTE_PRESENT_INVALID was previously occupying bit 59, which when a PTE is valid can either be IGNORED, PBHA[0] or AttrIndex[3], depending on the HW configuration. In practice this is currently not a problem because PTE_PRESENT_INVALID can only be 1 when PTE_VALID=0 and upstream Linux always requires the bit set to 0 for a valid pte. However, if in future Linux wants to use the field (e.g. AttrIndex[3]) then we could end up with confusion when PTE_PRESENT_INVALID comes along and corrupts the field - we would ideally want to preserve it even for an invalid (but present) pte. The other problem with bit 59 is that it prevents the offset field of a swap entry within a swap pte from growing beyond 51 bits. By moving PTE_PRESENT_INVALID to a low bit we can lay the swap pte out so that the offset field could grow to 52 bits in future. So let's move PTE_PRESENT_INVALID to overlay PTE_NG (bit 11). There is no need to persist NG for a present-invalid entry; it is always set for user mappings and is not used by SW to derive any state from the pte. PTE_NS was considered instead of PTE_NG, but it is RES0 for non-secure SW, so there is a chance that future architecture may allocate the bit and we may therefore need to persist that bit for present-invalid ptes. These are both marginal benefits, but make things a bit tidier in my opinion. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240503144604.151095-4-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64/mm: Remove PTE_PROT_NONE bitRyan Roberts2024-05-032-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the PTE_PRESENT_INVALID and PTE_PROT_NONE functionality explicitly occupy 2 bits in the PTE when PTE_VALID/PMD_SECT_VALID is clear. This has 2 significant consequences: - PTE_PROT_NONE consumes a precious SW PTE bit that could be used for other things. - The swap pte layout must reserve those same 2 bits and ensure they are both always zero for a swap pte. It would be nice to reclaim at least one of those bits. But PTE_PRESENT_INVALID, which since the previous patch, applies uniformly to page/block descriptors at any level when PTE_VALID is clear, can already give us most of what PTE_PROT_NONE requires: If it is set, then the pte is still considered present; pte_present() returns true and all the fields in the pte follow the HW interpretation (e.g. SW can safely call pte_pfn(), etc). But crucially, the HW treats the pte as invalid and will fault if it hits. So let's remove PTE_PROT_NONE entirely and instead represent PROT_NONE as a present but invalid pte (PTE_VALID=0, PTE_PRESENT_INVALID=1) with PTE_USER=0 and PTE_UXN=1. This is a unique combination that is not used anywhere else. The net result is a clearer, simpler, more generic encoding scheme that applies uniformly to all levels. Additionally we free up a PTE SW bit and a swap pte bit (bit 58 in both cases). Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240503144604.151095-3-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64/mm: generalize PMD_PRESENT_INVALID for all levelsRyan Roberts2024-05-032-13/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As preparation for the next patch, which frees up the PTE_PROT_NONE present pte and swap pte bit, generalize PMD_PRESENT_INVALID to PTE_PRESENT_INVALID. This will then be used to mark PROT_NONE ptes (and entries at any other level) in the next patch. While we're at it, fix up the swap pte format comment to include PTE_PRESENT_INVALID. This is not new, it just wasn't previously documented. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/r/20240503144604.151095-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: mm: Don't remap pgtables for allocate vs populateRyan Roberts2024-04-122-32/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During linear map pgtable creation, each pgtable is fixmapped / fixunmapped twice; once during allocation to zero the memory, and a again during population to write the entries. This means each table has 2 TLB invalidations issued against it. Let's fix this so that each table is only fixmapped/fixunmapped once, halving the number of TLBIs, and improving performance. Achieve this by separating allocation and initialization (zeroing) of the page. The allocated page is now fixmapped directly by the walker and initialized, before being populated and finally fixunmapped. This approach keeps the change small, but has the side effect that late allocations (using __get_free_page()) must also go through the generic memory clearing routine. So let's tell __get_free_page() not to zero the memory to avoid duplication. Additionally this approach means that fixmap/fixunmap is still used for late pgtable modifications. That's not technically needed since the memory is all mapped in the linear map by that point. That's left as a possible future optimization if found to be needed. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 11 (0%) | 161 (0%) | 656 (0%) | 1654 (0%) after | 10 (-11%) | 104 (-35%) | 438 (-33%) | 1223 (-26%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-4-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: mm: Batch dsb and isb when populating pgtablesRyan Roberts2024-04-122-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After removing uneccessary TLBIs, the next bottleneck when creating the page tables for the linear map is DSB and ISB, which were previously issued per-pte in __set_pte(). Since we are writing multiple ptes in a given pte table, we can elide these barriers and insert them once we have finished writing to the table. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 78 (0%) | 435 (0%) | 1723 (0%) | 3779 (0%) after | 11 (-86%) | 161 (-63%) | 656 (-62%) | 1654 (-56%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-3-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: mm: Don't remap pgtables per-cont(pte|pmd) blockRyan Roberts2024-04-121-13/+14
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A large part of the kernel boot time is creating the kernel linear map page tables. When rodata=full, all memory is mapped by pte. And when there is lots of physical ram, there are lots of pte tables to populate. The primary cost associated with this is mapping and unmapping the pte table memory in the fixmap; at unmap time, the TLB entry must be invalidated and this is expensive. Previously, each pmd and pte table was fixmapped/fixunmapped for each cont(pte|pmd) block of mappings (16 entries with 4K granule). This means we ended up issuing 32 TLBIs per (pmd|pte) table during the population phase. Let's fix that, and fixmap/fixunmap each page once per population, for a saving of 31 TLBIs per (pmd|pte) table. This gives a significant boot speedup. Execution time of map_mem(), which creates the kernel linear map page tables, was measured on different machines with different RAM configs: | Apple M2 VM | Ampere Altra| Ampere Altra| Ampere Altra | VM, 16G | VM, 64G | VM, 256G | Metal, 512G ---------------|-------------|-------------|-------------|------------- | ms (%) | ms (%) | ms (%) | ms (%) ---------------|-------------|-------------|-------------|------------- before | 168 (0%) | 2198 (0%) | 8644 (0%) | 17447 (0%) after | 78 (-53%) | 435 (-80%) | 1723 (-80%) | 3779 (-78%) Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Itaru Kitayama <itaru.kitayama@fujitsu.com> Tested-by: Eric Chanudet <echanude@redhat.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240412131908.433043-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| * | | Merge branch 'for-next/misc' into for-next/coreWill Deacon2024-05-099-43/+48
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-next/misc: arm64: simplify arch_static_branch/_jump function arm64: Add the arm64.no32bit_el0 command line option arm64: defer clearing DAIF.D arm64: assembler: update stale comment for disable_step_tsk arm64/sysreg: Update PIE permission encodings arm64: Add Neoverse-V2 part arm64: Remove unnecessary irqflags alternative.h include
| | * | | arm64: simplify arch_static_branch/_jump functionGeorge Guo2024-05-031-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extracted the jump table definition code from the arch_static_branch and arch_static_branch_jump functions into a macro JUMP_TABLE_ENTRY to reduce code duplication. Signed-off-by: George Guo <guodongtai@kylinos.cn> Link: https://lore.kernel.org/r/20240430085655.2798551-2-dongtai.guo@linux.dev Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: Add the arm64.no32bit_el0 command line optionAndrea della Porta2024-05-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introducing the field 'el0' to the idreg-override for register ID_AA64PFR0_EL1. This field is also aliased to the new kernel command line option 'arm64.no32bit_el0' as a more recognizable and mnemonic name to disable the execution of 32 bit userspace applications (i.e. avoid Aarch32 execution state in EL0) from kernel command line. Link: https://lore.kernel.org/all/20240207105847.7739-1-andrea.porta@suse.com/ Signed-off-by: Andrea della Porta <andrea.porta@suse.com> Link: https://lore.kernel.org/r/20240429102833.6426-1-andrea.porta@suse.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: defer clearing DAIF.DMark Rutland2024-04-284-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For historical reasons we unmask debug exceptions in __cpu_setup(), but it's not necessary to unmask debug exceptions this early in the boot/idle entry paths. It would be better to unmask debug exceptions later in C code as this simplifies the current code and will make it easier to rework exception masking logic to handle non-DAIF bits in future (e.g. PSTATE.{ALLINT,PM}). We started clearing DAIF.D in __cpu_setup() in commit: 2ce39ad15182604b ("arm64: debug: unmask PSTATE.D earlier") At the time, we needed to ensure that DAIF.D was clear on the primary CPU before scheduling and preemption were possible, and chose to do this in __cpu_setup() so that this occurred in the same place for primary and secondary CPUs. As we cannot handle debug exceptions this early, we placed an ISB between initializing MDSCR_EL1 and clearing DAIF.D so that no exceptions should be triggered. Subsequently we rewrote the return-from-{idle,suspend} paths to use __cpu_setup() in commit: cabe1c81ea5be983 ("arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va") ... which allowed for earlier use of the MMU and had the desirable property of using the same code to reset the CPU in the cold and warm boot paths. This introduced a bug: DAIF.D was clear while cpu_do_resume() restored MDSCR_EL1 and other control registers (e.g. breakpoint/watchpoint control/value registers), and so we could unexpectedly take debug exceptions. We fixed that in commit: 744c6c37cc18705d ("arm64: kernel: Fix unmasked debug exceptions when restoring mdscr_el1") ... by having cpu_do_resume() use the `disable_dbg` macro to set DAIF.D before restoring MDSCR_EL1 and other control registers. This relies on DAIF.D being subsequently cleared again in cpu_resume(). Subsequently we reworked DAIF masking in commit: 0fbeb318754860b3 ("arm64: explicitly mask all exceptions") ... where we began enforcing a policy that DAIF.D being set implies all other DAIF bits are set, and so e.g. we cannot take an IRQ while DAIF.D is set. As part of this the use of `disable_dbg` in cpu_resume() was replaced with `disable_daif` for consistency with the rest of the kernel. These days, there's no need to clear DAIF.D early within __cpu_setup(): * setup_arch() clears DAIF.DA before scheduling and preemption are possible on the primary CPU, avoiding the problem we we originally trying to work around. Note: DAIF.IF get cleared later when interrupts are enabled for the first time. * secondary_start_kernel() clears all DAIF bits before scheduling and preemption are possible on secondary CPUs. Note: with pseudo-NMI, the PMR is initialized here before any DAIF bits are cleared. Similar will be necessary for the architectural NMI. * cpu_suspend() restores all DAIF bits when returning from idle, ensuring that we don't unexpectedly leave DAIF.D clear or set. Note: with pseudo-NMI, the PMR is initialized here before DAIF is cleared. Similar will be necessary for the architectural NMI. This patch removes the unmasking of debug exceptions from __cpu_setup(), relying on the above locations to initialize DAIF. This allows some other cleanups: * It is no longer necessary for cpu_resume() to explicitly mask debug (or other) exceptions, as it is always called with all DAIF bits set. Thus we drop the use of `disable_daif`. * The `enable_dbg` macro is no longer used, and so is dropped. * It is no longer necessary to have an ISB immediately after initializing MDSCR_EL1 in __cpu_setup(), and we can revert to relying on the context synchronization that occurs when the MMU is enabled between __cpu_setup() and code which clears DAIF.D Comments are added to setup_arch() and secondary_start_kernel() to explain the initial unmasking of the DAIF bits. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240422113523.4070414-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: assembler: update stale comment for disable_step_tskMark Rutland2024-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A comment in the disable_step_tsk macro refers to synchronising with enable_dbg, as historically the entry used enable_dbg to unmask debug exceptions after disabling single-stepping. These days the unmasking happens in entry-common.c via local_daif_restore() or local_daif_inherit(), so the comment is stale. This logic is likely to chang in future, so it would be best to avoid referring to those macros specifically. Update the comment to take this into account, and describe it in terms of clearing DAIF.D so that it doesn't macro where this logic lives nor what it is called. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20240422113523.4070414-2-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64/sysreg: Update PIE permission encodingsShiqi Liu2024-04-281-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix left shift overflow issue when the parameter idx is greater than or equal to 8 in the calculation of perm in PIRx_ELx_PERM macro. Fix this by modifying the encoding to use a long integer type. Signed-off-by: Shiqi Liu <shiqiliu@hust.edu.cn> Acked-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20240421063328.29710-1-shiqiliu@hust.edu.cn Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: Add Neoverse-V2 partBesar Wicaksono2024-04-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the part number and MIDR for Neoverse-V2 Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com> Reviewed-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20240109192310.16234-2-bwicaksono@nvidia.com Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: Remove unnecessary irqflags alternative.h includeJinjie Ruan2024-04-101-1/+0
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 20af807d806d ("arm64: Avoid cpus_have_const_cap() for ARM64_HAS_GIC_PRIO_MASKING"), the alternative.h include is not used, so remove it. Fixes: 20af807d806d ("arm64: Avoid cpus_have_const_cap() for ARM64_HAS_GIC_PRIO_MASKING") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/20240314063819.2636445-1-ruanjinjie@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
| * | | Merge branch 'for-next/kbuild' into for-next/coreWill Deacon2024-05-093-3/+15
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | * for-next/kbuild: arm64: boot: Support Flat Image Tree arm64: Add BOOT_TARGETS variable
| | * | | arm64: boot: Support Flat Image TreeSimon Glass2024-04-123-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a script which produces a Flat Image Tree (FIT), a single file containing the built kernel and associated devicetree files. Compression defaults to gzip which gives a good balance of size and performance. The files compress from about 86MB to 24MB using this approach. The FIT can be used by bootloaders which support it, such as U-Boot and Linuxboot. It permits automatic selection of the correct devicetree, matching the compatible string of the running board with the closest compatible string in the FIT. There is no need for filenames or other workarounds. Add a 'make image.fit' build target for arm64, as well. The FIT can be examined using 'dumpimage -l'. This uses the 'dtbs-list' file but processes only .dtb files, ignoring the overlay .dtbo files. This features requires pylibfdt (use 'pip install libfdt'). It also requires compression utilities for the algorithm being used. Supported compression options are the same as the Image.xxx files. Use FIT_COMPRESSION to select an algorithm other than gzip. While FIT supports a ramdisk / initrd, no attempt is made to support this here, since it must be built separately from the Linux build. Signed-off-by: Simon Glass <sjg@chromium.org> Acked-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20240329032836.141899-3-sjg@chromium.org Signed-off-by: Will Deacon <will@kernel.org>
| | * | | arm64: Add BOOT_TARGETS variableSimon Glass2024-04-121-1/+5
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new variable containing a list of possible targets. Mark them as phony. This matches the approach taken for arch/arm Signed-off-by: Simon Glass <sjg@chromium.org> Reviewed-by: Nicolas Schier <n.schier@avm.de> Link: https://lore.kernel.org/r/20240329032836.141899-2-sjg@chromium.org Signed-off-by: Will Deacon <will@kernel.org>
| * / / arm64: acpi: Honour firmware_signature field of FACS, if it existsDavid Woodhouse2024-04-181-0/+10
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the firmware_signature changes then OSPM should not attempt to resume from hibernate, but should instead perform a clean reboot. Set the global swsusp_hardware_signature to allow the generic code to include the value in the swsusp header on disk, and perform the appropriate check on resume. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240412073530.2222496-3-dwmw2@infradead.org Signed-off-by: Will Deacon <will@kernel.org>
* | | Merge tag 'm68k-for-v6.10-tag1' of ↵Linus Torvalds2024-05-1417-62/+41
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k Pull m68k updates from Geert Uytterhoeven: - Fix invalid context sleep and reboot hang on Mac - Fix spinlock race in kernel thread creation - Miscellaneous fixes and improvements - defconfig updates * tag 'm68k-for-v6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: m68k: defconfig: Update defconfigs for v6.9-rc1 m68k: Move ARCH_HAS_CPU_CACHE_ALIASING m68k: mac: Fix reboot hang on Mac IIci m68k: Fix spinlock race in kernel thread creation m68k: Let GENERIC_IOMAP depend on HAS_IOPORT m68k: amiga: Use str_plural() to fix Coccinelle warning macintosh/via-macii: Fix "BUG: sleeping function called from invalid context" zorro: Use helpers from ioport.h m68k: Calculate THREAD_SIZE from THREAD_SIZE_ORDER
| * | | m68k: defconfig: Update defconfigs for v6.9-rc1Geert Uytterhoeven2024-05-0812-36/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Enable trimming of unused exported kernel symbols, - Drop CONFIG_IP_NF_ARPTABLES=m (auto-enabled since commit 4654467dc7e111e8 ("netfilter: arptables: allow xtables-nft only builds")), - Drop CONFIG_STRING_SELFTEST=m (replaced by auto-modular CONFIG_STRING_KUNIT_TEST in commit 29d8568849fe5937 ("string: Convert selftest to KUnit")), - Drop CONFIG_TEST_STRING_HELPERS=m (replaced by auto-modular CONFIG_STRING_HELPERS_KUNIT_TEST in commit fb57550fcbd86839 ("string: Convert helpers selftest to KUnit")). Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/e17b3ac60832a3ff92d25d1a05bf814e8f15d0c5.1711475325.git.geert@linux-m68k.org
| * | | m68k: Move ARCH_HAS_CPU_CACHE_ALIASINGGeert Uytterhoeven2024-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the recently added ARCH_HAS_CPU_CACHE_ALIASING to restore alphabetical sort order. Fixes: 8690bbcf3b7010b3 ("Introduce cpu_dcache_is_aliasing() across all architectures") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/4574ad6cc1117e4b5d29812c165bf7f6e5b60773.1714978406.git.geert@linux-m68k.org
| * | | m68k: mac: Fix reboot hang on Mac IIciFinn Thain2024-05-081-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling mac_reset() on a Mac IIci does reset the system, but what follows is a POST failure that requires a manual reset to resolve. Avoid that by using the 68030 asm implementation instead of the C implementation. Apparently the SE/30 has a similar problem as it has used the asm implementation since before git. This patch extends that solution to other systems with a similar ROM. After this patch, the only systems still using the C implementation are 68040 systems where adb_type is either MAC_ADB_IOP or MAC_ADB_II. This implies a 1 MiB Quadra ROM. This now includes the Quadra 900/950, which previously fell through to the "should never get here" catch-all. Reported-and-tested-by: Stan Johnson <userm57@yahoo.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Finn Thain <fthain@linux-m68k.org> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/480ebd1249d229c6dc1f3f1c6d599b8505483fd8.1714797072.git.fthain@linux-m68k.org Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
| * | | m68k: Fix spinlock race in kernel thread creationMichael Schmitz2024-05-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Context switching does take care to retain the correct lock owner across the switch from 'prev' to 'next' tasks. This does rely on interrupts remaining disabled for the entire duration of the switch. This condition is guaranteed for normal process creation and context switching between already running processes, because both 'prev' and 'next' already have interrupts disabled in their saved copies of the status register. The situation is different for newly created kernel threads. The status register is set to PS_S in copy_thread(), which does leave the IPL at 0. Upon restoring the 'next' thread's status register in switch_to() aka resume(), interrupts then become enabled prematurely. resume() then returns via ret_from_kernel_thread() and schedule_tail() where run queue lock is released (see finish_task_switch() and finish_lock_switch()). A timer interrupt calling scheduler_tick() before the lock is released in finish_task_switch() will find the lock already taken, with the current task as lock owner. This causes a spinlock recursion warning as reported by Guenter Roeck. As far as I can ascertain, this race has been opened in commit 533e6903bea0 ("m68k: split ret_from_fork(), simplify kernel_thread()") but I haven't done a detailed study of kernel history so it may well predate that commit. Interrupts cannot be disabled in the saved status register copy for kernel threads (init will complain about interrupts disabled when finally starting user space). Disable interrupts temporarily when switching the tasks' register sets in resume(). Note that a simple oriw 0x700,%sr after restoring sr is not enough here - this leaves enough of a race for the 'spinlock recursion' warning to still be observed. Tested on ARAnyM and qemu (Quadra 800 emulation). Fixes: 533e6903bea0 ("m68k: split ret_from_fork(), simplify kernel_thread()") Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/07811b26-677c-4d05-aeb4-996cd880b789@roeck-us.net Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20240411033631.16335-1-schmitzmic@gmail.com Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
| * | | m68k: Let GENERIC_IOMAP depend on HAS_IOPORTNiklas Schnelle2024-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a future patch HAS_IOPORT=n will disable inb()/outb() and friends at compile time. With that choosing dynamically between I/O port and MMIO access via GNERIC_IOMAP will not work. So only select GENERIC_IOMAP when HAS_IOPORT is selected. Co-developed-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20240403122851.38808-2-schnelle@linux.ibm.com Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
| * | | m68k: amiga: Use str_plural() to fix Coccinelle warningThorsten Blum2024-04-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes the following Coccinelle/coccicheck warning reported by string_choices.cocci: opportunity for str_plural(zorro_num_autocon) Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20240412215704.204403-4-thorsten.blum@toblux.com Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
| * | | m68k: Calculate THREAD_SIZE from THREAD_SIZE_ORDERDawei Li2024-04-021-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current THREAD_SIZE_ORDER implementation is not generic. Improve it by: - Defining THREAD_SIZE_ORDER based on the specific platform config, - Calculating THREAD_SIZE from THREAD_SIZE_ORDER. Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Dawei Li <dawei.li@shingroup.cn> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20240304085455.125063-1-dawei.li@shingroup.cn Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
* | | | Merge tag 'x86-irq-2024-05-12' of ↵Linus Torvalds2024-05-1416-114/+335
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 interrupt handling updates from Thomas Gleixner: "Add support for posted interrupts on bare metal. Posted interrupts is a virtualization feature which allows to inject interrupts directly into a guest without host interaction. The VT-d interrupt remapping hardware sets the bit which corresponds to the interrupt vector in a vector bitmap which is either used to inject the interrupt directly into the guest via a virtualized APIC or in case that the guest is scheduled out provides a host side notification interrupt which informs the host that an interrupt has been marked pending in the bitmap. This can be utilized on bare metal for scenarios where multiple devices, e.g. NVME storage, raise interrupts with a high frequency. In the default mode these interrupts are handles independently and therefore require a full roundtrip of interrupt entry/exit. Utilizing posted interrupts this roundtrip overhead can be avoided by coalescing these interrupt entries to a single entry for the posted interrupt notification. The notification interrupt then demultiplexes the pending bits in a memory based bitmap and invokes the corresponding device specific handlers. Depending on the usage scenario and device utilization throughput improvements between 10% and 130% have been measured. As this is only relevant for high end servers with multiple device queues per CPU attached and counterproductive for situations where interrupts are arriving at distinct times, the functionality is opt-in via a kernel command line parameter" * tag 'x86-irq-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/irq: Use existing helper for pending vector check iommu/vt-d: Enable posted mode for device MSIs iommu/vt-d: Make posted MSI an opt-in command line option x86/irq: Extend checks for pending vectors to posted interrupts x86/irq: Factor out common code for checking pending interrupts x86/irq: Install posted MSI notification handler x86/irq: Factor out handler invocation from common_interrupt() x86/irq: Set up per host CPU posted interrupt descriptors x86/irq: Reserve a per CPU IDT vector for posted MSIs x86/irq: Add a Kconfig option for posted MSI x86/irq: Remove bitfields in posted interrupt descriptor x86/irq: Unionize PID.PIR for 64bit access w/o casting KVM: VMX: Move posted interrupt descriptor out of VMX code
| * | | | x86/irq: Use existing helper for pending vector checkJacob Pan2024-05-081-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lapic_vector_set_in_irr() is already available, use it for checking pending vectors at the local APIC. No functional change. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Link: https://lore.kernel.org/r/20240506175612.1141095-1-jacob.jun.pan@linux.intel.com
| * | | | iommu/vt-d: Make posted MSI an opt-in command line optionJacob Pan2024-04-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y. Also introduce a helper function for testing if posted MSI is supported on the platform. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-12-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Extend checks for pending vectors to posted interruptsJacob Pan2024-04-302-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During interrupt affinity change, it is possible to have interrupts delivered to the old CPU after the affinity has changed to the new one. To prevent lost interrupts, local APIC IRR is checked on the old CPU. Similar checks must be done for posted MSIs given the same reason. Consider the following scenario: Device system agent iommu memory CPU/LAPIC 1 FEEX_XXXX 2 Interrupt request 3 Fetch IRTE -> 4 ->Atomic Swap PID.PIR(vec) Push to Global Observable(GO) 5 if (ON*) done;* else 6 send a notification -> * ON: outstanding notification, 1 will suppress new notifications If the affinity change happens between 3 and 5 in the IOMMU, the old CPU's posted interrupt request (PIR) could have the pending bit set for the vector being moved. Add a helper function to check individual vector status. Then use the helper to check for pending interrupts on the source CPU's PID. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-11-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Factor out common code for checking pending interruptsJacob Pan2024-04-303-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a common function for checking pending interrupt vector in APIC IRR instead of duplicated open coding them. Additional checks for posted MSI vectors can then be contained in this function. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-10-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Install posted MSI notification handlerJacob Pan2024-04-305-4/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All MSI vectors are multiplexed into a single notification vector when posted MSI is enabled. It is the responsibility of the notification vector handler to demultiplex MSI vectors. In the handler the MSI vector handlers are dispatched without IDT delivery for each pending MSI interrupt. For example, the interrupt flow will change as follows: (3 MSIs of different vectors arrive in a a high frequency burst) BEFORE: interrupt(MSI) irq_enter() handler() /* EOI */ irq_exit() process_softirq() interrupt(MSI) irq_enter() handler() /* EOI */ irq_exit() process_softirq() interrupt(MSI) irq_enter() handler() /* EOI */ irq_exit() process_softirq() AFTER: interrupt /* Posted MSI notification vector */ irq_enter() atomic_xchg(PIR) handler() handler() handler() pi_clear_on() apic_eoi() irq_exit() process_softirq() Except for the leading MSI, CPU notifications are skipped/coalesced. For MSIs which arrive at a low frequency, the demultiplexing loop does not wait for more interrupts to coalesce. Therefore, there's no additional latency other than the processing time. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-9-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Factor out handler invocation from common_interrupt()Jacob Pan2024-04-301-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for calling external interrupt handlers directly from the posted MSI demultiplexing loop. Extract the common code from common_interrupt() to avoid code duplication. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-8-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Set up per host CPU posted interrupt descriptorsJacob Pan2024-04-304-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support posted MSIs, create a posted interrupt descriptor (PID) for each host CPU. Later on, when setting up interrupt affinity, the IOMMU's interrupt remapping table entry (IRTE) will point to the physical address of the matching CPU's PID. Each PID is initialized with the owner CPU's physical APICID as the destination. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-7-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Reserve a per CPU IDT vector for posted MSIsJacob Pan2024-04-301-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When posted MSI is enabled, all device MSIs are multiplexed into a single notification vector. MSI handlers will be de-multiplexed at run-time by system software without IDT delivery. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-6-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Add a Kconfig option for posted MSIJacob Pan2024-04-301-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This option will be used to support delivering MSIs as posted interrupts. Interrupt remapping is required. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-5-jacob.jun.pan@linux.intel.com
| * | | | x86/irq: Remove bitfields in posted interrupt descriptorJacob Pan2024-04-303-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
| * | | | x86/irq: Unionize PID.PIR for 64bit access w/o castingJacob Pan2024-04-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the PIR field into u64 such that atomic xchg64 can be used without ugly casting. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-3-jacob.jun.pan@linux.intel.com
| * | | | KVM: VMX: Move posted interrupt descriptor out of VMX codeJacob Pan2024-04-304-93/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com
* | | | | Merge tag 'irq-core-2024-05-12' of ↵Linus Torvalds2024-05-148-4/+330
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Core code: - Interrupt storm detection for the lockup watchdog: Lockups which are caused by interrupt storms are not easy to debug because there is no information about the events which make the lockup detector trigger. To make this more user friendly, provide an extenstion to interrupt statistics which allows to take snapshots and an interface to retrieve the delta to the snapshot. Use this new mechanism in the watchdog code to do a two stage lockup analysis by taking the snapshot and printing the deltas for the topmost active interrupts on the second trigger. Note: This contains both the interrupt and the watchdog changes as the latter depend on the former obviously. - Avoid summation loops in the /proc/interrupts output and use the global counter when possible - Skip suspended interrupts on CPU hotplug operations to ensure that they are not delivered before the system resumes the device drivers when coming out of suspend. - On CPU hot-unplug interrupts which are affine to the outgoing CPU are migrated to a different CPU in the affinity mask. This can fail when the CPUs have no vectors left. Instead of giving up try to migrate it to any online CPU and thereby breaking the affinity setting in order to prevent a stale device interrupt which targets an offline CPU - The usual small cleanups Driver code: - Support for the RISCV AIA MSI controller - Make the interrupt allocation for the Loongson PCH controller more flexible to prevent vector exhaustion - The usual set of cleanups and fixes all over the place" * tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc cpuidle: Avoid explicit cpumask allocation on stack irqchip/sifive-plic: Avoid explicit cpumask allocation on stack irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack cpumask: Introduce cpumask_first_and_and() irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown genirq: Reuse irq_is_nmi() genirq/cpuhotplug: Retry with cpu_online_mask when migration fails genirq/cpuhotplug: Skip suspended interrupts when restoring affinity arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251 arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251 ARM: dts: stm32: List exti parent interrupts on stm32mp131 ARM: dts: stm32: List exti parent interrupts on stm32mp151 arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32 irqchip/stm32-exti: Mark events reserved with RIF configuration check irqchip/stm32-exti: Skip secure events irqchip/stm32-exti: Convert driver to standard PM ...