summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
* | powerpc/vdso: Mark the vDSO code read-only after initChristophe Leroy2024-12-182-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VDSO text is fixed-up during init so it can't be const, but it can be read-only after init. Do the same as x86 in commit 018ef8dcf3de ("x86/vdso: Mark the vDSO code read-only after init") and arm in commit 11bf9b865898 ("ARM/vdso: Mark the vDSO code read-only after init"), move it into ro_after_init section. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/e9892d288b646cbdfeef0b2b73edbaf6d3c6cabe.1734174500.git.christophe.leroy@csgroup.eu
* | powerpc/64: Use get_user() in start_thread()Michael Ellerman2024-12-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ELFv1 binaries (big endian), the ELF entry point isn't the address of the first instruction, instead it points to the function descriptor for the entry point. The address of the first instruction is in the function descriptor. That means the kernel has to fetch the address of the first instruction from user memory. Because start_thread() uses __get_user(), which has no access_ok() checks, it looks like a malicious ELF binary could be crafted to point the entry point address at kernel memory. The kernel would load 8 bytes from kernel memory into the NIP and then start the process, it would typically crash, but a debugger could observe the NIP value which would be the result of reading from kernel memory. However that's NOT possible, because there is a check in load_elf_binary() that ensures the ELF entry point is < TASK_SIZE (look for BAD_ADDR(elf_entry)). However it's fragile for start_thread() to rely on a check elsewhere, even if the ELF parser is unlikely to ever drop the check that elf_entry is a user address. Make it more robust by using get_user(), which checks that the address points at userspace before doing the load. If the address doesn't point at userspace it will just set the result to zero, and the userspace program will crash at zero (which is fine because it's self-inflicted). Note that it's also possible for a malicious binary to have a valid ELF entry address, but with the first instruction address pointing into the kernel. However that's OK, because it is blocked by the MMU, just like any other attempt to jump into the kernel from userspace. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241216121706.26790-1-mpe@ellerman.id.au
* | powerpc/32: Replace mulhdu() by mul_u64_u64_shr()Christophe Leroy2024-12-101-26/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using mul_u64_u64_shr() provides similar calculation as mulhdu() assembly function, but enables inlining by the compiler. The home-made assembly function had special handling for when one of the arguments is not a fully populated u64 but time functions use it to multiply timebase by a calculated scale which is constructed to have most significant bit set. On mpc8xx sched_clock() runs 3% faster. On mpc83xx it is 2%. As you can see below, sched_clock() is not much bigger than before: c000cf68 <sched_clock>: c000cf68: 7d 2d 42 a6 mftbu r9 c000cf6c: 7d 0c 42 a6 mftb r8 c000cf70: 7d 4d 42 a6 mftbu r10 c000cf74: 7c 09 50 40 cmplw r9,r10 c000cf78: 40 82 ff f0 bne c000cf68 <sched_clock> c000cf7c: 3d 40 c1 37 lis r10,-16073 c000cf80: 38 8a b3 30 addi r4,r10,-19664 c000cf84: 80 ea b3 30 lwz r7,-19664(r10) c000cf88: 80 64 00 14 lwz r3,20(r4) c000cf8c: 39 40 00 00 li r10,0 c000cf90: 80 a4 00 04 lwz r5,4(r4) c000cf94: 80 c4 00 10 lwz r6,16(r4) c000cf98: 7c 63 40 10 subfc r3,r3,r8 c000cf9c: 80 84 00 08 lwz r4,8(r4) c000cfa0: 7d 06 49 10 subfe r8,r6,r9 c000cfa4: 7c c7 19 d6 mullw r6,r7,r3 c000cfa8: 7d 25 18 16 mulhwu r9,r5,r3 c000cfac: 7c 08 29 d6 mullw r0,r8,r5 c000cfb0: 7c 67 18 16 mulhwu r3,r7,r3 c000cfb4: 7d 29 30 14 addc r9,r9,r6 c000cfb8: 7c a8 28 16 mulhwu r5,r8,r5 c000cfbc: 7c ca 51 14 adde r6,r10,r10 c000cfc0: 7d 67 41 d6 mullw r11,r7,r8 c000cfc4: 7d 29 00 14 addc r9,r9,r0 c000cfc8: 7c c6 01 94 addze r6,r6 c000cfcc: 7c 63 28 14 addc r3,r3,r5 c000cfd0: 7d 4a 51 14 adde r10,r10,r10 c000cfd4: 7c e7 40 16 mulhwu r7,r7,r8 c000cfd8: 7c 63 58 14 addc r3,r3,r11 c000cfdc: 7d 4a 01 94 addze r10,r10 c000cfe0: 7c 63 30 14 addc r3,r3,r6 c000cfe4: 7d 4a 39 14 adde r10,r10,r7 c000cfe8: 35 24 ff e0 addic. r9,r4,-32 c000cfec: 41 80 00 10 blt c000cffc <sched_clock+0x94> c000cff0: 7c 63 48 30 slw r3,r3,r9 c000cff4: 38 80 00 00 li r4,0 c000cff8: 4e 80 00 20 blr c000cffc: 21 04 00 1f subfic r8,r4,31 c000d000: 54 69 f8 7e srwi r9,r3,1 c000d004: 7d 4a 20 30 slw r10,r10,r4 c000d008: 7d 29 44 30 srw r9,r9,r8 c000d00c: 7c 64 20 30 slw r4,r3,r4 c000d010: 7d 23 53 78 or r3,r9,r10 c000d014: 4e 80 00 20 blr Before this change: c000d0bc <sched_clock>: c000d0bc: 94 21 ff f0 stwu r1,-16(r1) c000d0c0: 7c 08 02 a6 mflr r0 c000d0c4: 90 01 00 14 stw r0,20(r1) c000d0c8: 93 e1 00 0c stw r31,12(r1) c000d0cc: 7d 2d 42 a6 mftbu r9 c000d0d0: 7d 0c 42 a6 mftb r8 c000d0d4: 7d 4d 42 a6 mftbu r10 c000d0d8: 7c 09 50 40 cmplw r9,r10 c000d0dc: 40 82 ff f0 bne c000d0cc <sched_clock+0x10> c000d0e0: 3f e0 c1 37 lis r31,-16073 c000d0e4: 3b ff b3 30 addi r31,r31,-19664 c000d0e8: 80 9f 00 14 lwz r4,20(r31) c000d0ec: 80 7f 00 10 lwz r3,16(r31) c000d0f0: 7c 84 40 10 subfc r4,r4,r8 c000d0f4: 80 bf 00 00 lwz r5,0(r31) c000d0f8: 80 df 00 04 lwz r6,4(r31) c000d0fc: 7c 63 49 10 subfe r3,r3,r9 c000d100: 48 00 37 85 bl c0010884 <mulhdu> c000d104: 81 3f 00 08 lwz r9,8(r31) c000d108: 35 49 ff e0 addic. r10,r9,-32 c000d10c: 41 80 00 20 blt c000d12c <sched_clock+0x70> c000d110: 80 01 00 14 lwz r0,20(r1) c000d114: 7c 83 50 30 slw r3,r4,r10 c000d118: 83 e1 00 0c lwz r31,12(r1) c000d11c: 38 80 00 00 li r4,0 c000d120: 7c 08 03 a6 mtlr r0 c000d124: 38 21 00 10 addi r1,r1,16 c000d128: 4e 80 00 20 blr c000d12c: 80 01 00 14 lwz r0,20(r1) c000d130: 54 8a f8 7e srwi r10,r4,1 c000d134: 21 09 00 1f subfic r8,r9,31 c000d138: 83 e1 00 0c lwz r31,12(r1) c000d13c: 7c 63 48 30 slw r3,r3,r9 c000d140: 7d 4a 44 30 srw r10,r10,r8 c000d144: 7c 84 48 30 slw r4,r4,r9 c000d148: 7d 43 1b 78 or r3,r10,r3 c000d14c: 7c 08 03 a6 mtlr r0 c000d150: 38 21 00 10 addi r1,r1,16 c000d154: 4e 80 00 20 blr c0010884 <mulhdu>: c0010884: 2c 06 00 00 cmpwi r6,0 c0010888: 2c 83 00 00 cmpwi cr1,r3,0 c001088c: 7c 8a 23 78 mr r10,r4 c0010890: 7c 84 28 16 mulhwu r4,r4,r5 c0010894: 41 82 00 14 beq c00108a8 <mulhdu+0x24> c0010898: 7c 0a 30 16 mulhwu r0,r10,r6 c001089c: 7c ea 29 d6 mullw r7,r10,r5 c00108a0: 7c e0 38 14 addc r7,r0,r7 c00108a4: 7c 84 01 94 addze r4,r4 c00108a8: 4d 86 00 20 beqlr cr1 c00108ac: 7d 23 29 d6 mullw r9,r3,r5 c00108b0: 7d 43 28 16 mulhwu r10,r3,r5 c00108b4: 41 82 00 18 beq c00108cc <mulhdu+0x48> c00108b8: 7c 03 31 d6 mullw r0,r3,r6 c00108bc: 7d 03 30 16 mulhwu r8,r3,r6 c00108c0: 7c e0 38 14 addc r7,r0,r7 c00108c4: 7c 84 41 14 adde r4,r4,r8 c00108c8: 7d 4a 01 94 addze r10,r10 c00108cc: 7c 84 48 14 addc r4,r4,r9 c00108d0: 7c 6a 01 94 addze r3,r10 c00108d4: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/f29e473c193c87bdbd36b209dfdee99d2f0c60dc.1733566130.git.christophe.leroy@csgroup.eu
* powerpc/prom_init: Fixup missing powermac #size-cellsMichael Ellerman2024-11-271-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some powermacs `escc` nodes are missing `#size-cells` properties, which is deprecated and now triggers a warning at boot since commit 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling"). For example: Missing '#size-cells' in /pci@f2000000/mac-io@c/escc@13000 WARNING: CPU: 0 PID: 0 at drivers/of/base.c:133 of_bus_n_size_cells+0x98/0x108 Hardware name: PowerMac3,1 7400 0xc0209 PowerMac ... Call Trace: of_bus_n_size_cells+0x98/0x108 (unreliable) of_bus_default_count_cells+0x40/0x60 __of_get_address+0xc8/0x21c __of_address_to_resource+0x5c/0x228 pmz_init_port+0x5c/0x2ec pmz_probe.isra.0+0x144/0x1e4 pmz_console_init+0x10/0x48 console_init+0xcc/0x138 start_kernel+0x5c4/0x694 As powermacs boot via prom_init it's possible to add the missing properties to the device tree during boot, avoiding the warning. Note that `escc-legacy` nodes are also missing `#size-cells` properties, but they are skipped by the macio driver, so leave them alone. Depends-on: 045b14ca5c36 ("of: WARN on deprecated #address-cells/#size-cells handling") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20241126025710.591683-1-mpe@ellerman.id.au
* Merge tag 'powerpc-6.13-1' of ↵Linus Torvalds2024-11-2328-344/+679
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM. - Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation". - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS, DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines. - Add support for running KVM nested guests on Power11. - Other small features, cleanups and fixes. Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao. * tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits) EDAC/powerpc: Remove PPC_MAPLE drivers powerpc/perf: Add per-task/process monitoring to vpa_pmu driver powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu powerpc/perf: Add perf interface to expose vpa counters MAINTAINERS: powerpc: Mark Maddy as "M" powerpc/Makefile: Allow overriding CPP powerpc-km82xx.c: replace of_node_put() with __free ps3: Correct some typos in comments powerpc/kexec: Fix return of uninitialized variable macintosh: Use common error handling code in via_pmu_led_init() powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type() powerpc: remove dead config options for MPC85xx platform support powerpc/xive: Use cpumask_intersects() selftests/powerpc: Remove the path after initialization. powerpc/xmon: symbol lookup length fixed powerpc/ep8248e: Use %pa to format resource_size_t powerpc/ps3: Reorganize kerneldoc parameter names KVM: PPC: Book3S HV: Fix kmv -> kvm typo powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static ...
| * powerpc/Makefile: Allow overriding CPPArnd Bergmann2024-11-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike all other arches, powerpc doesn't allow the user to override CPP, because it sets it unconditionally in the arch Makefile. This can lead to strange build failures. Instead add the required flags to KBUILD_CPPFLAGS, which are passed to CPP, CC and AS invocations by the generic Makefile logic. Reported-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20240607061629.530301-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> [mpe: Rebase, write change log, add Arnd's SoB as communicated privately] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241107112646.32401-1-mpe@ellerman.id.au
| * powerpc/vdso: Remove unused clockmode asm offsetsThomas Weißschuh2024-11-141-2/+0
| | | | | | | | | | | | | | | | | | | | | | These offsets are not used anymore, delete them. Fixes: c39b1dcf055d ("powerpc/vdso: Add a page for non-time data") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241113-vdso-powerpc-asm-offsets-v1-1-3f7e589f090d@linutronix.de
| * powerpc/irq: use seq_put_decimal_ull_width() for decimal valuesDavid Wang2024-11-101-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On a system with n CPUs and m interrupts, there will be n*m decimal values yielded via seq_printf(.."%10u "..) which is less efficient than seq_put_decimal_ull_width(), stress reading /proc/interrupts indicates ~30% performance improvement with this patch. Signed-off-by: David Wang <00107082@163.com> [mpe: Flesh out change log based on original submission] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/all/20241103080552.4787-1-00107082@163.com Link: https://patch.msgid.link/20241108162327.9887-1-00107082@163.com
| * powerpc/pseries: Fix KVM guest detection for disabling hardlockup detectorGautam Menghani2024-11-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As per the kernel documentation[1], hardlockup detector should be disabled in KVM guests as it may give false positives. On PPC, hardlockup detector is enabled inside KVM guests because disable_hardlockup_detector() is marked as early_initcall and it relies on kvm_guest static key (is_kvm_guest()) which is initialized later during boot by check_kvm_guest(), which is a core_initcall. check_kvm_guest() is also called in pSeries_smp_probe(), which is called before initcalls, but it is skipped if KVM guest does not have doorbell support or if the guest is launched with SMT=1. Call check_kvm_guest() in disable_hardlockup_detector() so that is_kvm_guest() check goes through fine and hardlockup detector can be disabled inside the KVM guest. [1]: Documentation/admin-guide/sysctl/kernel.rst Fixes: 633c8e9800f3 ("powerpc/pseries: Enable hardlockup watchdog for PowerVM partitions") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241108094839.33084-1-gautam@linux.ibm.com
| * fadump: reserve param area if below boot_mem_topSourabh Jain2024-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The param area is a memory region where the kernel places additional command-line arguments for fadump kernel. Currently, the param memory area is reserved in fadump kernel if it is above boot_mem_top. However, it should be reserved if it is below boot_mem_top because the fadump kernel already reserves memory from boot_mem_top to the end of DRAM. Currently, there is no impact from not reserving param memory if it is below boot_mem_top, as it is not used after the early boot phase of the fadump kernel. However, if this changes in the future, it could lead to issues in the fadump kernel. Fixes: 3416c9daa6b1 ("powerpc/fadump: pass additional parameters when fadump is active") Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241107055817.489795-2-sourabhjain@linux.ibm.com
| * powerpc/fadump: allocate memory for additional parameters earlyHari Bathini2024-11-102-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory for passing additional parameters to fadump capture kernel is allocated during subsys_initcall level, using memblock. But as slab is already available by this time, allocation happens via the buddy allocator. This may work for radix MMU but is likely to fail in most cases for hash MMU as hash MMU needs this memory in the first memory block for it to be accessible in real mode in the capture kernel (second boot). So, allocate memory for additional parameters area as soon as MMU mode is obvious. Fixes: 683eab94da75 ("powerpc/fadump: setup additional parameters for dump capture kernel") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Closes: https://lore.kernel.org/lkml/a70e4064-a040-447b-8556-1fd02f19383d@linux.vnet.ibm.com/T/#u Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241107055817.489795-1-sourabhjain@linux.ibm.com
| * powerpc: Use str_enabled_disabled() helper functionThorsten Blum2024-11-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | Remove hard-coded strings by using the str_enabled_disabled() helper function. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241027222219.1173-2-thorsten.blum@linux.dev
| * powerpc/vdso: Drop -mstack-protector-guard flags in 32-bit files with clangNathan Chancellor2024-11-041-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under certain conditions, the 64-bit '-mstack-protector-guard' flags may end up in the 32-bit vDSO flags, resulting in build failures due to the structure of clang's argument parsing of the stack protector options, which validates the arguments of the stack protector guard flags unconditionally in the frontend, choking on the 64-bit values when targeting 32-bit: clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2 clang: error: invalid value 'r13' in 'mstack-protector-guard-reg=', expected one of: r2 make[3]: *** [arch/powerpc/kernel/vdso/Makefile:85: arch/powerpc/kernel/vdso/vgettimeofday-32.o] Error 1 make[3]: *** [arch/powerpc/kernel/vdso/Makefile:87: arch/powerpc/kernel/vdso/vgetrandom-32.o] Error 1 Remove these flags by adding them to the CC32FLAGSREMOVE variable, which already handles situations similar to this. Additionally, reformat and align a comment better for the expanding CONFIG_CC_IS_CLANG block. Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030-powerpc-vdso-drop-stackp-flags-clang-v1-1-d95e7376d29c@kernel.org
| * powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLSNaveen N Rao2024-10-313-29/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS similar to the arm64 implementation. ftrace direct calls allow custom trampolines to be called into directly from function ftrace call sites, bypassing the ftrace trampoline completely. This functionality is currently utilized by BPF trampolines to hook into kernel function entries. Since we have limited relative branch range, we support ftrace direct calls through support for DYNAMIC_FTRACE_WITH_CALL_OPS. In this approach, ftrace trampoline is not entirely bypassed. Rather, it is re-purposed into a stub that reads direct_call field from the associated ftrace_ops structure and branches into that, if it is not NULL. For this, it is sufficient if we can ensure that the ftrace trampoline is reachable from all traceable functions. When multiple ftrace_ops are associated with a call site, we utilize a call back to set pt_regs->orig_gpr3 that can then be tested on the return path from the ftrace trampoline to branch into the direct caller. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-16-hbathini@linux.ibm.com
| * powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPSNaveen N Rao2024-10-313-9/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement support for DYNAMIC_FTRACE_WITH_CALL_OPS similar to the arm64 implementation. This works by patching-in a pointer to an associated ftrace_ops structure before each traceable function. If multiple ftrace_ops are associated with a call site, then a special ftrace_list_ops is used to enable iterating over all the registered ftrace_ops. If no ftrace_ops are associated with a call site, then a special ftrace_nop_ops structure is used to render the ftrace call as a no-op. ftrace trampoline can then read the associated ftrace_ops for a call site by loading from an offset from the LR, and branch directly to the associated function. The primary advantage with this approach is that we don't have to iterate over all the registered ftrace_ops for call sites that have a single ftrace_ops registered. This is the equivalent of implementing support for dynamic ftrace trampolines, which set up a special ftrace trampoline for each registered ftrace_ops and have individual call sites branch into those directly. A secondary advantage is that this gives us a way to add support for direct ftrace callers without having to resort to using stubs. The address of the direct call trampoline can be loaded from the ftrace_ops structure. To support this, we reserve a nop before each function on 32-bit powerpc. For 64-bit powerpc, two nops are reserved before each out-of-line stub. During ftrace activation, we update this location with the associated ftrace_ops pointer. Then, on ftrace entry, we load from this location and call into ftrace_ops->func(). For 64-bit powerpc, we ensure that the out-of-line stub area is doubleword aligned so that ftrace_ops address can be updated atomically. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-15-hbathini@linux.ibm.com
| * powerpc64/ftrace: Support .text larger than 32MB with out-of-line stubsNaveen N Rao2024-10-312-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are restricted to a .text size of ~32MB when using out-of-line function profile sequence. Allow this to be extended up to the previous limit of ~64MB by reserving space in the middle of .text. A new config option CONFIG_PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE is introduced to specify the number of function stubs that are reserved in .text. On boot, ftrace utilizes stubs from this area first before using the stub area at the end of .text. A ppc64le defconfig has ~44k functions that can be traced. A more conservative value of 32k functions is chosen as the default value of PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE so that we do not allot more space than necessary by default. If building a kernel that only has 32k trace-able functions, we won't allot any more space at the end of .text during the pass on vmlinux.o. Otherwise, only the remaining functions get space for stubs at the end of .text. This default value should help cover a .text size of ~48MB in total (including space reserved at the end of .text which can cover up to 32MB), which should be sufficient for most common builds. For a very small kernel build, this can be set to 0. Or, this can be bumped up to a larger value to support vmlinux .text size up to ~64MB. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-14-hbathini@linux.ibm.com
| * powerpc64/ftrace: Move ftrace sequence out of lineNaveen N Rao2024-10-314-37/+303
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function profile sequence on powerpc includes two instructions at the beginning of each function: mflr r0 bl ftrace_caller The call to ftrace_caller() gets nop'ed out during kernel boot and is patched in when ftrace is enabled. Given the sequence, we cannot return from ftrace_caller with 'blr' as we need to keep LR and r0 intact. This results in link stack (return address predictor) imbalance when ftrace is enabled. To address that, we would like to use a three instruction sequence: mflr r0 bl ftrace_caller mtlr r0 Further more, to support DYNAMIC_FTRACE_WITH_CALL_OPS, we need to reserve two instruction slots before the function. This results in a total of five instruction slots to be reserved for ftrace use on each function that is traced. Move the function profile sequence out-of-line to minimize its impact. To do this, we reserve a single nop at function entry using -fpatchable-function-entry=1 and add a pass on vmlinux.o to determine the total number of functions that can be traced. This is then used to generate a .S file reserving the appropriate amount of space for use as ftrace stubs, which is built and linked into vmlinux. On bootup, the stub space is split into separate stubs per function and populated with the proper instruction sequence. A pointer to the associated stub is maintained in dyn_arch_ftrace. For modules, space for ftrace stubs is reserved from the generic module stub space. This is restricted to and enabled by default only on 64-bit powerpc, though there are some changes to accommodate 32-bit powerpc. This is done so that 32-bit powerpc could choose to opt into this based on further tests and benchmarks. As an example, after this patch, kernel functions will have a single nop at function entry: <kernel_clone>: addis r2,r12,467 addi r2,r2,-16028 nop mfocrf r11,8 ... When ftrace is enabled, the nop is converted to an unconditional branch to the stub associated with that function: <kernel_clone>: addis r2,r12,467 addi r2,r2,-16028 b ftrace_ool_stub_text_end+0x11b28 mfocrf r11,8 ... The associated stub: <ftrace_ool_stub_text_end+0x11b28>: mflr r0 bl ftrace_caller mtlr r0 b kernel_clone+0xc ... This change showed an improvement of ~10% in null_syscall benchmark on a Power 10 system with ftrace enabled. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-13-hbathini@linux.ibm.com
| * powerpc/ftrace: Move ftrace stub used for init text before _einittextNaveen N Rao2024-10-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | Move the ftrace stub used to cover inittext before _einittext so that it is within kernel text, as seen through core_kernel_text(). This is required for a subsequent change to ftrace. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-9-hbathini@linux.ibm.com
| * powerpc/ftrace: Skip instruction patching if the instructions are the sameNaveen N Rao2024-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | To simplify upcoming changes to ftrace, add a check to skip actual instruction patching if the old and new instructions are the same. We still validate that the instruction is what we expect, but don't actually patch the same instruction again. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-8-hbathini@linux.ibm.com
| * powerpc/ftrace: Remove pointer to struct module from dyn_arch_ftraceNaveen N Rao2024-10-312-62/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Pointer to struct module is only relevant for ftrace records belonging to kernel modules. Having this field in dyn_arch_ftrace wastes memory for all ftrace records belonging to the kernel. Remove the same in favour of looking up the module from the ftrace record address, similar to other architectures. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-7-hbathini@linux.ibm.com
| * powerpc/module_64: Convert #ifdef to IS_ENABLED()Naveen N Rao2024-10-311-8/+2
| | | | | | | | | | | | | | | | | | | | Minor refactor for converting #ifdef to IS_ENABLED(). Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-6-hbathini@linux.ibm.com
| * powerpc32/ftrace: Unify 32-bit and 64-bit ftrace entry codeNaveen N Rao2024-10-312-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 32-bit powerpc, gcc generates a three instruction sequence for function profiling: mflr r0 stw r0, 4(r1) bl _mcount On kernel boot, the call to _mcount() is nop-ed out, to be patched back in when ftrace is actually enabled. The 'stw' instruction therefore is not necessary unless ftrace is enabled. Nop it out during ftrace init. When ftrace is enabled, we want the 'stw' so that stack unwinding works properly. Perform the same within the ftrace handler, similar to 64-bit powerpc. Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-5-hbathini@linux.ibm.com
| * powerpc64/ftrace: Nop out additional 'std' instruction emitted by gcc v5.xNaveen N Rao2024-10-311-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gcc v5.x emits a 3-instruction sequence for -mprofile-kernel: mflr r0 std r0, 16(r1) bl _mcount Gcc v6.x moved to a simpler 2-instruction sequence by removing the 'std' instruction. The store saved the return address in the LR save area in the caller stack frame for stack unwinding. However, with dynamic ftrace, we no longer have a call to _mcount on kernel boot when ftrace is not enabled. When ftrace is enabled, that store is performed within ftrace_caller(). As such, the additional 'std' instruction is redundant. Nop it out on kernel boot. With this change, we now use the same 2-instruction profiling sequence with both -mprofile-kernel, as well as -fpatchable-function-entry on 64-bit powerpc. Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-4-hbathini@linux.ibm.com
| * powerpc/kprobes: Use ftrace to determine if a probe is at function entryNaveen N Rao2024-10-311-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than hard-coding the offset into a function to be used to determine if a kprobe is at function entry, use ftrace_location() to determine the ftrace location within the function and categorize all instructions till that offset to be function entry. For functions that cannot be traced, we fall back to using a fixed offset of 8 (two instructions) to categorize a probe as being at function entry for 64-bit elfv2, unless we are using pcrel. Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-3-hbathini@linux.ibm.com
| * powerpc/trace: Account for -fpatchable-function-entry support by toolchainNaveen N Rao2024-10-311-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, we have relied on the fact that gcc supports both -mprofile-kernel, as well as -fpatchable-function-entry, and clang supports neither. Our Makefile only checks for CONFIG_MPROFILE_KERNEL to decide which files to build. Clang has a feature request out [*] to implement -fpatchable-function-entry, and is unlikely to support -mprofile-kernel. Update our Makefile checks so that we pick up the correct files to build once clang picks up support for -fpatchable-function-entry. [*] https://github.com/llvm/llvm-project/issues/57031 Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-2-hbathini@linux.ibm.com
| * powerpc/64: Remove maple platformMichael Ellerman2024-10-294-116/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The maple platform was added in 2004 [1], to support the "Maple" 970FX evaluation board. It was later used for IBM JS20/JS21 machines, as well as the Bimini machine, aka "Yellow Dog Powerstation". Sadly all those machines have passed into memory, and there's been no evidence for years that anyone is still using any of them. Remove the platform and related code. It can always be reinstated if there's interest. Note that this has no impact on support for 970FX based Power Macs. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=f0d068d65c5e555ffcfbc189de32598f6f00770c Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241013102957.548291-1-mpe@ellerman.id.au
| * powerpc/machdep: Drop include of dma-mapping.hMichael Ellerman2024-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Drop the include of dma-mapping.h in machdep.h, replace it with forward declarations of struct device and struct pci_dev, and include time64.h and page.h which are required for time64_t and pgprot_t respectively. Add direct includes of some other headers to some files that were getting them via machdep.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051826.132805-2-mpe@ellerman.id.au
| * powerpc/64: Drop IPI_PRIORITY from asm-offsetsMichael Ellerman2024-10-291-1/+0
| | | | | | | | | | | | | | | | | | The last use of IPI_PRIORITY in asm was removed in commit 37f55d30df2e ("KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function"). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051701.132282-1-mpe@ellerman.id.au
| * powerpc/fadump: Move fadump_cma_init to setup_arch() after initmem_init()Ritesh Harjani (IBM)2024-10-212-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During early init CMA_MIN_ALIGNMENT_BYTES can be PAGE_SIZE, since pageblock_order is still zero and it gets initialized later during initmem_init() e.g. setup_arch() -> initmem_init() -> sparse_init() -> set_pageblock_order() One such use case where this causes issue is - early_setup() -> early_init_devtree() -> fadump_reserve_mem() -> fadump_cma_init() This causes CMA memory alignment check to be bypassed in cma_init_reserved_mem(). Then later cma_activate_area() can hit a VM_BUG_ON_PAGE(pfn & ((1 << order) - 1)) if the reserved memory area was not pageblock_order aligned. Fix it by moving the fadump_cma_init() after initmem_init(), where other such cma reservations also gets called. <stack trace> ============== page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10010 flags: 0x13ffff800000000(node=1|zone=0|lastcpupid=0x7ffff) CMA raw: 013ffff800000000 5deadbeef0000100 5deadbeef0000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(pfn & ((1 << order) - 1)) ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:778! Call Trace: __free_one_page+0x57c/0x7b0 (unreliable) free_pcppages_bulk+0x1a8/0x2c8 free_unref_page_commit+0x3d4/0x4e4 free_unref_page+0x458/0x6d0 init_cma_reserved_pageblock+0x114/0x198 cma_init_reserved_areas+0x270/0x3e0 do_one_initcall+0x80/0x2f8 kernel_init_freeable+0x33c/0x530 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c Fixes: 11ac3e87ce09 ("mm: cma: use pageblock_order as the single alignment") Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Sachin P Bappalige <sachinpb@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list@gmail.com
| * powerpc/fadump: Reserve page-aligned boot_memory_size during fadump_reserve_memRitesh Harjani (IBM)2024-10-211-12/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors all CMA related initialization and alignment code to within fadump_cma_init() which gets called in the end. This also means that we keep [reserve_dump_area_start, boot_memory_size] page aligned during fadump_reserve_mem(). Then later in fadump_cma_init() we extract the aligned chunk and provide it to CMA. This inherently also fixes an issue in the current code where the reserve_dump_area_start is not aligned when the physical memory can have holes and the suitable chunk starts at an unaligned boundary. After this we should be able to call fadump_cma_init() independently later in setup_arch() where pageblock_order is non-zero. Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/805d6b900968fb9402ad8f4e4775597db42085c4.1729146153.git.ritesh.list@gmail.com
| * powerpc/fadump: Refactor and prepare fadump_cma_init for late initRitesh Harjani (IBM)2024-10-211-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We anyway don't use any return values from fadump_cma_init(). Since fadump_reserve_mem() from where fadump_cma_init() gets called today, already has the required checks. This patch makes this function return type as void. Let's also handle extra cases like return if fadump_supported is false or dump_active, so that in later patches we can call fadump_cma_init() separately from setup_arch(). Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/a2afc3d6481a87a305e89cfc4a3f3d2a0b8ceab3.1729146153.git.ritesh.list@gmail.com
| * powerpc/vdso: Implement __arch_get_vdso_rng_data()Christophe Leroy2024-10-163-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VDSO time functions do not call any other function, so they don't need to save/restore LR. However, retrieving the address of VDSO data page requires using LR hence saving then restoring it, which can be heavy on some CPUs. On the other hand, VDSO functions on powerpc are not standard functions and require a wrapper function to call C VDSO functions. And that wrapper has to save and restore LR in order to call the C VDSO function, so retrieving VDSO data page address in that wrapper doesn't require additional save/restore of LR. For random VDSO functions it is a bit different. Because the function calls __arch_chacha20_blocks_nostack(), it saves and restores LR. Retrieving VDSO data page address can then be done there without additional save/restore of LR. So lets implement __arch_get_vdso_rng_data() and simplify the wrapper. It starts paving the way for the day powerpc will implement a more standard ABI for VDSO functions. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/a1a9bd0df508f1b5c04684b7366940577dfc6262.1727858295.git.christophe.leroy@csgroup.eu
| * powerpc/vdso: Add a page for non-time dataChristophe Leroy2024-10-167-16/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page containing VDSO time data is swapped with the one containing TIME namespace data when a process uses a non-root time namespace. For other data like powerpc specific data and RNG data, it means tracking whether time namespace is the root one or not to know which page to use. Simplify the logic behind by moving time data out of first data page so that the first data page which contains everything else always remains the first page. Time data is in the second or third page depending on selected time namespace. While we are playing with get_datapage macro, directly take into account the data offset inside the macro instead of adding that offset afterwards. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/0557d3ec898c1d0ea2fc59fa8757618e524c5d94.1727858295.git.christophe.leroy@csgroup.eu
* | Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds2024-11-2315-15/+15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
| * | asm-generic: introduce text-patching.hMike Rapoport (Microsoft)2024-11-0715-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several architectures support text patching, but they name the header files that declare patching functions differently. Make all such headers consistently named text-patching.h and add an empty header in asm-generic for architectures that do not support text patching. Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Tested-by: kdevops <kdevops@lists.linux.dev> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'devicetree-for-6.13' of ↵Linus Torvalds2024-11-202-2/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree updates from Rob Herring: "Bindings: - Enable dtc "interrupt_provider" warnings for binding examples. Fix the warnings in fsl,mu-msi and ti,sci-inta due to this. - Convert zii,rave-sp-wdt, zii,rave-sp-pwrbutton, and altr,fpga-passive-serial to DT schema format - Add some documentation on the different forms of YAML text blocks which are a constant source of review comments - Fix some schema errors in constraints for arrays - Add compatibles for qcom,sar2130p-pdc and onnn,adt7462 DT core: - Allow overlay kunit tests to run CONFIG_OF_OVERLAY=n - Add some warnings on deprecated address handling - Rework early_init_dt_scan() so the arch can pass in the phys address of the DTB as __pa() is not always valid to use. This fixes a warning for arm64 with kexec. - Add and use some new DT graph iterators for iterating over ports and endpoints - Rework reserved-memory handling to be sized dynamically for fixed regions - Optimize of_modalias() to avoid a strlen() call - Constify struct device_node and property pointers where ever possible" * tag 'devicetree-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (36 commits) of: Allow overlay kunit tests to run CONFIG_OF_OVERLAY=n dt-bindings: interrupt-controller: qcom,pdc: Add SAR2130P compatible of/address: Rework bus matching to avoid warnings of: WARN on deprecated #address-cells/#size-cells handling of/fdt: Don't use default address cell sizes for address translation dt-bindings: Enable dtc "interrupt_provider" warnings of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verify dt-bindings: cache: qcom,llcc: Fix X1E80100 reg entries dt-bindings: watchdog: convert zii,rave-sp-wdt.txt to yaml format dt-bindings: input: convert zii,rave-sp-pwrbutton.txt to yaml media: xilinx-tpg: use new of_graph functions fbdev: omapfb: use new of_graph functions gpu: drm: omapdrm: use new of_graph functions ASoC: audio-graph-card2: use new of_graph functions ASoC: audio-graph-card: use new of_graph functions ASoC: test-component: use new of_graph functions of: property: use new of_graph functions of: property: add of_graph_get_next_port_endpoint() of: property: add of_graph_get_next_port() of: module: remove strlen() call in of_modalias() ...
| * | | of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verifyUsama Arif2024-10-292-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __pa() is only intended to be used for linear map addresses and using it for initial_boot_params which is in fixmap for arm64 will give an incorrect value. Hence save the physical address when it is known at boot time when calling early_init_dt_scan for arm64 and use it at kexec time instead of converting the virtual address using __pa(). Note that arm64 doesn't need the FDT region reserved in the DT as the kernel explicitly reserves the passed in FDT. Therefore, only a debug warning is fixed with this change. Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20241023171426.452688-1-usamaarif642@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
* | | | Merge tag 'ftrace-v6.13' of ↵Linus Torvalds2024-11-202-3/+3
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace updates from Steven Rostedt: - Restructure the function graph shadow stack to prepare it for use with kretprobes With the goal of merging the shadow stack logic of function graph and kretprobes, some more restructuring of the function shadow stack is required. Move out function graph specific fields from the fgraph infrastructure and store it on the new stack variables that can pass data from the entry callback to the exit callback. Hopefully, with this change, the merge of kretprobes to use fgraph shadow stacks will be ready by the next merge window. - Make shadow stack 4k instead of using PAGE_SIZE. Some architectures have very large PAGE_SIZE values which make its use for shadow stacks waste a lot of memory. - Give shadow stacks its own kmem cache. When function graph is started, every task on the system gets a shadow stack. In the future, shadow stacks may not be 4K in size. Have it have its own kmem cache so that whatever size it becomes will still be efficient in allocations. - Initialize profiler graph ops as it will be needed for new updates to fgraph - Convert to use guard(mutex) for several ftrace and fgraph functions - Add more comments and documentation - Show function return address in function graph tracer Add an option to show the caller of a function at each entry of the function graph tracer, similar to what the function tracer does. - Abstract out ftrace_regs from being used directly like pt_regs ftrace_regs was created to store a partial pt_regs. It holds only the registers and stack information to get to the function arguments and return values. On several archs, it is simply a wrapper around pt_regs. But some users would access ftrace_regs directly to get the pt_regs which will not work on all archs. Make ftrace_regs an abstract structure that requires all access to its fields be through accessor functions. - Show how long it takes to do function code modifications When code modification for function hooks happen, it always had the time recorded in how long it took to do the conversion. But this value was never exported. Recently the code was touched due to new ROX modification handling that caused a large slow down in doing the modifications and had a significant impact on boot times. Expose the timings in the dyn_ftrace_total_info file. This file was created a while ago to show information about memory usage and such to implement dynamic function tracing. It's also an appropriate file to store the timings of this modification as well. This will make it easier to see the impact of changes to code modification on boot up timings. - Other clean ups and small fixes * tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits) ftrace: Show timings of how long nop patching took ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash() ftrace: Use guard to take the ftrace_lock in release_probe() ftrace: Use guard to lock ftrace_lock in cache_mod() ftrace: Use guard for match_records() fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph() fgraph: Give ret_stack its own kmem cache fgraph: Separate size of ret_stack from PAGE_SIZE ftrace: Rename ftrace_regs_return_value to ftrace_regs_get_return_value selftests/ftrace: Fix check of return value in fgraph-retval.tc test ftrace: Use arch_ftrace_regs() for ftrace_regs_*() macros ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs ftrace: Make ftrace_regs abstract from direct use fgragh: No need to invoke the function call_filter_check_discard() fgraph: Simplify return address printing in function graph tracer function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr() function_graph: Support recording and printing the function return address ftrace: Have calltime be saved in the fgraph storage ftrace: Use a running sleeptime instead of saving on shadow stack fgraph: Use fgraph data to store subtime for profiler ...
| * | | | ftrace: Make ftrace_regs abstract from direct useSteven Rostedt2024-10-102-3/+3
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ftrace_regs was created to hold registers that store information to save function parameters, return value and stack. Since it is a subset of pt_regs, it should only be used by its accessor functions. But because pt_regs can easily be taken from ftrace_regs (on most archs), it is tempting to use it directly. But when running on other architectures, it may fail to build or worse, build but crash the kernel! Instead, make struct ftrace_regs an empty structure and have the architectures define __arch_ftrace_regs and all the accessor functions will typecast to it to get to the actual fields. This will help avoid usage of ftrace_regs directly. Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/ Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Naveen N Rao <naveen@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* | | | Merge tag 'timers-core-2024-11-18' of ↵Linus Torvalds2024-11-191-14/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" * tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ...
| * | | | powerpc/rtas: Use fsleep() to minimize additional sleep durationAnna-Maria Behnsen2024-10-161-14/+7
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements") was introduced, documentation about proper usage of sleep related functions was outdated. The commit message references the usage of a HZ=100 system. When using a 20ms sleep duration on such a system and therefore using msleep(), the possible additional slack will be +10ms. When the system is configured with HZ=100 the granularity of a jiffy and of a bucket of the lowest timer wheel level is 10ms. To make sure a timer will not expire early (when queueing of the timer races with an concurrent update of jiffies), timers are always queued into the next bucket. This is the reason for the maximal possible slack of 10ms. fsleep() limits the maximal possible slack to 25% by making threshold between usleep_range() and msleep() HZ dependent. As soon as the accuracy of msleep() is sufficient, the less expensive timer list timer based sleeping function is used instead of the more expensive hrtimer based usleep_range() function. The udelay() will not be used in this specific usecase as the lowest sleep length is larger than 1 millisecond. Use fsleep() directly instead of using an own heuristic for the best sleeping mechanism to use. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-13-dc8b907cb62f@linutronix.de
* | | | Merge tag 'timers-vdso-2024-11-18' of ↵Linus Torvalds2024-11-195-33/+45
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vdso data page handling updates from Thomas Gleixner: "First steps of consolidating the VDSO data page handling. The VDSO data page handling is architecture specific for historical reasons, but there is no real technical reason to do so. Aside of that VDSO data has become a dump ground for various mechanisms and fail to provide a clear separation of the functionalities. Clean this up by: - consolidating the VDSO page data by getting rid of architecture specific warts especially in x86 and PowerPC. - removing the last includes of header files which are pulling in other headers outside of the VDSO namespace. - seperating timekeeping and other VDSO data accordingly. Further consolidation of the VDSO page handling is done in subsequent changes scheduled for the next merge window. This also lays the ground for expanding the VDSO time getters for independent PTP clocks in a generic way without making every architecture add support seperately" * tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits) x86/vdso: Add missing brackets in switch case vdso: Rename struct arch_vdso_data to arch_vdso_time_data powerpc: Split systemcfg struct definitions out from vdso powerpc: Split systemcfg data out of vdso data page powerpc: Add kconfig option for the systemcfg page powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processors powerpc/pseries/lparcfg: Fix printing of system_active_processors powerpc/procfs: Propagate error of remap_pfn_range() powerpc/vdso: Remove offset comment from 32bit vdso_arch_data x86/vdso: Split virtual clock pages into dedicated mapping x86/vdso: Delete vvar.h x86/vdso: Access vdso data without vvar.h x86/vdso: Move the rng offset to vsyscall.h x86/vdso: Access rng vdso data without vvar.h x86/vdso: Access timens vdso data without vvar.h x86/vdso: Allocate vvar page from C code x86/vdso: Access rng data from kernel without vvar x86/vdso: Place vdso_data at beginning of vvar page x86/vdso: Use __arch_get_vdso_data() to access vdso data x86/mm/mmap: Remove arch_vma_name() ...
| * | | | powerpc: Split systemcfg struct definitions out from vdsoThomas Weißschuh2024-11-024-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The systemcfg data has nothing to do anymore with the vdso. Split it into a dedicated header file. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-27-b64f0842d512@linutronix.de
| * | | | powerpc: Split systemcfg data out of vdso data pageThomas Weißschuh2024-11-025-26/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The systemcfg data only has minimal overlap with the vdso data. Splitting the two avoids mapping the implementation-defined vdso data into /proc/ppc64/systemcfg. It is also a preparation for the standardization of vdso data storage. The only field actually used by both systemcfg and vdso is tb_ticks_per_sec and it is only changed once during time_init(). Initialize it in both structures there. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-26-b64f0842d512@linutronix.de
| * | | | powerpc: Add kconfig option for the systemcfg pageThomas Weißschuh2024-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The systemcfg page through procfs is only a backwards-compatible interface for very old applications. Make it possible to be disabled. This also creates a convenient config #define to guard any accesses to the systemcfg page. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-25-b64f0842d512@linutronix.de
| * | | | powerpc/procfs: Propagate error of remap_pfn_range()Thomas Weißschuh2024-11-021-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the operation fails and userspace is unaware it will access unmapped memory, crashing the process. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-22-b64f0842d512@linutronix.de
| * | | | powerpc/vdso: Remove timekeeper includesThomas Weißschuh2024-10-151-1/+0
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the generic VDSO clock mode storage is used, this header file is unused and can be removed. This avoids including a non-VDSO header while building the VDSO, which can lead to compilation errors. Also drop the comment which is out of date and in the wrong place. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241010-vdso-generic-arch_update_vsyscall-v1-4-7fe5a3ea4382@linutronix.de
* | | | Merge tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2024-11-181-0/+4
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xattr updates from Al Viro: "Sanitize xattr and io_uring interactions with it, add *xattrat() syscalls, sanitize struct filename handling in there" * tag 'pull-xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xattr: remove redundant check on variable err fs/xattr: add *at family syscalls new helpers: file_removexattr(), filename_removexattr() new helpers: file_listxattr(), filename_listxattr() replace do_getxattr() with saner helpers. replace do_setxattr() with saner helpers. new helper: import_xattr_name() fs: rename struct xattr_ctx to kernel_xattr_ctx xattr: switch to CLASS(fd) io_[gs]etxattr_prep(): just use getname() io_uring: IORING_OP_F[GS]ETXATTR is fine with REQ_F_FIXED_FILE getname_maybe_null() - the third variant of pathname copy-in teach filename_lookup() to treat NULL filename as ""
| * | | fs/xattr: add *at family syscallsChristian Göttsche2024-11-061-0/+4
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the four syscalls setxattrat(), getxattrat(), listxattrat() and removexattrat(). Those can be used to operate on extended attributes, especially security related ones, either relative to a pinned directory or on a file descriptor without read access, avoiding a /proc/<pid>/fd/<fd> detour, requiring a mounted procfs. One use case will be setfiles(8) setting SELinux file contexts ("security.selinux") without race conditions and without a file descriptor opened with read access requiring SELinux read permission. Use the do_{name}at() pattern from fs/open.c. Pass the value of the extended attribute, its length, and for setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added struct xattr_args to not exceed six syscall arguments and not merging the AT_* and XATTR_* flags. [AV: fixes by Christian Brauner folded in, the entire thing rebased on top of {filename,file}_...xattr() primitives, treatment of empty pathnames regularized. As the result, AT_EMPTY_PATH+NULL handling is cheap, so f...(2) can use it] Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://lore.kernel.org/r/20240426162042.191916-1-cgoettsche@seltendoof.de Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <brauner@kernel.org> CC: x86@kernel.org CC: linux-alpha@vger.kernel.org CC: linux-kernel@vger.kernel.org CC: linux-arm-kernel@lists.infradead.org CC: linux-ia64@vger.kernel.org CC: linux-m68k@lists.linux-m68k.org CC: linux-mips@vger.kernel.org CC: linux-parisc@vger.kernel.org CC: linuxppc-dev@lists.ozlabs.org CC: linux-s390@vger.kernel.org CC: linux-sh@vger.kernel.org CC: sparclinux@vger.kernel.org CC: linux-fsdevel@vger.kernel.org CC: audit@vger.kernel.org CC: linux-arch@vger.kernel.org CC: linux-api@vger.kernel.org CC: linux-security-module@vger.kernel.org CC: selinux@vger.kernel.org [brauner: slight tweaks] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | powerpc/8xx: Fix kernel DTLB miss on dcbzChristophe Leroy2024-10-111-0/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following OOPS is encountered while loading test_bpf module on powerpc 8xx: [ 218.835567] BUG: Unable to handle kernel data access on write at 0xcb000000 [ 218.842473] Faulting instruction address: 0xc0017a80 [ 218.847451] Oops: Kernel access of bad area, sig: 11 [#1] [ 218.852854] BE PAGE_SIZE=16K PREEMPT CMPC885 [ 218.857207] SAF3000 DIE NOTIFICATION [ 218.860713] Modules linked in: test_bpf(+) test_module [ 218.865867] CPU: 0 UID: 0 PID: 527 Comm: insmod Not tainted 6.11.0-s3k-dev-09856-g3de3d71ae2e6-dirty #1280 [ 218.875546] Hardware name: MIAE 8xx 0x500000 CMPC885 [ 218.880521] NIP: c0017a80 LR: beab859c CTR: 000101d4 [ 218.885584] REGS: cac2bc90 TRAP: 0300 Not tainted (6.11.0-s3k-dev-09856-g3de3d71ae2e6-dirty) [ 218.894308] MSR: 00009032 <EE,ME,IR,DR,RI> CR: 55005555 XER: a0007100 [ 218.901290] DAR: cb000000 DSISR: c2000000 [ 218.901290] GPR00: 000185d1 cac2bd50 c21b9580 caf7c030 c3883fcc 00000008 cafffffc 00000000 [ 218.901290] GPR08: 00040000 18300000 20000000 00000004 99005555 100d815e ca669d08 00000369 [ 218.901290] GPR16: ca730000 00000000 ca2c004c 00000000 00000000 0000035d 00000311 00000369 [ 218.901290] GPR24: ca732240 00000001 00030ba3 c3800000 00000000 00185d48 caf7c000 ca2c004c [ 218.941087] NIP [c0017a80] memcpy+0x88/0xec [ 218.945277] LR [beab859c] test_bpf_init+0x22c/0x3c90 [test_bpf] [ 218.951476] Call Trace: [ 218.953916] [cac2bd50] [beab8570] test_bpf_init+0x200/0x3c90 [test_bpf] (unreliable) [ 218.962034] [cac2bde0] [c0004c04] do_one_initcall+0x4c/0x1fc [ 218.967706] [cac2be40] [c00a2ec4] do_init_module+0x68/0x360 [ 218.973292] [cac2be60] [c00a5194] init_module_from_file+0x8c/0xc0 [ 218.979401] [cac2bed0] [c00a5568] sys_finit_module+0x250/0x3f0 [ 218.985248] [cac2bf20] [c000e390] system_call_exception+0x8c/0x15c [ 218.991444] [cac2bf30] [c00120a8] ret_from_syscall+0x0/0x28 This happens in the main loop of memcpy() ==> c0017a80: 7c 0b 37 ec dcbz r11,r6 c0017a84: 80 e4 00 04 lwz r7,4(r4) c0017a88: 81 04 00 08 lwz r8,8(r4) c0017a8c: 81 24 00 0c lwz r9,12(r4) c0017a90: 85 44 00 10 lwzu r10,16(r4) c0017a94: 90 e6 00 04 stw r7,4(r6) c0017a98: 91 06 00 08 stw r8,8(r6) c0017a9c: 91 26 00 0c stw r9,12(r6) c0017aa0: 95 46 00 10 stwu r10,16(r6) c0017aa4: 42 00 ff dc bdnz c0017a80 <memcpy+0x88> Commit ac9f97ff8b32 ("powerpc/8xx: Inconditionally use task PGDIR in DTLB misses") relies on re-reading DAR register to know if an error is due to a missing copy of a PMD entry in task's PGDIR, allthough DAR was already read in the exception prolog and copied into thread struct. This is because is it done very early in the exception and there are not enough registers available to keep a pointer to thread struct. However, dcbz instruction is buggy and doesn't update DAR register on fault. That is detected and generates a call to FixupDAR workaround which updates DAR copy in thread struct but doesn't fix DAR register. Let's fix DAR in addition to the update of DAR copy in thread struct. Fixes: ac9f97ff8b32 ("powerpc/8xx: Inconditionally use task PGDIR in DTLB misses") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/2b851399bd87e81c6ccb87ea3a7a6b32c7aa04d7.1728118396.git.christophe.leroy@csgroup.eu