summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'timers-core-2023-04-24' of ↵Linus Torvalds2023-04-251-6/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R_*_NONE relocation entries correctly. R_*_NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R_*_NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R_*_NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations
| * posix-cpu-timers: Implement the missing timer_wait_running callbackThomas Gleixner2023-04-211-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some unknown reason the introduction of the timer_wait_running callback missed to fixup posix CPU timers, which went unnoticed for almost four years. Marco reported recently that the WARN_ON() in timer_wait_running() triggers with a posix CPU timer test case. Posix CPU timers have two execution models for expiring timers depending on CONFIG_POSIX_CPU_TIMERS_TASK_WORK: 1) If not enabled, the expiry happens in hard interrupt context so spin waiting on the remote CPU is reasonably time bound. Implement an empty stub function for that case. 2) If enabled, the expiry happens in task work before returning to user space or guest mode. The expired timers are marked as firing and moved from the timer queue to a local list head with sighand lock held. Once the timers are moved, sighand lock is dropped and the expiry happens in fully preemptible context. That means the expiring task can be scheduled out, migrated, interrupted etc. So spin waiting on it is more than suboptimal. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way: - Add a mutex to struct posix_cputimers_work. This struct is per task and used to schedule the expiry task work from the timer interrupt. - Add a task_struct pointer to struct cpu_timer which is used to store a the task which runs the expiry. That's filled in when the task moves the expired timers to the local expiry list. That's not affecting the size of the k_itimer union as there are bigger union members already - Let the task take the expiry mutex around the expiry function - Let the waiter acquire a task reference with rcu_read_lock() held and block on the expiry mutex This avoids spin-waiting on a task which might not even be on a CPU and works nicely for RT too. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Reported-by: Marco Elver <elver@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marco Elver <elver@google.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
* | Merge tag 'irq-core-2023-04-24' of ↵Linus Torvalds2023-04-253-6/+65
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt updates from Thomas Gleixner: "Core: - Add tracepoints for tasklet callbacks which makes it possible to analyze individual tasklet functions instead of guess working from the overall duration of tasklet processing - Ensure that secondary interrupt threads have their affinity adjusted correctly Drivers: - A large rework of the RISC-V IPI management to prepare for a new RISC-V interrupt architecture - Small fixes and enhancements all over the place - Removal of support for various obsolete hardware platforms and the related code" * tag 'irq-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) irqchip/st: Remove stih415/stih416 and stid127 platforms support irqchip/gic-v3: Add Rockchip 3588001 erratum workaround genirq: Update affinity of secondary threads softirq: Add trace points for tasklet entry/exit irqchip/loongson-pch-pic: Fix pch_pic_acpi_init calling irqchip/loongson-pch-pic: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix incorrect use of acpi_get_vec_parent irqchip/loongson-eiointc: Fix returned value on parsing MADT irqchip/riscv-intc: Add empty irq_eoi() for chained irq handlers RISC-V: Use IPIs for remote icache flush when possible RISC-V: Use IPIs for remote TLB flush when possible RISC-V: Allow marking IPIs as suitable for remote FENCEs RISC-V: Treat IPIs as normal Linux IRQs irqchip/riscv-intc: Allow drivers to directly discover INTC hwnode RISC-V: Clear SIP bit only when using SBI IPI operations irqchip/irq-sifive-plic: Add syscore callbacks for hibernation irqchip: Use of_property_read_bool() for boolean properties irqchip/bcm-6345-l1: Request memory region irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 ...
| * \ Merge tag 'irqchip-6.4' of ↵Thomas Gleixner2023-04-212-6/+18
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip changes from Marc Zyngier: - Large RISC-V IPI rework to make way for a new interrupt architecture - More Loongarch fixes from Lianmin Lv, fixing issues in the so called "dual-bridge" systems. - Workaround for the nvidia T241 chip that gets confused in 3 and 4 socket configurations, leading to the GIC malfunctionning in some contexts - Drop support for non-firmware driven GIC configurarations now that the old ARM11MP Cavium board is gone - Workaround for the Rockchip 3588 chip that doesn't correctly deal with the shareability attributes. - Replace uses of of_find_property() with the more appropriate of_property_read_bool() - Make bcm-6345-l1 request its MMIO region - Add suspend support to the SiFive PLIC - Drop support for stih415, stih416 and stid127 platforms Link: https://lore.kernel.org/lkml/20230421132104.3021536-1-maz@kernel.org
| | * \ Merge branch irq/gic-6.4 into irq/irqchip-nextMarc Zyngier2023-04-212-6/+18
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * irq/gic-6.4: : . : Collection of GIC/GICv3 fixes and cleanups : : - Workaround for the nvidia T241 chip that gets confused : in 3 and 4 socket configurations, leading to the GIC : malfunctionning in some contexts : : - Drop support for non-firmware driven GIC configurarations : now that the old ARM11MP Cavium board is gone : : - Workaround for the Rockchip 3588 chip that doesn't : correctly deal with the shareability attributes. : . irqchip/gic-v3: Add Rockchip 3588001 erratum workaround irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 irqchip/gic: Drop support for board files Signed-off-by: Marc Zyngier <maz@kernel.org>
| | | * | irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4Shanker Donthineni2023-04-081-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The T241 platform suffers from the T241-FABRIC-4 erratum which causes unexpected behavior in the GIC when multiple transactions are received simultaneously from different sources. This hardware issue impacts NVIDIA server platforms that use more than two T241 chips interconnected. Each chip has support for 320 {E}SPIs. This issue occurs when multiple packets from different GICs are incorrectly interleaved at the target chip. The erratum text below specifies exactly what can cause multiple transfer packets susceptible to interleaving and GIC state corruption. GIC state corruption can lead to a range of problems, including kernel panics, and unexpected behavior. >From the erratum text: "In some cases, inter-socket AXI4 Stream packets with multiple transfers, may be interleaved by the fabric when presented to ARM Generic Interrupt Controller. GIC expects all transfers of a packet to be delivered without any interleaving. The following GICv3 commands may result in multiple transfer packets over inter-socket AXI4 Stream interface: - Register reads from GICD_I* and GICD_N* - Register writes to 64-bit GICD registers other than GICD_IROUTERn* - ITS command MOVALL Multiple commands in GICv4+ utilize multiple transfer packets, including VMOVP, VMOVI, VMAPP, and 64-bit register accesses." This issue impacts system configurations with more than 2 sockets, that require multi-transfer packets to be sent over inter-socket AXI4 Stream interface between GIC instances on different sockets. GICv4 cannot be supported. GICv3 SW model can only be supported with the workaround. Single and Dual socket configurations are not impacted by this issue and support GICv3 and GICv4." Link: https://developer.nvidia.com/docs/t241-fabric-4/nvidia-t241-fabric-4-errata.pdf Writing to the chip alias region of the GICD_In{E} registers except GICD_ICENABLERn has an equivalent effect as writing to the global distributor. The SPI interrupt deactivate path is not impacted by the erratum. To fix this problem, implement a workaround that ensures read accesses to the GICD_In{E} registers are directed to the chip that owns the SPI, and disable GICv4.x features. To simplify code changes, the gic_configure_irq() function uses the same alias region for both read and write operations to GICD_ICFGR. Co-developed-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Vikram Sethi <vsethi@nvidia.com> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> (for SMCCC/SOC ID bits) Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230319024314.3540573-2-sdonthineni@nvidia.com
| | | * | irqchip/gic: Drop support for board filesMarc Zyngier2023-04-081-6/+0
| | | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the last non-OF, non-ACPI user of the GIC being removed in e73307b9ebc4 ("ARM: cns3xxx: remove entire platform"), we can finally drop the entry point and do some minor cleanup. We also make the driver depend on CONFIG_OF, which is required even when CONFIG_ACPI is selected. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230315130218.3212033-1-maz@kernel.org
| * | | softirq: Add trace points for tasklet entry/exitLingutla Chandrasekhar2023-04-151-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tasklets are supposed to finish their work quickly and should not block the current running process, but it is not guaranteed that they do so. Currently softirq_entry/exit can be used to analyse the total tasklets execution time, but that's not helpful to track individual tasklets execution time. That makes it hard to identify tasklet functions, which take more time than expected. Add tasklet_entry/exit trace point support to track individual tasklet execution. Trivial usage example: # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_entry/enable # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_exit/enable # cat /sys/kernel/debug/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 4/4 #P:4 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [003] ..s1. 314.011428: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.011432: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.017369: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func <idle>-0 [003] ..s1. 314.017371: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: J. Avila <elavila@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230407230526.1685443-1-jstultz@google.com [elavila: Port to android-mainline] [jstultz: Rebased to upstream, cut unused trace points, added comments for the tracepoints, reworded commit]
* | | | Merge tag 'core-entry-2023-04-24' of ↵Linus Torvalds2023-04-252-0/+48
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry/ptrace update from Thomas Gleixner: "Provide a ptrace set/get interface for syscall user dispatch. The main purpose is to enable checkpoint/restore (CRIU) to handle processes which utilize syscall user dispatch correctly" * tag 'core-entry-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftest, ptrace: Add selftest for syscall user dispatch config api ptrace: Provide set/get interface for syscall user dispatch syscall_user_dispatch: Untag selector address before access_ok() syscall_user_dispatch: Split up set_syscall_user_dispatch()
| * | | | ptrace: Provide set/get interface for syscall user dispatchGregory Price2023-04-162-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syscall user dispatch configuration can only be set by the task itself, but lacks a ptrace set/get interface which makes it impossible to implement checkpoint/restore for it. Add the required ptrace requests and the get/set functions in the syscall user dispatch code to make that possible. Signed-off-by: Gregory Price <gregory.price@memverge.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230407171834.3558-4-gregory.price@memverge.com
* | | | | Merge tag 'x86_sev_for_v6.4_rc1' of ↵Linus Torvalds2023-04-253-2/+25
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Add the necessary glue so that the kernel can run as a confidential SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the address space in two parts: encrypted and unencrypted. The use case being running unmodified guests on the Hyper-V confidential computing hypervisor - Double-buffer messages between the guest and the hardware PSP device so that no partial buffers are copied back'n'forth and thus potential message integrity and leak attacks are possible - Name the return value the sev-guest driver returns when the hw PSP device hasn't been called, explicitly - Cleanups * tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hyperv: Change vTOM handling to use standard coco mechanisms init: Call mem_encrypt_init() after Hyper-V hypercall init is done x86/mm: Handle decryption/re-encryption of bss_decrypted consistently Drivers: hv: Explicitly request decrypted in vmap_pfn() calls x86/hyperv: Reorder code to facilitate future work x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM x86/sev: Change snp_guest_issue_request()'s fw_err argument virt/coco/sev-guest: Double-buffer messages crypto: ccp: Get rid of __sev_platform_init_locked()'s local function pointer crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALL
| * | | | | x86/hyperv: Change vTOM handling to use standard coco mechanismsMichael Kelley2023-03-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hyper-V guests on AMD SEV-SNP hardware have the option of using the "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP architecture. With vTOM, shared vs. private memory accesses are controlled by splitting the guest physical address space into two halves. vTOM is the dividing line where the uppermost bit of the physical address space is set; e.g., with 47 bits of guest physical address space, vTOM is 0x400000000000 (bit 46 is set). Guest physical memory is accessible at two parallel physical addresses -- one below vTOM and one above vTOM. Accesses below vTOM are private (encrypted) while accesses above vTOM are shared (decrypted). In this sense, vTOM is like the GPA.SHARED bit in Intel TDX. Support for Hyper-V guests using vTOM was added to the Linux kernel in two patch sets[1][2]. This support treats the vTOM bit as part of the physical address. For accessing shared (decrypted) memory, these patch sets create a second kernel virtual mapping that maps to physical addresses above vTOM. A better approach is to treat the vTOM bit as a protection flag, not as part of the physical address. This new approach is like the approach for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel virtual mapping, the existing mapping is updated using recently added coco mechanisms. When memory is changed between private and shared using set_memory_decrypted() and set_memory_encrypted(), the PTEs for the existing kernel mapping are changed to add or remove the vTOM bit in the guest physical address, just as with TDX. The hypercalls to change the memory status on the host side are made using the existing callback mechanism. Everything just works, with a minor tweak to map the IO-APIC to use private accesses. To accomplish the switch in approach, the following must be done: * Update Hyper-V initialization to set the cc_mask based on vTOM and do other coco initialization. * Update physical_mask so the vTOM bit is no longer treated as part of the physical address * Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear the vTOM bit as a protection flag. * Code already exists to make hypercalls to inform Hyper-V about pages changing between shared and private. Update this code to run as a callback from __set_memory_enc_pgtable(). * Remove the Hyper-V special case from __set_memory_enc_dec() * Remove the Hyper-V specific call to swiotlb_update_mem_attributes() since mem_encrypt_init() will now do it. * Add a Hyper-V specific implementation of the is_private_mmio() callback that returns true for the IO-APIC and vTPM MMIO addresses [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/ [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/ [ bp: Touchups. ] Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com
| * | | | | x86/sev: Change snp_guest_issue_request()'s fw_err argumentDionna Glaze2023-03-211-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GHCB specification declares that the firmware error value for a guest request will be stored in the lower 32 bits of EXIT_INFO_2. The upper 32 bits are for the VMM's own error code. The fw_err argument to snp_guest_issue_request() is thus a misnomer, and callers will need access to all 64 bits. The type of unsigned long also causes problems, since sw_exit_info2 is u64 (unsigned long long) vs the argument's unsigned long*. Change this type for issuing the guest request. Pass the ioctl command struct's error field directly instead of in a local variable, since an incomplete guest request may not set the error code, and uninitialized stack memory would be written back to user space. The firmware might not even be called, so bookend the call with the no firmware call error and clear the error. Since the "fw_err" field is really exitinfo2 split into the upper bits' vmm error code and lower bits' firmware error code, convert the 64 bit value to a union. [ bp: - Massage commit message - adjust code - Fix a build issue as Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202303070609.vX6wp2Af-lkp@intel.com - print exitinfo2 in hex Tom: - Correct -EIO exit case. ] Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230214164638.1189804-5-dionnaglaze@google.com Link: https://lore.kernel.org/r/20230307192449.24732-12-bp@alien8.de
| * | | | | crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALLPeter Gonda2023-03-211-0/+7
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PSP can return a "firmware error" code of -1 in circumstances where the PSP has not actually been called. To make this protocol unambiguous, name the value SEV_RET_NO_FW_CALL. [ bp: Massage a bit. ] Signed-off-by: Peter Gonda <pgonda@google.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221207010210.2563293-2-dionnaglaze@google.com
* | | | | Merge tag 'pull-old-dio' of ↵Linus Torvalds2023-04-241-2/+2
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull legacy dio cleanup from Al Viro. * tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: __blockdev_direct_IO(): get rid of submit_io callback
| * | | | | __blockdev_direct_IO(): get rid of submit_io callbackAl Viro2023-03-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | always NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | Merge tag 'pull-write-one-page' of ↵Linus Torvalds2023-04-241-6/+0
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs write_one_page removal from Al Viro: "write_one_page series" * tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mm,jfs: move write_one_page/folio_write_one to jfs ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page ufs: don't flush page immediately for DIRSYNC directories
| * | | | | | mm,jfs: move write_one_page/folio_write_one to jfsChristoph Hellwig2023-03-121-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2023-04-241-1/+0
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vfs fget updates from Al Viro: "fget() to fdget() conversions" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse_dev_ioctl(): switch to fdget() cgroup_get_from_fd(): switch to fdget_raw() bpf: switch to fdget_raw() build_mount_idmapped(): switch to fdget() kill the last remaining user of proc_ns_fget() SVM-SEV: convert the rest of fget() uses to fdget() in there convert sgx_set_attribute() to fdget()/fdput() convert setns(2) to fdget()/fdput()
| * | | | | | | kill the last remaining user of proc_ns_fget()Al Viro2023-04-201-1/+0
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | lookups by descriptor are better off closer to syscall surface... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | Merge tag 'erofs-for-6.4-rc1' of ↵Linus Torvalds2023-04-241-2/+2
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, sub-page block support for uncompressed files is available. It's mainly used to enable original signing ('golden') 4k-block images on arm64 with 16/64k pages. In addition, end users could also use this feature to build a manifest to directly refer to golden tar data. Besides, long xattr name prefix support is also introduced in this cycle to avoid too many xattrs with the same prefix (e.g. overlayfs xattrs). It's useful for erofs + overlayfs combination (like Composefs model): the image size is reduced by ~14% and runtime performance is also slightly improved. Others are random fixes and cleanups as usual. Summary: - Add sub-page block size support for uncompressed files - Support flattened block device for multi-blob images to be attached into virtual machines (including cloud servers) and bare metals - Support long xattr name prefixes to optimize images with common xattr namespaces (e.g. files with overlayfs xattrs) use cases - Various minor cleanups & fixes" * tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: cleanup i_format-related stuffs erofs: sunset erofs_dbg() erofs: fix potential overflow calculating xattr_isize erofs: get rid of z_erofs_fill_inode() erofs: enable long extended attribute name prefixes erofs: handle long xattr name prefixes properly erofs: add helpers to load long xattr name prefixes erofs: introduce on-disk format for long xattr name prefixes erofs: move packed inode out of the compression part erofs: keep meta inode into erofs_buf erofs: initialize packed inode after root inode is assigned erofs: stop parsing non-compact HEAD index if clusterofs is invalid erofs: don't warn ztailpacking feature anymore erofs: simplify erofs_xattr_generic_get() erofs: rename init_inode_xattrs with erofs_ prefix erofs: move several xattr helpers into xattr.c erofs: tidy up EROFS on-disk naming erofs: support flattened block device for multi-blob images erofs: set block size to the on-disk block size erofs: avoid hardcoded blocksize for subpage block support
| * | | | | | | erofs: avoid hardcoded blocksize for subpage block supportJingbo Xu2023-04-171-2/+2
| | |_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | | | | Merge tag 'v6.4/vfs.open' of ↵Linus Torvalds2023-04-241-1/+0
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs open fixlet from Christian Brauner: "EINVAL ist keinmal: This contains the changes to make O_DIRECTORY when specified together with O_CREAT an invalid request. The wider background is that a regression report about the behavior of O_DIRECTORY | O_CREAT was sent to fsdevel about a behavior that was changed multiple years and LTS releases earlier during v5.7 development. This has also been covered in https://lwn.net/Articles/926782/ which provides an excellent summary of the discussion" * tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: open: return EINVAL for O_DIRECTORY | O_CREAT
| * | | | | | | open: return EINVAL for O_DIRECTORY | O_CREATChristian Brauner2023-03-221-1/+0
| | |_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7. On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL had the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: create regular file * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: create regular file * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL have the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory This is a fairly substantial semantic change that userspace didn't notice until Pedro took the time to deliberately figure out corner cases. Since no one noticed this breakage we can somewhat safely assume that O_DIRECTORY | O_CREAT combinations are likely unused. The v5.7 breakage is especially weird because while ENOTDIR is returned indicating failure a regular file is actually created. This doesn't make a lot of sense. Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does. The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want. For some examples see the code examples in [1] to [3] and the discussion in [4]. There are various ways to address this issue. The lazy/simple option would be to restore the pre-v5.7 behavior and to just live with that bug forever. But since there's a real chance that the O_DIRECTORY | O_CREAT quirk isn't relied upon we should try to get away with murder(ing bad semantics) first. If we need to Frankenstein pre-v5.7 behavior later so be it. So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT combinations. In addition to cleaning up the old bug this also opens up the possiblity to make that flag combination do something more intuitive in the future. Starting with this commit the following semantics apply: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory One additional note, O_TMPFILE is implemented as: #define __O_TMPFILE 020000000 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) For older kernels it was important to return an explicit error when O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't specified. Since O_DIRECTORY | O_CREAT could be used to create a regular allowing that combination together with __O_TMPFILE would've meant that false positives were possible, i.e., that a regular file was created instead of a O_TMPFILE. This could've been used to trick userspace into thinking it operated on a O_TMPFILE when it wasn't. Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE) can be dropped. Instead we can simply check verify that O_DIRECTORY is raised via if (!(flags & O_DIRECTORY)) and explain this in two comments. As Aleksa pointed out O_PATH is unaffected by this change since it always returned EINVAL if O_CREAT was specified - with or without O_DIRECTORY. Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4] Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | | | Merge tag 'v6.4/vfs.misc' of ↵Linus Torvalds2023-04-243-2/+3
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains a pile of various smaller fixes. Most of them aren't very interesting so this just highlights things worth mentioning: - Various filesystems contained the same little helper to convert from the mode of a dentry to the DT_* type of that dentry. They have now all been switched to rely on the generic fs_umode_to_dtype() helper. All custom helpers are removed (Jeff) - Fsnotify now reports ACCESS and MODIFY events for splice (Chung-Chiang Cheng) - After converting timerfd a long time ago to rely on wait_event_interruptible_*() apis, convert eventfd as well. This removes the complex open-coded wait code (Wen Yang) - Simplify sysctl registration for devpts, avoiding the declaration of two tables. Instead, just use a prefixed path with register_sysctl() (Luis) - The setattr_should_drop_sgid() helper is now exported so NFS can use it. By switching NFS to this helper an NFS setgid inheritance bug is fixed (me)" * tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode() pnode: pass mountpoint directly eventfd: use wait_event_interruptible_locked_irq() helper splice: report related fsnotify events fs: consolidate duplicate dt_type helpers nfs: use vfs setgid helper Update relatime comments to include equality fs/buffer: Remove redundant assignment to err fs_context: drop the unused lsm_flags member fs/namespace: fnic: Switch to use %ptTd Documentation: update idmappings.rst devpts: simplify two-level sysctl registration for pty_kern_table eventpoll: align comment with nested epoll limitation
| * | | | | | | nfs: use vfs setgid helperChristian Brauner2023-03-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've aligned setgid behavior over multiple kernel releases. The details can be found in the following two merge messages: cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2') 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0') Consistent setgid stripping behavior is now encapsulated in the setattr_should_drop_sgid() helper which is used by all filesystems that strip setgid bits outside of vfs proper. Switch nfs to rely on this helper as well. Without this patch the setgid stripping tests in xfstests will fail. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | | | | fs_context: drop the unused lsm_flags memberOndrej Mosnacek2023-03-162-2/+1
| | |_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This isn't ever used by VFS now, and it couldn't even work. Any FS that uses the SECURITY_LSM_NATIVE_LABELS flag needs to also process the value returned back from the LSM, so it needs to do its security_sb_set_mnt_opts() call on its own anyway. Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | | | Merge tag 'v6.4/vfs.acl' of ↵Linus Torvalds2023-04-243-3/+28
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull acl updates from Christian Brauner: "After finishing the introduction of the new posix acl api last cycle the generic POSIX ACL xattr handlers are still around in the filesystems xattr handlers for two reasons: (1) Because a few filesystems rely on the ->list() method of the generic POSIX ACL xattr handlers in their ->listxattr() inode operation. (2) POSIX ACLs are only available if IOP_XATTR is raised. The IOP_XATTR flag is raised in inode_init_always() based on whether the sb->s_xattr pointer is non-NULL. IOW, the registered xattr handlers of the filesystem are used to raise IOP_XATTR. Removing the generic POSIX ACL xattr handlers from all filesystems would risk regressing filesystems that only implement POSIX ACL support and no other xattrs (nfs3 comes to mind). This contains the work to decouple POSIX ACLs from the IOP_XATTR flag as they don't depend on xattr handlers anymore. So it's now possible to remove the generic POSIX ACL xattr handlers from the sb->s_xattr list of all filesystems. This is a crucial step as the generic POSIX ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs don't depend on the xattr infrastructure anymore. Adressing problem (1) will require more long-term work. It would be best to get rid of the ->list() method of xattr handlers completely at some point. For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX ACL xattr handler is kept around so they can continue to use array-based xattr handler indexing. This update does simplify the ->listxattr() implementation of all these filesystems however" * tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: acl: don't depend on IOP_XATTR ovl: check for ->listxattr() support reiserfs: rework priv inode handling fs: rename generic posix acl handlers reiserfs: rework ->listxattr() implementation fs: simplify ->listxattr() implementation fs: drop unused posix acl handlers xattr: remove unused argument xattr: add listxattr helper xattr: simplify listxattr helpers
| * | | | | | | fs: rename generic posix acl handlersChristian Brauner2023-03-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reflect in their naming and document that they are kept around for legacy reasons and shouldn't be used anymore by new code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | xattr: remove unused argumentChristian Brauner2023-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | his helpers is really just used to check for user.* xattr support so don't make it pointlessly generic. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | xattr: add listxattr helperChristian Brauner2023-03-061-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tiny helper to determine whether an xattr handler given a specific dentry supports listing the requested xattr. We will use this helper in various filesystems in later commits. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | xattr: simplify listxattr helpersChristian Brauner2023-03-062-0/+8
| | |_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic_listxattr() and simple_xattr_list() helpers list xattrs and contain duplicated code. Add two helpers that both generic_listxattr() and simple_xattr_list() can use. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | | | Merge tag 'v6.4/pidfd.file' of ↵Linus Torvalds2023-04-241-0/+1
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This adds a new pidfd_prepare() helper which allows the caller to reserve a pidfd number and allocates a new pidfd file that stashes the provided struct pid. It should be avoided installing a file descriptor into a task's file descriptor table just to close it again via close_fd() in case an error occurs. The fd has been visible to userspace and might already be in use. Instead, a file descriptor should be reserved but not installed into the caller's file descriptor table. If another failure path is hit then the reserved file descriptor and file can just be put without any userspace visible side-effects. And if all failure paths are cleared the file descriptor and file can be installed into the task's file descriptor table. This helper is now used in all places that open coded this functionality before. For example, this is currently done during copy_process() and fanotify used pidfd_create(), which returns a pidfd that has already been made visibile in the caller's file descriptor table, but then closed it using close_fd(). In one of the next merge windows there is also new functionality coming to unix domain sockets that will have to rely on pidfd_prepare()" * tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fanotify: use pidfd_prepare() fork: use pidfd_prepare() pid: add pidfd_prepare()
| * | | | | | | pid: add pidfd_prepare()Christian Brauner2023-04-031-0/+1
| | |_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | | | Merge tag 'v6.4/kernel.user_worker' of ↵Linus Torvalds2023-04-243-4/+34
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull user work thread updates from Christian Brauner: "This contains the work generalizing the ability to create a kernel worker from a userspace process. Such user workers will run with the same credentials as the userspace process they were created from providing stronger security and accounting guarantees than the traditional override_creds() approach ever could've hoped for. The original work was heavily based and optimzed for the needs of io_uring which was the first user. However, as it quickly turned out the ability to create user workers inherting properties from a userspace process is generally useful. The vhost subsystem currently creates workers using the kthread api. The consequences of using the kthread api are that RLIMITs don't work correctly as they are inherited from khtreadd. This leads to bugs where more workers are created than would be allowed by the RLIMITs of the userspace process in lieu of which workers are created. Problems like this disappear with user workers created from the userspace processes for which they perform the work. In addition, providing this api allows vhost to remove additional complexity. For example, cgroup and mm sharing will just work out of the box with user workers based on the relevant userspace process instead of manually ensuring the correct cgroup and mm contexts are used. So the vhost subsystem should simply be made to use the same mechanism as io_uring. To this end the original mechanism used for create_io_thread() is generalized into user workers: - Introduce PF_USER_WORKER as a generic indicator that a given task is a user worker, i.e., a kernel task that was created from a userspace process. Now a PF_IO_WORKER thread is just a specialized version of PF_USER_WORKER. So io_uring io workers raise both flags. - Make copy_process() available to core kernel code - Extend struct kernel_clone_args with the following bitfields allowing to indicate to copy_process(): - to create a user worker (raise PF_USER_WORKER) - to not inherit any files from the userspace process - to ignore signals After all generic changes are in place the vhost subsystem implements a new dedicated vhost api based on user workers. Finally, vhost is switched to rely on the new api moving it off of kthreads. Thanks to Mike for sticking it out and making it through this rather arduous journey" * tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: vhost: use vhost_tasks for worker threads vhost: move worker thread fields to new struct vhost_task: Allow vhost layer to use copy_process fork: allow kernel code to call copy_process fork: Add kernel_clone_args flag to ignore signals fork: add kernel_clone_args flag to not dup/clone files fork/vm: Move common PF_IO_WORKER behavior to new flag kernel: Make io_thread and kthread bit fields kthread: Pass in the thread's name during creation kernel: Allow a kernel thread's name to be set in copy_process csky: Remove kernel_thread declaration
| * | | | | | | vhost_task: Allow vhost layer to use copy_processMike Christie2023-03-231-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Qemu will create vhost devices in the kernel which perform network, SCSI, etc IO and management operations from worker threads created by the kthread API. Because the kthread API does a copy_process on the kthreadd thread, the vhost layer has to use kthread_use_mm to access the Qemu thread's memory and cgroup_attach_task_all to add itself to the Qemu thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result in VMs creating more threads than the admin expected. This patch adds a new struct vhost_task which can be used instead of kthreads. They allow the vhost layer to use copy_process and inherit the userspace process's mm and cgroups, the task is accounted for under the userspace's nproc count and can be seen in its process tree, and other features like namespaces work and are inherited by default. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | | | | fork: allow kernel code to call copy_processMike Christie2023-03-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch adds helpers like create_io_thread, but for use by the vhost layer. There are several functions, so they are in their own file instead of cluttering up fork.c. This patch allows that new file to call copy_process. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | fork: Add kernel_clone_args flag to ignore signalsMike Christie2023-03-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since: commit 10ab825bdef8 ("change kernel threads to ignore signals instead of blocking them") kthreads have been ignoring signals by default, and the vhost layer has never had a need to change that. This patch adds an option flag, USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand() and copy_signals() so vhost_tasks added in the next patches can continue to ignore singals. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | fork: add kernel_clone_args flag to not dup/clone filesMike Christie2023-03-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thread is part of the device, so when it creates a thread using a helper based on copy_process we can't dup or clone the parent's files/FDS because it would do an extra increment on ourself. Later, when we do: Qemu process exits: do_exit -> exit_files -> put_files_struct -> close_files we would leak the device's resources because of that extra refcount on the fd or file_struct. This patch adds a no_files option so these worker threads can prevent taking an extra refcount on themselves. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | fork/vm: Move common PF_IO_WORKER behavior to new flagMike Christie2023-03-122-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper instead of create_io_thread because they require different behavior for operations like signal handling. The common behavior PF_USER_WORKER covers is the vm reclaim handling. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | kernel: Make io_thread and kthread bit fieldsMike Christie2023-03-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only set args->io_thread/kthread to 0 or 1 then test if they are set, so make them bit fields. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | | kernel: Allow a kernel thread's name to be set in copy_processMike Christie2023-03-121-1/+3
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after the thread is created. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | | | Merge tag 'linux-kselftest-kunit-6.4-rc1' of ↵Linus Torvalds2023-04-242-3/+3
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: - several fixes to kunit tool - new klist structure test - support for m68k under QEMU - support for overriding the QEMU serial port - support for SH under QEMU * tag 'linux-kselftest-kunit-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: add tests for using current KUnit test field kunit: tool: Add support for SH under QEMU kunit: tool: Add support for overriding the QEMU serial port .gitignore: Unignore .kunitconfig list: test: Test the klist structure kunit: increase KUNIT_LOG_SIZE to 2048 bytes kunit: Use gfp in kunit_alloc_resource() kernel-doc kunit: tool: fix pre-existing `mypy --strict` errors and update run_checks.py kunit: tool: remove unused imports and variables kunit: tool: add subscripts for type annotations where appropriate kunit: fix bug of extra newline characters in debugfs logs kunit: fix bug in the order of lines in debugfs logs kunit: fix bug in debugfs logs of parameterized tests kunit: tool: Add support for m68k under QEMU
| * | | | | | | kunit: increase KUNIT_LOG_SIZE to 2048 bytesHeiko Carstens2023-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The s390 specific test_unwind kunit test has 39 parameterized tests. The results in debugfs are truncated since the full log doesn't fit into 1500 bytes. Therefore increase KUNIT_LOG_SIZE to 2048 bytes in a similar way like it was done recently with commit "kunit: fix bug in debugfs logs of parameterized tests". With that the whole test result is present. Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | | kunit: Use gfp in kunit_alloc_resource() kernel-docStephen Boyd2023-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Copy/pasting the code from the kernel-doc here doesn't compile because kunit_alloc_resource() takes a gfp flags argument. Pass the gfp argument from the caller to complete the example. Signed-off-by: Stephen Boyd <sboyd@kernel.org> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | | kunit: fix bug of extra newline characters in debugfs logsRae Moar2023-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix bug of the extra newline characters in debugfs logs. When a line is added to debugfs with a newline character at the end, an extra line appears in the debugfs log. This is due to a discrepancy between how the lines are printed and how they are added to the logs. Remove this discrepancy by checking if a newline character is present before adding a newline character. This should closely match the printk behavior. Add kunit_log_newline_test to provide test coverage for this issue. (Also, move kunit_log_test above suite definition to remove the unnecessary declaration prior to the suite definition) As an example, say we add these two lines to the log: kunit_log(..., "KTAP version 1\n"); kunit_log(..., "1..1"); The debugfs log before this fix: KTAP version 1 1..1 The debugfs log after this fix: KTAP version 1 1..1 Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | | kunit: fix bug in debugfs logs of parameterized testsRae Moar2023-03-101-1/+1
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix bug in debugfs logs that causes individual parameterized results to not appear because the log is reinitialized (cleared) when each parameter is run. Ensure these results appear in the debugfs logs, increase log size to allow for the size of parameterized results. As a result, append lines to the log directly rather than using an intermediate variable that can cause stack size warnings due to the increased log size. Here is the debugfs log of ext4_inode_test which uses parameterized tests before the fix: KTAP version 1 # Subtest: ext4_inode_test 1..1 # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test As you can see, this log does not include any of the individual parametrized results. After (in combination with the next two fixes to remove extra empty line and ensure KTAP valid format): KTAP version 1 1..1 KTAP version 1 # Subtest: ext4_inode_test 1..1 KTAP version 1 # Subtest: inode_test_xtimestamp_decoding ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits ... (the rest of the individual parameterized tests) ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra # inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16 ok 1 inode_test_xtimestamp_decoding # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* | | | | | | Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds2023-04-248-40/+116
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
| | \ \ \ \ \ \
| | \ \ \ \ \ \
| | \ \ \ \ \ \
| | \ \ \ \ \ \
| | \ \ \ \ \ \
| | \ \ \ \ \ \
| *-----. \ \ \ \ \ \ Merge branches 'rcu/staging-core', 'rcu/staging-docs' and ↵Joel Fernandes (Google)2023-04-055-36/+111
| |\ \ \ \ \ \ \ \ \ \ | | | | |_|/ / / / / / | | | |/| | | | | | | | | | | | | | | | | | 'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging
| | | | | * | | | | | locking/lockdep: Improve the deadlock scenario print for sync and read lockBoqun Feng2023-03-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lock scenario print is always a weak spot of lockdep splats. Improvement can be made if we rework the dependency search and the error printing. However without touching the graph search, we can improve a little for the circular deadlock case, since we have the to-be-added lock dependency, and know whether these two locks are read/write/sync. In order to know whether a held_lock is sync or not, a bit was "stolen" from ->references, which reduce our limit for the same lock class nesting from 2^12 to 2^11, and it should still be good enough. Besides, since we now have bit in held_lock for sync, we don't need the "hardirqoffs being 1" trick, and also we can avoid the __lock_release() if we jump out of __lock_acquire() before the held_lock stored. With these changes, a deadlock case evolved with read lock and sync gets a better print-out from: [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] lock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] lock(srcuB); to [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] rlock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] sync(srcuB); Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>