summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* sched: Fix TASK_state comparisonsPeter Zijlstra2022-09-283-4/+8
| | | | | | | | | | | | | | | | | | | Task state is fundamentally a bitmask; direct comparisons are probably not working as intended. Specifically the normal wait-state have a number of possible modifiers: TASK_UNINTERRUPTIBLE: TASK_WAKEKILL, TASK_NOLOAD, TASK_FREEZABLE TASK_INTERRUPTIBLE: TASK_FREEZABLE Specifically, the addition of TASK_FREEZABLE wrecked __wait_is_interruptible(). This however led to an audit of direct comparisons yielding the rest of the changes. Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Debugged-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
* sched/fair: Move call to list_last_entry() in detach_tasksVincent Guittot2022-09-151-2/+2
| | | | | | | | | Move the call to list_last_entry() in detach_tasks() after testing loop_max and loop_break. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-4-vincent.guittot@linaro.org
* sched/fair: Cleanup loop_max and loop_breakVincent Guittot2022-09-153-12/+11
| | | | | | | | | | | | | | | | | sched_nr_migrate_break is set to a fix value and never changes so we can replace it by a define SCHED_NR_MIGRATE_BREAK. Also, we adjust SCHED_NR_MIGRATE_BREAK to be aligned with the init value of sysctl_sched_nr_migrate which can be init to different values. Then, use SCHED_NR_MIGRATE_BREAK to init sysctl_sched_nr_migrate. The behavior stays unchanged unless you modify sysctl_sched_nr_migrate trough debugfs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-3-vincent.guittot@linaro.org
* sched/fair: Make sure to try to detach at least one movable taskVincent Guittot2022-09-151-3/+9
| | | | | | | | | | | | | During load balance, we try at most env->loop_max time to move a task. But it can happen that the loop_max LRU tasks (ie tail of the cfs_tasks list) can't be moved to dst_cpu because of affinity. In this case, loop in the list until we found at least one. The maximum of detached tasks remained the same as before. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
* sched: Show PF_flag holesPeter Zijlstra2022-09-071-0/+7
| | | | | | | | | Put placeholders in PF_flag so we can see how holey it is. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/YxctoffFFPXONESt@hirez.programming.kicks-ass.net
* freezer,sched: Rewrite core freezer logicPeter Zijlstra2022-09-0732-395/+210
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
* sched: Widen TAKS_state literalsPeter Zijlstra2022-09-071-15/+15
| | | | | | | | In preparation of adding more states, add a few 0s to the literals as we've just about ran out. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
* sched/wait: Add wait_event_state()Peter Zijlstra2022-09-071-0/+28
| | | | | | | | Allows waiting with a custom @state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220822114648.989212021@infradead.org
* sched/completion: Add wait_for_completion_state()Peter Zijlstra2022-09-072-0/+13
| | | | | | | | Allows waiting with a custom @state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220822114648.922711674@infradead.org
* sched: Add TASK_ANY for wait_task_inactive()Peter Zijlstra2022-09-074-10/+12
| | | | | | | | | | Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match. Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
* sched: Change wait_task_inactive()s match_statePeter Zijlstra2022-09-071-2/+2
| | | | | | | | | | | | | Make wait_task_inactive()'s @match_state work like ttwu()'s @state. That is, instead of an equal comparison, use it as a mask. This allows matching multiple block conditions. (removes the unlikely; it doesn't make sense how it's only part of the condition) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220822114648.856734578@infradead.org
* freezer,umh: Clean up freezer/initrd interactionPeter Zijlstra2022-09-073-13/+14
| | | | | | | | | | | | | handle_initrd() marks itself as PF_FREEZER_SKIP in order to ensure that the UMH, which is going to freeze the system, doesn't indefinitely wait for it's caller. Rework things by adding UMH_FREEZABLE to indicate the completion is freezable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.791019324@infradead.org
* freezer: Have {,un}lock_system_sleep() save/restore flagsPeter Zijlstra2022-09-077-44/+70
| | | | | | | | | | | | | Rafael explained that the reason for having both PF_NOFREEZE and PF_FREEZER_SKIP is that {,un}lock_system_sleep() is callable from kthread context that has previously called set_freezable(). In preparation of merging the flags, have {,un}lock_system_slee() save and restore current->flags. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114648.725003428@infradead.org
* sched: Rename task_running() to task_on_cpu()Peter Zijlstra2022-09-076-14/+14
| | | | | | | | | | There is some ambiguity about task_running() in that it is unrelated to TASK_RUNNING but instead tests ->on_cpu. As such, rename the thing task_on_cpu(). Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Yxhkhn55uHZx+NGl@hirez.programming.kicks-ass.net
* sched/fair: Cleanup for SIS_PROPAbel Wu2022-09-071-6/+6
| | | | | | | | | | | | | | The sched-domain of this cpu is only used for some heuristics when SIS_PROP is enabled, and it should be irrelevant whether the local sd_llc is valid or not, since all we care about is target sd_llc if !SIS_PROP. Access the local domain only when there is a need. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220907112000.1854-6-wuyun.abel@bytedance.com
* sched/fair: Default to false in test_idle_cores()Abel Wu2022-09-071-8/+8
| | | | | | | | | | | | | | | | | | | | | | It's uncertain whether idle cores exist or not if shared sched- domains are not ready, so returning "no idle cores" usually makes sense. While __update_idle_core() is an exception, it checks status of this core and set hint to shared sched-domain if necessary. So the whole logic of this function depends on the existence of shared sched-domain, and can certainly bail out early if it is not available. It's somehow a little tricky, and as Josh suggested that it should be transient while the domain isn't ready. So remove the self-defined default value to make things more clearer. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-5-wuyun.abel@bytedance.com
* sched/fair: Remove useless check in select_idle_core()Abel Wu2022-09-071-3/+0
| | | | | | | | | | | | | The function select_idle_core() only gets called when has_idle_cores is true which can be possible only when sched_smt_present is enabled. This change also aligns select_idle_core() with select_idle_smt() in the way that the caller do the check if necessary. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-4-wuyun.abel@bytedance.com
* sched/fair: Avoid double search on same cpuAbel Wu2022-09-071-0/+2
| | | | | | | | | | | | The prev cpu is checked at the beginning of SIS, and it's unlikely to be idle before the second check in select_idle_smt(). So we'd better focus on its SMT siblings. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-3-wuyun.abel@bytedance.com
* sched/fair: Remove redundant check in select_idle_smt()Abel Wu2022-09-071-7/+4
| | | | | | | | | | | If two cpus share LLC cache, then the two cores they belong to are also in the same LLC domain. Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20220907112000.1854-2-wuyun.abel@bytedance.com
* sched/deadline: Move __dl_clear_params out of dl_bw lockShang XiaoJing2022-09-011-1/+1
| | | | | | | | | | As members in sched_dl_entity are independent with dl_bw, move __dl_clear_params out of dl_bw lock. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220827020911.30641-1-shangxiaojing@huawei.com
* sched/deadline: Add replenish_dl_new_period helperShang XiaoJing2022-09-011-10/+13
| | | | | | | | | | | | Wrap repeated code in helper function replenish_dl_new_period, which set the deadline and runtime of input dl_se based on pi_of(dl_se). Note that setup_new_dl_entity originally set the deadline and runtime base on dl_se, which should equals to pi_of(dl_se) for non-boosted task. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826100037.12146-1-shangxiaojing@huawei.com
* sched/deadline: Add dl_task_is_earliest_deadline helperShang XiaoJing2022-09-011-12/+12
| | | | | | | | | | | Wrap repeated code in helper function dl_task_is_earliest_deadline, which return true if there is no deadline task on the rq at all, or task's deadline earlier than the whole rq. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Link: https://lore.kernel.org/r/20220826083453.698-1-shangxiaojing@huawei.com
* Merge branch 'sched/warnings' into sched/core, to pick up WARN_ON_ONCE() ↵Ingo Molnar2022-08-3012394-275534/+1262981
|\ | | | | | | | | | | | | | | conversion commit Merge in the BUG_ON() => WARN_ON_ONCE() conversion commit. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()Ingo Molnar2022-08-127-25/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no good reason to crash a user's system with a BUG_ON(), chances are high that they'll never even see the crash message on Xorg, and it won't make it into the syslog either. By using a WARN_ON_ONCE() we at least give the user a chance to report any bugs triggered here - instead of getting silent hangs. None of these WARN_ON_ONCE()s are supposed to trigger, ever - so we ignore cases where a NULL check is done via a BUG_ON() and we let a NULL pointer through after a WARN_ON_ONCE(). There's one exception: WARN_ON_ONCE() arguments with side-effects, such as locking - in this case we use the return value of the WARN_ON_ONCE(), such as in: - BUG_ON(!lock_task_sighand(p, &flags)); + if (WARN_ON_ONCE(!lock_task_sighand(p, &flags))) + return; Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/YvSsKcAXISmshtHo@gmail.com
| * x86: link vdso and boot with -z noexecstack --no-warn-rwx-segmentsNick Desaulniers2022-08-103-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users of GNU ld (BFD) from binutils 2.39+ will observe multiple instances of a new warning when linking kernels in the form: ld: warning: arch/x86/boot/pmjump.o: missing .note.GNU-stack section implies executable stack ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker ld: warning: arch/x86/boot/compressed/vmlinux has a LOAD segment with RWX permissions Generally, we would like to avoid the stack being executable. Because there could be a need for the stack to be executable, assembler sources have to opt-in to this security feature via explicit creation of the .note.GNU-stack feature (which compilers create by default) or command line flag --noexecstack. Or we can simply tell the linker the production of such sections is irrelevant and to link the stack as --noexecstack. LLVM's LLD linker defaults to -z noexecstack, so this flag isn't strictly necessary when linking with LLD, only BFD, but it doesn't hurt to be explicit here for all linkers IMO. --no-warn-rwx-segments is currently BFD specific and only available in the current latest release, so it's wrapped in an ld-option check. While the kernel makes extensive usage of ELF sections, it doesn't use permissions from ELF segments. Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/ Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107 Link: https://github.com/llvm/llvm-project/issues/57009 Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Suggested-by: Fangrui Song <maskray@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * Makefile: link with -z noexecstack --no-warn-rwx-segmentsNick Desaulniers2022-08-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Users of GNU ld (BFD) from binutils 2.39+ will observe multiple instances of a new warning when linking kernels in the form: ld: warning: vmlinux: missing .note.GNU-stack section implies executable stack ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker ld: warning: vmlinux has a LOAD segment with RWX permissions Generally, we would like to avoid the stack being executable. Because there could be a need for the stack to be executable, assembler sources have to opt-in to this security feature via explicit creation of the .note.GNU-stack feature (which compilers create by default) or command line flag --noexecstack. Or we can simply tell the linker the production of such sections is irrelevant and to link the stack as --noexecstack. LLVM's LLD linker defaults to -z noexecstack, so this flag isn't strictly necessary when linking with LLD, only BFD, but it doesn't hurt to be explicit here for all linkers IMO. --no-warn-rwx-segments is currently BFD specific and only available in the current latest release, so it's wrapped in an ld-option check. While the kernel makes extensive usage of ELF sections, it doesn't use permissions from ELF segments. Link: https://lore.kernel.org/linux-block/3af4127a-f453-4cf7-f133-a181cce06f73@kernel.dk/ Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=ba951afb99912da01a6e8434126b8fac7aa75107 Link: https://github.com/llvm/llvm-project/issues/57009 Reported-and-tested-by: Jens Axboe <axboe@kernel.dk> Suggested-by: Fangrui Song <maskray@google.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * crypto: blake2b: effectively disable frame size warningLinus Torvalds2022-08-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that gcc-12.1 has some nasty problems with register allocation on a 32-bit x86 build for the 64-bit values used in the generic blake2b implementation, where the pattern of 64-bit rotates and xor operations ends up making gcc generate horrible code. As a result it ends up with a ridiculously large stack frame for all the spills it generates, resulting in the following build problem: crypto/blake2b_generic.c: In function ‘blake2b_compress_one_generic’: crypto/blake2b_generic.c:109:1: error: the frame size of 2640 bytes is larger than 2048 bytes [-Werror=frame-larger-than=] on the same test-case, clang ends up generating a stack frame that is just 296 bytes (and older gcc versions generate a slightly bigger one at 428 bytes - still nowhere near that almost 3kB monster stack frame of gcc-12.1). The issue is fixed both in mainline and the GCC 12 release branch [1], but current release compilers end up failing the i386 allmodconfig build due to this issue. Disable the warning for now by simply raising the frame size for this one file, just to keep this issue from having people turn off WERROR. Link: https://lore.kernel.org/all/CAHk-=wjxqgeG2op+=W9sqgsWqCYnavC+SRfVyopu9-31S6xw+Q@mail.gmail.com/ Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105930 [1] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2022-08-1037-457/+1032
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out Bugfixes: - NFS: fix port value parsing - SUNRPC: Reinitialise the backchannel request buffers before reuse - SUNRPC: fix expiry of auth creds - NFSv4: Fix races in the legacy idmapper upcall - NFS: O_DIRECT fixes from Jeff Layton - NFSv4.1: Fix OP_SEQUENCE error handling - SUNRPC: Fix an RPC/RDMA performance regression - NFS: Fix case insensitive renames - NFSv4/pnfs: Fix a use-after-free bug in open - NFSv4.1: RECLAIM_COMPLETE must handle EACCES Features: - NFSv4.1: session trunking enhancements - NFSv4.2: READ_PLUS performance optimisations - NFS: relax the rules for rsize/wsize mount options - NFS: don't unhash dentry during unlink/rename - SUNRPC: Fail faster on bad verifier - NFS/SUNRPC: Various tracing improvements" * tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits) NFS: Improve readpage/writepage tracing NFS: Improve O_DIRECT tracing NFS: Improve write error tracing NFS: don't unhash dentry during unlink/rename NFSv4/pnfs: Fix a use-after-free bug in open NFS: nfs_async_write_reschedule_io must not recurse into the writeback code SUNRPC: Don't reuse bvec on retransmission of the request SUNRPC: Reinitialise the backchannel request buffers before reuse NFSv4.1: RECLAIM_COMPLETE must handle EACCES NFSv4.1 probe offline transports for trunking on session creation SUNRPC create a function that probes only offline transports SUNRPC export xprt_iter_rewind function SUNRPC restructure rpc_clnt_setup_test_and_add_xprt NFSv4.1 remove xprt from xprt_switch if session trunking test fails SUNRPC create an rpc function that allows xprt removal from rpc_clnt SUNRPC enable back offline transports in trunking discovery SUNRPC create an iterator to list only OFFLINE xprts NFSv4.1 offline trunkable transports on DESTROY_SESSION SUNRPC add function to offline remove trunkable transports SUNRPC expose functions for offline remote xprt functionality ...
| | * NFS: Improve readpage/writepage tracingTrond Myklebust2022-08-091-28/+26
| | | | | | | | | | | | | | | | | | Switch formatting to better match that used by other NFS tracepoints. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFS: Improve O_DIRECT tracingTrond Myklebust2022-08-091-12/+9
| | | | | | | | | | | | | | | | | | Switch the formatting to match the other NFS tracepoints. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFS: Improve write error tracingTrond Myklebust2022-08-093-20/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | Don't leak request pointers, but use the "device:inode" labelling that is used by all the other trace points. Furthermore, replace use of page indexes with an offset, again in order to align behaviour with other NFS trace points. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFS: don't unhash dentry during unlink/renameNeilBrown2022-08-082-18/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS unlink() (and rename over existing target) must determine if the file is open, and must perform a "silly rename" instead of an unlink (or before rename) if it is. Otherwise the client might hold a file open which has been removed on the server. Consequently if it determines that the file isn't open, it must block any subsequent opens until the unlink/rename has been completed on the server. This is currently achieved by unhashing the dentry. This forces any open attempt to the slow-path for lookup which will block on i_rwsem on the directory until the unlink/rename completes. A future patch will change the VFS to only get a shared lock on i_rwsem for unlink, so this will no longer work. Instead we introduce an explicit interlock. A special value is stored in dentry->d_fsdata while the unlink/rename is running and ->d_revalidate blocks while that value is present. When ->d_revalidate unblocks, the dentry will be invalid. This closes the race without requiring exclusion on i_rwsem. d_fsdata is already used in two different ways. 1/ an IS_ROOT directory dentry might have a "devname" stored in d_fsdata. Such a dentry doesn't have a name and so cannot be the target of unlink or rename. For safety we check if an old devname is still stored, and remove it if it is. 2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct nfs_unlinkdata' stored in d_fsdata. While this is set maydelete() will fail, so an unlink or rename will never proceed on such a dentry. Neither of these can be in effect when a dentry is the target of unlink or rename. So we can expect d_fsdata to be NULL, and store a special value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate that any lookup will be blocked. The d_count() is incremented under d_lock() when a lookup finds the dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under the same lock to avoid any races. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFSv4/pnfs: Fix a use-after-free bug in openTrond Myklebust2022-08-021-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | If someone cancels the open RPC call, then we must not try to free either the open slot or the layoutget operation arguments, since they are likely still in use by the hung RPC call. Fixes: 6949493884fe ("NFSv4: Don't hold the layoutget locks across multiple RPC calls") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFS: nfs_async_write_reschedule_io must not recurse into the writeback codeTrond Myklebust2022-08-021-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | It is not safe to call filemap_fdatawrite_range() from nfs_async_write_reschedule_io(), since we're often calling from a page reclaim context. Just let fsync() redrive the writeback for us. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC: Don't reuse bvec on retransmission of the requestTrond Myklebust2022-07-274-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | If a request is re-encoded and then retransmitted, we need to make sure that we also re-encode the bvec, in case the page lists have changed. Fixes: ff053dbbaffe ("SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC: Reinitialise the backchannel request buffers before reuseTrond Myklebust2022-07-271-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we're reusing the backchannel requests instead of freeing them, then we should reinitialise any values of the send/receive xdr_bufs so that they reflect the available space. Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFSv4.1: RECLAIM_COMPLETE must handle EACCESZhang Xianwei2022-07-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A client should be able to handle getting an EACCES error while doing a mount operation to reclaim state due to NFS4CLNT_RECLAIM_REBOOT being set. If the server returns RPC_AUTH_BADCRED because authentication failed when we execute "exportfs -au", then RECLAIM_COMPLETE will go a wrong way. After mount succeeds, all OPEN call will fail due to an NFS4ERR_GRACE error being returned. This patch is to fix it by resending a RPC request. Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Fixes: aa5190d0ed7d ("NFSv4: Kill nfs4_async_handle_error() abuses by NFSv4.1") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFSv4.1 probe offline transports for trunking on session creationOlga Kornievskaia2022-07-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | Once the session is established call into the SUNRPC layer to check if any offlined trunking connections should be re-enabled. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC create a function that probes only offline transportsOlga Kornievskaia2022-07-252-0/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | For only offline transports, attempt to check connectivity via a NULL call and, if that succeeds, call a provided session trunking detection function. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC export xprt_iter_rewind functionOlga Kornievskaia2022-07-252-1/+2
| | | | | | | | | | | | | | | | | | | | | Make xprt_iter_rewind callable outside of xprtmultipath.c Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC restructure rpc_clnt_setup_test_and_add_xprtOlga Kornievskaia2022-07-251-21/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for code re-use, pull out the part of the rpc_clnt_setup_test_and_add_xprt() portion that sends a NULL rpc and then calls a session trunking function into a helper function. Re-organize the end of the function for code re-use. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFSv4.1 remove xprt from xprt_switch if session trunking test failsOlga Kornievskaia2022-07-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | If we are doing a session trunking test and it fails for the transport, then remove this transport from the xprt_switch group. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC create an rpc function that allows xprt removal from rpc_clntOlga Kornievskaia2022-07-255-8/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expose a function that allows a removal of xprt from the rpc_clnt. When called from NFS that's running a trunked transport then don't decrement the active transport counter. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC enable back offline transports in trunking discoveryOlga Kornievskaia2022-07-252-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we are adding a transport to a xprt_switch that's already on the list but has been marked OFFLINE, then make the state ONLINE since it's been tested now. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC create an iterator to list only OFFLINE xprtsOlga Kornievskaia2022-07-253-12/+101
| | | | | | | | | | | | | | | | | | | | | | | | Create a new iterator helper that will go thru the all the transports in the switch and return transports that are marked OFFLINE. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFSv4.1 offline trunkable transports on DESTROY_SESSIONOlga Kornievskaia2022-07-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | When session is destroy, some of the transports might no longer be valid trunks for the new session. Offline existing transports. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC add function to offline remove trunkable transportsOlga Kornievskaia2022-07-252-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | Iterate thru available transports in the xprt_switch for all trunkable transports offline and possibly remote them as well. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC expose functions for offline remote xprt functionalityOlga Kornievskaia2022-07-253-23/+40
| | | | | | | | | | | | | | | | | | | | | | | | Re-arrange the code that make offline transport and delete transport callable functions. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * SUNRPC: Remove xdr_align_data() and xdr_expand_hole()Anna Schumaker2022-07-232-68/+0
| | | | | | | | | | | | | | | | | | | | | | | | These functions are no longer needed now that the NFS client places data and hole segments directly. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| | * NFS: Replace the READ_PLUS decoding codeAnna Schumaker2022-07-231-82/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We now take a 2-step process that allows us to place data and hole segments directly at their final position in the xdr_stream without needing to do a bunch of redundant copies to expand holes. Due to the variable lengths of each segment, the xdr metadata might cross page boundaries which I account for by setting a small scratch buffer so xdr_inline_decode() won't fail. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>