summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: move: move_stats refactoringKent Overstreet2023-10-318-62/+82
| | | | | | | | | | | data_progress_list is gone - it was redundant with moving_context_list The upcoming rebalance rewrite is going to have it using two different move_stats objects with the same moving_context, depending on whether it's scanning or using the rebalance_work btree - this patch plumbs stats around a bit differently so that will work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move: convert to bbposKent Overstreet2023-10-317-31/+38
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: moving_context now owns a btree_transKent Overstreet2023-10-314-88/+70
| | | | | | | btree_trans and moving_context are used together, and having the moving_context owns the transaction object reduces some plumbing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move.c exports, refactoringKent Overstreet2023-10-313-59/+85
| | | | | | Prep work for the new rebalance code - we need a few helpers exported. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Guard against unknown compression optionsKent Overstreet2023-10-316-12/+45
| | | | | | | | | | Since compression options now include compression level, proper validation is a bit more involved. This adds bch2_compression_opt_valid(), and plumbs it around appropriately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: trivial extents.c refactoringKent Overstreet2023-10-311-11/+11
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix bch2_prt_bitflags()Kent Overstreet2023-10-311-2/+2
| | | | | | This fixes an infinite loop when there's a set bit at position >= 32. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Check for too-large encoded extentsKent Overstreet2023-10-316-4/+64
| | | | | | We don't yet repair (split) them, just check. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure we don't exceed encoded_extent_maxKent Overstreet2023-10-311-0/+1
| | | | | | | | | The write path may (rarely) see an encoded (checksummed) extent that exceeds encoded_extent_max - this can happen when we're moving an existing extent that was not checksummed, but was given a checksum by bch2_write_rechecksum(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_disk_path_to_text() no longer takes sb_lockKent Overstreet2023-10-315-14/+61
| | | | | | | | | | | | | We're going to be using bch2_target_to_text() -> bch2_disk_path_to_text() from bch2_bkey_ptrs_to_text() and bch2_bkey_ptrs_invalid(), which can be called in any context. This patch adds the actual label to bch_disk_group_cpu so that it can be used by bch2_disk_path_to_text, and splits out bch2_disk_path_to_text() into two variants - like the previous patch, one for when we have a running filesystem and another for when we only have a superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split out disk_groups_types.hKent Overstreet2023-10-314-12/+20
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split apart bch2_target_to_text(), bch2_target_to_text_sb()Kent Overstreet2023-10-312-37/+59
| | | | | | | | | | Previously we just had bch2_opt_target_to_text() which could be passed either a filesystem object or just a superblock - depending on if we have a running filesystem or not. Split these into two functions for clarity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: All triggers are BTREE_TRIGGER_WANTS_OLD_AND_NEWKent Overstreet2023-10-315-82/+105
| | | | | | | | Upcoming rebalance_work btree will require extent triggers to be BTREE_TRIGGER_WANTS_OLD_AND_NEW - so to reduce potential confusion, let's just make all triggers BTREE_TRIGGER_WANTS_OLD_AND_NEW. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve io option handling in data move pathKent Overstreet2023-10-312-50/+107
| | | | | | | The data move path now correctly picks IO options when inodes in different snapshots have different options applied. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure devices are always correctly initializedKent Overstreet2023-10-317-36/+73
| | | | | | | | | | | We can't mark device superblocks or allocate journal on a device that isn't online. That means we may need to do this on every mount, because we may have formatted a new filesystem and then done the first mount (bch2_fs_initialize()) in degraded mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete duplicate time stats initializationKent Overstreet2023-10-311-6/+0
| | | | | | | This code duplicated initialization already done in bch2_fs_btree_iter_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill dead code extent_save()Kent Overstreet2023-10-311-18/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix ca->oldest_gen allocationKent Overstreet2023-10-311-4/+2
| | | | | | | | | The ca->oldest_gen array needs to be the same size as the bucket_gens array; ca->mi.nbuckets is updated with only state_lock held, not gc_lock, so bch2_gc_gens() could race with device resize and allocate too small of an oldest_gens array. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix shrinker namesKent Overstreet2023-10-312-2/+2
| | | | | | | Shrinkers are now exported to debugfs, so the names can't have slashes in them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix btree_node_type enumKent Overstreet2023-10-314-29/+41
| | | | | | | | More forwards compatibility fixups: having BKEY_TYPE_btree at the end of the enum conflicts with unnkown btree IDs, this shifts BKEY_TYPE_btree to slot 0 and fixes things up accordingly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_id_str()Kent Overstreet2023-10-3120-68/+74
| | | | | | | Since we can run with unknown btree IDs, we can't directly index btree IDs into fixed size arrays. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't run bch2_delete_dead_snapshots() unnecessarilyKent Overstreet2023-10-317-54/+51
| | | | | | | | | | | | | | | | | | Be a bit more careful about when bch2_delete_dead_snapshots needs to run: it only needs to run synchronously if we're running fsck, and it only needs to run at all if we have snapshot nodes to delete or if fsck has noticed that it needs to run. Also: Rename BCH_FS_HAVE_DELETED_SNAPSHOTS -> BCH_FS_NEED_DELETE_DEAD_SNAPSHOTS Kill bch2_delete_dead_snapshots_hook(), move functionality to bch2_mark_snapshot() Factor out bch2_check_snapshot_needs_deletion(), to explicitly check if we need to be running snapshot deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix lock ordering with snapshot_create_lockKent Overstreet2023-10-311-0/+1
| | | | | | | We must not hold btree locks while taking snapshot_create_lock - this fixes a lockdep splat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* six locks: Lock contended tracepointsKent Overstreet2023-10-301-2/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* closures: Fix race in closure_sync()Kent Overstreet2023-10-303-1/+13
| | | | | | | | | | | | As pointed out by Linus, closure_sync() was racy; we could skip blocking immediately after a get() and a put(), but then that would skip any barrier corresponding to the other thread's put() barrier. To fix this, always do the full __closure_sync() sequence whenever any get() has happened and the closure might have been used by other threads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* closures: Better memory barriersKent Overstreet2023-10-302-4/+4
| | | | | | | | | | | | | atomic_(dec|sub)_return_release() are a thing now - use them. Also, delete the useless barrier in set_closure_fn(): it's redundant with the memory barrier in closure_put(0. Since closure_put() would now otherwise just have a release barrier, we also need a new barrier when the ref hits 0 - smp_acquire__after_ctrl_dep(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* Merge tag 'objtool-core-2023-10-28' of ↵Linus Torvalds2023-10-308-27/+41
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: "Misc fixes and cleanups: - Fix potential MAX_NAME_LEN limit related build failures - Fix scripts/faddr2line symbol filtering bug - Fix scripts/faddr2line on LLVM=1 - Fix scripts/faddr2line to accept readelf output with mapping symbols - Minor cleanups" * tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: scripts/faddr2line: Skip over mapping symbols in output from readelf scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1 scripts/faddr2line: Don't filter out non-function symbols from readelf objtool: Remove max symbol name length limitation objtool: Propagate early errors objtool: Use 'the fallthrough' pseudo-keyword x86/speculation, objtool: Use absolute relocations for annotations x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()
| * scripts/faddr2line: Skip over mapping symbols in output from readelfWill Deacon2023-10-231-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mapping symbols emitted in the readelf output can confuse the 'faddr2line' symbol size calculation, resulting in the erroneous rejection of valid offsets. This is especially prevalent when building an arm64 kernel with CONFIG_CFI_CLANG=y, where most functions are prefixed with a 32-bit data value in a '$d.n' section. For example: 447538: ffff800080014b80 548 FUNC GLOBAL DEFAULT 2 do_one_initcall 104: ffff800080014c74 0 NOTYPE LOCAL DEFAULT 2 $x.73 106: ffff800080014d30 0 NOTYPE LOCAL DEFAULT 2 $x.75 111: ffff800080014da4 0 NOTYPE LOCAL DEFAULT 2 $d.78 112: ffff800080014da8 0 NOTYPE LOCAL DEFAULT 2 $x.79 36: ffff800080014de0 200 FUNC LOCAL DEFAULT 2 run_init_process Adding a warning to do_one_initcall() results in: | WARNING: CPU: 0 PID: 1 at init/main.c:1236 do_one_initcall+0xf4/0x260 Which 'faddr2line' refuses to accept: $ ./scripts/faddr2line vmlinux do_one_initcall+0xf4/0x260 skipping do_one_initcall address at 0xffff800080014c74 due to size mismatch (0x260 != 0x224) no match for do_one_initcall+0xf4/0x260 Filter out these entries from readelf using a shell reimplementation of is_mapping_symbol(), so that the size of a symbol is calculated as a delta to the next symbol present in ksymtab. Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20231002165750.1661-4-will@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
| * scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1Will Deacon2023-10-231-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GNU utilities cannot necessarily parse objects built by LLVM, which can result in confusing errors when using 'faddr2line': $ CROSS_COMPILE=aarch64-linux-gnu- ./scripts/faddr2line vmlinux do_one_initcall+0xf4/0x260 aarch64-linux-gnu-addr2line: vmlinux: unknown type [0x13] section `.relr.dyn' aarch64-linux-gnu-addr2line: DWARF error: invalid or unhandled FORM value: 0x25 do_one_initcall+0xf4/0x260: aarch64-linux-gnu-addr2line: vmlinux: unknown type [0x13] section `.relr.dyn' aarch64-linux-gnu-addr2line: DWARF error: invalid or unhandled FORM value: 0x25 $x.73 at main.c:? Although this can be worked around by setting CROSS_COMPILE to "llvm=-", it's cleaner to follow the same syntax as the top-level Makefile and accept LLVM= as an indication to use the llvm- tools, optionally specifying their location or specific version number. Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20231002165750.1661-3-will@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
| * scripts/faddr2line: Don't filter out non-function symbols from readelfWill Deacon2023-10-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Josh points out in 20230724234734.zy67gm674vl3p3wv@treble: > Problem is, I think the kernel's symbol printing code prints the > nearest kallsyms symbol, and there are some valid non-FUNC code > symbols. For example, syscall_return_via_sysret. so we shouldn't be considering only 'FUNC'-type symbols in the output from readelf. Drop the function symbol type filtering from the faddr2line outer loop. Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230724234734.zy67gm674vl3p3wv@treble Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20231002165750.1661-2-will@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
| * objtool: Remove max symbol name length limitationAaron Plattner2023-10-051-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | If one of the symbols processed by read_symbols() happens to have a .cold variant with a name longer than objtool's MAX_NAME_LEN limit, the build fails. Avoid this problem by just using strndup() to copy the parent function's name, rather than strncpy()ing it onto the stack. Signed-off-by: Aaron Plattner <aplattner@nvidia.com> Link: https://lore.kernel.org/r/41e94cfea1d9131b758dd637fecdeacd459d4584.1696355111.git.aplattner@nvidia.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
| * objtool: Propagate early errorsAaron Plattner2023-10-051-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If objtool runs into a problem that causes it to exit early, the overall tool still returns a status code of 0, which causes the build to continue as if nothing went wrong. Note this only affects early errors, as later errors are still ignored by check(). Fixes: b51277eb9775 ("objtool: Ditch subcommands") Signed-off-by: Aaron Plattner <aplattner@nvidia.com> Link: https://lore.kernel.org/r/cb6a28832d24b2ebfafd26da9abb95f874c83045.1696355111.git.aplattner@nvidia.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
| * objtool: Use 'the fallthrough' pseudo-keywordRuan Jinjie2023-10-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Replace the existing /* fallthrough */ comments with the new 'fallthrough' pseudo-keyword macro: https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
| * x86/speculation, objtool: Use absolute relocations for annotationsFangrui Song2023-09-223-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | .discard.retpoline_safe sections do not have the SHF_ALLOC flag. These sections referencing text sections' STT_SECTION symbols with PC-relative relocations like R_386_PC32 [0] is conceptually not suitable. Newer LLD will report warnings for REL relocations even for relocatable links [1]: ld.lld: warning: vmlinux.a(drivers/i2c/busses/i2c-i801.o):(.discard.retpoline_safe+0x120): has non-ABS relocation R_386_PC32 against symbol '' Switch to absolute relocations instead, which indicate link-time addresses. In a relocatable link, these addresses are also output section offsets, used by checks in tools/objtool/check.c. When linking vmlinux, these .discard.* sections will be discarded, therefore it is not a problem that R_X86_64_32 cannot represent a kernel address. Alternatively, we could set the SHF_ALLOC flag for .discard.* sections, but I think non-SHF_ALLOC for sections to be discarded makes more sense. Note: if we decide to never support REL architectures (e.g. arm, i386), we can utilize R_*_NONE relocations (.reloc ., BFD_RELOC_NONE, sym), making .discard.* sections zero-sized. That said, the section content waste is 4 bytes per entry, much smaller than sizeof(Elf{32,64}_Rel). [0] commit 1c0c1faf5692 ("objtool: Use relative pointers for annotations") [1] https://github.com/ClangBuiltLinux/linux/issues/1937 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20230920001728.1439947-1-maskray@google.com
| * x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()Colin Ian King2023-09-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'mid' pointer is being initialized with a value that is never read, it is being re-assigned and used inside a for-loop. Remove the redundant initialization. Cleans up clang scan build warning: arch/x86/kernel/unwind_orc.c:88:7: warning: Value stored to 'mid' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20230920114141.118919-1-colin.i.king@gmail.com
* | Merge tag 'sched-core-2023-10-28' of ↵Linus Torvalds2023-10-3045-965/+1013
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
| * | sched/fair: Remove SIS_PROPPeter Zijlstra2023-10-245-59/+0
| | | | | | | | | | | | | | | | | | | | | | | | SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
| * | sched/fair: Use candidate prev/recent_used CPU if scanning failed for ↵Yicong Yang2023-10-241-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cluster wakeup Chen Yu reports a hackbench regression of cluster wakeup when hackbench threads equal to the CPU number [1]. Analysis shows it's because we wake up more on the target CPU even if the prev_cpu is a good wakeup candidate and leads to the decrease of the CPU utilization. Generally if the task's prev_cpu is idle we'll wake up the task on it without scanning. On cluster machines we'll try to wake up the task in the same cluster of the target for better cache affinity, so if the prev_cpu is idle but not sharing the same cluster with the target we'll still try to find an idle CPU within the cluster. This will improve the performance at low loads on cluster machines. But in the issue above, if the prev_cpu is idle but not in the cluster with the target CPU, we'll try to scan an idle one in the cluster. But since the system is busy, we're likely to fail the scanning and use target instead, even if the prev_cpu is idle. Then leads to the regression. This patch solves this in 2 steps: o record the prev_cpu/recent_used_cpu if they're good wakeup candidates but not sharing the cluster with the target. o on scanning failure use the prev_cpu/recent_used_cpu if they're recorded as idle [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
| * | sched/fair: Scan cluster before scanning LLC in wake-up pathBarry Song2023-10-243-4/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. This will be implemented in 2 steps in select_idle_sibling(): 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them if they're sharing cluster with the target CPU. Otherwise trying to scan for an idle CPU in the target's cluster. 2. Scanning the cluster prior to the LLC of the target CPU for an idle CPU to wakeup. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench and netperf within one numa or cross two numa on top of tip-sched-core commit 9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()") tbench results (node 0): baseline patched 1: 327.2833 372.4623 ( 13.80%) 4: 1320.5933 1479.8833 ( 12.06%) 8: 2638.4867 2921.5267 ( 10.73%) 16: 5282.7133 5891.5633 ( 11.53%) 32: 9810.6733 9877.3400 ( 0.68%) 64: 7408.9367 7447.9900 ( 0.53%) 128: 6203.2600 6191.6500 ( -0.19%) tbench results (node 0-1): baseline patched 1: 332.0433 372.7223 ( 12.25%) 4: 1325.4667 1477.6733 ( 11.48%) 8: 2622.9433 2897.9967 ( 10.49%) 16: 5218.6100 5878.2967 ( 12.64%) 32: 10211.7000 11494.4000 ( 12.56%) 64: 13313.7333 16740.0333 ( 25.74%) 128: 13959.1000 14533.9000 ( 4.12%) netperf results TCP_RR (node 0): baseline patched 1: 76546.5033 90649.9867 ( 18.42%) 4: 77292.4450 90932.7175 ( 17.65%) 8: 77367.7254 90882.3467 ( 17.47%) 16: 78519.9048 90938.8344 ( 15.82%) 32: 72169.5035 72851.6730 ( 0.95%) 64: 25911.2457 25882.2315 ( -0.11%) 128: 10752.6572 10768.6038 ( 0.15%) netperf results TCP_RR (node 0-1): baseline patched 1: 76857.6667 90892.2767 ( 18.26%) 4: 78236.6475 90767.3017 ( 16.02%) 8: 77929.6096 90684.1633 ( 16.37%) 16: 77438.5873 90502.5787 ( 16.87%) 32: 74205.6635 88301.5612 ( 19.00%) 64: 69827.8535 71787.6706 ( 2.81%) 128: 25281.4366 25771.3023 ( 1.94%) netperf results UDP_RR (node 0): baseline patched 1: 96869.8400 110800.8467 ( 14.38%) 4: 97744.9750 109680.5425 ( 12.21%) 8: 98783.9863 110409.9637 ( 11.77%) 16: 99575.0235 110636.2435 ( 11.11%) 32: 95044.7250 97622.8887 ( 2.71%) 64: 32925.2146 32644.4991 ( -0.85%) 128: 12859.2343 12824.0051 ( -0.27%) netperf results UDP_RR (node 0-1): baseline patched 1: 97202.4733 110190.1200 ( 13.36%) 4: 95954.0558 106245.7258 ( 10.73%) 8: 96277.1958 105206.5304 ( 9.27%) 16: 97692.7810 107927.2125 ( 10.48%) 32: 79999.6702 103550.2999 ( 29.44%) 64: 80592.7413 87284.0856 ( 8.30%) 128: 27701.5770 29914.5820 ( 7.99%) Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Chen Yu also noticed this will improve the performance of tbench and netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one cluster sharing L2 Cache. [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
| * | sched: Add cpus_share_resources APIBarry Song2023-10-245-1/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
| * | sched/core: Fix RQCF_ACT_SKIP leakHao Jia2023-10-241-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning. This warning may be triggered in the following situations: CPU0 CPU1 __schedule() *rq->clock_update_flags <<= 1;* unregister_fair_sched_group() pick_next_task_fair+0x4a/0x410 destroy_cfs_bandwidth() newidle_balance+0x115/0x3e0 for_each_possible_cpu(i) *i=0* rq_unpin_lock(this_rq, rf) __cfsb_csd_unthrottle() raw_spin_rq_unlock(this_rq) rq_lock(*CPU0_rq*, &rf) rq_clock_start_loop_update() rq->clock_update_flags & RQCF_ACT_SKIP <-- raw_spin_rq_lock(this_rq) The purpose of RQCF_ACT_SKIP is to skip the update rq clock, but the update is very early in __schedule(), but we clear RQCF_*_SKIP very late, causing it to span that gap above and triggering this warning. In __schedule() we can clear the RQCF_*_SKIP flag immediately after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning. And set rq->clock_update_flags to RQCF_UPDATED to avoid rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later. Fixes: ebb83d84e49b ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com Reported-by: Igor Raits <igor.raits@gmail.com> Reported-by: Bagas Sanjaya <bagasdotme@gmail.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
| * | Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar2023-10-23831-3953/+8728
| |\ \ | | | | | | | | | | | | | | | | | | | | Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/fair: Remove unused 'curr' argument from pick_next_entity()Yiwei Lin2023-10-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'curr' argument of pick_next_entity() has become unused after the EEVDF changes. [ mingo: Updated the changelog. ] Signed-off-by: Yiwei Lin <s921975628@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
| * | | sched/nohz: Update comments about NEWILB_KICKJoel Fernandes (Google)2023-10-201-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | How ILB is triggered without IPIs is cryptic. Out of mercy for future code readers, document it in code comments. The comments are derived from a discussion with Vincent in a past review. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
| * | | sched/fair: Remove duplicate #includeJiapeng Chong2023-10-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ./kernel/sched/fair.c: linux/sched/cond_resched.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231018062759.44375-1-jiapeng.chong@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6907
| * | | sched/psi: Update poll => rtpoll in relevant commentsFan Yu2023-10-161-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PSI trigger code is now making a distinction between privileged and unprivileged triggers, after the following commit: 65457b74aa94 ("sched/psi: Rename existing poll members in preparation") But some comments have not been modified along with the code, so they need to be updated. This will help readers better understand the code. Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Ziljstra <peterz@infradead.org> Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
| * | | sched: Make PELT acronym definition searchableMathieu Desnoyers2023-10-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PELT acronym definition can be found right at the top of kernel/sched/pelt.c (of course), but it cannot be found through use of grep -r PELT kernel/sched/ Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of the source file. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
| * | | sched: Fix stop_one_cpu_nowait() vs hotplugPeter Zijlstra2023-10-134-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
| * | | sched/psi: Bail out early from irq time accountingHaifeng Xu2023-10-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We could bail out early when psi was disabled. Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
| * | | sched/topology: Rename 'DIE' domain to 'PKG'Peter Zijlstra2023-10-125-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain for the package mask. :-) Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the name less ambiguous. [ Shrikanth Hegde: rename on s390 as well. ] [ Valentin Schneider: also rename it in the comments. ] [ mingo: port to recent kernels & find all remaining occurances. ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net