summaryrefslogtreecommitdiffstats
path: root/kernel/sched
Commit message (Collapse)AuthorAgeFilesLines
* sched/fair: Ensure that the CFS parent is added after unthrottlingRik van Riel2021-06-221-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that a CFS parent will be in the list whenever one of its children is also in the list. A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been reported while running LTP test cfs_bandwidth01. Odin Ugedal found the root cause: $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii /sys/fs/cgroup/ltp/ |-- drain `-- test-6851 `-- level2 |-- level3a | |-- worker1 | `-- worker2 `-- level3b `-- worker3 Timeline (ish): - worker3 gets throttled - level3b is decayed, since it has no more load - level2 get throttled - worker3 get unthrottled - level2 get unthrottled - worker3 is added to list - level3b is not added to list, since nr_running==0 and is decayed [ Vincent Guittot: Rebased and updated to fix for the reported warning. ] Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Acked-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210621174330.11258-1-vincent.guittot@linaro.org
* Merge tag 'sched_urgent_for_v5.13_rc6' of ↵Linus Torvalds2021-06-201-19/+25
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "A single fix to restore fairness between control groups with equal priority" * tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Correctly insert cfs_rq's to list on unthrottle
| * sched/fair: Correctly insert cfs_rq's to list on unthrottleOdin Ugedal2021-06-141-19/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210612112815.61678-1-odin@uged.al
* | Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar2021-06-141-1/+0
|/ | | | | | | | | | | | | | | | This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* sched/fair: Fix util_est UTIL_AVG_UNCHANGED handlingDietmar Eggemann2021-06-033-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
* sched/pelt: Ensure that *_sum is always synced with *_avgVincent Guittot2021-06-031-2/+8
| | | | | | | | | | | | | | Rounding in PELT calculation happening when entities are attached/detached of a cfs_rq can result into situations where util/runnable_avg is not null but util/runnable_sum is. This is normally not possible so we need to ensure that util/runnable_sum stays synced with util/runnable_avg. detach_entity_load_avg() is the last place where we don't sync util/runnable_sum with util/runnbale_avg when moving some sched_entities Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210601085832.12626-1-vincent.guittot@linaro.org
* sched/fair: Make sure to update tg contrib for blocked loadVincent Guittot2021-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the update of fair blocked load (__update_blocked_fair()), we update the contribution of the cfs in tg->load_avg if cfs_rq's pelt has decayed. Nevertheless, the pelt values of a cfs_rq could have been recently updated while propagating the change of a child. In this case, cfs_rq's pelt will not decayed because it has already been updated and we don't update tg->load_avg. __update_blocked_fair ... for_each_leaf_cfs_rq_safe: child cfs_rq update cfs_rq_load_avg() for child cfs_rq ... update_load_avg(cfs_rq_of(se), se, 0) ... update cfs_rq_load_avg() for parent cfs_rq -propagation of child's load makes parent cfs_rq->load_sum becoming null -UPDATE_TG is not set so it doesn't update parent cfs_rq->tg_load_avg_contrib .. for_each_leaf_cfs_rq_safe: parent cfs_rq update cfs_rq_load_avg() for parent cfs_rq - nothing to do because parent cfs_rq has already been updated recently so cfs_rq->tg_load_avg_contrib is not updated ... parent cfs_rq is decayed list_del_leaf_cfs_rq parent cfs_rq - but it still contibutes to tg->load_avg we must set UPDATE_TG flags when propagting pending load to the parent Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.org
* sched/fair: Keep load_avg and load_sum syncedVincent Guittot2021-05-311-6/+5
| | | | | | | | | | | | | | | | when removing a cfs_rq from the list we only check _sum value so we must ensure that _avg and _sum stay synced so load_sum can't be null whereas load_avg is not after propagating load in the cgroup hierarchy. Use load_avg to compute load_sum similarly to what is done for util_sum and runnable_sum. Fixes: 0e2d2aaaae52 ("sched/fair: Rewrite PELT migration propagation") Reported-by: Odin Ugedal <odin@uged.al> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lkml.kernel.org/r/20210527122916.27683-2-vincent.guittot@linaro.org
* sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()Gautham R. Shenoy2021-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | In commit: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") in select_idle_cpu(), we check if an idle core is present in the LLC of the target CPU via the flag "has_idle_cores". We look for the idle core in select_idle_cores(). If select_idle_cores() isn't able to find an idle core/CPU, we need to unset the has_idle_cores flag in the LLC of the target to prevent other CPUs from going down this route. However, the current code is unsetting it in the LLC of the current CPU instead of the target CPU. This patch fixes this issue. Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()") Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/1620746169-13996-1-git-send-email-ego@linux.vnet.ibm.com
* sched/fair: Fix unfairness caused by missing load decayOdin Ugedal2021-05-061-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. This applies to cases where the new process is the first in the cfs_rq, issue introduced 3d30544f0212 ("sched/fair: Apply more PELT fixes"), and when there has previously been load on the cgroup but the cgroup was removed from the leaflist due to having null PELT load, indroduced in 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path"). For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes") Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al
* sched: Fix out-of-bound access in uclampQuentin Perret2021-05-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | Util-clamp places tasks in different buckets based on their clamp values for performance reasons. However, the size of buckets is currently computed using a rounding division, which can lead to an off-by-one error in some configurations. For instance, with 20 buckets, the bucket size will be 1024/20=51. A task with a clamp of 1024 will be mapped to bucket id 1024/51=20. Sadly, correct indexes are in range [0,19], hence leading to an out of bound memory access. Clamp the bucket id to fix the issue. Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting") Suggested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210430151412.160913-1-qperret@google.com
* psi: Fix psi state corruption when schedule() races with cgroup moveJohannes Weiner2021-05-061-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") introduced a race condition that corrupts internal psi state. This manifests as kernel warnings, sometimes followed by bogusly high IO pressure: psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0 (schedule() decreasing RUNNING and ONCPU, both of which are 0) psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0 (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU) What the offending commit does is batch the two psi callbacks in schedule() to reduce the number of cgroup tree updates. When prev is deactivated and removed from the runqueue, nothing is done in psi at first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are updated along with TSK_ONCPU. However, the deactivation and the task switch inside schedule() aren't atomic: pick_next_task() may drop the rq lock for load balancing. When this happens, cgroup_move_task() can run after the task has been physically dequeued, but the psi updates are still pending. Since it looks at the task's scheduler state, it doesn't move everything to the new cgroup that the task switch that follows is about to clear from it. cgroup_move_task() will leak the TSK_RUNNING count in the old cgroup, and psi_sched_switch() will underflow it in the new cgroup. A similar thing can happen for iowait. TSK_IOWAIT is usually set when a p->in_iowait task is dequeued, but again this update is deferred to the switch. cgroup_move_task() can see an unqueued p->in_iowait task and move a non-existent TSK_IOWAIT. This results in the inconsistent task state warning, as well as a counter underflow that will result in permanent IO ghost pressure being reported. Fix this bug by making cgroup_move_task() use task->psi_flags instead of looking at the potentially mismatching scheduler state. [ We used the scheduler state historically in order to not rely on task->psi_flags for anything but debugging. But that ship has sailed anyway, and this is simpler and more robust. We previously already batched TSK_ONCPU clearing with the TSK_RUNNING update inside the deactivation call from schedule(). But that ordering was safe and didn't result in TSK_ONCPU corruption: unlike most places in the scheduler, cgroup_move_task() only checked task_current() and handled TSK_ONCPU if the task was still queued. ] Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
* Merge tag 'sched-core-2021-04-28' of ↵Linus Torvalds2021-04-2820-662/+798
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Clean up SCHED_DEBUG: move the decades old mess of sysctl, procfs and debugfs interfaces to a unified debugfs interface. - Signals: Allow caching one sigqueue object per task, to improve performance & latencies. - Improve newidle_balance() irq-off latencies on systems with a large number of CPU cgroups. - Improve energy-aware scheduling - Improve the PELT metrics for certain workloads - Reintroduce select_idle_smt() to improve load-balancing locality - but without the previous regressions - Add 'scheduler latency debugging': warn after long periods of pending need_resched. This is an opt-in feature that requires the enabling of the LATENCY_WARN scheduler feature, or the use of the resched_latency_warn_ms=xx boot parameter. - CPU hotplug fixes for HP-rollback, and for the 'fail' interface. Fix remaining balance_push() vs. hotplug holes/races - PSI fixes, plus allow /proc/pressure/ files to be written by CAP_SYS_RESOURCE tasks as well - Fix/improve various load-balancing corner cases vs. capacity margins - Fix sched topology on systems with NUMA diameter of 3 or above - Fix PF_KTHREAD vs to_kthread() race - Minor rseq optimizations - Misc cleanups, optimizations, fixes and smaller updates * tag 'sched-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits) cpumask/hotplug: Fix cpu_dying() state tracking kthread: Fix PF_KTHREAD vs to_kthread() race sched/debug: Fix cgroup_path[] serialization sched,psi: Handle potential task count underflow bugs more gracefully sched: Warn on long periods of pending need_resched sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to simplify the code & fix an unused function warning sched/debug: Rename the sched_debug parameter to sched_verbose sched,fair: Alternative sched_slice() sched: Move /proc/sched_debug to debugfs sched,debug: Convert sysctl sched_domains to debugfs debugfs: Implement debugfs_create_str() sched,preempt: Move preempt_dynamic to debug.c sched: Move SCHED_DEBUG sysctl to debugfs sched: Don't make LATENCYTOP select SCHED_DEBUG sched: Remove sched_schedstats sysctl out from under SCHED_DEBUG sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUG sched: Use cpu_dying() to fix balance_push vs hotplug-rollback cpumask: Introduce DYING mask cpumask: Make cpu_{online,possible,present,active}() inline rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs() ...
| * kthread: Fix PF_KTHREAD vs to_kthread() racePeter Zijlstra2021-04-212-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kthread_is_per_cpu() construct relies on only being called on PF_KTHREAD tasks (per the WARN in to_kthread). This gives rise to the following usage pattern: if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p)) However, as reported by syzcaller, this is broken. The scenario is: CPU0 CPU1 (running p) (p->flags & PF_KTHREAD) // true begin_new_exec() me->flags &= ~(PF_KTHREAD|...); kthread_is_per_cpu(p) to_kthread(p) WARN(!(p->flags & PF_KTHREAD) <-- *SPLAT* Introduce __to_kthread() that omits the WARN and is sure to check both values. Use this to remove the problematic pattern for kthread_is_per_cpu() and fix a number of other kthread_*() functions that have similar issues but are currently not used in ways that would expose the problem. Notably kthread_func() is only ever called on 'current', while kthread_probe_data() is only used for PF_WQ_WORKER, which implies the task is from kthread_create*(). Fixes: ac687e6e8c26 ("kthread: Extract KTHREAD_IS_PER_CPU") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lkml.kernel.org/r/YH6WJc825C4P0FCK@hirez.programming.kicks-ass.net
| * sched/debug: Fix cgroup_path[] serializationWaiman Long2021-04-211-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context. Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console. If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system. The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock. Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack. The solution implemented in this patch is to allow only one caller at a time to use the full size group_path[], while other simultaneous callers will have to use shorter stack buffers with the possibility of path name truncation. A "..." suffix will be printed if truncation may have happened. The cgroup path name is provided for informational purpose only, so occasional path name truncation should not be a big problem. Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com
| * sched,psi: Handle potential task count underflow bugs more gracefullyCharan Teja Reddy2021-04-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | psi_group_cpu->tasks, represented by the unsigned int, stores the number of tasks that could be stalled on a psi resource(io/mem/cpu). Decrementing these counters at zero leads to wrapping which further leads to the psi_group_cpu->state_mask is being set with the respective pressure state. This could result into the unnecessary time sampling for the pressure state thus cause the spurious psi events. This can further lead to wrong actions being taken at the user land based on these psi events. Though psi_bug is set under these conditions but that just for debug purpose. Fix it by decrementing the ->tasks count only when it is non-zero. Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1618585336-37219-1-git-send-email-charante@codeaurora.org
| * sched: Warn on long periods of pending need_reschedPaul Turner2021-04-214-1/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPU scheduler marks need_resched flag to signal a schedule() on a particular CPU. But, schedule() may not happen immediately in cases where the current task is executing in the kernel mode (no preemption state) for extended periods of time. This patch adds a warn_on if need_resched is pending for more than the time specified in sysctl resched_latency_warn_ms. If it goes off, it is likely that there is a missing cond_resched() somewhere. Monitoring is done via the tick and the accuracy is hence limited to jiffy scale. This also means that we won't trigger the warning if the tick is disabled. This feature (LATENCY_WARN) is default disabled. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210416212936.390566-1-joshdon@google.com
| * sched/fair: Move update_nohz_stats() to the CONFIG_NO_HZ_COMMON block to ↵YueHaibing2021-04-201-22/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | simplify the code & fix an unused function warning When !CONFIG_NO_HZ_COMMON we get this new GCC warning: kernel/sched/fair.c:8398:13: warning: ‘update_nohz_stats’ defined but not used [-Wunused-function] Move update_nohz_stats() to an already existing CONFIG_NO_HZ_COMMON #ifdef block. Beyond fixing the GCC warning, this also simplifies the update_nohz_stats() function. [ mingo: Rewrote the changelog. ] Fixes: 0826530de3cb ("sched/fair: Remove update of blocked load from newidle_balance") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210329144029.29200-1-yuehaibing@huawei.com
| * sched/debug: Rename the sched_debug parameter to sched_verbosePeter Zijlstra2021-04-173-9/+9
| | | | | | | | | | | | | | | | | | CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param sched_debug and the /debug/sched/debug_enabled knobs control the sched_debug_enabled variable, but what they really do is make SCHED_DEBUG more verbose, so rename the lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
| * sched,fair: Alternative sched_slice()Peter Zijlstra2021-04-162-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current sched_slice() seems to have issues; there's two possible things that could be improved: - the 'nr_running' used for __sched_period() is daft when cgroups are considered. Using the RQ wide h_nr_running seems like a much more consistent number. - (esp) cgroups can slice it real fine, which makes for easy over-scheduling, ensure min_gran is what the name says. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org
| * sched: Move /proc/sched_debug to debugfsPeter Zijlstra2021-04-161-9/+16
| | | | | | | | | | | | | | Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.548833671@infradead.org
| * sched,debug: Convert sysctl sched_domains to debugfsPeter Zijlstra2021-04-163-211/+59
| | | | | | | | | | | | | | | | | | | | Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net
| * sched,preempt: Move preempt_dynamic to debug.cPeter Zijlstra2021-04-163-77/+78
| | | | | | | | | | | | | | | | | | Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to collect all the debugfs bits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
| * sched: Move SCHED_DEBUG sysctl to debugfsPeter Zijlstra2021-04-164-12/+77
| | | | | | | | | | | | | | | | | | | | | | Stop polluting sysctl with undocumented knobs that really are debug only, move them all to /debug/sched/ along with the existing /debug/sched_* files that already exist. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
| * sched: Use cpu_dying() to fix balance_push vs hotplug-rollbackPeter Zijlstra2021-04-162-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new cpu_dying() state to simplify and fix the balance_push() vs CPU hotplug rollback state. Specifically, we currently rely on notifiers sched_cpu_dying() / sched_cpu_activate() to terminate balance_push, however if the cpu_down() fails when we're past sched_cpu_deactivate(), it should terminate balance_push at that point and not wait until we hit sched_cpu_activate(). Similarly, when cpu_up() fails and we're going back down, balance_push should be active, where it currently is not. So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE (when !cpu_active()), and gate it's utility with cpu_dying(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
| * sched/fair: Introduce a CPU capacity comparison helperValentin Schneider2021-04-091-23/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During load-balance, groups classified as group_misfit_task are filtered out if they do not pass group_smaller_max_cpu_capacity(<candidate group>, <local group>); which itself employs fits_capacity() to compare the sgc->max_capacity of both groups. Due to the underlying margin, fits_capacity(X, 1024) will return false for any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium" CPUs, misfit migration will never intentionally upmigrate it to a CPU of higher capacity due to the aforementioned margin. One may argue the 20% margin of fits_capacity() is excessive in the advent of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here is that fits_capacity() is meant to compare a utilization value to a capacity value, whereas here it is being used to compare two capacity values. As CPU capacity and task utilization have different dynamics, a sensible approach here would be to add a new helper dedicated to comparing CPU capacities. Also note that comparing capacity extrema of local and source sched_group's doesn't make much sense when at the day of the day the imbalance will be pulled by a known env->dst_cpu, whose capacity can be anywhere within the local group's capacity extrema. While at it, replace group_smaller_{min, max}_cpu_capacity() with comparisons of the source group's min/max capacity and the destination CPU's capacity. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
| * sched/fair: Clean up active balance nr_balance_failed trickeryValentin Schneider2021-04-091-16/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When triggering an active load balance, sd->nr_balance_failed is set to such a value that any further can_migrate_task() using said sd will ignore the output of task_hot(). This behaviour makes sense, as active load balance intentionally preempts a rq's running task to migrate it right away, but this asynchronous write is a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop before the sd->nr_balance_failed write either becomes visible to the stopper's CPU or even happens on the CPU that appended the stopper work. Add a struct lb_env flag to denote active balancing, and use it in can_migrate_task(). Remove the sd->nr_balance_failed write that served the same purpose. Cleanup the LBF_DST_PINNED active balance special case. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
| * sched/fair: Ignore percpu threads for imbalance pullsLingutla Chandrasekhar2021-04-091-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [ ][ ] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
| * sched/fair: Bring back select_idle_smt(), but differentlyRik van Riel2021-04-091-12/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel being more efficient at finding an idle CPU, and in tasks spending less time waiting to be run, both according to the schedstats run_delay numbers, and according to measured application latencies. Yay. The flip side of this is that we see more task migrations (about 30% more), higher cache misses, higher memory bandwidth utilization, and higher CPU use, for the same number of requests/second. This is most pronounced on a memcache type workload, which saw a consistent 1-3% increase in total CPU use on the system, due to those increased task migrations leading to higher L2 cache miss numbers, and higher memory utilization. The exclusive L3 cache on Skylake does us no favors there. On our web serving workload, that effect is usually negligible. It appears that the increased number of CPU migrations is generally a good thing, since it leads to lower cpu_delay numbers, reflecting the fact that tasks get to run faster. However, the reduced locality and the corresponding increase in L2 cache misses hurts a little. The patch below appears to fix the regression, while keeping the benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt with a twist: when a socket has no idle cores, check to see if the sibling of "prev" is idle, before searching all the other CPUs. This fixes both the occasional 9% regression on the web serving workload, and the continuous 2% CPU use regression on the memcache type workload. With Mel's patches and this patch together, task migrations are still high, but L2 cache misses, memory bandwidth, and CPU time used are back down to what they were before. The p95 and p99 response times for the memcache type application improve by about 10% over what they were before Mel's patches got merged. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
| * psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi filesJosh Hunt2021-04-081-6/+14
| | | | | | | | | | | | | | | | | | | | | | Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
| * sched/topology: Remove redundant cpumask_and() in init_overlap_sched_group()Barry Song2021-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so it must be a subset of sched_group_span(sg). So the cpumask_and() call is redundant - remove it. [ mingo: Adjusted the changelog a bit. ] Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao.hua@hisilicon.com
| * sched/core: Use -EINVAL in sched_dynamic_mode()Rasmus Villemoes2021-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | -1 is -EPERM which is a somewhat odd error to return from sched_dynamic_write(). No other callers care about which negative value is used. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210325004515.531631-2-linux@rasmusvillemoes.dk
| * sched/core: Stop using magic values in sched_dynamic_mode()Rasmus Villemoes2021-03-251-3/+3
| | | | | | | | | | | | | | | | | | | | Use the enum names which are also what is used in the switch() in sched_dynamic_update(). Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: https://lore.kernel.org/r/20210325004515.531631-1-linux@rasmusvillemoes.dk
| * sched/fair: Reduce long-tail newly idle balance costAubrey Li2021-03-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long-tail load balance cost is observed on the newly idle path, this is caused by a race window between the first nr_running check of the busiest runqueue and its nr_running recheck in detach_tasks. Before the busiest runqueue is locked, the tasks on the busiest runqueue could be pulled by other CPUs and nr_running of the busiest runqueu becomes 1 or even 0 if the running task becomes idle, this causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers load_balance redo at the same sched_domain level. In order to find the new busiest sched_group and CPU, load balance will recompute and update the various load statistics, which eventually leads to the long-tail load balance cost. This patch clears LBF_ALL_PINNED flag for this race condition, and hence reduces the long-tail cost of newly idle balance. Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey.li@intel.com
| * sched/fair: Optimize test_idle_cores() for !SMTBarry Song2021-03-231-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_idle_core() is only done for the case of sched_smt_present. but test_idle_cores() is done for all machines even those without SMT. This can contribute to up 8%+ hackbench performance loss on a machine like kunpeng 920 which has no SMT. This patch removes the redundant test_idle_cores() for !SMT machines. Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get an average. $ numactl -N 0 hackbench -p -T -l 20000 -g $1 The below is the result of hackbench w/ and w/o this patch: g= 2 4 6 8 10 12 14 w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929 w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367 +4.1% +8.3% +7.3% +6.3% Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao.hua@hisilicon.com
| * psi: Reduce calls to sched_clock() in psiShakeel Butt2021-03-231-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We noticed that the cost of psi increases with the increase in the levels of the cgroups. Particularly the cost of cpu_clock() sticks out as the kernel calls it multiple times as it traverses up the cgroup tree. This patch reduces the calls to cpu_clock(). Performed perf bench on Intel Broadwell with 3 levels of cgroup. Before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.747 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 3.516 [sec] 3.516689 usecs/op 284358 ops/sec After the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.640 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 3.329 [sec] 3.329820 usecs/op 300316 ops/sec Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
| * sched: Fix various typosIngo Molnar2021-03-2219-41/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix ~42 single-word typos in scheduler code comments. We have accumulated a few fun ones over the years. :-) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-kernel@vger.kernel.org
| * sched: Remove unnecessary variable from schedule_tail()Edmundo Carmona Antoranz2021-03-101-3/+1
| | | | | | | | | | | | | | | | | | | | | | Since 565790d28b1 (sched: Fix balance_callback(), 2020-05-11), there is no longer a need to reuse the result value of the call to finish_task_switch() inside schedule_tail(), therefore the variable used to hold that value (rq) is no longer needed. Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
| * sched: Optimize __calc_delta()Clement Courbet2021-03-102-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
| * psi: Optimize task switch inside shared cgroupsChengming Zhou2021-03-062-26/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch only in task preempt case, not in task sleep case. We actually don't need to clear and set TSK_ONCPU state for common cgroups of next and prev task in sleep case, that can save many psi_group_change especially when most activity comes from one leaf cgroup. sleep before: psi_dequeue() while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU) psi_task_switch() while ((group = iterate_groups(next))) # all ancestors psi_group_change(next, .set=TSK_ONCPU) sleep after: psi_dequeue() nop psi_task_switch() while ((group = iterate_groups(next))) # until (prev & next) psi_group_change(next, .set=TSK_ONCPU) while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU) When a voluntary sleep switches to another task, we remove one call of psi_group_change() for every common cgroup ancestor of the two tasks. Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
| * psi: Pressure states are unlikelyJohannes Weiner2021-03-061-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | Move the unlikely branches out of line. This eliminates undesirable jumps during wakeup and sleeps for workloads that aren't under any sort of resource pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
| * psi: Use ONCPU state tracking machinery to detect reclaimChengming Zhou2021-03-063-51/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the reclaim detection from the timer tick to the task state tracking machinery using the recently added ONCPU state. And we also add task psi_flags changes checking in the psi_task_switch() optimization to update the parents properly. In terms of performance and cost, this ONCPU task state tracking is not cheaper than previous timer tick in aggregate. But the code is simpler and shorter this way, so it's a maintainability win. And Johannes did some testing with perf bench, the performace and cost changes would be acceptable for real workloads. Thanks to Johannes Weiner for pointing out the psi_task_switch() optimization things and the clearer changelog. Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
| * psi: Add PSI_CPU_FULL stateChengming Zhou2021-03-061-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The FULL state doesn't exist for the CPU resource at the system level, but exist at the cgroup level, means all non-idle tasks in a cgroup are delayed on the CPU resource which used by others outside of the cgroup or throttled by the cgroup cpu.max configuration. Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
| * sched/topology: fix the issue groups don't span domain->span for NUMA ↵Barry Song2021-03-061-30/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | diameter > 2 As long as NUMA diameter > 2, building sched_domain by sibling's child domain will definitely create a sched_domain with sched_group which will span out of the sched_domain: +------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+ domain0 node0 node1 node2 node3 domain1 node0+1 node0+1 node2+3 node2+3 + domain2 node0+1+2 | group: node0+1 | group:node2+3 <-------------------+ when node2 is added into the domain2 of node0, kernel is using the child domain of node2's domain2, which is domain1(node2+3). Node 3 is outside the span of the domain including node0+1+2. This will make load_balance() run based on screwed avg_load and group_type in the sched_group spanning out of the sched_domain, and it also makes select_task_rq_fair() pick an idle CPU outside the sched_domain. Real servers which suffer from this problem include Kunpeng920 and 8-node Sun Fire X4600-M2, at least. Here we move to use the *child* domain of the *child* domain of node2's domain2 as the new added sched_group. At the same, we re-use the lower level sgc directly. +------+ +------+ +-------+ +------+ | node | 12 |node | 20 | node | 12 |node | | 0 +---------+1 +--------+ 2 +-------+3 | +------+ +------+ +-------+ +------+ domain0 node0 node1 +- node2 node3 | domain1 node0+1 node0+1 | node2+3 node2+3 | domain2 node0+1+2 | group: node0+1 | group:node2 <-------------------+ While the lower level sgc is re-used, this patch only changes the remote sched_groups for those sched_domains playing grandchild trick, therefore, sgc->next_update is still safe since it's only touched by CPUs that have the group span as local group. And sgc->imbalance is also safe because sd_parent remains the same in load_balance and LB only tries other CPUs from the local group. Moreover, since local groups are not touched, they are still getting roughly equal size in a TL. And should_we_balance() only matters with local groups, so the pull probability of those groups are still roughly equal. Tested by the below topology: qemu-system-aarch64 -M virt -nographic \ -smp cpus=8 \ -numa node,cpus=0-1,nodeid=0 \ -numa node,cpus=2-3,nodeid=1 \ -numa node,cpus=4-5,nodeid=2 \ -numa node,cpus=6-7,nodeid=3 \ -numa dist,src=0,dst=1,val=12 \ -numa dist,src=0,dst=2,val=20 \ -numa dist,src=0,dst=3,val=22 \ -numa dist,src=1,dst=2,val=22 \ -numa dist,src=2,dst=3,val=12 \ -numa dist,src=1,dst=3,val=24 \ -m 4G -cpu cortex-a57 -kernel arch/arm64/boot/Image w/o patch, we get lots of "groups don't span domain->span": [ 0.802139] CPU0 attaching sched-domain(s): [ 0.802193] domain-0: span=0-1 level=MC [ 0.802443] groups: 0:{ span=0 cap=1013 }, 1:{ span=1 cap=979 } [ 0.802693] domain-1: span=0-3 level=NUMA [ 0.802731] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.802811] domain-2: span=0-5 level=NUMA [ 0.802829] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.802881] ERROR: groups don't span domain->span [ 0.803058] domain-3: span=0-7 level=NUMA [ 0.803080] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804055] CPU1 attaching sched-domain(s): [ 0.804072] domain-0: span=0-1 level=MC [ 0.804096] groups: 1:{ span=1 cap=979 }, 0:{ span=0 cap=1013 } [ 0.804152] domain-1: span=0-3 level=NUMA [ 0.804170] groups: 0:{ span=0-1 cap=1992 }, 2:{ span=2-3 cap=1943 } [ 0.804219] domain-2: span=0-5 level=NUMA [ 0.804236] groups: 0:{ span=0-3 cap=3935 }, 4:{ span=4-7 cap=3937 } [ 0.804302] ERROR: groups don't span domain->span [ 0.804520] domain-3: span=0-7 level=NUMA [ 0.804546] groups: 0:{ span=0-5 mask=0-1 cap=5843 }, 6:{ span=4-7 mask=6-7 cap=4077 } [ 0.804677] CPU2 attaching sched-domain(s): [ 0.804687] domain-0: span=2-3 level=MC [ 0.804705] groups: 2:{ span=2 cap=934 }, 3:{ span=3 cap=1009 } [ 0.804754] domain-1: span=0-3 level=NUMA [ 0.804772] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.804820] domain-2: span=0-5 level=NUMA [ 0.804836] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.804944] ERROR: groups don't span domain->span [ 0.805108] domain-3: span=0-7 level=NUMA [ 0.805134] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805223] CPU3 attaching sched-domain(s): [ 0.805232] domain-0: span=2-3 level=MC [ 0.805249] groups: 3:{ span=3 cap=1009 }, 2:{ span=2 cap=934 } [ 0.805319] domain-1: span=0-3 level=NUMA [ 0.805336] groups: 2:{ span=2-3 cap=1943 }, 0:{ span=0-1 cap=1992 } [ 0.805383] domain-2: span=0-5 level=NUMA [ 0.805399] groups: 2:{ span=0-3 mask=2-3 cap=3991 }, 4:{ span=0-1,4-7 mask=4-5 cap=5985 } [ 0.805458] ERROR: groups don't span domain->span [ 0.805605] domain-3: span=0-7 level=NUMA [ 0.805626] groups: 2:{ span=0-5 mask=2-3 cap=5899 }, 6:{ span=0-1,4-7 mask=6-7 cap=6125 } [ 0.805712] CPU4 attaching sched-domain(s): [ 0.805721] domain-0: span=4-5 level=MC [ 0.805738] groups: 4:{ span=4 cap=984 }, 5:{ span=5 cap=924 } [ 0.805787] domain-1: span=4-7 level=NUMA [ 0.805803] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.805851] domain-2: span=0-1,4-7 level=NUMA [ 0.805867] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.805915] ERROR: groups don't span domain->span [ 0.806108] domain-3: span=0-7 level=NUMA [ 0.806130] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.806214] CPU5 attaching sched-domain(s): [ 0.806222] domain-0: span=4-5 level=MC [ 0.806240] groups: 5:{ span=5 cap=924 }, 4:{ span=4 cap=984 } [ 0.806841] domain-1: span=4-7 level=NUMA [ 0.806866] groups: 4:{ span=4-5 cap=1908 }, 6:{ span=6-7 cap=2029 } [ 0.806934] domain-2: span=0-1,4-7 level=NUMA [ 0.806953] groups: 4:{ span=4-7 cap=3937 }, 0:{ span=0-3 cap=3935 } [ 0.807004] ERROR: groups don't span domain->span [ 0.807312] domain-3: span=0-7 level=NUMA [ 0.807386] groups: 4:{ span=0-1,4-7 mask=4-5 cap=5985 }, 2:{ span=0-3 mask=2-3 cap=3991 } [ 0.807686] CPU6 attaching sched-domain(s): [ 0.807710] domain-0: span=6-7 level=MC [ 0.807750] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1012 } [ 0.807840] domain-1: span=4-7 level=NUMA [ 0.807870] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.807952] domain-2: span=0-1,4-7 level=NUMA [ 0.807985] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.808045] ERROR: groups don't span domain->span [ 0.808257] domain-3: span=0-7 level=NUMA [ 0.808571] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6125 }, 2:{ span=0-5 mask=2-3 cap=5899 } [ 0.808848] CPU7 attaching sched-domain(s): [ 0.808860] domain-0: span=6-7 level=MC [ 0.808880] groups: 7:{ span=7 cap=1012 }, 6:{ span=6 cap=1017 } [ 0.808953] domain-1: span=4-7 level=NUMA [ 0.808974] groups: 6:{ span=6-7 cap=2029 }, 4:{ span=4-5 cap=1908 } [ 0.809034] domain-2: span=0-1,4-7 level=NUMA [ 0.809055] groups: 6:{ span=4-7 mask=6-7 cap=4077 }, 0:{ span=0-5 mask=0-1 cap=5843 } [ 0.809128] ERROR: groups don't span domain->span [ 0.810361] domain-3: span=0-7 level=NUMA [ 0.810400] groups: 6:{ span=0-1,4-7 mask=6-7 cap=5961 }, 2:{ span=0-5 mask=2-3 cap=5903 } w/ patch, we don't get "groups don't span domain->span" any more: [ 1.486271] CPU0 attaching sched-domain(s): [ 1.486820] domain-0: span=0-1 level=MC [ 1.500924] groups: 0:{ span=0 cap=980 }, 1:{ span=1 cap=994 } [ 1.515717] domain-1: span=0-3 level=NUMA [ 1.515903] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.516989] domain-2: span=0-5 level=NUMA [ 1.517124] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.517369] domain-3: span=0-7 level=NUMA [ 1.517423] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.520027] CPU1 attaching sched-domain(s): [ 1.520097] domain-0: span=0-1 level=MC [ 1.520184] groups: 1:{ span=1 cap=994 }, 0:{ span=0 cap=980 } [ 1.520429] domain-1: span=0-3 level=NUMA [ 1.520487] groups: 0:{ span=0-1 cap=1974 }, 2:{ span=2-3 cap=1989 } [ 1.520687] domain-2: span=0-5 level=NUMA [ 1.520744] groups: 0:{ span=0-3 cap=3963 }, 4:{ span=4-5 cap=1949 } [ 1.520948] domain-3: span=0-7 level=NUMA [ 1.521038] groups: 0:{ span=0-5 mask=0-1 cap=5912 }, 6:{ span=4-7 mask=6-7 cap=4054 } [ 1.522068] CPU2 attaching sched-domain(s): [ 1.522348] domain-0: span=2-3 level=MC [ 1.522606] groups: 2:{ span=2 cap=1003 }, 3:{ span=3 cap=986 } [ 1.522832] domain-1: span=0-3 level=NUMA [ 1.522885] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.523043] domain-2: span=0-5 level=NUMA [ 1.523092] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.523302] domain-3: span=0-7 level=NUMA [ 1.523352] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.523748] CPU3 attaching sched-domain(s): [ 1.523774] domain-0: span=2-3 level=MC [ 1.523825] groups: 3:{ span=3 cap=986 }, 2:{ span=2 cap=1003 } [ 1.524009] domain-1: span=0-3 level=NUMA [ 1.524086] groups: 2:{ span=2-3 cap=1989 }, 0:{ span=0-1 cap=1974 } [ 1.524281] domain-2: span=0-5 level=NUMA [ 1.524331] groups: 2:{ span=0-3 mask=2-3 cap=4037 }, 4:{ span=4-5 cap=1949 } [ 1.524534] domain-3: span=0-7 level=NUMA [ 1.524586] groups: 2:{ span=0-5 mask=2-3 cap=5986 }, 6:{ span=0-1,4-7 mask=6-7 cap=6102 } [ 1.524847] CPU4 attaching sched-domain(s): [ 1.524873] domain-0: span=4-5 level=MC [ 1.524954] groups: 4:{ span=4 cap=958 }, 5:{ span=5 cap=991 } [ 1.525105] domain-1: span=4-7 level=NUMA [ 1.525153] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.525368] domain-2: span=0-1,4-7 level=NUMA [ 1.525428] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.532726] domain-3: span=0-7 level=NUMA [ 1.532811] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=4037 } [ 1.534125] CPU5 attaching sched-domain(s): [ 1.534159] domain-0: span=4-5 level=MC [ 1.534303] groups: 5:{ span=5 cap=991 }, 4:{ span=4 cap=958 } [ 1.534490] domain-1: span=4-7 level=NUMA [ 1.534572] groups: 4:{ span=4-5 cap=1949 }, 6:{ span=6-7 cap=2006 } [ 1.534734] domain-2: span=0-1,4-7 level=NUMA [ 1.534783] groups: 4:{ span=4-7 cap=3955 }, 0:{ span=0-1 cap=1974 } [ 1.536057] domain-3: span=0-7 level=NUMA [ 1.536430] groups: 4:{ span=0-1,4-7 mask=4-5 cap=6003 }, 2:{ span=0-3 mask=2-3 cap=3896 } [ 1.536815] CPU6 attaching sched-domain(s): [ 1.536846] domain-0: span=6-7 level=MC [ 1.536934] groups: 6:{ span=6 cap=1005 }, 7:{ span=7 cap=1001 } [ 1.537144] domain-1: span=4-7 level=NUMA [ 1.537262] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.537553] domain-2: span=0-1,4-7 level=NUMA [ 1.537613] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.537872] domain-3: span=0-7 level=NUMA [ 1.537998] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } [ 1.538448] CPU7 attaching sched-domain(s): [ 1.538505] domain-0: span=6-7 level=MC [ 1.538586] groups: 7:{ span=7 cap=1001 }, 6:{ span=6 cap=1005 } [ 1.538746] domain-1: span=4-7 level=NUMA [ 1.538798] groups: 6:{ span=6-7 cap=2006 }, 4:{ span=4-5 cap=1949 } [ 1.539048] domain-2: span=0-1,4-7 level=NUMA [ 1.539111] groups: 6:{ span=4-7 mask=6-7 cap=4054 }, 0:{ span=0-1 cap=1805 } [ 1.539571] domain-3: span=0-7 level=NUMA [ 1.539610] groups: 6:{ span=0-1,4-7 mask=6-7 cap=6102 }, 2:{ span=0-5 mask=2-3 cap=5845 } Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Meelis Roos <mroos@linux.ee> Link: https://lkml.kernel.org/r/20210224030944.15232-1-song.bao.hua@hisilicon.com
| * sched/pelt: Fix task util_est update filteringVincent Donnefort2021-03-061-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Being called for each dequeue, util_est reduces the number of its updates by filtering out when the EWMA signal is different from the task util_avg by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the decay from a previous high util_avg, EWMA might now be close enough to the new util_avg. No update would then happen while it would leave ue.enqueued with an out-of-date value. Taking into consideration the two util_est members, EWMA and enqueued for the filtering, ensures, for both, an up-to-date value. This is for now an issue only for the trace probe that might return the stale value. Functional-wise, it isn't a problem, as the value is always accessed through max(enqueued, ewma). This problem has been observed using LISA's UtilConvergence:test_means on the sd845c board. No regression observed with Hackbench on sd845c and Perf-bench sched pipe on hikey/hikey960. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com
| * sched/fair: Fix shift-out-of-bounds in load_balance()Valentin Schneider2021-03-062-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported a handful of occurrences where an sd->nr_balance_failed can grow to much higher values than one would expect. A successful load_balance() resets it to 0; a failed one increments it. Once it gets to sd->cache_nice_tries + 3, this *should* trigger an active balance, which will either set it to sd->cache_nice_tries+1 or reset it to 0. However, in case the to-be-active-balanced task is not allowed to run on env->dst_cpu, then the increment is done without any further modification. This could then be repeated ad nauseam, and would explain the absurdly high values reported by syzbot (86, 149). VincentG noted there is value in letting sd->cache_nice_tries grow, so the shift itself should be fixed. That means preventing: """ If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined. """ Thus we need to cap the shift exponent to BITS_PER_TYPE(typeof(lefthand)) - 1. I had a look around for other similar cases via coccinelle: @expr@ position pos; expression E1; expression E2; @@ ( E1 >> E2@pos | E1 >> E2@pos ) @cst depends on expr@ position pos; expression expr.E1; constant cst; @@ ( E1 >> cst@pos | E1 << cst@pos ) @script:python depends on !cst@ pos << expr.pos; exp << expr.E2; @@ # Dirty hack to ignore constexpr if exp.upper() != exp: coccilib.report.print_report(pos[0], "Possible UB shift here") The only other match in kernel/sched is rq_clock_thermal() which employs sched_thermal_decay_shift, and that exponent is already capped to 10, so that one is fine. Fixes: 5a7f55590467 ("sched/fair: Relax constraint on task's load during load balance") Reported-by: syzbot+d7581744d5fd27c9fbe1@syzkaller.appspotmail.com Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: http://lore.kernel.org/r/000000000000ffac1205b9a2112f@google.com
| * sched/fair: use lsub_positive in cpu_util_next()Vincent Donnefort2021-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The sub_positive local version is saving an explicit load-store and is enough for the cpu_util_next() usage. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
| * sched/fair: Fix task utilization accountability in compute_energy()Vincent Donnefort2021-03-061-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an energy delta as follows: feec(task) for_each_pd base_energy = compute_energy(task, -1, pd) -> for_each_cpu(pd) -> cpu_util_next(cpu, task, -1) energy_delta = compute_energy(task, dst_cpu, pd) -> for_each_cpu(pd) -> cpu_util_next(cpu, task, dst_cpu) energy_delta -= base_energy Then it picks the best CPU as being the one that minimizes energy_delta. cpu_util_next() estimates the CPU utilization that would happen if the task was placed on dst_cpu as follows: max(cpu_util + task_util, cpu_util_est + _task_util_est) The task contribution to the energy delta can then be either: (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0 and _task_util_est > cpu_util. (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est. (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and otherwise must be small enough so that feec() takes the CPU as a potential target for the task placement) This is problematic for feec(), as cpu_util_next() might give an unfair advantage to a CPU which is mostly busy (2) compared to one which is mostly idle (1). _task_util_est being always bigger than task_util in feec() (as the task is waking up), the task contribution to the energy might look smaller on certain CPUs (2) and this breaks the energy comparison. This issue is, moreover, not sporadic. By starving idle CPUs, it keeps their cpu_util < _task_util_est (1) while others will maintain cpu_util > _task_util_est (2). Fix this problem by always using max(task_util, _task_util_est) as a task contribution to the energy (ENERGY_UTIL). The new estimated CPU utilization for the energy would then be: max(cpu_util, cpu_util_est) + max(task_util, _task_util_est) compute_energy() still needs to know which OPP would be selected if the task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence, cpu_util_next() is still used to estimate the maximum util within the pd. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
| * sched/fair: Reduce the window for duplicated updateVincent Guittot2021-03-061-3/+8
| | | | | | | | | | | | | | | | | | | | | | Start to update last_blocked_load_update_tick to reduce the possibility of another cpu starting the update one more time Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-8-vincent.guittot@linaro.org
| * sched/fair: Trigger the update of blocked load on newly idle cpuVincent Guittot2021-03-064-4/+35
| | | | | | | | | | | | | | | | | | | | | | | | Instead of waking up a random and already idle CPU, we can take advantage of this_cpu being about to enter idle to run the ILB and update the blocked load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org