summaryrefslogtreecommitdiffstats
path: root/kernel/sched/pelt.c
diff options
context:
space:
mode:
authorPeter Zijlstra <peterz@infradead.org>2023-10-10 20:57:39 +0200
committerPeter Zijlstra <peterz@infradead.org>2023-10-13 09:56:29 +0200
commitf0498d2a54e7966ce23cd7c7ff42c64fa0059b07 (patch)
tree937edf90a458c4586d5164f739c00c9e34c55273 /kernel/sched/pelt.c
parent0c2924079f5a83ed715630680e338b3685a0bf7d (diff)
downloadlinux-f0498d2a54e7966ce23cd7c7ff42c64fa0059b07.tar.gz
linux-f0498d2a54e7966ce23cd7c7ff42c64fa0059b07.tar.bz2
linux-f0498d2a54e7966ce23cd7c7ff42c64fa0059b07.zip
sched: Fix stop_one_cpu_nowait() vs hotplug
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning. Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete(). The race scenario is: CPU0 CPU1 // doing _cpu_down() __set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..) <PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock(); <PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!! That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete. OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks. Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination: preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable(); This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed. Apply this pattern to all similar stop_one_cpu_nowait() invocations. Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
Diffstat (limited to 'kernel/sched/pelt.c')
0 files changed, 0 insertions, 0 deletions