summaryrefslogtreecommitdiffstats
path: root/kernel/cgroup/cpuset.c
Commit message (Collapse)AuthorAgeFilesLines
* cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()Waiman Long2022-05-051-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are 3 places where the cpu and node masks of the top cpuset can be initialized in the order they are executed: 1) start_kernel -> cpuset_init() 2) start_kernel -> cgroup_init() -> cpuset_bind() 3) kernel_init_freeable() -> do_basic_setup() -> cpuset_init_smp() The first cpuset_init() call just sets all the bits in the masks. The second cpuset_bind() call sets cpus_allowed and mems_allowed to the default v2 values. The third cpuset_init_smp() call sets them back to v1 values. For systems with cgroup v2 setup, cpuset_bind() is called once. As a result, cpu and memory node hot add may fail to update the cpu and node masks of the top cpuset to include the newly added cpu or node in a cgroup v2 environment. For systems with cgroup v1 setup, cpuset_bind() is called again by rebind_subsystem() when the v1 cpuset filesystem is mounted as shown in the dmesg log below with an instrumented kernel. [ 2.609781] cpuset_bind() called - v2 = 1 [ 3.079473] cpuset_init_smp() called [ 7.103710] cpuset_bind() called - v2 = 0 smp_init() is called after the first two init functions. So we don't have a complete list of active cpus and memory nodes until later in cpuset_init_smp() which is the right time to set up effective_cpus and effective_mems. To fix this cgroup v2 mask setup problem, the potentially incorrect cpus_allowed & mems_allowed setting in cpuset_init_smp() are removed. For cgroup v2 systems, the initial cpuset_bind() call will set the masks correctly. For cgroup v1 systems, the second call to cpuset_bind() will do the right setup. cc: stable@vger.kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge branch 'for-5.18' of ↵Linus Torvalds2022-03-231-5/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "All trivial cleanups without meaningful behavior changes" * 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: cleanup comments cgroup: Fix cgroup_can_fork() and cgroup_post_fork() kernel-doc comment cgroup: rstat: retrieve current bstat to delta directly cgroup: rstat: use same convention to assign cgroup_base_stat
| * cgroup: cleanup commentsTom Rix2022-03-131-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | for spdx, add a space before // replacements judgement to judgment transofrmed to transformed partitition to partition histrical to historical migratecd to migrated Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge tag 'v5.17-rc8' into sched/core, to pick up fixesIngo Molnar2022-03-151-5/+7
|\ \ | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | cpuset: Fix kernel-docJiapeng Chong2022-02-221-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following W=1 kernel warnings: kernel/cgroup/cpuset.c:3718: warning: expecting prototype for cpuset_memory_pressure_bump(). Prototype was for __cpuset_memory_pressure_bump() instead. kernel/cgroup/cpuset.c:3568: warning: expecting prototype for cpuset_node_allowed(). Prototype was for __cpuset_node_allowed() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplugZhang Qiao2022-02-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As previously discussed(https://lkml.org/lkml/2022/1/20/51), cpuset_attach() is affected with similar cpu hotplug race, as follow scenario: cpuset_attach() cpu hotplug --------------------------- ---------------------- down_write(cpuset_rwsem) guarantee_online_cpus() // (load cpus_attach) sched_cpu_deactivate set_cpu_active() // will change cpu_active_mask set_cpus_allowed_ptr(cpus_attach) __set_cpus_allowed_ptr_locked() // (if the intersection of cpus_attach and cpu_active_mask is empty, will return -EINVAL) up_write(cpuset_rwsem) To avoid races such as described above, protect cpuset_attach() call with cpu_hotplug_lock. Fixes: be367d099270 ("cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time") Cc: stable@vger.kernel.org # v2.6.32+ Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | Merge tag 'v5.17-rc5' into sched/core, to resolve conflictsIngo Molnar2022-02-211-14/+51
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | cgroup/cpuset: Fix "suspicious RCU usage" lockdep warningWaiman Long2022-02-031-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was found that a "suspicious RCU usage" lockdep warning was issued with the rcu_read_lock() call in update_sibling_cpumasks(). It is because the update_cpumasks_hier() function may sleep. So we have to release the RCU lock, call update_cpumasks_hier() and reacquire it afterward. Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks() instead of stating that in the comment. Fixes: 4716909cc5c5 ("cpuset: Track cpusets that use parent's effective_cpus") Signed-off-by: Waiman Long <longman@redhat.com> Tested-by: Phil Auld <pauld@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()Tianchen Ding2022-01-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | subparts_cpus should be limited as a subset of cpus_allowed, but it is updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to fix it. Fixes: ee8dde0cd2ce ("cpuset: Add new v2 cpuset.sched.partition flag") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchyMichal Koutný2022-01-121-12/+40
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy") inteded to relax the check only on the default hierarchy (or v2 mode) but it dropped the check in v1 too. This patch returns and separates the legacy-only validations so that they can be considered only in the v1 mode, which should enforce the old constraints for the sake of compatibility. Fixes: 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy") Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* / sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker2022-02-161-3/+3
|/ | | | | | | | | | | | | Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
* cpuset: convert 'allowed' in __cpuset_node_allowed() to be booleanQi Zheng2022-01-071-1/+1
| | | | | | | | | Convert 'allowed' in __cpuset_node_allowed() to be boolean since the return types of node_isset() and __cpuset_node_allowed() are both boolean. Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchyWaiman Long2021-12-131-11/+3
| | | | | | | | | | | | | | | | | | In validate_change(), there is a check since v2.6.12 to make sure that each of the child cpusets must be a subset of a parent cpuset. IOW, it allows child cpusets to restrict what changes can be made to a parent's "cpuset.cpus". This actually violates one of the core principles of the default hierarchy where a cgroup higher up in the hierarchy should be able to change configuration however it sees fit as deligation breaks down otherwise. To address this issue, the check is now removed for the default hierarchy to free parent cpusets from being restricted by child cpusets. The check will still apply for legacy hierarchy. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* mm/page_alloc: detect allocation forbidden by cpuset and bail out earlyFeng Tang2021-11-061-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a report that starting an Ubuntu in docker while using cpuset to bind it to movable nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (where node 4 is a movable node) runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0 CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G W I E 5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased) Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020 Call Trace: dump_stack+0x6b/0x88 dump_header+0x4a/0x1e2 oom_kill_process.cold+0xb/0x10 out_of_memory.part.0+0xaf/0x230 out_of_memory+0x3d/0x80 __alloc_pages_slowpath.constprop.0+0x954/0xa20 __alloc_pages_nodemask+0x2d3/0x300 pipe_write+0x322/0x590 new_sync_write+0x196/0x1b0 vfs_write+0x1c3/0x1f0 ksys_write+0xa7/0xe0 do_syscall_64+0x52/0xd0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mem-Info: active_anon:392832 inactive_anon:182 isolated_anon:0 active_file:68130 inactive_file:151527 isolated_file:0 unevictable:2701 dirty:0 writeback:7 slab_reclaimable:51418 slab_unreclaimable:116300 mapped:45825 shmem:735 pagetables:2540 bounce:0 free:159849484 free_pcp:73 free_cma:0 Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0 Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0 oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB The reason is that in this case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The OOM killer cannot help to resolve the situation as there is no usable memory for the request in the cpuset scope. The only reasonable measure to take is to fail the allocation right away and have the caller to deal with it. So add a check for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. As page allocation is one of the hottest path in kernel, this check will hurt all users with sane cpuset configuration, add a static branch check and detect the abnormal config in cpuset memory binding setup so that the extra check cost in page allocation is not paid by everyone. [thanks to Micho Hocko and David Rientjes for suggesting not handling it inside OOM code, adding cpuset check, refining comments] Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsemWaiman Long2021-09-131-27/+29
| | | | | | | | | | | | | Since commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem"), cpuset_mutex has been replaced by cpuset_rwsem which is a percpu rwsem. However, the comments in kernel/cgroup/cpuset.c still reference cpuset_mutex which are now incorrect. Change all the references of cpuset_mutex to cpuset_rwsem. Fixes: 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge branch 'for-5.15' of ↵Linus Torvalds2021-08-311-56/+106
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Two cpuset behavior changes: - cpuset on cgroup2 is changed to enable memory migration based on nodemask by default. - A notification is generated when cpuset partition state changes. All other patches are minor fixes and cleanups" * 'for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Avoid compiler warnings with no subsystems cgroup/cpuset: Avoid memory migration when nodemasks match cgroup/cpuset: Enable memory migration for cpuset v2 cgroup/cpuset: Enable event notification when partition state changes cgroup: cgroup-v1: clean up kernel-doc notation cgroup: Replace deprecated CPU-hotplug functions. cgroup/cpuset: Fix violation of cpuset locking rule cgroup/cpuset: Fix a partition bug with hotplug cgroup/cpuset: Miscellaneous code cleanup cgroup: remove cgroup_mount from comments
| * cgroup/cpuset: Avoid memory migration when nodemasks matchNicolas Saenz Julienne2021-08-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of ee9707e8593d ("cgroup/cpuset: Enable memory migration for cpuset v2") attaching a process to a different cgroup will trigger a memory migration regardless of whether it's really needed. Memory migration is an expensive operation, so bypass it if the nodemasks passed to cpuset_migrate_mm() are equal. Note that we're not only avoiding the migration work itself, but also a call to lru_cache_disable(), which triggers and flushes an LRU drain work on every online CPU. Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup/cpuset: Enable memory migration for cpuset v2Waiman Long2021-08-121-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a user changes cpuset.cpus, each task in a v2 cpuset will be moved to one of the new cpus if it is not there already. For memory, however, they won't be migrated to the new nodes when cpuset.mems changes. This is an inconsistency in behavior. In cpuset v1, there is a memory_migrate control file to enable such behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default for cpuset v2 so that we have a consistent set of behavior for both cpus and memory. There is certainly a cost to make memory migration the default, but it is a one time cost that shouldn't really matter as long as cpuset.mems isn't changed frequenty. Update the cgroup-v2.rst file to document the new behavior and recommend against changing cpuset.mems frequently. Since there won't be any concurrent access to the newly allocated cpuset structure in cpuset_css_alloc(), we can use the cheaper non-atomic __set_bit() instead of the more expensive atomic set_bit(). Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup/cpuset: Enable event notification when partition state changesWaiman Long2021-08-111-11/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A valid cpuset partition can become invalid if all its CPUs are offlined or somehow removed. This can happen through external events without "cpuset.cpus.partition" being touched at all. Users that rely on the property of a partition being present do not currently have a simple way to get such an event notified other than constant periodic polling which is both inefficient and cumbersome. To make life easier for those users, event notification is now enabled for "cpuset.cpus.partition" whenever its state changes. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior2021-08-091-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup/cpuset: Fix violation of cpuset locking ruleWaiman Long2021-08-091-23/+35
| | | | | | | | | | | | | | | | | | | | | | | | The cpuset fields that manage partition root state do not strictly follow the cpuset locking rule that update to cpuset has to be done with both the callback_lock and cpuset_mutex held. This is now fixed by making sure that the locking rule is upheld. Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition") Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup/cpuset: Fix a partition bug with hotplugWaiman Long2021-07-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | In cpuset_hotplug_workfn(), the detection of whether the cpu list has been changed is done by comparing the effective cpus of the top cpuset with the cpu_active_mask. However, in the rare case that just all the CPUs in the subparts_cpus are offlined, the detection fails and the partition states are not updated correctly. Fix it by forcing the cpus_updated flag to true in this particular case. Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup/cpuset: Miscellaneous code cleanupWaiman Long2021-07-261-21/+19
| | | | | | | | | | | | | | | | Use more descriptive variable names for update_prstate(), remove unnecessary code and fix some typos. There is no functional change. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | cpuset: Cleanup cpuset_cpus_allowed_fallback() use in select_fallback_rq()Will Deacon2021-08-201-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | select_fallback_rq() only needs to recheck for an allowed CPU if the affinity mask of the task has changed since the last check. Return a 'bool' from cpuset_cpus_allowed_fallback() to indicate whether the affinity mask was updated, and use this to elide the allowed check when the mask has been left alone. No functional change. Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20210730112443.23245-5-will@kernel.org
* | cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()Will Deacon2021-08-201-17/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Asymmetric systems may not offer the same level of userspace ISA support across all CPUs, meaning that some applications cannot be executed by some CPUs. As a concrete example, upcoming arm64 big.LITTLE designs do not feature support for 32-bit applications on both clusters. Modify guarantee_online_cpus() to take task_cpu_possible_mask() into account when trying to find a suitable set of online CPUs for a given task. This will avoid passing an invalid mask to set_cpus_allowed_ptr() during ->attach() and will subsequently allow the cpuset hierarchy to be taken into account when forcefully overriding the affinity mask for a task which requires migration to a compatible CPU. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com> Link: https://lkml.kernel.org/r/20210730112443.23245-4-will@kernel.org
* | cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1Will Deacon2021-08-201-2/+6
|/ | | | | | | | | | | | | | | | | If the scheduler cannot find an allowed CPU for a task, cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask if cgroup v1 is in use. In preparation for allowing architectures to provide their own fallback mask, just return early if we're either using cgroup v1 or we're using cgroup v2 with a mask that contains invalid CPUs. This will allow select_fallback_rq() to figure out the mask by itself. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Quentin Perret <qperret@google.com> Link: https://lkml.kernel.org/r/20210730112443.23245-3-will@kernel.org
* cgroup: fix spelling mistakesZhen Lei2021-05-241-1/+1
| | | | | | | | | | | | | | | | | Fix some spelling mistakes in comments: hierarhcy ==> hierarchy automtically ==> automatically overriden ==> overridden In absense of .. or ==> In absence of .. and assocaited ==> associated taget ==> target initate ==> initiate succeded ==> succeeded curremt ==> current udpated ==> updated Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cgroup/cpuset: fix typos in commentsLu Jialin2021-04-121-3/+3
| | | | | | | | Change hierachy to hierarchy and unrechable to unreachable, no functionality changed. Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: fix typos in commentsAubrey Li2021-01-151-3/+3
| | | | | | | | | Change hierachy to hierarchy and congifured to configured, no functionality changed. Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* cpuset: fix race between hotplug work and later CPU offlineDaniel Jordan2020-11-191-5/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of our machines keeled over trying to rebuild the scheduler domains. Mainline produces the same splat: BUG: unable to handle page fault for address: 0000607f820054db CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6 Workqueue: events cpuset_hotplug_workfn RIP: build_sched_domains Call Trace: partition_sched_domains_locked rebuild_sched_domains_locked cpuset_hotplug_workfn It happens with cgroup2 and exclusive cpusets only. This reproducer triggers it on an 8-cpu vm and works most effectively with no preexisting child cgroups: cd $UNIFIED_ROOT mkdir cg1 echo 4-7 > cg1/cpuset.cpus echo root > cg1/cpuset.cpus.partition # with smt/control reading 'on', echo off > /sys/devices/system/cpu/smt/control RIP maps to sd->shared = *per_cpu_ptr(sdd->sds, sd_id); from sd_init(). sd_id is calculated earlier in the same function: cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu)); sd_id = cpumask_first(sched_domain_span(sd)); tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus value from per_cpu_ptr() above. The problem is a race between cpuset_hotplug_workfn() and a later offline of CPU N. cpuset_hotplug_workfn() updates the effective masks when N is still online, the offline clears N from cpu_sibling_map, and then the worker uses the stale effective masks that still have N to generate the scheduling domains, leading the worker to read N's empty cpu_sibling_map in sd_init(). rebuild_sched_domains_locked() prevented the race during the cgroup2 cpuset series up until the Fixes commit changed its check. Make the check more robust so that it can detect an offline CPU in any exclusive cpuset's effective mask, not just the top one. Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
* kernel/: fix repeated words in commentsRandy Dunlap2020-10-161-1/+1
| | | | | | | | | | | | | Fix multiple occurrences of duplicated words in kernel/. Fix one typo/spello on the same line as a duplicate word. Change one instance of "the the" to "that the". Otherwise just drop one of the repeated words. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/98202fa6-8919-ef63-9efe-c0fad5ca7af1@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmap locking API: convert mmap_sem commentsMichel Lespinasse2020-06-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* docs: cgroup-v1: Document the cpuset_v2_mode mount optionWaiman Long2020-04-031-2/+6
| | | | | | | | | | The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount option that make cpuset.cpus and cpuset.mems behave more like those in cgroup v2. Document it to make other people more aware of this feature that can be useful in some circumstances. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* Revert "cpuset: Make cpuset hotplug synchronous"Tejun Heo2020-04-031-19/+12
| | | | | | | | | | This reverts commit a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") as it may deadlock with cpu hotplug path. Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Qian Cai <cai@lca.pw> Cc: Prateek Sood <prsood@codeaurora.org>
* cpuset: Make cpuset hotplug synchronousPrateek Sood2020-02-121-12/+19
| | | | | | | | | | | | Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug path. For memory hotplug path it still gets queued as a work item. Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug path, it is not required to wait for cpuset hotplug while thawing processes. Signed-off-by: Prateek Sood <prsood@codeaurora.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* Merge branch 'for-5.5' of ↵Linus Torvalds2019-11-251-2/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "There are several notable changes here: - Single thread migrating itself has been optimized so that it doesn't need threadgroup rwsem anymore. - Freezer optimization to avoid unnecessary frozen state changes. - cgroup ID unification so that cgroup fs ino is the only unique ID used for the cgroup and can be used to directly look up live cgroups through filehandle interface on 64bit ino archs. On 32bit archs, cgroup fs ino is still the only ID in use but it is only unique when combined with gen. - selftest and other changes" * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits) writeback: fix -Wformat compilation warnings docs: cgroup: mm: Fix spelling of "list" cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root() cgroup: use cgrp->kn->id as the cgroup ID kernfs: use 64bit inos if ino_t is 64bit kernfs: implement custom exportfs ops and fid type kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id() kernfs: convert kernfs_node->id from union kernfs_node_id to u64 kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes kernfs: use dumber locking for kernfs_find_and_get_node_by_ino() netprio: use css ID instead of cgroup ID writeback: use ino_t for inodes in tracepoints kernfs: fix ino wrap-around detection kselftests: cgroup: Avoid the reuse of fd after it is deallocated cgroup: freezer: don't change task and cgroups status unnecessarily cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency cgroup: remove cgroup_enable_task_cg_lists() optimization cgroup: pids: use atomic64_t for pids->limit selftests: cgroup: Run test_core under interfering stress selftests: cgroup: Add task migration tests ...
| * cgroup: remove cgroup_enable_task_cg_lists() optimizationTejun Heo2019-10-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_enable_task_cg_lists() is used to lazyily initialize task cgroup associations on the first use to reduce fork / exit overheads on systems which don't use cgroup. Unfortunately, locking around it has never been actually correct and its value is dubious given how the vast majority of systems use cgroup right away from boot. This patch removes the optimization. For now, replace the cg_list based branches with WARN_ON_ONCE()'s to be on the safe side. We can simplify the logic further in the future. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | sched/topology: Don't try to build empty sched domainsValentin Schneider2019-10-291-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Turns out hotplugging CPUs that are in exclusive cpusets can lead to the cpuset code feeding empty cpumasks to the sched domain rebuild machinery. This leads to the following splat: Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23 Hardware name: ARM Juno development board (r0) (DT) Workqueue: events cpuset_hotplug_workfn pstate: 60000005 (nZCv daif -PAN -UAO) pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969) lr : build_sched_domains (kernel/sched/topology.c:1966) Call trace: build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969) partition_sched_domains_locked (kernel/sched/topology.c:2250) rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019) rebuild_sched_domains (kernel/cgroup/cpuset.c:1032) cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2)) process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274) worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416) kthread (kernel/kthread.c:255) ret_from_fork (arch/arm64/kernel/entry.S:1167) Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853) The faulty line in question is: cap = arch_scale_cpu_capacity(cpumask_first(cpu_map)); and we're not checking the return value against nr_cpu_ids (we shouldn't have to!), which leads to the above. Prevent generate_sched_domains() from returning empty cpumasks, and add some assertion in build_sched_domains() to scream bloody murder if it happens again. The above splat was obtained on my Juno r0 with the following reproducer: $ cgcreate -g cpuset:asym $ cgset -r cpuset.cpus=0-3 asym $ cgset -r cpuset.mems=0 asym $ cgset -r cpuset.cpu_exclusive=1 asym $ cgcreate -g cpuset:smp $ cgset -r cpuset.cpus=4-5 smp $ cgset -r cpuset.mems=0 smp $ cgset -r cpuset.cpu_exclusive=1 smp $ cgset -r cpuset.sched_load_balance=0 . $ echo 0 > /sys/devices/system/cpu/cpu4/online $ echo 0 > /sys/devices/system/cpu/cpu5/online Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hannes@cmpxchg.org Cc: lizefan@huawei.com Cc: morten.rasmussen@arm.com Cc: qperret@google.com Cc: tj@kernel.org Cc: vincent.guittot@linaro.org Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection") Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/core: Prevent race condition between cpuset and __sched_setscheduler()Juri Lelli2019-07-251-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No synchronisation mechanism exists between the cpuset subsystem and calls to function __sched_setscheduler(). As such, it is possible that new root domains are created on the cpuset side while a deadline acceptance test is carried out in __sched_setscheduler(), leading to a potential oversell of CPU bandwidth. Grab cpuset_rwsem read lock from core scheduler, so to prevent situations such as the one described above from happening. The only exception is normalize_rt_tasks() which needs to work under tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with this, as this function is only called by sysrq and, if that gets triggered, DEADLINE guarantees are already gone out of the window anyway. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* cgroup/cpuset: Change cpuset_rwsem and hotplug lock orderJuri Lelli2019-07-251-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_rwsem is going to be acquired from sched_setscheduler() with a following patch. There are however paths (e.g., spawn_ksoftirqd) in which sched_scheduler() is eventually called while holding hotplug lock; this creates a dependecy between hotplug lock (to be always acquired first) and cpuset_rwsem (to be always acquired after hotplug lock). Fix paths which currently take the two locks in the wrong order (after a following patch is applied). Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* cgroup/cpuset: Convert cpuset_mutex to percpu_rwsemJuri Lelli2019-07-251-33/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Holding cpuset_mutex means that cpusets are stable (only the holder can make changes) and this is required for fixing a synchronization issue between cpusets and scheduler core. However, grabbing cpuset_mutex from setscheduler() hotpath (as implemented in a later patch) is a no-go, as it would create a bottleneck for tasks concurrently calling setscheduler(). Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that setscheduler() will then be able to read lock it and avoid concurrency issues. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* cpusets: Rebuild root domain deadline accounting informationMathieu Poirier2019-07-251-1/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the topology of root domains is modified by CPUset or CPUhotplug operations information about the current deadline bandwidth held in the root domain is lost. This patch addresses the issue by recalculating the lost deadline bandwidth information by circling through the deadline tasks held in CPUsets and adding their current load to the root domain they are associated with. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> [ Various additional modifications. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Cc: tj@kernel.org Cc: tommaso.cucinotta@santannapisa.it Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'work.mount0' of ↵Linus Torvalds2019-07-191-60/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ...
| * cpuset: move mount -t cpuset logics into cgroup.cAl Viro2019-05-251-60/+1
| | | | | | | | | | | | | | ... and get rid of the weird dances in ->get_tree() - that logics can be easily handled in ->init_fs_context(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | docs: cgroup-v1: add it to the admin-guide bookMauro Carvalho Chehab2019-07-151-1/+1
| | | | | | | | | | | | Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
* | Merge branch 'for-5.3' of ↵Linus Torvalds2019-07-081-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Documentation updates and the addition of cgroup_parse_float() which will be used by new controllers including blk-iocost" * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: convert docs to ReST and rename to *.rst cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS cgroup: add cgroup_parse_float()
| * | docs: cgroup-v1: convert docs to ReST and rename to *.rstMauro Carvalho Chehab2019-06-141-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the cgroup-v1 files to ReST format, in order to allow a later addition to the admin-guide. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge tag 'v5.2-rc5' into sched/core, to pick up fixesIngo Molnar2019-06-171-1/+14
|\ \ | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | cpuset: restore sanity to cpuset_cpus_allowed_fallback()Joel Savitz2019-06-121-1/+14
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the case that a process is constrained by taskset(1) (i.e. sched_setaffinity(2)) to a subset of available cpus, and all of those are subsequently offlined, the scheduler will set tsk->cpus_allowed to the current value of task_cs(tsk)->effective_cpus. This is done via a call to do_set_cpus_allowed() in the context of cpuset_cpus_allowed_fallback() made by the scheduler when this case is detected. This is the only call made to cpuset_cpus_allowed_fallback() in the latest mainline kernel. However, this is not sane behavior. I will demonstrate this on a system running the latest upstream kernel with the following initial configuration: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,fffffff Cpus_allowed_list: 0-63 (Where cpus 32-63 are provided via smt.) If we limit our current shell process to cpu2 only and then offline it and reonline it: # taskset -p 4 $$ pid 2272's current affinity mask: ffffffffffffffff pid 2272's new affinity mask: 4 # echo off > /sys/devices/system/cpu/cpu2/online # dmesg | tail -3 [ 2195.866089] process 2272 (bash) no longer affine to cpu2 [ 2195.872700] IRQ 114: no longer affine to CPU2 [ 2195.879128] smpboot: CPU 2 is now offline # echo on > /sys/devices/system/cpu/cpu2/online # dmesg | tail -1 [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4 We see that our current process now has an affinity mask containing every cpu available on the system _except_ the one we originally constrained it to: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,fffffffb Cpus_allowed_list: 0-1,3-63 This is not sane behavior, as the scheduler can now not only place the process on previously forbidden cpus, it can't even schedule it on the cpu it was originally constrained to! Other cases result in even more exotic affinity masks. Take for instance a process with an affinity mask containing only cpus provided by smt at the moment that smt is toggled, in a configuration such as the following: # taskset -p f000000000 $$ # grep -i cpu /proc/$$/status Cpus_allowed: 000000f0,00000000 Cpus_allowed_list: 36-39 A double toggle of smt results in the following behavior: # echo off > /sys/devices/system/cpu/smt/control # echo on > /sys/devices/system/cpu/smt/control # grep -i cpus /proc/$$/status Cpus_allowed: ffffff00,ffffffff Cpus_allowed_list: 0-31,40-63 This is even less sane than the previous case, as the new affinity mask excludes all smt-provided cpus with ids less than those that were previously in the affinity mask, as well as those that were actually in the mask. With this patch applied, both of these cases end in the following state: # grep -i cpu /proc/$$/status Cpus_allowed: ffffffff,ffffffff Cpus_allowed_list: 0-63 The original policy is discarded. Though not ideal, it is the simplest way to restore sanity to this fallback case without reinventing the cpuset wheel that rolls down the kernel just fine in cgroup v2. A user who wishes for the previous affinity mask to be restored in this fallback case can use that mechanism instead. This patch modifies scheduler behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. I tested the cases above on both modes. Note that the scheduler uses this fallback mechanism if and only if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Suggested-by: Waiman Long <longman@redhat.com> Suggested-by: Phil Auld <pauld@redhat.com> Signed-off-by: Joel Savitz <jsavitz@redhat.com> Acked-by: Phil Auld <pauld@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* / sched/core: Provide a pointer to the valid CPU maskSebastian Andrzej Siewior2019-06-031-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit: 4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper") the tsk_nr_cpus_allowed() wrapper was removed. There was not much difference in !RT but in RT we used this to implement migrate_disable(). Within a migrate_disable() section the CPU mask is restricted to single CPU while the "normal" CPU mask remains untouched. As an alternative implementation Ingo suggested to use: struct task_struct { const cpumask_t *cpus_ptr; cpumask_t cpus_mask; }; with t->cpus_ptr = &t->cpus_mask; In -RT we then can switch the cpus_ptr to: t->cpus_ptr = &cpumask_of(task_cpu(p)); in a migration disabled region. The rules are simple: - Code that 'uses' ->cpus_allowed would use the pointer. - Code that 'modifies' ->cpus_allowed would use the direct mask. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>