summaryrefslogtreecommitdiffstats
path: root/kernel/sched_rt.c
Commit message (Collapse)AuthorAgeFilesLines
* sched: fix goto retry in pick_next_task_rt()Dmitry Adamushko2008-01-251-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | looking at it one more time: (1) it looks to me that there is no need to call sched_rt_ratio_exceeded() from pick_next_rt_entity() - [ for CONFIG_FAIR_GROUP_SCHED ] queues with rt_rq->rt_throttled are not within this 'tree-like hierarchy' (or whatever we should call it :-) - there is also no need to re-check 'rt_rq->rt_time > ratio' at this point as 'rt_rq->rt_time' couldn't have been increased since the last call to update_curr_rt() (which obviously calls sched_rt_ratio_esceeded()) well, it might be that 'ratio' for this rt_rq has been re-configured (and the period over which this rt_rq was active has not yet been finished)... but I don't think we should really take this into account. (2) now pick_next_rt_entity() must never return NULL, so let's change pick_next_task_rt() accordingly. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt-watchdog: fix .rlim_max = RLIM_INFINITYPeter Zijlstra2008-01-251-7/+1
| | | | | | | | | | | | | Remove the curious logic to set it_sched_expires in the future. It useless because rt.timeout wouldn't be incremented anyway. Explicity check for RLIM_INFINITY as a test programm that had a 1s soft limit and a inf hard limit would SIGKILL at 1s. This is because RLIM_INFINITY+d-1 is d-2. Signed-off-by: Peter Zijlsta <a.p.zijlstra@chello.nl> CC: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt-group: reduce reschedulingPeter Zijlstra2008-01-251-1/+4
| | | | | | | Only reschedule if the new group has a higher prio task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt throttling vs no_hzPeter Zijlstra2008-01-251-14/+16
| | | | | | | We need to teach no_hz about the rt throttling because its tick driven. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: pull_rt_task() cleanupMike Galbraith2008-01-251-6/+4
| | | | | | | "goto out" is an odd way to spell "skip". Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt group schedulingPeter Zijlstra2008-01-251-119/+336
| | | | | | | | | | | | | | | | | Extend group scheduling to also cover the realtime classes. It uses the time limiting introduced by the previous patch to allow multiple realtime groups. The hard time limit is required to keep behaviour deterministic. The algorithms used make the realtime scheduler O(tg), linear scaling wrt the number of task groups. This is the worst case behaviour I can't seem to get out of, the avg. case of the algorithms can be improved, I focused on correctness and worst case. [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt time limitPeter Zijlstra2008-01-251-0/+53
| | | | | | | | | | Very simple time limit on the realtime scheduling classes. Allow the rq's realtime class to consume sched_rt_ratio of every sched_rt_period slice. If the class exceeds this quota the fair class will preempt the realtime class. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: high-res preemption tickPeter Zijlstra2008-01-251-1/+1
| | | | | | | | | | | | | | | | Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: SCHED_FIFO/SCHED_RR watchdog timerPeter Zijlstra2008-01-251-0/+30
| | | | | | | | | | | | | | | | Introduce a new rlimit that allows the user to set a runtime timeout on real-time tasks their slice. Once this limit is exceeded the task will receive SIGXCPU. So it measures runtime since the last sleep. Input and ideas by Thomas Gleixner and Lennart Poettering. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Lennart Poettering <mzxreary@0pointer.de> CC: Michael Kerrisk <mtk.manpages@googlemail.com> CC: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: sched_rt_entityPeter Zijlstra2008-01-251-10/+10
| | | | | | | | | | Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove some old cpuset logicGregory Haskins2008-01-251-33/+0
| | | | | | | | | We had support for overlapping cpuset based rto logic in early prototypes that is no longer used, so remove it. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, only adjust overload state when changingGregory Haskins2008-01-251-3/+5
| | | | | | | | | | | | The overload set/clears were originally idempotent when this logic was first implemented. But that is no longer true due to the addition of the atomic counter and this logic was never updated to work properly with that change. So only adjust the overload state if it is actually changing to avoid getting out of sync. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, add new methods to sched_classSteven Rostedt2008-01-251-0/+89
| | | | | | | | | | | | | | | | | Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, replace hooks with pre/post schedule and wakeup methodsSteven Rostedt2008-01-251-10/+7
| | | | | | | | | | | To make the main sched.c code more agnostic to the schedule classes. Instead of having specific hooks in the schedule code for the RT class balancing. They are replaced with a pre_schedule, post_schedule and task_wake_up methods. These methods may be used by any of the classes but currently, only the sched_rt class implements them. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: fix sched_rt.c:join/leave_domainIngo Molnar2008-01-251-17/+16
| | | | | | | fix build bug in sched_rt.c:join/leave_domain and make them only be included on SMP builds. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: only balance our RT tasks within our domainGregory Haskins2008-01-251-26/+31
| | | | | | | | | | | | | | | | | | | | | | We move the rt-overload data as the first global to per-domain reclassification. This limits the scope of overload related cache-line bouncing to stay with a specified partition instead of affecting all cpus in the system. Finally, we limit the scope of find_lowest_cpu searches to the domain instead of the entire system. Note that we would always respect domain boundaries even without this patch, but we first would scan potentially all cpus before whittling the list down. Now we can avoid looking at RQs that are out of scope, again reducing cache-line hits. Note: In some cases, task->cpus_allowed will effectively reduce our search to within our domain. However, I believe there are cases where the cpus_allowed mask may be all ones and therefore we err on the side of caution. If it can be optimized later, so be it. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up schedule_balance_rt()Ingo Molnar2008-01-251-4/+2
| | | | | | clean up schedule_balance_rt(). Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up pull_rt_task()Ingo Molnar2008-01-251-12/+10
| | | | | | clean up pull_rt_task(). Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove leftover debuggingIngo Molnar2008-01-251-8/+0
| | | | | | remove leftover debugging. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove rt_overload()Ingo Molnar2008-01-251-9/+1
| | | | | | remove rt_overload() - it's an unnecessary indirection. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up kernel/sched_rt.cIngo Molnar2008-01-251-0/+9
| | | | | | clean up whitespace damage and missing comments in kernel/sched_rt.c. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up overlong line in kernel/sched_debug.cIngo Molnar2008-01-251-2/+4
| | | | | | clean up overlong line in kernel/sched_debug.c. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up find_lock_lowest_rq()Ingo Molnar2008-01-251-4/+5
| | | | | | clean up find_lock_lowest_rq(). Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up pick_next_highest_task_rt()Ingo Molnar2008-01-251-3/+3
| | | | | | clean up pick_next_highest_task_rt(). Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, optimize cpu searchSteven Rostedt2008-01-251-13/+36
| | | | | | | | | | | This patch removes several cpumask operations by keeping track of the first of the CPUS that is of the lowest priority. When the search for the lowest priority runqueue is completed, all the bits up to the first CPU with the lowest priority runqueue is cleared. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, optimizeGregory Haskins2008-01-251-7/+18
| | | | | | | | | We can cheaply track the number of bits set in the cpumask for the lowest priority CPUs. Therefore, compute the mask's weight and use it to skip the optimal domain search logic when there is only one CPU available. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: break out early if RT task cannot be migratedGregory Haskins2008-01-251-1/+2
| | | | | | | | We don't need to bother searching if the task cannot be migrated Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT-balance, avoid overloadingSteven Rostedt2008-01-251-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the searching for a run queue by a waking RT task to try to pick another runqueue if the currently running task is an RT task. The reason is that RT tasks behave different than normal tasks. Preempting a normal task to run a RT task to keep its cache hot is fine, because the preempted non-RT task may wait on that same runqueue to run again unless the migration thread comes along and pulls it off. RT tasks behave differently. If one is preempted, it makes an active effort to continue to run. So by having a high priority task preempt a lower priority RT task, that lower RT task will then quickly try to run on another runqueue. This will cause that lower RT task to replace its nice hot cache (and TLB) with a completely cold one. This is for the hope that the new high priority RT task will keep its cache hot. Remeber that this high priority RT task was just woken up. So it may likely have been sleeping for several milliseconds, and will end up with a cold cache anyway. RT tasks run till they voluntarily stop, or are preempted by a higher priority task. This means that it is unlikely that the woken RT task will have a hot cache to wake up to. So pushing off a lower RT task is just killing its cache for no good reason. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: wake-balance fixesGregory Haskins2008-01-251-2/+8
| | | | | | | | | | We have logic to detect whether the system has migratable tasks, but we are not using it when deciding whether to push tasks away. So we add support for considering this new information. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: optimize RT affinityGregory Haskins2008-01-251-12/+88
| | | | | | | | | | | | | | | | | The current code base assumes a relatively flat CPU/core topology and will route RT tasks to any CPU fairly equally. In the real world, there are various toplogies and affinities that govern where a task is best suited to run with the smallest amount of overhead. NUMA and multi-core CPUs are prime examples of topologies that can impact cache performance. Fortunately, linux is already structured to represent these topologies via the sched_domains interface. So we change our RT router to consult a combination of topology and affinity policy to best place tasks during migration. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: pre-route RT tasks on wakeupGregory Haskins2008-01-251-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | In the original patch series that Steven Rostedt and I worked on together, we both took different approaches to low-priority wakeup path. I utilized "pre-routing" (push the task away to a less important RQ before activating) approach, while Steve utilized a "post-routing" approach. The advantage of my approach is that you avoid the overhead of a wasted activate/deactivate cycle and peripherally related burdens. The advantage of Steve's method is that it neatly solves an issue preventing a "pull" optimization from being deployed. In the end, we ended up deploying Steve's idea. But it later dawned on me that we could get the best of both worlds by deploying both ideas together, albeit slightly modified. The idea is simple: Use a "light-weight" lookup for pre-routing, since we only need to approximate a good home for the task. And we also retain the post-routing push logic to clean up any inaccuracies caused by a condition of "priority mistargeting" caused by the lightweight lookup. Most of the time, the pre-routing should work and yield lower overhead. In the cases where it doesnt, the post-router will bat cleanup. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: RT balancing: include current CPUGregory Haskins2008-01-251-4/+1
| | | | | | | | | | | | It doesn't hurt if we allow the current CPU to be included in the search. We will just simply skip it later if the current CPU turns out to be the lowest. We will use this later in the series Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: break out search for RT tasksGregory Haskins2008-01-251-27/+39
| | | | | | | | | Isolate the search logic into a function so that it can be used later in places other than find_locked_lowest_rq(). Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: de-SCHED_OTHER-ize the RT pathGregory Haskins2008-01-251-0/+10
| | | | | | | | | | | | | | | | The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: clean up this_rq use in kernel/sched_rt.cGregory Haskins2008-01-251-11/+11
| | | | | | | | | | "this_rq" is normally used to denote the RQ on the current cpu (i.e. "cpu_rq(this_cpu)"). So clean up the usage of this_rq to be more consistent with the rest of the code. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add RT-balance cpu-weightGregory Haskins2008-01-251-5/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for two or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then the timer handler determines that it's time to smp-rebalance the system so it schedules softirq_sched to run. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. This causes two problems in the current code: 1) If a high-priority bound task and a low-priority unbounded task queue up behind the running task, we will fail to ever relocate the unbounded task because we terminate the search on the first unmovable task. 2) We spend precious futile cycles in the fast-path trying to pull overloaded tasks over. It is therefore optimial to strive to avoid the overhead all together if we can cheaply detect the condition before overload even occurs. This patch tries to achieve this optimization by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated. We will then utilize this information to skip non-migratable tasks and to eliminate uncessary rebalance attempts. We introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. In addition, we introduce a per-task variable to cache the cpus_allowed weight, since the hamming calculation is probably relatively expensive. We only update the cached value when the mask is updated which should be relatively infrequent, especially compared to scheduling frequency in the fast path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: disable standard balancer for RT tasksSteven Rostedt2008-01-251-91/+4
| | | | | | | | | | Since we now take an active approach to load balancing, we don't need to balance RT tasks via the normal task balancer. In fact, this code was found to pull RT tasks away from CPUS that the active movement performed, resulting in large latencies. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: push RT tasks from overloaded CPUsSteven Rostedt2008-01-251-0/+10
| | | | | | | | | | This patch adds pushing of overloaded RT tasks from a runqueue that is having tasks (most likely RT tasks) added to the run queue. TODO: We don't cover the case of waking of new RT tasks (yet). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: pull RT tasks from overloaded runqueuesSteven Rostedt2008-01-251-11/+176
| | | | | | | | | | | | | This patch adds the algorithm to pull tasks from RT overloaded runqueues. When a pull RT is initiated, all overloaded runqueues are examined for a RT task that is higher in prio than the highest prio task queued on the target runqueue. If another runqueue holds a RT task that is of higher prio than the highest prio task on the target runqueue is found it is pulled to the target runqueue. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add rt-overload trackingSteven Rostedt2008-01-251-0/+36
| | | | | | | | | This patch adds an RT overload accounting system. When a runqueue has more than one RT task queued, it is marked as overloaded. That is that it is a candidate to have RT tasks pulled from it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add RT task pushingSteven Rostedt2008-01-251-1/+224
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an algorithm to push extra RT tasks off a run queue to other CPU runqueues. When more than one RT task is added to a run queue, this algorithm takes an assertive approach to push the RT tasks that are not running onto other run queues that have lower priority. The way this works is that the highest RT task that is not running is looked at and we examine the runqueues on the CPUS for that tasks affinity mask. We find the runqueue with the lowest prio in the CPU affinity of the picked task, and if it is lower in prio than the picked task, we push the task onto that CPU runqueue. We continue pushing RT tasks off the current runqueue until we don't push any more. The algorithm stops when the next highest RT task can't preempt any other processes on other CPUS. TODO: The algorithm may stop when there are still RT tasks that can be migrated. Specifically, if the highest non running RT task CPU affinity is restricted to CPUs that are running higher priority tasks, there may be a lower priority task queued that has an affinity with a CPU that is running a lower priority task that it could be migrated to. This patch set does not address this issue. Note: checkpatch reveals two over 80 character instances. I'm not sure that breaking them up will help visually, so I left them as is. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: track highest prio task queuedSteven Rostedt2008-01-251-0/+18
| | | | | | | | | | | | This patch adds accounting to each runqueue to keep track of the highest prio task queued on the run queue. We only care about RT tasks, so if the run queue does not contain any active RT tasks its priority will be considered MAX_RT_PRIO. This information will be used for later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: count # of queued RT tasksSteven Rostedt2008-01-251-0/+17
| | | | | | | | This patch adds accounting to keep track of the number of RT tasks running on a runqueue. This information will be used in later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: group scheduling, change how cpu load is calculatedSrivatsa Vaddagiri2008-01-251-0/+2
| | | | | | | | | | | | | | This patch changes how the cpu load exerted by fair_sched_class tasks is calculated. Load exerted by fair_sched_class tasks on a cpu is now a summation of the group weights, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it. This version of patch has a minor impact on code size, but should have no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: rt: account the cpu time during the tickPeter Zijlstra2007-12-201-0/+2
| | | | | | | | | | | | | | | | Realtime tasks would not account their runtime during ticks. Which would lead to: struct sched_param param = { .sched_priority = 10 }; pthread_setschedparam(pthread_self(), SCHED_FIFO, &param); while (1) ; Not showing up in top. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched: cpu accounting controller (V2)Srivatsa Vaddagiri2007-12-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit cfb5285660aad4931b2ebbfa902ea48a37dfffa1 removed a useful feature for us, which provided a cpu accounting resource controller. This feature would be useful if someone wants to group tasks only for accounting purpose and doesnt really want to exercise any control over their cpu consumption. The patch below reintroduces the feature. It is based on Paul Menage's original patch (Commit 62d0df64065e7c135d0002f069444fbdfc64768f), with these differences: - Removed load average information. I felt it needs more thought (esp to deal with SMP and virtualized platforms) and can be added for 2.6.25 after more discussions. - Convert group cpu usage to be nanosecond accurate (as rest of the cfs stats are) and invoke cpuacct_charge() from the respective scheduler classes - Make accounting scalable on SMP systems by splitting the usage counter to be per-cpu - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the code is not big enough to warrant a new file and also this rightly needs to live inside the scheduler. Also things like accessing rq->lock while reading cpu usage becomes easier if the code lived in kernel/sched.c) The patch also modifies the cpu controller not to provide the same accounting information. Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran some simple tests like cpuspin (spin on the cpu), ran several tasks in the same group and timed them. Compared their time stamps with cpuacct.usage. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: isolate SMP balancing code a bit morePeter Williams2007-10-241-0/+4
| | | | | | | | | | | | | | | At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and reduces the binary size on non SMP systems: text data bss dec hex filename 10983 28 1192 12203 2fab sched.o.before 10739 28 1192 11959 2eb7 sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: reduce balance-tasks overheadPeter Williams2007-10-241-9/+19
| | | | | | | | | | | | | | | At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: tidy up SCHED_RRDmitry Adamushko2007-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - make timeslices of SCHED_RR tasks constant and not dependent on task's static_prio [1] ; - remove obsolete code (timeslice related bits); - make sched_rr_get_interval() return something more meaningful [2] for SCHED_OTHER tasks. [1] according to the following link, it's not compliant with SUSv3 (not sure though, what is the reference for us :-) http://lkml.org/lkml/2007/3/7/656 [2] the interval is dynamic and can be depicted as follows "should a task be one of the runnable tasks at this particular moment, it would expect to run for this interval of time before being re-scheduled by the scheduler tick". (i.e. it's more precise if a task is runnable at the moment) yeah, this seems to require task_rq_lock/unlock() but this is not a hot path. results: (SCHED_FIFO) dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval time_slice: 0 : 0 (SCHED_RR) dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval time_slice: 0 : 99984800 (SCHED_NORMAL) dimm@earth:~/storage/prog$ ./rr_interval time_slice: 0 : 19996960 (SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result) dimm@earth:~/storage/prog$ taskset 1 ./rr_interval time_slice: 0 : 9998480 Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: uninline schedulerAlexey Dobriyan2007-10-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * save ~300 bytes * activate_idle_task() was moved to avoid a warning bloat-o-meter output: add/remove: 6/0 grow/shrink: 0/16 up/down: 438/-733 (-295) <=== function old new delta __enqueue_entity - 165 +165 finish_task_switch - 110 +110 update_curr_rt - 79 +79 __load_balance_iterator - 32 +32 __task_rq_unlock - 28 +28 find_process_by_pid - 24 +24 do_sched_setscheduler 133 123 -10 sys_sched_rr_get_interval 176 165 -11 sys_sched_getparam 156 145 -11 normalize_rt_tasks 482 470 -12 sched_getaffinity 112 99 -13 sys_sched_getscheduler 86 72 -14 sched_setaffinity 226 212 -14 sched_setscheduler 666 642 -24 load_balance_start_fair 33 9 -24 load_balance_next_fair 33 9 -24 dequeue_task_rt 133 67 -66 put_prev_task_rt 97 28 -69 schedule_tail 133 50 -83 schedule 682 594 -88 enqueue_entity 499 366 -133 task_new_fair 317 180 -137 Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>