summaryrefslogtreecommitdiffstats
path: root/include/linux/hardirq.h
Commit message (Collapse)AuthorAgeFilesLines
* genirq: Provide synchronize_hardirq()Thomas Gleixner2014-02-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | synchronize_irq() waits for hard irq and threaded handlers to complete before returning. For some special cases we only need to make sure that the hard interrupt part of the irq line is not in progress when we disabled the - possibly shared - interrupt at the device level. A proper use case for this was provided by Russell. The sdhci driver requires some irq triggered functions to be run in thread context. The current implementation of the thread context is a sdio private kthread construct, which has quite some shortcomings. These can be avoided when the thread is directly associated to the device interrupt via the generic threaded irq infrastructure. Though there is a corner case related to run time power management where one side disables the device interrupts at the device level and needs to make sure, that an already running hard interrupt handler has completed before proceeding further. Though that hard interrupt handler might wake the associated thread, which in turn can request the runtime PM to reenable the device. Using synchronize_irq() leads to an immediate deadlock of the irq thread waiting for the PM lock and the synchronize_irq() waiting for the irq thread to complete. Due to the fact that it is sufficient for this case to ensure that no hard irq handler is executing a new function which avoids the check for the thread is required. Add a function, which just monitors the hard irq parts and ignores the threaded handlers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Russell King <linux@arm.linux.org.uk> Cc: Chris Ball <chris@printf.net> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140215003823.653236081@linutronix.de
* sched/preempt, locking: Rework local_bh_{dis,en}able()Peter Zijlstra2014-01-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently local_bh_disable() is out-of-line for no apparent reason. So inline it to save a few cycles on call/return nonsense, the function body is a single add on x86 (a few loads and store extra on load/store archs). Also expose two new local_bh functions: __local_bh_{dis,en}able_ip(unsigned long ip, unsigned int cnt); Which implement the actual local_bh_{dis,en}able() behaviour. The next patch uses the exposed @cnt argument to optimize bh lock functions. With build fixes from Jacob Pan. Cc: rjw@rjwysocki.net Cc: rui.zhang@intel.com Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched: Extract the basic add/sub preempt_count modifiersPeter Zijlstra2013-09-251-4/+4
| | | | | | | | | | | | | | | | | | | | | | Rewrite the preempt_count macros in order to extract the 3 basic preempt_count value modifiers: __preempt_count_add() __preempt_count_sub() and the new: __preempt_count_dec_and_test() And since we're at it anyway, replace the unconventional $op_preempt_count names with the more conventional preempt_count_$op. Since these basic operators are equivalent to the previous _notrace() variants, do away with the _notrace() versions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Remove GENERIC_HARDIRQ config optionMartin Schwidefsky2013-09-131-4/+0
| | | | | | | | After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* hardirq: Split preempt count mask definitionsFrederic Weisbecker2013-08-141-116/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to use static keys with vtime APIs, we'll need to add static keys headers to vtime.h hardirq.h then becomes a problem because it needs vtime.h for irqtime accounting in irq_enter/irq_exit, but it's often included just to get the irq mask definitions in the task preempt_count field and the APIs that come along: in_interrupt(), in_hardirq(), etc... Some very low level arch headers sometimes need these masks and APIs such as arch/m68k/include/asm/irqflags.h for example. But they don't want to include hardirq.h if vtime.h, jump_label.h and even workqueue.h come along. Including such bloated high level header from arch headers can quickly result in circular headers dependency that crash the build. So let's split hardirq.h in two parts: * preempt_mask.h that gathers all the preempt_count definitions and the APIs associated. This one is considered low level and can be safely included anywhere. * hardirq.h that includes the previous one. It defines the irq entry/exit APIs. To avoid future circular headers dependencies, the preempt_mask.h inclusion can replace hardirq.h on files that don't implement irq low level handlers but just need the atomic/context check APIs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kevin Hilman <khilman@linaro.org>
* rcu: Remove TINY_PREEMPT_RCUPaul E. McKenney2013-06-101-1/+1
| | | | | | | | | | | | | | | | TINY_PREEMPT_RCU adds significant code and complexity, but does not offer commensurate benefits. People currently using TINY_PREEMPT_RCU can get much better memory footprint with TINY_RCU, or, if they really need preemptible RCU, they can use TREE_PREEMPT_RCU with a relatively minor degradation in memory footprint. Please note that this move has been widely publicized on LKML (https://lkml.org/lkml/2012/11/12/545) and on LWN (http://lwn.net/Articles/541037/). This commit therefore removes TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Updated to eliminate #else in rcutiny.h as suggested by Josh ] Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* irq: Remove IRQ_EXIT_OFFSET workaroundFrederic Weisbecker2013-02-221-2/+0
| | | | | | | | | | | | | | | | | | | | | The IRQ_EXIT_OFFSET trick was used to make sure the irq doesn't get preempted after we substract the HARDIRQ_OFFSET until we are entirely done with any code in irq_exit(). This workaround was necessary because some archs may call irq_exit() with irqs enabled and there is still some code in the end of this function that is not covered by the HARDIRQ_OFFSET but want to stay non-preemptible. Now that irq are always disabled in irq_exit(), the whole code is guaranteed not to be preempted. We can thus remove this hack. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2013-02-191-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
| * cputime: Safely read cputime of full dynticks CPUsFrederic Weisbecker2013-01-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While remotely reading the cputime of a task running in a full dynticks CPU, the values stored in utime/stime fields of struct task_struct may be stale. Its values may be those of the last kernel <-> user transition time snapshot and we need to add the tickless time spent since this snapshot. To fix this, flush the cputime of the dynticks CPUs on kernel <-> user transition and record the time / context where we did this. Then on top of this snapshot and the current time, perform the fixup on the reader side from task_times() accessors. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> [fixed kvm module related build errors] Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
* | tracing/lockdep: Disable lockdep first in entering NMISteven Rostedt2013-01-211-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When function tracing with either debug locks enabled or tracing preempt disabled, the add_preempt_count() is traced. This is an issue with lockdep and function tracing. As function tracing can disable interrupts, and lockdep records that change, lockdep may not be able to handle this recursion if it happens from an NMI context. The first thing that an NMI does is: #define nmi_enter() \ do { \ ftrace_nmi_enter(); \ BUG_ON(in_nmi()); \ add_preempt_count(NMI_OFFSET + HARDIRQ_OFFSET); \ lockdep_off(); \ rcu_nmi_enter(); \ trace_hardirq_enter(); \ } while (0) When the add_preempt_count() is traced, and the tracing callback disables interrupts, it will jump into the lockdep code. There's some places in lockdep that can't handle this re-entrance, and causes lockdep to fail. As the lockdep_off() (and lockdep_on) is a simple: void lockdep_off(void) { current->lockdep_recursion++; } and is never traced, it can be called first in nmi_enter() and lockdep_on() last in nmi_exit(). Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* cputime: Specialize irq vtime hooksFrederic Weisbecker2012-10-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account() is called in irq entry/exit, we perform a check on the context: if we are interrupting the idle task we account the pending cputime to idle, otherwise account to system time or its sub-areas: tsk->stime, hardirq time, softirq time, ... However this check for idle only concerns the hardirq entry and softirq entry: * Hardirq may directly interrupt the idle task, in which case we need to flush the pending CPU time to idle. * The idle task may be directly interrupted by a softirq if it calls local_bh_enable(). There is probably no such call in any idle task but we need to cover every case. Ksoftirqd is not concerned because the idle time is flushed on context switch and softirq in the end of hardirq have the idle time already flushed from the hardirq entry. In the other cases we always account to system/irq time: * On hardirq exit we account the time to hardirq time. * On softirq exit we account the time to softirq time. To optimize this and avoid the indirect call to vtime_account() and the checks it performs, specialize the vtime irq APIs and only perform the check on irq entry. Irq exit can directly call vtime_account_system(). CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly maps to its own vtime_account() implementation. One may want to take benefits from the new APIs to optimize irq time accounting as well in the future. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
* vtime: Gather vtime declarations to their own header fileFrederic Weisbecker2012-10-291-10/+1
| | | | | | | | | | | | These APIs are scattered around and are going to expand a bit. Let's create a dedicated header file for sanity. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
* cputime: Use a proper subsystem naming for vtime related APIsFrederic Weisbecker2012-09-251-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Use a naming based on vtime as a prefix for virtual based cputime accounting APIs: - account_system_vtime() -> vtime_account() - account_switch_vtime() -> vtime_task_switch() It makes it easier to allow for further declension such as vtime_account_system(), vtime_account_idle(), ... if we want to find out the context we account to from generic code. This also make it better to know on which subsystem these APIs refer to. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
* sched: Fix comment about PREEMPT_ACTIVE bit locationSrivatsa S. Bhat2012-07-261-1/+1
| | | | | | | | | | | | PREEMPT_ACTIVE flag is bit 27, not 28. Fix the comment. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: paulmck@linux.vnet.ibm.com Cc: josh@joshtriplett.org Cc: rostedt@goodmis.org Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120720192459.6149.14821.stgit@srivatsabhat.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* rcu: Track idleness independent of idle tasksPaul E. McKenney2011-12-111-21/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Earlier versions of RCU used the scheduling-clock tick to detect idleness by checking for the idle task, but handled idleness differently for CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side critical sections in the idle task, for example, for tracing. A more fine-grained detection of idleness is therefore required. This commit presses the old dyntick-idle code into full-time service, so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is always invoked at the beginning of an idle loop iteration. Similarly, rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked at the end of an idle-loop iteration. This allows the idle task to use RCU everywhere except between consecutive rcu_idle_enter() and rcu_idle_exit() calls, in turn allowing architecture maintainers to specify exactly where in the idle loop that RCU may be used. Because some of the userspace upcall uses can result in what looks to RCU like half of an interrupt, it is not possible to expect that the irq_enter() and irq_exit() hooks will give exact counts. This patch therefore expands the ->dynticks_nesting counter to 64 bits and uses two separate bitfields to count process/idle transitions and interrupt entry/exit transitions. It is presumed that userspace upcalls do not happen in the idle loop or from usermode execution (though usermode might do a system call that results in an upcall). The counter is hard-reset on each process/idle transition, which avoids the interrupt entry/exit error from accumulating. Overflow is avoided by the 64-bitness of the ->dyntick_nesting counter. This commit also adds warnings if a non-idle task asks RCU to enter idle state (and these checks will need some adjustment before applying Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246). In addition, validation of ->dynticks and ->dynticks_nesting is added. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* sched: Isolate preempt counting in its own config optionFrederic Weisbecker2011-06-101-2/+2
| | | | | | | | | | | | | | | | | Create a new CONFIG_PREEMPT_COUNT that handles the inc/dec of preempt count offset independently. So that the offset can be updated by preempt_disable() and preempt_enable() even without the need for CONFIG_PREEMPT beeing set. This prepares to make CONFIG_DEBUG_SPINLOCK_SLEEP working with !CONFIG_PREEMPT where it currently doesn't detect code that sleeps inside explicit preemption disabled sections. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
* BKL: That's all, folksArnd Bergmann2011-03-051-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes the implementation of the big kernel lock, at last. A lot of people have worked on this in the past, I so the credit for this patch should be with everyone who participated in the hunt. The names on the Cc list are the people that were the most active in this, according to the recorded git history, in alphabetical order. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Alan Cox <alan@linux.intel.com> Cc: Alessio Igor Bogani <abogani@texware.it> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Hendry <andrew.hendry@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Blunck <jblunck@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Oliver Neukum <oliver@neukum.org> Cc: Paul Menage <menage@google.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
* hardirq.h: needs sched.h if using BKLLinus Torvalds2010-11-181-0/+1
| | | | | | | | | | | | | | | | | | | This really isn't the right thing to do, and strictly speaking we should have the BKL depth count in the thread info right next to the preempt count. The two really do go together. However, since that would involve a patch to all architectures, and the BKL is finally going away, it's simply not worth the effort to do the RightThing(tm). Just re-instate the <linux/sched.h> include that we used to get accidentally from the smp_lock.h one. This is all fallout from the same old "BKL: remove extraneous #include <smp_lock.h>" commit. Reported-by: Ingo Molnar <mingo@elte.hu> Tested-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hardirq.h: remove now-empty #ifdef/#endif pairLinus Torvalds2010-11-171-2/+0
| | | | | | | | | | | Commit 451a3c24b013 ("BKL: remove extraneous #include <smp_lock.h>") removed the #include line that was the only thing that was surrounded by the #ifdef/#endif. So now that #ifdef is guarding nothing at all. Just remove it. Reported-by: Byeong-ryeol Kim <brofkims@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix build failure due to hwirq.h needing smp_lock.hLinus Torvalds2010-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arnd Bergmann did an automated scripting run to find left-over instances of <linux/smp_lock.h>, and had made it trigger it on the normal BKL use of lock_kernel and unlock_lernel (and apparently release_kernel_lock and reacquire_kernel_lock too, used by the scheduler). That resulted in commit 451a3c24b013 ("BKL: remove extraneous #include <smp_lock.h>"). However, hardirq.h was the only remaining user of the old 'kernel_locked()' interface, and Arnd's script hadn't checked for that. So depending on your configuration and what header files had been included, you would get errors like "implicit declaration of function 'kernel_locked'" during the build. The right fix is not to just re-instate the smp_lock.h include - it is to just remove 'kernel_locked()' entirely, since the only use was this one special low-level detail. Just make hardirq.h do it directly. In fact this simplifies and clarifies the code, because some trivial analysis makes it clear that hardirq.h only ever used _one_ of the two definitions of kernel_locked(), so we can remove the other one entirely. Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com> Reported-and-acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* BKL: remove extraneous #include <smp_lock.h>Arnd Bergmann2010-11-171-1/+0
| | | | | | | | | | The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* preempt: fix kernel build with !CONFIG_BKLArnd Bergmann2010-11-021-2/+6
| | | | | | | | | | | | The preempt count logic tries to take the BKL into account, which breaks when CONFIG_BKL is not set. Use the same preempt_count offset that we use without CONFIG_PREEMPT when CONFIG_BKL is disabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-and-tested-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-irqflagsLinus Torvalds2010-10-211-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-irqflags: Fix IRQ flag handling naming MIPS: Add missing #inclusions of <linux/irq.h> smc91x: Add missing #inclusion of <linux/irq.h> Drop a couple of unnecessary asm/system.h inclusions SH: Add missing consts to sys_execve() declaration Blackfin: Rename IRQ flags handling functions Blackfin: Add missing dep to asm/irqflags.h Blackfin: Rename DES PC2() symbol to avoid collision Blackfin: Split the BF532 BFIN_*_FIO_FLAG() functions to their own header Blackfin: Split PLL code from mach-specific cdef headers
| * Drop a couple of unnecessary asm/system.h inclusionsDavid Howells2010-10-071-1/+0
| | | | | | | | | | | | | | | | | | | | Drop inclusions of asm/system.h from linux/hardirq.h and linux/list.h as they're no longer required and prevent the M68K arch's IRQ flag handling macros from being made into inlined functions due to circular dependencies. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
* | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2010-10-211-1/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits) sched: Export account_system_vtime() sched: Call tick_check_idle before __irq_enter sched: Remove irq time from available CPU power sched: Do not account irq time to current task x86: Add IRQ_TIME_ACCOUNTING sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time sched: Add a PF flag for ksoftirqd identification sched: Consolidate account_system_vtime extern declaration sched: Fix softirq time accounting sched: Drop group_capacity to 1 only if local group has extra capacity sched: Force balancing on newidle balance if local group has capacity sched: Set group_imb only a task can be pulled from the busiest cpu sched: Do not consider SCHED_IDLE tasks to be cache hot sched: Drop all load weight manipulation for RT tasks sched: Create special class for stop/migrate work sched: Unindent labels sched: Comment updates: fix default latency and granularity numbers tracing/sched: Add sched_pi_setprio tracepoint sched: Give CPU bound RT tasks preference sched: Try not to migrate higher priority RT tasks ...
| * | sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq timeVenkatesh Pallipadi2010-10-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 will be challenging however, given the state of TSC reliability on various platforms and also the overhead it will add in syscall entry exit. Instead, add a lighter variant that only does finer accounting of hardirq and softirq times, providing precise irq times (instead of timer tick based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not interested in paying the perf penalty. This accounting is based on sched_clock, with the code being generic. So, other archs may find it useful as well. This patch just adds the core logic and does not enable this logic yet. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Consolidate account_system_vtime extern declarationVenkatesh Pallipadi2010-10-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just a minor cleanup patch that makes things easier to the following patches. No functionality change in this patch. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | sched: Fix softirq time accountingVenkatesh Pallipadi2010-10-181-0/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Peter Zijlstra found a bug in the way softirq time is accounted in VIRT_CPU_ACCOUNTING on this thread: http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html The problem is, softirq processing uses local_bh_disable internally. There is no way, later in the flow, to differentiate between whether softirq is being processed or is it just that bh has been disabled. So, a hardirq when bh is disabled results in time being wrongly accounted as softirq. Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING as well. As account_system_time() in normal tick based accouting also uses softirq_count, which will be set even when not in softirq with bh disabled. Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq processing. The patch below does that and adds API in_serving_softirq() which returns whether we are currently processing softirq or not. Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c to in_serving_softirq. Looks like many usages of in_softirq really want in_serving_softirq. Those changes can be made individually on a case by case basis. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* / rcu: Add a TINY_PREEMPT_RCUPaul E. McKenney2010-08-201-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement a small-memory-footprint uniprocessor-only implementation of preemptible RCU. This implementation uses but a single blocked-tasks list rather than the combinatorial number used per leaf rcu_node by TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies processing. This version also takes advantage of uniprocessor execution to accelerate grace periods in the case where there are no readers. The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU. This implementation is a step towards having RCU implementation driven off of the SMP and PREEMPT kernel configuration variables, which can happen once this implementation has accumulated sufficient experience. Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as suggested by Steve Rostedt in order to avoid the compiler-reordering issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183). As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time workloads, CONFIG_TINY_RCU is even better. CONFIG_TREE_PREEMPT_RCU text data bss dec filename 13 0 0 13 kernel/rcupdate.o 6170 825 28 7023 kernel/rcutree.o ---- 7026 Total CONFIG_TINY_PREEMPT_RCU text data bss dec filename 13 0 0 13 kernel/rcupdate.o 2081 81 8 2170 kernel/rcutiny.o ---- 2183 Total CONFIG_TINY_RCU (non-preemptible) text data bss dec filename 13 0 0 13 kernel/rcupdate.o 719 25 0 744 kernel/rcutiny.o --- 757 Total Requested-by: Loïc Minier <loic.minier@canonical.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: "Tiny RCU", The Bloatwatch EditionPaul E. McKenney2009-10-261-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a version of RCU designed for !SMP provided for a small-footprint RCU implementation. In particular, the implementation of synchronize_rcu() is extremely lightweight and high performance. It passes rcutorture testing in each of the four relevant configurations (combinations of NO_HZ and PREEMPT) on x86. This saves about 1K bytes compared to old Classic RCU (which is no longer in mainline), and more than three kilobytes compared to Hierarchical RCU (updated to 2.6.30): CONFIG_TREE_RCU: text data bss dec filename 183 4 0 187 kernel/rcupdate.o 2783 520 36 3339 kernel/rcutree.o 3526 Total (vs 4565 for v7) CONFIG_TREE_PREEMPT_RCU: text data bss dec filename 263 4 0 267 kernel/rcupdate.o 4594 776 52 5422 kernel/rcutree.o 5689 Total (6155 for v7) CONFIG_TINY_RCU: text data bss dec filename 96 4 0 100 kernel/rcupdate.o 734 24 0 758 kernel/rcutiny.o 858 Total (vs 848 for v7) The above is for x86. Your mileage may vary on other platforms. Further compression is possible, but is being procrastinated. Changes from v7 (http://lkml.org/lkml/2009/10/9/388) o Apply Lai Jiangshan's review comments (aside from might_sleep() in synchronize_sched(), which is covered by SMP builds). o Fix up expedited primitives. Changes from v6 (http://lkml.org/lkml/2009/9/23/293). o Forward ported to put it into the 2.6.33 stream. o Added lockdep support. o Make lightweight rcu_barrier. Changes from v5 (http://lkml.org/lkml/2009/6/23/12). o Ported to latest pre-2.6.32 merge window kernel. - Renamed rcu_qsctr_inc() to rcu_sched_qs(). - Renamed rcu_bh_qsctr_inc() to rcu_bh_qs(). - Provided trivial rcu_cpu_notify(). - Provided trivial exit_rcu(). - Provided trivial rcu_needs_cpu(). - Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h. o Removed the dependence on EMBEDDED, with a view to making TINY_RCU default for !SMP at some time in the future. o Added (trivial) support for expedited grace periods. Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include: o Squeeze the size down a bit further by removing the ->completed field from struct rcu_ctrlblk. o This permits synchronize_rcu() to become the empty function. Previous concerns about rcutorture were unfounded, as rcutorture correctly handles a constant value from rcu_batches_completed() and rcu_batches_completed_bh(). Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include: o Changed rcu_batches_completed(), rcu_batches_completed_bh() rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and rcu_nmi_exit(), to be static inlines, as suggested by David Howells. Doing this saves about 100 bytes from rcutiny.o. (The numbers between v3 and this v4 of the patch are not directly comparable, since they are against different versions of Linux.) Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include: o Fix whitespace issues. o Change short-circuit "||" operator to instead be "+" in order to fix performance bug noted by "kraai" on LWN. (http://lwn.net/Articles/324348/) Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include: o This version depends on EMBEDDED as well as !SMP, as suggested by Ingo. o Updated rcu_needs_cpu() to unconditionally return zero, permitting the CPU to enter dynticks-idle mode at any time. This works because callbacks can be invoked upon entry to dynticks-idle mode. o Paul is now OK with this being included, based on a poll at the Kernel Miniconf at linux.conf.au, where about ten people said that they cared about saving 900 bytes on single-CPU systems. o Applies to both mainline and tip/core/rcu. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: avi@redhat.com Cc: mtosatti@redhat.com LKML-Reference: <12565226351355-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2009-09-111-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits) sched: Fix sched::sched_stat_wait tracepoint field sched: Disable NEW_FAIR_SLEEPERS for now sched: Keep kthreads at default priority sched: Re-tune the scheduler latency defaults to decrease worst-case latencies sched: Turn off child_runs_first sched: Ensure that a child can't gain time over it's parent after fork() sched: enable SD_WAKE_IDLE sched: Deal with low-load in wake_affine() sched: Remove short cut from select_task_rq_fair() sched: Turn on SD_BALANCE_NEWIDLE sched: Clean up topology.h sched: Fix dynamic power-balancing crash sched: Remove reciprocal for cpu_power sched: Try to deal with low capacity, fix update_sd_power_savings_stats() sched: Try to deal with low capacity sched: Scale down cpu_power due to RT tasks sched: Implement dynamic cpu_power sched: Add smt_gain sched: Update the cpu_power sum during load-balance sched: Add SD_PREFER_SIBLING ...
| * sched: Add default defines for PREEMPT_ACTIVEArnd Bergmann2009-08-091-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PREEMPT_ACTIVE setting doesn't actually need to be arch-specific, so set up a sane default for all arches to (hopefully) migrate to. > if we look at linux/hardirq.h, it makes this claim: > * - bit 28 is the PREEMPT_ACTIVE flag > if that's true, then why are we letting any arch set this define ? a > quick survey shows that half the arches (11) are using 0x10000000 (bit > 28) while the other half (10) are using 0x4000000 (bit 26). and then > there is the ia64 oddity which uses bit 30. the exact value here > shouldnt really matter across arches though should it ? actually alpha, arm and avr32 also use bit 30 (0x40000000), there are only five (or eight, depending on how you count) architectures (blackfin, h8300, m68k, s390 and sparc) using bit 26. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMPPaul E. McKenney2009-08-221-2/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | A couple of references to CONFIG_CLASSIC_RCU have survived. Although these are harmless, it is past time for them to go. The one in hardirq.h is strictly a readability problem. The two in pagemap.h appear to disable a !SMP performance optimization (which this patch re-enables). This does raise the issue as to whether pagemap.h should really be referring to the CPU implementation. Long term, I intend to make the RCU implementation driven by CONFIG_PREEMPT, at which point these should change from defined(CONFIG_TREE_RCU) to !defined(CONFIG_PREEMPT). In the meantime, is there something else that could be done in pagemap.h? Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: akpm@linux-foundation.org Cc: mathieu.desnoyers@polymtl.ca Cc: josht@linux.vnet.ibm.com Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org LKML-Reference: <20090822050851.GA8414@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* headers: smp_lock.h reduxAlexey Dobriyan2009-07-121-0/+2
| | | | | | | | | | | | | * Remove smp_lock.h from files which don't need it (including some headers!) * Add smp_lock.h to files which do need it * Make smp_lock.h include conditional in hardirq.h It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT This will make hardirq.h inclusion cheaper for every PREEMPT=n config (which includes allmodconfig/allyesconfig, BTW) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'linus' into irq/threadedIngo Molnar2009-04-061-29/+44
|\ | | | | | | | | | | Conflicts: include/linux/irq.h kernel/irq/handle.c
| * sched: do not account for NMIsSteven Rostedt2009-02-121-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: avoid corruption in system time accounting Martin Schwidefsky told me that there was an issue with NMIs and system accounting. The problem is that the accounting code is not reentrant, and if an NMI goes off after an interrupt it can corrupt the accounting. For now, the best we can do is to treat NMIs like SMIs and they are not accounted for. This patch changes nmi_enter to not call __irq_enter and to do the preempt-count and tracing calls directly. Signed-off-by: Steven Rostedt <srostedt@redhat.com>
| * preempt-count: force hardirq-count to max of 10Steven Rostedt2009-02-121-24/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To add a bit in the preempt_count to be set when in NMI context, we found that some archs did not have enough bits to spare. This is due to the hardirq_count being a mask that can hold NR_IRQS. Some archs allow for over 16000 IRQs, and that would require a mask of 14 bits. The sofitrq mask is 8 bits and the preempt disable mask is also 8 bits. The PREEMP_ACTIVE bit is bit 30, and bit 31 would make the preempt_count (which is type int) a negative number. A negative preempt_count is a sign of failure. Add them up 14+8+8+1+1 you get 32 bits. No room for the NMI bit. But the hardirq_count is to track the number of nested IRQs, not the number of total IRQs. This originally took the paranoid approach of setting the max nesting to NR_IRQS. But when we have archs with over 1000 IRQs, it is not practical to think they will ever all nest on a single CPU. Not to mention that this would most definitely cause a stack overflow. This patch sets a max of 10 bits to be used for IRQ nesting. I did a 'git grep HARDIRQ' to examine all users of HARDIRQ_BITS and HARDIRQ_MASK, and found that making it a max of 10 would not hurt anyone. I did find that the m68k expected it to be 8 bits, so I allow for the archs to set the number to be less than 10. I removed the setting of HARDIRQ_BITS from the archs that set it to more than 10. This includes ALPHA, ia64 and avr32. This will always allow room for the NMI bit, and if we need to allow for NMI nesting, we have 4 bits to play with. Signed-off-by: Steven Rostedt <srostedt@redhat.com>
| * nmi: add generic nmi tracking stateSteven Rostedt2009-02-071-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | This code adds an in_nmi() macro that uses the current tasks preempt count to track when it is in NMI context. Other parts of the kernel can use this to determine if the context is in NMI context or not. This code was inspired by the -rt patch in_nmi version that was written by Peter Zijlstra, who borrowed that code from Mathieu Desnoyers. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
* | genirq: add threaded interrupt handler supportThomas Gleixner2009-03-241-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | Add support for threaded interrupt handlers: A device driver can request that its main interrupt handler runs in a thread. To achive this the device driver requests the interrupt with request_threaded_irq() and provides additionally to the handler a thread function. The handler function is called in hard interrupt context and needs to check whether the interrupt originated from the device. If the interrupt originated from the device then the handler can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is returned when no further action is required. IRQ_WAKE_THREAD causes the genirq code to invoke the threaded (main) handler. When IRQ_WAKE_THREAD is returned handler must have disabled the interrupt on the device level. This is mandatory for shared interrupt handlers, but we need to do it as well for obscure x86 hardware where disabling an interrupt on the IO_APIC level redirects the interrupt to the legacy PIC interrupt lines. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'core-for-linus' of ↵Linus Torvalds2008-12-301-4/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits) stacktrace: provide save_stack_trace_tsk() weak alias rcu: provide RCU options on non-preempt architectures too printk: fix discarding message when recursion_bug futex: clean up futex_(un)lock_pi fault handling "Tree RCU": scalable classic RCU implementation futex: rename field in futex_q to clarify single waiter semantics x86/swiotlb: add default swiotlb_arch_range_needs_mapping x86/swiotlb: add default phys<->bus conversion x86: unify pci iommu setup and allow swiotlb to compile for 32 bit x86: add swiotlb allocation functions swiotlb: consolidate swiotlb info message printing swiotlb: support bouncing of HighMem pages swiotlb: factor out copy to/from device swiotlb: add arch hook to force mapping swiotlb: allow architectures to override phys<->bus<->phys conversions swiotlb: add comment where we handle the overflow of a dma mask on 32 bit rcu: fix rcutorture behavior during reboot resources: skip sanity check of busy resources swiotlb: move some definitions to header swiotlb: allow architectures to override swiotlb pool allocation ... Fix up trivial conflicts in arch/x86/kernel/Makefile arch/x86/mm/init_32.c include/linux/hardirq.h as per Ingo's suggestions.
| * "Tree RCU": scalable classic RCU implementationPaul E. McKenney2008-12-181-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a long-standing performance bug in classic RCU that results in massive internal-to-RCU lock contention on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for ease of review and patch maintenance, it is intended to replace classic RCU. This patch still handles stress better than does mainline, so I am still calling it ready for inclusion. This patch is against the -tip tree. Nevertheless, experience on an actual 1000+ CPU machine would still be most welcome. Most of the changes noted below were found while creating an rcutiny (which should permit ejecting the current rcuclassic) and while doing detailed line-by-line documentation. Updates from v9 (http://lkml.org/lkml/2008/12/2/334): o Fixes from remainder of line-by-line code walkthrough, including comment spelling, initialization, undesirable narrowing due to type conversion, removing redundant memory barriers, removing redundant local-variable initialization, and removing redundant local variables. I do not believe that any of these fixes address the CPU-hotplug issues that Andi Kleen was seeing, but please do give it a whirl in case the machine is smarter than I am. A writeup from the walkthrough may be found at the following URL, in case you are suffering from terminal insomnia or masochism: http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf o Made rcutree tracing use seq_file, as suggested some time ago by Lai Jiangshan. o Added a .csv variant of the rcudata debugfs trace file, to allow people having thousands of CPUs to drop the data into a spreadsheet. Tested with oocalc and gnumeric. Updated documentation to suit. Updates from v8 (http://lkml.org/lkml/2008/11/15/139): o Fix a theoretical race between grace-period initialization and force_quiescent_state() that could occur if more than three jiffies were required to carry out the grace-period initialization. Which it might, if you had enough CPUs. o Apply Ingo's printk-standardization patch. o Substitute local variables for repeated accesses to global variables. o Fix comment misspellings and redundant (but harmless) increments of ->n_rcu_pending (this latter after having explicitly added it). o Apply checkpatch fixes. Updates from v7 (http://lkml.org/lkml/2008/10/10/291): o Fixed a number of problems noted by Gautham Shenoy, including the cpu-stall-detection bug that he was having difficulty convincing me was real. ;-) o Changed cpu-stall detection to wait for ten seconds rather than three in order to reduce false positive, as suggested by Ingo Molnar. o Produced a design document (http://lwn.net/Articles/305782/). The act of writing this document uncovered a number of both theoretical and "here and now" bugs as noted below. o Fix dynticks_nesting accounting confusion, simplify WARN_ON() condition, fix kerneldoc comments, and add memory barriers in dynticks interface functions. o Add more data to tracing. o Remove unused "rcu_barrier" field from rcu_data structure. o Count calls to rcu_pending() from scheduling-clock interrupt to use as a surrogate timebase should jiffies stop counting. o Fix a theoretical race between force_quiescent_state() and grace-period initialization. Yes, initialization does have to go on for some jiffies for this race to occur, but given enough CPUs... Updates from v6 (http://lkml.org/lkml/2008/9/23/448): o Fix a number of checkpatch.pl complaints. o Apply review comments from Ingo Molnar and Lai Jiangshan on the stall-detection code. o Fix several bugs in !CONFIG_SMP builds. o Fix a misspelled config-parameter name so that RCU now announces at boot time if stall detection is configured. o Run tests on numerous combinations of configurations parameters, which after the fixes above, now build and run correctly. Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line): o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a changeset some time ago, and finally got around to retesting this option). o Fix some tracing bugs in rcupreempt that caused incorrect totals to be printed. o I now test with a more brutal random-selection online/offline script (attached). Probably more brutal than it needs to be on the people reading it as well, but so it goes. o A number of optimizations and usability improvements: o Make rcu_pending() ignore the grace-period timeout when there is no grace period in progress. o Make force_quiescent_state() avoid going for a global lock in the case where there is no grace period in progress. o Rearrange struct fields to improve struct layout. o Make call_rcu() initiate a grace period if RCU was idle, rather than waiting for the next scheduling clock interrupt. o Invoke rcu_irq_enter() and rcu_irq_exit() only when idle, as suggested by Andi Kleen. I still don't completely trust this change, and might back it out. o Make CONFIG_RCU_TRACE be the single config variable manipulated for all forms of RCU, instead of the prior confusion. o Document tracing files and formats for both rcupreempt and rcutree. Updates from v4 for those missing v5 given its bad subject line: o Separated dynticks interface so that NMIs and irqs call separate functions, greatly simplifying it. In particular, this code no longer requires a proof of correctness. ;-) o Separated dynticks state out into its own per-CPU structure, avoiding the duplicated accounting. o The case where a dynticks-idle CPU runs an irq handler that invokes call_rcu() is now correctly handled, forcing that CPU out of dynticks-idle mode. o Review comments have been applied (thank you all!!!). For but one example, fixed the dynticks-ordering issue that Manfred pointed out, saving me much debugging. ;-) o Adjusted rcuclassic and rcupreempt to handle dynticks changes. Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes 10-hour concurrent rcutorture and online-offline testing on 128-CPU ppc64 without dynticks enabled, and exposes some timekeeping bugs in presence of dynticks (exciting working on a system where "sleep 1" hangs until interrupted...), which were fixed in the 2.6.27 kernel. It is getting more reliable than mainline by some measures, so the next version will be against -tip for inclusion. See also Manfred Spraul's recent patches (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. That said, I have already gratefully stolen quite a few of Manfred's ideas. This patch provides CONFIG_RCU_FANOUT, which controls the bushiness of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on 64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT, there is no hierarchy. By default, the RCU initialization code will adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable this balancing, allowing the hierarchy to be exactly aligned to the underlying hardware. Up to two levels of hierarchy are permitted (in addition to the root node), allowing up to 16,384 CPUs on 32-bit systems and up to 262,144 CPUs on 64-bit systems. I just know that I am going to regret saying this, but this seems more than sufficient for the foreseeable future. (Some architectures might wish to set CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs. If this becomes a real problem, additional levels can be added, but I doubt that it will make a significant difference on real hardware.) In the common case, a given CPU will manipulate its private rcu_data structure and the rcu_node structure that it shares with its immediate neighbors. This can reduce both lock and memory contention by multiple orders of magnitude, which should eliminate the need for the strange manipulations that are reported to be required when running Linux on very large systems. Some shortcomings: o More bugs will probably surface as a result of an ongoing line-by-line code inspection. Patches will be provided as required. o There are probably hangs, rcutorture failures, &c. Seems quite stable on a 128-CPU machine, but that is kind of small compared to 4096 CPUs. However, seems to do better than mainline. Patches will be provided as required. o The memory footprint of this version is several KB larger than rcuclassic. A separate UP-only rcutiny patch will be provided, which will reduce the memory footprint significantly, even compared to the old rcuclassic. One such patch passes light testing, and has a memory footprint smaller even than rcuclassic. Initial reaction from various embedded guys was "it is not worth it", so am putting it aside. Credits: o Manfred Spraul for ideas, review comments, and bugs spotted, as well as some good friendly competition. ;-) o Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers, Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton for reviews and comments. o Thomas Gleixner for much-needed help with some timer issues (see patches below). o Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos, Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines alive despite my heavy abuse^Wtesting. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | ftrace: split out hardirq ftrace code into own headerSteven Rostedt2008-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Impact: moving of function prototypes into own header file ftrace.h is too big of a file for hardirq.h, and some archs will fail to build because of the include dependencies not being met. This patch pulls out the required prototypes for hardirq.h into a smaller and safer ftrace_irq.h file. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | ftrace: fix hardirq header for non ftrace archsSteven Rostedt2008-11-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Impact: build fix for non-ftrace architectures Not all archs implement ftrace, and therefore do not have an asm/ftrace.h. This patch corrects the problem. The ftrace_nmi_enter/exit now must be defined for all archs that implement dynamic ftrace. Currently, only x86 does. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | ftrace: nmi safe code modificationSteven Rostedt2008-10-301-2/+13
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix crashes that can occur in NMI handlers, if their code is modified Modifying code is something that needs special care. On SMP boxes, if code that is being modified is also being executed on another CPU, that CPU will have undefined results. The dynamic ftrace uses kstop_machine to make the system act like a uniprocessor system. But this does not address NMIs, that can still run on other CPUs. One approach to handle this is to make all code that are used by NMIs not be traced. But NMIs can call notifiers that spread throughout the kernel and this will be very hard to maintain, and the chance of missing a function is very high. The approach that this patch takes is to have the NMIs modify the code if the modification is taking place. The way this works is that just writing to code executing on another CPU is not harmful if what is written is the same as what exists. Two buffers are used: an IP buffer and a "code" buffer. The steps that the patcher takes are: 1) Put in the instruction pointer into the IP buffer and the new code into the "code" buffer. 2) Set a flag that says we are modifying code 3) Wait for any running NMIs to finish. 4) Write the code 5) clear the flag. 6) Wait for any running NMIs to finish. If an NMI is executed, it will also write the pending code. Multiple writes are OK, because what is being written is the same. Then the patcher must wait for all running NMIs to finish before going to the next line that must be patched. This is basically the RCU approach to code modification. Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven for his guidence on what is safe and what is not. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* BKL: revert back to the old spinlock implementationLinus Torvalds2008-05-101-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic semaphore rewrite had a huge performance regression on AIM7 (and potentially other BKL-heavy benchmarks) because the generic semaphores had been rewritten to be simple to understand and fair. The latter, in particular, turns a semaphore-based BKL implementation into a mess of scheduling. The attempt to fix the performance regression failed miserably (see the previous commit 00b41ec2611dc98f87f30753ee00a53db648d662 'Revert "semaphore: fix"'), and so for now the simple and sane approach is to instead just go back to the old spinlock-based BKL implementation that never had any issues like this. This patch also has the advantage of being reported to fix the regression completely according to Yanmin Zhang, unlike the semaphore hack which still left a couple percentage point regression. As a spinlock, the BKL obviously has the potential to be a latency issue, but it's not really any different from any other spinlock in that respect. We do want to get rid of the BKL asap, but that has been the plan for several years. These days, the biggest users are in the tty layer (open/release in particular) and Alan holds out some hope: "tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked." so while we're not there yet, we do have a plan of action. Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@ftp.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* in_atomic(): document why it is unsuitable for general useJonathan Corbet2008-03-281-0/+7
| | | | | | | | Discourage people from inappropriately using in_atomic() Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rcu: add support for dynamic ticks and preempt rcuSteven Rostedt2008-02-291-0/+10
| | | | | | | | | | | | | | | The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The idle CPU will not progress the RCU through its grace period and a synchronize_rcu my get stuck. Without this patch I have a box that will not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine with this patch. This patch comes from the -rt kernel where it has been tested for several months. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: remove the !PREEMPT_BKL codeIngo Molnar2008-01-251-5/+1
| | | | | | | | remove the !PREEMPT_BKL code. this removes 160 lines of legacy code. Signed-off-by: Ingo Molnar <mingo@elte.hu>
* sched: add in_atomic_preempt_off()Ingo Molnar2007-07-091-0/+13
| | | | | | | add in_atomic_preempt_off() - debugging helper that will simplify schedule(). Signed-off-by: Ingo Molnar <mingo@elte.hu>
* [PATCH] tick-management: dyntick / highres functionalityThomas Gleixner2007-02-161-1/+11
| | | | | | | | | | | | | | | | | With Ingo Molnar <mingo@elte.hu> Add functions to provide dynamic ticks and high resolution timers. The code which keeps track of jiffies and handles the long idle periods is shared between tick based and high resolution timer based dynticks. The dyntick functionality can be disabled on the kernel commandline. Provide also the infrastructure to support high resolution timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>