summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* rcu: Fix day-one dyntick-idle stall-warning bugPaul E. McKenney2012-10-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a10d206ef1a83121ab7430cb196e0376a7145b22 upstream. Each grace period is supposed to have at least one callback waiting for that grace period to complete. However, if CONFIG_NO_HZ=n, an extra callback-free grace period is no big problem -- it will chew up a tiny bit of CPU time, but it will complete normally. In contrast, CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to sleep indefinitely, in turn indefinitely delaying completion of the callback-free grace period. Given that nothing is waiting on this grace period, this is also not a problem. That is, unless RCU CPU stall warnings are also enabled, as they are in recent kernels. In this case, if a CPU wakes up after at least one minute of inactivity, an RCU CPU stall warning will result. The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute. But there are some embedded systems with cut-down userspace configurations that consistently get into this situation. All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Note that this can happen in TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system: Deadlock considerations mean that the CPU that detected the end of the grace period is not necessarily officially informed of this fact for some time. Once this CPU notices that the earlier grace period completed, it will invoke its callbacks. It then won't have any callbacks left. If no other CPU has any callbacks, we now have a callback-free grace period. This commit therefore makes CPUs check more carefully before starting a new grace period. This new check relies on an array of tail pointers into each CPU's list of callbacks. If the CPU is up to date on which grace periods have completed, it checks to see if any callbacks follow the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks follow the RCU_WAIT_TAIL segment. The reason that this works is that the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment as soon as the CPU is officially notified that the old grace period has ended. This change is to cpu_needs_another_gp(), which is called in a number of places. The only one that really matters is in rcu_start_gp(), where the root rcu_node structure's ->lock is held, which prevents any other CPU from starting or completing a grace period, so that the comparison that determines whether the CPU is missing the completion of a grace period is stable. Reported-by: Becky Bruce <bgillbruce@gmail.com> Reported-by: Subodh Nijsure <snijsure@grid-net.com> Reported-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: add missing smp_wmb() in process_one_work()Tejun Heo2012-10-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit 959d1af8cffc8fd38ed53e8be1cf4ab8782f9c00 upstream. WORK_STRUCT_PENDING is used to claim ownership of a work item and process_one_work() releases it before starting execution. When someone else grabs PENDING, all pre-release updates to the work item should be visible and all updates made by the new owner should happen afterwards. Grabbing PENDING uses test_and_set_bit() and thus has a full barrier; however, clearing doesn't have a matching wmb. Given the preceding spin_unlock and use of clear_bit, I don't believe this can be a problem on an actual machine and there hasn't been any related report but it still is theretically possible for clear_pending to permeate upwards and happen before work->entry update. Add an explicit smp_wmb() before work_clear_pending(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()Shawn Guo2012-10-131-0/+1
| | | | | | | | | | | | | | | | | | | | commit f96972f2dc6365421cf2366ebd61ee4cf060c8d5 upstream. As kernel_power_off() calls disable_nonboot_cpus(), we may also want to have kernel_restart() call disable_nonboot_cpus(). Doing so can help machines that require boot cpu be the last alive cpu during reboot to survive with kernel restart. This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine requires that the restart routine be run on the primary cpu rather than secondary ones. Otherwise, the secondary core running the restart routine will fail to come to online after reboot. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread_worker: reimplement flush_kthread_work() to allow freeing the work ↵Tejun Heo2012-10-021-21/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | item being executed commit 46f3d976213452350f9d10b0c2780c2681f7075b upstream. kthread_worker provides minimalistic workqueue-like interface for users which need a dedicated worker thread (e.g. for realtime priority). It has basic queue, flush_work, flush_worker operations which mostly match the workqueue counterparts; however, due to the way flush_work() is implemented, it has a noticeable difference of not allowing work items to be freed while being executed. While the current users of kthread_worker are okay with the current behavior, the restriction does impede some valid use cases. Also, removing this difference isn't difficult and actually makes the code easier to understand. This patch reimplements flush_kthread_work() such that it uses a flush_work item instead of queue/done sequence numbers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Colin Cross <ccross@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kthread_worker: reorganize to prepare for flush_kthread_work() reimplementationTejun Heo2012-10-021-16/+26
| | | | | | | | | | | | | | | | | | | | commit 9a2e03d8ed518a61154f18d83d6466628e519f94 upstream. Make the following two non-functional changes. * Separate out insert_kthread_work() from queue_kthread_work(). * Relocate struct kthread_flush_work and kthread_flush_work_fn() definitions above flush_kthread_work(). v2: Added lockdep_assert_held() in insert_kthread_work() as suggested by Andy Walls. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andy Walls <awalls@md.metrocast.net> Cc: Colin Cross <ccross@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* time: Move ktime_t overflow checking into timespec_valid_strictJohn Stultz2012-10-021-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | commit cee58483cf56e0ba355fdd97ff5e8925329aa936 upstream Andreas Bombe reported that the added ktime_t overflow checking added to timespec_valid in commit 4e8b14526ca7 ("time: Improve sanity checking of timekeeping inputs") was causing problems with X.org because it caused timeouts larger then KTIME_T to be invalid. Previously, these large timeouts would be clamped to KTIME_MAX and would never expire, which is valid. This patch splits the ktime_t overflow checking into a new timespec_valid_strict function, and converts the timekeeping codes internal checking to use this more strict function. Reported-and-tested-by: Andreas Bombe <aeb@debian.org> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* time: Avoid making adjustments if we haven't accumulated anythingJohn Stultz2012-10-021-1/+4
| | | | | | | | | | | | | | | | | commit bf2ac312195155511a0f79325515cbb61929898a upstream. If update_wall_time() is called and the current offset isn't large enough to accumulate, avoid re-calling timekeeping_adjust which may change the clock freq and can cause 1ns inconsistencies with CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* time: Improve sanity checking of timekeeping inputsJohn Stultz2012-10-021-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4e8b14526ca7fb046a81c94002c1c43b6fdf0e9b upstream. Unexpected behavior could occur if the time is set to a value large enough to overflow a 64bit ktime_t (which is something larger then the year 2262). Also unexpected behavior could occur if large negative offsets are injected via adjtimex. So this patch improves the sanity check timekeeping inputs by improving the timespec_valid() check, and then makes better use of timespec_valid() to make sure we don't set the time to an invalid negative value or one that overflows ktime_t. Note: This does not protect from setting the time close to overflowing ktime_t and then letting natural accumulation cause the overflow. Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: Fix race in task_group()Peter Zijlstra2012-10-022-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | commit 8323f26ce3425460769605a6aece7a174edaa7d1 upstream. Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"), he found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Fix a dead loop in async_synchronize_full()Li Zhong2012-10-021-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Fixed upstream by commits 2955b47d2c1983998a8c5915cb96884e67f7cb53 and a4683487f90bfe3049686fc5c566bdc1ad03ace6 from Dan Williams, but they are much more intrusive than this tiny fix, according to Andrew - gregkh] This patch tries to fix a dead loop in async_synchronize_full(), which could be seen when preemption is disabled on a single cpu machine. void async_synchronize_full(void) { do { async_synchronize_cookie(next_cookie); } while (!list_empty(&async_running) || ! list_empty(&async_pending)); } async_synchronize_cookie() calls async_synchronize_cookie_domain() with &async_running as the default domain to synchronize. However, there might be some works in the async_pending list from other domains. On a single cpu system, without preemption, there is no chance for the other works to finish, so async_synchronize_full() enters a dead loop. It seems async_synchronize_full() wants to synchronize all entries in all running lists(domains), so maybe we could just check the entry_count to know whether all works are finished. Currently, async_synchronize_cookie_domain() expects a non-NULL running list ( if NULL, there would be NULL pointer dereference ), so maybe a NULL pointer could be used as an indication for the functions to synchronize all works in all domains. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomicLai Jiangshan2012-10-021-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 96e65306b81351b656835c15931d1d237b252f27 upstream. The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags |= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: Add missing call to calc_load_exit_idle()Charles Wang2012-10-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | commit 749c8814f08f12baa4a9c2812a7c6ede7d69507d upstream. Azat Khuzhin reported high loadavg in Linux v3.6 After checking the upstream scheduler code, I found Peter's commit: 5167e8d5417b sched/nohz: Rewrite and fix load-avg computation -- again not fully applied, missing the call to calc_load_exit_idle(). After that idle exit in sampling window will always be calculated to non-idle, and the load will be higher than normal. This patch adds the missing call to calc_load_exit_idle(). Signed-off-by: Charles Wang <muming.wq@taobao.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1345449754-27130-1-git-send-email-muming.wq@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* perf_event: Switch to internal refcount, fix race with close()Al Viro2012-10-021-29/+33
| | | | | | | | | | | | | | | | | | | | | commit a6fa941d94b411bbd2b6421ffbde6db3c93e65ab upstream. Don't mess with file refcounts (or keep a reference to file, for that matter) in perf_event. Use explicit refcount of its own instead. Deal with the race between the final reference to event going away and new children getting created for it by use of atomic_long_inc_not_zero() in inherit_event(); just have the latter free what it had allocated and return NULL, that works out just fine (children of siblings of something doomed are created as singletons, same as if the child of leader had been created and immediately killed). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: reimplement work_on_cpu() using system_wqTejun Heo2012-10-021-17/+8
| | | | | | | | | | | | | | | | | | | | | | | | | commit ed48ece27cd3d5ee0354c32bbaec0f3e1d4715c3 upstream. The existing work_on_cpu() implementation is hugely inefficient. It creates a new kthread, execute that single function and then let the kthread die on each invocation. Now that system_wq can handle concurrent executions, there's no advantage of doing this. Reimplement work_on_cpu() using system_wq which makes it simpler and way more efficient. stable: While this isn't a fix in itself, it's needed to fix a workqueue related bug in cpufreq/powernow-k8. AFAICS, this shouldn't break other existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: fix divide by zero at {thread_group,task}_timesStanislaw Gruszka2012-09-141-14/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e upstream. On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched,cgroup: Fix up task_groups listMike Galbraith2012-09-142-1/+2
| | | | | | | | | | | | | | | | | commit 35cf4e50b16331def6cfcbee11e49270b6db07f5 upstream. With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: fix refcounting in audit-treeMiklos Szeredi2012-09-141-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a2140fc0cb0325bb6384e788edd27b9a568714e2 upstream. Refcounting of fsnotify_mark in audit tree is broken. E.g: refcount create_chunk alloc_chunk 1 fsnotify_add_mark 2 untag_chunk fsnotify_get_mark 3 fsnotify_destroy_mark audit_tree_freeing_mark 2 fsnotify_put_mark 1 fsnotify_put_mark 0 via destroy_list fsnotify_mark_destroy -1 This was reported by various people as triggering Oops when stopping auditd. We could just remove the put_mark from audit_tree_freeing_mark() but that would break freeing via inode destruction. So this patch simply omits a put_mark after calling destroy_mark or adds a get_mark before. The additional get_mark is necessary where there's no other put_mark after fsnotify_destroy_mark() since it assumes that the caller is holding a reference (or the inode is keeping the mark pinned, not the case here AFAICS). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Valentin Avram <aval13@gmail.com> Reported-by: Peter Moody <pmoody@google.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* audit: don't free_chunk() after fsnotify_add_mark()Miklos Szeredi2012-09-141-3/+3
| | | | | | | | | | | | commit 0fe33aae0e94b4097dd433c9399e16e17d638cd8 upstream. Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref via the destroy list and this results in use-after-free. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* random: remove rand_initialize_irq()Theodore Ts'o2012-08-151-17/+0
| | | | | | | | | | | | | | | | commit c5857ccf293968348e5eb4ebedc68074de3dcda6 upstream. With the new interrupt sampling system, we are no longer using the timer_rand_state structure in the irq descriptor, so we can stop initializing it now. [ Merged in fixes from Sedat to find some last missing references to rand_initialize_irq() ] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* random: make 'add_interrupt_randomness()' do something saneTheodore Ts'o2012-08-151-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 775f4b297b780601e61787b766f306ed3e1d23eb upstream. We've been moving away from add_interrupt_randomness() for various reasons: it's too expensive to do on every interrupt, and flooding the CPU with interrupts could theoretically cause bogus floods of entropy from a somewhat externally controllable source. This solves both problems by limiting the actual randomness addition to just once a second or after 64 interrupts, whicever comes first. During that time, the interrupt cycle data is buffered up in a per-cpu pool. Also, we make sure the the nonblocking pool used by urandom is initialized before we start feeding the normal input pool. This assures that /dev/urandom is returning unpredictable data as soon as possible. (Based on an original patch by Linus, but significantly modified by tytso.) Tested-by: Eric Wustrow <ewust@umich.edu> Reported-by: Eric Wustrow <ewust@umich.edu> Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu> Reported-by: Zakir Durumeric <zakir@umich.edu> Reported-by: J. Alex Halderman <jhalderm@umich.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()Darren Hart2012-08-091-5/+8
| | | | | | | | | | | | | | | | | | | commit 6f7b0a2a5c0fb03be7c25bd1745baa50582348ef upstream. If uaddr == uaddr2, then we have broken the rule of only requeueing from a non-pi futex to a pi futex with this call. If we attempt this, as the trinity test suite manages to do, we miss early wakeups as q.key is equal to key2 (because they are the same uaddr). We will then attempt to dereference the pi_mutex (which would exist had the futex_q been properly requeued to a pi futex) and trigger a NULL pointer dereference. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Fix bug in WARN_ON for NULL q.pi_stateDarren Hart2012-08-091-1/+1
| | | | | | | | | | | | | | | commit f27071cb7fe3e1d37a9dbe6c0dfc5395cd40fa43 upstream. The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing the address (&q.pi_state) of the pointer instead of the value (q.pi_state) of the pointer. Correct it accordingly. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/1c85d97f6e5f79ec389a4ead3e367363c74bd09a.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* futex: Test for pi_mutex on fault in futex_wait_requeue_pi()Darren Hart2012-08-091-1/+1
| | | | | | | | | | | | | | | | commit b6070a8d9853eda010a549fa9a09eb8d7269b929 upstream. If fixup_pi_state_owner() faults, pi_mutex may be NULL. Test for pi_mutex != NULL before testing the owner against current and possibly unlocking it. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Link: http://lkml.kernel.org/r/dc59890338fc413606f04e5c5b131530734dae3d.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* posix_types.h: Cleanup stale __NFDBITS and related definitionsJosh Boyer2012-08-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8ded2bbc1845e19c771eb55209aab166ef011243 upstream. Recently, glibc made a change to suppress sign-conversion warnings in FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the kernel's definition of __NFDBITS if applications #include <linux/types.h> after including <sys/select.h>. A build failure would be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2 flags to gcc. It was suggested that the kernel should either match the glibc definition of __NFDBITS or remove that entirely. The current in-kernel uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no uses of the related __FDELT and __FDMASK defines. Given that, we'll continue the cleanup that was started with commit 8b3d1cda4f5f ("posix_types: Remove fd_set macros") and drop the remaining unused macros. Additionally, linux/time.h has similar macros defined that expand to nothing so we'll remove those at the same time. Reported-by: Jeff Law <law@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Boyer <jwboyer@redhat.com> [ .. and fix up whitespace as per akpm ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: perform cpu down operations from low priority cpu_notifier()Tejun Heo2012-08-091-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream. Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ftrace: Disable function tracing during suspend/resume and hibernation, againSrivatsa S. Bhat2012-08-092-0/+9
| | | | | | | | | | | | | | | | | | | | | | | commit 443772d408a25af62498793f6f805ce3c559309a upstream. If function tracing is enabled for some of the low-level suspend/resume functions, it leads to triple fault during resume from suspend, ultimately ending up in a reboot instead of a resume (or a total refusal to come out of suspended state, on some machines). This issue was explained in more detail in commit f42ac38c59e0a03d (ftrace: disable tracing for suspend to ram). However, the changes made by that commit got reverted by commit cbe2f5a6e84eebb (tracing: allow tracing of suspend/resume & hibernation code again). So, unfortunately since things are not yet robust enough to allow tracing of low-level suspend/resume functions, suspend/resume is still broken when ftrace is enabled. So fix this by disabling function tracing during suspend/resume & hibernation. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntp: Fix STA_INS/DEL clearing bugJohn Stultz2012-07-291-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | commit 6b1859dba01c7d512b72d77e3fd7da8354235189 upstream. In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I introduced a bug that kept the STA_INS or STA_DEL bit from being cleared from time_status via adjtimex() without forcing STA_PLL first. Usually once the STA_INS is set, it isn't cleared until the leap second is applied, so its unlikely this affected anyone. However during testing I noticed it took some effort to cancel a leap second once STA_INS was set. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Add missing update call in timekeeping_resume()Thomas Gleixner2012-07-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of 3e997130bd2e8c6f5aaa49d6e3161d4d29b43ab0 The leap second rework unearthed another issue of inconsistent data. On timekeeping_resume() the timekeeper data is updated, but nothing calls timekeeping_update(), so now the update code in the timer interrupt sees stale values. This has been the case before those changes, but then the timer interrupt was using stale data as well so this went unnoticed for quite some time. Add the missing update call, so all the data is consistent everywhere. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de> Cc: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimer: Update hrtimer base offsets each hrtimer_interruptJohn Stultz2012-07-191-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of 5baefd6d84163443215f4a99f6a20f054ef11236 The update of the hrtimer base offsets on all cpus cannot be made atomically from the timekeeper.lock held and interrupt disabled region as smp function calls are not allowed there. clock_was_set(), which enforces the update on all cpus, is called either from preemptible process context in case of do_settimeofday() or from the softirq context when the offset modification happened in the timer interrupt itself due to a leap second. In both cases there is a race window for an hrtimer interrupt between dropping timekeeper lock, enabling interrupts and clock_was_set() issuing the updates. Any interrupt which arrives in that window will see the new time but operate on stale offsets. So we need to make sure that an hrtimer interrupt always sees a consistent state of time and offsets. ktime_get_update_offsets() allows us to get the current monotonic time and update the per cpu hrtimer base offsets from hrtimer_interrupt() to capture a consistent state of monotonic time and the offsets. The function replaces the existing ktime_get() calls in hrtimer_interrupt(). The overhead of the new function vs. ktime_get() is minimal as it just adds two store operations. This ensures that any changes to realtime or boottime offsets are noticed and stored into the per-cpu hrtimer base structures, prior to any hrtimer expiration and guarantees that timers are not expired early. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Provide hrtimer update functionThomas Gleixner2012-07-191-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimers: Move lock held region in hrtimer_interrupt()Thomas Gleixner2012-07-191-2/+3
| | | | | | | | | | | | | | | | | | | | | This is a backport of 196951e91262fccda81147d2bcf7fdab08668b40 We need to update the base offsets from this code and we need to do that under base->lock. Move the lock held region around the ktime_get() calls. The ktime_get() calls are going to be replaced with a function which gets the time and the offsets atomically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Maintain ktime_t based offsets for hrtimersThomas Gleixner2012-07-191-2/+23
| | | | | | | | | | | | | | | | | | | | | | | This is a backport of 5b9fe759a678e05be4937ddf03d50e950207c1c0 We need to update the hrtimer clock offsets from the hrtimer interrupt context. To avoid conversions from timespec to ktime_t maintain a ktime_t based representation of those offsets in the timekeeper. This puts the conversion overhead into the code which updates the underlying offsets and provides fast accessible values in the hrtimer interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Fix leapsecond triggered load spike issueJohn Stultz2012-07-191-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of 4873fa070ae84a4115f0b3c9dfabc224f1bc7c51 The timekeeping code misses an update of the hrtimer subsystem after a leap second happened. Due to that timers based on CLOCK_REALTIME are either expiring a second early or late depending on whether a leap second has been inserted or deleted until an operation is initiated which causes that update. Unless the update happens by some other means this discrepancy between the timekeeping and the hrtimer data stays forever and timers are expired either early or late. The reported immediate workaround - $ data -s "`date`" - is causing a call to clock_was_set() which updates the hrtimer data structures. See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix Add the missing clock_was_set() call to update_wall_time() in case of a leap second event. The actual update is deferred to softirq context as the necessary smp function call cannot be invoked from hard interrupt context. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hrtimer: Provide clock_was_set_delayed()John Stultz2012-07-191-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of f55a6faa384304c89cfef162768e88374d3312cb clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched/nohz: Rewrite and fix load-avg computation -- againPeter Zijlstra2012-07-194-75/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 5167e8d5417bf5c322a703d2927daec727ea40dd upstream. Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Reported-and-tested-by: Charles Wang <muming.wq@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* splice: fix racy pipe->buffers usesEric Dumazet2012-07-162-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 047fe3605235888f3ebcda0c728cb31937eadfe6 upstream. Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered by splice_shrink_spd() called from vmsplice_to_pipe() commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes) added capability to adjust pipe->buffers. Problem is some paths don't hold pipe mutex and assume pipe->buffers doesn't change for their duration. Fix this by adding nr_pages_max field in struct splice_pipe_desc, and use it in place of pipe->buffers where appropriate. splice_shrink_spd() loses its struct pipe_inode_info argument. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tom Herbert <therbert@google.com> Tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> [bwh: Backported to 3.2: - Adjust context in vmsplice_to_pipe() - Update one more call to splice_shrink_spd(), from skb_splice_bits()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: change CPU ring buffer state from tracing_cpumaskVaibhav Nagarnaik2012-07-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 71babb2705e2203a64c27ede13ae3508a0d2c16c upstream. According to Documentation/trace/ftrace.txt: tracing_cpumask: This is a mask that lets the user only trace on specified CPUS. The format is a hex string representing the CPUS. The tracing_cpumask currently doesn't affect the tracing state of per-CPU ring buffers. This patch enables/disables CPU recording as its corresponding bit in tracing_cpumask is set/unset. Link: http://lkml.kernel.org/r/1336096792-25373-3-git-send-email-vnagarnaik@google.com Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Laurent Chavey <chavey@google.com> Cc: Justin Teravest <teravest@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: correctly synchronize rss-counters at exit/execKonstantin Khlebnikov2012-07-161-0/+1
| | | | | | | | | | | | | | | | | | | | commit 4fe7efdbdfb1c7e7a7f31decfd831c0f31d37091 upstream. do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does put_user(clear_child_tid) which can update task->rss_stat and thus make mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm(). Let's fix this bug in the safest way, and optimize/cleanup this later. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntp: Correct TAI offset during leap secondRichard Cochran2012-06-221-1/+1
| | | | | | | | | | | | | | commit dd48d708ff3e917f6d6b6c2b696c3f18c019feed upstream. When repeating a UTC time value during a leap second (when the UTC time should be 23:59:60), the TAI timescale should not stop. The kernel NTP code increments the TAI offset one second too late. This patch fixes the issue by incrementing the offset during the leap second itself. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()Seiji Aguchi2012-06-221-2/+2
| | | | | | | | | | | | | | | | | | | commit 62be73eafaa045d3233337303fb140f7f8a61135 upstream. This patch moves kmsg_dump(KMSG_DUMP_PANIC) below smp_send_stop(), to serialize the crash-logging process via smp_send_stop() and to thus retrieve a more stable crash image of all CPUs stopped. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: dle-develop@lists.sourceforge.net <dle-develop@lists.sourceforge.net> Cc: Satoru Moriya <satoru.moriya@hds.com> Cc: Tony Luck <tony.luck@intel.com> Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/5C4C569E8A4B9B42A84A977CF070A35B2E4D7A5CE2@USINDEVS01.corp.hds.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tracing: Have tracing_off() actually turn tracing offSteven Rostedt2012-06-221-1/+1
| | | | | | | | | | | | | | | | | | commit f2bf1f6f5f89d031245067512449fc889b2f4bb2 upstream. A recent update to have tracing_on/off() only affect the ftrace ring buffers instead of all ring buffers had a cut and paste error. The tracing_off() did the exact same thing as tracing_on() and would not actually turn off tracing. Unfortunately, tracing_off() is more important to be working than tracing_on() as this is a key development tool, as it lets the developer turn off tracing as soon as a problem is discovered. It is also used by panic and oops code. This bug also breaks the 'echo func:traceoff > set_ftrace_filter' Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sched: Fix the relax_domain_level boot parameterDimitri Sivanich2012-06-171-6/+3
| | | | | | | | | | | | | | | | | | | | | | commit a841f8cef4bb124f0f5563314d0beaf2e1249d72 upstream. It does not get processed because sched_domain_level_max is 0 at the time that setup_relax_domain_level() is run. Simply accept the value as it is, as we don't know the value of sched_domain_level_max until sched domain construction is completed. Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls the set_domain_attribute() routine prior to setting the sd->level, however, the set_domain_attribute() routine relies on the sd->level to decide whether idle load balancing will be off/on. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecondJohn Stultz2012-06-171-0/+2
| | | | | | | | | | | | | | | | | | commit fad0c66c4bb836d57a5f125ecd38bed653ca863a upstream. Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC. Adjust wall_to_monotonic when NTP inserted a leapsecond. Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Richard Cochran <richardcochran@gmail.com> Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/fork: fix overflow in vma length when copying mmap on cloneSiddhesh Poyarekar2012-06-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7edc8b0ac16cbaed7cb4ea4c6b95ce98d2997e84 upstream. The vma length in dup_mmap is calculated and stored in a unsigned int, which is insufficient and hence overflows for very large maps (beyond 16TB). The following program demonstrates this: #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #define GIG 1024 * 1024 * 1024L #define EXTENT 16393 int main(void) { int i, r; void *m; char buf[1024]; for (i = 0; i < EXTENT; i++) { m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (m == (void *)-1) printf("MMAP Failed: %d\n", m); else printf("%d : MMAP returned %p\n", i, m); r = fork(); if (r == 0) { printf("%d: successed\n", i); return 0; } else if (r < 0) printf("FORK Failed: %d\n", r); else if (r > 0) wait(NULL); } return 0; } Increase the storage size of the result to unsigned long, which is sufficient for storing the difference between addresses. Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is ↵Tejun Heo2012-06-011-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | active commit 544ecf310f0e7f51fa057ac2a295fc1b3b35a9d3 upstream. worker_enter_idle() has WARN_ON_ONCE() which triggers if nr_running isn't zero when every worker is idle. This can trigger spuriously while a cpu is going down due to the way trustee sets %WORKER_ROGUE and zaps nr_running. It first sets %WORKER_ROGUE on all workers without updating nr_running, releases gcwq->lock, schedules, regrabs gcwq->lock and then zaps nr_running. If the last running worker enters idle inbetween, it would see stale nr_running which hasn't been zapped yet and trigger the WARN_ON_ONCE(). Fix it by performing the sanity check iff the trustee is idle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
*---. Merge branches 'perf-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds2012-05-173-1/+7
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf, x86 and scheduler updates from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing: Do not enable function event with enable perf stat: handle ENXIO error for perf_event_open perf: Turn off compiler warnings for flex and bison generated files perf stat: Fix case where guest/host monitoring is not supported by kernel perf build-id: Fix filename size calculation * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable x86: Fix section annotation of acpi_map_cpu2node() x86/microcode: Ensure that module is only loaded on supported Intel CPUs * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
| | | * sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list ↵Igor Mammedov2012-05-091-0/+2
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | assumption If we have one cpu that failed to boot and boot cpu gave up on waiting for it and then another cpu is being booted, kernel might crash with following OOPS: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80 Call Trace: [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50 The crash happens in init_sched_groups_power() that expects sched_groups to be circular linked list. However it is not always true, since sched_groups preallocated in __sdt_alloc are initialized in build_sched_groups and it may exit early if (cpu != cpumask_first(sched_domain_span(sd))) return 0; without initializing sd->groups->next field. Fix bug by initializing next field right after sched_group was allocated. Also-Reported-by: Jiang Liu <liuj97@gmail.com> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: a.p.zijlstra@chello.nl Cc: pjt@google.com Cc: seto.hidetoshi@jp.fujitsu.com Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * / tracing: Do not enable function event with enableSteven Rostedt2012-05-102-1/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the adding of function tracing event to perf, it caused a side effect that produces the following warning when enabling all events in ftrace: # echo 1 > /sys/kernel/debug/tracing/events/enable [console] event trace: Could not enable event function This is because when enabling all events via the debugfs system it ignores events that do not have a ->reg() function assigned. This was to skip over the ftrace internal events (as they are not TRACE_EVENTs). But as the ftrace function event now has a ->reg() function attached to it for use with perf, it is no longer ignored. Worse yet, this ->reg() function is being called when it should not be. It returns an error and causes the above warning to be printed. By adding a new event_call flag (TRACE_EVENT_FL_IGNORE_ENABLE) and have all ftrace internel event structures have it set, setting the events/enable will no longe try to incorrectly enable the function event and does not warn. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | genirq: export handle_edge_irq() and irq_to_desc()Jiri Kosina2012-05-152-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export handle_edge_irq() and irq_to_desc() to modules to allow them to do things such as __irq_set_handler_locked(...., handle_edge_irq); This fixes ERROR: "handle_edge_irq" [drivers/gpio/gpio-pch.ko] undefined! ERROR: "irq_to_desc" [drivers/gpio/gpio-pch.ko] undefined! when gpio-pch is being built as a module. This was introduced by commit df9541a60af0 ("gpio: pch9: Use proper flow type handlers") that added __irq_set_handler_locked(d->irq, handle_edge_irq); but handle_edge_irq() was not exported for modules (and inlined __irq_set_handler_locked() requires irq_to_desc() exported as well) Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | namespaces, pid_ns: fix leakage on fork() failureMike Galbraith2012-05-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fork() failure post namespace creation for a child cloned with CLONE_NEWPID leaks pid_namespace/mnt_cache due to proc being mounted during creation, but not unmounted during cleanup. Call pid_ns_release_proc() during cleanup. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Louis Rilling <louis.rilling@kerlabs.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>