summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Linux 3.2.44v3.2.44Ben Hutchings2013-04-251-1/+1
|
* sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()sTejun Heo2013-04-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | commit 383efcd00053ec40023010ce5034bd702e7ab373 upstream. try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: Allow cross page reads and writes from cached translations.Andrew Honig2013-04-254-18/+45
| | | | | | | | | | | | | | | | | | | commit 8f964525a121f2ff2df948dac908dcc65be21b5b upstream. This patch adds support for kvm_gfn_to_hva_cache_init functions for reads and writes that will cross a page. If the range falls within the same memslot, then this will be a fast operation. If the range is split between two memslots, then the slower kvm_read_guest and kvm_write_guest are used. Tested: Test against kvm_clock unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> [bwh: Backported to 3.2: - Drop change in lapic.c - Keep using __gfn_to_memslot() in kvm_gfn_to_hva_cache_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)Andy Honig2013-04-251-2/+5
| | | | | | | | | | | | | | | | | | commit a2c118bfab8bc6b8bb213abfc35201e441693d55 upstream. If the guest specifies a IOAPIC_REG_SELECT with an invalid value and follows that with a read of the IOAPIC_REG_WINDOW KVM does not properly validate that request. ioapic_read_indirect contains an ASSERT(redir_index < IOAPIC_NUM_PINS), but the ASSERT has no effect in non-debug builds. In recent kernels this allows a guest to cause a kernel oops by reading invalid memory. In older kernels (pre-3.3) this allows a guest to read from large ranges of host memory. Tested: tested against apic unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions ↵Andy Honig2013-04-252-27/+16
| | | | | | | | | | | | | | | | | | | | | | | (CVE-2013-1797) commit 0b79459b482e85cb7426aa7da683a9f2c97aeae1 upstream. There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> [bwh: Backported to 3.2: - Adjust context - We do not implement the PVCLOCK_GUEST_STOPPED flag] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME ↵Andy Honig2013-04-251-0/+5
| | | | | | | | | | | | | | | | | | | | (CVE-2013-1796) commit c300aa64ddf57d9c5d9c898a64b36877345dd4a9 upstream. If the guest sets the GPA of the time_page so that the request to update the time straddles a page then KVM will write onto an incorrect page. The write is done byusing kmap atomic to get a pointer to the page for the time structure and then performing a memcpy to that page starting at an offset that the guest controls. Well behaved guests always provide a 32-byte aligned address, however a malicious guest could use this to corrupt host kernel memory. Tested: Tested against kvmclock unit test. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* hfsplus: fix potential overflow in hfsplus_file_truncate()Vyacheslav Dubeyko2013-04-251-1/+1
| | | | | | | | | | | | | | commit 12f267a20aecf8b84a2a9069b9011f1661c779b4 upstream. Change a u32 to loff_t hfsplus_file_truncate(). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* fbcon: fix locking harderDave Airlie2013-04-253-3/+13
| | | | | | | | | | | | | | | | | | | | commit 054430e773c9a1e26f38e30156eff02dedfffc17 upstream. Okay so Alan's patch handled the case where there was no registered fbcon, however the other path entered in set_con2fb_map pit. In there we called fbcon_takeover, but we also took the console lock in a couple of places. So push the console lock out to the callers of set_con2fb_map, this means fbmem and switcheroo needed to take the lock around the fb notifier entry points that lead to this. This should fix the efifb regression seen by Maarten. Tested-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> Tested-by: Lu Hua <huax.lu@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* mtd: Disable mtdchar mmap on MMU systemsDavid Woodhouse2013-04-251-1/+5
| | | | | | | | | | | commit f5cf8f07423b2677cebebcebc863af77223a4972 upstream. This code was broken because it assumed that all MTD devices were map-based. Disable it for now, until it can be fixed properly for the next merge window. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* r8169: fix auto speed down issuehayeswang2013-04-251-3/+25
| | | | | | | | | | | | | | | commit e2409d83434d77874b461b78af6a19cd6e6a1280 upstream. It would cause no link after suspending or shutdowning when the nic changes the speed to 10M and connects to a link partner which forces the speed to 100M. Check the link partner ability to determine which speed to set. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* block: avoid using uninitialized value in from queue_var_storeArnd Bergmann2013-04-251-0/+2
| | | | | | | | | | | | | | | | | | | commit c678ef5286ddb5cf70384ad5af286b0afc9b73e1 upstream. As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions that use a value generated by queue_var_store independent of whether that value was set or not. block/blk-sysfs.c: In function 'queue_store_nonrot': block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized] Unlike most other such warnings, this one is not a false positive, writing any non-number string into the sysfs files indeed has an undefined result, rather than returning an error. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ALSA: hda - fix typo in proc outputDavid Henningsson2013-04-251-1/+1
| | | | | | | | | | | commit aeb3a97222832e5457c4b72d72235098ce4bfe8d upstream. Rename "Digitial In" to "Digital In". This function is only used for proc output, so should not cause any problems to change. Signed-off-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ALSA: hda - Enabling Realtek ALC 671 codecRainer Koenig2013-04-251-1/+3
| | | | | | | | | | | | commit 1d87caa69c04008e09f5ff47b5e6acb6116febc7 upstream. * Added the device ID to the modalias list and assinged ALC662 patches for it * Added 4 port support for the device ID 0671 in alc662_parse_auto_config Signed-off-by: Rainer Koenig <Rainer.Koenig@ts.fujitsu.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* msi-wmi: Fix memory leakMaxim Mikityanskiy2013-04-251-1/+3
| | | | | | | | | | | | | commit 51c94491c82c3d9029f6e87a1a153db321d88e35 upstream. Fix memory leak - don't forget to kfree ACPI object when returning from msi_wmi_notify() after suppressing key event. Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com> Acked-by: Anisse Astier <anisse@astier.eu> Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: add quirk to invert brightness on Packard Bell NCL20Jani Nikula2013-04-251-0/+3
| | | | | | | | | | commit 5559ecadad5a73b27f863e92f4b4f369501dce6f upstream. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=44156 Reported-by: Alan Zimmerman <alan.zimm@gmail.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: add quirk to invert brightness on eMachines e725Jani Nikula2013-04-251-0/+3
| | | | | | | | | | | commit 01e3a8feb40e54b962a20fa7eb595c5efef5e109 upstream. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31522#c35 [Note: There are more than one broken setups in the bug. This fixes one.] Reported-by: Martins <andrissr@inbox.lv> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: add quirk to invert brightness on eMachines G725Jani Nikula2013-04-251-0/+3
| | | | | | | | | | commit 1ffff60320879830e469e26062c18f75236822ba upstream. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=59628 Reported-by: Roland Gruber <post@rolandgruber.de> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* DRM/i915: Add QUIRK_INVERT_BRIGHTNESS for NCR machines.Egbert Eich2013-04-251-0/+33
| | | | | | | | | | | | | | | | | | commit 5f85f176c2f1c9d2a23f60ca0b99e4d0aa5a26a7 upstream. NCR machines with LVDS panels using Intel chipsets need to have the QUIRK_INVERT_BRIGHTNESS bit set. Unfortunately NCR doesn't set a meaningful subvendor/subdevice ID, therefore we add a DMI dependent quirk list. Signed-off-by: Egbert Eich <eich@suse.de> [danvet: fixup whitespace fail.] Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Jani Nikula <jani.nikula@intel.com> [bwh: Backported to 3.2: - Adjust context - Add #include <linux/dmi.h>] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: panel: invert brightness acer aspire 5734zCarsten Emde2013-04-251-1/+5
| | | | | | | | | | | | | | commit 5a15ab5b93e4a3ebcd4fa6c76cf646a45e9cf806 upstream. Mark the Acer Aspire 5734Z that this machines requires the module to invert the panel backlight brightness value after reading from and prior to writing to the PCI configuration space. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: panel: invert brightness via quirkCarsten Emde2013-04-254-10/+32
| | | | | | | | | | | | | commit 4dca20efb1a9c2efefc28ad2867e5d6c3f5e1955 upstream. A machine may need to invert the panel backlight brightness value. This patch adds the infrastructure for a quirk to do so. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* drm/i915: panel: invert brightness via parameterCarsten Emde2013-04-252-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | commit 7bd90909bbf9ce7c40e1da3d72b97b93839c188a upstream. Following the documentation of the Legacy Backlight Brightness (LBB) Register in the configuration space of some Intel PCI graphics adapters, setting the LBB register with the value 0x0 causes the backlight to be turned off, and 0xFF causes the backlight to be set to 100% intensity (http://download.intel.com/embedded/processors/Whitepaper/324567.pdf). The Acer Aspire 5734Z, however, turns the backlight off at 0xFF and sets it to maximum intensity at 0. In consequence, the screen of this systems becomes dark at an early boot stage which makes it unusable. The same inversion applies to the BLC_PWM_CTL I915 register. This problem was introduced in kernel version 2.6.38 when the PCI device of this system was first supported by the i915 KMS module. This patch adds a parameter to the i915 module to enable inversion of the brightness variable (i915.invert_brightness). Signed-off-by: Carsten Emde <C.Emde@osadl.org> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: fix race between mmap writes and compressionChris Mason2013-04-253-0/+49
| | | | | | | | | | | | | | | | | | | | | | | commit 4adaa611020fa6ac65b0ac8db78276af4ec04e63 upstream. Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* writeback: fix dirtied pages accounting on redirtyWu Fengguang2013-04-252-0/+21
| | | | | | | | | | | | | | | | | | | | | commit 2f800fbd777b792de54187088df19a7df0251254 upstream. De-account the accumulative dirty counters on page redirty. Page redirties (very common in ext4) will introduce mismatch between counters (a) and (b) a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied b) NR_WRITTEN, BDI_WRITTEN This will introduce systematic errors in balanced_rate and result in dirty page position errors (ie. the dirty pages are no longer balanced around the global/bdi setpoints). Acked-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* thermal: return an error on failure to register thermal classRichard Guy Briggs2013-04-251-0/+1
| | | | | | | | | | | | | | | | commit da28d966f6aa942ae836d09729f76a1647932309 upstream. The return code from the registration of the thermal class is used to unallocate resources, but this failure isn't passed back to the caller of thermal_init. Return this failure back to the caller. This bug was introduced in changeset 4cb18728 which overwrote the return code when the variable was re-used to catch the return code of the registration of the genetlink thermal socket family. Signed-off-by: Richard Guy Briggs <rbriggs@redhat.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* net: fix incorrect credentials passingLinus Torvalds2013-04-253-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 83f1b4ba917db5dc5a061a44b3403ddb6e783494 upstream. Commit 257b5358b32f ("scm: Capture the full credentials of the scm sender") changed the credentials passing code to pass in the effective uid/gid instead of the real uid/gid. Obviously this doesn't matter most of the time (since normally they are the same), but it results in differences for suid binaries when the wrong uid/gid ends up being used. This just undoes that (presumably unintentional) part of the commit. Reported-by: Andy Lutomirski <luto@amacapital.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> [bwh: Backported to 3.2: scm_set_cred() does user namespace conversion of euid/egid using cred_to_ucred(). Add and use cred_real_to_ucred() to do the same thing for real uid/gid.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* kernel/signal.c: stop info leak via the tkill and the tgkill syscallsEmese Revfy2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b9e146d8eb3b9ecae5086d373b50fa0c1f3e7f0f upstream. This fixes a kernel memory contents leak via the tkill and tgkill syscalls for compat processes. This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field when handling signals delivered from tkill. The place of the infoleak: int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) { ... put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); ... } Signed-off-by: Emese Revfy <re.emese@gmail.com> Reviewed-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* hugetlbfs: add swap entry check in follow_hugetlb_page()Naoya Horiguchi2013-04-251-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9cc3a5bd40067b9a0fbd49199d0780463fc2140f upstream. With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(&page->_count) <= 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise "hugepage hwpoison entry" with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. The expected behavior is like this: absent is_swap_pte FOLL_DUMP Expected behavior ------------------------------------------------------------------- true false false hugetlb_fault false true false hugetlb_fault false false false return page true false true skip page (to avoid allocation) false true true hugetlb_fault false false true return page With this patch, we can call hugetlb_fault() and take proper actions (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for hwpoisoned entries,) and as the result we can dump all hugepages except for hwpoisoned ones. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7698/1: perf: fix group validation when using enable_on_execWill Deacon2013-04-251-1/+4
| | | | | | | | | | | | | | | | | | | | commit cb2d8b342aa084d1f3ac29966245dec9163677fb upstream. Events may be created with attr->disabled == 1 and attr->enable_on_exec == 1, which confuses the group validation code because events with the PERF_EVENT_STATE_OFF are not considered candidates for scheduling, which may lead to failure at group scheduling time. This patch fixes the validation check for ARM, so that events in the OFF state are still considered when enable_on_exec is true. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Jiri Olsa <jolsa@redhat.com> Reported-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7696/1: Fix kexec by setting outer_cache.inv_all for FeroceonIllia Ragozin2013-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | commit cd272d1ea71583170e95dde02c76166c7f9017e6 upstream. On Feroceon the L2 cache becomes non-coherent with the CPU when the L1 caches are disabled. Thus the L2 needs to be invalidated after both L1 caches are disabled. On kexec before the starting the code for relocation the kernel, the L1 caches are disabled in cpu_froc_fin (cpu_v7_proc_fin for Feroceon), but after L2 cache is never invalidated, because inv_all is not set in cache-feroceon-l2.c. So kernel relocation and decompression may has (and usually has) errors. Setting the function enables L2 invalidation and fixes the issue. Signed-off-by: Illia Ragozin <illia.ragozin@grapecom.com> Acked-by: Jason Cooper <jason@lakedaemon.net> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ath9k_hw: change AR9580 initvals to fix a stability issueFelix Fietkau2013-04-251-1/+1
| | | | | | | | | | | | | commit f09a878511997c25a76bf111a32f6b8345a701a5 upstream. The hardware parsing of Control Wrapper Frames needs to be disabled, as it has been causing spurious decryption error reports. The initvals for other chips have been updated to disable it, but AR9580 was left out for some reason. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* can: sja1000: fix handling on dt properties on little endian systemsChristoph Fritz2013-04-251-16/+15
| | | | | | | | | | | | | commit 0443de5fbf224abf41f688d8487b0c307dc5a4b4 upstream. To get correct endianes on little endian cpus (like arm) while reading device tree properties, this patch replaces of_get_property() with of_property_read_u32(). While there use of_property_read_bool() for the handling of the boolean "nxp,no-comparator-bypass" property. Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* of: introduce helper to manage booleanJean-Christophe PLAGNIOL-VILLARD2013-04-251-0/+16
| | | | | | | | | | | | | | | commit fa4d34ccd0914ac87336ea2c17e9370dfecef286 upstream. of_property_read_bool Search for a property in a device node. Returns true if the property exist false otherwise. Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Acked-by: Rob Herring <rob.herring@calxeda.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ath9k_htc: accept 1.x firmware newer than 1.3Felix Fietkau2013-04-251-1/+1
| | | | | | | | | | | | commit 319e7bd96aca64a478f3aad40711c928405b8b77 upstream. Since the firmware has been open sourced, the minor version has been bumped to 1.4 and the API/ABI will stay compatible across further 1.x releases. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: Do 15e0d9e37c (ARM: pm: let platforms select cpu_suspend support) properlyRussell King2013-04-256-6/+6
| | | | | | | | | | | commit b6c7aabd923a17af993c5a5d5d7995f0b27c000a upstream. Let's do the changes properly and fix the same problem everywhere, not just for one case. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> [bwh: Backported to 3.2: mohawk doesn't support suspend at all] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* vfs: Revert spurious fix to spinning prevention in prune_icache_sbSuleiman Souhlal2013-04-251-1/+1
| | | | | | | | | | | | | | | | | | | commit 5b55d708335a9e3e4f61f2dadf7511502205ccd1 upstream. Revert commit 62a3ddef6181 ("vfs: fix spinning prevention in prune_icache_sb"). This commit doesn't look right: since we are looking at the tail of the list (sb->s_inode_lru.prev) if we want to skip an inode, we should put it back at the head of the list instead of the tail, otherwise we will keep spinning on it. Discovered when investigating why prune_icache_sb came top in perf reports of a swapping load. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* kobject: fix kset_find_obj() race with concurrent last kobject_put()Linus Torvalds2013-04-251-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a49b7e82cab0f9b41f483359be83f44fbb6b4979 upstream. Anatol Pomozov identified a race condition that hits module unloading and re-loading. To quote Anatol: "This is a race codition that exists between kset_find_obj() and kobject_put(). kset_find_obj() might return kobject that has refcount equal to 0 if this kobject is freeing by kobject_put() in other thread. Here is timeline for the crash in case if kset_find_obj() searches for an object tht nobody holds and other thread is doing kobject_put() on the same kobject: THREAD A (calls kset_find_obj()) THREAD B (calls kobject_put()) splin_lock() atomic_dec_return(kobj->kref), counter gets zero here ... starts kobject cleanup .... spin_lock() // WAIT thread A in kobj_kset_leave() iterate over kset->list atomic_inc(kobj->kref) (counter becomes 1) spin_unlock() spin_lock() // taken // it does not know that thread A increased counter so it remove obj from list spin_unlock() vfree(module) // frees module object with containing kobj // kobj points to freed memory area!! kobject_put(kobj) // OOPS!!!! The race above happens because module.c tries to use kset_find_obj() when somebody unloads module. The module.c code was introduced in commit 6494a93d55fa" Anatol supplied a patch specific for module.c that worked around the problem by simply not using kset_find_obj() at all, but rather than make a local band-aid, this just fixes kset_find_obj() to be thread-safe using the proper model of refusing the get a new reference if the refcount has already dropped to zero. See examples of this proper refcount handling not only in the kref documentation, but in various other equivalent uses of this pattern by grepping for atomic_inc_not_zero(). [ Side note: the module race does indicate that module loading and unloading is not properly serialized wrt sysfs information using the module mutex. That may require further thought, but this is the correct fix at the kobject layer regardless. ] Reported-analyzed-and-tested-by: Anatol Pomozov <anatol.pomozov@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* kref: Implement kref_get_unless_zero v3Thomas Hellstrom2013-04-251-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | commit 4b20db3de8dab005b07c74161cb041db8c5ff3a7 upstream. This function is intended to simplify locking around refcounting for objects that can be looked up from a lookup structure, and which are removed from that lookup structure in the object destructor. Operations on such objects require at least a read lock around lookup + kref_get, and a write lock around kref_put + remove from lookup structure. Furthermore, RCU implementations become extremely tricky. With a lookup followed by a kref_get_unless_zero *with return value check* locking in the kref_put path can be deferred to the actual removal from the lookup structure and RCU lookups become trivial. v2: Formatting fixes. v3: Invert the return value. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> [bwh: Backported to 3.2: - Adjust context - Add #include <linux/atomic.h>] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: make sure nbytes are right after log replayJosef Bacik2013-04-251-6/+42
| | | | | | | | | | | | | | | | | | | commit 4bc4bee4595662d8bff92180d5c32e3313a704b0 upstream. While trying to track down a tree log replay bug I noticed that fsck was always complaining about nbytes not being right for our fsynced file. That is because the new fsync stuff doesn't wait for ordered extents to complete, so the inodes nbytes are not necessarily updated properly when we log it. So to fix this we need to set nbytes to whatever it is on the inode that is on disk, so when we replay the extents we can just add the bytes that are being added as we replay the extent. This makes it work for the case that we have the wrong nbytes or the case that we logged everything and nbytes is actually correct. With this I'm no longer getting nbytes errors out of btrfsck. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* tracing: Fix possible NULL pointer dereferencesNamhyung Kim2013-04-251-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6a76f8c0ab19f215af2a3442870eeb5f0e81998d upstream. Currently set_ftrace_pid and set_graph_function files use seq_lseek for their fops. However seq_open() is called only for FMODE_READ in the fops->open() so that if an user tries to seek one of those file when she open it for writing, it sees NULL seq_file and then panic. It can be easily reproduced with following command: $ cd /sys/kernel/debug/tracing $ echo 1234 | sudo tee -a set_ftrace_pid In this example, GNU coreutils' tee opens the file with fopen(, "a") and then the fopen() internally calls lseek(). Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bwh: Backported to 3.2: ftrace_regex_lseek() is static] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* target: Fix incorrect fallthrough of ALUA Standby/Offline/Transition CDBsNicholas Bellinger2013-04-251-0/+3
| | | | | | | | | | | | | | | commit 30f359a6f9da65a66de8cadf959f0f4a0d498bba upstream. This patch fixes a bug where a handful of informational / control CDBs that should be allowed during ALUA access state Standby/Offline/Transition where incorrectly returning CHECK_CONDITION + ASCQ_04H_ALUA_TG_PT_*. This includes INQUIRY + REPORT_LUNS, which would end up preventing LUN registration when LUN scanning occured during these ALUA access states. Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* target: Fix MAINTENANCE_IN service action CDB checks to use lower 5 bitsNicholas Bellinger2013-04-252-4/+5
| | | | | | | | | | | | | | | | | | | | | commit ba539743b70cd160c84bab1c82910d0789b820f8 upstream. This patch fixes the MAINTENANCE_IN service action type checks to only look at the proper lower 5 bits of cdb byte 1. This addresses the case where MI_REPORT_TARGET_PGS w/ extended header using the upper three bits of cdb byte 1 was not processed correctly in transport_generic_cmd_sequencer, as well as the three cases for standby, unavailable, and transition ALUA primary access state checks. Also add MAINTENANCE_IN to the excluded list in transport_generic_prepare_cdb() to prevent the PARAMETER DATA FORMAT bits from being cleared. Cc: Hannes Reinecke <hare@suse.de> Cc: Rob Evers <revers@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metalBoris Ostrovsky2013-04-255-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | commit 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 upstream. Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updatesSamu Kallio2013-04-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1160c2779b826c6f5c08e5cc542de58fd1f667d5 upstream. In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc'd mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Tested-by: Josh Boyer <jwboyer@redhat.com> Reported-and-Tested-by: Krishna Raman <kraman@redhat.com> Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com> Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* tracing: Fix double free when function profile init failedNamhyung Kim2013-04-251-1/+0
| | | | | | | | | | | | | | | commit 83e03b3fe4daffdebbb42151d5410d730ae50bd1 upstream. On the failure path, stat->start and stat->pages will refer same page. So it'll attempt to free the same page again and get kernel panic. Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* spinlocks and preemption points need to be at least compiler barriersLinus Torvalds2013-04-252-17/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 386afc91144b36b42117b0092893f15bc8798a80 upstream. In UP and non-preempt respectively, the spinlocks and preemption disable/enable points are stubbed out entirely, because there is no regular code that can ever hit the kind of concurrency they are meant to protect against. However, while there is no regular code that can cause scheduling, we _do_ end up having some exceptional (literally!) code that can do so, and that we need to make sure does not ever get moved into the critical region by the compiler. In particular, get_user() and put_user() is generally implemented as inline asm statements (even if the inline asm may then make a call instruction to call out-of-line), and can obviously cause a page fault and IO as a result. If that inline asm has been scheduled into the middle of a preemption-safe (or spinlock-protected) code region, we obviously lose. Now, admittedly this is *very* unlikely to actually ever happen, and we've not seen examples of actual bugs related to this. But partly exactly because it's so hard to trigger and the resulting bug is so subtle, we should be extra careful to get this right. So make sure that even when preemption is disabled, and we don't have to generate any actual *code* to explicitly tell the system that we are in a preemption-disabled region, we need to at least tell the compiler not to move things around the critical region. This patch grew out of the same discussion that caused commits 79e5f05edcbf ("ARC: Add implicit compiler barrier to raw_local_irq* functions") and 3e2e0d2c222b ("tile: comment assumption about __insn_mtspr for <asm/irqflags.h>") to come about. Note for stable: use discretion when/if applying this. As mentioned, this bug may never have actually bitten anybody, and gcc may never have done the required code motion for it to possibly ever trigger in practice. Cc: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.2: drop sched_preempt_enable_no_resched()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ASoC: wm8903: Fix the bypass to HP/LINEOUT when no DAC or ADC is runningAlban Bedel2013-04-251-0/+2
| | | | | | | | | | | | | commit f1ca493b0b5e8f42d3b2dc8877860db2983f47b6 upstream. The Charge Pump needs the DSP clock to work properly, without it the bypass to HP/LINEOUT is not working properly. This requirement is not mentioned in the datasheet but has been confirmed by Mark Brown from Wolfson. Signed-off-by: Alban Bedel <alban.bedel@avionic-design.de> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* can: gw: use kmem_cache_free() instead of kfree()Wei Yongjun2013-04-251-3/+3
| | | | | | | | | | | | commit 3480a2125923e4b7a56d79efc76743089bf273fc upstream. Memory allocated by kmem_cache_alloc() should be freed using kmem_cache_free(), not kfree(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* PM / reboot: call syscore_shutdown() after disable_nonboot_cpus()Huacai Chen2013-04-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | commit 6f389a8f1dd22a24f3d9afc2812b30d639e94625 upstream. As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core subsystems PM) say, syscore_ops operations should be carried with one CPU on-line and interrupts disabled. However, after commit f96972f2d (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()), syscore_shutdown() is called before disable_nonboot_cpus(), so break the rules. We have a MIPS machine with a 8259A PIC, and there is an external timer (HPET) linked at 8259A. Since 8259A has been shutdown too early (by syscore_shutdown()), disable_nonboot_cpus() runs without timer interrupt, so it hangs and reboot fails. This patch call syscore_shutdown() a little later (after disable_nonboot_cpus()) to avoid reboot failure, this is the same way as poweroff does. For consistency, add disable_nonboot_cpus() to kernel_halt(). Signed-off-by: Huacai Chen <chenhc@lemote.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ftrace: Consistently restore trace function on sysctl enablingJan Kiszka2013-04-251-6/+2
| | | | | | | | | | | | | | commit 5000c418840b309251c5887f0b56503aae30f84c upstream. If we reenable ftrace via syctl, we currently set ftrace_trace_function based on the previous simplistic algorithm. This is inconsistent with what update_ftrace_function does. So better call that helper instead. Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* sched_clock: Prevent 64bit inatomicity on 32bit systemsThomas Gleixner2013-04-251-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream. The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>