summaryrefslogtreecommitdiffstats
path: root/arch
Commit message (Collapse)AuthorAgeFilesLines
* KVM: Allow cross page reads and writes from cached translations.Andrew Honig2013-04-251-7/+6
| | | | | | | | | | | | | | | | | | | commit 8f964525a121f2ff2df948dac908dcc65be21b5b upstream. This patch adds support for kvm_gfn_to_hva_cache_init functions for reads and writes that will cross a page. If the range falls within the same memslot, then this will be a fast operation. If the range is split between two memslots, then the slower kvm_read_guest and kvm_write_guest are used. Tested: Test against kvm_clock unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> [bwh: Backported to 3.2: - Drop change in lapic.c - Keep using __gfn_to_memslot() in kvm_gfn_to_hva_cache_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions ↵Andy Honig2013-04-252-27/+16
| | | | | | | | | | | | | | | | | | | | | | | (CVE-2013-1797) commit 0b79459b482e85cb7426aa7da683a9f2c97aeae1 upstream. There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> [bwh: Backported to 3.2: - Adjust context - We do not implement the PVCLOCK_GUEST_STOPPED flag] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME ↵Andy Honig2013-04-251-0/+5
| | | | | | | | | | | | | | | | | | | | (CVE-2013-1796) commit c300aa64ddf57d9c5d9c898a64b36877345dd4a9 upstream. If the guest sets the GPA of the time_page so that the request to update the time straddles a page then KVM will write onto an incorrect page. The write is done byusing kmap atomic to get a pointer to the page for the time structure and then performing a memcpy to that page starting at an offset that the guest controls. Well behaved guests always provide a 32-byte aligned address, however a malicious guest could use this to corrupt host kernel memory. Tested: Tested against kvmclock unit test. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7698/1: perf: fix group validation when using enable_on_execWill Deacon2013-04-251-1/+4
| | | | | | | | | | | | | | | | | | | | commit cb2d8b342aa084d1f3ac29966245dec9163677fb upstream. Events may be created with attr->disabled == 1 and attr->enable_on_exec == 1, which confuses the group validation code because events with the PERF_EVENT_STATE_OFF are not considered candidates for scheduling, which may lead to failure at group scheduling time. This patch fixes the validation check for ARM, so that events in the OFF state are still considered when enable_on_exec is true. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Jiri Olsa <jolsa@redhat.com> Reported-by: Sudeep KarkadaNagesha <Sudeep.KarkadaNagesha@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7696/1: Fix kexec by setting outer_cache.inv_all for FeroceonIllia Ragozin2013-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | commit cd272d1ea71583170e95dde02c76166c7f9017e6 upstream. On Feroceon the L2 cache becomes non-coherent with the CPU when the L1 caches are disabled. Thus the L2 needs to be invalidated after both L1 caches are disabled. On kexec before the starting the code for relocation the kernel, the L1 caches are disabled in cpu_froc_fin (cpu_v7_proc_fin for Feroceon), but after L2 cache is never invalidated, because inv_all is not set in cache-feroceon-l2.c. So kernel relocation and decompression may has (and usually has) errors. Setting the function enables L2 invalidation and fixes the issue. Signed-off-by: Illia Ragozin <illia.ragozin@grapecom.com> Acked-by: Jason Cooper <jason@lakedaemon.net> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: Do 15e0d9e37c (ARM: pm: let platforms select cpu_suspend support) properlyRussell King2013-04-256-6/+6
| | | | | | | | | | | commit b6c7aabd923a17af993c5a5d5d7995f0b27c000a upstream. Let's do the changes properly and fix the same problem everywhere, not just for one case. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> [bwh: Backported to 3.2: mohawk doesn't support suspend at all] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metalBoris Ostrovsky2013-04-255-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | commit 511ba86e1d386f671084b5d0e6f110bb30b8eeb2 upstream. Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updatesSamu Kallio2013-04-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1160c2779b826c6f5c08e5cc542de58fd1f667d5 upstream. In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc'd mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Tested-by: Josh Boyer <jwboyer@redhat.com> Reported-and-Tested-by: Krishna Raman <kraman@redhat.com> Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com> Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc: pSeries_lpar_hpte_remove fails from Adjunct partition being ↵Michael Wolf2013-04-251-1/+7
| | | | | | | | | | | | | | | | | | performed before the ANDCOND test commit 9fb2640159f9d4f5a2a9d60e490482d4cbecafdb upstream. Some versions of pHyp will perform the adjunct partition test before the ANDCOND test. The result of this is that H_RESOURCE can be returned and cause the BUG_ON condition to occur. The HPTE is not removed. So add a check for H_RESOURCE, it is ok if this HPTE is not removed as pSeries_lpar_hpte_remove is looking for an HPTE to remove and not a specific HPTE to remove. So it is ok to just move on to the next slot and try again. Signed-off-by: Michael Wolf <mjw@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* alpha: Add irongate_io to PCI bus resourcesJay Estabrook2013-04-251-0/+5
| | | | | | | | | | | | | commit aa8b4be3ac049c8b1df2a87e4d1d902ccfc1f7a9 upstream. Fixes a NULL pointer dereference at boot on UP1500. Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Jay Estabrook <jay.estabrook@gmail.com> Signed-off-by: Matt Turner <mattst88@gmail.com> Signed-off-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* tile: expect new initramfs name from hypervisor file systemChris Metcalf2013-04-101-6/+12
| | | | | | | | | | | | | | | | | | commit ff7f3efb9abf986f4ecd8793a9593f7ca4d6431a upstream. The current Tilera boot infrastructure now provides the initramfs to Linux as a Tilera-hypervisor file named "initramfs", rather than "initramfs.cpio.gz", as before. (This makes it reasonable to use other compression techniques than gzip on the file without having to worry about the name causing confusion.) Adapt to use the new name, but also fall back to checking for the old name. Cc'ing to stable so that older kernels will remain compatible with newer Tilera boot infrastructure. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* signal: Define __ARCH_HAS_SA_RESTORER so we know whether to clear sa_restorerBen Hutchings2013-03-2712-0/+13
| | | | | | | | | | | | flush_signal_handlers() needs to know whether sigaction::sa_restorer is defined, not whether SA_RESTORER is defined. Define the __ARCH_HAS_SA_RESTORER macro to indicate this. Vaguely based on upstream commit 574c4866e33d 'consolidate kernel-side struct sigaction declarations'. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Al Viro <viro@zeniv.linux.org.uk>
* x86-64: Fix the failure case in copy_user_handle_tail()CQ Tang2013-03-271-2/+2
| | | | | | | | | | | | | | | | commit 66db3feb486c01349f767b98ebb10b0c3d2d021b upstream. The increment of "to" in copy_user_handle_tail() will have incremented before a failure has been noted. This causes us to skip a byte in the failure case. Only do the increment when assured there is no failure. Signed-off-by: CQ Tang <cq.tang@intel.com> Link: http://lkml.kernel.org/r/20130318150221.8439.993.stgit@phlsvslse11.ph.intel.com Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc: Fix cputable entry for 970MP rev 1.0Benjamin Herrenschmidt2013-03-271-1/+1
| | | | | | | | | | | | | commit d63ac5f6cf31c8a83170a9509b350c1489a7262b upstream. Commit 44ae3ab3358e962039c36ad4ae461ae9fb29596c forgot to update the entry for the 970MP rev 1.0 processor when moving some CPU features bits to the MMU feature bit mask. This breaks booting on some rare G5 models using that chip revision. Reported-by: Phileas Fogg <phileas-fogg@mail.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* s390/mm: fix flush_tlb_kernel_range()Heiko Carstens2013-03-271-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | commit f6a70a07079518280022286a1dceb797d12e1edf upstream. Our flush_tlb_kernel_range() implementation calls __tlb_flush_mm() with &init_mm as argument. __tlb_flush_mm() however will only flush tlbs for the passed in mm if its mm_cpumask is not empty. For the init_mm however its mm_cpumask has never any bits set. Which in turn means that our flush_tlb_kernel_range() implementation doesn't work at all. This can be easily verified with a vmalloc/vfree loop which allocates a page, writes to it and then frees the page again. A crash will follow almost instantly. To fix this remove the cpumask_empty() check in __tlb_flush_mm() since there shouldn't be too many mms with a zero mm_cpumask, besides the init_mm of course. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* perf,x86: fix wrmsr_on_cpu() warning on suspend/resumeLinus Torvalds2013-03-271-1/+3
| | | | | | | | | | | | | | | | | | | | | | | commit 2a6e06b2aed6995af401dcd4feb5e79a0c7ea554 upstream. Commit 1d9d8639c063 ("perf,x86: fix kernel crash with PEBS/BTS after suspend/resume") fixed a crash when doing PEBS performance profiling after resuming, but in using init_debug_store_on_cpu() to restore the DS_AREA mtrr it also resulted in a new WARN_ON() triggering. init_debug_store_on_cpu() uses "wrmsr_on_cpu()", which in turn uses CPU cross-calls to do the MSR update. Which is not really valid at the early resume stage, and the warning is quite reasonable. Now, it all happens to _work_, for the simple reason that smp_call_function_single() ends up just doing the call directly on the CPU when the CPU number matches, but we really should just do the wrmsr() directly instead. This duplicates the wrmsr() logic, but hopefully we can just remove the wrmsr_on_cpu() version eventually. Reported-and-tested-by: Parag Warudkar <parag.lkml@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* perf,x86: fix kernel crash with PEBS/BTS after suspend/resumeStephane Eranian2013-03-272-0/+10
| | | | | | | | | | | | | | | | | | | commit 1d9d8639c063caf6efc2447f5f26aa637f844ff6 upstream. This patch fixes a kernel crash when using precise sampling (PEBS) after a suspend/resume. Turns out the CPU notifier code is not invoked on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly by the kernel and keeps it power-on/resume value of 0 causing any PEBS measurement to crash when running on CPU0. The workaround is to add a hook in the actual resume code to restore the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0, the DS_AREA will be restored twice but this is harmless. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: 7663/1: perf: fix ARMv7 EVTYPE_MASK to include NSH bitWill Deacon2013-03-201-1/+1
| | | | | | | | | | | | | | commit f2fe09b055e2549de41fb107b34c60bac4a1b0cf upstream. Masked out PMXEVTYPER.NSH means that we can't enable profiling at PL2, regardless of the settings in the HDCR. This patch fixes the broken mask. Reported-by: Christoffer Dall <cdall@cs.columbia.edu> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* xen/pci: We don't do multiple MSI's.Konrad Rzeszutek Wilk2013-03-201-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 884ac2978a295b7df3c4a686d3bff6932bbbb460 upstream. There is no hypercall to setup multiple MSI per PCI device. As such with these two new commits: - 08261d87f7d1b6253ab3223756625a5c74532293 PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto() - 5ca72c4f7c412c2002363218901eba5516c476b1 AHCI: Support multiple MSIs we would call the PHYSDEVOP_map_pirq 'nvec' times with the same contents of the PCI device. Sander discovered that we would get the same PIRQ value 'nvec' times and return said values to the caller. That of course meant that the device was configured only with one MSI and AHCI would fail with: ahci 0000:00:11.0: version 3.0 xen: registering gsi 19 triggering 0 polarity 1 xen: --> pirq=19 -> irq=19 (gsi=19) (XEN) [2013-02-27 19:43:07] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x99 -> IRQ 19 Mode:1 Active:1) ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 6 Gbps 0xf impl SATA mode ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ahci: probe of 0000:00:11.0 failed with error -22 That is b/c in ahci_host_activate the second call to devm_request_threaded_irq would return -EINVAL as we passed in (on the second run) an IRQ that was never initialized. Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: fix scheduling while atomic warning in alignment handling codeRussell King2013-03-201-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b255188f90e2bade1bd11a986dd1ca4861869f4d upstream. Paolo Pisati reports that IPv6 triggers this warning: BUG: scheduling while atomic: swapper/0/0/0x40000100 Modules linked in: [<c001b1c4>] (unwind_backtrace+0x0/0xf0) from [<c0503c5c>] (__schedule_bug+0x48/0x5c) [<c0503c5c>] (__schedule_bug+0x48/0x5c) from [<c0508608>] (__schedule+0x700/0x740) [<c0508608>] (__schedule+0x700/0x740) from [<c007007c>] (__cond_resched+0x24/0x34) [<c007007c>] (__cond_resched+0x24/0x34) from [<c05086dc>] (_cond_resched+0x3c/0x44) [<c05086dc>] (_cond_resched+0x3c/0x44) from [<c0021f6c>] (do_alignment+0x178/0x78c) [<c0021f6c>] (do_alignment+0x178/0x78c) from [<c00083e0>] (do_DataAbort+0x34/0x98) [<c00083e0>] (do_DataAbort+0x34/0x98) from [<c0509a60>] (__dabt_svc+0x40/0x60) Exception stack(0xc0763d70 to 0xc0763db8) 3d60: e97e805e e97e806e 2c000000 11000000 3d80: ea86bb00 0000002c 00000011 e97e807e c076d2a8 e97e805e e97e806e 0000002c 3da0: 3d000000 c0763dbc c04b98fc c02a8490 00000113 ffffffff [<c0509a60>] (__dabt_svc+0x40/0x60) from [<c02a8490>] (__csum_ipv6_magic+0x8/0xc8) Fix this by using probe_kernel_address() stead of __get_user(). Reported-by: Paolo Pisati <p.pisati@gmail.com> Tested-by: Paolo Pisati <p.pisati@gmail.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: VFP: fix emulation of second VFP instructionRussell King2013-03-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5e4ba617c1b584b2e376f31a63bd4e734109318a upstream. Martin Storsjö reports that the sequence: ee312ac1 vsub.f32 s4, s3, s2 ee702ac0 vsub.f32 s5, s1, s0 e59f0028 ldr r0, [pc, #40] ee111a90 vmov r1, s3 on Raspberry Pi (implementor 41 architecture 1 part 20 variant b rev 5) where s3 is a denormal and s2 is zero results in incorrect behaviour - the instruction "vsub.f32 s5, s1, s0" is not executed: VFP: bounce: trigger ee111a90 fpexc d0000780 VFP: emulate: INST=0xee312ac1 SCR=0x00000000 ... As we can see, the instruction triggering the exception is the "vmov" instruction, and we emulate the "vsub.f32 s4, s3, s2" but fail to properly take account of the FPEXC_FP2V flag in FPEXC. This is because the test for the second instruction register being valid is bogus, and will always skip emulation of the second instruction. Reported-by: Martin Storsjö <martin@martin.st> Tested-by: Martin Storsjö <martin@martin.st> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Revert "powerpc/eeh: Fix crash when adding a device in a slot with DDW"Ben Hutchings2013-03-204-34/+3
| | | | | | | | | | This reverts commit 066f289835f09a3f744d6bac96f25e25d20b3ded which was 6a040ce72598159a74969a2d01ab0ba5ee6536b3 upstream. This was not needed and is not suitable for 3.2.y. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* s390/timer: avoid overflow when programming clock comparatorHeiko Carstens2013-03-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | commit d911e03d097bdc01363df5d81c43f69432eb785c upstream. Since ed4f209 "s390/time: fix sched_clock() overflow" a new helper function is used to avoid overflows when converting TOD format values to nanosecond values. The kvm interrupt code formerly however only worked by accident because of an overflow. It tried to program a timer that would expire in more than ~29 years. Because of the old TOD-to-nanoseconds overflow bug the real expiry value however was much smaller, but now it isn't anymore. This however triggers yet another bug in the function that programs the clock comparator s390_next_ktime(): if the absolute "expires" value is after 2042 this will result in an overflow and the programmed value is lower than the current TOD value which immediatly triggers a clock comparator (= timer) interrupt. Since the timer isn't expired it will be programmed immediately again and so on... the result is a dead system. To fix this simply program the maximum possible value if an overflow is detected. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86: Make sure we can boot in the case the BDA contains pure garbageH. Peter Anvin2013-03-061-19/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7c10093692ed2e6f318387d96b829320aa0ca64c upstream. On non-BIOS platforms it is possible that the BIOS data area contains garbage instead of being zeroed or something equivalent (firmware people: we are talking of 1.5K here, so please do the sane thing.) We need on the order of 20-30K of low memory in order to boot, which may grow up to < 64K in the future. We probably want to avoid the lowest of the low memory. At the same time, it seems extremely unlikely that a legitimate EBDA would ever reach down to the 128K (which would require it to be over half a megabyte in size.) Thus, pick 128K as the cutoff for "this is insane, ignore." We may still end up reserving a bunch of extra memory on the low megabyte, but that is not really a major issue these days. In the worst case we lose 512K of RAM. This code really should be merged with trim_bios_range() in arch/x86/kernel/setup.c, but that is a bigger patch for a later merge window. Reported-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc/kexec: Disable hard IRQ before kexecPhileas Fogg2013-03-061-0/+5
| | | | | | | | | | | | commit 8520e443aa56cc157b015205ea53e7b9fc831291 upstream. Disable hard IRQ before kexec a new kernel image. Not doing it can result in corrupted data in the memory segments reserved for the new kernel. Signed-off-by: Phileas Fogg <phileas-fogg@mail.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, efi: Make "noefi" really disable EFI runtime serivcesMatt Fleming2013-03-061-28/+31
| | | | | | | | | | | | | | | | | | | | | | | | | commit fb834c7acc5e140cf4f9e86da93a66de8c0514da upstream. commit 1de63d60cd5b ("efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameter") attempted to make "noefi" true to its documentation and disable EFI runtime services to prevent the bricking bug described in commit e0094244e41c ("samsung-laptop: Disable on EFI hardware"). However, it's not possible to clear EFI_RUNTIME_SERVICES from an early param function because EFI_RUNTIME_SERVICES is set in efi_init() *after* parse_early_param(). This resulted in "noefi" effectively becoming a no-op and no longer providing users with a way to disable EFI, which is bad for those users that have buggy machines. Reported-by: Walt Nelson Jr <walt0924@gmail.com> Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1361392572-25657-1-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> [bwh: Backported to 3.2: efi_runtime_init() is not a separate function, so put a whole set of statements in an if (!disable_runtime) block] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* xen: Send spinlock IPI to all waitersStefan Bader2013-03-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 76eaca031f0af2bb303e405986f637811956a422 upstream. There is a loophole between Xen's current implementation of pv-spinlocks and the scheduler. This was triggerable through a testcase until v3.6 changed the TLB flushing code. The problem potentially is still there just not observable in the same way. What could happen was (is): 1. CPU n tries to schedule task x away and goes into a slow wait for the runq lock of CPU n-# (must be one with a lower number). 2. CPU n-#, while processing softirqs, tries to balance domains and goes into a slow wait for its own runq lock (for updating some records). Since this is a spin_lock_irqsave in softirq context, interrupts will be re-enabled for the duration of the poll_irq hypercall used by Xen. 3. Before the runq lock of CPU n-# is unlocked, CPU n-1 receives an interrupt (e.g. endio) and when processing the interrupt, tries to wake up task x. But that is in schedule and still on_cpu, so try_to_wake_up goes into a tight loop. 4. The runq lock of CPU n-# gets unlocked, but the message only gets sent to the first waiter, which is CPU n-# and that is busily stuck. 5. CPU n-# never returns from the nested interruption to take and release the lock because the scheduler uses a busy wait. And CPU n never finishes the task migration because the unlock notification only went to CPU n-#. To avoid this and since the unlocking code has no real sense of which waiter is best suited to grab the lock, just send the IPI to all of them. This causes the waiters to return from the hyper- call (those not interrupted at least) and do active spinlocking. BugLink: http://bugs.launchpad.net/bugs/1011792 Acked-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: w90x900: fix legacy assembly syntaxArnd Bergmann2013-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | | commit fa5ce5f94c0f2bfa41ba68d2d2524298e1fc405e upstream. New ARM binutils don't allow extraneous whitespace inside of brackets, which causes this error on all mach-w90x900 defconfigs: arch/arm/kernel/entry-armv.S: Assembler messages: arch/arm/kernel/entry-armv.S:214: Error: ARM register expected -- `ldr r0,[ r6,#(0x10C)]' arch/arm/kernel/entry-armv.S:214: Error: ARM register expected -- `ldr r0,[ r6,#(0x110)]' arch/arm/kernel/entry-armv.S:430: Error: ARM register expected -- `ldr r0,[ r6,#(0x10C)]' arch/arm/kernel/entry-armv.S:430: Error: ARM register expected -- `ldr r0,[ r6,#(0x110)]' This removes the whitespace in order to build the kernel again. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Wan ZongShun <mcuos.com@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ARM: samsung: fix assembly syntax for new gasArnd Bergmann2013-03-066-30/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2815774bb38445006074e16251b9ef5123bdc616 upstream. Recent assembler versions complain about extraneous whitespace inside [] brackets. This fixes all of these instances for the samsung platforms. We should backport this to all kernels that might need to be built with new binutils. arch/arm/kernel/entry-armv.S: Assembler messages: arch/arm/kernel/entry-armv.S:214: Error: ARM register expected -- `ldr r2,[ r6,#(0x10)]' arch/arm/kernel/entry-armv.S:214: Error: ARM register expected -- `ldr r0,[ r6,#(0x14)]' arch/arm/kernel/entry-armv.S:430: Error: ARM register expected -- `ldr r2,[ r6,#(0x10)]' arch/arm/kernel/entry-armv.S:430: Error: ARM register expected -- `ldr r0,[ r6,#(0x14)]' arch/arm/mach-s3c24xx/sleep-s3c2410.S: Assembler messages: arch/arm/mach-s3c24xx/sleep-s3c2410.S:48: Error: ARM register expected -- `ldr r7,[ r4 ]' arch/arm/mach-s3c24xx/sleep-s3c2410.S:49: Error: ARM register expected -- `ldr r8,[ r5 ]' arch/arm/mach-s3c24xx/sleep-s3c2410.S:50: Error: ARM register expected -- `ldr r9,[ r6 ]' arch/arm/mach-s3c24xx/sleep-s3c2410.S:64: Error: ARM register expected -- `streq r7,[ r4 ]' arch/arm/mach-s3c24xx/sleep-s3c2410.S:65: Error: ARM register expected -- `streq r8,[ r5 ]' arch/arm/mach-s3c24xx/sleep-s3c2410.S:66: Error: ARM register expected -- `streq r9,[ r6 ]' arch/arm/kernel/debug.S: Assembler messages: arch/arm/kernel/debug.S:83: Error: ARM register expected -- `ldr r2,[ r2,#((0x0B0)+(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000))))-((0)+(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000))))]' arch/arm/kernel/debug.S:83: Error: ARM register expected -- `ldr r2,[ r3,#(0x18)]' arch/arm/kernel/debug.S:85: Error: ARM register expected -- `ldr r2,[ r2,#((0x0B0)+(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000))))-((0)+(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000))))]' arch/arm/kernel/debug.S:85: Error: ARM register expected -- `ldr r2,[ r3,#(0x18)]' arch/arm/mach-s3c24xx/pm-h1940.S: Assembler messages: arch/arm/mach-s3c24xx/pm-h1940.S:33: Error: ARM register expected -- `ldr pc,[ r0,#((0x0B8)+(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000))))-(((0x56000000)-(0x50000000))+(0xF6000000+(0x01000000)))]' arch/arm/mach-s3c24xx/sleep-s3c2412.S: Assembler messages: arch/arm/mach-s3c24xx/sleep-s3c2412.S:60: Error: ARM register expected -- `ldrne r9,[ r1 ]' arch/arm/mach-s3c24xx/sleep-s3c2412.S:61: Error: ARM register expected -- `strne r9,[ r1 ]' arch/arm/mach-s3c24xx/sleep-s3c2412.S:62: Error: ARM register expected -- `ldrne r9,[ r2 ]' arch/arm/mach-s3c24xx/sleep-s3c2412.S:63: Error: ARM register expected -- `strne r9,[ r2 ]' arch/arm/mach-s3c24xx/sleep-s3c2412.S:64: Error: ARM register expected -- `ldrne r9,[ r3 ]' arch/arm/mach-s3c24xx/sleep-s3c2412.S:65: Error: ARM register expected -- `strne r9,[ r3 ]' arch/arm/kernel/debug.S:83: Error: ARM register expected -- `ldr r2,[ r3,#(0x08)]' arch/arm/kernel/debug.S:83: Error: ARM register expected -- `ldr r2,[ r3,#(0x18)]' arch/arm/kernel/debug.S:83: Error: ARM register expected -- `ldr r2,[ r3,#(0x10)]' arch/arm/kernel/debug.S:85: Error: ARM register expected -- `ldr r2,[ r3,#(0x08)]' arch/arm/kernel/debug.S:85: Error: ARM register expected -- `ldr r2,[ r3,#(0x18)]' arch/arm/kernel/debug.S:85: Error: ARM register expected -- `ldr r2,[ r3,#(0x10)]' Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kukjin Kim <kgene.kim@samsung.com> Cc: Ben Dooks <ben-linux@fluff.org> [bwh: Backported to 3.2: adjust filenames] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* efi: Clear EFI_RUNTIME_SERVICES rather than EFI_BOOT by "noefi" boot parameterSatoru Takeuchi2013-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1de63d60cd5b0d33a812efa455d5933bf1564a51 upstream. There was a serious problem in samsung-laptop that its platform driver is designed to run under BIOS and running under EFI can cause the machine to become bricked or can cause Machine Check Exceptions. Discussion about this problem: https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557 https://bugzilla.kernel.org/show_bug.cgi?id=47121 The patches to fix this problem: efi: Make 'efi_enabled' a function to query EFI facilities 83e68189745ad931c2afd45d8ee3303929233e7f samsung-laptop: Disable on EFI hardware e0094244e41c4d0c7ad69920681972fc45d8ce34 Unfortunately this problem comes back again if users specify "noefi" option. This parameter clears EFI_BOOT and that driver continues to run even if running under EFI. Refer to the document, this parameter should clear EFI_RUNTIME_SERVICES instead. Documentation/kernel-parameters.txt: =============================================================================== ... noefi [X86] Disable EFI runtime services support. ... =============================================================================== Documentation/x86/x86_64/uefi.txt: =============================================================================== ... - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. noefi turn off all EFI runtime services ... =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Link: http://lkml.kernel.org/r/511C2C04.2070108@jp.fujitsu.com Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/mm: Check if PUD is large when validating a kernel addressMel Gorman2013-03-062-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0ee364eb316348ddf3e0dfcd986f5f13f528f821 upstream. A user reported the following oops when a backup process reads /proc/kcore: BUG: unable to handle kernel paging request at ffffbb00ff33b000 IP: [<ffffffff8103157e>] kern_addr_valid+0xbe/0x110 [...] Call Trace: [<ffffffff811b8aaa>] read_kcore+0x17a/0x370 [<ffffffff811ad847>] proc_reg_read+0x77/0xc0 [<ffffffff81151687>] vfs_read+0xc7/0x130 [<ffffffff811517f3>] sys_read+0x53/0xa0 [<ffffffff81449692>] system_call_fastpath+0x16/0x1b Investigation determined that the bug triggered when reading system RAM at the 4G mark. On this system, that was the first address using 1G pages for the virt->phys direct mapping so the PUD is pointing to a physical address, not a PMD page. The problem is that the page table walker in kern_addr_valid() is not checking pud_large() and treats the physical address as if it was a PMD. If it happens to look like pmd_none then it'll silently fail, probably returning zeros instead of real data. If the data happens to look like a present PMD though, it will be walked resulting in the oops above. This patch adds the necessary pud_large() check. Unfortunately the problem was not readily reproducible and now they are running the backup program without accessing /proc/kcore so the patch has not been validated but I think it makes sense. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.coM> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20130211145236.GX21389@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86: Hyper-V: register clocksource only if its advertisedOlaf Hering2013-03-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 32068f6527b8f1822a30671dedaf59c567325026 upstream. Enable hyperv_clocksource only if its advertised as a feature. XenServer 6 returns the signature which is checked in ms_hyperv_platform(), but it does not offer all features. Currently the clocksource is enabled unconditionally in ms_hyperv_init_platform(), and the result is a hanging guest. Hyper-V spec Bit 1 indicates the availability of Partition Reference Counter. Register the clocksource only if this bit is set. The guest in question prints this in dmesg: [ 0.000000] Hypervisor detected: Microsoft HyperV [ 0.000000] HyperV: features 0x70, hints 0x0 This bug can be reproduced easily be setting 'viridian=1' in a HVM domU .cfg file. A workaround without this patch is to boot the HVM guest with 'clocksource=jiffies'. Signed-off-by: Olaf Hering <olaf@aepfle.de> Link: http://lkml.kernel.org/r/1359940959-32168-1-git-send-email-kys@microsoft.com Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Cc: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/apic: Work around boot failure on HP ProLiant DL980 G7 Server systemsStoney Wang2013-03-061-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cb214ede7657db458fd0b2a25ea0b28dbf900ebc upstream. When a HP ProLiant DL980 G7 Server boots a regular kernel, there will be intermittent lost interrupts which could result in a hang or (in extreme cases) data loss. The reason is that this system only supports x2apic physical mode, while the kernel boots with a logical-cluster default setting. This bug can be worked around by specifying the "x2apic_phys" or "nox2apic" boot option, but we want to handle this system without requiring manual workarounds. The BIOS sets ACPI_FADT_APIC_PHYSICAL in FADT table. As all apicids are smaller than 255, BIOS need to pass the control to the OS with xapic mode, according to x2apic-spec, chapter 2.9. Current code handle x2apic when BIOS pass with xapic mode enabled: When user specifies x2apic_phys, or FADT indicates PHYSICAL: 1. During madt oem check, apic driver is set with xapic logical or xapic phys driver at first. 2. enable_IR_x2apic() will enable x2apic_mode. 3. if user specifies x2apic_phys on the boot line, x2apic_phys_probe() will install the correct x2apic phys driver and use x2apic phys mode. Otherwise it will skip the driver will let x2apic_cluster_probe to take over to install x2apic cluster driver (wrong one) even though FADT indicates PHYSICAL, because x2apic_phys_probe does not check FADT PHYSICAL. Add checking x2apic_fadt_phys in x2apic_phys_probe() to fix the problem. Signed-off-by: Stoney Wang <song-bo.wang@hp.com> [ updated the changelog and simplified the code ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1360263182-16226-1-git-send-email-yinghai@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/apic: Use x2apic physical mode based on FADT settingGreg Pearson2013-03-061-0/+6
| | | | | | | | | | | | | | | | | | commit ea0dcf903e7d76aa5d483d876215fedcfdfe140f upstream. Provide systems that do not support x2apic cluster mode a mechanism to select x2apic physical mode using the FADT FORCE_APIC_PHYSICAL_DESTINATION_MODE bit. Changes from v1: (based on Suresh's comments) - removed #ifdef CONFIG_ACPI - removed #include <linux/acpi.h> Signed-off-by: Greg Pearson <greg.pearson@hp.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1335313436-32020-1-git-send-email-greg.pearson@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86: Do not leak kernel page mapping locationsKees Cook2013-03-061-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e575a86fdc50d013bf3ad3aa81d9100e8e6cc60d upstream. Without this patch, it is trivial to determine kernel page mappings by examining the error code reported to dmesg[1]. Instead, declare the entire kernel memory space as a violation of a present page. Additionally, since show_unhandled_signals is enabled by default, switch branch hinting to the more realistic expectation, and unobfuscate the setting of the PF_PROT bit to improve readability. [1] http://vulnfactory.org/blog/2013/02/06/a-linux-memory-trick/ Reported-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130207174413.GA12485@www.outflux.net Signed-off-by: Ingo Molnar <mingo@kernel.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86-32, mm: Rip out x86_32 NUMA remapping codeDave Hansen2013-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream. This code was an optimization for 32-bit NUMA systems. It has probably been the cause of a number of subtle bugs over the years, although the conditions to excite them would have been hard to trigger. Essentially, we remap part of the kernel linear mapping area, and then sometimes part of that area gets freed back in to the bootmem allocator. If those pages get used by kernel data structures (say mem_map[] or a dentry), there's no big deal. But, if anyone ever tried to use the linear mapping for these pages _and_ cared about their physical address, bad things happen. For instance, say you passed __GFP_ZERO to the page allocator and then happened to get handed one of these pages, it zero the remapped page, but it would make a pte to the _old_ page. There are probably a hundred other ways that it could screw with things. We don't need to hang on to performance optimizations for these old boxes any more. All my 32-bit NUMA systems are long dead and buried, and I probably had access to more than most people. This code is causing real things to break today: https://lkml.org/lkml/2013/1/9/376 I looked in to actually fixing this, but it requires surgery to way too much brittle code, as well as stuff like per_cpu_ptr_to_phys(). [ hpa: Cc: this for -stable, since it is a memory corruption issue. However, an alternative is to simply mark NUMA as depends BROKEN rather than EXPERIMENTAL in the X86_32 subclause... ] Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> [bwh: For 3.2, using the suggested alternative] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* s390/kvm: Fix store status for ACRS/FPRSChristian Borntraeger2013-03-061-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | commit 15bc8d8457875f495c59d933b05770ba88d1eacb upstream. On store status we need to copy the current state of registers into a save area. Currently we might save stale versions: The sie state descriptor doesnt have fields for guest ACRS,FPRS, those registers are simply stored in the host registers. The host program must copy these away if needed. We do that in vcpu_put/load. If we now do a store status in KVM code between vcpu_put/load, the saved values are not up-to-date. Lets collect the ACRS/FPRS before saving them. This also fixes some strange problems with hotplug and virtio-ccw, since the low level machine check handler (on hotplug a machine check will happen) will revalidate all registers with the content of the save area. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> [bwh: Backported to 3.2 as done in 3.0 by Jiri Slaby] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Jiri Slaby <jslaby@suse.cz>
* ARM: PXA3xx: program the CSMSADRCFG registerIgor Grinberg2013-03-062-1/+15
| | | | | | | | | | | | | | | | | | commit d107a204154ddd79339203c2deeb7433f0cf6777 upstream. The Chip Select Configuration Register must be programmed to 0x2 in order to achieve the correct behavior of the Static Memory Controller. Without this patch devices wired to DFI and accessed through SMC cannot be accessed after resume from S2. Do not rely on the boot loader to program the CSMSADRCFG register by programming it in the kernel smemc module. Signed-off-by: Igor Grinberg <grinberg@compulab.co.il> Acked-by: Eric Miao <eric.y.miao@gmail.com> Signed-off-by: Haojian Zhuang <haojian.zhuang@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Purge existing TLB entries in set_pte_at and ptep_set_wrprotectJohn David Anglin2013-03-062-3/+28
| | | | | | | | | | | | | | | | commit 7139bc1579901b53db7e898789e916ee2fb52d78 upstream. This patch goes a long way toward fixing the minifail bug, and it  significantly improves the stability of SMP machines such as the rp3440.  When write  protecting a page for COW, we need to purge the existing translation.  Otherwise, the COW break doesn't occur as expected because the TLB may still have a stale entry which allows writes. [jejb: fix up checkpatch errors] Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* powerpc/eeh: Fix crash when adding a device in a slot with DDWThadeu Lima de Souza Cascardo2013-03-064-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | commit 6a040ce72598159a74969a2d01ab0ba5ee6536b3 upstream. The DDW code uses a eeh_dev struct from the pci_dev. However, this is not set until eeh_add_device_late is called. Since pci_bus_add_devices is called before eeh_add_device_late, the PCI devices are added to the bus, making drivers' probe hooks to be called. These will call set_dma_mask, which will call the DDW code, which will require the eeh_dev struct from pci_dev. This would result in a crash, due to a NULL dereference. Calling eeh_add_device_late after pci_bus_add_devices would make the system BUG, because device files shouldn't be added to devices there were not added to the system. So, a new function is needed to add such files only after pci_bus_add_devices have been called. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/xen: don't assume %ds is usable in xen_iret for 32-bit PVOPS.Jan Beulich2013-02-201-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 13d2b4d11d69a92574a55bfd985cfb0ca77aebdc upstream. This fixes CVE-2013-0228 / XSA-42 Drew Jones while working on CVE-2013-0190 found that that unprivileged guest user in 32bit PV guest can use to crash the > guest with the panic like this: ------------- general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/vbd-51712/block/xvda/dev Modules linked in: sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1250, comm: r Not tainted 2.6.32-356.el6.i686 #1 EIP: 0061:[<c0407462>] EFLAGS: 00010086 CPU: 0 EIP is at xen_iret+0x12/0x2b EAX: eb8d0000 EBX: 00000001 ECX: 08049860 EDX: 00000010 ESI: 00000000 EDI: 003d0f00 EBP: b77f8388 ESP: eb8d1fe0 DS: 0000 ES: 007b FS: 0000 GS: 00e0 SS: 0069 Process r (pid: 1250, ti=eb8d0000 task=c2953550 task.ti=eb8d0000) Stack: 00000000 0027f416 00000073 00000206 b77f8364 0000007b 00000000 00000000 Call Trace: Code: c3 8b 44 24 18 81 4c 24 38 00 02 00 00 8d 64 24 30 e9 03 00 00 00 8d 76 00 f7 44 24 08 00 00 02 80 75 33 50 b8 00 e0 ff ff 21 e0 <8b> 40 10 8b 04 85 a0 f6 ab c0 8b 80 0c b0 b3 c0 f6 44 24 0d 02 EIP: [<c0407462>] xen_iret+0x12/0x2b SS:ESP 0069:eb8d1fe0 general protection fault: 0000 [#2] ---[ end trace ab0d29a492dcd330 ]--- Kernel panic - not syncing: Fatal exception Pid: 1250, comm: r Tainted: G D --------------- 2.6.32-356.el6.i686 #1 Call Trace: [<c08476df>] ? panic+0x6e/0x122 [<c084b63c>] ? oops_end+0xbc/0xd0 [<c084b260>] ? do_general_protection+0x0/0x210 [<c084a9b7>] ? error_code+0x73/ ------------- Petr says: " I've analysed the bug and I think that xen_iret() cannot cope with mangled DS, in this case zeroed out (null selector/descriptor) by either xen_failsafe_callback() or RESTORE_REGS because the corresponding LDT entry was invalidated by the reproducer. " Jan took a look at the preliminary patch and came up a fix that solves this problem: "This code gets called after all registers other than those handled by IRET got already restored, hence a null selector in %ds or a non-null one that got loaded from a code or read-only data descriptor would cause a kernel mode fault (with the potential of crashing the kernel as a whole, if panic_on_oops is set)." The way to fix this is to realize that the we can only relay on the registers that IRET restores. The two that are guaranteed are the %cs and %ss as they are always fixed GDT selectors. Also they are inaccessible from user mode - so they cannot be altered. This is the approach taken in this patch. Another alternative option suggested by Jan would be to relay on the subtle realization that using the %ebp or %esp relative references uses the %ss segment. In which case we could switch from using %eax to %ebp and would not need the %ss over-rides. That would also require one extra instruction to compensate for the one place where the register is used as scaled index. However Andrew pointed out that is too subtle and if further work was to be done in this code-path it could escape folks attention and lead to accidents. Reviewed-by: Petr Matousek <pmatouse@redhat.com> Reported-by: Petr Matousek <pmatouse@redhat.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILLOleg Nesterov2013-02-201-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9899d11f654474d2d54ea52ceaa2a1f4db3abd68 upstream. putreg() assumes that the tracee is not running and pt_regs_access() can safely play with its stack. However a killed tracee can return from ptrace_stop() to the low-level asm code and do RESTORE_REST, this means that debugger can actually read/modify the kernel stack until the tracee does SAVE_REST again. set_task_blockstep() can race with SIGKILL too and in some sense this race is even worse, the very fact the tracee can be woken up breaks the logic. As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace() call, this ensures that nobody can ever wakeup the tracee while the debugger looks at it. Not only this fixes the mentioned problems, we can do some cleanups/simplifications in arch_ptrace() paths. Probably ptrace_unfreeze_traced() needs more callers, for example it makes sense to make the tracee killable for oom-killer before access_process_vm(). While at it, add the comment into may_ptrace_stop() to explain why ptrace_stop() still can't rely on SIGKILL and signal_pending_state(). Reported-by: Salman Qazi <sqazi@google.com> Reported-by: Suleiman Souhlal <suleiman@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ptrace/x86: Partly fix set_task_blockstep()->update_debugctlmsr() logicOleg Nesterov2013-02-201-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 95cf00fa5d5e2a200a2c044c84bde8389a237e02 upstream. Afaics the usage of update_debugctlmsr() and TIF_BLOCKSTEP in step.c was always very wrong. 1. update_debugctlmsr() was simply unneeded. The child sleeps TASK_TRACED, __switch_to_xtra(next_p => child) should notice TIF_BLOCKSTEP and set/clear DEBUGCTLMSR_BTF after resume if needed. 2. It is wrong. The state of DEBUGCTLMSR_BTF bit in CPU register should always match the state of current's TIF_BLOCKSTEP bit. 3. Even get_debugctlmsr() + update_debugctlmsr() itself does not look right. Irq can change other bits in MSR_IA32_DEBUGCTLMSR register or the caller can be preempted in between. 4. It is not safe to play with TIF_BLOCKSTEP if task != current. DEBUGCTLMSR_BTF and TIF_BLOCKSTEP should always match each other if the task is running. The tracee is stopped but it can be SIGKILL'ed right before set/clear_tsk_thread_flag(). However, now that uprobes uses user_enable_single_step(current) we can't simply remove update_debugctlmsr(). So this patch adds the additional "task == current" check and disables irqs to avoid the race with interrupts/preemption. Unfortunately this patch doesn't solve the last problem, we need another fix. Probably we should teach ptrace_stop() to set/clear single/block stepping after resume. And afaics there is yet another problem: perf can play with MSR_IA32_DEBUGCTLMSR from nmi, this obviously means that even __switch_to_xtra() has problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ptrace/x86: Introduce set_task_blockstep() helperOleg Nesterov2013-02-201-20/+21
| | | | | | | | | | | | | | commit 848e8f5f0ad3169560c516fff6471be65f76e69f upstream. No functional changes, preparation for the next fix and for uprobes single-step fixes. Move the code playing with TIF_BLOCKSTEP/DEBUGCTLMSR_BTF into the new helper, set_task_blockstep(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86-64: Replace left over sti/cli in ia32 audit exit codeJan Beulich2013-02-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit 40a1ef95da85843696fc3ebe5fce39b0db32669f upstream. For some reason they didn't get replaced so far by their paravirt equivalents, resulting in code to be run with interrupts disabled that doesn't expect so (causing, in the observed case, a BUG_ON() to trigger) when syscall auditing is enabled. David (Cc-ed) came up with an identical fix, so likely this can be taken to count as an ack from him. Reported-by: Peter Moody <pmoody@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/5108E01902000078000BA9C5@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Tested-by: Peter Moody <pmoody@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86, efi: Set runtime_version to the EFI spec revisionMatt Fleming2013-02-061-1/+1
| | | | | | | | | | | | | | | | | | commit 712ba9e9afc4b3d3d6fa81565ca36fe518915c01 upstream. efi.runtime_version is erroneously being set to the value of the vendor's firmware revision instead of that of the implemented EFI specification. We can't deduce which EFI functions are available based on the revision of the vendor's firmware since the version scheme is likely to be unique to each vendor. What we really need to know is the revision of the implemented EFI specification, which is available in the EFI System Table header. Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86: Use enum instead of literals for trap valuesKees Cook2013-02-063-57/+90
| | | | | | | | | | | | | | | | commit c94082656dac74257f63e91f78d5d458ac781fa5 upstream. The traps are referred to by their numbers and it can be difficult to understand them while reading the code without context. This patch adds enumeration of the trap numbers and replaces the numbers with the correct enum for x86. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/20120310000710.GA32667@www.outflux.net Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cherry-picked-for: v2.3.37 Signed-off-by: John Kacur <jkacur@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/Sandy Bridge: Sandy Bridge workaround depends on CONFIG_PCIH. Peter Anvin2013-02-061-0/+2
| | | | | | | | | | | commit e43b3cec711a61edf047adf6204d542f3a659ef8 upstream. early_pci_allowed() and read_pci_config_16() are only available if CONFIG_PCI is defined. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/Sandy Bridge: mark arrays in __init functions as __initconstH. Peter Anvin2013-02-061-2/+2
| | | | | | | | | | | | commit ab3cd8670e0b3fcde7f029e1503ed3c5138e9571 upstream. Mark static arrays as __initconst so they get removed when the init sections are flushed. Reported-by: Mathias Krause <minipli@googlemail.com> Link: http://lkml.kernel.org/r/75F4BEE6-CB0E-4426-B40B-697451677738@googlemail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* x86/Sandy Bridge: reserve pages when integrated graphics is presentJesse Barnes2013-02-061-0/+78
| | | | | | | | | | | | | | | | | | | | | commit a9acc5365dbda29f7be2884efb63771dc24bd815 upstream. SNB graphics devices have a bug that prevent them from accessing certain memory ranges, namely anything below 1M and in the pages listed in the table. So reserve those at boot if set detect a SNB gfx device on the CPU to avoid GPU hangs. Stephane Marchesin had a similar patch to the page allocator awhile back, but rather than reserving pages up front, it leaked them at allocation time. [ hpa: made a number of stylistic changes, marked arrays as static const, and made less verbose; use "memblock=debug" for full verbosity. ] Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>