| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Variable Length Arrays In Structs (VLAIS) is not supported by Clang, and
frowned upon by others.
https://lkml.org/lkml/2013/9/23/500
Here, the VLAIS was used because the size of the bitmap returned from
xen_mc_entry() depended on possibly (based on kernel configuration)
runtime sized data. Rather than declaring args as a VLAIS then calling
sizeof on *args, we calculate the appropriate sizeof args manually.
Further, we can get rid of the #ifdef's and rely on num_possible_cpus()
(thanks to a helpful checkpatch warning from an earlier version of this
patch).
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
The header declares this function as __init but is defined in __ref
section.
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit f5775e0b6116 ("x86/xen: discard RAM regions above the maximum
reservation") left host memory not assigned to dom0 as available for
memory hotplug.
Unfortunately this also meant that those regions could be used by
others. Specifically, commit fa564ad96366 ("x86/PCI: Enable a 64bit BAR
on AMD Family 15h (Models 00-1f, 30-3f, 60-7f)") may try to map those
addresses as MMIO.
To prevent this mark unallocated host memory as E820_TYPE_UNUSABLE (thus
effectively reverting f5775e0b6116) and keep track of that region as
a hostmem resource that can be used for the hotplug.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A few thousand such pages are usually left around due to the re-use of
L1 tables having been provided by the hypervisor (Dom0) or tool stack
(DomU). Set NX in the direct map variant, which needs to be done in L2
due to the dual use of the re-used L1s.
For x86_configure_nx() to actually do what it is supposed to do, call
get_cpu_cap() first. This was broken by commit 4763ed4d45 ("x86, mm:
Clean up and simplify NX enablement") when switching away from the
direct EFER read.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unconditionally reporting a value seen on the P4 or older invokes
functionality like io_apic_get_unique_id() on 32-bit builds, resulting
in a panic() with sufficiently many CPUs and/or IO-APICs. Doing what
that function does would be the hypervisor's responsibility anyway, so
makes no sense to be used when running on Xen. Uniformly report a more
modern version; this shouldn't matter much as both LAPIC and IO-APIC are
being managed entirely / mostly by the hypervisor.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to support pvclock vdso on xen we need to setup the time
info page for vcpu 0 and register the page with Xen using the
VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall
will also forcefully update the pvti which will set some of the
necessary flags for vdso. Afterwards we check if it supports the
PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having
vdso/vsyscall support. And if so, it will set the cpu 0 pvti that
will be later on used when mapping the vdso image.
The xen headers are also updated to include the new hypercall for
registering the secondary vcpu_time_info struct.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
| |
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set,
then set it too on pvclock flags. This allows Xen clocksource to use it
and thus speeding up xen_clocksource_read() callers (i.e. sched_clock())
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now there is only a pvclock_pvti_cpu0_va() which is defined
on kvmclock since:
commit dac16fba6fc5
("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap")
The only user of this interface so far is kvm. This commit adds a
setter function for the pvti page and moves pvclock_pvti_cpu0_va
to pvclock, which is a more generic place to have it; and would
allow other PV clocksources to use it, such as Xen.
While moving pvclock_pvti_cpu0_va into pvclock, rename also this
function to pvclock_get_pvti_cpu0_va (including its call sites)
to be symmetric with the setter (pvclock_set_pvti_cpu0_va).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
Update arch/x86/include/asm/xen/cpuid.h from the Xen tree to get newest
definitions.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The grant v2 support was removed from the kernel with
commit 438b33c7145ca8a5131a30c36d8f59bce119a19a ("xen/grant-table:
remove support for V2 tables") as the higher memory footprint of v2
grants resulted in less grants being possible for a kernel compared
to the v1 grant interface.
As machines with more than 16TB of memory are expected to be more
common in the near future support of grant v2 is mandatory in order
to be able to run a Xen pv domain at any memory location.
So re-add grant v2 support basically by reverting above commit.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the domain has XENFEAT_auto_translated_physmap then use of the PV-
specific HYPERVISOR_mmu_update hypercall is clearly incorrect.
This patch adds checks in xen_remap_domain_gfn_array() and
xen_unmap_domain_gfn_array() which call through to the approprate
xlate_mmu function if the feature is present. A check is also added
to xen_remap_domain_gfn_range() to fail with -EOPNOTSUPP since this
should not be used in an HVM tools domain.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
| |
For any other error sync_cmos_clock() will reschedule itself
every second or so, for no good reason.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Physical addresses on processors supporting 5 level paging can be up to
52 bits wide. For a Xen pv guest running on such a machine those
physical addresses have to be supported in order to be able to use any
memory on the machine even if the guest itself does not support 5 level
paging.
So when reading/writing a MFN from/to a pte don't use the kernel's
PTE_PFN_MASK but a new XEN_PTE_MFN_MASK allowing full 40 bit wide MFNs.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- revert a /dev/mem restriction change that crashes with certain boot
parameters
- an AMD erratum fix for cases where the BIOS doesn't apply it
- fix unwinder debuginfo
- improve ORC unwinder warning printouts"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"
x86/unwind: Show function name+offset in ORC error messages
x86/entry: Fix idtentry unwind hint
x86/cpu/AMD: Apply the Erratum 688 fix when the BIOS doesn't
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit ce56a86e2ade45d052b3228cdfebe913a1ae7381.
There's unanticipated interaction with some boot parameters like 'mem=',
which now cause the new checks via valid_mmap_phys_addr_range() to be too
restrictive, crashing a Qemu bootup in fact, as reported by Fengguang Wu.
So while the motivation of the change is still entirely valid, we
need a few more rounds of testing to get it right - it's way too late
after -rc6, so revert it for now.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Craig Bergstrom <craigb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dsafonov@virtuozzo.com
Cc: kirill.shutemov@linux.intel.com
Cc: mhocko@suse.com
Cc: oleg@redhat.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Improve the warning messages to show the relevant function name+offset.
This makes it much easier to diagnose problems with the ORC metadata.
Before:
WARNING: can't dereference iret registers at ffff8801c5f17fe0 for ip ffffffff95f0d94b
After:
WARNING: can't dereference iret registers at ffff880178f5ffe0 for ip int3+0x5b/0x60
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder")
Link: http://lkml.kernel.org/r/6bada6b9eac86017e16bd79e1e77877935cb50bb.1508516398.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This fixes the following ORC warning in the 'int3' entry code:
WARNING: can't dereference iret registers at ffff8801c5f17fe0 for ip ffffffff95f0d94b
The ORC metadata had the wrong stack offset for the iret registers.
Their location on the stack is dependent on whether the exception has an
error code.
Reported-and-tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8c1f75587a18 ("x86/entry/64: Add unwind hint annotations")
Link: http://lkml.kernel.org/r/931d57f0551ed7979d5e7e05370d445c8e5137f8.1508516398.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some F14h machines have an erratum which, "under a highly specific
and detailed set of internal timing conditions" can lead to skipping
instructions and RIP corruption.
Add the fix for those machines when their BIOS doesn't apply it or
there simply isn't BIOS update for them.
Tested-by: <mirh@protonmail.ch>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: http://lkml.kernel.org/r/20171022104731.28249-1-bp@alien8.de
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197285
[ Added pr_info() that we activated the workaround. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit:
d2878d642a4ed ("perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systems")
... adds a privilege check in the exactly wrong place in the event init path:
after the 'LBR exclusive' reference has been taken, and doesn't release it
in the case of insufficient privileges. After this, nobody in the system
gets to use PT or LBR afterwards.
This patch moves the privilege check to where it should have been in the
first place.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d2878d642a4ed ("perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systems")
Link: http://lkml.kernel.org/r/20171023123533.16973-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, it is possible to mmap() any offset from /dev/mem. If a
program mmaps() /dev/mem offsets outside of the addressable limits
of a system, the page table can be corrupted by setting reserved bits.
For example if you mmap() offset 0x0001000000000000 of /dev/mem on an
x86_64 system with a 48-bit bus, the page fault handler will be called
with error_code set to RSVD. The kernel then crashes with a page table
corruption error.
This change prevents this page table corruption on x86 by refusing
to mmap offsets higher than the highest valid address in the system.
Signed-off-by: Craig Bergstrom <craigb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: dsafonov@virtuozzo.com
Cc: kirill.shutemov@linux.intel.com
Cc: mhocko@suse.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/20171019192856.39672-1-craigb@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Borislav thinks that we don't need this knob in a released kernel.
Get rid of it.
Requested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")
Link: http://lkml.kernel.org/r/1fa72431924e81e86c164ff7881bf9240d1f1a6c.1508000261.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to timezones, commit:
b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")
was an outdated patch that well tested and fixed the bug but didn't
address Borislav's review comments.
Tidy it up:
- The name "tlb_use_lazy_mode()" was highly confusing. Change it to
"tlb_defer_switch_to_init_mm()", which describes what it actually
means.
- Move the static_branch crap into a helper.
- Improve comments.
Actually removing the debugfs option is in the next patch.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b956575bed91 ("x86/mm: Flush more aggressively in lazy TLB mode")
Link: http://lkml.kernel.org/r/154ef95428d4592596b6e98b0af1d2747d6cfbf8.1508000261.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's avoid hard-to-diagnose crashes in the future.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f423bbc97864089fbdeb813f1ea126c6eaed844a.1508000261.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Blacklist Broadwell X model 79 for late loading due to an erratum.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171018111225.25635-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Moving the early IDT setup out of assembly code breaks the boot on first
generation 486 systems.
The reason is that the call of idt_setup_early_handler, which sets up the
early handlers was added after the call to cr4_init_shadow().
cr4_init_shadow() tries to read CR4 which is not available on those
systems. The accessor function uses a extable fixup to handle the resulting
fault. As the IDT is not set up yet, the cr4 read exception causes an
instantaneous reboot for obvious reasons.
Call idt_setup_early_handler() before cr4_init_shadow() so IDT is set up
before the first exception hits.
Fixes: 87e81786b13b ("x86/idt: Move early IDT setup out of 32-bit asm")
Reported-and-tested-by: Matthew Whitehead <whiteheadm@acm.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710161210290.1973@nanos
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The 'this_leaf' variable is assigned a value that is never
read and it is updated a little later with a newer value,
hence we can remove the redundant assignment.
Cleans up the following Clang warning:
Value stored to 'this_leaf' is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20171015160203.12332-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A landry list of fixes:
- fix reboot breakage on some PCID-enabled system
- fix crashes/hangs on some PCID-enabled systems
- fix microcode loading on certain older CPUs
- various unwinder fixes
- extend an APIC quirk to more hardware systems and disable APIC
related warning on virtualized systems
- various Hyper-V fixes
- a macro definition robustness fix
- remove jprobes IRQ disabling
- various mem-encryption fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Do the family check first
x86/mm: Flush more aggressively in lazy TLB mode
x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping
x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on hypervisors
x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c
x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing
x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structures
x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUs
x86/unwind: Disable unwinder warnings on 32-bit
x86/unwind: Align stack pointer in unwinder dump
x86/unwind: Use MSB for frame pointer encoding on 32-bit
x86/unwind: Fix dereference of untrusted pointer
x86/alternatives: Fix alt_max_short macro to really be a max()
x86/mm/64: Fix reboot interaction with CR4.PCIDE
kprobes/x86: Remove IRQ disabling from jprobe handlers
kprobes/x86: Set up frame pointer in kprobe trampoline
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
On CPUs like AMD's Geode, for example, we shouldn't even try to load
microcode because they do not support the modern microcode loading
interface.
However, we do the family check *after* the other checks whether the
loader has been disabled on the command line or whether we're running in
a guest.
So move the family checks first in order to exit early if we're being
loaded on an unsupported family.
Reported-and-tested-by: Sven Glodowski <glodi1@arcor.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.11..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396
Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Since commit:
94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.
From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.
Unfortunately, I forgot about a subtlety. By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent. This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.
I can imagine this causing two different problems:
- A speculative page walk starting from a bogus page table could read
IO addresses. I haven't seen any reports of this causing problems.
- A speculative page walk that involves a bogus page table can install
garbage in the TLB. Such garbage would always be at a user VA, but
some AMD CPUs have logic that triggers a machine check when it notices
these bogus entries. I've seen a couple reports of this.
Boris further explains the failure mode:
> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.
To fix this, reinstate TLB coherence in lazy mode. With this patch
applied, we do it in one of two ways:
- If we have PCID, we simply switch back to init_mm's page tables
when we enter a kernel thread -- this seems to be quite cheap
except for the cost of serializing the CPU.
- If we don't have PCID, then we set a flag and switch to init_mm
the first time we would otherwise need to flush the TLB.
The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.
In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed. Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
SKX stepping-3 fixed the TSC_DEADLINE issue in a different ucode
version number than stepping-4. Linux needs to know this stepping-3
specific version number to also enable the TSC_DEADLINE on stepping-3.
The steppings and ucode versions are documented in the SKX BIOS update:
https://downloadmirror.intel.com/26978/eng/ReleaseNotes_R00.01.0004.txt
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/60f2bbf7cf617e212b522e663f84225bfebc50e5.1507756305.git.len.brown@intel.com
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 594a30fb1242 ("x86/apic: Silence "FW_BUG TSC_DEADLINE disabled
due to Errata" on CPUs without the feature", 2017-08-30) was also about
silencing the warning on VirtualBox; however, KVM does expose the TSC
deadline timer, and it's virtualized so that it is immune from CPU errata.
Therefore, booting 4.13 with "-cpu Haswell" shows this in the logs:
[ 0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata;
please update microcode to version: 0xb2 (or later)
Even if you had a hypervisor that does _not_ virtualize the TSC deadline
and rather exposes the hardware one, it should be the hypervisors task
to update microcode and possibly hide the flag from CPUID. So just
hide the message when running on _any_ hypervisor, not just those that
do not support the TSC deadline timer.
The older check still makes sense, so keep it.
Fixes: bd9240a18e ("x86/apic: Add TSC_DEADLINE quirk due to errata")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1507630377-54471-1-git-send-email-pbonzini@redhat.com
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some routines in mem_encrypt.c are called very early in the boot process,
e.g. sme_enable(). When CONFIG_KCOV=y is defined the resulting code added
to sme_enable() (and others) for KCOV instrumentation results in a kernel
crash. Disable the KCOV instrumentation for mem_encrypt.c by adding
KCOV_INSTRUMENT_mem_encrypt.o := n to arch/x86/mm/Makefile.
In order to avoid other possible early boot issues, model mem_encrypt.c
after head64.c in regards to tools. In addition to disabling KCOV as
stated above and a previous patch that disables branch profiling, also
remove the "-pg" CFLAG if CONFIG_FUNCTION_TRACER is enabled and set
KASAN_SANITIZE to "n", each of which are done on a file basis.
Reported-by: kernel test robot <lkp@01.org>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171010194504.18887.38053.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Do not consider the fixed size of hv_vp_set when passing the variable
header size to hv_do_rep_hypercall().
The Hyper-V hypervisor specification states that for a hypercall with a
variable header only the size of the variable portion should be supplied
via the input control.
For HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX calls that means the
fixed portion of hv_vp_set should not be considered.
That fixes random failures of some applications that are unexpectedly
killed with SIGBUS or SIGSEGV.
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Fixes: 628f54cc6451 ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls")
Link: http://lkml.kernel.org/r/1507210469-29065-1-git-send-email-marcelo.cerri@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
hv_do_hypercall() does virt_to_phys() translation and with some configs
(CONFIG_SLAB) this doesn't work for percpu areas, we pass wrong memory to
hypervisor and get #GP. We could use working slow_virt_to_phys() instead
but doing so kills the performance.
Move pcpu_flush/pcpu_flush_ex structures out of percpu areas and
allocate memory on first call. The additional level of indirection gives
us a small performance penalty, in future we may consider introducing
hypercall functions which avoid virt_to_phys() conversion and cache
physical addresses of pcpu_flush/pcpu_flush_ex structures somewhere.
Reported-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20171005113924.28021-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
hv_flush_pcpu_ex structures are not cleared between calls for performance
reasons (they're variable size up to PAGE_SIZE each) but we must clear
hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The
rest of the structure is formed correctly.
To do the clearing in an efficient way stash the maximum possible vCPU
number (this may differ from Linux CPU id).
Reported-by: Jork Loeser <Jork.Loeser@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
x86-32 doesn't have stack validation, so in most cases it doesn't make
sense to warn about bad frame pointers.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a69658760800bf281e6353248c23e0fa0acf5230.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When printing the unwinder dump, the stack pointer could be unaligned,
for one of two reasons:
- stack corruption; or
- GCC created an unaligned stack.
There's no way for the unwinder to tell the difference between the two,
so we have to assume one or the other. GCC unaligned stacks are very
rare, and have only been spotted before GCC 5. Presumably, if we're
doing an unwinder stack dump, stack corruption is more likely than a
GCC unaligned stack. So always align the stack before starting the
dump.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2f540c515946ab09ed267e1a1d6421202a0cce08.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
On x86-32, Tetsuo Handa and Fengguang Wu reported unwinder warnings
like:
WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' value 0ba00000
And also there were some stack dumps with a bunch of unreliable '?'
symbols after an apic_timer_interrupt symbol, meaning the unwinder got
confused when it tried to read the regs.
The cause of those issues is that, with GCC 4.8 (and possibly older),
there are cases where GCC misaligns the stack pointer in a leaf function
for no apparent reason:
c124a388 <acpi_rs_move_data>:
c124a388: 55 push %ebp
c124a389: 89 e5 mov %esp,%ebp
c124a38b: 57 push %edi
c124a38c: 56 push %esi
c124a38d: 89 d6 mov %edx,%esi
c124a38f: 53 push %ebx
c124a390: 31 db xor %ebx,%ebx
c124a392: 83 ec 03 sub $0x3,%esp
...
c124a3e3: 83 c4 03 add $0x3,%esp
c124a3e6: 5b pop %ebx
c124a3e7: 5e pop %esi
c124a3e8: 5f pop %edi
c124a3e9: 5d pop %ebp
c124a3ea: c3 ret
If an interrupt occurs in such a function, the regs on the stack will be
unaligned, which breaks the frame pointer encoding assumption. So on
32-bit, use the MSB instead of the LSB to encode the regs.
This isn't an issue on 64-bit, because interrupts align the stack before
writing to it.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/279a26996a482ca716605c7dbc7f2db9d8d91e81.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder:
BUG: unable to handle kernel NULL pointer dereference at 000001f2
IP: update_stack_state+0xd4/0x340
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be67 #592
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
task: bb0b53c0 task.stack: bb3ac000
EIP: update_stack_state+0xd4/0x340
EFLAGS: 00010002 CPU: 0
EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570
ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: fffe0ff0 DR7: 00000400
Call Trace:
? unwind_next_frame+0xea/0x400
? __unwind_start+0xf5/0x180
? __save_stack_trace+0x81/0x160
? save_stack_trace+0x20/0x30
? __lock_acquire+0xfa5/0x12f0
? lock_acquire+0x1c2/0x230
? tick_periodic+0x3a/0xf0
? _raw_spin_lock+0x42/0x50
? tick_periodic+0x3a/0xf0
? tick_periodic+0x3a/0xf0
? debug_smp_processor_id+0x12/0x20
? tick_handle_periodic+0x23/0xc0
? local_apic_timer_interrupt+0x63/0x70
? smp_trace_apic_timer_interrupt+0x235/0x6a0
? trace_apic_timer_interrupt+0x37/0x3c
? strrchr+0x23/0x50
Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83
EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f
CR2: 00000000000001f2
---[ end trace 0d147fd4aba8ff50 ]---
Kernel panic - not syncing: Fatal exception in interrupt
On x86-32, after decoding a frame pointer to get a regs address,
regs_size() dereferences the regs pointer when it checks regs->cs to see
if the regs are user mode. This is dangerous because it's possible that
what looks like a decoded frame pointer is actually a corrupt value, and
we don't want the unwinder to make things worse.
Instead of calling regs_size() on an unsafe pointer, just assume they're
kernel regs to start with. Later, once it's safe to access the regs, we
can do the user mode check and corresponding safety check for the
remaining two regs.
Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: LKP <lkp@01.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5ed8d8bb38c5 ("x86/unwind: Move common code into update_stack_state()")
Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The alt_max_short() macro in asm/alternative.h does not work as
intended, leading to nasty bugs. E.g. alt_max_short("1", "3")
evaluates to 3, but alt_max_short("3", "1") evaluates to 1 -- not
exactly the maximum of 1 and 3.
In fact, I had to learn it the hard way by crashing my kernel in not
so funny ways by attempting to make use of the ALTENATIVE_2 macro
with alternatives where the first one was larger than the second
one.
According to [1] and commit dbe4058a6a44 ("x86/alternatives: Fix
ALTERNATIVE_2 padding generation properly") the right handed side
should read "-(-(a < b))" not "-(-(a - b))". Fix that, to make the
macro work as intended.
While at it, fix up the comments regarding the additional "-", too.
It's not about gas' usage of s32 but brain dead logic of having a
"true" value of -1 for the < operator ... *sigh*
Btw., the one in asm/alternative-asm.h is correct. And, apparently,
all current users of ALTERNATIVE_2() pass same sized alternatives,
avoiding to hit the bug.
[1] http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax
Reviewed-and-tested-by: Borislav Petkov <bp@suse.de>
Fixes: dbe4058a6a44 ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1507228213-13095-1-git-send-email-minipli@googlemail.com
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Trying to reboot via real mode fails with PCID on: long mode cannot
be exited while CR4.PCIDE is set. (No, I have no idea why, but the
SDM and actual CPUs are in agreement here.) The result is a GPF and
a hang instead of a reboot.
I didn't catch this in testing because neither my computer nor my VM
reboots this way. I can trigger it with reboot=bios, though.
Fixes: 660da7c9228f ("x86/mm: Enable CR4.PCIDE on supported systems")
Reported-and-tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.1507524746.git.luto@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Jprobes actually don't need to disable IRQs while calling
handlers, because of how we specify the kernel interface in
Documentation/kprobes.txt:
-----
Probe handlers are run with preemption disabled. Depending on the
architecture and optimization state, handlers may also run with
interrupts disabled (e.g., kretprobe handlers and optimized kprobe
handlers run without interrupt disabled on x86/x86-64).
-----
So let's remove IRQ disabling from jprobes too.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/150701508194.32266.14458959863314097305.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Richard Weinberger saw an unwinder warning when running bcc's opensnoop:
WARNING: kernel stack frame pointer at ffff99ef4076bea0 in opensnoop:2008 has bad value 0000000000000008
unwind stack type:0 next_sp: (null) mask:0x2 graph_idx:0
...
ffff99ef4076be88: ffff99ef4076bea0 (0xffff99ef4076bea0)
ffff99ef4076be90: ffffffffac442721 (optimized_callback +0x81/0x90)
...
A lockdep stack trace was initiated from inside a kprobe handler, when
the unwinder noticed a bad frame pointer on the stack. The bad frame
pointer is related to the fact that the kprobe optprobe trampoline
doesn't save the frame pointer before calling into optimized_callback().
Reported-and-tested-by: Richard Weinberger <richard@sigma-star.at>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/7aef2f8ecd75c2f505ef9b80490412262cf4a44c.1507038547.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Ingo Molnar:
"A boot parameter fix, plus a header export fix"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Hide mca_cfg
RAS/CEC: Use the right length for "cec_disable"
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Now that lguest is gone, put it in the internal header which should be
used only by MCA/RAS code.
Add missing header guards while at it.
No functional change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20171002092836.22971-3-bp@alien8.de
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Some tooling fixes plus three kernel fixes: a memory leak fix, a
statistics fix and a crash fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Fix memory leaks on allocation failures
perf/core: Fix cgroup time when scheduling descendants
perf/core: Avoid freeing static PMU contexts when PMU is unregistered
tools include uapi bpf.h: Sync kernel ABI header with tooling header
perf pmu: Unbreak perf record for arm/arm64 with events with explicit PMU
perf script: Add missing separator for "-F ip,brstack" (and brstackoff)
perf callchain: Compare dsos (as well) for CCKEY_FUNCTION
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Currently if an allocation fails then the error return paths
don't free up any currently allocated pmus[].boxes and pmus causing
a memory leak. Add an error clean up exit path that frees these
objects.
Detected by CoverityScan, CID#711632 ("Resource Leak")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Fixes: 087bfbb03269 ("perf/x86: Add generic Intel uncore PMU support")
Link: http://lkml.kernel.org/r/20171009172655.6132-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixlet from Juergen Gross:
"A minor fix correcting the cpu hotplug name for Xen guests"
* tag 'for-linus-4.14c-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/vcpu: Use a unified name about cpu hotplug state for pv and pvhvm
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
As xen_cpuhp_setup is called by PV and PVHVM, the name of "x86/xen/hvm_guest"
is confusing.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the
guest CR4. Before this CR4 loading, the guest CR4 refers to L2
CR4. Because these two CR4's are in different levels of guest, we
should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which
is used to handle guest writes to its CR4, checks the guest change to
CR4 and may fail if the change is invalid.
The failure may cause trouble. Consider we start
a L1 guest with non-zero L1 PCID in use,
(i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0)
and
a L2 guest with L2 PCID disabled,
(i.e. L2 CR4.PCIDE == 0)
and following events may happen:
1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4
into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because
of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e.
vcpu->arch.cr4) is left to the value of L2 CR4.
2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit,
kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID,
because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1
CR3.PCID != 0, L0 KVM will inject GP to L1 guest.
Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1")
Cc: qemu-stable@nongnu.org
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|