| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
with POWER9, and the result is undefined on earlier CPUs.
Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
POWER9") caused POWER7/8 code to use this instruction. Remove it. An
ERAT flush can be made by invalidatig the SLB, but before POWER9 that
requires a flush and rebolt.
Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When enumerating page size definitions to check hardware support,
we construct a constant which is (1U << (def->shift - 10)).
However, the array of page size definitions is only initalised for
various MMU_PAGE_* constants, so it contains a number of 0-initialised
elements with def->shift == 0. This means we end up shifting by a
very large number, which gives the following UBSan splat:
================================================================================
UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
Call Trace:
[c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
[c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
[c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
[c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
[c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
[c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
================================================================================
Fix this by first checking if the element exists (shift != 0) before
constructing the constant.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a process allocates a hugepage, the following leak is
reported by kmemleak. This is a false positive which is
due to the pointer to the table being stored in the PGD
as physical memory address and not virtual memory pointer.
unreferenced object 0xc30f8200 (size 512):
comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<e32b68da>] huge_pte_alloc+0xdc/0x1f8
[<9e0df1e1>] hugetlb_fault+0x560/0x8f8
[<7938ec6c>] follow_hugetlb_page+0x14c/0x44c
[<afbdb405>] __get_user_pages+0x1c4/0x3dc
[<b8fd7cd9>] __mm_populate+0xac/0x140
[<3215421e>] vm_mmap_pgoff+0xb4/0xb8
[<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc
[<4fcd760f>] ret_from_syscall+0x0/0x38
See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as
memory leaks when using kmemleak") for detailed explanation.
To fix that, this patch tells kmemleak to ignore the allocated
hugepage table.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Make sure we are operating on THP and hugetlb entries in the respective hash
fault handling routines.
No functional change in this patch. If we walked the table wrongly before, we
will retry the access.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Update few code paths to check for pmd_large.
set_pmd_at:
We want to use this to store swap pte at pmd level. For swap ptes we don't want
to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove
the false WARN_ON when using this with swap pmd entry.
pmd_page:
We don't really use them on pmd migration entries. But they can also work with
migration entries and we don't differentiate at the pte level. Hence update
pmd_page to work with pmd migration entries too
__find_linux_pte:
lockless page table walk need to handle pmd migration entries. pmd_trans_huge
check will return false on them. We don't set thp = 1 for such entries, but
update hpage_shift correctly. Without this we will walk pmd migration entries
as a pte page pointer which is wrong.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This make hugetlb directory pointer similar to other page able entries. A hugepd
entry is identified by lack of _PAGE_PTE bit set and directory size stored in
HUGEPD_SHIFT_MASK. We update that to also look at _PAGE_PRESENT
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid
pgd/pud/pmd entry. We also switch the p**_present() to look at this bit.
With pmd_present, we have a special case. We need to make sure we consider a
pmd marked invalid during THP split as present. Right now we clear the
_PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special
case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is
only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit
for this special case. pmd_present is also updated to look at _PAGE_INVALID.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This reverts commits:
5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C")
8fed04d0f6ae ("powerpc/64s/hash: remove user SLB data from the paca")
655deecf67b2 ("powerpc/64s/hash: SLB allocation status bitmaps")
2e1626744e8d ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup")
89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache")
This series had a few bugs, and the fixes are not all trivial. So
revert most of it for now.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When switching processes, currently all user SLBEs are cleared, and a
few (exec_base, pc, and stack) are preloaded. In trivial testing with
small apps, this tends to miss the heap and low 256MB segments, and it
will also miss commonly accessed segments on large memory workloads.
Add a simple round-robin preload cache that just inserts the last SLB
miss into the head of the cache and preloads those at context switch
time. Every 256 context switches, the oldest entry is removed from the
cache to shrink the cache and require fewer slbmte if they are unused.
Much more could go into this, including into the SLB entry reclaim
side to track some LRU information etc, which would require a study of
large memory workloads. But this is a simple thing we can do now that
is an obvious win for common workloads.
With the full series, process switching speed on the context_switch
benchmark on POWER9/hash (with kernel speculation security masures
disabled) increases from 140K/s to 178K/s (27%).
POWER8 does not change much (within 1%), it's unclear why it does not
see a big gain like POWER9.
Booting to busybox init with 256MB segments has SLB misses go down
from 945 to 69, and with 1T segments 900 to 21. These could almost all
be eliminated by preloading a bit more carefully with ELF binary
loading.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This will be used by the SLB code in the next patch, but for now this
sets the slb_addr_limit to the correct size for 32-bit tasks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add 32-entry bitmaps to track the allocation status of the first 32
SLB entries, and whether they are user or kernel entries. These are
used to allocate free SLB entries first, before resorting to the round
robin allocator.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
User SLB mappig data is copied into the PACA from the mm->context so
it can be accessed by the SLB miss handlers.
After the C conversion, SLB miss handlers now run with relocation on,
and user SLB misses are able to take recursive kernel SLB misses, so
the user SLB mapping data can be removed from the paca and accessed
directly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch moves SLB miss handlers completely to C, using the standard
exception handler macros to set up the stack and branch to C.
This can be done because the segment containing the kernel stack is
always bolted, so accessing it with relocation on will not cause an
SLB exception.
Arbitrary kernel memory may not be accessed when handling kernel space
SLB misses, so care should be taken there. However user SLB misses can
access any kernel memory, which can be used to move some fields out of
the paca (in later patches).
User SLB misses could quite easily reconcile IRQs and set up a first
class kernel environment and exit via ret_from_except, however that
doesn't seem to be necessary at the moment, so we only do that if a
bad fault is encountered.
[ Credit to Aneesh for bug fixes, error checks, and improvements to bad
address handling, etc ]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Since RFC:
- Added MSR[RI] handling
- Fixed up a register loss bug exposed by irq tracing (Aneesh)
- Reject misses outside the defined kernel regions (Aneesh)
- Added several more sanity checks and error handling (Aneesh), we may
look at consolidating these tests and tightenig up the code but for
a first pass we decided it's better to check carefully.
Since v1:
- Fixed SLB cache corruption (Aneesh)
- Fixed untidy SLBE allocation "leak" in get_vsid error case
- Now survives some stress testing on real hardware
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and
associated lookaside information that have a class value of 1, which
Linux assigns to user addresses. This matches what switch_slb wants,
and allows a simple fast implementation that avoids the slb_cache
complexity.
As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is
also avoided on POWER9.
Process context switching rate is improved about 2.2% for a small
process that hits the slb cache which is the best case for the current
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The SLBIA IH=1 hint will remove all non-zero SLBEs, but only
invalidate ERAT entries associated with a class value of 1, for
processors that support the hint (e.g., POWER6 and newer), which
Linux assigns to user addresses.
This prevents kernel ERAT entries from being invalidated when
context switchig (if the thread faulted in more than 8 user SLBEs).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Remove the vmalloc segment from bolted SLBEs. This is not required to
be bolted, and seems like it was added to help pre-load the SLB on
context switch. However there are now other segments like the vmemmap
segment and non-zero node memory that often take misses after a context
switch, so it is better to solve this in a more general way.
A subsequent change will track free SLB entries and uses those rather
than round-robin overwrite valid entries, which makes it far less
likely for kernel SLBEs to be evicted after they are installed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The POWER5 < DD2.1 issue is that slbie needs to be issued more than
once. It came in with this change:
ChangeSet@1.1608, 2004-04-29 07:12:31-07:00, david@gibson.dropbear.id.au
[PATCH] POWER5 erratum workaround
Early POWER5 revisions (<DD2.1) have a problem requiring slbie
instructions to be repeated under some circumstances. The patch below
adds a workaround (patch made by Anton Blanchard).
(aka. 3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree)
The extra slbie in switch_slb is done even for the case where slbia is
called (slb_flush_and_rebolt). I don't believe that is required
because there are other slb_flush_and_rebolt callers which do not
issue the workaround slbie, which would be broken if it was required.
It also seems to be fine inside the isync with the first slbie, as it
is in the kernel stack switch code.
So move this workaround to where it is required. This is not much of
an optimisation because this is the fast path, but it makes the code
more understandable and neater.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Retain slbie_data initialisation to avoid compiler warning]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I only have POWER8/9 to test, so just remove it for those.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This causes SLB alloation to start 1 beyond the start of the SLB.
There is no real problem because after it wraps it stats behaving
properly, it's just surprisig to see when looking at SLB traces.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we get a machine check exceptions due to SLB errors then dump the
current SLB contents which will be very much helpful in debugging the
root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
faulty SLB entries. In real mode mce handler saves the old SLB contents
into this buffer accessible through paca and print it out later in virtual
mode.
With this patch the console will log SLB contents like below on SLB MCE
errors:
[ 507.297236] SLB contents of cpu 0x1
[ 507.297237] Last SLB entry inserted at slot 16
[ 507.297238] 00 c000000008000000 400ea1b217000500
[ 507.297239] 1T ESID= c00000 VSID= ea1b217 LLP:100
[ 507.297240] 01 d000000008000000 400d43642f000510
[ 507.297242] 1T ESID= d00000 VSID= d43642f LLP:110
[ 507.297243] 11 f000000008000000 400a86c85f000500
[ 507.297244] 1T ESID= f00000 VSID= a86c85f LLP:100
[ 507.297245] 12 00007f0008000000 4008119624000d90
[ 507.297246] 1T ESID= 7f VSID= 8119624 LLP:110
[ 507.297247] 13 0000000018000000 00092885f5150d90
[ 507.297247] 256M ESID= 1 VSID= 92885f5150 LLP:110
[ 507.297248] 14 0000010008000000 4009e7cb50000d90
[ 507.297249] 1T ESID= 1 VSID= 9e7cb50 LLP:110
[ 507.297250] 15 d000000008000000 400d43642f000510
[ 507.297251] 1T ESID= d00000 VSID= d43642f LLP:110
[ 507.297252] 16 d000000008000000 400d43642f000510
[ 507.297253] 1T ESID= d00000 VSID= d43642f LLP:110
[ 507.297253] ----------------------------------
[ 507.297254] SLB cache ptr value = 3
[ 507.297254] Valid SLB cache entries:
[ 507.297255] 00 EA[0-35]= 7f000
[ 507.297256] 01 EA[0-35]= 1
[ 507.297257] 02 EA[0-35]= 1000
[ 507.297257] Rest of SLB cache entries:
[ 507.297258] 03 EA[0-35]= 7f000
[ 507.297258] 04 EA[0-35]= 1
[ 507.297259] 05 EA[0-35]= 1000
[ 507.297260] 06 EA[0-35]= 12
[ 507.297260] 07 EA[0-35]= 7f000
Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull KVM updates from Radim Krčmář:
"ARM:
- Improved guest IPA space support (32 to 52 bits)
- RAS event delivery for 32bit
- PMU fixes
- Guest entry hardening
- Various cleanups
- Port of dirty_log_test selftest
PPC:
- Nested HV KVM support for radix guests on POWER9. The performance
is much better than with PR KVM. Migration and arbitrary level of
nesting is supported.
- Disable nested HV-KVM on early POWER9 chips that need a particular
hardware bug workaround
- One VM per core mode to prevent potential data leaks
- PCI pass-through optimization
- merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
s390:
- Initial version of AP crypto virtualization via vfio-mdev
- Improvement for vfio-ap
- Set the host program identifier
- Optimize page table locking
x86:
- Enable nested virtualization by default
- Implement Hyper-V IPI hypercalls
- Improve #PF and #DB handling
- Allow guests to use Enlightened VMCS
- Add migration selftests for VMCS and Enlightened VMCS
- Allow coalesced PIO accesses
- Add an option to perform nested VMCS host state consistency check
through hardware
- Automatic tuning of lapic_timer_advance_ns
- Many fixes, minor improvements, and cleanups"
* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
Revert "kvm: x86: optimize dr6 restore"
KVM: PPC: Optimize clearing TCEs for sparse tables
x86/kvm/nVMX: tweak shadow fields
selftests/kvm: add missing executables to .gitignore
KVM: arm64: Safety check PSTATE when entering guest and handle IL
KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
arm/arm64: KVM: Enable 32 bits kvm vcpu events support
arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
KVM: arm64: Fix caching of host MDCR_EL2 value
KVM: VMX: enable nested virtualization by default
KVM/x86: Use 32bit xor to clear registers in svm.c
kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
kvm: vmx: Defer setting of DR6 until #DB delivery
kvm: x86: Defer setting of CR2 until #PF delivery
kvm: x86: Add payload operands to kvm_multiple_exception
kvm: x86: Add exception payload fields to kvm_vcpu_events
kvm: x86: Add has_payload and payload to kvm_queued_exception
KVM: Documentation: Fix omission in struct kvm_vcpu_events
KVM: selftests: add Enlightened VMCS test
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;
L1
|
| (L1 -> L0)
|
----> L0
There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;
L2 L2
| |
| (L2 -> L1) |
| |
----> L1 | (L2 -> L0)
| |
| (L1 -> L0) |
| |
----> L0 --------> L0
When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:
1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:
1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
host page.
3. Set the rc bits in the L2 -> L0 pte.
As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
"I have been slowly sorting out siginfo and this is the culmination of
that work.
The primary result is in several ways the signal infrastructure has
been made less error prone. The code has been updated so that manually
specifying SEND_SIG_FORCED is never necessary. The conversion to the
new siginfo sending functions is now complete, which makes it
difficult to send a signal without filling in the proper siginfo
fields.
At the tail end of the patchset comes the optimization of decreasing
the size of struct siginfo in the kernel from 128 bytes to about 48
bytes on 64bit. The fundamental observation that enables this is by
definition none of the known ways to use struct siginfo uses the extra
bytes.
This comes at the cost of a small user space observable difference.
For the rare case of siginfo being injected into the kernel only what
can be copied into kernel_siginfo is delivered to the destination, the
rest of the bytes are set to 0. For cases where the signal and the
si_code are known this is safe, because we know those bytes are not
used. For cases where the signal and si_code combination is unknown
the bits that won't fit into struct kernel_siginfo are tested to
verify they are zero, and the send fails if they are not.
I made an extensive search through userspace code and I could not find
anything that would break because of the above change. If it turns out
I did break something it will take just the revert of a single change
to restore kernel_siginfo to the same size as userspace siginfo.
Testing did reveal dependencies on preferring the signo passed to
sigqueueinfo over si->signo, so bit the bullet and added the
complexity necessary to handle that case.
Testing also revealed bad things can happen if a negative signal
number is passed into the system calls. Something no sane application
will do but something a malicious program or a fuzzer might do. So I
have fixed the code that performs the bounds checks to ensure negative
signal numbers are handled"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
signal: Guard against negative signal numbers in copy_siginfo_from_user32
signal: Guard against negative signal numbers in copy_siginfo_from_user
signal: In sigqueueinfo prefer sig not si_signo
signal: Use a smaller struct siginfo in the kernel
signal: Distinguish between kernel_siginfo and siginfo
signal: Introduce copy_siginfo_from_user and use it's return value
signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
signal: Fail sigqueueinfo if si_signo != sig
signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
signal/unicore32: Use force_sig_fault where appropriate
signal/unicore32: Generate siginfo in ucs32_notify_die
signal/unicore32: Use send_sig_fault where appropriate
signal/arc: Use force_sig_fault where appropriate
signal/arc: Push siginfo generation into unhandled_exception
signal/ia64: Use force_sig_fault where appropriate
signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
signal/ia64: Use the generic force_sigsegv in setup_frame
signal/arm/kvm: Use send_sig_mceerr
signal/arm: Use send_sig_fault where appropriate
signal/arm: Use force_sig_fault where appropriate
...
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now that _exception no longer calls _exception_pkey it is no longer
necessary to handle any signal with any si_code. All pkey exceptions
are SIGSEGV with paired with SEGV_PKUERR. So just handle
that case and remove the now unnecessary parameters from _exception_pkey.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now that bad_key_fault_exception no longer calls __bad_area_nosemaphore
there is no reason for __bad_area_nosemaphore to handle pkeys.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This removes the need for other code paths to deal with pkey exceptions.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There are no callers of __bad_area that pass in a pkey parameter so it makes
no sense to take one.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In do_sigbus isolate the mceerr signaling code and call
force_sig_mceerr instead of falling through to the force_sig_info that
works for all of the other signals.
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|\ \ \ \
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Michael writes:
"powerpc fixes for 4.19 #4
Four regression fixes.
A fix for a change to lib/xz which broke our zImage loader when
building with XZ compression. OK'ed by Herbert who merged the
original patch.
The recent fix we did to avoid patching __init text broke some 32-bit
machines, fix that.
Our show_user_instructions() could be tricked into printing kernel
memory, add a check to avoid that.
And a fix for a change to our NUMA initialisation logic, which causes
crashes in some kdump configurations.
Thanks to:
Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."
* tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/numa: Skip onlining a offline node in kdump path
powerpc: Don't print kernel instructions in show_user_instructions()
powerpc/lib: fix book3s/32 boot failure due to code patching
lib/xz: Put CRC32_POLY_LE in xz_private.h
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
With commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot"), kdump kernel on shared LPAR may crash.
The necessary conditions are
- Shared LPAR with at least 2 nodes having memory and CPUs.
- Memory requirement for kdump kernel must be met by the first N-1
nodes where there are at least N nodes with memory and CPUs.
Example numactl of such a machine.
$ numactl -H
available: 5 nodes (0,2,5-7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 255 MB
node 2 free: 189 MB
node 5 cpus: 24 25 26 27 28 29 30 31
node 5 size: 4095 MB
node 5 free: 4024 MB
node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 6 size: 6353 MB
node 6 free: 5998 MB
node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
node 7 size: 7640 MB
node 7 free: 7164 MB
node distances:
node 0 2 5 6 7
0: 10 40 40 40 40
2: 40 10 40 40 40
5: 40 40 10 40 40
6: 40 40 40 10 20
7: 40 40 40 20 10
Steps to reproduce.
1. Load / start kdump service.
2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)
When booting a kdump kernel with 2048M:
kexec: Starting switchover sequence.
I'm in purgatory
Using 1TB segments
hash-mmu: Initializing hash mmu with SLB
Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
Found initrd at 0xc000000009e70000:0xc00000000ae554b4
Using pSeries machine description
-----------------------------------------------------
ppc64_pft_size = 0x1e
phys_mem_size = 0x88000000
dcache_bsize = 0x80
icache_bsize = 0x80
cpu_features = 0x000000ff8f5d91a7
possible = 0x0000fbffcf5fb1a7
always = 0x0000006f8b5c91a1
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features = 0x7c006001
firmware_features = 0x00000007c45bfc57
htab_hash_mask = 0x7fffff
physical_start = 0x8000000
-----------------------------------------------------
numa: NODE_DATA [mem 0x87d5e300-0x87d67fff]
numa: NODE_DATA(0) on node 6
numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff]
Top of RAM: 0x88000000, Total RAM: 0x88000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x0000000087ffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 6: [mem 0x0000000000000000-0x0000000087ffffff]
Could not find start_pfn for node 0
Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
On node 0 totalpages: 0
Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
On node 6 totalpages: 34816
Unable to handle kernel paging request for data at address 0x00000060
Faulting instruction address: 0xc000000008703a54
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+)
MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 24022022 XER: 20000002
CFAR: c0000000086fc238 IRQMASK: 0
GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
NIP [c000000008703a54] bus_add_device+0x84/0x1e0
LR [c000000008703a38] bus_add_device+0x68/0x1e0
Call Trace:
[c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
[c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
[c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
[c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
[c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
[c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
[c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
[c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
[c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
[c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
[c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
[c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
e8bf0050 e93e0098 3b9f0010 2fa50000 <e8690060> 38630018 419e0114 7f84e378
---[ end trace 593577668c2daa65 ]---
However a regular kernel with 4096M (2048 gets reserved for crash
kernel) boots properly.
Unlike regular kernels, which mark all available nodes as online,
kdump kernel only marks just enough nodes as online and marks the rest
as offline at boot. However kdump kernel boots with all available
CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for
shared processors at boot"), all CPUs are onlined on their respective
nodes at boot time. try_online_node() tries to online the offline
nodes but fails as all needed subsystems are not yet initialized.
As part of fix, detect and skip early onlining of a offline node.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Reported-by: Pavithra Prakash <pavrampu@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\| | |
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Michael writes:
"powerpc fixes for 4.19 #3
A reasonably big batch of fixes due to me being away for a few weeks.
A fix for the TM emulation support on Power9, which could result in
corrupting the guest r11 when running under KVM.
Two fixes to the TM code which could lead to userspace GPR corruption
if we take an SLB miss at exactly the wrong time.
Our dynamic patching code had a bug that meant we could patch freed
__init text, which could lead to corrupting userspace memory.
csum_ipv6_magic() didn't work on little endian platforms since we
optimised it recently.
A fix for an endian bug when reading a device tree property telling
us how many storage keys the machine has available.
Fix a crash seen on some configurations of PowerVM when migrating the
partition from one machine to another.
A fix for a regression in the setup of our CPU to NUMA node mapping
in KVM guests.
A fix to our selftest Makefiles to make them work since a recent
change to the shared Makefile logic."
* tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix Makefiles for headers_install change
powerpc/numa: Use associativity if VPHN hcall is successful
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
powerpc/tm: Fix userspace r13 corruption
powerpc/pseries: Fix unitialized timer reset on migration
powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
powerpc: fix csum_ipv6_magic() on little endian platforms
powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
powerpc: Avoid code patching freed init sections
KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Currently associativity is used to lookup node-id even if the
preceding VPHN hcall failed. However this can cause CPU to be made
part of the wrong node, (most likely to be node 0). This is because
VPHN is not enabled on KVM guests.
With 2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM
guest topology is broken.
For example : A 4 node KVM guest before would have reported.
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Would now report:
[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10
Fix this by skipping associativity lookup if the VPHN hcall failed.
Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.
Such changes include modifications to device tree properties provided
to the system by PHYP. Property notifications received by the
post_mobility_fixup() code are passed along to the kernel in general
through a call to of_update_property() which in turn passes such
events back to all modules through entries like the '.notifier_call'
function within the NUMA module.
When the NUMA module updates its state, it resets its event timer. If
this occurs after a previous call to stop_topology_update() or on a
system without VPHN enabled, the code runs into an unitialized timer
structure and crashes. This patch adds a safety check along this path
toward the problem code.
An example crash log is as follows.
ibmvscsi 30000081: Re-enabling adapter!
------------[ cut here ]------------
kernel BUG at kernel/time/timer.c:958!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
...
NIP mod_timer+0x4c/0x400
LR reset_topology_timer+0x40/0x60
Call Trace:
0xc0000003f9407830 (unreliable)
reset_topology_timer+0x40/0x60
dt_update_callback+0x100/0x120
notifier_call_chain+0x90/0x100
__blocking_notifier_call_chain+0x60/0x90
of_property_notify+0x90/0xd0
of_update_property+0x104/0x150
update_dt_property+0xdc/0x1f0
pseries_devicetree_update+0x2d0/0x510
post_mobility_fixup+0x7c/0xf0
migration_store+0xa4/0xc0
kobj_attr_store+0x30/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x16c/0x240
__vfs_write+0x40/0x200
vfs_write+0xc8/0x240
ksys_write+0x5c/0x100
system_call+0x58/0x6c
Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
scan_pkey_feature() uses of_property_read_u32_array() to read the
ibm,processor-storage-keys property and calls be32_to_cpu() on the
value it gets. The problem is that of_property_read_u32_array() already
returns the value converted to the CPU byte order.
The value of pkeys_total ends up more or less sane because there's a min()
call in pkey_initialize() which reduces pkeys_total to 32. So in practice
the kernel ignores the fact that the hypervisor reserved one key for
itself (the device tree advertises 31 keys in my test VM).
This is wrong, but the effect in practice is that when a process tries to
allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
would indicate that there aren't any keys available
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This stops us from doing code patching in init sections after they've
been freed.
In this chain:
kvm_guest_init() ->
kvm_use_magic_page() ->
fault_in_pages_readable() ->
__get_user() ->
__get_user_nocheck() ->
barrier_nospec();
We have a code patching location at barrier_nospec() and
kvm_guest_init() is an init function. This whole chain gets inlined,
so when we free the init section (hence kvm_guest_init()), this code
goes away and hence should no longer be patched.
We seen this as userspace memory corruption when using a memory
checker while doing partition migration testing on powervm (this
starts the code patching post migration via
/sys/kernel/mobility/migration). In theory, it could also happen when
using /sys/kernel/debug/powerpc/barrier_nospec.
Cc: stable@vger.kernel.org # 4.13+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
which in turn reads the old TCE and if it was a valid entry, marks
the physical page dirty if it was mapped for writing. Since it is in
real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
to get the page struct. However SetPageDirty() itself reads the compound
page head and returns a virtual address for the head page struct and
setting dirty bit for that kills the system.
This adds additional dirty bit tracking into the MM/IOMMU API for use
in the real mode. Note that this does not change how VFIO and
KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
- use the lowest bit of the cached host phys address to carry
the dirty bit;
- mark pages dirty when they are unpinned which happens when
the preregistered memory is released which always happens in virtual
mode;
- add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
- change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
in the new mm_iommu_ua_mark_dirty_rm() helper;
- move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
caller anyway) to reduce the real mode KVM and IOMMU knowledge
across different subsystems.
This removes realmode_pfn_to_page() as it is not used anymore.
While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
the real mode only and modules cannot call it anyway.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull IDA updates from Matthew Wilcox:
"A better IDA API:
id = ida_alloc(ida, GFP_xxx);
ida_free(ida, id);
rather than the cumbersome ida_simple_get(), ida_simple_remove().
The new IDA API is similar to ida_simple_get() but better named. The
internal restructuring of the IDA code removes the bitmap
preallocation nonsense.
I hope the net -200 lines of code is convincing"
* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
ida: Change ida_get_new_above to return the id
ida: Remove old API
test_ida: check_ida_destroy and check_ida_alloc
test_ida: Convert check_ida_conv to new API
test_ida: Move ida_check_max
test_ida: Move ida_check_leaf
idr-test: Convert ida_check_nomem to new API
ida: Start new test_ida module
target/iscsi: Allocate session IDs from an IDA
iscsi target: fix session creation failure handling
drm/vmwgfx: Convert to new IDA API
dmaengine: Convert to new IDA API
ppc: Convert vas ID allocation to new IDA API
media: Convert entity ID allocation to new IDA API
ppc: Convert mmu context allocation to new IDA API
Convert net_namespace to new IDA API
cb710: Convert to new IDA API
rsxx: Convert to new IDA API
osd: Convert to new IDA API
sd: Convert to new IDA API
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ida_alloc_range is the perfect fit for this use case. Eliminates
a custom spinlock, a call to ida_pre_get and a local check for the
allocated ID exceeding a maximum.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- An implementation for the newly added hv_ops->flush() for the OPAL
hvc console driver backends, I forgot to apply this after merging the
hvc driver changes before the merge window.
- Enable all PCI bridges at boot on powernv, to avoid races when
multiple children of a bridge try to enable it simultaneously. This
is a workaround until the PCI core can be enhanced to fix the races.
- A fix to query PowerVM for the correct system topology at boot before
initialising sched domains, seen in some configurations to cause
broken scheduling etc.
- A fix for pte_access_permitted() on "nohash" platforms.
- Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
due to a workaround when using the nest MMU (GPUs, accelerators).
- Another fix to the VFIO code used by KVM, the previous fix had some
bugs which caused guests to not start in some configurations.
- A handful of other minor fixes.
Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
Mackerras, Srikar Dronamraju.
* tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mce: Fix SLB rebolting during MCE recovery path.
KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
powerpc/nohash: fix pte_access_permitted()
powerpc/topology: Get topology for shared processors at boot
powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
powerpc/powernv/pci: Work around races in PCI bridge enabling
powerpc/fadump: cleanup crash memory ranges support
powerpc/powernv: provide a console flush operation for opal hvc driver
powerpc/traps: Avoid rate limit messages from show unhandled signals
powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The commit e7e81847478 ("powerpc/64s: move machine check SLB flushing
to mm/slb.c") introduced a bug in reloading bolted SLB entries. Unused
bolted entries are stored with .esid=0 in the slb_shadow area, and
that value is now used directly as the RB input to slbmte, which means
the RB[52:63] index field is set to 0, which causes SLB entry 0 to be
cleared.
Fix this by storing the index bits in the unused bolted entries, which
directs the slbmte to the right place.
The SLB shadow area is also used by the hypervisor, but PAPR is okay
with that, from LoPAPR v1.1, 14.11.1.3 SLB Shadow Buffer:
Note: SLB is filled sequentially starting at index 0
from the shadow buffer ignoring the contents of
RB field bits 52-63
Fixes: e7e81847478b ("powerpc/64s: move machine check SLB flushing to mm/slb.c")
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
the pinned physical page", 2018-07-17) added some checks to ensure
that guest DMA mappings don't attempt to map more than the guest is
entitled to access. However, errors in the logic mean that legitimate
guest requests to map pages for DMA are being denied in some
situations. Specifically, if the first page of the range passed to
mm_iommu_get() is mapped with a normal page, and subsequent pages are
mapped with transparent huge pages, we end up with mem->pageshift ==
0. That means that the page size checks in mm_iommu_ua_to_hpa() and
mm_iommu_up_to_hpa_rm() will always fail for every page in that
region, and thus the guest can never map any memory in that region for
DMA, typically leading to a flood of error messages like this:
qemu-system-ppc64: VFIO_MAP_DMA: -22
qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)
The logic errors in mm_iommu_get() are:
(a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
call (meaning that find_linux_pte() returns the pte for the
first address in the range, not the address we are currently up
to);
(b) use of 'pageshift' as the variable to receive the hugepage shift
returned by find_linux_pte() - for a normal page this gets set
to 0, leading to us setting mem->pageshift to 0 when we conclude
that the pte returned by find_linux_pte() didn't match the page
we were looking at;
(c) comparing 'compshift', which is a page order, i.e. log base 2 of
the number of pages, with 'pageshift', which is a log base 2 of
the number of bytes.
To fix these problems, this patch introduces 'cur_ua' to hold the
current user address and uses that in the find_linux_pte() call;
introduces 'pteshift' to hold the hugepage shift found by
find_linux_pte(); and compares 'pteshift' with 'compshift +
PAGE_SHIFT' rather than 'compshift'.
The patch also moves the local_irq_restore to the point after the PTE
pointer returned by find_linux_pte() has been dereferenced because
otherwise the PTE could change underneath us, and adds a check to
avoid doing the find_linux_pte() call once mem->pageshift has been
reduced to PAGE_SHIFT, as an optimization.
Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The Nest MMU workaround is only needed for RW upgrades. Avoid doing
that for other PTE updates.
We also avoid clearing the PTE while marking it invalid. This is
because other page table walkers will find this PTE none and can
result in unexpected behaviour due to that. Instead we clear
_PAGE_PRESENT and set the software PTE bit _PAGE_INVALID.
pte_present() is already updated to check for both bits. This makes
sure page table walkers will find the PTE present and things like
pte_pfn(pte) returns the right value.
Based on an original patch from Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
On a shared LPAR, Phyp will not update the CPU associativity at boot
time. Just after the boot system does recognize itself as a shared
LPAR and trigger a request for correct CPU associativity. But by then
the scheduler would have already created/destroyed its sched domains.
This causes
- Broken load balance across Nodes causing islands of cores.
- Performance degradation esp if the system is lightly loaded
- dmesg to wrongly report all CPUs to be in Node 0.
- Messages in dmesg saying borken topology.
- With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
node sched domain"), can cause rcu stalls at boot up.
The sched_domains_numa_masks table which is used to generate cpumasks
is only created at boot time just before creating sched domains and
never updated. Hence, its better to get the topology correct before
the sched domains are created.
For example on 64 core Power 8 shared LPAR, dmesg reports
Brought up 512 CPUs
Node 0 CPUs: 0-511
Node 1 CPUs:
Node 2 CPUs:
Node 3 CPUs:
Node 4 CPUs:
Node 5 CPUs:
Node 6 CPUs:
Node 7 CPUs:
Node 8 CPUs:
Node 9 CPUs:
Node 10 CPUs:
Node 11 CPUs:
...
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
BUG: arch topology borken
the DIE domain not a subset of the NUMA domain
numactl/lscpu output will still be correct with cores spreading across
all nodes:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Currently on this LPAR, the scheduler detects 2 levels of Numa and
created numa sched domains for all CPUs, but it finds a single DIE
domain consisting of all CPUs. Hence it deletes all numa sched
domains.
To address this, detect the shared processor and update topology soon
after CPUs are setup so that correct topology is updated just before
scheduler creates sched domain.
With the fix, dmesg reports:
numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
numa: Node 11 CPUs: 160-167 256-263 352-359 448-455
and lscpu also reports:
Socket(s): 64
NUMA node(s): 12
Model: 2.0 (pvr 004d 0200)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
NUMA node11 CPU(s): 160-167,256-263,352-359,448-455
Reported-by: Manjunatha H R <manjuhr1@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[mpe: Trim / format change log]
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\ \ \
| |/ /
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Merge updates from Andrew Morton:
- a few misc things
- a few Y2038 fixes
- ntfs fixes
- arch/sh tweaks
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
mm/hmm.c: remove unused variables align_start and align_end
fs/userfaultfd.c: remove redundant pointer uwq
mm, vmacache: hash addresses based on pmd
mm/list_lru: introduce list_lru_shrink_walk_irq()
mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
mm/sparse: delete old sparse_init and enable new one
mm/sparse: add new sparse_init_nid() and sparse_init()
mm/sparse: move buffer init/fini to the common place
mm/sparse: use the new sparse buffer functions in non-vmemmap
mm/sparse: abstract sparse buffer allocations
mm/hugetlb.c: don't zero 1GiB bootmem pages
mm, page_alloc: double zone's batchsize
mm/oom_kill.c: document oom_lock
mm/hugetlb: remove gigantic page support for HIGHMEM
mm, oom: remove sleep from under oom_lock
kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
...
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.
Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add statistics that show how memory is mapped within the kernel linear mapping.
This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics")
We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
to map the kernel linear mapping and we print the linear psize during boot as
below.
"Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"
A sample output looks like:
DirectMap4k: 0 kB
DirectMap64k: 18432 kB
DirectMap2M: 1030144 kB
DirectMap1G: 11534336 kB
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|\|
| |
| |
| |
| | |
Merge our fixes branch from the 4.18 cycle to resolve some minor
conflicts.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A VM which has:
- a DMA capable device passed through to it (eg. network card);
- running a malicious kernel that ignores H_PUT_TCE failure;
- capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.
The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.
The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.
We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,
This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size.
This calculates maximum page size as a minimum of the natural region
alignment and compound page size. For the page shift this uses the shift
returned by find_linux_pte() which indicates how the page is mapped to
the current userspace - if the page is huge and this is not a zero, then
it is a leaf pte and the page is mapped within the range.
Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will also be used by pseries
machine check and idle code in future changes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|