summaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/asm-offsets.c
Commit message (Collapse)AuthorAgeFilesLines
* x86/bugs: Rename CONFIG_CALL_DEPTH_TRACKING => ↵Breno Leitao2024-01-101-1/+1
| | | | | | | | | | | | | CONFIG_MITIGATION_CALL_DEPTH_TRACKING Step 3/10 of the namespace unification of CPU mitigations related Kconfig options. Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-4-leitao@debian.org
* x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALLKai Huang2023-09-121-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro share similar code pattern too. The __tdx_hypercall() and __tdcall() should be unified to use the same assembly code. As a preparation to unify them, simplify the TDX_HYPERCALL to make it more like the TDX_MODULE_CALL. The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as function call argument, and does below extra things comparing to the TDX_MODULE_CALL: 1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally; 2) It sets RCX to the (fixed) bitmap of shared registers internally; 3) It calls __tdx_hypercall_failed() internally (and panics) when the TDCALL instruction itself fails; 4) After TDCALL, it moves R10 to RAX to return the return code of the VMCALL leaf, regardless the '\ret' asm macro argument; Firstly, change the TDX_HYPERCALL to take the same function call arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer to 'struct tdx_module_args'. Then 1) and 2) can be moved to the caller: - TDG.VP.VMCALL leaf ID can be passed via the function call argument; - 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus the bitmap of shared registers can be passed via RCX in the structure. Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL always save output registers to the structure. The caller then can: - Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error; - Return R10 in the structure as the return code of the VMCALL leaf; With above changes, change the asm function from __tdx_hypercall() to __tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper of it. This avoids having to add another wrapper of __tdx_hypercall() (_tdx_hypercall() is already taken). The __tdcall_hypercall() will be replaced with a __tdcall() variant using TDX_MODULE_CALL in a later commit as the final goal is to have one assembly to handle both TDCALL and TDVMCALL. Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep this unchanged, annotate __tdx_hypercall(), which is a C function now, as 'noinstr'. Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so. Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the compressed code. Opportunistically fix a checkpatch error complaining using space around parenthesis '(' and ')' while moving the bitmap of shared registers to <asm/shared/tdx.h>. [ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
* x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafsKai Huang2023-09-111-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL leaf functions. Those new TDCALLs/SEAMCALLs take additional registers for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example. Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers when it returns in case of VMCALL from TDX guest. With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the TDX_MODULE_CALL macro basically needs to handle the same input/output registers as the TDX_HYPERCALL does. And as a result, they also share similar logic in the assembly, thus should be unified to use one common assembly. Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the TDX_HYPERCALL. The new input/output registers fit with the "callee-saved" registers in the x86 calling convention. Add a new "saved" parameter to support those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing TDCALLs/SEAMCALLs minimally impacted. For TDH.VP.ENTER, after it returns the registers shared by the guest contain guest's values. Explicitly clear them to prevent speculative use of guest's values. Note most TDX live migration related SEAMCALLs may also clobber AVX* state ("AVX, AVX2 and AVX512 state: may be reset to the architectural INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't handle them in the TDX_MODULE_CALL macro but let the caller save and restore when needed. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com
* x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structureKai Huang2023-09-111-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
* x86/smpboot: Remove initial_stack on 64-bitBrian Gerst2023-03-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to facilitate parallel startup, start to eliminate some of the global variables passing information to CPUs in the startup path. However, start by introducing one more: smpboot_control. For now this merely holds the CPU# of the CPU which is coming up. Each CPU can then find its own per-cpu data, and everything else it needs can be found from there, allowing the other global variables to be removed. First to be removed is initial_stack. Each CPU can load %rsp from its current_task->thread.sp instead. That is already set up with the correct idle thread for APs. Set up the .sp field in INIT_THREAD on x86 so that the BSP also finds a suitable stack pointer in the static per-cpu data when coming up on first boot. On resume from S3, the CPU needs a temporary stack because its idle task is already active. Instead of setting initial_stack, the sleep code can simply set its own current->thread.sp to point to the temporary stack. Nobody else cares about ->thread.sp for a thread which is currently on a CPU, because the true value is actually in the %rsp register. Which is restored with the rest of the CPU context in do_suspend_lowlevel(). Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Usama Arif <usama.arif@bytedance.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20230316222109.1940300-7-usama.arif@bytedance.com
* Merge tag 'x86_tdx_for_6.3' of ↵Linus Torvalds2023-02-251-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen: "Other than a minor fixup, the content here is to ensure that TDX guests never see virtualization exceptions (#VE's) that might be induced by the untrusted VMM. This is a highly desirable property. Without it, #VE exception handling would fall somewhere between NMIs, machine checks and total insanity. With it, #VE handling remains pretty mundane. Summary: - Fixup comment typo - Prevent unexpected #VE's from: - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE) - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered via a #VE" * tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall() x86/tdx: Disable NOTIFY_ENABLES x86/tdx: Relax SEPT_VE_DISABLE check for debug TD x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE x86/tdx: Expand __tdx_hypercall() to handle more arguments x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments x86/tdx: Add more registers to struct tdx_hypercall_args x86/tdx: Fix typo in comment in __tdx_hypercall()
| * x86/tdx: Add more registers to struct tdx_hypercall_argsKirill A. Shutemov2023-01-271-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | struct tdx_hypercall_args is used to pass down hypercall arguments to __tdx_hypercall() assembly routine. Currently __tdx_hypercall() handles up to 6 arguments. In preparation to changes in __tdx_hypercall(), expand the structure to 6 more registers and generate asm offsets for them. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-3-kirill.shutemov%40linux.intel.com
* | crypto: x86/aria - do not use magic number offsets of aria_ctxTaehee Yoo2023-01-061-0/+8
|/ | | | | | | | | | | | | | | | aria-avx assembly code accesses members of aria_ctx with magic number offset. If the shape of struct aria_ctx is changed carelessly, aria-avx will not work. So, we need to ensure accessing members of aria_ctx with correct offset values, not with magic numbers. It adds ARIA_CTX_enc_key, ARIA_CTX_dec_key, and ARIA_CTX_rounds in the asm-offsets.c So, correct offset definitions will be generated. aria-avx assembly code can access members of aria_ctx safely with these definitions. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* Merge tag 'v6.1-rc6' into x86/core, to resolve conflictsIngo Molnar2022-11-211-6/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c: # upstream: debc5a1ec0d1 ("KVM: x86: use a separate asm-offsets.c file") # retbleed work in x86/core: 5d8213864ade ("x86/retbleed: Add SKL return thunk") ... and these commits in include/linux/bpf.h: # upstram: 18acb7fac22f ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")") # x86/core commits: 931ab63664f0 ("x86/ibt: Implement FineIBT") bea75b33895f ("x86/Kconfig: Introduce function padding") The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream. Conflicts: arch/x86/kernel/asm-offsets.c include/linux/bpf.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * KVM: x86: use a separate asm-offsets.c filePaolo Bonzini2022-11-091-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This already removes an ugly #include "" from asm-offsets.c, but especially it avoids a future error when trying to define asm-offsets for KVM's svm/svm.h header. This would not work for kernel/asm-offsets.c, because svm/svm.h includes kvm_cache_regs.h which is not in the include path when compiling asm-offsets.c. The problem is not there if the .c file is in arch/x86/kvm. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Fixes: a149180fbcf3 ("x86: Add magic AMD return-thunk") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | x86/retbleed: Add SKL return thunkThomas Gleixner2022-10-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To address the Intel SKL RSB underflow issue in software it's required to do call depth tracking. Provide a return thunk for call depth tracking on Intel SKL CPUs. The tracking does not use a counter. It uses uses arithmetic shift right on call entry and logical shift left on return. The depth tracking variable is initialized to 0x8000.... when the call depth is zero. The arithmetic shift right sign extends the MSB and saturates after the 12th call. The shift count is 5 so the tracking covers 12 nested calls. On return the variable is shifted left logically so it becomes zero again. CALL RET 0: 0x8000000000000000 0x0000000000000000 1: 0xfc00000000000000 0xf000000000000000 ... 11: 0xfffffffffffffff8 0xfffffffffffffc00 12: 0xffffffffffffffff 0xffffffffffffffe0 After a return buffer fill the depth is credited 12 calls before the next stuffing has to take place. There is a inaccuracy for situations like this: 10 calls 5 returns 3 calls 4 returns 3 calls .... The shift count might cause this to be off by one in either direction, but there is still a cushion vs. the RSB depth. The algorithm does not claim to be perfect, but it should obfuscate the problem enough to make exploitation extremly difficult. The theory behind this is: RSB is a stack with depth 16 which is filled on every call. On the return path speculation "pops" entries to speculate down the call chain. Once the speculative RSB is empty it switches to other predictors, e.g. the Branch History Buffer, which can be mistrained by user space and misguide the speculation path to a gadget. Call depth tracking is designed to break this speculation path by stuffing speculation trap calls into the RSB which are never getting a corresponding return executed. This stalls the prediction path until it gets resteered, The assumption is that stuffing at the 12th return is sufficient to break the speculation before it hits the underflow and the fallback to the other predictors. Testing confirms that it works. Johannes, one of the retbleed researchers. tried to attack this approach but failed. There is obviously no scientific proof that this will withstand future research progress, but all we can do right now is to speculate about it. The SAR/SHL usage was suggested by Andi Kleen. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111147.890071690@infradead.org
* | x86/percpu: Move current_top_of_stack next to current_taskThomas Gleixner2022-10-171-0/+2
|/ | | | | | | | | Extend the struct pcpu_hot cacheline with current_top_of_stack; another very frequently used value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
* KVM: VMX: Prevent RSB underflow before vmenterJosh Poimboeuf2022-06-271-0/+6
| | | | | | | | | | | | | | | | | | On VMX, there are some balanced returns between the time the guest's SPEC_CTRL value is written, and the vmenter. Balanced returns (matched by a preceding call) are usually ok, but it's at least theoretically possible an NMI with a deep call stack could empty the RSB before one of the returns. For maximum paranoia, don't allow *any* returns (balanced or otherwise) between the SPEC_CTRL write and the vmenter. [ bp: Fix 32-bit build. ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>
* x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functionsKuppuswamy Sathyanarayanan2022-04-071-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Guests communicate with VMMs with hypercalls. Historically, these are implemented using instructions that are known to cause VMEXITs like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer expose the guest state to the host. This prevents the old hypercall mechanisms from working. So, to communicate with VMM, TDX specification defines a new instruction called TDCALL. In a TDX based VM, since the VMM is an untrusted entity, an intermediary layer -- TDX module -- facilitates secure communication between the host and the guest. TDX module is loaded like a firmware into a special CPU mode called SEAM. TDX guests communicate with the TDX module using the TDCALL instruction. A guest uses TDCALL to communicate with both the TDX module and VMM. The value of the RAX register when executing the TDCALL instruction is used to determine the TDCALL type. A leaf of TDCALL used to communicate with the VMM is called TDVMCALL. Add generic interfaces to communicate with the TDX module and VMM (using the TDCALL instruction). __tdx_module_call() - Used to communicate with the TDX module (via TDCALL instruction). __tdx_hypercall() - Used by the guest to request services from the VMM (via TDVMCALL leaf of TDCALL). Also define an additional wrapper _tdx_hypercall(), which adds error handling support for the TDCALL failure. The __tdx_module_call() and __tdx_hypercall() helper functions are implemented in assembly in a .S file. The TDCALL ABI requires shuffling arguments in and out of registers, which proved to be awkward with inline assembly. Just like syscalls, not all TDVMCALL use cases need to use the same number of argument registers. The implementation here picks the current worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer than 4 arguments, there will end up being a few superfluous (cheap) instructions. But, this approach maximizes code reuse. For registers used by the TDCALL instruction, please check TDX GHCI specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL Interface". Based on previous patch by Sean Christopherson. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
* x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappersKirill A. Shutemov2022-04-071-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It defines a new VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX non-root) which are both isolated from the legacy VMX operation where the host kernel runs. A CPU-attested software module (called 'TDX module') runs in SEAM VMX root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX root is also used to host another CPU-attested software module (called 'P-SEAMLDR') to load and update the TDX module. Host kernel transits to either P-SEAMLDR or TDX module via the new SEAMCALL instruction, which is essentially a VMExit from VMX root mode to SEAM VMX root mode. SEAMCALLs are leaf functions defined by P-SEAMLDR and TDX module around the new SEAMCALL instruction. A guest kernel can also communicate with TDX module via TDCALL instruction. TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI. RAX is used to carry both the SEAMCALL leaf function number (input) and the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may be further used as both input and output operands in individual leaf. TDCALL and SEAMCALL share the same ABI and require the largely same code to pass down arguments and retrieve results. Define an assembly macro that can be used to implement C wrapper for both TDCALL and SEAMCALL. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com
* um, x86: bury crypto_tfm_ctx_offsetAl Viro2021-12-211-3/+0
| | | | | | | unused since 2011 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
* x86/paravirt: Switch iret pvops to ALTERNATIVEJuergen Gross2021-03-111-5/+0
| | | | | | | | | | The iret paravirt op is rather special as it is using a jmp instead of a call instruction. Switch it to ALTERNATIVE. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-12-jgross@suse.com
* x86/paravirt: Remove no longer needed 32-bit pvops cruftJuergen Gross2021-03-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | PVOP_VCALL4() is only used for Xen PV, while PVOP_CALL4() isn't used at all. Keep PVOP_CALL4() for 64 bits due to symmetry reasons. This allows to remove the 32-bit definitions of those macros leading to a substantial simplification of the paravirt macros, as those were the only ones needing non-empty "pre" and "post" parameters. PVOP_CALLEE2() and PVOP_VCALLEE2() are used nowhere, so remove them. Another no longer needed case is special handling of return types larger than unsigned long. Replace that with a BUILD_BUG_ON(). DISABLE_INTERRUPTS() is used in 32-bit code only, so it can just be replaced by cli. INTERRUPT_RETURN in 32-bit code can be replaced by iret. ENABLE_INTERRUPTS is used nowhere, so it can be removed. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-10-jgross@suse.com
* x86/head/64: Remove unused GET_CR2_INTO() macroArvind Sankar2020-11-181-1/+0
| | | | | | | | | | | | | | | Commit 4b47cdbda6f1 ("x86/head/64: Move early exception dispatch to C code") removed the usage of GET_CR2_INTO(). Drop the definition as well, and related definitions in paravirt.h and asm-offsets.h Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201005151208.2212886-3-nivedita@alum.mit.edu
* x86: remove address space overrides using set_fs()Christoph Hellwig2020-09-081-3/+0
| | | | | | | | | | | | | | Stop providing the possibility to override the address space using set_fs() now that there is no need for that any more. To properly handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on x86 a new alternative is introduced, which just like the one in entry_64.S has to use the hardcoded virtual address bits to escape the fact that TASK_SIZE_MAX isn't actually a constant when 5-level page tables are enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* efi/x86: Avoid using code32_startArvind Sankar2020-03-081-1/+0
| | | | | | | | | | | | | | | | | | | | code32_start is meant for 16-bit real-mode bootloaders to inform the kernel where the 32-bit protected mode code starts. Nothing in the protected mode kernel except the EFI stub uses it. efi_main() currently returns boot_params, with code32_start set inside it to tell efi_stub_entry() where startup_32 is located. Since it was invoked by efi_stub_entry() in the first place, boot_params is already known. Return the address of startup_32 instead. This will allow a 64-bit kernel to live above 4Gb, for example, and it's cleaner as well. Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200301230436.2246909-5-nivedita@alum.mit.edu Link: https://lore.kernel.org/r/20200308080859.21568-13-ardb@kernel.org
* x86/paravirt: Make read_cr2() CALLEE_SAVEPeter Zijlstra2019-07-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | The one paravirt read_cr2() implementation (Xen) is actually quite trivial and doesn't need to clobber anything other than the return register. Making read_cr2() CALLEE_SAVE avoids all the PUSH/POP nonsense and allows more convenient use from assembly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: bp@alien8.de Cc: rostedt@goodmis.org Cc: luto@kernel.org Cc: torvalds@linux-foundation.org Cc: hpa@zytor.com Cc: dave.hansen@linux.intel.com Cc: zhe.he@windriver.com Cc: joel@joelfernandes.org Cc: devel@etsukata.com Link: https://lkml.kernel.org/r/20190711114335.887392493@infradead.org
* x86/asm: Remove unused TASK_TI_flags from asm-offsets.cSteven Rostedt (VMware)2019-05-241-1/+0
| | | | | | | | | | | | | | | | | | | Since commit: 21d375b6b34ff5 ("x86/entry/64: Remove the SYSCALL64 fast path") there is no user of TASK_TI_flags in assembly. There's no need to keep it around in asm-offsets.c Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20190523102325.22eacdf7@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/kernel: Fix more -Wmissing-prototypes warningsBorislav Petkov2018-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | ... with the goal of eventually enabling -Wmissing-prototypes by default. At least on x86. Make functions static where possible, otherwise add prototypes or make them visible through includes. asm/trace/ changes courtesy of Steven Rostedt <rostedt@goodmis.org>. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> # ACPI + cpufreq bits Cc: Andrew Banman <andrew.banman@hpe.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Travis <mike.travis@hpe.com> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yi Wang <wang.yi59@zte.com.cn> Cc: linux-acpi@vger.kernel.org
* Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds2018-10-231-3/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti updates from Ingo Molnar: "The main changes: - Make the IBPB barrier more strict and add STIBP support (Jiri Kosina) - Micro-optimize and clean up the entry code (Andy Lutomirski) - ... plus misc other fixes" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Propagate information about RSB filling mitigation to sysfs x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation x86/speculation: Apply IBPB more strictly to avoid cross-process data leak x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant x86/CPU: Fix unused variable warning when !CONFIG_IA32_EMULATION x86/pti/64: Remove the SYSCALL64 entry trampoline x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space x86/entry/64: Document idtentry
| * x86/pti/64: Remove the SYSCALL64 entry trampolineAndy Lutomirski2018-09-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SYSCALL64 trampoline has a couple of nice properties: - The usual sequence of SWAPGS followed by two GS-relative accesses to set up RSP is somewhat slow because the GS-relative accesses need to wait for SWAPGS to finish. The trampoline approach allows RIP-relative accesses to set up RSP, which avoids the stall. - The trampoline avoids any percpu access before CR3 is set up, which means that no percpu memory needs to be mapped in the user page tables. This prevents using Meltdown to read any percpu memory outside the cpu_entry_area and prevents using timing leaks to directly locate the percpu areas. The downsides of using a trampoline may outweigh the upsides, however. It adds an extra non-contiguous I$ cache line to system calls, and it forces an indirect jump to transfer control back to the normal kernel text after CR3 is set up. The latter is because x86 lacks a 64-bit direct jump instruction that could jump from the trampoline to the entry text. With retpolines enabled, the indirect jump is extremely slow. Change the code to map the percpu TSS into the user page tables to allow the non-trampoline SYSCALL64 path to work under PTI. This does not add a new direct information leak, since the TSS is readable by Meltdown from the cpu_entry_area alias regardless. It does allow a timing attack to locate the percpu area, but KASLR is more or less a lost cause against local attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned, on current hardware, KASLR is only useful to mitigate remote attacks that try to attack the kernel without first gaining RCE against a vulnerable user process. On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall overhead from ~237ns to ~228ns. There is a possible alternative approach: Move the trampoline within 2G of the entry text and make a separate copy for each CPU. This would allow a direct jump to rejoin the normal entry path. There are pro's and con's for this approach: + It avoids a pipeline stall - It executes from an extra page and read from another extra page during the syscall. The latter is because it needs to use a relative addressing mode to find sp1 -- it's the same *cacheline*, but accessed using an alias, so it's an extra TLB entry. - Slightly more memory. This would be one page per CPU for a simple implementation and 64-ish bytes per CPU or one page per node for a more complex implementation. - More code complexity. The current approach is chosen for simplicity and because the alternative does not provide a significant benefit, which makes it worth. [ tglx: Added the alternative discussion to the changelog ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
| * x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch spaceAndy Lutomirski2018-09-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the non-trampoline SYSCALL64 path, a percpu variable is used to temporarily store the user RSP value. Instead of a separate variable, use the otherwise unused sp2 slot in the TSS. This will improve cache locality, as the sp1 slot is already used in the same code to find the kernel stack. It will also simplify a future change to make the non-trampoline path work in PTI mode. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/08e769a0023dbad4bac6f34f3631dbaf8ad59f4f.1536015544.git.luto@kernel.org
* | x86/paravirt: Move the Xen-only pv_mmu_ops under the PARAVIRT_XXL umbrellaJuergen Gross2018-09-031-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the paravirt ops defined in pv_mmu_ops are for Xen PV guests only. Define them only if CONFIG_PARAVIRT_XXL is set. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: virtualization@lists.linux-foundation.org Cc: akataria@vmware.com Cc: rusty@rustcorp.com.au Cc: boris.ostrovsky@oracle.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180828074026.820-15-jgross@suse.com
* | x86/paravirt: Move the pv_irq_ops under the PARAVIRT_XXL umbrellaJuergen Gross2018-09-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All of the paravirt ops defined in pv_irq_ops are for Xen PV guests or VSMP only. Define them only if CONFIG_PARAVIRT_XXL is set. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: virtualization@lists.linux-foundation.org Cc: akataria@vmware.com Cc: rusty@rustcorp.com.au Cc: boris.ostrovsky@oracle.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180828074026.820-14-jgross@suse.com
* | x86/paravirt: Move the Xen-only pv_cpu_ops under the PARAVIRT_XXL umbrellaJuergen Gross2018-09-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the paravirt ops defined in pv_cpu_ops are for Xen PV guests only. Define them only if CONFIG_PARAVIRT_XXL is set. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: virtualization@lists.linux-foundation.org Cc: akataria@vmware.com Cc: rusty@rustcorp.com.au Cc: boris.ostrovsky@oracle.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180828074026.820-13-jgross@suse.com
* | x86/paravirt: Remove unused paravirt bitsJuergen Gross2018-09-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macros ENABLE_INTERRUPTS_SYSEXIT, GET_CR0_INTO_EAX and PARAVIRT_ADJUST_EXCEPTION_FRAME are used nowhere. Remove their definitions. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: virtualization@lists.linux-foundation.org Cc: akataria@vmware.com Cc: rusty@rustcorp.com.au Cc: boris.ostrovsky@oracle.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180828074026.820-10-jgross@suse.com
* | x86/paravirt: Use a single ops structureJuergen Gross2018-09-031-7/+5
|/ | | | | | | | | | | | | | | | | | | | | | | Instead of using six globally visible paravirt ops structures combine them in a single structure, keeping the original structures as sub-structures. This avoids the need to assemble struct paravirt_patch_template at runtime on the stack each time apply_paravirt() is being called (i.e. when loading a module). [ tglx: Made the struct and the initializer tabular for readability sake ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: virtualization@lists.linux-foundation.org Cc: akataria@vmware.com Cc: rusty@rustcorp.com.au Cc: boris.ostrovsky@oracle.com Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180828074026.820-9-jgross@suse.com
* x86/entry/32: Enter the kernel via trampoline stackJoerg Roedel2018-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the entry-stack as a trampoline to enter the kernel. The entry-stack is already in the cpu_entry_area and will be mapped to userspace when PTI is enabled. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-8-git-send-email-joro@8bytes.org
* x86/asm-offsets: Move TSS_sp0 and TSS_sp1 to asm-offsets.cJoerg Roedel2018-07-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These offsets will be used in 32 bit assembly code as well, so make them available for all of x86 code. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Pavel Machek <pavel@ucw.cz> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Waiman Long <llong@redhat.com> Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca> Cc: joro@8bytes.org Link: https://lkml.kernel.org/r/1531906876-13451-2-git-send-email-joro@8bytes.org
* Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variablesLinus Torvalds2018-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86/mm: Use/Fix PCID to optimize user/kernel switchesPeter Zijlstra2017-12-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can use PCID to retain the TLBs across CR3 switches; including those now part of the user/kernel switch. This increases performance of kernel entry/exit at the cost of more expensive/complicated TLB flushing. Now that we have two address spaces, one for kernel and one for user space, we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID (just like we use the PFN LSB for the PGD). Since we do TLB invalidation from kernel space, the existing code will only invalidate the kernel PCID, we augment that by marking the corresponding user PCID invalid, and upon switching back to userspace, use a flushing CR3 write for the switch. In order to access the user_pcid_flush_mask we use PER_CPU storage, which means the previously established SWAPGS vs CR3 ordering is now mandatory and required. Having to do this memory access does require additional registers, most sites have a functioning stack and we can spill one (RAX), sites without functional stack need to otherwise provide the second scratch register. Note: PCID is generally available on Intel Sandybridge and later CPUs. Note: Up until this point TLB flushing was broken in this series. Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stackDave Hansen2017-12-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | If the kernel oopses while on the trampoline stack, it will print "<SYSENTER>" even if SYSENTER is not involved. That is rather confusing. The "SYSENTER" stack is used for a lot more than SYSENTER now. Give it a better string to display in stack dumps, and rename the kernel code to match. Also move the 32-bit code over to the new naming even though it still uses the entry stack only for SYSENTER. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry/64: Make cpu_entry_area.tss read-onlyAndy Lutomirski2017-12-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TSS is a fairly juicy target for exploits, and, now that the TSS is in the cpu_entry_area, it's no longer protected by kASLR. Make it read-only on x86_64. On x86_32, it can't be RO because it's written by the CPU during task switches, and we use a task gate for double faults. I'd also be nervous about errata if we tried to make it RO even on configurations without double fault handling. [ tglx: AMD confirmed that there is no problem on 64-bit with TSS RO. So it's probably safe to assume that it's a non issue, though Intel might have been creative in that area. Still waiting for confirmation. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.733700132@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry: Clean up the SYSENTER_stack codeAndy Lutomirski2017-12-171-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing code was a mess, mainly because C arrays are nasty. Turn SYSENTER_stack into a struct, add a helper to find it, and do all the obvious cleanups this enables. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.653244723@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry/64: Create a per-CPU SYSCALL entry trampolineAndy Lutomirski2017-12-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handling SYSCALL is tricky: the SYSCALL handler is entered with every single register (except FLAGS), including RSP, live. It somehow needs to set RSP to point to a valid stack, which means it needs to save the user RSP somewhere and find its own stack pointer. The canonical way to do this is with SWAPGS, which lets us access percpu data using the %gs prefix. With PAGE_TABLE_ISOLATION-like pagetable switching, this is problematic. Without a scratch register, switching CR3 is impossible, so %gs-based percpu memory would need to be mapped in the user pagetables. Doing that without information leaks is difficult or impossible. Instead, use a different sneaky trick. Map a copy of the first part of the SYSCALL asm at a different address for each CPU. Now RIP varies depending on the CPU, so we can use RIP-relative memory access to access percpu memory. By putting the relevant information (one scratch slot and the stack address) at a constant offset relative to RIP, we can make SYSCALL work without relying on %gs. A nice thing about this approach is that we can easily switch it on and off if we want pagetable switching to be configurable. The compat variant of SYSCALL doesn't have this problem in the first place -- there are plenty of scratch registers, since we don't care about preserving r8-r15. This patch therefore doesn't touch SYSCALL32 at all. This patch actually seems to be a small speedup. With this patch, SYSCALL touches an extra cache line and an extra virtual page, but the pipeline no longer stalls waiting for SWAPGS. It seems that, at least in a tight loop, the latter outweights the former. Thanks to David Laight for an optimization tip. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150606.403607157@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry: Remap the TSS into the CPU entry areaAndy Lutomirski2017-12-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has a secondary purpose: it puts the entry stack into a region with a well-controlled layout. A subsequent patch will take advantage of this to streamline the SYSCALL entry code to be able to find it more easily. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bpetkov@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150605.962042855@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/entry/64: Allocate and enable the SYSENTER stackAndy Lutomirski2017-12-171-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will simplify future changes that want scratch variables early in the SYSENTER handler -- they'll be able to spill registers to the stack. It also lets us get rid of a SWAPGS_UNSAFE_STACK user. This does not depend on CONFIG_IA32_EMULATION=y because we'll want the stack space even without IA32 emulation. As far as I can tell, the reason that this wasn't done from day 1 is that we use IST for #DB and #BP, which is IMO rather nasty and causes a lot more problems than it solves. But, since #DB uses IST, we don't actually need a real stack for SYSENTER (because SYSENTER with TF set will invoke #DB on the IST stack rather than the SYSENTER stack). I want to remove IST usage from these vectors some day, and this patch is a prerequisite for that as well. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Laight <David.Laight@aculab.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eduardo Valentin <eduval@amazon.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: aliguori@amazon.com Cc: daniel.gruss@iaik.tugraz.at Cc: hughd@google.com Cc: keescook@google.com Link: https://lkml.kernel.org/r/20171204150605.312726423@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman2017-11-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* efi: Get and store the secure boot statusDavid Howells2017-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Get the firmware's secure-boot status in the kernel boot wrapper and stash it somewhere that the main kernel image can find. The efi_get_secureboot() function is extracted from the ARM stub and (a) generalised so that it can be called from x86 and (b) made to use efi_call_runtime() so that it can be run in mixed-mode. For x86, it is stored in boot_params and can be overridden by the boot loader or kexec. This allows secure-boot mode to be passed on to a new kernel. Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/1486380166-31868-5-git-send-email-ard.biesheuvel@linaro.org [ Small readability edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86: Move thread_info into task_structAndy Lutomirski2016-09-151-3/+1
| | | | | | | | | | | | | | | | | | | | | Now that most of the thread_info users have been cleaned up, this is straightforward. Most of this code was written by Linus. Originally-from: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a50eab40abeaec9cb9a9e3cbdeafd32190206654.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/asm: Move the thread_info::status field to thread_structAndy Lutomirski2016-09-151-1/+0
| | | | | | | | | | | | | | | | | | | | Because sched.h and thread_info.h are a tangled mess, I turned in_compat_syscall() into a macro. If we had current_thread_struct() or similar and we could use it from thread_info.h, then this would be a bit cleaner. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/ccc8a1b2f41f9c264a41f771bb4a6539a642ad72.1473801993.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/x86: Rewrite the switch_to() codeBrian Gerst2016-08-241-0/+6
| | | | | | | | | | | | | | | | | | | | | Move the low-level context switch code to an out-of-line asm stub instead of using complex inline asm. This allows constructing a new stack frame for the child process to make it seamlessly flow to ret_from_fork without an extra test and branch in __switch_to(). It also improves code generation for __schedule() by using the C calling convention instead of clobbering all registers. Signed-off-by: Brian Gerst <brgerst@gmail.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1471106302-10159-5-git-send-email-brgerst@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/uaccess: Move thread_info::addr_limit to thread_structAndy Lutomirski2016-07-151-1/+3
| | | | | | | | | | | | | | | | | | | struct thread_info is a legacy mess. To prepare for its partial removal, move thread_info::addr_limit out. As an added benefit, this way is simpler. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/15bee834d09402b47ac86f2feccdf6529f9bc5b0.1468527351.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/boot: Move compressed kernel to the end of the decompression bufferYinghai Lu2016-04-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change makes later calculations about where the kernel is located easier to reason about. To better understand this change, we must first clarify what 'VO' and 'ZO' are. These values were introduced in commits by hpa: 77d1a4999502 ("x86, boot: make symbols from the main vmlinux available") 37ba7ab5e33c ("x86, boot: make kernel_alignment adjustable; new bzImage fields") Specifically: All names prefixed with 'VO_': - relate to the uncompressed kernel image - the size of the VO image is: VO__end-VO__text ("VO_INIT_SIZE" define) All names prefixed with 'ZO_': - relate to the bootable compressed kernel image (boot/compressed/vmlinux), which is composed of the following memory areas: - head text - compressed kernel (VO image and relocs table) - decompressor code - the size of the ZO image is: ZO__end - ZO_startup_32 ("ZO_INIT_SIZE" define, though see below) The 'INIT_SIZE' value is used to find the larger of the two image sizes: #define ZO_INIT_SIZE (ZO__end - ZO_startup_32 + ZO_z_extract_offset) #define VO_INIT_SIZE (VO__end - VO__text) #if ZO_INIT_SIZE > VO_INIT_SIZE # define INIT_SIZE ZO_INIT_SIZE #else # define INIT_SIZE VO_INIT_SIZE #endif The current code uses extract_offset to decide where to position the copied ZO (i.e. ZO starts at extract_offset). (This is why ZO_INIT_SIZE currently includes the extract_offset.) Why does z_extract_offset exist? It's needed because we are trying to minimize the amount of RAM used for the whole act of creating an uncompressed, executable, properly relocation-linked kernel image in system memory. We do this so that kernels can be booted on even very small systems. To achieve the goal of minimal memory consumption we have implemented an in-place decompression strategy: instead of cleanly separating the VO and ZO images and also allocating some memory for the decompression code's runtime needs, we instead create this elaborate layout of memory buffers where the output (decompressed) stream, as it progresses, overlaps with and destroys the input (compressed) stream. This can only be done safely if the ZO image is placed to the end of the VO range, plus a certain amount of safety distance to make sure that when the last bytes of the VO range are decompressed, the compressed stream pointer is safely beyond the end of the VO range. z_extract_offset is calculated in arch/x86/boot/compressed/mkpiggy.c during the build process, at a point when we know the exact compressed and uncompressed size of the kernel images and can calculate this safe minimum offset value. (Note that the mkpiggy.c calculation is not perfect, because we don't know the decompressor used at that stage, so the z_extract_offset calculation is necessarily imprecise and is mostly based on gzip internals - we'll improve that in the next patch.) When INIT_SIZE is bigger than VO_INIT_SIZE (uncommon but possible), the copied ZO occupies the memory from extract_offset to the end of decompression buffer. It overlaps with the soon-to-be-uncompressed kernel like this: |-----compressed kernel image------| V V 0 extract_offset +INIT_SIZE |-----------|---------------|-------------------------|--------| | | | | VO__text startup_32 of ZO VO__end ZO__end ^ ^ |-------uncompressed kernel image---------| When INIT_SIZE is equal to VO_INIT_SIZE (likely) there's still space left from end of ZO to the end of decompressing buffer, like below. |-compressed kernel image-| V V 0 extract_offset +INIT_SIZE |-----------|---------------|-------------------------|--------| | | | | VO__text startup_32 of ZO ZO__end VO__end ^ ^ |------------uncompressed kernel image-------------| To simplify calculations and avoid special cases, it is cleaner to always place the compressed kernel image in memory so that ZO__end is at the end of the decompression buffer, instead of placing t at the start of extract_offset as is currently done. This patch adds BP_init_size (which is the INIT_SIZE as passed in from the boot_params) into asm-offsets.c to make it visible to the assembly code. Then when moving the ZO, it calculates the starting position of the copied ZO (via BP_init_size and the ZO run size) so that the VO__end will be at the end of the decompression buffer. To make the position calculation safe, the end of ZO is page aligned (and a comment is added to the existing VO alignment for good measure). Signed-off-by: Yinghai Lu <yinghai@kernel.org> [ Rewrote changelog and comments. ] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: lasse.collin@tukaani.org Link: http://lkml.kernel.org/r/1461888548-32439-3-git-send-email-keescook@chromium.org [ Rewrote the changelog some more. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
* x86/asm-offsets: Remove PARAVIRT_enabledAndy Lutomirski2016-03-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | It no longer has any users. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: david.vrabel@citrix.com Cc: konrad.wilk@oracle.com Cc: lguest@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Signed-off-by: Ingo Molnar <mingo@kernel.org>