linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	capabilities: Allow privileged user in s_user_ns to set security.* xattrs	Eric W. Biederman	2018-05-24	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A privileged user in s_user_ns will generally have the ability to manipulate the backing store and insert security.* xattrs into the filesystem directly. Therefore the kernel must be prepared to handle these xattrs from unprivileged mounts, and it makes little sense for commoncap to prevent writing these xattrs to the filesystem. The capability and LSM code have already been updated to appropriately handle xattrs from unprivileged mounts, so it is safe to loosen this restriction on setting xattrs. The exception to this logic is that writing xattrs to a mounted filesystem may also cause the LSM inode_post_setxattr or inode_setsecurity callbacks to be invoked. SELinux will deny the xattr update by virtue of applying mountpoint labeling to unprivileged userns mounts, and Smack will deny the writes for any user without global CAP_MAC_ADMIN, so loosening the capability check in commoncap is safe in this respect as well. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
*	commoncap: Handle memory allocation failure.	Tetsuo Handa	2018-04-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	syzbot is reporting NULL pointer dereference at xattr_getsecurity() [1], for cap_inode_getsecurity() is returning sizeof(struct vfs_cap_data) when memory allocation failed. Return -ENOMEM if memory allocation failed. [1] https://syzkaller.appspot.com/bug?id=a55ba438506fe68649a5f50d2d82d56b365e0107 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 8db6c34f1dbc8e06 ("Introduce v3 namespaced file capabilities") Reported-by: syzbot <syzbot+9369930ca44f29e60e2d@syzkaller.appspotmail.com> Cc: stable <stable@vger.kernel.org> # 4.14+ Acked-by: Serge E. Hallyn <serge@hallyn.com> Acked-by: James Morris <james.morris@microsoft.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
*	capabilities: fix buffer overread on very short xattr	Eric Biggers	2018-01-02	1	-12/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If userspace attempted to set a "security.capability" xattr shorter than 4 bytes (e.g. 'setfattr -n security.capability -v x file'), then cap_convert_nscap() read past the end of the buffer containing the xattr value because it accessed the ->magic_etc field without verifying that the xattr value is long enough to contain that field. Fix it by validating the xattr value size first. This bug was found using syzkaller with KASAN. The KASAN report was as follows (cleaned up slightly): BUG: KASAN: slab-out-of-bounds in cap_convert_nscap+0x514/0x630 security/commoncap.c:498 Read of size 4 at addr ffff88002d8741c0 by task syz-executor1/2852 CPU: 0 PID: 2852 Comm: syz-executor1 Not tainted 4.15.0-rc6-00200-gcc0aac99d977 #253 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0xe3/0x195 lib/dump_stack.c:53 print_address_description+0x73/0x260 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x235/0x350 mm/kasan/report.c:409 cap_convert_nscap+0x514/0x630 security/commoncap.c:498 setxattr+0x2bd/0x350 fs/xattr.c:446 path_setxattr+0x168/0x1b0 fs/xattr.c:472 SYSC_setxattr fs/xattr.c:487 [inline] SyS_setxattr+0x36/0x50 fs/xattr.c:483 entry_SYSCALL_64_fastpath+0x18/0x85 Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities") Cc: <stable@vger.kernel.org> # v4.14+ Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
*	Merge branch 'next-general' of ↵	Linus Torvalds	2017-11-13	1	-65/+128
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull general security subsystem updates from James Morris: "TPM (from Jarkko): - essential clean up for tpm_crb so that ARM64 and x86 versions do not distract each other as much as before - /dev/tpm0 rejects now too short writes (shorter buffer than specified in the command header - use DMA-safe buffer in tpm_tis_spi - otherwise mostly minor fixes. Smack: - base support for overlafs Capabilities: - BPRM_FCAPS fixes, from Richard Guy Briggs: The audit subsystem is adding a BPRM_FCAPS record when auditing setuid application execution (SYSCALL execve). This is not expected as it was supposed to be limited to when the file system actually had capabilities in an extended attribute. It lists all capabilities making the event really ugly to parse what is happening. The PATH record correctly records the setuid bit and owner. Suppress the BPRM_FCAPS record on setid. TOMOYO: - Y2038 timestamping fixes" 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (28 commits) MAINTAINERS: update the IMA, EVM, trusted-keys, encrypted-keys entries Smack: Base support for overlayfs MAINTAINERS: remove David Safford as maintainer for encrypted+trusted keys tomoyo: fix timestamping for y2038 capabilities: audit log other surprising conditions capabilities: fix logic for effective root or real root capabilities: invert logic for clarity capabilities: remove a layer of conditional logic capabilities: move audit log decision to function capabilities: use intuitive names for id changes capabilities: use root_priveleged inline to clarify logic capabilities: rename has_cap to has_fcap capabilities: intuitive names for cap gain status capabilities: factor out cap_bprm_set_creds privileged root tpm, tpm_tis: use ARRAY_SIZE() to define TPM_HID_USR_IDX tpm: fix duplicate inline declaration specifier tpm: fix type of a local variables in tpm_tis_spi.c tpm: fix type of a local variable in tpm2_map_command() tpm: fix type of a local variable in tpm2_get_cc_attrs_tbl() tpm-dev-common: Reject too short writes ...
\| *	capabilities: audit log other surprising conditions	Richard Guy Briggs	2017-10-20	1	-7/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The existing condition tested for process effective capabilities set by file attributes but intended to ignore the change if the result was unsurprisingly an effective full set in the case root is special with a setuid root executable file and we are root. Stated again: - When you execute a setuid root application, it is no surprise and expected that it got all capabilities, so we do not want capabilities recorded. if (pE_grew && !(pE_fullset && (eff_root \|\| real_root) && root_priveleged) ) Now make sure we cover other cases: - If something prevented a setuid root app getting all capabilities and it wound up with one capability only, then it is a surprise and should be logged. When it is a setuid root file, we only want capabilities when the process does not get full capabilities.. root_priveleged && setuid_root && !pE_fullset - Similarly if a non-setuid program does pick up capabilities due to file system based capabilities, then we want to know what capabilities were picked up. When it has file system based capabilities we want the capabilities. !is_setuid && (has_fcap && pP_gained) - If it is a non-setuid file and it gets ambient capabilities, we want the capabilities. !is_setuid && pA_gained - These last two are combined into one due to the common first parameter. Related: https://github.com/linux-audit/audit-kernel/issues/16 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: fix logic for effective root or real root	Richard Guy Briggs	2017-10-20	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that the logic is inverted, it is much easier to see that both real root and effective root conditions had to be met to avoid printing the BPRM_FCAPS record with audit syscalls. This meant that any setuid root applications would print a full BPRM_FCAPS record when it wasn't necessary, cluttering the event output, since the SYSCALL and PATH records indicated the presence of the setuid bit and effective root user id. Require only one of effective root or real root to avoid printing the unnecessary record. Ref: commit 3fc689e96c0c ("Add audit_log_bprm_fcaps/AUDIT_BPRM_FCAPS") See: https://github.com/linux-audit/audit-kernel/issues/16 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: invert logic for clarity	Richard Guy Briggs	2017-10-20	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The way the logic was presented, it was awkward to read and verify. Invert the logic using DeMorgan's Law to be more easily able to read and understand. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: remove a layer of conditional logic	Richard Guy Briggs	2017-10-20	1	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove a layer of conditional logic to make the use of conditions easier to read and analyse. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: move audit log decision to function	Richard Guy Briggs	2017-10-20	1	-20/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the audit log decision logic to its own function to isolate the complexity in one place. Suggested-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: use intuitive names for id changes	Richard Guy Briggs	2017-10-20	1	-6/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce a number of inlines to make the use of the negation of uid_eq() easier to read and analyse. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: use root_priveleged inline to clarify logic	Richard Guy Briggs	2017-10-20	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce inline root_privileged() to make use of SECURE_NONROOT easier to read. Suggested-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: rename has_cap to has_fcap	Richard Guy Briggs	2017-10-20	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename has_cap to has_fcap to clarify it applies to file capabilities since the entire source file is about capabilities. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: intuitive names for cap gain status	Richard Guy Briggs	2017-10-20	1	-7/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce macros cap_gained, cap_grew, cap_full to make the use of the negation of is_subset() easier to read and analyse. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
\| *	capabilities: factor out cap_bprm_set_creds privileged root	Richard Guy Briggs	2017-10-20	1	-28/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Factor out the case of privileged root from the function cap_bprm_set_creds() to make the latter easier to read and analyse. Suggested-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Kees Cook <keescook@chromium.org> Okay-ished-by: Paul Moore <paul@paul-moore.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* \|	commoncap: move assignment of fs_ns to avoid null pointer dereference	Colin Ian King	2017-10-19	1	-1/+2
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The pointer fs_ns is assigned from inode->i_ib->s_user_ns before a null pointer check on inode, hence if inode is actually null we will get a null pointer dereference on this assignment. Fix this by only dereferencing inode after the null pointer check on inode. Detected by CoverityScan CID#1455328 ("Dereference before null check") Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities") Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: stable@vger.kernel.org Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
*	Merge branch 'next-general' of ↵	Linus Torvalds	2017-09-24	1	-3/+3
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull misc security layer update from James Morris: "This is the remaining 'general' change in the security tree for v4.14, following the direct merging of SELinux (+ TOMOYO), AppArmor, and seccomp. That's everything now for the security tree except IMA, which will follow shortly (I've been traveling for the past week with patchy internet)" * 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: security: fix description of values returned by cap_inode_need_killpriv
\| *	security: fix description of values returned by cap_inode_need_killpriv	Stefan Berger	2017-09-23	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cap_inode_need_killpriv returns 1 if security.capability exists and has a value and inode_killpriv() is required, 0 otherwise. Fix the description of the return value to reflect this. Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* \|	Merge branch 'for-linus' of ↵	Linus Torvalds	2017-09-11	1	-21/+256
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "Life has been busy and I have not gotten half as much done this round as I would have liked. I delayed it so that a minor conflict resolution with the mips tree could spend a little time in linux-next before I sent this pull request. This includes two long delayed user namespace changes from Kirill Tkhai. It also includes a very useful change from Serge Hallyn that allows the security capability attribute to be used inside of user namespaces. The practical effect of this is people can now untar tarballs and install rpms in user namespaces. It had been suggested to generalize this and encode some of the namespace information information in the xattr name. Upon close inspection that makes the things that should be hard easy and the things that should be easy more expensive. Then there is my bugfix/cleanup for signal injection that removes the magic encoding of the siginfo union member from the kernel internal si_code. The mips folks reported the case where I had used FPE_FIXME me is impossible so I have remove FPE_FIXME from mips, while at the same time including a return statement in that case to keep gcc from complaining about unitialized variables. I almost finished the work to get make copy_siginfo_to_user a trivial copy to user. The code is available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3 But I did not have time/energy to get the code posted and reviewed before the merge window opened. I was able to see that the security excuse for just copying fields that we know are initialized doesn't work in practice there are buggy initializations that don't initialize the proper fields in siginfo. So we still sometimes copy unitialized data to userspace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: Introduce v3 namespaced file capabilities mips/signal: In force_fcr31_sig return in the impossible case signal: Remove kernel interal si_code magic fcntl: Don't use ambiguous SIG_POLL si_codes prctl: Allow local CAP_SYS_ADMIN changing exe_file security: Use user_namespace::level to avoid redundant iterations in cap_capable() userns,pidns: Verify the userns for new pid namespaces signal/testing: Don't look for __SI_FAULT in userspace signal/mips: Document a conflict with SI_USER with SIGFPE signal/sparc: Document a conflict with SI_USER with SIGFPE signal/ia64: Document a conflict with SI_USER with SIGFPE signal/alpha: Document a conflict with SI_USER for SIGTRAP
\| * \|	Introduce v3 namespaced file capabilities	Serge E. Hallyn	2017-09-01	1	-19/+251
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Root in a non-initial user ns cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host. However supporting file capabilities in a user namespace is very desirable. Not doing so means that any programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must know how to drop partial capabilities, and do so only if setuid-root. This patch introduces v3 of the security.capability xattr. It builds a vfs_ns_cap_data struct by appending a uid_t rootid to struct vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns) of the root id in whose namespaces the file capabilities may take effect. When a task asks to write a v2 security.capability xattr, if it is privileged with respect to the userns which mounted the filesystem, then nothing should change. Otherwise, the kernel will transparently rewrite the xattr as a v3 with the appropriate rootid. This is done during the execution of setxattr() to catch user-space-initiated capability writes. Subsequently, any task executing the file which has the noted kuid as its root uid, or which is in a descendent user_ns of such a user_ns, will run the file with capabilities. Similarly when asking to read file capabilities, a v3 capability will be presented as v2 if it applies to the caller's namespace. If a task writes a v3 security.capability, then it can provide a uid for the xattr so long as the uid is valid in its own user namespace, and it is privileged with CAP_SETFCAP over its namespace. The kernel will translate that rootid to an absolute uid, and write that to disk. After this, a task in the writer's namespace will not be able to use those capabilities (unless rootid was 0), but a task in a namespace where the given uid is root will. Only a single security.capability xattr may exist at a time for a given file. A task may overwrite an existing xattr so long as it is privileged over the inode. Note this is a departure from previous semantics, which required privilege to remove a security.capability xattr. This check can be re-added if deemed useful. This allows a simple setxattr to work, allows tar/untar to work, and allows us to tar in one namespace and untar in another while preserving the capability, without risking leaking privilege into a parent namespace. Example using tar: $ cp /bin/sleep sleepx $ mkdir b1 b2 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx b2/sleepx = cap_sys_admin+ep # /opt/ltp/testcases/bin/getv3xattr b2/sleepx v3 xattr, rootid is 100001 A patch to linux-test-project adding a new set of tests for this functionality is in the nsfscaps branch at github.com/hallyn/ltp Changelog: Nov 02 2016: fix invalid check at refuse_fcap_overwrite() Nov 07 2016: convert rootid from and to fs user_ns (From ebiederm: mar 28 2017) commoncap.c: fix typos - s/v4/v3 get_vfs_caps_from_disk: clarify the fs_ns root access check nsfscaps: change the code split for cap_inode_setxattr() Apr 09 2017: don't return v3 cap for caps owned by current root. return a v2 cap for a true v2 cap in non-init ns Apr 18 2017: . Change the flow of fscap writing to support s_user_ns writing. . Remove refuse_fcap_overwrite(). The value of the previous xattr doesn't matter. Apr 24 2017: . incorporate Eric's incremental diff . move cap_convert_nscap to setxattr and simplify its usage May 8, 2017: . fix leaking dentry refcount in cap_inode_getsecurity Signed-off-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
\| * \|	security: Use user_namespace::level to avoid redundant iterations in ↵	Kirill Tkhai	2017-07-20	1	-2/+5
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cap_capable() When ns->level is not larger then cred->user_ns->level, then ns can't be cred->user_ns's descendant, and there is no a sense to search in parents. So, break the cycle earlier and skip needless iterations. v2: Change comment on suggested by Andy Lutomirski. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* \|	commoncap: Move cap_elevated calculation into bprm_set_creds	Kees Cook	2017-08-01	1	-42/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of a separate function, open-code the cap_elevated test, which lets us entirely remove bprm->cap_effective (to use the local "effective" variable instead), and more accurately examine euid/egid changes via the existing local "is_setid". The following LTP tests were run to validate the changes: # ./runltp -f syscalls -s cap # ./runltp -f securebits # ./runltp -f cap_bounds # ./runltp -f filecaps All kernel selftests for capabilities and exec continue to pass as well. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Andy Lutomirski <luto@kernel.org>
* \|	commoncap: Refactor to remove bprm_secureexec hook	Kees Cook	2017-08-01	1	-4/+8
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The commoncap implementation of the bprm_secureexec hook is the only LSM that depends on the final call to its bprm_set_creds hook (since it may be called for multiple files, it ignores bprm->called_set_creds). As a result, it cannot safely _clear_ bprm->secureexec since other LSMs may have set it. Instead, remove the bprm_secureexec hook by introducing a new flag to bprm specific to commoncap: cap_elevated. This is similar to cap_effective, but that is used for a specific subset of elevated privileges, and exists solely to track state from bprm_set_creds to bprm_secureexec. As such, it will be removed in the next patch. Here, set the new bprm->cap_elevated flag when setuid/setgid has happened from bprm_fill_uid() or fscapabilities have been prepared. This temporarily moves the bprm_secureexec hook to a static inline. The helper will be removed in the next patch; this makes the step easier to review and bisect, since this does not introduce any changes to inputs nor outputs to the "elevated privileges" calculation. The new flag is merged with the bprm->secureexec flag in setup_new_exec() since this marks the end of any further prepare_binprm() calls. Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge@hallyn.com>
*	security: mark LSM hooks as __ro_after_init	James Morris	2017-03-06	1	-1/+1
\| \| \| \| \| \| \| \| \|	Mark all of the registration hooks as __ro_after_init (via the __lsm_ro_after_init macro). Signed-off-by: James Morris <james.l.morris@oracle.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Kees Cook <keescook@chromium.org>
*	Merge branch 'for-linus' of ↵	Linus Torvalds	2017-02-23	1	-2/+3
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
\| *	exec: Remove LSM_UNSAFE_PTRACE_CAP	Eric W. Biederman	2017-01-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With previous changes every location that tests for LSM_UNSAFE_PTRACE_CAP also tests for LSM_UNSAFE_PTRACE making the LSM_UNSAFE_PTRACE_CAP redundant, so remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
\| *	exec: Test the ptracer's saved cred to see if the tracee can gain caps	Eric W. Biederman	2017-01-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have user namespaces and non-global capabilities verify the tracer has capabilities in the relevant user namespace instead of in the current_user_ns(). As the test for setting LSM_UNSAFE_PTRACE_CAP is currently ptracer_capable(p, current_user_ns()) and the new task credentials are in current_user_ns() this change does not have any user visible change and simply moves the test to where it is used, making the code easier to read. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
\| *	exec: Don't reset euid and egid when the tracee has CAP_SETUID	Eric W. Biederman	2017-01-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't reset euid and egid when the tracee has CAP_SETUID in it's user namespace. I punted on relaxing this permission check long ago but now that I have read this code closely it is clear it is safe to test against CAP_SETUID in the user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* \|	LSM: Add /sys/kernel/security/lsm	Casey Schaufler	2017-01-19	1	-1/+2
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I am still tired of having to find indirect ways to determine what security modules are active on a system. I have added /sys/kernel/security/lsm, which contains a comma separated list of the active security modules. No more groping around in /proc/filesystems or other clever hacks. Unchanged from previous versions except for being updated to the latest security next branch. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.l.morris@oracle.com>
*	xattr: Add __vfs_{get,set,remove}xattr helpers	Andreas Gruenbacher	2016-10-07	1	-15/+10
\| \| \| \| \| \| \| \| \| \| \|	Right now, various places in the kernel check for the existence of getxattr, setxattr, and removexattr inode operations and directly call those operations. Switch to helper functions and test for the IOP_XATTR flag instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fs: Treat foreign mounts as nosuid	Andy Lutomirski	2016-06-24	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a process gets access to a mount from a different user namespace, that process should not be able to take advantage of setuid files or selinux entrypoints from that filesystem. Prevent this by treating mounts from other mount namespaces and those not owned by current_user_ns() or an ancestor as nosuid. This will make it safer to allow more complex filesystems to be mounted in non-root user namespaces. This does not remove the need for MNT_LOCK_NOSUID. The setuid, setgid, and file capability bits can no longer be abused if code in a user namespace were to clear nosuid on an untrusted filesystem, but this patch, by itself, is insufficient to protect the system from abuse of files that, when execed, would increase MAC privilege. As a more concrete explanation, any task that can manipulate a vfsmount associated with a given user namespace already has capabilities in that namespace and all of its descendents. If they can cause a malicious setuid, setgid, or file-caps executable to appear in that mount, then that executable will only allow them to elevate privileges in exactly the set of namespaces in which they are already privileges. On the other hand, if they can cause a malicious executable to appear with a dangerous MAC label, running it could change the caller's security context in a way that should not have been possible, even inside the namespace in which the task is confined. As a hardening measure, this would have made CVE-2014-5207 much more difficult to exploit. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
*	fs: Limit file caps to the user namespace of the super block	Seth Forshee	2016-06-24	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Capability sets attached to files must be ignored except in the user namespaces where the mounter is privileged, i.e. s_user_ns and its descendants. Otherwise a vector exists for gaining privileges in namespaces where a user is not already privileged. Add a new helper function, current_in_user_ns(), to test whether a user namespace is the same as or a descendant of another namespace. Use this helper to determine whether a file's capability set should be applied to the caps constructed during exec. --EWB Replaced in_userns with the simpler current_in_userns. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
*	Merge branch 'for-linus' of ↵	Linus Torvalds	2016-05-17	1	-3/+3
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull parallel filesystem directory handling update from Al Viro. This is the main parallel directory work by Al that makes the vfs layer able to do lookup and readdir in parallel within a single directory. That's a big change, since this used to be all protected by the directory inode mutex. The inode mutex is replaced by an rwsem, and serialization of lookups of a single name is done by a "in-progress" dentry marker. The series begins with xattr cleanups, and then ends with switching filesystems over to actually doing the readdir in parallel (switching to the "iterate_shared()" that only takes the read lock). A more detailed explanation of the process from Al Viro: "The xattr work starts with some acl fixes, then switches ->getxattr to passing inode and dentry separately. This is the point where the things start to get tricky - that got merged into the very beginning of the -rc3-based #work.lookups, to allow untangling the security_d_instantiate() mess. The xattr work itself proceeds to switch a lot of filesystems to generic_...xattr(); no complications there. After that initial xattr work, the series then does the following: - untangle security_d_instantiate() - convert a bunch of open-coded lookup_one_len_unlocked() to calls of that thing; one such place (in overlayfs) actually yields a trivial conflict with overlayfs fixes later in the cycle - overlayfs ended up switching to a variant of lookup_one_len_unlocked() sans the permission checks. I would've dropped that commit (it gets overridden on merge from #ovl-fixes in #for-next; proper resolution is to use the variant in mainline fs/overlayfs/super.c), but I didn't want to rebase the damn thing - it was fairly late in the cycle... - some filesystems had managed to depend on lookup/lookup exclusion for fs-internal data structures in a way that would break if we relaxed the VFS exclusion. Fixing hadn't been hard, fortunately. - core of that series - parallel lookup machinery, replacing ->i_mutex with rwsem, making lookup_slow() take it only shared. At that point lookups happen in parallel; lookups on the same name wait for the in-progress one to be done with that dentry. Surprisingly little code, at that - almost all of it is in fs/dcache.c, with fs/namei.c changes limited to lookup_slow() - making it use the new primitive and actually switching to locking shared. - parallel readdir stuff - first of all, we provide the exclusion on per-struct file basis, same as we do for read() vs lseek() for regular files. That takes care of most of the needed exclusion in readdir/readdir; however, these guys are trickier than lookups, so I went for switching them one-by-one. To do that, a new method '->iterate_shared()' is added and filesystems are switched to it as they are either confirmed to be OK with shared lock on directory or fixed to be OK with that. I hope to kill the original method come next cycle (almost all in-tree filesystems are switched already), but it's still not quite finished. - several filesystems get switched to parallel readdir. The interesting part here is dealing with dcache preseeding by readdir; that needs minor adjustment to be safe with directory locked only shared. Most of the filesystems doing that got switched to in those commits. Important exception: NFS. Turns out that NFS folks, with their, er, insistence on VFS getting the fuck out of the way of the Smart Filesystem Code That Knows How And What To Lock(tm) have grown the locking of their own. They had their own homegrown rwsem, with lookup/readdir/atomic_open being writers (sillyunlink is the reader there). Of course, with VFS getting the fuck out of the way, as requested, the actual smarts of the smart filesystem code etc. had become exposed... - do_last/lookup_open/atomic_open cleanups. As the result, open() without O_CREAT locks the directory only shared. Including the ->atomic_open() case. Backmerge from #for-linus in the middle of that - atomic_open() fix got brought in. - then comes NFS switch to saner (VFS-based ;-) locking, killing the homegrown "lookup and readdir are writers" kinda-sorta rwsem. All exclusion for sillyunlink/lookup is done by the parallel lookups mechanism. Exclusion between sillyunlink and rmdir is a real rwsem now - rmdir being the writer. Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel now. - the rest of the series consists of switching a lot of filesystems to parallel readdir; in a lot of cases ->llseek() gets simplified as well. One backmerge in there (again, #for-linus - rockridge fix)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits) ext4: switch to ->iterate_shared() hfs: switch to ->iterate_shared() hfsplus: switch to ->iterate_shared() hostfs: switch to ->iterate_shared() hpfs: switch to ->iterate_shared() hpfs: handle allocation failures in hpfs_add_pos() gfs2: switch to ->iterate_shared() f2fs: switch to ->iterate_shared() afs: switch to ->iterate_shared() befs: switch to ->iterate_shared() befs: constify stuff a bit isofs: switch to ->iterate_shared() get_acorn_filename(): deobfuscate a bit btrfs: switch to ->iterate_shared() logfs: no need to lock directory in lseek switch ecryptfs to ->iterate_shared 9p: switch to ->iterate_shared() fat: switch to ->iterate_shared() romfs, squashfs: switch to ->iterate_shared() more trivial ->iterate_shared conversions ...
\| *	->getxattr(): pass dentry and inode as separate arguments	Al Viro	2016-04-11	1	-3/+3
\| \| \| \| \| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	security: Introduce security_settime64()	Baolin Wang	2016-04-22	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	security_settime() uses a timespec, which is not year 2038 safe on 32bit systems. Thus this patch introduces the security_settime64() function with timespec64 type. We also convert the cap_settime() helper function to use the 64bit types. This patch then moves security_settime() to the header file as an inline helper function so that existing users can be iteratively converted. None of the existing hooks is using the timespec argument and therefor the patch is not making any functional changes. Cc: Serge Hallyn <serge.hallyn@canonical.com>, Cc: James Morris <james.l.morris@oracle.com>, Cc: "Serge E. Hallyn" <serge@hallyn.com>, Cc: Paul Moore <pmoore@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Kees Cook <keescook@chromium.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Baolin Wang <baolin.wang@linaro.org> [jstultz: Reworded commit message] Signed-off-by: John Stultz <john.stultz@linaro.org>
*	ptrace: use fsuid, fsgid, effective creds for fs access checks	Jann Horn	2016-01-20	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	capabilities: add a securebit to disable PR_CAP_AMBIENT_RAISE	Andy Lutomirski	2015-09-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Per Andrew Morgan's request, add a securebit to allow admins to disable PR_CAP_AMBIENT_RAISE. This securebit will prevent processes from adding capabilities to their ambient set. For simplicity, this disables PR_CAP_AMBIENT_RAISE entirely rather than just disabling setting previously cleared bits. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	capabilities: ambient capabilities	Andy Lutomirski	2015-09-04	1	-10/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) \| (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) \| (pI & fI) \| pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker already has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is not a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <cl@linux.com> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. / #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <cap-ng.h> #include <sys/prctl.h> #include <linux/capability.h> / * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. / #define PR_CAP_AMBIENT 47 #define PR_CAP_AMBIENT_IS_SET 1 #define PR_CAP_AMBIENT_RAISE 2 #define PR_CAP_AMBIENT_LOWER 3 #define PR_CAP_AMBIENT_CLEAR_ALL 4 static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); / Note the two 0s at the end. Kernel checks for these / if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char *argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <cl@linux.com> # Original author Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	LSM: Switch to lists of hooks	Casey Schaufler	2015-05-12	1	-8/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of using a vector of security operations with explicit, special case stacking of the capability and yama hooks use lists of hooks with capability and yama hooks included as appropriate. The security_operations structure is no longer required. Instead, there is a union of the function pointers that allows all the hooks lists to use a common mechanism for list management while retaining typing. Each module supplies an array describing the hooks it provides instead of a sparsely populated security_operations structure. The description includes the element that gets put on the hook list, avoiding the issues surrounding individual element allocation. The method for registering security modules is changed to reflect the information available. The method for removing a module, currently only used by SELinux, has also changed. It should be generic now, however if there are potential race conditions based on ordering of hook removal that needs to be addressed by the calling module. The security hooks are called from the lists and the first failure is returned. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: James Morris <james.l.morris@oracle.com>
*	VFS: security/: d_backing_inode() annotations	David Howells	2015-04-15	1	-3/+3
\| \| \| \| \| \| \| \|	most of the ->d_inode uses there refer to the same inode IO would go to, i.e. d_backing_inode() Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	file->f_path.dentry is pinned down for as long as the file is open...	Al Viro	2015-01-25	1	-5/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	kill f_dentry uses	Al Viro	2014-11-19	1	-1/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	CAPABILITIES: remove undefined caps from all processes	Eric Paris	2014-07-24	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744 plus fixing it a different way... We found, when trying to run an application from an application which had dropped privs that the kernel does security checks on undefined capability bits. This was ESPECIALLY difficult to debug as those undefined bits are hidden from /proc/$PID/status. Consider a root application which drops all capabilities from ALL 4 capability sets. We assume, since the application is going to set eff/perm/inh from an array that it will clear not only the defined caps less than CAP_LAST_CAP, but also the higher 28ish bits which are undefined future capabilities. The BSET gets cleared differently. Instead it is cleared one bit at a time. The problem here is that in security/commoncap.c::cap_task_prctl() we actually check the validity of a capability being read. So any task which attempts to 'read all things set in bset' followed by 'unset all things set in bset' will not even attempt to unset the undefined bits higher than CAP_LAST_CAP. So the 'parent' will look something like: CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffc000000000 All of this 'should' be fine. Given that these are undefined bits that aren't supposed to have anything to do with permissions. But they do... So lets now consider a task which cleared the eff/perm/inh completely and cleared all of the valid caps in the bset (but not the invalid caps it couldn't read out of the kernel). We know that this is exactly what the libcap-ng library does and what the go capabilities library does. They both leave you in that above situation if you try to clear all of you capapabilities from all 4 sets. If that root task calls execve() the child task will pick up all caps not blocked by the bset. The bset however does not block bits higher than CAP_LAST_CAP. So now the child task has bits in eff which are not in the parent. These are 'meaningless' undefined bits, but still bits which the parent doesn't have. The problem is now in cred_cap_issubset() (or any operation which does a subset test) as the child, while a subset for valid cap bits, is not a subset for invalid cap bits! So now we set durring commit creds that the child is not dumpable. Given it is 'more priv' than its parent. It also means the parent cannot ptrace the child and other stupidity. The solution here: 1) stop hiding capability bits in status This makes debugging easier! 2) stop giving any task undefined capability bits. it's simple, it you don't put those invalid bits in CAP_FULL_SET you won't get them in init and you won't get them in any other task either. This fixes the cap_issubset() tests and resulting fallout (which made the init task in a docker container untraceable among other things) 3) mask out undefined bits when sys_capset() is called as it might use ~0, ~0 to denote 'all capabilities' for backward/forward compatibility. This lets 'capsh --caps="all=eip" -- -c /bin/bash' run. 4) mask out undefined bit when we read a file capability off of disk as again likely all bits are set in the xattr for forward/backward compatibility. This lets 'setcap all+pe /bin/bash; /bin/bash' run Signed-off-by: Eric Paris <eparis@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Steve Grubb <sgrubb@redhat.com> Cc: Dan Walsh <dwalsh@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: James Morris <james.l.morris@oracle.com>
*	commoncap: don't alloc the credential unless needed in cap_task_prctl	Tetsuo Handa	2014-07-24	1	-42/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In function cap_task_prctl(), we would allocate a credential unconditionally and then check if we support the requested function. If not we would release this credential with abort_creds() by using RCU method. But on some archs such as powerpc, the sys_prctl is heavily used to get/set the floating point exception mode. So the unnecessary allocating/releasing of credential not only introduce runtime overhead but also do cause OOM due to the RCU implementation. This patch removes abort_creds() from cap_task_prctl() by calling prepare_creds() only when we need to modify it. Reported-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Paul Moore <paul@paul-moore.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <james.l.morris@oracle.com>
*	capabilities: allow nice if we are privileged	Serge Hallyn	2013-08-30	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	We allow task A to change B's nice level if it has a supserset of B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has CAP_SYS_NICE with respect to B - meaning it is root in the same namespace, or it created B's namespace. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
*	userns: Allow PR_CAPBSET_DROP in a user namespace.	Eric W. Biederman	2013-08-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	As the capabilites and capability bounding set are per user namespace properties it is safe to allow changing them with just CAP_SETPCAP permission in the user namespace. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Tested-by: Richard Weinberger <richard@nod.at> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
*	kill f_vfsmnt	Al Viro	2013-02-26	1	-1/+1
\| \| \| \| \| \|	very few users left... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Fix cap_capable to only allow owners in the parent user namespace to have caps.	Eric W. Biederman	2012-12-14	1	-8/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Andy Lutomirski pointed out that the current behavior of allowing the owner of a user namespace to have all caps when that owner is not in a parent user namespace is wrong. Add a test to ensure the owner of a user namespace is in the parent of the user namespace to fix this bug. Thankfully this bug did not apply to the initial user namespace, keeping the mischief that can be caused by this bug quite small. This is bug was introduced in v3.5 by commit 783291e6900 "Simplify the user_namespace by making userns->creator a kuid." But did not matter until the permisions required to create a user namespace were relaxed allowing a user namespace to be created inside of a user namespace. The bug made it possible for the owner of a user namespace to be present in a child user namespace. Since the owner of a user nameapce is granted all capabilities it became possible for users in a grandchild user namespace to have all privilges over their parent user namspace. Reorder the checks in cap_capable. This should make the common case faster and make it clear that nothing magic happens in the initial user namespace. The reordering is safe because cred->user_ns can only be in targ_ns or targ_ns->parent but not both. Add a comment a the top of the loop to make the logic of the code clear. Add a distinct variable ns that changes as we walk up the user namespace hierarchy to make it clear which variable is changing. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
*	split ->file_mmap() into ->mmap_addr()/->mmap_file()	Al Viro	2012-05-31	1	-18/+3
\| \| \| \| \| \|	... i.e. file-dependent and address-dependent checks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	split cap_mmap_addr() out of cap_file_mmap()	Al Viro	2012-05-31	1	-9/+23
\| \| \| \| \| \|	... switch callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Merge branch 'for-linus' of ↵	Linus Torvalds	2012-05-23	1	-25/+36
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...