summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2016-07-2919-155/+242
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns vfs updates from Eric Biederman: "This tree contains some very long awaited work on generalizing the user namespace support for mounting filesystems to include filesystems with a backing store. The real world target is fuse but the goal is to update the vfs to allow any filesystem to be supported. This patchset is based on a lot of code review and testing to approach that goal. While looking at what is needed to support the fuse filesystem it became clear that there were things like xattrs for security modules that needed special treatment. That the resolution of those concerns would not be fuse specific. That sorting out these general issues made most sense at the generic level, where the right people could be drawn into the conversation, and the issues could be solved for everyone. At a high level what this patchset does a couple of simple things: - Add a user namespace owner (s_user_ns) to struct super_block. - Teach the vfs to handle filesystem uids and gids not mapping into to kuids and kgids and being reported as INVALID_UID and INVALID_GID in vfs data structures. By assigning a user namespace owner filesystems that are mounted with only user namespace privilege can be detected. This allows security modules and the like to know which mounts may not be trusted. This also allows the set of uids and gids that are communicated to the filesystem to be capped at the set of kuids and kgids that are in the owning user namespace of the filesystem. One of the crazier corner casees this handles is the case of inodes whose i_uid or i_gid are not mapped into the vfs. Most of the code simply doesn't care but it is easy to confuse the inode writeback path so no operation that could cause an inode write-back is permitted for such inodes (aka only reads are allowed). This set of changes starts out by cleaning up the code paths involved in user namespace permirted mounts. Then when things are clean enough adds code that cleanly sets s_user_ns. Then additional restrictions are added that are possible now that the filesystem superblock contains owner information. These changes should not affect anyone in practice, but there are some parts of these restrictions that are changes in behavior. - Andy's restriction on suid executables that does not honor the suid bit when the path is from another mount namespace (think /proc/[pid]/fd/) or when the filesystem was mounted by a less privileged user. - The replacement of the user namespace implicit setting of MNT_NODEV with implicitly setting SB_I_NODEV on the filesystem superblock instead. Using SB_I_NODEV is a stronger form that happens to make this state user invisible. The user visibility can be managed but it caused problems when it was introduced from applications reasonably expecting mount flags to be what they were set to. There is a little bit of work remaining before it is safe to support mounting filesystems with backing store in user namespaces, beyond what is in this set of changes. - Verifying the mounter has permission to read/write the block device during mount. - Teaching the integrity modules IMA and EVM to handle filesystems mounted with only user namespace root and to reduce trust in their security xattrs accordingly. - Capturing the mounters credentials and using that for permission checks in d_automount and the like. (Given that overlayfs already does this, and we need the work in d_automount it make sense to generalize this case). Furthermore there are a few changes that are on the wishlist: - Get all filesystems supporting posix acls using the generic posix acls so that posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_to_user may be removed. [Maintainability] - Reducing the permission checks in places such as remount to allow the superblock owner to perform them. - Allowing the superblock owner to chown files with unmapped uids and gids to something that is mapped so the files may be treated normally. I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically. Many thanks to Seth Forshee who kept this code alive, and putting up with me rewriting substantial portions of what he did to handle more corner cases, and for his diligent testing and reviewing of my changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits) fs: Call d_automount with the filesystems creds fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns evm: Translate user/group ids relative to s_user_ns when computing HMAC dquot: For now explicitly don't support filesystems outside of init_user_ns quota: Handle quota data stored in s_user_ns in quota_setxquota quota: Ensure qids map to the filesystem vfs: Don't create inodes with a uid or gid unknown to the vfs vfs: Don't modify inodes with a uid or gid unknown to the vfs cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() vfs: Verify acls are valid within superblock's s_user_ns. userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS fs: Refuse uid/gid changes which don't map into s_user_ns selinux: Add support for unprivileged mounts from user namespaces Smack: Handle labels consistently in untrusted mounts Smack: Add support for unprivileged mounts from user namespaces fs: Treat foreign mounts as nosuid fs: Limit file caps to the user namespace of the super block userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag userns: Remove implicit MNT_NODEV fragility. ...
| * fs: Call d_automount with the filesystems credsEric W. Biederman2016-07-231-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seth Forshee reported a mount regression in nfs autmounts with "fs: Add user namespace member to struct super_block". It turns out that the assumption that current->cred is something reasonable during mount while necessary to improve support of unprivileged mounts is wrong in the automount path. To fix the existing filesystems override current->cred with the init_cred before calling d_automount and restore current->cred after d_automount completes. To support unprivileged mounts would require a more nuanced cred selection, so fail on unprivileged mounts for the time being. As none of the filesystems that currently set FS_USERNS_MOUNT implement d_automount this check is only good for preventing future problems. Fixes: 6e4eab577a0c ("fs: Add user namespace member to struct super_block") Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * dquot: For now explicitly don't support filesystems outside of init_user_nsEric W. Biederman2016-07-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Mostly supporting filesystems outside of init_user_ns is s/&init_usre_ns/dquot->dq_sb->s_user_ns/. An actual need for supporting quotas on filesystems outside of s_user_ns is quite a ways away and to be done responsibily needs an audit on what can happen with hostile quota files. Until that audit is complete don't attempt to support quota files on filesystems outside of s_user_ns. Cc: Jan Kara <jack@suse.cz> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * quota: Handle quota data stored in s_user_ns in quota_setxquotaEric W. Biederman2016-07-051-1/+1
| | | | | | | | | | | | | | | | | | | | In Q_XSETQLIMIT use sb->s_user_ns to detect when we are dealing with the filesystems notion of id 0. Cc: Jan Kara <jack@suse.cz> Acked-by: Seth Forshee <seth.forshee@canonical.com> Inspired-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * quota: Ensure qids map to the filesystemEric W. Biederman2016-07-052-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the helper qid_has_mapping and use it to ensure that the quota system only considers qids that map to the filesystems s_user_ns. In practice for quota supporting filesystems today this is the exact same check as qid_valid. As only 0xffffffff aka (qid_t)-1 does not map into init_user_ns. Replace the qid_valid calls with qid_has_mapping as values come in from userspace. This is harmless today and it prepares the quota system to work on filesystems with quotas but mounted by unprivileged users. Call qid_has_mapping from dqget. This ensures the passed in qid has a prepresentation on the underlying filesystem. Previously this was unnecessary as filesystesm never had qids that could not map. With the introduction of filesystems outside of s_user_ns this will not remain true. All of this ensures the quota code never has to deal with qids that don't map to the underlying filesystem. Cc: Jan Kara <jack@suse.cz> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * vfs: Don't create inodes with a uid or gid unknown to the vfsEric W. Biederman2016-07-051-2/+8
| | | | | | | | | | | | | | | | | | It is expected that filesystems can not represent uids and gids from outside of their user namespace. Keep things simple by not even trying to create filesystem nodes with non-sense uids and gids. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * vfs: Don't modify inodes with a uid or gid unknown to the vfsEric W. Biederman2016-07-054-5/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a filesystem outside of init_user_ns is mounted it could have uids and gids stored in it that do not map to init_user_ns. The plan is to allow those filesystems to set i_uid to INVALID_UID and i_gid to INVALID_GID for unmapped uids and gids and then to handle that strange case in the vfs to ensure there is consistent robust handling of the weirdness. Upon a careful review of the vfs and filesystems about the only case where there is any possibility of confusion or trouble is when the inode is written back to disk. In that case filesystems typically read the inode->i_uid and inode->i_gid and write them to disk even when just an inode timestamp is being updated. Which leads to a rule that is very simple to implement and understand inodes whose i_uid or i_gid is not valid may not be written. In dealing with access times this means treat those inodes as if the inode flag S_NOATIME was set. Reads of the inodes appear safe and useful, but any write or modification is disallowed. The only inode write that is allowed is a chown that sets the uid and gid on the inode to valid values. After such a chown the inode is normal and may be treated as such. Denying all writes to inodes with uids or gids unknown to the vfs also prevents several oddball cases where corruption would have occurred because the vfs does not have complete information. One problem case that is prevented is attempting to use the gid of a directory for new inodes where the directories sgid bit is set but the directories gid is not mapped. Another problem case avoided is attempting to update the evm hash after setxattr, removexattr, and setattr. As the evm hash includeds the inode->i_uid or inode->i_gid not knowning the uid or gid prevents a correct evm hash from being computed. evm hash verification also fails when i_uid or i_gid is unknown but that is essentially harmless as it does not cause filesystem corruption. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * fs: Check for invalid i_uid in may_follow_link()Seth Forshee2016-06-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Filesystem uids which don't map into a user namespace may result in inode->i_uid being INVALID_UID. A symlink and its parent could have different owners in the filesystem can both get mapped to INVALID_UID, which may result in following a symlink when this would not have otherwise been permitted when protected symlinks are enabled. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * vfs: Verify acls are valid within superblock's s_user_ns.Eric W. Biederman2016-06-302-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | Update posix_acl_valid to verify that an acl is within a user namespace. Update the callers of posix_acl_valid to pass in an appropriate user namespace. For posix_acl_xattr_set and v9fs_xattr_set_acl pass in inode->i_sb->s_user_ns to posix_acl_valid. For md_unpack_acl pass in &init_user_ns as no inode or superblock is in sight. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * fs: Refuse uid/gid changes which don't map into s_user_nsSeth Forshee2016-06-271-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add checks to notify_change to verify that uid and gid changes will map into the superblock's user namespace. If they do not fail with -EOVERFLOW. This is mandatory so that fileystems don't have to even think of dealing with ia_uid and ia_gid that --EWB Moved the test from inode_change_ok to notify_change Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * fs: Treat foreign mounts as nosuidAndy Lutomirski2016-06-242-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a process gets access to a mount from a different user namespace, that process should not be able to take advantage of setuid files or selinux entrypoints from that filesystem. Prevent this by treating mounts from other mount namespaces and those not owned by current_user_ns() or an ancestor as nosuid. This will make it safer to allow more complex filesystems to be mounted in non-root user namespaces. This does not remove the need for MNT_LOCK_NOSUID. The setuid, setgid, and file capability bits can no longer be abused if code in a user namespace were to clear nosuid on an untrusted filesystem, but this patch, by itself, is insufficient to protect the system from abuse of files that, when execed, would increase MAC privilege. As a more concrete explanation, any task that can manipulate a vfsmount associated with a given user namespace already has capabilities in that namespace and all of its descendents. If they can cause a malicious setuid, setgid, or file-caps executable to appear in that mount, then that executable will only allow them to elevate privileges in exactly the set of namespaces in which they are already privileges. On the other hand, if they can cause a malicious executable to appear with a dangerous MAC label, running it could change the caller's security context in a way that should not have been possible, even inside the namespace in which the task is confined. As a hardening measure, this would have made CVE-2014-5207 much more difficult to exploit. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flagEric W. Biederman2016-06-232-3/+3
| | | | | | | | | | | | | | | | | | | | Now that SB_I_NODEV controls the nodev behavior devpts can just clear this flag during mount. Simplifying the code and making it easier to audit how the code works. While still preserving the invariant that s_iflags is only modified during mount. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * userns: Remove implicit MNT_NODEV fragility.Eric W. Biederman2016-06-232-18/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the implict setting of MNT_NODEV on mounts that happen with just user namespace permissions with an implicit setting of SB_I_NODEV in s_iflags. The visibility of the implicit MNT_NODEV has caused problems in the past. With this change the fragile case where an implicit MNT_NODEV needs to be preserved in do_remount is removed. Using SB_I_NODEV is much less fragile as s_iflags are set during the original mount and never changed. In do_new_mount with the implicit setting of MNT_NODEV gone, the only code that can affect mnt_flags is fs_fully_visible so simplify the if statement and reduce the indentation of the code to make that clear. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * mnt: Simplify mount_too_revealingEric W. Biederman2016-06-231-17/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Verify all filesystems that we check in mount_too_revealing set SB_I_NOEXEC and SB_I_NODEV in sb->s_iflags. That is true for today and it should remain true in the future. Remove the now unnecessary checks from mnt_already_visibile that ensure MNT_LOCK_NOSUID, MNT_LOCK_NOEXEC, and MNT_LOCK_NODEV are preserved. Making the code shorter and easier to read. Relying on SB_I_NOEXEC and SB_I_NODEV instead of the user visible MNT_NOSUID, MNT_NOEXEC, and MNT_NODEV ensures the many current systems where proc and sysfs are mounted with "nosuid, nodev, noexec" and several slightly buggy container applications don't bother to set those flags continue to work. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * vfs: Generalize filesystem nodev handling.Eric W. Biederman2016-06-234-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a function may_open_dev that tests MNT_NODEV and a new superblock flab SB_I_NODEV. Use this new function in all of the places where MNT_NODEV was previously tested. Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those filesystems should never support device nodes, and a simple superblock flags makes that very hard to get wrong. With SB_I_NODEV set if any device nodes somehow manage to show up on on a filesystem those device nodes will be unopenable. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * kernfs: The cgroup filesystem also benefits from SB_I_NOEXECEric W. Biederman2016-06-232-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The cgroup filesystem is in the same boat as sysfs. No one ever permits executables of any kind on the cgroup filesystem, and there is no reasonable future case to support executables in the future. Therefore move the setting of SB_I_NOEXEC which makes the code proof against future mistakes of accidentally creating executables from sysfs to kernfs itself. Making the code simpler and covering the sysfs, cgroup, and cgroup2 filesystems. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * mnt: Move the FS_USERNS_MOUNT check into sget_usernsEric W. Biederman2016-06-232-4/+4
| | | | | | | | | | | | | | | | | | | | Allowing a filesystem to be mounted by other than root in the initial user namespace is a filesystem property not a mount namespace property and as such should be checked in filesystem specific code. Move the FS_USERNS_MOUNT test into super.c:sget_userns(). Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * fs: Add user namespace member to struct super_blockEric W. Biederman2016-06-232-7/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Start marking filesystems with a user namespace owner, s_user_ns. In this change this is only used for permission checks of who may mount a filesystem. Ultimately s_user_ns will be used for translating ids and checking capabilities for filesystems mounted from user namespaces. The default policy for setting s_user_ns is implemented in sget(), which arranges for s_user_ns to be set to current_user_ns() and to ensure that the mounter of the filesystem has CAP_SYS_ADMIN in that user_ns. The guts of sget are split out into another function sget_userns(). The function sget_userns calls alloc_super with the specified user namespace or it verifies the existing superblock that was found has the expected user namespace, and fails with EBUSY when it is not. This failing prevents users with the wrong privileges mounting a filesystem. The reason for the split of sget_userns from sget is that in some cases such as mount_ns and kernfs_mount_ns a different policy for permission checking of mounts and setting s_user_ns is necessary, and the existence of sget_userns() allows those policies to be implemented. The helper mount_ns is expected to be used for filesystems such as proc and mqueuefs which present per namespace information. The function mount_ns is modified to call sget_userns instead of sget to ensure the user namespace owner of the namespace whose information is presented by the filesystem is used on the superblock. For sysfs and cgroup the appropriate permission checks are already in place, and kernfs_mount_ns is modified to call sget_userns so that the init_user_ns is the only user namespace used. For the cgroup filesystem cgroup namespace mounts are bind mounts of a subset of the full cgroup filesystem and as such s_user_ns must be the same for all of them as there is only a single superblock. Mounts of sysfs that vary based on the network namespace could in principle change s_user_ns but it keeps the analysis and implementation of kernfs simpler if that is not supported, and at present there appear to be no benefits from supporting a different s_user_ns on any sysfs mount. Getting the details of setting s_user_ns correct has been a long process. Thanks to Pavel Tikhorirorv who spotted a leak in sget_userns. Thanks to Seth Forshee who has kept the work alive. Thanks-to: Seth Forshee <seth.forshee@canonical.com> Thanks-to: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * proc: Convert proc_mount to use mount_ns.Eric W. Biederman2016-06-233-51/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | Move the call of get_pid_ns, the call of proc_parse_options, and the setting of s_iflags into proc_fill_super so that mount_ns can be used. Convert proc_mount to call mount_ns and remove the now unnecessary code. Acked-by: Seth Forshee <seth.forshee@canonical.com> Reviewed-by: Djalal Harouni <tixxdz@gmail.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * vfs: Pass data, ns, and ns->userns to mount_nsEric W. Biederman2016-06-232-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today what is normally called data (the mount options) is not passed to fill_super through mount_ns. Pass the mount options and the namespace separately to mount_ns so that filesystems such as proc that have mount options, can use mount_ns. Pass the user namespace to mount_ns so that the standard permission check that verifies the mounter has permissions over the namespace can be performed in mount_ns instead of in each filesystems .mount method. Thus removing the duplication between mqueuefs and proc in terms of permission checks. The extra permission check does not currently affect the rpc_pipefs filesystem and the nfsd filesystem as those filesystems do not currently allow unprivileged mounts. Without unpvileged mounts it is guaranteed that the caller has already passed capable(CAP_SYS_ADMIN) which guarantees extra permission check will pass. Update rpc_pipefs and the nfsd filesystem to ensure that the network namespace reference is always taken in fill_super and always put in kill_sb so that the logic is simpler and so that errors originating inside of fill_super do not cause a network namespace leak. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * mnt: Refactor fs_fully_visible into mount_too_revealingEric W. Biederman2016-06-234-16/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the call of fs_fully_visible in do_new_mount from before the new superblock is allocated with a call of mount_too_revealing after the superblock is allocated. This winds up being a much better location for maintainability of the code. The first change this enables is the replacement of FS_USERNS_VISIBLE with SB_I_USERNS_VISIBLE. Moving the flag from struct filesystem_type to sb_iflags on the superblock. Unfortunately mount_too_revealing fundamentally needs to touch mnt_flags adding several MNT_LOCKED_XXX flags at the appropriate times. If the mnt_flags did not need to be touched the code could be easily moved into the filesystem specific mount code. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2016-07-294-40/+28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This fixes error propagation from writeback to fsync/close for writeback cache mode as well as adding a missing capability flag to the INIT message. The rest are cleanups. (The commits are recent but all the code actually sat in -next for a while now. The recommits are due to conflict avoidance and the addition of Cc: stable@...)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use filemap_check_errors() mm: export filemap_check_errors() to modules fuse: fix wrong assignment of ->flags in fuse_send_init() fuse: fuse_flush must check mapping->flags for errors fuse: fsync() did not return IO errors fuse: don't mess with blocking signals new helper: wait_event_killable_exclusive() fuse: improve aio directIO write performance for size extending writes
| * | fuse: use filemap_check_errors()Miklos Szeredi2016-07-291-12/+2
| | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: fix wrong assignment of ->flags in fuse_send_init()Wei Fang2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | FUSE_HAS_IOCTL_DIR should be assigned to ->flags, it may be a typo. Signed-off-by: Wei Fang <fangwei1@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 69fe05c90ed5 ("fuse: add missing INIT flags") Cc: <stable@vger.kernel.org>
| * | fuse: fuse_flush must check mapping->flags for errorsMaxim Patlasov2016-07-291-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fuse_flush() calls write_inode_now() that triggers writeback, but actual writeback will happen later, on fuse_sync_writes(). If an error happens, fuse_writepage_end() will set error bit in mapping->flags. So, we have to check mapping->flags after fuse_sync_writes(). Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+
| * | fuse: fsync() did not return IO errorsAlexey Kuznetsov2016-07-291-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to implementation of fuse writeback filemap_write_and_wait_range() does not catch errors. We have to do this directly after fuse_sync_writes() Signed-off-by: Alexey Kuznetsov <kuznet@virtuozzo.com> Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4d99ff8f12eb ("fuse: Turn writeback cache on") Cc: <stable@vger.kernel.org> # v3.15+
| * | Merge branch 'for-miklos' of ↵Miklos Szeredi2016-07-2133-173/+332
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-next
| | * | fuse: don't mess with blocking signalsAl Viro2016-07-191-27/+3
| | | | | | | | | | | | | | | | | | | | | | | | just use wait_event_killable{,_exclusive}(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fuse: improve aio directIO write performance for size extending writesAshish Sangwan2016-06-302-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While sending the blocking directIO in fuse, the write request is broken into sub-requests, each of default size 128k and all the requests are sent in non-blocking background mode if async_dio mode is supported by libfuse. The process which issue the write wait for the completion of all the sub-requests. Sending multiple requests parallely gives a chance to perform parallel writes in the user space fuse implementation if it is multi-threaded and hence improves the performance. When there is a size extending aio dio write, we switch to blocking mode so that we can properly update the size of the file after completion of the writes. However, in this situation all the sub-requests are sent in serialized manner where the next request is sent only after receiving the reply of the current request. Hence the multi-threaded user space implementation is not utilized properly. This patch changes the size extending aio dio behavior to exactly follow blocking dio. For multi threaded fuse implementation having 10 threads and using buffer size of 64MB to perform async directIO, we are getting double the speed. Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | | Revert "vfs: add lookup_hash() helper"Linus Torvalds2016-07-291-28/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 3c9fe8cdff1b889a059a30d22f130372f2b3885f. As Miklos points out in commit c1b2cc1a765a, the "lookup_hash()" helper is now unused, and in fact, with the hash salting changes, since the hash of a dentry name now depends on the directory dentry it is in, the helper function isn't even really likely to be useful. So rather than keep it around in case somebody else might end up finding a use for it, let's just remove the helper and not trick people into thinking it might be a useful thing. For example, I had obviously completely missed how the helper didn't follow the normal dentry hashing patterns, and how the hash salting patch broke overlayfs. Things would quietly build and look sane, but not work. Suggested-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'overlayfs-linus' of ↵Linus Torvalds2016-07-295-236/+445
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: "First of all, this fixes a regression in overlayfs introduced by the dentry hash salting. I've moved the patch fixing this to the front of the queue, so if (god forbid) something needs to be bisected in overlayfs this regression won't interfere with that. The biggest part is preparation for selinux support, done by Vivek Goyal. Essentially this makes all operations on underlying filesystems be done with credentials of mounter. This makes everything nicely consistent. There are also fixes for a number of known and recently discovered non-standard behavior (thanks to Eryu Guan for testing and improving the test suites)" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (23 commits) ovl: simplify empty checking qstr: constify instances in overlayfs ovl: clear nlink on rmdir ovl: disallow overlayfs as upperdir ovl: fix warning ovl: remove duplicated include from super.c ovl: append MAY_READ when diluting write checks ovl: dilute permission checks on lower only if not special file ovl: fix POSIX ACL setting ovl: share inode for hard link ovl: store real inode pointer in ->i_private ovl: permission: return ECHILD instead of ENOENT ovl: update atime on upper ovl: fix sgid on directory ovl: simplify permission checking ovl: do not require mounter to have MAY_WRITE on lower ovl: do operations on underlying file system in mounter's context ovl: modify ovl_permission() to do checks on two inodes ovl: define ->get_acl() for overlay inodes ovl: move some common code in a function ...
| * | | | ovl: simplify empty checkingMiklos Szeredi2016-07-291-29/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The empty checking logic is duplicated in ovl_check_empty_and_clear() and ovl_remove_and_whiteout(), except the condition for clearing whiteouts is different: ovl_check_empty_and_clear() checked for being upper ovl_remove_and_whiteout() checked for merge OR lower Move the intersection of those checks (upper AND merge) into ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | qstr: constify instances in overlayfsAl Viro2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: clear nlink on rmdirMiklos Szeredi2016-07-291-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make delete notification work on fa/inotify. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: disallow overlayfs as upperdirMiklos Szeredi2016-07-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This does not work and does not make sense. So instead of fixing it (probably not hard) just disallow. Reported-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>
| * | | | ovl: fix warningMiklos Szeredi2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a superfluous newline in the warning message in ovl_d_real(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: remove duplicated include from super.cWei Yongjun2016-07-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove duplicated include. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: append MAY_READ when diluting write checksVivek Goyal2016-07-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on lower/. This is done as files on lower will never be written and will be copied up. But to copy up a file, mounter should have MAY_READ permission otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is reset. Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned True (context mounts) but when he tried to actually write to file, it failed as mounter did not have permission on lower file. [SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this won't trigger a copy-up. Reported-by: Dan Walsh <dwalsh@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: dilute permission checks on lower only if not special fileVivek Goyal2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from mask as lower/ will never be written and file will be copied up. But this is not true for special files. These files are not copied up and are opened in place. So don't dilute the checks for these types of files. Reported-by: Dan Walsh <dwalsh@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: fix POSIX ACL settingMiklos Szeredi2016-07-294-11/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting POSIX ACL needs special handling: 1) Some permission checks are done by ->setxattr() which now uses mounter's creds ("ovl: do operations on underlying file system in mounter's context"). These permission checks need to be done with current cred as well. 2) Setting ACL can fail for various reasons. We do not need to copy up in these cases. In the mean time switch to using generic_setxattr. [Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr() doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is disabled, and I could not come up with an obvious way to do it. This instead avoids the link error by defining two sets of ACL operations and letting the compiler drop one of the two at compile time depending on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code, also leading to smaller code. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: share inode for hard linkMiklos Szeredi2016-07-294-48/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inode attributes are copied up to overlay inode (uid, gid, mode, atime, mtime, ctime) so generic code using these fields works correcty. If a hard link is created in overlayfs separate inodes are allocated for each link. If chmod/chown/etc. is performed on one of the links then the inode belonging to the other ones won't be updated. This patch attempts to fix this by sharing inodes for hard links. Use inode hash (with real inode pointer as a key) to make sure overlay inodes are shared for hard links on upper. Hard links on lower are still split (which is not user observable until the copy-up happens, see Documentation/filesystems/overlayfs.txt under "Non-standard behavior"). The inode is only inserted in the hash if it is non-directoy and upper. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: store real inode pointer in ->i_privateMiklos Szeredi2016-07-295-37/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To get from overlay inode to real inode we currently use 'struct ovl_entry', which has lifetime connected to overlay dentry. This is okay, since each overlay dentry had a new overlay inode allocated. Following patch will break that assumption, so need to leave out ovl_entry. This patch stores the real inode directly in i_private, with the lowest bit used to indicate whether the inode is upper or lower. Lifetime rules remain, using ovl_inode_real() must only be done while caller holds ref on overlay dentry (and hence on real dentry), or within RCU protected regions. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: permission: return ECHILD instead of ENOENTMiklos Szeredi2016-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error is due to RCU and is temporary. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: update atime on upperMiklos Szeredi2016-07-294-5/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix atime update logic in overlayfs. This patch adds an i_op->update_time() handler to overlayfs inodes. This forwards atime updates to the upper layer only. No atime updates are done on lower layers. Remove implicit atime updates to underlying files and directories with O_NOATIME. Remove explicit atime update in ovl_readlink(). Clear atime related mnt flags from cloned upper mount. This means atime updates are controlled purely by overlayfs mount options. Reported-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: fix sgid on directoryMiklos Szeredi2016-07-291-4/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating directory in workdir, the group/sgid inheritance from the parent dir was omitted completely. Fix this by calling inode_init_owner() on overlay inode and using the resulting uid/gid/mode to create the file. Unfortunately the sgid bit can be stripped off due to umask, so need to reset the mode in this case in workdir before moving the directory in place. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: simplify permission checkingMiklos Szeredi2016-07-293-53/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fact that we always do permission checking on the overlay inode and clear MAY_WRITE for checking access to the lower inode allows cruft to be removed from ovl_permission(). 1) "default_permissions" option effectively did generic_permission() on the overlay inode with i_mode, i_uid and i_gid updated from underlying filesystem. This is what we do by default now. It did the update using vfs_getattr() but that's only needed if the underlying filesystem can change (which is not allowed). We may later introduce a "paranoia_mode" that verifies that mode/uid/gid are not changed. 2) splitting out the IS_RDONLY() check from inode_permission() also becomes unnecessary once we remove the MAY_WRITE from the lower inode check. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: do not require mounter to have MAY_WRITE on lowerVivek Goyal2016-07-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we have two levels of checks in ovl_permission(). overlay inode is checked with the creds of task while underlying inode is checked with the creds of mounter. Looks like mounter does not have to have WRITE access to files on lower/. So remove the MAY_WRITE from access mask for checks on underlying lower inode. This means task should still have the MAY_WRITE permission on lower inode and mounter is not required to have MAY_WRITE. It also solves the problem of read only NFS mounts being used as lower. If __inode_permission(lower_inode, MAY_WRITE) is called on read only NFS, it fails. By resetting MAY_WRITE, check succeeds and case of read only NFS shold work with overlay without having to specify any special mount options (default permission). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: do operations on underlying file system in mounter's contextVivek Goyal2016-07-292-37/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given we are now doing checks both on overlay inode as well underlying inode, we should be able to do checks and operations on underlying file system using mounter's context. So modify all operations to do checks/operations on underlying dentry/inode in the context of mounter. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: modify ovl_permission() to do checks on two inodesVivek Goyal2016-07-291-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now ovl_permission() calls __inode_permission(realinode), to do permission checks on real inode and no checks are done on overlay inode. Modify it to do checks both on overlay inode as well as underlying inode. Checks on overlay inode will be done with the creds of calling task while checks on underlying inode will be done with the creds of mounter. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | | | ovl: define ->get_acl() for overlay inodesVivek Goyal2016-07-294-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we are planning to do DAC permission checks on overlay inode itself. And to make it work, we will need to make sure we can get acls from underlying inode. So define ->get_acl() for overlay inodes and this in turn calls into underlying filesystem to get acls, if any. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>