linux-stable.git - Linux kernel stable tree

	Commit message (Collapse)	Author	Age	Files	Lines
...
* \|	fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*	Al Viro	2021-02-20	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After switching to non-RCU mode, we want nd->depth to match the number of entries in nd->stack[] that need eventual path_put(). legitimize_links() takes care of that on failures; unfortunately, failure exits added for LOOKUP_CACHED do not. We could add the logics for that into those failure exits, both in try_to_unlazy() and in try_to_unlazy_next(), but since both checks are immediately followed by legitimize_links() and there's no calls of legitimize_links() other than those two... It's easier to move the check (and required handling of nd->depth on failure) into legitimize_links() itself. [caught by Jens: ... and since we are zeroing ->depth here, we need to do drop_links() first] Fixes: 6c6ec2b0a3e0 "fs: add support for LOOKUP_CACHED" Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	fs: add support for LOOKUP_CACHED	Jens Axboe	2021-01-04	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	io_uring always punts opens to async context, since there's no control over whether the lookup blocks or not. Add LOOKUP_CACHED to support just doing the fast RCU based lookups, which we know will not block. If we can do a cached path resolution of the filename, then we don't have to always punt lookups for a worker. During path resolution, we always do LOOKUP_RCU first. If that fails and we terminate LOOKUP_RCU, then fail a LOOKUP_CACHED attempt as well. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	saner calling conventions for unlazy_child()	Al Viro	2021-01-04	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	same as for the previous commit - instead of 0/-ECHILD make it return true/false, rename to try_to_unlazy_child(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	fs: make unlazy_walk() error handling consistent	Jens Axboe	2021-01-04	1	-26/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most callers check for non-zero return, and assume it's -ECHILD (which it always will be). One caller uses the actual error return. Clean this up and make it fully consistent, by having unlazy_walk() return a bool instead. Rename it to try_to_unlazy() and return true on success, and failure on error. That's easier to read. No functional changes in this patch. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()	Steven Rostedt (VMware)	2021-01-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Running my yearly branch profiling code, it detected a 100% wrong branch condition in name.c for lookup_fast(). The code in question has: status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) return dentry; if (unlazy_child(nd, dentry, seq)) return ERR_PTR(-ECHILD); if (unlikely(status == -ECHILD)) /* we'd been told to redo it in non-rcu mode */ status = d_revalidate(dentry, nd->flags); If the status of the d_revalidate() is greater than zero, then the function finishes. Otherwise, if it is an "unlazy_child" it returns with -ECHILD. After the above two checks, the status is compared to -ECHILD, as that is what is returned if the original d_revalidate() needed to be done in a non-rcu mode. Especially this path is called in a condition of: if (nd->flags & LOOKUP_RCU) { And most of the d_revalidate() functions have: if (flags & LOOKUP_RCU) return -ECHILD; It appears that that is the only case that this if statement is triggered on two of my machines, running in production. As it is dependent on what filesystem mix is configured in the running kernel, simply remove the unlikely() from the if statement. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	do_tmpfile(): don't mess with finish_open()	Al Viro	2021-01-03	1	-4/+2
\|/ \| \| \| \| \|	use vfs_open() instead Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Merge branch 'work.misc' of ↵	Linus Torvalds	2020-12-25	1	-1/+3
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted patches from previous cycle(s)..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix hostfs_open() use of ->f_path.dentry Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode() fs/namespace.c: WARN if mnt_count has become negative
\| *	Make sure that make_create_in_sticky() never sees uninitialized value of ↵	Al Viro	2020-12-10	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dir_mode make sure nd->dir_mode is always initialized after success exit from link_path_walk(); in case of empty path it did not happen. Reported-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	fs: make do_renameat2() take struct filename	Jens Axboe	2020-12-09	1	-18/+22
\|/ \| \| \| \| \| \| \| \| \|	Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This behaves like do_unlinkat(), which also takes a filename struct and puts it when it is done. Converting callers is then trivial. Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	Merge branch 'work.misc' of ↵	Linus Torvalds	2020-10-24	1	-1/+2
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place (the largest group here is Christoph's stat cleanups)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove KSTAT_QUERY_FLAGS fs: remove vfs_stat_set_lookup_flags fs: move vfs_fstatat out of line fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat fs: remove vfs_statx_fd fs: omfs: use kmemdup() rather than kmalloc+memcpy [PATCH] reduce boilerplate in fsid handling fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS selftests: mount: add nosymfollow tests Add a "nosymfollow" mount option.
\| *	Add a "nosymfollow" mount option.	Mattias Nissler	2020-08-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For mounts that have the new "nosymfollow" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options, as well as to the LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD variants have been supporting the "nosymfollow" mount option for a long time with equivalent implementations. Note that symlinks may still be created on file systems mounted with the "nosymfollow" option present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymfollow" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymfollow" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. More information on the history and motivation for this patch can be found here: https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal Signed-off-by: Mattias Nissler <mnissler@chromium.org> Signed-off-by: Ross Zwisler <zwisler@google.com> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \|	fs: remove the unused SB_I_MULTIROOT flag	Christoph Hellwig	2020-09-24	1	-2/+2
\|/ \| \| \| \| \| \| \| \| \|	The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
*	exec: restore EACCES of S_ISDIR execve()	Kees Cook	2020-08-14	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Patch series "Fix S_ISDIR execve() errno". Fix an errno change for execve() of directories, noticed by Marc Zyngier. Along with the fix, include a regression test to avoid seeing this return in the future. This patch (of 2): The return code for attempting to execute a directory has always been EACCES. Adjust the S_ISDIR exec test to reflect the old errno instead of the general EISDIR for other kinds of "open" attempts on directories. Fixes: 633fb6ac3980 ("exec: move S_ISREG() check earlier") Reported-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Greg Kroah-Hartman <gregkh@android.com> Reviewed-by: Greg Kroah-Hartman <gregkh@google.com> Link: http://lkml.kernel.org/r/20200813231723.2725102-2-keescook@chromium.org Link: https://lore.kernel.org/lkml/20200813151305.6191993b@why Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	2020-08-12	1	-2/+8
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge more updates from Andrew Morton: - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction, mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util, memory-hotplug, cleanups, uaccess, migration, gup, pagemap), - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops, checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump, exec, kdump, rapidio, panic, kcov, kgdb, ipc). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits) mm/gup: remove task_struct pointer for all gup code mm: clean up the last pieces of page fault accountings mm/xtensa: use general page fault accounting mm/x86: use general page fault accounting mm/sparc64: use general page fault accounting mm/sparc32: use general page fault accounting mm/sh: use general page fault accounting mm/s390: use general page fault accounting mm/riscv: use general page fault accounting mm/powerpc: use general page fault accounting mm/parisc: use general page fault accounting mm/openrisc: use general page fault accounting mm/nios2: use general page fault accounting mm/nds32: use general page fault accounting mm/mips: use general page fault accounting mm/microblaze: use general page fault accounting mm/m68k: use general page fault accounting mm/ia64: use general page fault accounting mm/hexagon: use general page fault accounting mm/csky: use general page fault accounting ...
\| *	exec: move path_noexec() check earlier	Kees Cook	2020-08-12	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The path_noexec() check, like the regular file check, was happening too late, letting LSMs see impossible execve()s. Check it earlier as well in may_open() and collect the redundant fs/exec.c path_noexec() test under the same robustness comment as the S_ISREG() check. My notes on the call path, and related arguments, checks, etc: do_open_execat() struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE \| O_RDONLY \| __FMODE_EXEC, .acc_mode = MAY_EXEC, ... do_filp_open(dfd, filename, open_flags) path_openat(nameidata, open_flags, flags) file = alloc_empty_file(open_flags, current_cred()); do_open(nameidata, file, open_flags) may_open(path, acc_mode, open_flag) /* new location of MAY_EXEC vs path_noexec() test / inode_permission(inode, MAY_OPEN \| acc_mode) security_inode_permission(inode, acc_mode) vfs_open(path, file) do_dentry_open(file, path->dentry->d_inode, open) security_file_open(f) open() / old location of path_noexec() test */ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| *	exec: move S_ISREG() check earlier	Kees Cook	2020-08-12	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The execve(2)/uselib(2) syscalls have always rejected non-regular files. Recently, it was noticed that a deadlock was introduced when trying to execute pipes, as the S_ISREG() test was happening too late. This was fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files during execve()"), but it was added after inode_permission() had already run, which meant LSMs could see bogus attempts to execute non-regular files. Move the test into the other inode type checks (which already look for other pathological conditions[1]). Since there is no need to use FMODE_EXEC while we still have access to "acc_mode", also switch the test to MAY_EXEC. Also include a comment with the redundant S_ISREG() checks at the end of execve(2)/uselib(2) to note that they are present to avoid any mistakes. My notes on the call path, and related arguments, checks, etc: do_open_execat() struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE \| O_RDONLY \| __FMODE_EXEC, .acc_mode = MAY_EXEC, ... do_filp_open(dfd, filename, open_flags) path_openat(nameidata, open_flags, flags) file = alloc_empty_file(open_flags, current_cred()); do_open(nameidata, file, open_flags) may_open(path, acc_mode, open_flag) /* new location of MAY_EXEC vs S_ISREG() test / inode_permission(inode, MAY_OPEN \| acc_mode) security_inode_permission(inode, acc_mode) vfs_open(path, file) do_dentry_open(file, path->dentry->d_inode, open) / old location of FMODE_EXEC vs S_ISREG() test */ security_file_open(f) open() [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	fix breakage in do_rmdir()	Al Viro	2020-08-12	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \|	syzbot reported and bisected a use-after-free due to the recent init cleanups. The putname() should happen only after we'd not branched to retry, same as it's done in do_unlinkat(). Reported-by: syzbot+bbeb1c88016c7db4aa24@syzkaller.appspotmail.com Fixes: e24ab0ef689d "fs: push the getname from do_rmdir into the callers" Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	init: add an init_mknod helper	Christoph Hellwig	2020-07-31	1	-1/+1
\| \| \| \| \| \| \|	Add a simple helper to mknod with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mknod. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	init: add an init_mkdir helper	Christoph Hellwig	2020-07-31	1	-1/+1
\| \| \| \| \| \| \|	Add a simple helper to mkdir with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mkdir. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	init: add an init_symlink helper	Christoph Hellwig	2020-07-31	1	-1/+1
\| \| \| \| \| \| \|	Add a simple helper to symlink with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_symlink. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	init: add an init_link helper	Christoph Hellwig	2020-07-31	1	-2/+2
\| \| \| \| \| \| \|	Add a simple helper to link with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_link. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	fs: push the getname from do_rmdir into the callers	Christoph Hellwig	2020-07-31	1	-6/+4
\| \| \| \| \| \| \|	This mirrors do_unlinkat and will make life a little easier for the init code to reuse the whole function with a kernel filename. Signed-off-by: Christoph Hellwig <hch@lst.de>
*	vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK	Linus Torvalds	2020-06-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact the permission logic internally must not check that bit (it's only for upper layers to decide whether they can block to do IO to look up the acl information or not). But the way the code was written, it _looked_ like it cared, since the function explicitly did not mask that bit off. But it has exactly two callers: one for when that bit is set, which first clears the bit before calling posix_acl_permission(), and the other call site when that bit was clear. So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be used for the actual permission test, and that currently is pointlessly cleared by the callers when the function itself should just not care. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vfs: do not do group lookup when not necessary	Linus Torvalds	2020-06-08	1	-15/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rasmus Villemoes points out that the 'in_group_p()' tests can be a noticeable expense, and often completely unnecessary. A common situation is that the 'group' bits are the same as the 'other' bits wrt the permissions we want to test. So rewrite 'acl_permission_check()' to not bother checking for group ownership when the permission check doesn't care. For example, if we're asking for read permissions, and both 'group' and 'other' allow reading, there's really no reason to check if we're part of the group or not: either way, we'll allow it. Rasmus says: "On a bog-standard Ubuntu 20.04 install, a workload consisting of compiling lots of userspace programs (i.e., calling lots of short-lived programs that all need to get their shared libs mapped in, and the compilers poking around looking for system headers - lots of /usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around 0.1% according to perf top. System-installed files are almost always 0755 (directories and binaries) or 0644, so in most cases, we can avoid the binary search and the cost of pulling the cred->groups array and in_group_p() .text into the cpu cache" Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vfs: allow unprivileged whiteout creation	Miklos Szeredi	2020-05-14	1	-18/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Whiteouts, unlike real device node should not require privileges to create. The general concern with device nodes is that opening them can have side effects. The kernel already avoids zero major (see Documentation/admin-guide/devices.txt). To be on the safe side the patch explicitly forbids registering a char device with 0/0 number (see cdev_add()). This guarantees that a non-O_PATH open on a whiteout will fail with ENODEV; i.e. it won't have any side effect. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
*	fix a braino in legitimize_path()	Al Viro	2020-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	brown paperbag time... wrong order of arguments ended up confusing the values to check dentry and mount_lock seqcounts against. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: 2aa38470853a ("non-RCU analogue of the previous commit") Tested-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	lookup_open(): don't bother with fallbacks to lookup+create	Al Viro	2020-04-02	1	-25/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We fall back to lookup+create (instead of atomic_open) in several cases: 1) we don't have write access to filesystem and O_TRUNC is present in the flags. It's not something we want ->atomic_open() to see - it just might go ahead and truncate the file. However, we can pass it the flags sans O_TRUNC - eventually do_open() will call handle_truncate() anyway. 2) we have O_CREAT \| O_EXCL and we can't write to parent. That's going to be an error, of course, but we want to know _which_ error should that be - might be EEXIST (if file exists), might be EACCES or EROFS. Simply stripping O_CREAT (and checking if we see ENOENT) would suffice, if not for O_EXCL. However, we used to have ->atomic_open() fully responsible for rejecting O_CREAT \| O_EXCL on existing file and just stripping O_CREAT would've disarmed those checks. With nothing downstream to catch the problem - FMODE_OPENED used to be "don't bother with EEXIST checks, ->atomic_open() has done those". Now EEXIST checks downstream are skipped only if FMODE_CREATED is set - FMODE_OPENED alone is not enough. That has eliminated the need to fall back onto lookup+create path in this case. 3) O_WRONLY or O_RDWR when we have no write access to filesystem, with nothing else objectionable. Fallback is (and had always been) pointless. IOW, we don't really need that fallback; all we need in such cases is to trim O_TRUNC and O_CREAT properly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	atomic_open(): no need to pass struct open_flags anymore	Al Viro	2020-04-02	1	-2/+1
\| \| \| \| \| \| \|	argument had been unused since 1643b43fbd052 (lookup_open(): lift the "fallback to !O_CREAT" logics from atomic_open()) back in 2016 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	open_last_lookups(): move complete_walk() into do_open()	Al Viro	2020-04-02	1	-10/+8
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	open_last_lookups(): lift O_EXCL\|O_CREAT handling into do_open()	Al Viro	2020-04-02	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently path_openat() has "EEXIST on O_EXCL\|O_CREAT" checks done on one of the ways out of open_last_lookups(). There are 4 cases: 1) the last component is . or ..; check is not done. 2) we had FMODE_OPENED or FMODE_CREATED set while in lookup_open(); check is not done. 3) symlink to be traversed is found; check is not done (nor should it be) 4) everything else: check done (before complete_walk(), even). In case (1) O_EXCL\|O_CREAT ends up failing with -EISDIR - that's open("/tmp/.", O_CREAT\|O_EXCL, 0600) Note that in the same conditions open("/tmp", O_CREAT\|O_EXCL, 0600) would have yielded EEXIST. Either error is allowed, switching to -EEXIST in these cases would've been more consistent. Case (2) is more subtle; first of all, if we have FMODE_CREATED set, the object hadn't existed prior to the call. The check should not be done in such a case. The rest is problematic, though - we have FMODE_OPENED set (i.e. it went through ->atomic_open() and got successfully opened there) FMODE_CREATED is NOT set O_CREAT and O_EXCL are both set. Any such case is a bug - either we failed to set FMODE_CREATED when we had, in fact, created an object (no such instances in the tree) or we have opened a pre-existing file despite having had both O_CREAT and O_EXCL passed. One of those was, in fact caught (and fixed) while sorting out this mess (gfs2 on cold dcache). And in such situations we should fail with EEXIST. Note that for (1) and (4) FMODE_CREATED is not set - for (1) there's nothing in handle_dots() to set it, for (4) we'd explicitly checked that. And (1), (2) and (4) are exactly the cases when we leave the loop in the caller, with do_open() called immediately after that loop. IOW, we can move the check over there, and make it If we have O_CREAT\|O_EXCL and after successful pathname resolution FMODE_CREATED is not set, we must have run into a preexisting file and should fail with EEXIST. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	open_last_lookups(): don't abuse complete_walk() when all we want is unlazy	Al Viro	2020-04-02	1	-9/+5
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	open_last_lookups(): consolidate fsnotify_create() calls	Al Viro	2020-04-02	1	-5/+2
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	take post-lookup part of do_last() out of loop	Al Viro	2020-04-02	1	-12/+9
\| \| \| \| \| \| \| \| \| \| \|	now we can have open_last_lookups() directly from the loop in path_openat() - the rest of do_last() never returns a symlink to follow, so we can bloody well leave the loop first. Rename the rest of that thing from do_last() to do_open() and make it return an int. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	link_path_walk(): sample parent's i_uid and i_mode for the last component	Al Viro	2020-04-02	1	-10/+7
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	__nd_alloc_stack(): make it return bool	Al Viro	2020-04-02	1	-27/+18
\| \| \| \| \| \| \|	... and adjust the caller (reserve_stack()). Rename to nd_alloc_stack(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	reserve_stack(): switch to __nd_alloc_stack()	Al Viro	2020-04-02	1	-11/+8
\| \| \| \| \| \| \|	expand the call of nd_alloc_stack() into it (and don't recheck the depth on the second call) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	pick_link(): take reserving space on stack into a new helper	Al Viro	2020-04-02	1	-21/+25
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	pick_link(): more straightforward handling of allocation failures	Al Viro	2020-04-02	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pick_link() needs to push onto stack; we start with using two-element array embedded into struct nameidata and the first time we need more than that we switch to separately allocated array. Allocation can fail, of course, and handling of that would be simple enough - we need to drop 'link' and bugger off. However, the things get more complicated in RCU mode. There we must do GFP_ATOMIC allocation. If that fails, we try to switch to non-RCU mode and repeat the allocation. To switch to non-RCU mode we need to grab references to 'link' and to everything in nameidata. The latter done by unlazy_walk(); the former - legitimize_path(). 'link' must go first - after unlazy_walk() we are out of RCU-critical period and it's too late to call legitimize_path() since the references in link->mnt and link->dentry might be pointing to freed and reused memory. So we do legitimize_path(), then unlazy_walk(). And that's where it gets too subtle: what to do if the former fails? We MUST do path_put(link) to avoid leaks. And we can't do that under rcu_read_lock(). Solution in mainline was to empty then nameidata manually, drop out of RCU mode and then do put_path(). In effect, we open-code the things eventual terminate_walk() would've done on error in RCU mode. That looks badly out of place and confusing. We could add a comment along the lines of the explanation above, but... there's a simpler solution. Call unlazy_walk() even if legitimaze_path() fails. It will take us out of RCU mode, so we'll be able to do path_put(link). Yes, it will do unnecessary work - attempt to grab references on the stuff in nameidata, only to have them dropped as soon as we return the error to upper layer and get terminate_walk() called there. So what? We are thoroughly off the fast path by that point - we had GFP_ATOMIC allocation fail, we had ->d_seq or mount_lock mismatch and we are about to try walking the same path from scratch in non-RCU mode. Which will need to do the same allocation, this time with GFP_KERNEL, so it will be able to apply memory pressure for blocking stuff. Compared to that the cost of several lockref_get_not_dead() is noise. And the logics become much easier to understand that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fold path_to_nameidata() into its only remaining caller	Al Viro	2020-04-02	1	-13/+6
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	pick_link(): pass it struct path already with normal refcounting rules	Al Viro	2020-04-02	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	step_into() tries to avoid grabbing and dropping mount references on the steps that do not involve crossing mountpoints (which is obviously the majority of cases). So it uses a local struct path with unusual refcounting rules - path.mnt is pinned if and only if it's not equal to nd->path.mnt. We used to have similar beasts all over the place and we had quite a few bugs crop up in their handling - it's easy to get confused when changing e.g. cleanup on failure exits (or adding a new check, etc.) Now that's mostly gone - the step_into() instance (which is what we need them for) is the only one left. It is exposed to mount traversal and it's (shortly) seen by pick_link(). Since pick_link() needs to store it in link stack, where the normal rules apply, it has to make sure that mount is pinned regardless of nd->path.mnt value. That's done on all calls of pick_link() and very early in those. Let's do that in the caller (step_into()) instead - that way the fewer places need to be aware of such struct path instances. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fs/namei.c: kill follow_mount()	Al Viro	2020-04-02	1	-20/+2
\| \| \| \| \| \| \|	The only remaining caller (path_pts()) should be using follow_down() anyway. And clean path_pts() a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	non-RCU analogue of the previous commit	Al Viro	2020-04-02	1	-17/+39
\| \| \| \| \| \| \| \| \| \|	new helper: choose_mountpoint(). Wrapper around choose_mountpoint_rcu(), similar to lookup_mnt() vs. __lookup_mnt(). follow_dotdot() switched to it. Now we don't grab mount_lock exclusive anymore; note that the primitive used non-RCU mount traversals in other direction (lookup_mnt()) doesn't bother with that either - it uses mount_lock seqcount instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	helper for mount rootwards traversal	Al Viro	2020-04-02	1	-16/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The loops in follow_dotdot{_rcu()} are doing the same thing: we have a mount and we want to find out how far up the chain of mounts do we need to go. We follow the chain of mount until we find one that is not directly overmounting the root of another mount. If such a mount is found, we want the location it's mounted upon. If we run out of chain (i.e. get to a mount that is not mounted on anything else) or run into process' root, we report failure. On success, we want (in RCU case) d_seq of resulting location sampled or (in non-RCU case) references to that location acquired. This commit introduces such primitive for RCU case and switches follow_dotdot_rcu() to it; non-RCU case will be go in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	follow_dotdot(): be lazy about changing nd->path	Al Viro	2020-04-02	1	-5/+13
\| \| \| \| \| \| \| \| \| \| \| \| \|	Change nd->path only after the loop is done and only in case we hadn't ended up finding ourselves in root. Same for NO_XDEV check. That separates the "check how far back do we need to go through the mount stack" logics from the rest of .. traversal. NOTE: path_get/path_put introduced here are temporary. They will go away later in the series. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	follow_dotdot_rcu(): be lazy about changing nd->path	Al Viro	2020-04-02	1	-15/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change nd->path only after the loop is done and only in case we hadn't ended up finding ourselves in root. Same for NO_XDEV check. Don't recheck mount_lock on each step either. That separates the "check how far back do we need to go through the mount stack" logics from the rest of .. traversal. Note that the sequence for d_seq/d_inode here is * sample mount_lock seqcount ... * sample d_seq * fetch d_inode * verify mount_lock seqcount The last step makes sure that d_inode value we'd got matches d_seq - it dentry is guaranteed to have been a mountpoint through the entire thing, so its d_inode must have been stable. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	follow_dotdot{,_rcu}(): massage loops	Al Viro	2020-04-02	1	-32/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The logics in both of them is the same: while true if in process' root // uncommon break if not in mount root // normal case find the parent return if at absolute root // very uncommon break move to underlying mountpoint report that we are in root Pull the common path out of the loop: if in process' root // uncommon goto in_root if unlikely(in mount root) while true if at absolute root goto in_root move to underlying mountpoint if in process' root goto in_root if in mount root break; find the parent // we are not in mount root return in_root: report that we are in root The reason for that transformation is that we get to keep the common path straight and get a separate block for "move through underlying mountpoints", which will allow to sanitize NO_XDEV handling there. What's more, the pared-down loops will be easier to deal with - in particular, non-RCU case has no need to grab mount_lock and rewriting it to the form that wouldn't do that is a non-trivial change. Better do that with less stuff getting in the way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	lift all calls of step_into() out of follow_dotdot/follow_dotdot_rcu	Al Viro	2020-04-02	1	-34/+37
\| \| \| \| \| \| \| \| \|	lift step_into() into handle_dots() (where they merge with each other); have follow_... return dentry and pass inode/seq to the caller. [braino fix folded; kudos to Qian Cai <cai@lca.pw> for reporting it] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	follow_dotdot{,_rcu}(): switch to use of step_into()	Al Viro	2020-03-13	1	-24/+7
\| \| \| \| \| \|	gets the regular mount crossing on result of .. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	handle_dots(), follow_dotdot{,_rcu}(): preparation to switch to step_into()	Al Viro	2020-03-13	1	-27/+25
\| \| \| \| \| \| \| \| \| \| \| \| \|	Right now the tail ends of follow_dotdot{,_rcu}() are pretty much the open-coded analogues of step_into(). The differences: * the lack of proper LOOKUP_NO_XDEV handling in non-RCU case (arguably a bug) * the lack of ->d_manage() handling (again, arguably a bug) Adjust the calling conventions so that on the next step with could just switch those functions to returning step_into(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	move handle_dots(), follow_dotdot() and follow_dotdot_rcu() past step_into()	Al Viro	2020-03-13	1	-130/+130
\| \| \| \| \| \|	pure move; we are going to have step_into() called by that bunch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>