summaryrefslogtreecommitdiffstats
path: root/fs/read_write.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2020-10-231-437/+107
|\ | | | | | | | | | | | | | | | | | | | | | | Pull clone/dedupe/remap code refactoring from Darrick Wong: "Move the generic file range remap (aka reflink and dedupe) functions out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to reduce clutter in the first two files" * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: move the generic write and copy checks out of mm vfs: move the remap range helpers to remap_range.c vfs: move generic_remap_checks out of mm
| * vfs: move the generic write and copy checks out of mmDarrick J. Wong2020-10-151-0/+143
| | | | | | | | | | | | | | The generic write check helpers also don't have much to do with the page cache, so move them to the vfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * vfs: move the remap range helpers to remap_range.cDarrick J. Wong2020-10-151-473/+0
| | | | | | | | | | | | | | | | | | Complete the migration by moving the file remapping helper functions out of read_write.c and into remap_range.c. This reduces the clutter in the first file and (eventually) will make it so that we can compile out the second file if it isn't needed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | Merge branch 'work.set_fs' of ↵Linus Torvalds2020-10-221-26/+45
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull initial set_fs() removal from Al Viro: "Christoph's set_fs base series + fixups" * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Allow a NULL pos pointer to __kernel_read fs: Allow a NULL pos pointer to __kernel_write powerpc: remove address space overrides using set_fs() powerpc: use non-set_fs based maccess routines x86: remove address space overrides using set_fs() x86: make TASK_SIZE_MAX usable from assembly code x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h lkdtm: remove set_fs-based tests test_bitmap: remove user bitmap tests uaccess: add infrastructure for kernel builds with set_fs() fs: don't allow splice read/write without explicit ops fs: don't allow kernel reads and writes without iter ops sysctl: Convert to iter interfaces proc: add a read_iter method to proc proc_ops proc: cleanup the compat vs no compat file ops proc: remove a level of indentation in proc_get_inode
| * | fs: Allow a NULL pos pointer to __kernel_readMatthew Wilcox (Oracle)2020-10-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | Match the behaviour of new_sync_read() and __kernel_write(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: Allow a NULL pos pointer to __kernel_writeMatthew Wilcox (Oracle)2020-10-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Linus prefers that callers be allowed to pass in a NULL pointer for ppos like new_sync_write(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: don't allow splice read/write without explicit opsChristoph Hellwig2020-09-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | default_file_splice_write is the last piece of generic code that uses set_fs to make the uaccess routines operate on kernel pointers. It implements a "fallback loop" for splicing from files that do not actually provide a proper splice_read method. The usual file systems and other high bandwidth instances all provide a ->splice_read, so this just removes support for various device drivers and procfs/debugfs files. If splice support for any of those turns out to be important it can be added back by switching them to the iter ops and using generic_file_splice_read. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: don't allow kernel reads and writes without iter opsChristoph Hellwig2020-09-081-25/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't allow calling ->read or ->write with set_fs as a preparation for killing off set_fs. All the instances that we use kernel_read/write on are using the iter ops already. If a file has both the regular ->read/->write methods and the iter variants those could have different semantics for messed up enough drivers. Also fails the kernel access to them in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'work.iov_iter' of ↵Linus Torvalds2020-10-121-336/+26
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat iovec cleanups from Al Viro: "Christoph's series around import_iovec() and compat variant thereof" * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: security/keys: remove compat_keyctl_instantiate_key_iov mm: remove compat_process_vm_{readv,writev} fs: remove compat_sys_vmsplice fs: remove the compat readv/writev syscalls fs: remove various compat readv/writev helpers iov_iter: transparently handle compat iovecs in import_iovec iov_iter: refactor rw_copy_check_uvector and import_iovec iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c compat.h: fix a spelling error in <linux/compat.h>
| * | fs: remove the compat readv/writev syscallsChristoph Hellwig2020-10-031-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | Now that import_iovec handles compat iovecs, the native readv and writev syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fs: remove various compat readv/writev helpersChristoph Hellwig2020-10-031-149/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | Now that import_iovec handles compat iovecs as well, all the duplicated code in the compat readv/writev helpers is not needed. Remove them and switch the compat syscall handlers to use the native helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | iov_iter: transparently handle compat iovecs in import_iovecChristoph Hellwig2020-10-031-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | iov_iter: move rw_copy_check_uvector() into lib/iov_iter.cDavid Laight2020-09-251-179/+0
| |/ | | | | | | | | | | | | | | | | This lets the compiler inline it into import_iovec() generating much better code. Signed-off-by: David Laight <david.laight@aculab.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* / autofs: use __kernel_write() for the autofs pipe writingLinus Torvalds2020-09-291-0/+8
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | autofs got broken in some configurations by commit 13c164b1a186 ("autofs: switch to kernel_write") because there is now an extra LSM permission check done by security_file_permission() in rw_verify_area(). autofs is one if the few places that really does want the much more limited __kernel_write(), because the write is an internal kernel one that shouldn't do any user permission checks (it also doesn't need the file_start_write/file_end_write logic, since it's just a pipe). There are a couple of other cases like that - accounting, core dumping, and splice - but autofs stands out because it can be built as a module. As a result, we need to export this internal __kernel_write() function again. We really don't want any other module to use this, but we don't have a "EXPORT_SYMBOL_FOR_AUTOFS_ONLY()". But we can mark it GPL-only to at least approximate that "internal use only" for licensing. While in this area, make autofs pass in NULL for the file position pointer, since it's always a pipe, and we now use a NULL file pointer for streaming file descriptors (see file_ppos() and commit 438ab720c675: "vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files") This effectively reverts commits 9db977522449 ("fs: unexport __kernel_write") and 13c164b1a186 ("autofs: switch to kernel_write"). Fixes: 13c164b1a186 ("autofs: switch to kernel_write") Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* initrd: switch initrd loading to struct file based APIsChristoph Hellwig2020-07-301-1/+1
| | | | | | | | | | | | | There is no good reason to mess with file descriptors from in-kernel code, switch the initrd loading to struct file based read and writes instead. Also Pass an explicit offset instead of ->f_pos, and to make that easier, use file scope file structs and offsets everywhere except for identify_ramdisk_image instead of the current strange mix. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: remove __vfs_readChristoph Hellwig2020-07-081-22/+21
| | | | | | Fold it into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: implement kernel_read using __kernel_readChristoph Hellwig2020-07-081-8/+5
| | | | | | | | | Consolidate the two in-kernel read helpers to make upcoming changes easier. The only difference are the missing call to rw_verify_area in kernel_read, and an access_ok check that doesn't make sense for kernel buffers to start with. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: add a __kernel_read helperChristoph Hellwig2020-07-081-0/+23
| | | | | | | This is the counterpart to __kernel_write, and skip the rw_verify_area call compared to kernel_read. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: remove __vfs_writeChristoph Hellwig2020-07-081-24/+22
| | | | | | Fold it into the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: implement kernel_write using __kernel_writeChristoph Hellwig2020-07-081-8/+9
| | | | | | | | | Consolidate the two in-kernel write helpers to make upcoming changes easier. The only difference are the missing call to rw_verify_area in kernel_write, and an access_ok check that doesn't make sense for kernel buffers to start with. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: check FMODE_WRITE in __kernel_writeChristoph Hellwig2020-07-081-0/+2
| | | | | | | | | Add a WARN_ON_ONCE if the file isn't actually open for write. This matches the check done in vfs_write, but actually warn warns as a kernel user calling write on a file not opened for writing is a pretty obvious programming error. Signed-off-by: Christoph Hellwig <hch@lst.de>
* fs: unexport __kernel_writeChristoph Hellwig2020-07-081-1/+0
| | | | | | | This is a very special interface that skips sb_writes protection, and not used by modules anymore. Signed-off-by: Christoph Hellwig <hch@lst.de>
* powerpc: Add back __ARCH_WANT_SYS_LLSEEK macroMichal Suchanek2020-04-031-1/+2
| | | | | | | | | | | | | | | | | | This partially reverts commit caf6f9c8a326 ("asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro") When CONFIG_COMPAT is disabled on ppc64 the kernel does not build. There is resistance to both removing the llseek syscall from the 64bit syscall tables and building the llseek interface unconditionally. Signed-off-by: Michal Suchanek <msuchanek@suse.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/lkml/20190828151552.GA16855@infradead.org/ Link: https://lore.kernel.org/lkml/20190829214319.498c7de2@naga/ Link: https://lore.kernel.org/r/dd4575c51e31766e87f7e7fa121d099ab78d3290.1584699455.git.msuchanek@suse.de
* Merge tag 'ovl-update-5.6' of ↵Linus Torvalds2020-02-041-0/+56
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Try to preserve holes in sparse files when copying up, thus saving disk space and improving performance. - Fix a performance regression introduced in v4.19 by preserving asynchronicity of IO when fowarding to underlying layers. Add VFS helpers to submit async iocbs. - Fix a regression in lseek(2) introduced in v4.19 that breaks >2G seeks on 32bit kernels. - Fix a corner case where st_ino/st_dev was not preserved across copy up. - Miscellaneous fixes and cleanups. * tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lseek overflow on 32bit ovl: add splice file read write helper ovl: implement async IO routines vfs: add vfs_iocb_iter_[read|write] helper functions ovl: layer is const ovl: fix corner case of non-constant st_dev;st_ino ovl: fix corner case of conflicting lower layer uuid ovl: generalize the lower_fs[] array ovl: simplify ovl_same_sb() helper ovl: generalize the lower_layers[] array ovl: improving copy-up efficiency for big sparse file ovl: use ovl_inode_lock in ovl_llseek() ovl: use pr_fmt auto generate prefix ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
| * vfs: add vfs_iocb_iter_[read|write] helper functionsJiufei Xue2020-01-241-0/+56
| | | | | | | | | | | | | | | | This doesn't cause any behavior changes and will be used by overlay async IO implementation. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | fs: allow deduplication of eof block into the end of the destination fileFilipe Manana2020-01-231-6/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We always round down, to a multiple of the filesystem's block size, the length to deduplicate at generic_remap_check_len(). However this is only needed if an attempt to deduplicate the last block into the middle of the destination file is requested, since that leads into a corruption if the length of the source file is not block size aligned. When an attempt to deduplicate the last block into the end of the destination file is requested, we should allow it because it is safe to do it - there's no stale data exposure and we are prepared to compare the data ranges for a length not aligned to the block (or page) size - in fact we even do the data compare before adjusting the deduplication length. After btrfs was updated to use the generic helpers from VFS (by commit 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")) we started to have user reports of deduplication not reflinking the last block anymore, and whence users getting lower deduplication scores. The main use case is deduplication of entire files that have a size not aligned to the block size of the filesystem. We already allow cloning the last block to the end (and beyond) of the destination file, so allow for deduplication as well. Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/ CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* vfs: fix page locking deadlocks when deduping filesDarrick J. Wong2019-08-161-8/+41
| | | | | | | | | | | | | | | When dedupe wants to use the page cache to compare parts of two files for dedupe, we must be very careful to handle locking correctly. The current code doesn't do this. It must lock and unlock the page only once if the two pages are the same, since the overlapping range check doesn't catch this when blocksize < pagesize. If the pages are distinct but from the same file, we must observe page locking order and lock them in order of increasing offset to avoid clashing with writeback locking. Fixes: 876bec6f9bbfcb3 ("vfs: refactor clone/dedupe_file_range common functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* vfs: allow copy_file_range to copy across devicesAmir Goldstein2019-06-091-6/+12
| | | | | | | | | | | | | | | | | | | | | | | We want to enable cross-filesystem copy_file_range functionality where possible, so push the "same superblock only" checks down to the individual filesystem callouts so they can make their own decisions about cross-superblock copy offload and fallack to generic_copy_file_range() for cross-superblock copy. [Amir] We do not call ->remap_file_range() in case the files are not on the same sb and do not call ->copy_file_range() in case the files do not belong to the same filesystem driver. This changes behavior of the copy_file_range(2) syscall, which will now allow cross filesystem in-kernel copy. CIFS already supports cross-superblock copy, between two shares to the same server. This functionality will now be available via the copy_file_range(2) syscall. Cc: Steve French <stfrench@microsoft.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: introduce file_modified() helperAmir Goldstein2019-06-091-18/+3
| | | | | | | | | | | | | | | | The combination of file_remove_privs() and file_update_mtime() is quite common in filesystem ->write_iter() methods. Modelled after the helper file_accessed(), introduce file_modified() and use it from generic_remap_file_range_prep(). Note that the order of calling file_remove_privs() before file_update_mtime() in the helper was matched to the more common order by filesystems and not the current order in generic_remap_file_range_prep(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: add missing checks to copy_file_rangeAmir Goldstein2019-06-091-1/+2
| | | | | | | | | | | | | | | | | Like the clone and dedupe interfaces we've recently fixed, the copy_file_range() implementation is missing basic sanity, limits and boundary condition tests on the parameters that are passed to it from userspace. Create a new "generic_copy_file_checks()" function modelled on the generic_remap_checks() function to provide this missing functionality. [Amir] Shorten copy length instead of checking pos_in limits because input file size already abides by the limits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: introduce generic_file_rw_checks()Amir Goldstein2019-06-091-27/+11
| | | | | | | | | | Factor out helper with some checks on in/out file that are common to clone_file_range and copy_file_range. Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: no fallback for ->copy_file_rangeDave Chinner2019-06-091-9/+16
| | | | | | | | | | | | | | | | | | | | | Now that we have generic_copy_file_range(), remove it as a fallback case when offloads fail. This puts the responsibility for executing fallbacks on the filesystems that implement ->copy_file_range and allows us to add operational validity checks to generic_copy_file_range(). Rework vfs_copy_file_range() to call a new do_copy_file_range() helper to execute the copying callout, and move calls to generic_file_copy_range() into filesystem methods where they currently return failures. [Amir] overlayfs is not responsible of executing the fallback. It is the responsibility of the underlying filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: introduce generic_copy_file_range()Dave Chinner2019-06-091-3/+32
| | | | | | | | | | | | | | | | | Right now if vfs_copy_file_range() does not use any offload mechanism, it falls back to calling do_splice_direct(). This fails to do basic sanity checks on the files being copied. Before we start adding this necessarily functionality to the fallback path, separate it out into generic_copy_file_range(). generic_copy_file_range() has the same prototype as ->copy_file_range() so that filesystems can use it in their custom ->copy_file_range() method if they so choose. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM filesKirill Smelkov2019-05-061-46/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This amends commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") in how position is passed into .read()/.write() handler for stream-like files: Rasmus noticed that we currently pass 0 as position and ignore any position change if that is done by a file implementation. This papers over bugs if ppos is used in files that declare themselves as being stream-like as such bugs will go unnoticed. Even if a file implementation is correctly converted into using stream_open, its read/write later could be changed to use ppos and even though that won't be working correctly, that bug might go unnoticed without someone doing wrong behaviour analysis. It is thus better to pass ppos=NULL into read/write for stream-like files as that don't give any chance for ppos usage bugs because it will oops if ppos is ever used inside .read() or .write(). Note 1: rw_verify_area, new_sync_{read,write} needs to be updated because they are called by vfs_read/vfs_write & friends before file_operations .read/.write . Note 2: if file backend uses new-style .read_iter/.write_iter, position is still passed into there as non-pointer kiocb.ki_pos . Currently stream_open.cocci (semantic patch added by 10dce8af3422) ignores files whose file_operations has *_iter methods. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
* fs: stream_open - opener for stream-like files so that read and write can ↵Kirill Smelkov2019-04-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | run simultaneously without deadlock Commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") added locking for file.f_pos access and in particular made concurrent read and write not possible - now both those functions take f_pos lock for the whole run, and so if e.g. a read is blocked waiting for data, write will deadlock waiting for that read to complete. This caused regression for stream-like files where previously read and write could run simultaneously, but after that patch could not do so anymore. See e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") which fixes such regression for particular case of /proc/xen/xenbus. The patch that added f_pos lock in 2014 did so to guarantee POSIX thread safety for read/write/lseek and added the locking to file descriptors of all regular files. In 2014 that thread-safety problem was not new as it was already discussed earlier in 2006. However even though 2006'th version of Linus's patch was adding f_pos locking "only for files that are marked seekable with FMODE_LSEEK (thus avoiding the stream-like objects like pipes and sockets)", the 2014 version - the one that actually made it into the tree as 9c225f2655e3 - is doing so irregardless of whether a file is seekable or not. See https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/ https://lwn.net/Articles/180387 https://lwn.net/Articles/180396 for historic context. The reason that it did so is, probably, that there are many files that are marked non-seekable, but e.g. their read implementation actually depends on knowing current position to correctly handle the read. Some examples: kernel/power/user.c snapshot_read fs/debugfs/file.c u32_array_read fs/fuse/control.c fuse_conn_waiting_read + ... drivers/hwmon/asus_atk0110.c atk_debugfs_ggrp_read arch/s390/hypfs/inode.c hypfs_read_iter ... Despite that, many nonseekable_open users implement read and write with pure stream semantics - they don't depend on passed ppos at all. And for those cases where read could wait for something inside, it creates a situation similar to xenbus - the write could be never made to go until read is done, and read is waiting for some, potentially external, event, for potentially unbounded time -> deadlock. Besides xenbus, there are 14 such places in the kernel that I've found with semantic patch (see below): drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write() drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write() drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write() drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write() net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write() drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write() drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write() drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write() net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write() drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write() drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write() drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write() drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write() drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write() In addition to the cases above another regression caused by f_pos locking is that now FUSE filesystems that implement open with FOPEN_NONSEEKABLE flag, can no longer implement bidirectional stream-like files - for the same reason as above e.g. read can deadlock write locking on file.f_pos in the kernel. FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f715 ("fuse: implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and write routines not depending on current position at all, and with both read and write being potentially blocking operations: See https://github.com/libfuse/osspd https://lwn.net/Articles/308445 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477 https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510 Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as "somewhat pipe-like files ..." with read handler not using offset. However that test implements only read without write and cannot exercise the deadlock scenario: https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163 https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216 I've actually hit the read vs write deadlock for real while implementing my FUSE filesystem where there is /head/watch file, for which open creates separate bidirectional socket-like stream in between filesystem and its user with both read and write being later performed simultaneously. And there it is semantically not easy to split the stream into two separate read-only and write-only channels: https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169 Let's fix this regression. The plan is: 1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS - doing so would break many in-kernel nonseekable_open users which actually use ppos in read/write handlers. 2. Add stream_open() to kernel to open stream-like non-seekable file descriptors. Read and write on such file descriptors would never use nor change ppos. And with that property on stream-like files read and write will be running without taking f_pos lock - i.e. read and write could be running simultaneously. 3. With semantic patch search and convert to stream_open all in-kernel nonseekable_open users for which read and write actually do not depend on ppos and where there is no other methods in file_operations which assume @offset access. 4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via steam_open if that bit is present in filesystem open reply. It was tempting to change fs/fuse/ open handler to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. 5. Add stream_open and FOPEN_STREAM handling to stable kernels starting from v3.14+ (the kernel where 9c225f2655 first appeared). This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in their open handler and this way avoid the deadlock on all kernel versions. This should work because fs/fuse/ ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. This patch adds stream_open, converts /proc/xen/xenbus to it and adds semantic patch to automatically locate in-kernel places that are either required to be converted due to read vs write deadlock, or that are just safe to be converted because read and write do not use ppos and there are no other funky methods in file_operations. Regarding semantic patch I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. The script also does not convert files that should be valid to convert, but that currently have .llseek = noop_llseek or generic_file_llseek for unknown reason despite file being opened with nonseekable_open (e.g. drivers/input/mousedev.c) Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Yongzhi Pan <panyongzhi@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Juergen Gross <jgross@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Tejun Heo <tj@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Nikolaus Rath <Nikolaus@rath.org> Cc: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'work.misc' of ↵Linus Torvalds2019-03-121-2/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted fixes (really no common topic here)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Make __vfs_write() static vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1 pipe: stop using ->can_merge splice: don't merge into linked buffers fs: move generic stat response attr handling to vfs_getattr_nosec orangefs: don't reinitialize result_mask in ->getattr fs/devpts: always delete dcache dentry-s in dput()
| * vfs: Make __vfs_write() staticGeert Uytterhoeven2019-02-221-2/+2
| | | | | | | | | | | | | | | | | | __vfs_write() was unexported, and removed from <linux/fs.h>, but forgotten to be made static. Fixes: eb031849d52e61d2 ("fs: unexport __vfs_read/__vfs_write") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1Aurelien Jarno2019-02-161-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The preadv2 and pwritev2 syscalls are supposed to emulate the readv and writev syscalls when offset == -1. Therefore the compat code should check for offset before calling do_compat_preadv64 and do_compat_pwritev64. This is the case for the preadv2 and pwritev2 syscalls, but handling of offset == -1 is missing in their 64-bit equivalent. This patch fixes that, calling do_compat_readv and do_compat_writev when offset == -1. This fixes the following glibc tests on x32: - misc/tst-preadvwritev2 - misc/tst-preadvwritev64v2 Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | get rid of legacy 'get_ds()' functionLinus Torvalds2019-03-041-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every in-kernel use of this function defined it to KERNEL_DS (either as an actual define, or as an inline function). It's an entirely historical artifact, and long long long ago used to actually read the segment selector valueof '%ds' on x86. Which in the kernel is always KERNEL_DS. Inspired by a patch from Jann Horn that just did this for a very small subset of users (the ones in fs/), along with Al who suggested a script. I then just took it to the logical extreme and removed all the remaining gunk. Roughly scripted with git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/' git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d' plus manual fixups to remove a few unusual usage patterns, the couple of inline function cases and to fix up a comment that had become stale. The 'get_ds()' function remains in an x86 kvm selftest, since in user space it actually does something relevant. Inspired-by: Jann Horn <jannh@google.com> Inspired-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove 'type' argument from access_ok() functionLinus Torvalds2019-01-031-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument of the user address range verification function since we got rid of the old racy i386-only code to walk page tables by hand. It existed because the original 80386 would not honor the write protect bit when in kernel mode, so you had to do COW by hand before doing any user access. But we haven't supported that in a long time, and these days the 'type' argument is a purely historical artifact. A discussion about extending 'user_access_begin()' to do the range checking resulted this patch, because there is no way we're going to move the old VERIFY_xyz interface to that model. And it's best done at the end of the merge window when I've done most of my merges, so let's just get this done once and for all. This patch was mostly done with a sed-script, with manual fix-ups for the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form. There were a couple of notable cases: - csky still had the old "verify_area()" name as an alias. - the iter_iov code had magical hardcoded knowledge of the actual values of VERIFY_{READ,WRITE} (not that they mattered, since nothing really used it) - microblaze used the type argument for a debug printout but other than those oddities this should be a total no-op patch. I tried to fix up all architectures, did fairly extensive grepping for access_ok() uses, and the changes are trivial, but I may have missed something. Any missed conversion should be trivially fixable, though. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: allow some remap flags to be passed to vfs_clone_file_rangeDarrick J. Wong2018-12-041-1/+1
| | | | | | | | | | | | | In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the lower filesystem's inode, passing through whatever remap flags it got from its caller. Since vfs_copy_file_range first tries a filesystem's remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through to the second vfs_copy_file_range call, and this isn't an issue. Change the WARN_ON to look only for the DEDUP flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPPDave Chinner2018-11-211-8/+7
| | | | | | | | | | | | | | It returns EINVAL when the operation is not supported by the filesystem. Fix it to return EOPNOTSUPP to be consistent with the man page and clone_file_range(). Clean up the inconsistent error return handling while I'm there. (I know, lipstick on a pig, but every little bit helps...) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2018-11-021-179/+226
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vfs dedup fixes from Dave Chinner: "This reworks the vfs data cloning infrastructure. We discovered many issues with these interfaces late in the 4.19 cycle - the worst of them (data corruption, setuid stripping) were fixed for XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all the problems was needed. That rework is the contents of this pull request. Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use a common .remap_file_range method and supply generic bounds and sanity checking functions that are shared with the data write path. The current VFS infrastructure has problems with rlimit, LFS file sizes, file time stamps, maximum filesystem file sizes, stripping setuid bits, etc and so they are addressed in these commits. We also introduce the ability for the ->remap_file_range methods to return short clones so that clones for vfs_copy_file_range() don't get rejected if the entire range can't be cloned. It also allows filesystems to sliently skip deduplication of partial EOF blocks if they are not capable of doing so without requiring errors to be thrown to userspace. Existing filesystems are converted to user the new remap_file_range method, and both XFS and ocfs2 are modified to make use of the new generic checking infrastructure" * tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits) xfs: remove [cm]time update from reflink calls xfs: remove xfs_reflink_remap_range xfs: remove redundant remap partial EOF block checks xfs: support returning partial reflink results xfs: clean up xfs_reflink_remap_blocks call site xfs: fix pagecache truncation prior to reflink ocfs2: remove ocfs2_reflink_remap_range ocfs2: support partial clone range and dedupe range ocfs2: fix pagecache truncation prior to reflink ocfs2: truncate page cache for clone destination file before remapping vfs: clean up generic_remap_file_range_prep return value vfs: hide file range comparison function vfs: enable remap callers that can handle short operations vfs: plumb remap flags through the vfs dedupe functions vfs: plumb remap flags through the vfs clone functions vfs: make remap_file_range functions take and return bytes completed vfs: remap helper should update destination inode metadata vfs: pass remap flags to generic_remap_checks vfs: pass remap flags to generic_remap_file_range_prep vfs: combine the clone and dedupe into a single remap_file_range ...
| * vfs: clean up generic_remap_file_range_prep return valueDarrick J. Wong2018-10-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | Since the remap prep function can update the length of the remap request, we can change this function to return the usual return status instead of the odd behavior it has now. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: hide file range comparison functionDarrick J. Wong2018-10-301-96/+91
| | | | | | | | | | | | | | | | | | | | | | There are no callers of vfs_dedupe_file_range_compare, so we might as well make it a static helper and remove the export. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: enable remap callers that can handle short operationsDarrick J. Wong2018-10-301-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Plumb in a remap flag that enables the filesystem remap handler to shorten remapping requests for callers that can handle it. Now copy_file_range can report partial success (in case we run up against alignment problems, resource limits, etc.). We also enable CAN_SHORTEN for fideduperange to maintain existing userspace-visible behavior where xfs/btrfs shorten the dedupe range to avoid stale post-eof data exposure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: plumb remap flags through the vfs dedupe functionsDarrick J. Wong2018-10-301-3/+6
| | | | | | | | | | | | | | | | | | | | Plumb a remap_flags argument through the vfs_dedupe_file_range_one functions so that dedupe can take advantage of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: plumb remap flags through the vfs clone functionsDarrick J. Wong2018-10-301-4/+9
| | | | | | | | | | | | | | | | | | | | Plumb a remap_flags argument through the {do,vfs}_clone_file_range functions so that clone can take advantage of it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: make remap_file_range functions take and return bytes completedDarrick J. Wong2018-10-301-22/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the remap_file_range functions to take a number of bytes to operate upon and return the number of bytes they operated on. This is a requirement for allowing fs implementations to return short clone/dedupe results to the user, which will enable us to obey resource limits in a graceful manner. A subsequent patch will enable copy_file_range to signal to the ->clone_file_range implementation that it can handle a short length, which will be returned in the function's return value. For now the short return is not implemented anywhere so the behavior won't change -- either copy_file_range manages to clone the entire range or it tries an alternative. Neither clone ioctl can take advantage of this, alas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * vfs: remap helper should update destination inode metadataDarrick J. Wong2018-10-301-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | Extend generic_remap_file_range_prep to handle inode metadata updates when remapping into a file. If the operation can possibly alter the file contents, we must update the ctime and mtime and remove security privileges, just like we do for regular file writes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Dave Chinner <david@fromorbit.com>