summaryrefslogtreecommitdiffstats
path: root/fs/fuse
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'fuse-update-5.15' of ↵Linus Torvalds2021-09-076-80/+214
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Allow mounting an active fuse device. Previously the fuse device would always be mounted during initialization, and sharing a fuse superblock was only possible through mount or namespace cloning - Fix data flushing in syncfs (virtiofs only) - Fix data flushing in copy_file_range() - Fix a possible deadlock in atomic O_TRUNC - Misc fixes and cleanups * tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unused arg in fuse_write_file_get() fuse: wait for writepages in syncfs fuse: flush extending writes fuse: truncate pagecache on atomic_o_trunc fuse: allow sharing existing sb fuse: move fget() to fuse_get_tree() fuse: move option checking into fuse_fill_super() fuse: name fs_context consistently fuse: fix use after free in fuse_read_interrupt()
| * fuse: remove unused arg in fuse_write_file_get()Miklos Szeredi2021-09-061-9/+6
| | | | | | | | | | | | The struct fuse_conn argument is not used and can be removed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: wait for writepages in syncfsMiklos Szeredi2021-09-063-0/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of fuse the MM subsystem doesn't guarantee that page writeback completes by the time ->sync_fs() is called. This is because fuse completes page writeback immediately to prevent DoS of memory reclaim by the userspace file server. This means that fuse itself must ensure that writes are synced before sending the SYNCFS request to the server. Introduce sync buckets, that hold a counter for the number of outstanding write requests. On syncfs replace the current bucket with a new one and wait until the old bucket's counter goes down to zero. It is possible to have multiple syncfs calls in parallel, in which case there could be more than one waited-on buckets. Descendant buckets must not complete until the parent completes. Add a count to the child (new) bucket until the (parent) old bucket completes. Use RCU protection to dereference the current bucket and to wake up an emptied bucket. Use fc->lock to protect against parallel assignments to the current bucket. This leaves just the counter to be a possible scalability issue. The fc->num_waiting counter has a similar issue, so both should be addressed at the same time. Reported-by: Amir Goldstein <amir73il@gmail.com> Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server") Cc: <stable@vger.kernel.org> # v5.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: flush extending writesMiklos Szeredi2021-08-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Callers of fuse_writeback_range() assume that the file is ready for modification by the server in the supplied byte range after the call returns. If there's a write that extends the file beyond the end of the supplied range, then the file needs to be extended to at least the end of the range, but currently that's not done. There are at least two cases where this can cause problems: - copy_file_range() will return short count if the file is not extended up to end of the source range. - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file, hence the region may not be fully allocated. Fix by flushing writes from the start of the range up to the end of the file. This could be optimized if the writes are non-extending, etc, but it's probably not worth the trouble. Fixes: a2bc92362941 ("fuse: fix copy_file_range() in the writeback case") Fixes: 6b1bdb56b17c ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)") Cc: <stable@vger.kernel.org> # v5.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: truncate pagecache on atomic_o_truncMiklos Szeredi2021-08-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in fuse_launder_page() triggered by invalidate_inode_pages2(). Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE or fc->writeback cache, so do it unconditionally. Reported-by: Xie Yongji <xieyongji@bytedance.com> Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: allow sharing existing sbMiklos Szeredi2021-08-051-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to create a new mount from a already working server. Here's a detailed description of the problem from Jakob: "The background for this question is occasional problems we see with our fuse filesystem [1] and mount namespaces. On a usual client, we have system-wide, autofs managed mountpoints. When a new mount namespace is created (which can be done unprivileged in combination with user namespaces), it can happen that a mountpoint is used inside the new namespace but idle in the root mount namespace. So autofs unmounts the parent, system-wide mountpoint. But the fuse module stays active and still serves mountpoint in the child mount namespace. Because the fuse daemon also blocks other system wide resources corresponding to the mountpoint, this situation effectively prevents new mounts until the child mount namespaces closes. [1] https://github.com/cvmfs/cvmfs" Reported-by: Jakob Blomer <jblomer@cern.ch> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: move fget() to fuse_get_tree()Miklos Szeredi2021-08-052-23/+22
| | | | | | | | | | | | | | | | | | | | | | | | Affected call chains: fuse_get_tree -> get_tree_(bdev|nodev) -> fuse_fill_super Needed for following patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: move option checking into fuse_fill_super()Miklos Szeredi2021-08-041-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount options are present can be moved from fuse_get_tree() into fuse_fill_super() where the value of the options are consumed. This relaxes semantics of reusing a fuse blockdev mount using the device name. Before this patch presence of these options were enforced but values ignored, after this patch these options are completely ignored in this case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: name fs_context consistentlyMiklos Szeredi2021-08-043-41/+41
| | | | | | | | | | | | | | | | | | Naming convention under fs/fuse/: struct fuse_conn *fc; struct fs_context *fsc; Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: fix use after free in fuse_read_interrupt()Miklos Szeredi2021-08-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a potential race between fuse_read_interrupt() and fuse_request_end(). TASK1 in fuse_read_interrupt(): delete req->intr_entry (while holding fiq->lock) TASK2 in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock wake up TASK3 TASK3 request is freed TASK1 in fuse_read_interrupt(): dereference req->in.h.unique ***BAM*** Fix by always grabbing fiq->lock if the request was ever interrupted (FR_INTERRUPTED set) thereby serializing with concurrent fuse_read_interrupt() calls. FR_INTERRUPTED is set before the request is queued on fiq->interrupts. Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not cleared in this case. Reported-by: lijiazi <lijiazi@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | Merge tag 'ovl-update-5.15' of ↵Linus Torvalds2021-09-022-2/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Copy up immutable/append/sync/noatime attributes (Amir Goldstein) - Improve performance by enabling RCU lookup. - Misc fixes and improvements The reason this touches so many files is that the ->get_acl() method now gets a "bool rcu" argument. The ->get_acl() API was updated based on comments from Al and Linus: Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/ * tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: enable RCU'd ->get_acl() vfs: add rcu argument to ->get_acl() callback ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup() ovl: use kvalloc in xattr copy-up ovl: update ctime when changing fileattr ovl: skip checking lower file's i_writecount on truncate ovl: relax lookup error on mismatch origin ftype ovl: do not set overlay.opaque for new directories ovl: add ovl_allow_offline_changes() helper ovl: disable decoding null uuid with redirect_dir ovl: consistent behavior for immutable/append-only inodes ovl: copy up sync/noatime fileattr flags ovl: pass ovl_fs to ovl_check_setxattr() fs: add generic helper for filling statx attribute flags
| * | vfs: add rcu argument to ->get_acl() callbackMiklos Szeredi2021-08-182-2/+5
| | | | | | | | | | | | | | | | | | | | | Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | Merge tag 'hole_punch_for_v5.15-rc1' of ↵Linus Torvalds2021-08-305-44/+35
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fs hole punching vs cache filling race fixes from Jan Kara: "Fix races leading to possible data corruption or stale data exposure in multiple filesystems when hole punching races with operations such as readahead. This is the series I was sending for the last merge window but with your objection fixed - now filemap_fault() has been modified to take invalidate_lock only when we need to create new page in the page cache and / or bring it uptodate" * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: filesystems/locking: fix Malformed table warning cifs: Fix race between hole punch and page fault ceph: Fix race between hole punch and page fault fuse: Convert to using invalidate_lock f2fs: Convert to using invalidate_lock zonefs: Convert to using invalidate_lock xfs: Convert double locking of MMAPLOCK to use VFS helpers xfs: Convert to use invalidate_lock xfs: Refactor xfs_isilocked() ext2: Convert to using invalidate_lock ext4: Convert to use mapping->invalidate_lock mm: Add functions to lock invalidate_lock for two mappings mm: Protect operations adding pages to page cache with invalidate_lock documentation: Sync file_operations members with reality mm: Fix comments mentioning i_mutex
| * | fuse: Convert to using invalidate_lockJan Kara2021-07-135-44/+35
| |/ | | | | | | | | | | | | | | | | | | | | Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | Merge branch 'for-5.14/dax' into libnvdimm-fixesDan Williams2021-08-111-4/+2
|\ \ | |/ |/| | | | | Pick up some small dax cleanups that make some of Ira's follow on work easier.
| * fs/fuse: Remove unneeded kaddr parameterIra Weiny2021-07-071-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fuse_dax_mem_range_init() does not need the address or the pfn of the memory requested in dax_direct_access(). It is only calling direct access to get the number of pages. Remove the unused variables and stop requesting the kaddr and pfn from dax_direct_access(). Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/20210525172428.3634316-2-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'fuse-update-5.14' of ↵Linus Torvalds2021-07-068-80/+152
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fixes for virtiofs submounts - Misc fixes and cleanups * tag 'fuse-update-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtiofs: Fix spelling mistakes fuse: use DIV_ROUND_UP helper macro for calculations fuse: fix illegal access to inode with reused nodeid fuse: allow fallocate(FALLOC_FL_ZERO_RANGE) fuse: Make fuse_fill_super_submount() static fuse: Switch to fc_mount() for submounts fuse: Call vfs_get_tree() for submounts fuse: add dedicated filesystem context ops for submounts virtiofs: propagate sync() to file server fuse: reject internal errno fuse: check connected before queueing on fpq->io fuse: ignore PG_workingset after stealing fuse: Fix infinite loop in sget_fc() fuse: Fix crash if superblock of submount gets killed early fuse: Fix crash in fuse_dentry_automount() error path
| * | virtiofs: Fix spelling mistakesZheng Yongjun2021-06-223-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some spelling mistakes in comments: refernce ==> reference happnes ==> happens threhold ==> threshold splitted ==> split mached ==> matched Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: use DIV_ROUND_UP helper macro for calculationsWu Bo2021-06-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Replace open coded divisor calculations with the DIV_ROUND_UP kernel macro for better readability. Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: fix illegal access to inode with reused nodeidAmir Goldstein2021-06-224-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...) with ourarg containing nodeid and generation. If a fuse inode is found in inode cache with the same nodeid but different generation, the existing fuse inode should be unhashed and marked "bad" and a new inode with the new generation should be hashed instead. This can happen, for example, with passhrough fuse filesystem that returns the real filesystem ino/generation on lookup and where real inode numbers can get recycled due to real files being unlinked not via the fuse passthrough filesystem. With current code, this situation will not be detected and an old fuse dentry that used to point to an older generation real inode, can be used to access a completely new inode, which should be accessed only via the new dentry. Note that because the FORGET message carries the nodeid w/o generation, the server should wait to get FORGET counts for the nlookup counts of the old and reused inodes combined, before it can free the resources associated to that nodeid. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)Richard W.M. Jones2021-06-221-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current fuse module filters out fallocate(FALLOC_FL_ZERO_RANGE) returning -EOPNOTSUPP. libnbd's nbdfuse would like to translate FALLOC_FL_ZERO_RANGE requests into the NBD command NBD_CMD_WRITE_ZEROES which allows NBD servers that support it to do zeroing efficiently. This commit treats this flag exactly like FALLOC_FL_PUNCH_HOLE. A way to test this, requiring fuse >= 3, nbdkit >= 1.8 and the latest nbdfuse from https://gitlab.com/nbdkit/libnbd/-/tree/master/fuse is to create a file containing some data and "mirror" it to a fuse file: $ dd if=/dev/urandom of=disk.img bs=1M count=1 $ nbdkit file disk.img $ touch mirror.img $ nbdfuse mirror.img nbd://localhost & (mirror.img -> nbdfuse -> NBD over loopback -> nbdkit -> disk.img) You can then run commands such as: $ fallocate -z -o 1024 -l 1024 mirror.img and check that the content of the original file ("disk.img") stays synchronized. To show NBD commands, export LIBNBD_DEBUG=1 before running nbdfuse. To clean up: $ fusermount3 -u mirror.img $ killall nbdkit Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Make fuse_fill_super_submount() staticGreg Kurz2021-06-222-11/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function used to be called from fuse_dentry_automount(). This code was moved to fuse_get_tree_submount() in the same file since then. It is unlikely there will ever be another user. No need to be extern in this case. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Switch to fc_mount() for submountsGreg Kurz2021-06-221-23/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fc_mount() already handles the vfs_get_tree(), sb->s_umount unlocking and vfs_create_mount() sequence. Using it greatly simplifies fuse_dentry_automount(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Call vfs_get_tree() for submountsGreg Kurz2021-06-222-48/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently fixed an infinite loop by setting the SB_BORN flag on submounts along with the write barrier needed by super_cache_count(). This is the job of vfs_get_tree() and FUSE shouldn't have to care about the barrier at all. Split out some code from fuse_dentry_automount() to the dedicated fuse_get_tree_submount() handler for submounts and call vfs_get_tree(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: add dedicated filesystem context ops for submountsGreg Kurz2021-06-223-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The creation of a submount is open-coded in fuse_dentry_automount(). This brings a lot of complexity and we recently had to fix bugs because we weren't setting SB_BORN or because we were unlocking sb->s_umount before sb was fully configured. Most of these could have been avoided by using the mount API instead of open-coding. Basically, this means coming up with a proper ->get_tree() implementation for submounts and call vfs_get_tree(), or better fc_mount(). The creation of the superblock for submounts is quite different from the root mount. Especially, it doesn't require to allocate a FUSE filesystem context, nor to parse parameters. Introduce a dedicated context ops for submounts to make this clear. This is just a placeholder for now, fuse_get_tree_submount() will be populated in a subsequent patch. Only visible change is that we stop allocating/freeing a useless FUSE filesystem context with submounts. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | virtiofs: propagate sync() to file serverGreg Kurz2021-06-223-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: reject internal errnoMiklos Szeredi2021-06-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Don't allow userspace to report errors that could be kernel-internal. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Fixes: 334f485df85a ("[PATCH] FUSE - device functions") Cc: <stable@vger.kernel.org> # v2.6.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: check connected before queueing on fpq->ioMiklos Szeredi2021-06-221-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A request could end up on the fpq->io list after fuse_abort_conn() has reset fpq->connected and aborted requests on that list: Thread-1 Thread-2 ======== ======== ->fuse_simple_request() ->shutdown ->__fuse_request_send() ->queue_request() ->fuse_abort_conn() ->fuse_dev_do_read() ->acquire(fpq->lock) ->wait_for(fpq->lock) ->set err to all req's in fpq->io ->release(fpq->lock) ->acquire(fpq->lock) ->add req to fpq->io After the userspace copy is done the request will be ended, but req->out.h.error will remain uninitialized. Also the copy might block despite being already aborted. Fix both issues by not allowing the request to be queued on the fpq->io list after fuse_abort_conn() has processed this list. Reported-by: Pradeep P V K <pragalla@codeaurora.org> Fixes: fd22d62ed0c3 ("fuse: no fc->lock for iqueue parts") Cc: <stable@vger.kernel.org> # v4.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: ignore PG_workingset after stealingMiklos Szeredi2021-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the "fuse: trying to steal weird page" warning. Description from Johannes Weiner: "Think of it as similar to PG_active. It's just another usage/heat indicator of file and anon pages on the reclaim LRU that, unlike PG_active, persists across deactivation and even reclaim (we store it in the page cache / swapper cache tree until the page refaults). So if fuse accepts pages that can legally have PG_active set, PG_workingset is fine too." Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com> Fixes: 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Fix infinite loop in sget_fc()Greg Kurz2021-06-091-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't set the SB_BORN flag on submounts. This is wrong as these superblocks are then considered as partially constructed or dying in the rest of the code and can break some assumptions. One such case is when you have a virtiofs filesystem with submounts and you try to mount it again : virtio_fs_get_tree() tries to obtain a superblock with sget_fc(). The logic in sget_fc() is to loop until it has either found an existing matching superblock with SB_BORN set or to create a brand new one. It is assumed that a superblock without SB_BORN is transient and the loop is restarted. Forgetting to set SB_BORN on submounts hence causes sget_fc() to retry forever. Setting SB_BORN requires special care, i.e. a write barrier for super_cache_count() which can check SB_BORN without taking any lock. We should call vfs_get_tree() to deal with that but this requires to have a proper ->get_tree() implementation for submounts, which is a bigger piece of work. Go for a simple bug fix in the meatime. Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Fix crash if superblock of submount gets killed earlyGreg Kurz2021-06-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As soon as fuse_dentry_automount() does up_write(&sb->s_umount), the superblock can theoretically be killed. If this happens before the submount was added to the &fc->mounts list, fuse_mount_remove() later crashes in list_del_init() because it assumes the submount to be already there. Add the submount before dropping sb->s_umount to fix the inconsistency. It is okay to nest fc->killsb under sb->s_umount, we already do this on the ->kill_sb() path. Signed-off-by: Greg Kurz <groug@kaod.org> Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Fix crash in fuse_dentry_automount() error pathGreg Kurz2021-06-091-1/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If fuse_fill_super_submount() returns an error, the error path triggers a crash: [ 26.206673] BUG: kernel NULL pointer dereference, address: 0000000000000000 [...] [ 26.226362] RIP: 0010:__list_del_entry_valid+0x25/0x90 [...] [ 26.247938] Call Trace: [ 26.248300] fuse_mount_remove+0x2c/0x70 [fuse] [ 26.248892] virtio_kill_sb+0x22/0x160 [virtiofs] [ 26.249487] deactivate_locked_super+0x36/0xa0 [ 26.250077] fuse_dentry_automount+0x178/0x1a0 [fuse] The crash happens because fuse_mount_remove() assumes that the FUSE mount was already added to list under the FUSE connection, but this only done after fuse_fill_super_submount() has returned success. This means that until fuse_fill_super_submount() has returned success, the FUSE mount isn't actually owned by the superblock. We should thus reclaim ownership by clearing sb->s_fs_info, which will skip the call to fuse_mount_remove(), and perform rollback, like virtio_fs_get_tree() already does for the root sb. Fixes: bf109c64040f ("fuse: implement crossmounts") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | Merge branch 'work.iov_iter' of ↵Linus Torvalds2021-07-031-3/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter cleanups and fixes. There are followups, but this is what had sat in -next this cycle. IMO the macro forest in there became much thinner and easier to follow..." * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller clean up copy_mc_pipe_to_iter() pipe_zero(): we don't need no stinkin' kmap_atomic()... iov_iter: clean csum_and_copy_...() primitives up a bit copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases iterate_xarray(): only of the first iteration we might get offset != 0 pull handling of ->iov_offset into iterate_{iovec,bvec,xarray} iov_iter: make iterator callbacks use base and len instead of iovec iov_iter: make the amount already copied available to iterator callbacks iov_iter: get rid of separate bvec and xarray callbacks iov_iter: teach iterate_{bvec,xarray}() about possible short copies iterate_bvec(): expand bvec.h macro forest, massage a bit iov_iter: unify iterate_iovec and iterate_kvec iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec iterate_and_advance(): get rid of magic in case when n is 0 csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant [xarray] iov_iter_npages(): just use DIV_ROUND_UP() iov_iter_npages(): don't bother with iterate_all_kinds() ...
| * | iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing ↵Al Viro2021-06-101-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | variant Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the callers do *not* need to do iov_iter_advance() after it. In case when they end up consuming less than they'd been given they need to do iov_iter_revert() on everything they had not consumed. That, however, needs to be done only on slow paths. All in-tree callers converted. And that kills the last user of iterate_all_kinds() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fuse_fill_write_pages(): don't bother with iov_iter_single_seg_count()Al Viro2021-06-031-1/+0
| |/ | | | | | | | | | | | | another rudiment of fault-in originally having been limited to the first segment, same as in generic_perform_write() and friends. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | mm: move page dirtying prototypes from mm.hMatthew Wilcox (Oracle)2021-06-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions implement the address_space ->set_page_dirty operation and should live in pagemap.h, not mm.h so that the rest of the kernel doesn't get funny ideas about calling them directly. Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fs: remove noop_set_page_dirty()Matthew Wilcox (Oracle)2021-06-291-1/+1
|/ | | | | | | | | | | | | | | | | | | Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'work.misc' of ↵Linus Torvalds2021-05-021-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: useful constants: struct qstr for ".." hostfs_open(): don't open-code file_dentry() whack-a-mole: kill strlen_user() (again) autofs: should_expire() argument is guaranteed to be positive apparmor:match_mn() - constify devpath argument buffer: a small optimization in grow_buffers get rid of autofs_getpath() constify dentry argument of dentry_path()/dentry_path_raw()
| * useful constants: struct qstr for ".."Al Viro2021-04-151-2/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'fuse-update-5.13' of ↵Linus Torvalds2021-04-308-57/+97
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix a page locking bug in write (introduced in 2.6.26) - Allow sgid bit to be killed in setacl() - Miscellaneous fixes and cleanups * tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: cuse: simplify refcount cuse: prevent clone virtiofs: fix userns virtiofs: remove useless function virtiofs: split requests that exceed virtqueue size virtiofs: fix memory leak in virtio_fs_probe() fuse: invalidate attrs when page writeback completes fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID fuse: extend FUSE_SETXATTR request fuse: fix matching of FUSE_DEV_IOC_CLONE command fuse: fix a typo fuse: don't zero pages twice fuse: fix typo for fuse_conn.max_pages comment fuse: fix write deadlock
| * | cuse: simplify refcountMiklos Szeredi2021-04-141-7/+3
| | | | | | | | | | | | | | | | | | Put extra reference early in cuse_channel_open(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | cuse: prevent cloneMiklos Szeredi2021-04-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For cloned connections cuse_channel_release() will be called more than once, resulting in use after free. Prevent device cloning for CUSE, which does not make sense at this point, and highly unlikely to be used in real life. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | virtiofs: fix usernsMiklos Szeredi2021-04-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_user_ns() is done twice (once in virtio_fs_get_tree() and once in fuse_conn_init()), resulting in a reference leak. Also looks better to use fsc->user_ns (which *should* be the current_user_ns() at this point). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | virtiofs: remove useless functionJiapeng Chong2021-04-141-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following clang warning: fs/fuse/virtio_fs.c:130:35: warning: unused function 'vq_to_fpq' [-Wunused-function]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | virtiofs: split requests that exceed virtqueue sizeConnor Kuehl2021-04-143-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an incoming FUSE request can't fit on the virtqueue, the request is placed onto a workqueue so a worker can try to resubmit it later where there will (hopefully) be space for it next time. This is fine for requests that aren't larger than a virtqueue's maximum capacity. However, if a request's size exceeds the maximum capacity of the virtqueue (even if the virtqueue is empty), it will be doomed to a life of being placed on the workqueue, removed, discovered it won't fit, and placed on the workqueue yet again. Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect Descriptors) of the virtio spec: "A driver MUST NOT create a descriptor chain longer than the Queue Size of the device." To fix this, limit the number of pages FUSE will use for an overall request. This way, each request can realistically fit on the virtqueue when it is decomposed into a scattergather list and avoid violating section 2.6.5.3.1 of the virtio spec. Signed-off-by: Connor Kuehl <ckuehl@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | virtiofs: fix memory leak in virtio_fs_probe()Luis Henriques2021-04-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When accidentally passing twice the same tag to qemu, kmemleak ended up reporting a memory leak in virtiofs. Also, looking at the log I saw the following error (that's when I realised the duplicated tag): virtiofs: probe of virtio5 failed with error -17 Here's the kmemleak log for reference: unreferenced object 0xffff888103d47800 (size 1024): comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff ................ backtrace: [<000000000ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs] [<00000000f8aca419>] virtio_dev_probe+0x15f/0x210 [<000000004d6baf3c>] really_probe+0xea/0x430 [<00000000a6ceeac8>] device_driver_attach+0xa8/0xb0 [<00000000196f47a7>] __driver_attach+0x98/0x140 [<000000000b20601d>] bus_for_each_dev+0x7b/0xc0 [<00000000399c7b7f>] bus_add_driver+0x11b/0x1f0 [<0000000032b09ba7>] driver_register+0x8f/0xe0 [<00000000cdd55998>] 0xffffffffa002c013 [<000000000ea196a2>] do_one_initcall+0x64/0x2e0 [<0000000008f727ce>] do_init_module+0x5c/0x260 [<000000003cdedab6>] __do_sys_finit_module+0xb5/0x120 [<00000000ad2f48c6>] do_syscall_64+0x33/0x40 [<00000000809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.de> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: invalidate attrs when page writeback completesVivek Goyal2021-04-141-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In fuse when a direct/write-through write happens we invalidate attrs because that might have updated mtime/ctime on server and cached mtime/ctime will be stale. What about page writeback path. Looks like we don't invalidate attrs there. To be consistent, invalidate attrs in writeback path as well. Only exception is when writeback_cache is enabled. In that case we strust local mtime/ctime and there is no need to invalidate attrs. Recently users started experiencing failure of xfstests generic/080, geneirc/215 and generic/614 on virtiofs. This happened only newer "stat" utility and not older one. This patch fixes the issue. So what's the root cause of the issue. Here is detailed explanation. generic/080 test does mmap write to a file, closes the file and then checks if mtime has been updated or not. When file is closed, it leads to flushing of dirty pages (and that should update mtime/ctime on server). But we did not explicitly invalidate attrs after writeback finished. Still generic/080 passed so far and reason being that we invalidated atime in fuse_readpages_end(). This is called in fuse_readahead() path and always seems to trigger before mmaped write. So after mmaped write when lstat() is called, it sees that atleast one of the fields being asked for is invalid (atime) and that results in generating GETATTR to server and mtime/ctime also get updated and test passes. But newer /usr/bin/stat seems to have moved to using statx() syscall now (instead of using lstat()). And statx() allows it to query only ctime or mtime (and not rest of the basic stat fields). That means when querying for mtime, fuse_update_get_attr() sees that mtime is not invalid (only atime is invalid). So it does not generate a new GETATTR and fill stat with cached mtime/ctime. And that means updated mtime is not seen by xfstest and tests start failing. Invalidating attrs after writeback completion should solve this problem in a generic manner. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGIDVivek Goyal2021-04-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When posix access ACL is set, it can have an effect on file mode and it can also need to clear SGID if. - None of caller's group/supplementary groups match file owner group. AND - Caller is not priviliged (No CAP_FSETID). As of now fuser server is responsible for changing the file mode as well. But it does not know whether to clear SGID or not. So add a flag FUSE_SETXATTR_ACL_KILL_SGID and send this info with SETXATTR to let file server know that sgid needs to be cleared as well. Reported-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: extend FUSE_SETXATTR requestVivek Goyal2021-04-144-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fuse client needs to send additional information to file server when it calls SETXATTR(system.posix_acl_access), so add extra flags field to the structure. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: fix matching of FUSE_DEV_IOC_CLONE commandAlessio Balsini2021-04-141-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit f8425c939663 ("fuse: 32-bit user space ioctl compat for fuse device") the matching constraints for the FUSE_DEV_IOC_CLONE ioctl command are relaxed, limited to the testing of command type and number. As Arnd noticed, this is wrong as it wouldn't ensure the correctness of the data size or direction for the received FUSE device ioctl. Fix by bringing back the comparison of the ioctl received by the FUSE device to the originally generated FUSE_DEV_IOC_CLONE. Fixes: f8425c939663 ("fuse: 32-bit user space ioctl compat for fuse device") Reported-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>