summaryrefslogtreecommitdiffstats
path: root/fs/fuse/inode.c
Commit message (Collapse)AuthorAgeFilesLines
* fuse: mark inode DONT_CACHE when per inode DAX hint changesJeffle Xu2021-12-141-0/+3
| | | | | | | | | | | | | | | | | | | | When the per inode DAX hint changes while the file is still *opened*, it is quite complicated and maybe fragile to dynamically change the DAX state. Hence mark the inode and corresponding dentries as DONE_CACHE once the per inode DAX hint changes, so that the inode instance will be evicted and freed as soon as possible once the file is closed and the last reference to the inode is put. And then when the file gets reopened next time, the new instantiated inode will reflect the new DAX state. In summary, when the per inode DAX hint changes for an *opened* file, the DAX state of the file won't be updated until this file is closed and reopened later. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: negotiate per inode DAX in FUSE_INITJeffle Xu2021-12-141-4/+9
| | | | | | | | | | | | | | Among the FUSE_INIT phase, client shall advertise per inode DAX if it's mounted with "dax=inode". Then server is aware that client is in per inode DAX mode, and will construct per-inode DAX attribute accordingly. Server shall also advertise support for per inode DAX. If server doesn't support it while client is mounted with "dax=inode", client will silently fallback to "dax=never" since "dax=inode" is advisory only. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: enable per inode DAXJeffle Xu2021-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX may be limited in some specific situation. When the number of usable DAX windows is under watermark, the recalim routine will be triggered to reclaim some DAX windows. It may have a negative impact on the performance, since some processes may need to wait for DAX windows to be recalimed and reused then. To mitigate the performance degradation, the overall DAX window need to be expanded larger. However, simply expanding the DAX window may not be a good deal in some scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB (512 * 64 bytes) memory footprint will be consumed for page descriptors inside guest, which is greater than the memory footprint if it uses guest page cache when DAX disabled. Thus it'd better disable DAX for those files smaller than 32KB, to reduce the demand for DAX window and thus avoid the unworthy memory overhead. Per inode DAX feature is introduced to address this issue, by offering a finer grained control for dax to users, trying to achieve a balance between performance and memory overhead. The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether DAX should be enabled or not for corresponding file. Currently the state whether DAX is enabled or not for the file is initialized only when inode is instantiated. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: make DAX mount option a tri-stateJeffle Xu2021-12-141-3/+7
| | | | | | | | | | | | | | | | | | | | | | We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. The following behavior is consistent with that on ext4/xfs: - The default behavior (when neither '-o dax' nor '-o dax=always|never|inode' option is specified) is equal to 'inode' mode, while 'dax=inode' won't be printed among the mount option list. - The 'inode' mode is only advisory. It will silently fallback to 'never' mode if fuse server doesn't support that. Also noted that by the time of this commit, 'inode' mode is actually equal to 'always' mode, before the per inode DAX flag is introduced in the following patch. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: send security context of inode on fileVivek Goyal2021-11-251-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a new inode is created, send its security context to server along with creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK). This gives server an opportunity to create new file and set security context (possibly atomically). In all the configurations it might not be possible to set context atomically. Like nfs and ceph, use security_dentry_init_security() to dermine security context of inode and send it with create, mkdir, mknod, and symlink requests. Following is the information sent to server. fuse_sectx_header, fuse_secctx, xattr_name, security_context - struct fuse_secctx_header This contains total number of security contexts being sent and total size of all the security contexts (including size of fuse_secctx_header). - struct fuse_secctx This contains size of security context which follows this structure. There is one fuse_secctx instance per security context. - xattr name string This string represents name of xattr which should be used while setting security context. - security context This is the actual security context whose size is specified in fuse_secctx struct. Also add the FUSE_SECURITY_CTX flag for the `flags` field of the fuse_init_out struct. When this flag is set the kernel will append the security context for a newly created inode to the request (create, mkdir, mknod, and symlink). The server is responsible for ensuring that the inode appears atomically (preferrably) with the requested security context. For example, If the server is using SELinux and backed by a "real" linux file system that supports extended attributes it can write the security context value to /proc/thread-self/attr/fscreate before making the syscall to create the inode. This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: extend init flagsMiklos Szeredi2021-11-251-27/+33
| | | | | | | | | | | | | | | | | | | | FUSE_INIT flags are close to running out, so add another 32bits worth of space. Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi field is allocated to contain the high 32bits of the flags. A flags_hi field is also added to fuse_init_out, allocated out of the remaining unused fields. Known userspace implementations of the fuse protocol have been checked to accept the extended FUSE_INIT request, but this might cause problems with other implementations. If that happens to be the case, the protocol negotiation will have to be extended with an extra initialization request roundtrip. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: add cache_maskMiklos Szeredi2021-10-281-7/+24
| | | | | | | | | | | | | If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: move reverting attributes to fuse_change_attributes()Miklos Szeredi2021-10-281-0/+13
| | | | | | | | | | | | | In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: simplify local variables holding writeback cache stateMiklos Szeredi2021-10-281-2/+2
| | | | | | | | | There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: make sure reclaim doesn't write the inodeMiklos Szeredi2021-10-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | In writeback cache mode mtime/ctime updates are cached, and flushed to the server using the ->write_inode() callback. Closing the file will result in a dirty inode being immediately written, but in other cases the inode can remain dirty after all references are dropped. This result in the inode being written back from reclaim, which can deadlock on a regular allocation while the request is being served. The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because serving a request involves unrelated userspace process(es). Instead do the same as for dirty pages: make sure the inode is written before the last reference is gone. - fallocate(2)/copy_file_range(2): these call file_update_time() or file_modified(), so flush the inode before returning from the call - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so flush the ctime directly from this helper Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: clean up error exits in fuse_fill_super()Miklos Szeredi2021-10-211-6/+2
| | | | | | | Instead of "goto err", return error directly, since there's no error cleanup to do now. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: always initialize sb->s_fs_infoMiklos Szeredi2021-10-211-25/+25
| | | | | | | | | | | | | | | | | | | Syzkaller reports a null pointer dereference in fuse_test_super() that is caused by sb->s_fs_info being NULL. This is due to the fact that fuse_fill_super() is initializing s_fs_info, which is too late, it's already on the fs_supers list. The initialization needs to be done in sget_fc() with the sb_lock held. Move allocation of fuse_mount and fuse_conn from fuse_fill_super() into fuse_get_tree(). After this ->kill_sb() will always be called with non-NULL ->s_fs_info, hence fuse_mount_destroy() can drop the test for non-NULL "fm". Reported-by: syzbot+74a15f02ccb51f398601@syzkaller.appspotmail.com Fixes: 5d5b74aa9c76 ("fuse: allow sharing existing sb") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: clean up fuse_mount destructionMiklos Szeredi2021-10-211-12/+5
| | | | | | | | | | | | 1. call fuse_mount_destroy() for open coded variants 2. before deactivate_locked_super() don't need fuse_mount destruction since that will now be done (if ->s_fs_info is not cleared) 3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the regular pattern can be used Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: get rid of fuse_put_super()Miklos Szeredi2021-10-211-9/+11
| | | | | | | | | | | | | | | | The ->put_super callback is called from generic_shutdown_super() in case of a fully initialized sb. This is called from kill_***_super(), which is called from ->kill_sb instances. Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the reference to the fuse_conn, while it does the same on each error case during sb setup. This patch moves the destruction from fuse_put_super() to fuse_mount_destroy(), called at the end of all ->kill_sb instances. A follup patch will clean up the error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: check s_root when destroying sbMiklos Szeredi2021-10-211-1/+1
| | | | | | | | | | Checking "fm" works because currently sb->s_fs_info is cleared on error paths; however, sb->s_root is what generic_shutdown_super() checks to determine whether the sb was fully initialized or not. This change will allow cleanup of sb setup error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Merge tag 'fuse-update-5.15' of ↵Linus Torvalds2021-09-071-55/+148
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Allow mounting an active fuse device. Previously the fuse device would always be mounted during initialization, and sharing a fuse superblock was only possible through mount or namespace cloning - Fix data flushing in syncfs (virtiofs only) - Fix data flushing in copy_file_range() - Fix a possible deadlock in atomic O_TRUNC - Misc fixes and cleanups * tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unused arg in fuse_write_file_get() fuse: wait for writepages in syncfs fuse: flush extending writes fuse: truncate pagecache on atomic_o_trunc fuse: allow sharing existing sb fuse: move fget() to fuse_get_tree() fuse: move option checking into fuse_fill_super() fuse: name fs_context consistently fuse: fix use after free in fuse_read_interrupt()
| * fuse: wait for writepages in syncfsMiklos Szeredi2021-09-061-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of fuse the MM subsystem doesn't guarantee that page writeback completes by the time ->sync_fs() is called. This is because fuse completes page writeback immediately to prevent DoS of memory reclaim by the userspace file server. This means that fuse itself must ensure that writes are synced before sending the SYNCFS request to the server. Introduce sync buckets, that hold a counter for the number of outstanding write requests. On syncfs replace the current bucket with a new one and wait until the old bucket's counter goes down to zero. It is possible to have multiple syncfs calls in parallel, in which case there could be more than one waited-on buckets. Descendant buckets must not complete until the parent completes. Add a count to the child (new) bucket until the (parent) old bucket completes. Use RCU protection to dereference the current bucket and to wake up an emptied bucket. Use fc->lock to protect against parallel assignments to the current bucket. This leaves just the counter to be a possible scalability issue. The fc->num_waiting counter has a similar issue, so both should be addressed at the same time. Reported-by: Amir Goldstein <amir73il@gmail.com> Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server") Cc: <stable@vger.kernel.org> # v5.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: allow sharing existing sbMiklos Szeredi2021-08-051-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it possible to create a new mount from a already working server. Here's a detailed description of the problem from Jakob: "The background for this question is occasional problems we see with our fuse filesystem [1] and mount namespaces. On a usual client, we have system-wide, autofs managed mountpoints. When a new mount namespace is created (which can be done unprivileged in combination with user namespaces), it can happen that a mountpoint is used inside the new namespace but idle in the root mount namespace. So autofs unmounts the parent, system-wide mountpoint. But the fuse module stays active and still serves mountpoint in the child mount namespace. Because the fuse daemon also blocks other system wide resources corresponding to the mountpoint, this situation effectively prevents new mounts until the child mount namespaces closes. [1] https://github.com/cvmfs/cvmfs" Reported-by: Jakob Blomer <jblomer@cern.ch> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: move fget() to fuse_get_tree()Miklos Szeredi2021-08-051-23/+21
| | | | | | | | | | | | | | | | | | | | | | | | Affected call chains: fuse_get_tree -> get_tree_(bdev|nodev) -> fuse_fill_super Needed for following patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: move option checking into fuse_fill_super()Miklos Szeredi2021-08-041-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Checking whether the "fd=", "rootmode=", "user_id=" and "group_id=" mount options are present can be moved from fuse_get_tree() into fuse_fill_super() where the value of the options are consumed. This relaxes semantics of reusing a fuse blockdev mount using the device name. Before this patch presence of these options were enforced but values ignored, after this patch these options are completely ignored in this case. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: name fs_context consistentlyMiklos Szeredi2021-08-041-30/+30
| | | | | | | | | | | | | | | | | | Naming convention under fs/fuse/: struct fuse_conn *fc; struct fs_context *fsc; Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | fuse: Convert to using invalidate_lockJan Kara2021-07-131-1/+0
|/ | | | | | | | | | | Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
* fuse: fix illegal access to inode with reused nodeidAmir Goldstein2021-06-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...) with ourarg containing nodeid and generation. If a fuse inode is found in inode cache with the same nodeid but different generation, the existing fuse inode should be unhashed and marked "bad" and a new inode with the new generation should be hashed instead. This can happen, for example, with passhrough fuse filesystem that returns the real filesystem ino/generation on lookup and where real inode numbers can get recycled due to real files being unlinked not via the fuse passthrough filesystem. With current code, this situation will not be detected and an old fuse dentry that used to point to an older generation real inode, can be used to access a completely new inode, which should be accessed only via the new dentry. Note that because the FORGET message carries the nodeid w/o generation, the server should wait to get FORGET counts for the nlookup counts of the old and reused inodes combined, before it can free the resources associated to that nodeid. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: Make fuse_fill_super_submount() staticGreg Kurz2021-06-221-2/+2
| | | | | | | | | | | This function used to be called from fuse_dentry_automount(). This code was moved to fuse_get_tree_submount() in the same file since then. It is unlikely there will ever be another user. No need to be extern in this case. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: Call vfs_get_tree() for submountsGreg Kurz2021-06-221-0/+36
| | | | | | | | | | | | | | We recently fixed an infinite loop by setting the SB_BORN flag on submounts along with the write barrier needed by super_cache_count(). This is the job of vfs_get_tree() and FUSE shouldn't have to care about the barrier at all. Split out some code from fuse_dentry_automount() to the dedicated fuse_get_tree_submount() handler for submounts and call vfs_get_tree(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: add dedicated filesystem context ops for submountsGreg Kurz2021-06-221-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | The creation of a submount is open-coded in fuse_dentry_automount(). This brings a lot of complexity and we recently had to fix bugs because we weren't setting SB_BORN or because we were unlocking sb->s_umount before sb was fully configured. Most of these could have been avoided by using the mount API instead of open-coding. Basically, this means coming up with a proper ->get_tree() implementation for submounts and call vfs_get_tree(), or better fc_mount(). The creation of the superblock for submounts is quite different from the root mount. Especially, it doesn't require to allocate a FUSE filesystem context, nor to parse parameters. Introduce a dedicated context ops for submounts to make this clear. This is just a placeholder for now, fuse_get_tree_submount() will be populated in a subsequent patch. Only visible change is that we stop allocating/freeing a useless FUSE filesystem context with submounts. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* virtiofs: propagate sync() to file serverGreg Kurz2021-06-221-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Merge branch 'work.misc' of ↵Linus Torvalds2021-05-021-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: useful constants: struct qstr for ".." hostfs_open(): don't open-code file_dentry() whack-a-mole: kill strlen_user() (again) autofs: should_expire() argument is guaranteed to be positive apparmor:match_mn() - constify devpath argument buffer: a small optimization in grow_buffers get rid of autofs_getpath() constify dentry argument of dentry_path()/dentry_path_raw()
| * useful constants: struct qstr for ".."Al Viro2021-04-151-2/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'fuse-update-5.13' of ↵Linus Torvalds2021-04-301-2/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix a page locking bug in write (introduced in 2.6.26) - Allow sgid bit to be killed in setacl() - Miscellaneous fixes and cleanups * tag 'fuse-update-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: cuse: simplify refcount cuse: prevent clone virtiofs: fix userns virtiofs: remove useless function virtiofs: split requests that exceed virtqueue size virtiofs: fix memory leak in virtio_fs_probe() fuse: invalidate attrs when page writeback completes fuse: add a flag FUSE_SETXATTR_ACL_KILL_SGID to kill SGID fuse: extend FUSE_SETXATTR request fuse: fix matching of FUSE_DEV_IOC_CLONE command fuse: fix a typo fuse: don't zero pages twice fuse: fix typo for fuse_conn.max_pages comment fuse: fix write deadlock
| * | virtiofs: split requests that exceed virtqueue sizeConnor Kuehl2021-04-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an incoming FUSE request can't fit on the virtqueue, the request is placed onto a workqueue so a worker can try to resubmit it later where there will (hopefully) be space for it next time. This is fine for requests that aren't larger than a virtqueue's maximum capacity. However, if a request's size exceeds the maximum capacity of the virtqueue (even if the virtqueue is empty), it will be doomed to a life of being placed on the workqueue, removed, discovered it won't fit, and placed on the workqueue yet again. Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect Descriptors) of the virtio spec: "A driver MUST NOT create a descriptor chain longer than the Queue Size of the device." To fix this, limit the number of pages FUSE will use for an overall request. This way, each request can realistically fit on the virtqueue when it is decomposed into a scattergather list and avoid violating section 2.6.5.3.1 of the virtio spec. Signed-off-by: Connor Kuehl <ckuehl@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: extend FUSE_SETXATTR requestVivek Goyal2021-04-141-1/+3
| |/ | | | | | | | | | | | | | | | | Fuse client needs to send additional information to file server when it calls SETXATTR(system.posix_acl_access), so add extra flags field to the structure. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* / new helper: inode_wrong_type()Al Viro2021-03-081-1/+1
|/ | | | | | | | inode_wrong_type(inode, mode) returns true if setting inode->i_mode to given value would've changed the inode type. We have enough of those checks open-coded to make a helper worthwhile. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fuse: fix bad inodeMiklos Szeredi2020-12-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org>
* fuse: support SB_NOSEC flag to improve write performanceVivek Goyal2020-11-111-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Virtiofs can be slow with small writes if xattr are enabled and we are doing cached writes (No direct I/O). Ganesh Mahalingam noticed this. Some debugging showed that file_remove_privs() is called in cached write path on every write. And everytime it calls security_inode_need_killpriv() which results in call to __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch xattr. This extra round trip for every write slows down writes tremendously. Normally to avoid paying this penalty on every write, vfs has the notion of caching this information in inode (S_NOSEC). So vfs sets S_NOSEC, if filesystem opted for it using super block flag SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or when security xattr is set on inode so that next time a write happens, we check inode again for clearing setuid/setgid bits as well clear any security.capability xattr. This seems to work well for local file systems but for remote file systems it is possible that VFS does not have full picture and a different client sets setuid/setgid bit or security.capability xattr on file and that means VFS information about S_NOSEC on another client will be stale. So for remote filesystems SB_NOSEC was disabled by default. Commit 9e1f1de02c22 ("more conservative S_NOSEC handling") mentioned that these filesystems can still make use of SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode attriutes from server. So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes. This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2. This says that server will clear setuid/setgid/security.capability on chown/truncate/write as apporpriate. This should provide tighter coherency because now suid/sgid/ security.capability will be cleared even if fuse client cache has not seen these attrs. Basic idea is that fuse client will trigger suid/sgid/security.capability clearing based on its attr cache. But even if cache has gone stale, it is fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear suid/sgid/security.capability. We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2. This should make sure that existing filesystems which might be relying on seucurity.capability always being queried from server are not impacted. This tighter coherency relies on WRITE showing up on server (and not being cached in guest). So writeback_cache mode will not provide that tight coherency and it is not recommended to use two together. Having said that it might work reasonably well for lot of use cases. This change improves random write performance very significantly. Running virtiofsd with cache=auto and following fio command: fio --ioengine=libaio --direct=1 --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite Bandwidth increases from around 50MB/s to around 250MB/s as a result of applying this patch. So improvement is very significant. Link: https://github.com/kata-containers/runtime/issues/2815 Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2Vivek Goyal2020-11-111-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already have FUSE_HANDLE_KILLPRIV flag that says that file server will remove suid/sgid/caps on truncate/chown/write. But that's little different from what Linux VFS implements. To be consistent with Linux VFS behavior what we want is. - caps are always cleared on chown/write/truncate - suid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID. - sgid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID as well as file has group execute permission. As previous flag did not provide above semantics. Implement a V2 of the protocol with above said constraints. Server does not know if caller has CAP_FSETID or not. So for the case of write()/truncate(), client will send information in special flag to indicate whether to kill priviliges or not. These changes are in subsequent patches. FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear suid/sgid/security.capability. But with ->writeback_cache, WRITES are cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2 and writeback_cache together. Though it probably might be good enough for lot of use cases. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: add fuse_sb_destroy() helperMiklos Szeredi2020-11-111-9/+7
| | | | | | | This is to avoid minor code duplication between fuse_kill_sb_anon() and fuse_kill_sb_blk(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* fuse: get rid of fuse_mount refcountMiklos Szeredi2020-11-111-13/+4
| | | | | | | | | | | Fuse mount now only ever has a refcount of one (before being freed) so the count field is unnecessary. Remove the refcounting and fold fuse_mount_put() into callers. The only caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount() and here the fuse_conn_put() can simply be omitted. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* virtiofs: simplify sb setupMiklos Szeredi2020-11-111-7/+0
| | | | | | | | | | | | | | | Currently when acquiring an sb for virtiofs fuse_mount_get() is being called from virtio_fs_set_super() if a new sb is being filled and fuse_mount_put() is called unconditionally after sget_fc() returns. The exact same result can be obtained by checking whether fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and only calling fuse_mount_put() if the ref wasn't transferred (error or matching sb found). This allows getting rid of virtio_fs_set_super() and fuse_mount_get(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Merge tag 'fuse-update-5.10' of ↵Linus Torvalds2020-10-191-88/+303
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
| * fuse: connection remove fixMiklos Szeredi2020-10-121-0/+7
| | | | | | | | | | | | | | | | Re-add lost removal of fc from fuse_conn_list and the control filesystem. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: fcee216beb9c ("fuse: split fuse_mount off of fuse_conn") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: implement crossmountsMax Reitz2020-10-091-2/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: Allow fuse_fill_super_common() for submountsMax Reitz2020-09-181-21/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Submounts have their own superblock, which needs to be initialized. However, they do not have a fuse_fs_context associated with them, and the root node's attributes should be taken from the mountpoint's node. Extend fuse_fill_super_common() to work for submounts by making the @ctx parameter optional, and by adding a @submount_finode parameter. (There is a plain "unsigned" in an existing code block that is being indented by this commit. Extend it to "unsigned int" so checkpatch does not complain.) Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * fuse: split fuse_mount off of fuse_connMax Reitz2020-09-181-49/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * virtiofs: serialize truncate/punch_hole and dax fault pathVivek Goyal2020-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently in fuse we don't seem have any lock which can serialize fault path with truncate/punch_hole path. With dax support I need one for following reasons. 1. Dax requirement DAX fault code relies on inode size being stable for the duration of fault and want to serialize with truncate/punch_hole and they explicitly mention it. static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, const struct iomap_ops *ops) /* * Check whether offset isn't beyond end of file now. Caller is * supposed to hold locks serializing us with truncate / punch hole so * this is a reliable test. */ max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2. Make sure there are no users of pages being truncated/punch_hole get_user_pages() might take references to page and then do some DMA to said pages. Filesystem might truncate those pages without knowing that a DMA is in progress or some I/O is in progress. So use dax_layout_busy_page() to make sure there are no such references and I/O is not in progress on said pages before moving ahead with truncation. 3. Limitation of kvm page fault error reporting If we are truncating file on host first and then removing mappings in guest lateter (truncate page cache etc), then this could lead to a problem with KVM. Say a mapping is in place in guest and truncation happens on host. Now if guest accesses that mapping, then host will take a fault and kvm will either exit to qemu or spin infinitely. IOW, before we do truncation on host, we need to make sure that guest inode does not have any mapping in that region or whole file. 4. virtiofs memory range reclaim Soon I will introduce the notion of being able to reclaim dax memory ranges from a fuse dax inode. There also I need to make sure that no I/O or fault is going on in the reclaimed range and nobody is using it so that range can be reclaimed without issues. Currently if we take inode lock, that serializes read/write. But it does not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem for this purpose. It can be used to serialize with faults. As of now, I am adding taking this semaphore only in dax fault path and not regular fault path because existing code does not have one. May be existing code can benefit from it as well to take care of some races, but that we can fix later if need be. For now, I am just focussing only on DAX path which is new path. Also added logic to take fuse_inode->i_mmap_sem in truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and fuse dax fault are mutually exlusive and avoid all the above problems. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * virtiofs: implement dax read/write operationsVivek Goyal2020-09-101-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements basic DAX support. mmap() is not implemented yet and will come in later patches. This patch looks into implemeting read/write. We make use of interval tree to keep track of per inode dax mappings. Do not use dax for file extending writes, instead just send WRITE message to daemon (like we do for direct I/O path). This will keep write and i_size change atomic w.r.t crash. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * virtiofs: implement FUSE_INIT map_alignment fieldStefan Hajnoczi2020-09-101-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment constraints via the FUST_INIT map_alignment field. Parse this field and ensure our DAX mappings meet the alignment constraints. We don't actually align anything differently since our mappings are already 2MB aligned. Just check the value when the connection is established. If it becomes necessary to honor arbitrary alignments in the future we'll have to adjust how mappings are sized. The upshot of this commit is that we can be confident that mappings will work even when emulating x86 on Power and similar combinations where the host page sizes are different. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * virtiofs: add a mount option to enable daxVivek Goyal2020-09-101-1/+17
| | | | | | | | | | | | | | Add a mount option to allow using dax with virtio_fs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * virtiofs: get rid of no_mount_optionsVivek Goyal2020-09-101-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | This option was introduced so that for virtio_fs we don't show any mounts options fuse_show_options(). Because we don't offer any of these options to be controlled by mounter. Very soon we are planning to introduce option "dax" which mounter should be able to specify. And no_mount_options does not work anymore. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig2020-09-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>