summaryrefslogtreecommitdiffstats
path: root/fs/nfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | | nfs: convert to new timestamp accessorsJeff Layton2023-10-183-18/+18
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-49-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | Merge tag 'vfs-6.7.xattr' of ↵Linus Torvalds2023-10-303-3/+3
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs xattr updates from Christian Brauner: "The 's_xattr' field of 'struct super_block' currently requires a mutable table of 'struct xattr_handler' entries (although each handler itself is const). However, no code in vfs actually modifies the tables. This changes the type of 's_xattr' to allow const tables, and modifies existing file systems to move their tables to .rodata. This is desirable because these tables contain entries with function pointers in them; moving them to .rodata makes it considerably less likely to be modified accidentally or maliciously at runtime" * tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits) const_structs.checkpatch: add xattr_handler net: move sockfs_xattr_handlers to .rodata shmem: move shmem_xattr_handlers to .rodata overlayfs: move xattr tables to .rodata xfs: move xfs_xattr_handlers to .rodata ubifs: move ubifs_xattr_handlers to .rodata squashfs: move squashfs_xattr_handlers to .rodata smb: move cifs_xattr_handlers to .rodata reiserfs: move reiserfs_xattr_handlers to .rodata orangefs: move orangefs_xattr_handlers to .rodata ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata ntfs3: move ntfs_xattr_handlers to .rodata nfs: move nfs4_xattr_handlers to .rodata kernfs: move kernfs_xattr_handlers to .rodata jfs: move jfs_xattr_handlers to .rodata jffs2: move jffs2_xattr_handlers to .rodata hfsplus: move hfsplus_xattr_handlers to .rodata hfs: move hfs_xattr_handlers to .rodata gfs2: move gfs2_xattr_handlers_max to .rodata fuse: move fuse_xattr_handlers to .rodata ...
| * | | | | nfs: move nfs4_xattr_handlers to .rodataWedson Almeida Filho2023-10-093-3/+3
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This makes it harder for accidental or malicious changes to nfs4_xattr_handlers at runtime. Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: linux-nfs@vger.kernel.org Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com> Link: https://lore.kernel.org/r/20230930050033.41174-19-wedsonaf@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds2023-10-301-1/+1
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
| * | | | | fs: add a new SB_I_NOUMASK flagJeff Layton2023-10-191-1/+1
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SB_POSIXACL must be set when a filesystem supports POSIX ACLs, but NFSv4 also sets this flag to prevent the VFS from applying the umask on newly-created files. NFSv4 doesn't support POSIX ACLs however, which causes confusion when other subsystems try to test for them. Add a new SB_I_NOUMASK flag that allows filesystems to opt-in to umask stripping without advertising support for POSIX ACLs. Set the new flag on NFSv4 instead of SB_POSIXACL. Also, move mode_strip_umask to namei.h and convert init_mknod and init_mkdir to use it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230911-acl-fix-v3-1-b25315333f6c@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | / nfs/blocklayout: Convert to use bdev_open_by_dev/path()Jan Kara2023-10-282-40/+38
| |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert block device handling to use bdev_open_by_dev/path() and pass the handle around. CC: linux-nfs@vger.kernel.org CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-25-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia2023-10-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patches fixes commit 51d674a5e488 "NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that commit was to mark EXCHANGE_ID to the DS with the appropriate flag. However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that to the MDS. Instead, just mark the pnfs path exclusively. Fixes: 51d674a5e488 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | | pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_statsTrond Myklebust2023-10-181-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that we check the layout pointer and validity after dereferencing it in ff_layout_mirror_prepare_stats. Fixes: 08e2e5bc6c9a ("pNFS/flexfiles: Clean up layoutstats") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | | pNFS: Fix a hang in nfs4_evict_inode()Trond Myklebust2023-10-181-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are not allowed to call pnfs_mark_matching_lsegs_return() without also holding a reference to the layout header, since doing so could lead to the reference count going to zero when we call pnfs_layout_remove_lseg(). This again can lead to a hang when we get to nfs4_evict_inode() and are unable to clear the layout pointer. pnfs_layout_return_unused_byserver() is guilty of this behaviour, and has been seen to trigger the refcount warning prior to a hang. Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | | NFS: Fix potential oops in nfs_inode_remove_request()Scott Mayhew2023-10-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once a folio's private data has been cleared, it's possible for another process to clear the folio->mapping (e.g. via invalidate_complete_folio2 or evict_mapping_folio), so it wouldn't be safe to call nfs_page_to_inode() after that. Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | | nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE opDai Ngo2023-10-111-1/+2
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Linux NFS server strips the SUID and SGID from the file mode on ALLOCATE op. Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to nfs_set_cache_invalid's argument to force update of the file mode suid/sgid bit. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | nfs: decrement nrequests counter before releasing the reqJeff Layton2023-09-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit this panic in testing: [ 6235.500016] run fstests generic/464 at 2023-09-18 22:51:24 [ 6288.410761] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 6288.412174] #PF: supervisor read access in kernel mode [ 6288.413160] #PF: error_code(0x0000) - not-present page [ 6288.413992] PGD 0 P4D 0 [ 6288.414603] Oops: 0000 [#1] PREEMPT SMP PTI [ 6288.415419] CPU: 0 PID: 340798 Comm: kworker/u18:8 Not tainted 6.6.0-rc1-gdcf620ceebac #95 [ 6288.416538] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 6288.417701] Workqueue: nfsiod rpc_async_release [sunrpc] [ 6288.418676] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.419836] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b [ 6288.422389] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246 [ 6288.423234] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000 [ 6288.424351] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840 [ 6288.425345] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff [ 6288.426334] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4 [ 6288.427353] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0 [ 6288.428358] FS: 0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000 [ 6288.429513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6288.430427] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0 [ 6288.431553] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6288.432715] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6288.433698] PKRU: 55555554 [ 6288.434196] Call Trace: [ 6288.434667] <TASK> [ 6288.435132] ? __die+0x1f/0x70 [ 6288.435723] ? page_fault_oops+0x159/0x450 [ 6288.436389] ? try_to_wake_up+0x98/0x5d0 [ 6288.437044] ? do_user_addr_fault+0x65/0x660 [ 6288.437728] ? exc_page_fault+0x7a/0x180 [ 6288.438368] ? asm_exc_page_fault+0x22/0x30 [ 6288.439137] ? nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.440112] ? nfs_inode_remove_request+0xa0/0x150 [nfs] [ 6288.440924] nfs_commit_release_pages+0x16e/0x340 [nfs] [ 6288.441700] ? __pfx_call_transmit+0x10/0x10 [sunrpc] [ 6288.442475] ? _raw_spin_lock_irqsave+0x23/0x50 [ 6288.443161] nfs_commit_release+0x15/0x40 [nfs] [ 6288.443926] rpc_free_task+0x36/0x60 [sunrpc] [ 6288.444741] rpc_async_release+0x29/0x40 [sunrpc] [ 6288.445509] process_one_work+0x171/0x340 [ 6288.446135] worker_thread+0x277/0x3a0 [ 6288.446724] ? __pfx_worker_thread+0x10/0x10 [ 6288.447376] kthread+0xf0/0x120 [ 6288.447903] ? __pfx_kthread+0x10/0x10 [ 6288.448500] ret_from_fork+0x2d/0x50 [ 6288.449078] ? __pfx_kthread+0x10/0x10 [ 6288.449665] ret_from_fork_asm+0x1b/0x30 [ 6288.450283] </TASK> [ 6288.450688] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nls_iso8859_1 nls_cp437 vfat fat 9p netfs ext4 kvm_intel crc16 mbcache jbd2 joydev kvm xfs irqbypass virtio_net pcspkr net_failover psmouse failover 9pnet_virtio cirrus drm_shmem_helper virtio_balloon drm_kms_helper button evdev drm loop dm_mod zram zsmalloc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic virtio_blk nvme aesni_intel crypto_simd cryptd nvme_core t10_pi i6300esb crc64_rocksoft_generic crc64_rocksoft crc64 virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring serio_raw btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq autofs4 [ 6288.460211] CR2: 0000000000000000 [ 6288.460787] ---[ end trace 0000000000000000 ]--- [ 6288.461571] RIP: 0010:nfs_inode_remove_request+0xc8/0x150 [nfs] [ 6288.462500] Code: ff ff 48 8b 43 38 48 8b 7b 10 a8 04 74 5b 48 85 ff 74 56 48 8b 07 a9 00 00 08 00 74 58 48 8b 07 f6 c4 10 74 50 e8 c8 44 b3 d5 <48> 8b 00 f0 48 ff 88 30 ff ff ff 5b 5d 41 5c c3 cc cc cc cc 48 8b [ 6288.465136] RSP: 0018:ffffbd618353bda8 EFLAGS: 00010246 [ 6288.465963] RAX: 0000000000000000 RBX: ffff9a29f9a25280 RCX: 0000000000000000 [ 6288.467035] RDX: ffff9a29f9a252b4 RSI: 000000000000000b RDI: ffffef41448e3840 [ 6288.468093] RBP: ffffef41448e3840 R08: 0000000000000038 R09: ffffffffffffffff [ 6288.469121] R10: 0000000000033f80 R11: ffff9a2a7fffa000 R12: ffff9a29093f98c4 [ 6288.470109] R13: 0000000000000000 R14: ffff9a29230f62e0 R15: ffff9a29230f62d0 [ 6288.471106] FS: 0000000000000000(0000) GS:ffff9a2a77c00000(0000) knlGS:0000000000000000 [ 6288.472216] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6288.473059] CR2: 0000000000000000 CR3: 0000000264748002 CR4: 0000000000770ef0 [ 6288.474096] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6288.475097] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6288.476148] PKRU: 55555554 [ 6288.476665] note: kworker/u18:8[340798] exited with irqs disabled Once we've released "req", it's not safe to dereference it anymore. Decrement the nrequests counter before dropping the reference. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | NFSv4: Fix a state manager thread deadlock regressionTrond Myklebust2023-09-272-13/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4dc73c679114 reintroduces the deadlock that was fixed by commit aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") because it prevents the setup of new threads to handle reboot recovery, while the older recovery thread is stuck returning delegations. Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | | NFSv4: Fix a nfs4_state_manager() raceTrond Myklebust2023-09-271-0/+7
| |/ |/| | | | | | | | | | | | | | | | | | | If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared NFS4CLNT_MANAGER_RUNNING, then we might have won the race against nfs4_schedule_state_manager(), and are responsible for handling the recovery situation. Fixes: aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFSv4.1: fix zero value filehandle in post open getattrOlga Kornievskaia2023-09-151-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if the OPEN compound experiencing an error and needs to get the file attributes separately, it will send a stand alone GETATTR but it would use the filehandle from the results of the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh is zero value. That generate a GETATTR that's sent with a zero value filehandle, and results in the server returning an error. Instead, for the CLAIM_FH OPEN, take the filehandle that was used in the PUTFH of the OPEN compound. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFSv4.1: fix pnfs MDS=DS session trunkingOlga Kornievskaia2023-09-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when GETDEVICEINFO returns multiple locations where each is a different IP but the server's identity is same as MDS, then nfs4_set_ds_client() finds the existing nfs_client structure which has the MDS's max_connect value (and if it's 1), then the 1st IP on the DS's list will get dropped due to MDS trunking rules. Other IPs would be added as they fall under the pnfs trunking rules. For the list of IPs the 1st goes thru calling nfs4_set_ds_client() which will eventually call nfs4_add_trunk() and call into rpc_clnt_test_and_add_xprt() which has the check for MDS trunking. The other IPs (after the 1st one), would call rpc_clnt_add_xprt() which doesn't go thru that check. nfs4_add_trunk() is called when MDS trunking is happening and it needs to enforce the usage of max_connect mount option of the 1st mount. However, this shouldn't be applied to pnfs flow. Instead, this patch proposed to treat MDS=DS as DS trunking and make sure that MDS's max_connect limit does not apply to the 1st IP returned in the GETDEVICEINFO list. It does so by marking the newly created client with a new flag NFS_CS_PNFS which then used to pass max_connect value to use into the rpc_clnt_test_and_add_xprt() instead of the existing rpc client's max_connect value set by the MDS connection. For example, mount was done without max_connect value set so MDS's rpc client has cl_max_connect=1. Upon calling into rpc_clnt_test_and_add_xprt() and using rpc client's value, the caller passes in max_connect value which is previously been set in the pnfs path (as a part of handling GETDEVICEINFO list of IPs) in nfs4_set_ds_client(). However, when NFS_CS_PNFS flag is not set and we know we are doing MDS trunking, comparing a new IP of the same server, we then set the max_connect value to the existing MDS's value and pass that into rpc_clnt_test_and_add_xprt(). Fixes: dc48e0abee24 ("SUNRPC enforce creation of no more than max_connect xprts") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS/pNFS: Report EINVAL errors from connect() to the serverTrond Myklebust2023-09-131-0/+1
| | | | | | | | | | | | | | | | | | | | With IPv6, connect() can occasionally return EINVAL if a route is unavailable. If this happens during I/O to a data server, we want to report it using LAYOUTERROR as an inability to connect. Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS: More fixes for nfs_direct_write_reschedule_io()Trond Myklebust2023-09-131-6/+11
| | | | | | | | | | | | | | | | | | Ensure that all requests are put back onto the commit list so that they can be rescheduled. Fixes: 4daaeba93822 ("NFS: Fix nfs_direct_write_reschedule_io()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS: Use the correct commit info in nfs_join_page_group()Trond Myklebust2023-09-132-14/+17
| | | | | | | | | | | | | | | | | | Ensure that nfs_clear_request_commit() updates the correct counters when it removes them from the commit list. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS: More O_DIRECT accounting fixes for error pathsTrond Myklebust2023-09-131-16/+31
| | | | | | | | | | | | | | | | | | If we hit a fatal error when retransmitting, we do need to record the removal of the request from the count of written bytes. Fixes: 031d73ed768a ("NFS: Fix O_DIRECT accounting of number of bytes read/written") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS: Fix O_DIRECT locking issuesTrond Myklebust2023-09-131-4/+4
| | | | | | | | | | | | | | | | The dreq fields are protected by the dreq->lock. Fixes: 954998b60caa ("NFS: Fix error handling for O_DIRECT write scheduling") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | NFS: Fix error handling for O_DIRECT write schedulingTrond Myklebust2023-09-111-16/+46
|/ | | | | | | | | | | | | | If we fail to schedule a request for transmission, there are 2 possibilities: 1) Either we hit a fatal error, and we just want to drop the remaining requests on the floor. 2) We were asked to try again, in which case we should allow the outstanding RPC calls to complete, so that we can recoalesce requests and try again. Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2023-08-3119-43/+88
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Anna Schumaker: "New Features: - Enable the NFS v4.2 READ_PLUS operation by default Stable Fixes: - NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info - NFS: Fix a potential data corruption Bugfixes: - Fix various READ_PLUS issues including: - smatch warnings - xdr size calculations - scratch buffer handling - 32bit / highmem xdr page handling - Fix checkpatch errors in file.c - Fix redundant readdir request after an EOF - Fix handling of COPY ERR_OFFLOAD_NO_REQ - Fix assignment of xprtdata.cred Cleanups: - Remove unused xprtrdma function declarations - Clean up an integer overflow check to avoid a warning - Clean up #includes in dns_resolve.c - Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page() - Clean up sunrpc TCP socket timeout configuration - Guard against READDIR loops when entry names are too long - Use EXCHID4_FLAG_USE_PNFS_DS for DS servers" * tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits) pNFS: Fix assignment of xprtdata.cred NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt() SUNRPC: Allow specification of TCP client connect timeout at setup SUNRPC: Refactor and simplify connect timeout SUNRPC: Set the TCP_SYNCNT to match the socket timeout NFS: Fix a potential data corruption nfs: fix redundant readdir request after get eof nfs/blocklayout: Use the passed in gfp flags filemap: Fix errors in file.c NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info NFS: Move common includes outside ifdef SUNRPC: clean up integer overflow check xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv() NFS: Enable the READ_PLUS operation by default SUNRPC: kmap() the xdr pages during decode NFSv4.2: Rework scratch handling for READ_PLUS (again) ...
| * pNFS: Fix assignment of xprtdata.credAnna Schumaker2023-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The comma at the end of the line was leftover from an earlier refactor of the _nfs4_pnfs_v3_ds_connect() function. This is technically valid C, so the compilers didn't catch it, but if I'm understanding how it works correctly it assigns the return value of rpc_clnt_add_xprtr() to xprtdata.cred. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: a12f996d3413 ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQOlga Kornievskaia2023-08-301-2/+3
| | | | | | | | | | | | | | | | | | | | If the client sent a synchronous copy and the server replied with ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous copy instead, the client should retry with asynchronous copy. Fixes: 539f57b3e0fd ("NFS handle COPY ERR_OFFLOAD_NO_REQS") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Guard against READDIR loop when entry names exceed MAXNAMELENBenjamin Coddington2023-08-302-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 64cfca85bacd asserts the only valid return values for nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the client's behavior will be to endlessly retry the operation. We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce truncated listings without any error. The client should return an error for this case to clearly assert that the server implementation must be corrected. Fixes: 64cfca85bacd ("NFS: Return valid errors from nfs2/3_decode_dirent()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia2023-08-242-0/+7
| | | | | | | | | | | | | | | | | | After receiving the location(s) of the DS server(s) in the GETDEVINCEINFO, create the request for the clientid to such server and indicate that the client is connecting to a DS. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS/pNFS: Set the connect timeout for the pNFS flexfiles driverTrond Myklebust2023-08-244-0/+10
| | | | | | | | | | | | | | | | | | Ensure that the connect timeout for the pNFS flexfiles driver is of the same order as the I/O timeout, so that we can fail over quickly when trying to read from a data server that is down. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Fix a potential data corruptionTrond Myklebust2023-08-241-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We must ensure that the subrequests are joined back into the head before we can retransmit a request. If the head was not on the commit lists, because the server wrote it synchronously, we still need to add it back to the retransmission list. Add a call that mirrors the effect of nfs_cancel_remove_inode() for O_DIRECT. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * nfs: fix redundant readdir request after get eofKinglong Mee2023-08-241-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a directory contains 17 files (except . and ..), nfs client sends a redundant readdir request after get eof. A simple reproduce, At NFS server, create a directory with 17 files under exported directory. # mkdir test # cd test # for i in {0..16} ; do touch $i; done At NFS client, no matter mounting through nfsv3 or nfsv4, does ls (or ll) at the created test directory. A tshark output likes following (for nfsv4), # tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4 srcip dstip SEQUENCE, PUTFH, READDIR 0 dstip srcip SEQUENCE PUTFH READDIR 909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807 srcip dstip srcip dstip SEQUENCE, PUTFH, READDIR 9223372036854775807 dstip srcip SEQUENCE PUTFH READDIR The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * nfs/blocklayout: Use the passed in gfp flagsDan Carpenter2023-08-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | This allocation should use the passed in GFP_ flags instead of GFP_KERNEL. One places where this matters is in filelayout_pg_init_write() which uses GFP_NOFS as the allocation flags. Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * filemap: Fix errors in file.chuzhi001@208suo.com2023-08-241-1/+1
| | | | | | | | | | | | | | | | | | The following checkpatch errors are removed: ERROR: "foo * bar" should be "foo *bar" "foo * bar" should be "foo *bar" Signed-off-by: ZhiHu <huzhi001@208suo.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_infoFedor Pchelkin2023-08-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | It is an almost improbable error case but when page allocating loop in nfs4_get_device_info() fails then we should only free the already allocated pages, as __free_page() can't deal with NULL arguments. Found by Linux Verification Center (linuxtesting.org). Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Move common includes outside ifdefGUO Zihua2023-08-241-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | module.h, clnt.h, addr.h and dns_resolve.h is always included whether CONFIG_NFS_USE_KERNEL_DNS is set or not and their order does not seems to matter. Move them outside the ifdef to simplify code and avoid checkincludes message. Signed-off-by: GUO Zihua <guozihua@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFS: Enable the READ_PLUS operation by defaultAnna Schumaker2023-08-231-4/+2
| | | | | | | | | | | | | | Now that the remaining issues have been worked out, including some unexpected 32 bit issues, we can safely enable the feature by default. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4.2: Rework scratch handling for READ_PLUS (again)Anna Schumaker2023-08-235-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I found that the read code might send multiple requests using the same nfs_pgio_header, but nfs4_proc_read_setup() is only called once. This is how we ended up occasionally double-freeing the scratch buffer, but also means we set a NULL pointer but non-zero length to the xdr scratch buffer. This results in an oops the first time decoding needs to copy something to scratch, which frequently happens when decoding READ_PLUS hole segments. I fix this by moving scratch handling into the pageio read code. I provide a function to allocate scratch space for decoding read replies, and free the scratch buffer when the nfs_pgio_header is freed. Fixes: fbd2a05f29a9 (NFSv4.2: Rework scratch handling for READ_PLUS) Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4.2: Fix READ_PLUS size calculationsAnna Schumaker2023-08-231-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | I bump the decode_read_plus_maxsz to account for hole segments, but I need to subtract out this increase when calling rpc_prepare_reply_pages() so the common case of single data segment replies can be directly placed into the xdr pages without needing to be shifted around. Reported-by: Chuck Lever <chuck.lever@oracle.com> Fixes: d3b00a802c845 ("NFS: Replace the READ_PLUS decoding code") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * NFSv4.2: Fix READ_PLUS smatch warningsAnna Schumaker2023-08-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Smatch reports: fs/nfs/nfs42xdr.c:1131 decode_read_plus() warn: missing error code? 'status' Which Dan suggests to fix by doing a hardcoded "return 0" from the "if (segments == 0)" check. Additionally, smatch reports that the "status = -EIO" assignment is not used. This patch addresses both these issues. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202305222209.6l5VM2lL-lkp@intel.com/ Fixes: d3b00a802c845 ("NFS: Replace the READ_PLUS decoding code") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2023-08-311-19/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "I'm thrilled to announce that the Linux in-kernel NFS server now offers NFSv4 write delegations. A write delegation enables a client to cache data and metadata for a single file more aggressively, reducing network round trips and server workload. Many thanks to Dai Ngo for contributing this facility, and to Jeff Layton and Neil Brown for reviewing and testing it. This release also sees the removal of all support for DES- and triple-DES-based Kerberos encryption types in the kernel's SunRPC implementation. These encryption types have been deprecated by the Internet community for years and are considered insecure. This change affects both the in-kernel NFS client and server. The server's UDP and TCP socket transports have now fully adopted David Howells' new bio_vec iterator so that no more than one sendmsg() call is needed to transmit each RPC message. In particular, this helps kTLS optimize record boundaries when sending RPC-with-TLS replies, and it takes the server a baby step closer to handling file I/O via folios. We've begun work on overhauling the SunRPC thread scheduler to remove a costly linked-list walk when looking for an idle RPC service thread to wake. The pre-requisites are included in this release. Thanks to Neil Brown for his ongoing work on this improvement" * tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits) Documentation: Add missing documentation for EXPORT_OP flags SUNRPC: Remove unused declaration rpc_modcount() SUNRPC: Remove unused declarations NFSD: da_addr_body field missing in some GETDEVICEINFO replies SUNRPC: Remove return value of svc_pool_wake_idle_thread() SUNRPC: make rqst_should_sleep() idempotent() SUNRPC: Clean up svc_set_num_threads SUNRPC: Count ingress RPC messages per svc_pool SUNRPC: Deduplicate thread wake-up code SUNRPC: Move trace_svc_xprt_enqueue SUNRPC: Add enum svc_auth_status SUNRPC: change svc_xprt::xpt_flags bits to enum SUNRPC: change svc_rqst::rq_flags bits to enum SUNRPC: change svc_pool::sp_flags bits to enum SUNRPC: change cache_head.flags bits to enum SUNRPC: remove timeout arg from svc_recv() SUNRPC: change svc_recv() to return void. SUNRPC: call svc_process() from svc_recv(). nfsd: separate nfsd_last_thread() from nfsd_put() nfsd: Simplify code around svc_exit_thread() call in nfsd() ...
| * | SUNRPC: Add enum svc_auth_statusChuck Lever2023-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In addition to the benefits of using an enum rather than a set of macros, we now have a named type that can improve static type checking of function return values. As part of this change, I removed a stale comment from svcauth.h; the return values from current implementations of the auth_ops::release method are all zero/negative errno, not the SVC_OK enum values as the old comment suggested. Suggested-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | SUNRPC: remove timeout arg from svc_recv()NeilBrown2023-08-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most svc threads have no interest in a timeout. nfsd sets it to 1 hour, but this is a wart of no significance. lockd uses the timeout so that it can call nlmsvc_retry_blocked(). It also sometimes calls svc_wake_up() to ensure this is called. So change lockd to be consistent and always use svc_wake_up() to trigger nlmsvc_retry_blocked() - using a timer instead of a timeout to svc_recv(). And change svc_recv() to not take a timeout arg. This makes the sp_threads_timedout counter always zero. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | SUNRPC: change svc_recv() to return void.NeilBrown2023-08-291-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | svc_recv() currently returns a 0 on success or one of two errors: - -EAGAIN means no message was successfully received - -EINTR means the thread has been told to stop Previously nfsd would stop as the result of a signal as well as following kthread_stop(). In that case the difference was useful: EINTR means stop unconditionally. EAGAIN means stop if kthread_should_stop(), continue otherwise. Now threads only exit when kthread_should_stop() so we don't need the distinction. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | SUNRPC: call svc_process() from svc_recv().NeilBrown2023-08-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of svc_recv() go on to call svc_process() on success. Simplify callers by having svc_recv() do that for them. This loses one call to validate_process_creds() in nfsd. That was debugging code added 14 years ago. I don't think we need to keep it. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * | nfsd: don't allow nfsd threads to be signalled.NeilBrown2023-08-291-8/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original implementation of nfsd used signals to stop threads during shutdown. In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads internally it if was asked to run "0" threads. After this user-space transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to threads was no longer an important part of the API. In commit 3ebdbe5203a8 ("SUNRPC: discard svo_setup and rename svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the use of signals for stopping threads, using kthread_stop() instead. This patch makes the "obvious" next step and removes the ability to signal nfsd threads - or any svc threads. nfsd stops allowing signals and we don't check for their delivery any more. This will allow for some simplification in later patches. A change worth noting is in nfsd4_ssc_setup_dul(). There was previously a signal_pending() check which would only succeed when the thread was being shut down. It should really have tested kthread_should_stop() as well. Now it just does the latter, not the former. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | NFS: switch back to using kill_anon_superChristoph Hellwig2023-08-311-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS switch to open coding kill_anon_super in 7b14a213890a ("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister call. At that point bdi_destroy was called in nfs_free_server and thus it required a later freeing of the anon dev_t. But since 0db10944a76b ("nfs: Convert to separately allocated bdi") the bdi has been free implicitly by the sb destruction, so this isn't needed anymore. By not open coding kill_anon_super, nfs now inherits the fix in dc3216b14160 ("super: ensure valid info"), and we remove the only open coded version of kill_anon_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230831052940.256193-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds2023-08-291-0/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
| * | mm, netfs, fscache: stop read optimisation when folio removed from pagecacheDavid Howells2023-08-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fscache has an optimisation by which reads from the cache are skipped until we know that (a) there's data there to be read and (b) that data isn't entirely covered by pages resident in the netfs pagecache. This is done with two flags manipulated by fscache_note_page_release(): if (... test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) && test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate that netfslib should download from the server or clear the page instead. The fscache_note_page_release() function is intended to be called from ->releasepage() - but that only gets called if PG_private or PG_private_2 is set - and currently the former is at the discretion of the network filesystem and the latter is only set whilst a page is being written to the cache, so sometimes we miss clearing the optimisation. Fix this by following Willy's suggestion[1] and adding an address_space flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call ->release_folio() if it's set, even if PG_private or PG_private_2 aren't set. Note that this would require folio_test_private() and page_has_private() to become more complicated. To avoid that, in the places[*] where these are used to conditionalise calls to filemap_release_folio() and try_to_release_page(), the tests are removed the those functions just jumped to unconditionally and the test is performed there. [*] There are some exceptions in vmscan.c where the check guards more than just a call to the releaser. I've added a function, folio_needs_release() to wrap all the checks for that. AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from fscache and cleared in ->evict_inode() before truncate_inode_pages_final() is called. Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared and the optimisation cancelled if a cachefiles object already contains data when we open it. [dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()] Link: https://github.com/DaveWysochanskiRH/kernel/commit/902c990e311120179fa5de99d68364b2947b79ec Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page") Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Daire Byrne <daire.byrne@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds2023-08-283-4/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
| * | | fs: Pass argument to fcntl_setlease as intLuca Vizzarro2023-07-103-4/+4
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The interface for fcntl expects the argument passed for the command F_SETLEASE to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-nfs@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-3-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds2023-08-284-15/+16
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...