summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'pull-old-dio' of ↵Linus Torvalds2023-04-241-2/+2
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull legacy dio cleanup from Al Viro. * tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: __blockdev_direct_IO(): get rid of submit_io callback
| * __blockdev_direct_IO(): get rid of submit_io callbackAl Viro2023-03-051-2/+2
| | | | | | | | | | | | always NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'pull-write-one-page' of ↵Linus Torvalds2023-04-241-6/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs write_one_page removal from Al Viro: "write_one_page series" * tag 'pull-write-one-page' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mm,jfs: move write_one_page/folio_write_one to jfs ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page ufs: don't flush page immediately for DIRSYNC directories
| * | mm,jfs: move write_one_page/folio_write_one to jfsChristoph Hellwig2023-03-121-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2023-04-241-1/+0
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vfs fget updates from Al Viro: "fget() to fdget() conversions" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse_dev_ioctl(): switch to fdget() cgroup_get_from_fd(): switch to fdget_raw() bpf: switch to fdget_raw() build_mount_idmapped(): switch to fdget() kill the last remaining user of proc_ns_fget() SVM-SEV: convert the rest of fget() uses to fdget() in there convert sgx_set_attribute() to fdget()/fdput() convert setns(2) to fdget()/fdput()
| * | | kill the last remaining user of proc_ns_fget()Al Viro2023-04-201-1/+0
| | |/ | |/| | | | | | | | | | | | | | | | lookups by descriptor are better off closer to syscall surface... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge tag 'erofs-for-6.4-rc1' of ↵Linus Torvalds2023-04-241-2/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, sub-page block support for uncompressed files is available. It's mainly used to enable original signing ('golden') 4k-block images on arm64 with 16/64k pages. In addition, end users could also use this feature to build a manifest to directly refer to golden tar data. Besides, long xattr name prefix support is also introduced in this cycle to avoid too many xattrs with the same prefix (e.g. overlayfs xattrs). It's useful for erofs + overlayfs combination (like Composefs model): the image size is reduced by ~14% and runtime performance is also slightly improved. Others are random fixes and cleanups as usual. Summary: - Add sub-page block size support for uncompressed files - Support flattened block device for multi-blob images to be attached into virtual machines (including cloud servers) and bare metals - Support long xattr name prefixes to optimize images with common xattr namespaces (e.g. files with overlayfs xattrs) use cases - Various minor cleanups & fixes" * tag 'erofs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: cleanup i_format-related stuffs erofs: sunset erofs_dbg() erofs: fix potential overflow calculating xattr_isize erofs: get rid of z_erofs_fill_inode() erofs: enable long extended attribute name prefixes erofs: handle long xattr name prefixes properly erofs: add helpers to load long xattr name prefixes erofs: introduce on-disk format for long xattr name prefixes erofs: move packed inode out of the compression part erofs: keep meta inode into erofs_buf erofs: initialize packed inode after root inode is assigned erofs: stop parsing non-compact HEAD index if clusterofs is invalid erofs: don't warn ztailpacking feature anymore erofs: simplify erofs_xattr_generic_get() erofs: rename init_inode_xattrs with erofs_ prefix erofs: move several xattr helpers into xattr.c erofs: tidy up EROFS on-disk naming erofs: support flattened block device for multi-blob images erofs: set block size to the on-disk block size erofs: avoid hardcoded blocksize for subpage block support
| * | | erofs: avoid hardcoded blocksize for subpage block supportJingbo Xu2023-04-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | Merge tag 'v6.4/vfs.open' of ↵Linus Torvalds2023-04-241-1/+0
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs open fixlet from Christian Brauner: "EINVAL ist keinmal: This contains the changes to make O_DIRECTORY when specified together with O_CREAT an invalid request. The wider background is that a regression report about the behavior of O_DIRECTORY | O_CREAT was sent to fsdevel about a behavior that was changed multiple years and LTS releases earlier during v5.7 development. This has also been covered in https://lwn.net/Articles/926782/ which provides an excellent summary of the discussion" * tag 'v6.4/vfs.open' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: open: return EINVAL for O_DIRECTORY | O_CREAT
| * | | | open: return EINVAL for O_DIRECTORY | O_CREATChristian Brauner2023-03-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a couple of years and multiple LTS releases we received a report that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7. On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL had the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: create regular file * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: create regular file * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL have the following semantics: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: ENOTDIR * d exists and is a directory: EISDIR (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: ENOTDIR (create regular file) * d exists and is a regular file: EEXIST * d exists and is a directory: EEXIST (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory This is a fairly substantial semantic change that userspace didn't notice until Pedro took the time to deliberately figure out corner cases. Since no one noticed this breakage we can somewhat safely assume that O_DIRECTORY | O_CREAT combinations are likely unused. The v5.7 breakage is especially weird because while ENOTDIR is returned indicating failure a regular file is actually created. This doesn't make a lot of sense. Time was spent finding potential users of this combination. Searching on codesearch.debian.net showed that codebases often express semantical expectations about O_DIRECTORY | O_CREAT which are completely contrary to what our code has done and currently does. The expectation often is that this particular combination would create and open a directory. This suggests users who tried to use that combination would stumble upon the counterintuitive behavior no matter if pre-v5.7 or post v5.7 and quickly realize neither semantics give them what they want. For some examples see the code examples in [1] to [3] and the discussion in [4]. There are various ways to address this issue. The lazy/simple option would be to restore the pre-v5.7 behavior and to just live with that bug forever. But since there's a real chance that the O_DIRECTORY | O_CREAT quirk isn't relied upon we should try to get away with murder(ing bad semantics) first. If we need to Frankenstein pre-v5.7 behavior later so be it. So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT combinations. In addition to cleaning up the old bug this also opens up the possiblity to make that flag combination do something more intuitive in the future. Starting with this commit the following semantics apply: (1) open("/tmp/d", O_DIRECTORY | O_CREAT) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL) * d doesn't exist: EINVAL * d exists and is a regular file: EINVAL * d exists and is a directory: EINVAL (3) open("/tmp/d", O_DIRECTORY | O_EXCL) * d doesn't exist: ENOENT * d exists and is a regular file: ENOTDIR * d exists and is a directory: open directory One additional note, O_TMPFILE is implemented as: #define __O_TMPFILE 020000000 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) For older kernels it was important to return an explicit error when O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't specified. Since O_DIRECTORY | O_CREAT could be used to create a regular allowing that combination together with __O_TMPFILE would've meant that false positives were possible, i.e., that a regular file was created instead of a O_TMPFILE. This could've been used to trick userspace into thinking it operated on a O_TMPFILE when it wasn't. Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE) can be dropped. Instead we can simply check verify that O_DIRECTORY is raised via if (!(flags & O_DIRECTORY)) and explain this in two comments. As Aleksa pointed out O_PATH is unaffected by this change since it always returned EINVAL if O_CREAT was specified - with or without O_DIRECTORY. Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1] Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2] Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3] Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4] Reported-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | Merge tag 'v6.4/vfs.misc' of ↵Linus Torvalds2023-04-243-2/+3
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains a pile of various smaller fixes. Most of them aren't very interesting so this just highlights things worth mentioning: - Various filesystems contained the same little helper to convert from the mode of a dentry to the DT_* type of that dentry. They have now all been switched to rely on the generic fs_umode_to_dtype() helper. All custom helpers are removed (Jeff) - Fsnotify now reports ACCESS and MODIFY events for splice (Chung-Chiang Cheng) - After converting timerfd a long time ago to rely on wait_event_interruptible_*() apis, convert eventfd as well. This removes the complex open-coded wait code (Wen Yang) - Simplify sysctl registration for devpts, avoiding the declaration of two tables. Instead, just use a prefixed path with register_sysctl() (Luis) - The setattr_should_drop_sgid() helper is now exported so NFS can use it. By switching NFS to this helper an NFS setgid inheritance bug is fixed (me)" * tag 'v6.4/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode() pnode: pass mountpoint directly eventfd: use wait_event_interruptible_locked_irq() helper splice: report related fsnotify events fs: consolidate duplicate dt_type helpers nfs: use vfs setgid helper Update relatime comments to include equality fs/buffer: Remove redundant assignment to err fs_context: drop the unused lsm_flags member fs/namespace: fnic: Switch to use %ptTd Documentation: update idmappings.rst devpts: simplify two-level sysctl registration for pty_kern_table eventpoll: align comment with nested epoll limitation
| * | | | | nfs: use vfs setgid helperChristian Brauner2023-03-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've aligned setgid behavior over multiple kernel releases. The details can be found in the following two merge messages: cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2') 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0') Consistent setgid stripping behavior is now encapsulated in the setattr_should_drop_sgid() helper which is used by all filesystems that strip setgid bits outside of vfs proper. Switch nfs to rely on this helper as well. Without this patch the setgid stripping tests in xfstests will fail. Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | | fs_context: drop the unused lsm_flags memberOndrej Mosnacek2023-03-162-2/+1
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This isn't ever used by VFS now, and it couldn't even work. Any FS that uses the SECURITY_LSM_NATIVE_LABELS flag needs to also process the value returned back from the LSM, so it needs to do its security_sb_set_mnt_opts() call on its own anyway. Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | Merge tag 'v6.4/vfs.acl' of ↵Linus Torvalds2023-04-243-3/+28
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull acl updates from Christian Brauner: "After finishing the introduction of the new posix acl api last cycle the generic POSIX ACL xattr handlers are still around in the filesystems xattr handlers for two reasons: (1) Because a few filesystems rely on the ->list() method of the generic POSIX ACL xattr handlers in their ->listxattr() inode operation. (2) POSIX ACLs are only available if IOP_XATTR is raised. The IOP_XATTR flag is raised in inode_init_always() based on whether the sb->s_xattr pointer is non-NULL. IOW, the registered xattr handlers of the filesystem are used to raise IOP_XATTR. Removing the generic POSIX ACL xattr handlers from all filesystems would risk regressing filesystems that only implement POSIX ACL support and no other xattrs (nfs3 comes to mind). This contains the work to decouple POSIX ACLs from the IOP_XATTR flag as they don't depend on xattr handlers anymore. So it's now possible to remove the generic POSIX ACL xattr handlers from the sb->s_xattr list of all filesystems. This is a crucial step as the generic POSIX ACL xattr handlers aren't used for POSIX ACLs anymore and POSIX ACLs don't depend on the xattr infrastructure anymore. Adressing problem (1) will require more long-term work. It would be best to get rid of the ->list() method of xattr handlers completely at some point. For erofs, ext{2,4}, f2fs, jffs2, ocfs2, and reiserfs the nop POSIX ACL xattr handler is kept around so they can continue to use array-based xattr handler indexing. This update does simplify the ->listxattr() implementation of all these filesystems however" * tag 'v6.4/vfs.acl' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: acl: don't depend on IOP_XATTR ovl: check for ->listxattr() support reiserfs: rework priv inode handling fs: rename generic posix acl handlers reiserfs: rework ->listxattr() implementation fs: simplify ->listxattr() implementation fs: drop unused posix acl handlers xattr: remove unused argument xattr: add listxattr helper xattr: simplify listxattr helpers
| * | | | | fs: rename generic posix acl handlersChristian Brauner2023-03-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reflect in their naming and document that they are kept around for legacy reasons and shouldn't be used anymore by new code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | xattr: remove unused argumentChristian Brauner2023-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | his helpers is really just used to check for user.* xattr support so don't make it pointlessly generic. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | xattr: add listxattr helperChristian Brauner2023-03-061-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a tiny helper to determine whether an xattr handler given a specific dentry supports listing the requested xattr. We will use this helper in various filesystems in later commits. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | xattr: simplify listxattr helpersChristian Brauner2023-03-062-0/+8
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The generic_listxattr() and simple_xattr_list() helpers list xattrs and contain duplicated code. Add two helpers that both generic_listxattr() and simple_xattr_list() can use. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | Merge tag 'v6.4/pidfd.file' of ↵Linus Torvalds2023-04-241-0/+1
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This adds a new pidfd_prepare() helper which allows the caller to reserve a pidfd number and allocates a new pidfd file that stashes the provided struct pid. It should be avoided installing a file descriptor into a task's file descriptor table just to close it again via close_fd() in case an error occurs. The fd has been visible to userspace and might already be in use. Instead, a file descriptor should be reserved but not installed into the caller's file descriptor table. If another failure path is hit then the reserved file descriptor and file can just be put without any userspace visible side-effects. And if all failure paths are cleared the file descriptor and file can be installed into the task's file descriptor table. This helper is now used in all places that open coded this functionality before. For example, this is currently done during copy_process() and fanotify used pidfd_create(), which returns a pidfd that has already been made visibile in the caller's file descriptor table, but then closed it using close_fd(). In one of the next merge windows there is also new functionality coming to unix domain sockets that will have to rely on pidfd_prepare()" * tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fanotify: use pidfd_prepare() fork: use pidfd_prepare() pid: add pidfd_prepare()
| * | | | | pid: add pidfd_prepare()Christian Brauner2023-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | | | | Merge tag 'v6.4/kernel.user_worker' of ↵Linus Torvalds2023-04-243-4/+34
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull user work thread updates from Christian Brauner: "This contains the work generalizing the ability to create a kernel worker from a userspace process. Such user workers will run with the same credentials as the userspace process they were created from providing stronger security and accounting guarantees than the traditional override_creds() approach ever could've hoped for. The original work was heavily based and optimzed for the needs of io_uring which was the first user. However, as it quickly turned out the ability to create user workers inherting properties from a userspace process is generally useful. The vhost subsystem currently creates workers using the kthread api. The consequences of using the kthread api are that RLIMITs don't work correctly as they are inherited from khtreadd. This leads to bugs where more workers are created than would be allowed by the RLIMITs of the userspace process in lieu of which workers are created. Problems like this disappear with user workers created from the userspace processes for which they perform the work. In addition, providing this api allows vhost to remove additional complexity. For example, cgroup and mm sharing will just work out of the box with user workers based on the relevant userspace process instead of manually ensuring the correct cgroup and mm contexts are used. So the vhost subsystem should simply be made to use the same mechanism as io_uring. To this end the original mechanism used for create_io_thread() is generalized into user workers: - Introduce PF_USER_WORKER as a generic indicator that a given task is a user worker, i.e., a kernel task that was created from a userspace process. Now a PF_IO_WORKER thread is just a specialized version of PF_USER_WORKER. So io_uring io workers raise both flags. - Make copy_process() available to core kernel code - Extend struct kernel_clone_args with the following bitfields allowing to indicate to copy_process(): - to create a user worker (raise PF_USER_WORKER) - to not inherit any files from the userspace process - to ignore signals After all generic changes are in place the vhost subsystem implements a new dedicated vhost api based on user workers. Finally, vhost is switched to rely on the new api moving it off of kthreads. Thanks to Mike for sticking it out and making it through this rather arduous journey" * tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: vhost: use vhost_tasks for worker threads vhost: move worker thread fields to new struct vhost_task: Allow vhost layer to use copy_process fork: allow kernel code to call copy_process fork: Add kernel_clone_args flag to ignore signals fork: add kernel_clone_args flag to not dup/clone files fork/vm: Move common PF_IO_WORKER behavior to new flag kernel: Make io_thread and kthread bit fields kthread: Pass in the thread's name during creation kernel: Allow a kernel thread's name to be set in copy_process csky: Remove kernel_thread declaration
| * | | | | | vhost_task: Allow vhost layer to use copy_processMike Christie2023-03-231-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Qemu will create vhost devices in the kernel which perform network, SCSI, etc IO and management operations from worker threads created by the kthread API. Because the kthread API does a copy_process on the kthreadd thread, the vhost layer has to use kthread_use_mm to access the Qemu thread's memory and cgroup_attach_task_all to add itself to the Qemu thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result in VMs creating more threads than the admin expected. This patch adds a new struct vhost_task which can be used instead of kthreads. They allow the vhost layer to use copy_process and inherit the userspace process's mm and cgroups, the task is accounted for under the userspace's nproc count and can be seen in its process tree, and other features like namespaces work and are inherited by default. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | | | fork: allow kernel code to call copy_processMike Christie2023-03-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The next patch adds helpers like create_io_thread, but for use by the vhost layer. There are several functions, so they are in their own file instead of cluttering up fork.c. This patch allows that new file to call copy_process. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | fork: Add kernel_clone_args flag to ignore signalsMike Christie2023-03-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since: commit 10ab825bdef8 ("change kernel threads to ignore signals instead of blocking them") kthreads have been ignoring signals by default, and the vhost layer has never had a need to change that. This patch adds an option flag, USER_WORKER_SIG_IGN, handled in copy_process() after copy_sighand() and copy_signals() so vhost_tasks added in the next patches can continue to ignore singals. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | fork: add kernel_clone_args flag to not dup/clone filesMike Christie2023-03-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each vhost device gets a thread that is used to perform IO and management operations. Instead of a thread that is accessing a device, the thread is part of the device, so when it creates a thread using a helper based on copy_process we can't dup or clone the parent's files/FDS because it would do an extra increment on ourself. Later, when we do: Qemu process exits: do_exit -> exit_files -> put_files_struct -> close_files we would leak the device's resources because of that extra refcount on the fd or file_struct. This patch adds a no_files option so these worker threads can prevent taking an extra refcount on themselves. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | fork/vm: Move common PF_IO_WORKER behavior to new flagMike Christie2023-03-122-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new flag, PF_USER_WORKER, that's used for behavior common to to both PF_IO_WORKER and users like vhost which will use a new helper instead of create_io_thread because they require different behavior for operations like signal handling. The common behavior PF_USER_WORKER covers is the vm reclaim handling. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | kernel: Make io_thread and kthread bit fieldsMike Christie2023-03-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only set args->io_thread/kthread to 0 or 1 then test if they are set, so make them bit fields. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
| * | | | | | kernel: Allow a kernel thread's name to be set in copy_processMike Christie2023-03-121-1/+3
| | |/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows kernel users to pass in the thread name so it can be set during creation instead of having to use set_task_comm after the thread is created. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | | Merge tag 'linux-kselftest-kunit-6.4-rc1' of ↵Linus Torvalds2023-04-242-3/+3
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull KUnit updates from Shuah Khan: - several fixes to kunit tool - new klist structure test - support for m68k under QEMU - support for overriding the QEMU serial port - support for SH under QEMU * tag 'linux-kselftest-kunit-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: add tests for using current KUnit test field kunit: tool: Add support for SH under QEMU kunit: tool: Add support for overriding the QEMU serial port .gitignore: Unignore .kunitconfig list: test: Test the klist structure kunit: increase KUNIT_LOG_SIZE to 2048 bytes kunit: Use gfp in kunit_alloc_resource() kernel-doc kunit: tool: fix pre-existing `mypy --strict` errors and update run_checks.py kunit: tool: remove unused imports and variables kunit: tool: add subscripts for type annotations where appropriate kunit: fix bug of extra newline characters in debugfs logs kunit: fix bug in the order of lines in debugfs logs kunit: fix bug in debugfs logs of parameterized tests kunit: tool: Add support for m68k under QEMU
| * | | | | | kunit: increase KUNIT_LOG_SIZE to 2048 bytesHeiko Carstens2023-03-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The s390 specific test_unwind kunit test has 39 parameterized tests. The results in debugfs are truncated since the full log doesn't fit into 1500 bytes. Therefore increase KUNIT_LOG_SIZE to 2048 bytes in a similar way like it was done recently with commit "kunit: fix bug in debugfs logs of parameterized tests". With that the whole test result is present. Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | kunit: Use gfp in kunit_alloc_resource() kernel-docStephen Boyd2023-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Copy/pasting the code from the kernel-doc here doesn't compile because kunit_alloc_resource() takes a gfp flags argument. Pass the gfp argument from the caller to complete the example. Signed-off-by: Stephen Boyd <sboyd@kernel.org> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | kunit: fix bug of extra newline characters in debugfs logsRae Moar2023-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix bug of the extra newline characters in debugfs logs. When a line is added to debugfs with a newline character at the end, an extra line appears in the debugfs log. This is due to a discrepancy between how the lines are printed and how they are added to the logs. Remove this discrepancy by checking if a newline character is present before adding a newline character. This should closely match the printk behavior. Add kunit_log_newline_test to provide test coverage for this issue. (Also, move kunit_log_test above suite definition to remove the unnecessary declaration prior to the suite definition) As an example, say we add these two lines to the log: kunit_log(..., "KTAP version 1\n"); kunit_log(..., "1..1"); The debugfs log before this fix: KTAP version 1 1..1 The debugfs log after this fix: KTAP version 1 1..1 Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
| * | | | | | kunit: fix bug in debugfs logs of parameterized testsRae Moar2023-03-101-1/+1
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix bug in debugfs logs that causes individual parameterized results to not appear because the log is reinitialized (cleared) when each parameter is run. Ensure these results appear in the debugfs logs, increase log size to allow for the size of parameterized results. As a result, append lines to the log directly rather than using an intermediate variable that can cause stack size warnings due to the increased log size. Here is the debugfs log of ext4_inode_test which uses parameterized tests before the fix: KTAP version 1 # Subtest: ext4_inode_test 1..1 # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test As you can see, this log does not include any of the individual parametrized results. After (in combination with the next two fixes to remove extra empty line and ensure KTAP valid format): KTAP version 1 1..1 KTAP version 1 # Subtest: ext4_inode_test 1..1 KTAP version 1 # Subtest: inode_test_xtimestamp_decoding ok 1 1901-12-13 Lower bound of 32bit < 0 timestamp, no extra bits ... (the rest of the individual parameterized tests) ok 16 2446-05-10 Upper bound of 32bit >=0 timestamp. All extra # inode_test_xtimestamp_decoding: pass:16 fail:0 skip:0 total:16 ok 1 inode_test_xtimestamp_decoding # Totals: pass:16 fail:0 skip:0 total:16 ok 1 ext4_inode_test Signed-off-by: Rae Moar <rmoar@google.com> Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* | | | | | Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds2023-04-248-40/+116
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
| | \ \ \ \ \
| | \ \ \ \ \
| | \ \ \ \ \
| | \ \ \ \ \
| | \ \ \ \ \
| | \ \ \ \ \
| *-----. \ \ \ \ \ Merge branches 'rcu/staging-core', 'rcu/staging-docs' and ↵Joel Fernandes (Google)2023-04-055-36/+111
| |\ \ \ \ \ \ \ \ \ | | | | |_|/ / / / / | | | |/| | | | | | | | | | | | | | | | 'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging
| | | | | * | | | | locking/lockdep: Improve the deadlock scenario print for sync and read lockBoqun Feng2023-03-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lock scenario print is always a weak spot of lockdep splats. Improvement can be made if we rework the dependency search and the error printing. However without touching the graph search, we can improve a little for the circular deadlock case, since we have the to-be-added lock dependency, and know whether these two locks are read/write/sync. In order to know whether a held_lock is sync or not, a bit was "stolen" from ->references, which reduce our limit for the same lock class nesting from 2^12 to 2^11, and it should still be good enough. Besides, since we now have bit in held_lock for sync, we don't need the "hardirqoffs being 1" trick, and also we can avoid the __lock_release() if we jump out of __lock_acquire() before the held_lock stored. With these changes, a deadlock case evolved with read lock and sync gets a better print-out from: [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] lock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] lock(srcuB); to [...] Possible unsafe locking scenario: [...] [...] CPU0 CPU1 [...] ---- ---- [...] rlock(srcuA); [...] lock(srcuB); [...] lock(srcuA); [...] sync(srcuB); Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
| | | | | * | | | | rcu: Annotate SRCU's update-side lockdep dependenciesBoqun Feng2023-03-271-2/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although all flavors of RCU readers are annotated correctly with lockdep as recursive read locks, they do not set the lock_acquire 'check' parameter. This means that RCU read locks are not added to the lockdep dependency graph, which in turn means that lockdep cannot detect RCU-based deadlocks. This is not a problem for RCU flavors having atomic read-side critical sections because context-based annotations can catch these deadlocks, see for example the RCU_LOCKDEP_WARN() statement in synchronize_rcu(). But context-based annotations are not helpful for sleepable RCU, especially given that it is perfectly legal to do synchronize_srcu(&srcu1) within an srcu_read_lock(&srcu2). However, we can detect SRCU-based by: (1) Making srcu_read_lock() a 'check'ed recursive read lock and (2) Making synchronize_srcu() a empty write lock critical section. Even better, with the newly introduced lock_sync(), we can avoid false positives about irq-unsafe/safe. This commit therefore makes it so. Note that NMI-safe SRCU read side critical sections are currently not annotated, but might be annotated in the future. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ boqun: Add comments for annotation per Waiman's suggestion ] [ boqun: Fix comment warning reported by Stephen Rothwell ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
| | | | | * | | | | locking/lockdep: Introduce lock_sync()Boqun Feng2023-03-271-0/+5
| | | | |/ / / / / | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, functions like synchronize_srcu() do not have lockdep annotations resembling those of other write-side locking primitives. Such annotations might look as follows: lock_acquire(); lock_release(); Such annotations would tell lockdep that synchronize_srcu() acts like an empty critical section that waits for other (read-side) critical sections to finish. This would definitely catch some deadlock, but as pointed out by Paul Mckenney [1], this could also introduce false positives because of irq-safe/unsafe detection. Of course, there are tricks could help with this: might_sleep(); // Existing statement in __synchronize_srcu(). if (IS_ENABLED(CONFIG_PROVE_LOCKING)) { local_irq_disable(); lock_acquire(); lock_release(); local_irq_enable(); } But it would be better for lockdep to provide a separate annonation for functions like synchronize_srcu(), so that people won't need to repeat the ugly tricks above. Therefore introduce lock_sync(), which is simply an lock+unlock pair with no irq safe/unsafe deadlock check. This works because the to-be-annontated functions do not create real critical sections, and there is therefore no way that irq can create extra dependencies. [1]: https://lore.kernel.org/lkml/20180412021233.ewncg5jjuzjw3x62@tardis/ Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ boqun: Fix typos reported by Davidlohr Bueso and Paul E. Mckenney ] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
| | | | * | | | | srcu: Move work-scheduling fields from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->reschedule_jiffies, ->reschedule_count, and ->work fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. However, this means that the container_of() calls cannot get a pointer to the srcu_struct because they are no longer in the srcu_struct. This issue is addressed by adding a ->srcu_ssp field in the srcu_usage structure that references the corresponding srcu_struct structure. And given the presence of the sup pointer to the srcu_usage structure, replace some ssp->srcu_usage-> instances with sup->. [ paulmck Apply feedback from kernel test robot. ] Link: https://lore.kernel.org/oe-kbuild-all/202303191400.iO5BOqka-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move srcu_barrier() fields from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex, ->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->sda_is_static from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->sda_is_static field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move heuristics fields from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries, and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move grace-period fields from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed, ->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_gp_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->lock from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->lock field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_cb_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->srcu_size_state from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->srcu_size_state field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Move ->level from srcu_struct to srcu_usagePaul E. McKenney2023-04-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit moves the ->level[] array from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Begin offloading srcu_struct fields to srcu_updatePaul E. McKenney2023-04-043-13/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current srcu_struct structure is on the order of 200 bytes in size (depending on architecture and .config), which is much better than the old-style 26K bytes, but still all too inconvenient when one is trying to achieve good cache locality on a fastpath involving SRCU readers. However, only a few fields in srcu_struct are used by SRCU readers. The remaining fields could be offloaded to a new srcu_update structure, thus shrinking the srcu_struct structure down to a few tens of bytes. This commit begins this noble quest, a quest that is complicated by open-coded initialization of the srcu_struct within the srcu_notifier_head structure. This complication is addressed by updating the srcu_notifier_head structure's open coding, given that there does not appear to be a straightforward way of abstracting that initialization. This commit moves only the ->node pointer to srcu_update. Later commits will move additional fields. [ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ] Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
| | | | * | | | | srcu: Use static init for statically allocated in-module srcu_structPaul E. McKenney2023-04-041-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further shrinking the srcu_struct structure is eased by requiring that in-module srcu_struct structures rely more heavily on static initialization. In particular, this preserves the property that a module-load-time srcu_struct initialization can fail only due to memory-allocation failure of the per-CPU srcu_data structures. It might also slightly improve robustness by keeping the number of memory allocations that must succeed down percpu_alloc() call. This is in preparation for splitting an srcu_usage structure out of the srcu_struct structure. [ paulmck: Fold in qiang1.zhang@intel.com feedback. ] Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>