summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-08-135-46/+193
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more xfs updates from Darrick Wong: "There's not a lot this time around, just the usual bug fixes and corrections for missing error returns. - Return error codes from block device flushes to userspace - Fix a deadlock between reclaim and mount time quotacheck - Fix an unnecessary ENOSPC return when doing COW on a filesystem with severe free space fragmentation - Fix a miscalculation in the transaction reservation computations for file removal operations" * tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix inode reservation space for removing transaction xfs: Fix false ENOSPC when performing direct write on a delalloc extent in cow fork xfs: fix intermittent hang during quotacheck xfs: check return codes when flushing block devices
| * xfs: fix inode reservation space for removing transactionhexiaole2022-08-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a directory entry writes: /* fs/xfs/libxfs/xfs_trans_resv.c begin */ /* * For removing a directory entry we can modify: * the parent directory inode: inode size * the removed inode: inode size ... xfs_calc_remove_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + xfs_calc_iunlink_add_reservation(mp) + max((xfs_calc_inode_res(mp, 1) + ... /* fs/xfs/libxfs/xfs_trans_resv.c end */ There has 2 inode size of space to be reserverd, but the actual code for inode reservation space writes. There only count for 1 inode size to be reserved in 'xfs_calc_inode_res(mp, 1)', rather than 2. Signed-off-by: hexiaole <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: remove redundant code citations] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * xfs: Fix false ENOSPC when performing direct write on a delalloc extent in ↵Chandan Babu R2022-08-051-35/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cow fork On a higly fragmented filesystem a Direct IO write can fail with -ENOSPC error even though the filesystem has sufficient number of free blocks. This occurs if the file offset range on which the write operation is being performed has a delalloc extent in the cow fork and this delalloc extent begins much before the Direct IO range. In such a scenario, xfs_reflink_allocate_cow() invokes xfs_bmapi_write() to allocate the blocks mapped by the delalloc extent. The extent thus allocated may not cover the beginning of file offset range on which the Direct IO write was issued. Hence xfs_reflink_allocate_cow() ends up returning -ENOSPC. The following script reliably recreates the bug described above. #!/usr/bin/bash device=/dev/loop0 shortdev=$(basename $device) mntpnt=/mnt/ file1=${mntpnt}/file1 file2=${mntpnt}/file2 fragmentedfile=${mntpnt}/fragmentedfile punchprog=/root/repos/xfstests-dev/src/punch-alternating errortag=/sys/fs/xfs/${shortdev}/errortag/bmap_alloc_minlen_extent umount $device > /dev/null 2>&1 echo "Create FS" mkfs.xfs -f -m reflink=1 $device > /dev/null 2>&1 if [[ $? != 0 ]]; then echo "mkfs failed." exit 1 fi echo "Mount FS" mount $device $mntpnt > /dev/null 2>&1 if [[ $? != 0 ]]; then echo "mount failed." exit 1 fi echo "Create source file" xfs_io -f -c "pwrite 0 32M" $file1 > /dev/null 2>&1 sync echo "Create Reflinked file" xfs_io -f -c "reflink $file1" $file2 &>/dev/null echo "Set cowextsize" xfs_io -c "cowextsize 16M" $file1 > /dev/null 2>&1 echo "Fragment FS" xfs_io -f -c "pwrite 0 64M" $fragmentedfile > /dev/null 2>&1 sync $punchprog $fragmentedfile echo "Allocate block sized extent from now onwards" echo -n 1 > $errortag echo "Create 16MiB delalloc extent in CoW fork" xfs_io -c "pwrite 0 4k" $file1 > /dev/null 2>&1 sync echo "Direct I/O write at offset 12k" xfs_io -d -c "pwrite 12k 8k" $file1 This commit fixes the bug by invoking xfs_bmapi_write() in a loop until disk blocks are allocated for atleast the starting file offset of the Direct IO write range. Fixes: 3c68d44a2b49 ("xfs: allocate direct I/O COW blocks in iomap_begin") Reported-and-Root-caused-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: slight editing to make the locking less grody, and fix some style things] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * xfs: fix intermittent hang during quotacheckDarrick J. Wong2022-08-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every now and then, I see the following hang during mount time quotacheck when running fstests. Turning on KASAN seems to make it happen somewhat more frequently. I've edited the backtrace for brevity. XFS (sdd): Quotacheck needed: Please wait. XFS: Assertion failed: bp->b_flags & _XBF_DELWRI_Q, file: fs/xfs/xfs_buf.c, line: 2411 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1831409 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] CPU: 0 PID: 1831409 Comm: mount Tainted: G W 5.19.0-rc6-xfsx #rc6 09911566947b9f737b036b4af85e399e4b9aef64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:assfail+0x46/0x4a [xfs] Code: a0 8f 41 a0 e8 45 fe ff ff 8a 1d 2c 36 10 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 c0 f0 4f a0 e8 10 f0 02 e1 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24 RSP: 0018:ffffc900078c7b30 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880099ac000 RCX: 000000007fffffff RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0418fa0 RBP: ffff8880197bc1c0 R08: 0000000000000000 R09: 000000000000000a R10: 000000000000000a R11: f000000000000000 R12: ffffc900078c7d20 R13: 00000000fffffff5 R14: ffffc900078c7d20 R15: 0000000000000000 FS: 00007f0449903800(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005610ada631f0 CR3: 0000000014dd8002 CR4: 00000000001706f0 Call Trace: <TASK> xfs_buf_delwri_pushbuf+0x150/0x160 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_qm_flush_one+0xd6/0x130 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_qm_dquot_walk.isra.0+0x109/0x1e0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_qm_quotacheck+0x319/0x490 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_qm_mount_quotas+0x65/0x2c0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_mountfs+0x6b5/0xab0 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] xfs_fs_fill_super+0x781/0x990 [xfs 4561f5b32c9bfb874ec98d58d0719464e1f87368] get_tree_bdev+0x175/0x280 vfs_get_tree+0x1a/0x80 path_mount+0x6f5/0xaa0 __x64_sys_mount+0x103/0x140 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 I /think/ this can happen if xfs_qm_flush_one is racing with xfs_qm_dquot_isolate (i.e. dquot reclaim) when the second function has taken the dquot flush lock but xfs_qm_dqflush hasn't yet locked the dquot buffer, let alone queued it to the delwri list. In this case, flush_one will fail to get the dquot flush lock, but it can lock the incore buffer, but xfs_buf_delwri_pushbuf will then trip over this ASSERT, which checks that the buffer isn't on a delwri list. The hang results because the _delwri_submit_buffers ignores non DELWRI_Q buffers, which means that xfs_buf_iowait waits forever for an IO that has not yet been scheduled. AFAICT, a reasonable solution here is to detect a dquot buffer that is not on a DELWRI list, drop it, and return -EAGAIN to try the flush again. It's not /that/ big of a deal if quotacheck writes the dquot buffer repeatedly before we even set QUOTA_CHKD. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: check return codes when flushing block devicesDarrick J. Wong2022-08-052-10/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a blkdev_issue_flush fails, fsync needs to report that to upper levels. Modify xfs_file_fsync to capture the errors, while trying to flush as much data and log updates to disk as possible. If log writes cannot flush the data device, we need to shut down the log immediately because we've violated a log invariant. Modify this code to check the return value of blkdev_issue_flush as well. This behavior seems to go back to about 2.6.15 or so, which makes this fixes tag a bit misleading. Link: https://elixir.bootlin.com/linux/v2.6.15/source/fs/xfs/xfs_vnodeops.c#L1187 Fixes: b5071ada510a ("xfs: remove xfs_blkdev_issue_flush") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* | Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2022-08-0515-22/+386
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
| * | xfs: fail dax mount if reflink is enabled on a partitionShiyang Ruan2022-07-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Failure notification is not supported on partitions. So, when we mount a reflink enabled xfs on a partition with dax option, let it fail with -EINVAL code. Link: https://lkml.kernel.org/r/20220609143435.393724-1-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | xfs: add dax dedupe supportShiyang Ruan2022-07-174-7/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who are going to be deduped. After that, call compare range function only when files are both DAX or not. Link: https://lkml.kernel.org/r/20220603053738.1218681-15-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | xfs: support CoW in fsdax modeShiyang Ruan2022-07-173-6/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After that, new allocated extents needs to be remapped to the file. So, add a CoW identification in ->iomap_begin(), and implement ->iomap_end() to do the remapping work. [akpm@linux-foundation.org: make xfs_dax_fault() static] Link: https://lkml.kernel.org/r/20220603053738.1218681-14-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | fsdax: dedup file range to use a compare functionShiyang Ruan2022-07-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With dax we cannot deal with readpage() etc. So, we create a dax comparison function which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | xfs: implement ->notify_failure() for XFSShiyang Ruan2022-07-176-3/+238
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xfs_notify_failure.c to handle failure related works, such as implement ->notify_failure(), register/unregister dax holder in xfs, and so on. If the rmap feature of XFS enabled, we can query it to find files and metadata which are associated with the corrupt data. For now all we do is kill processes with that file mapped into their address spaces, but future patches could actually do something about corrupt metadata. After that, the memory failure needs to notify the processes who are using those files. Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | dax: introduce holder for dax_deviceShiyang Ruan2022-07-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2. The patchset fsdax-rmap is aimed to support shared pages tracking for fsdax. It moves owner tracking from dax_assocaite_entry() to pmem device driver, by introducing an interface ->memory_failure() for struct pagemap. This interface is called by memory_failure() in mm, and implemented by pmem device. Then call holder operations to find the filesystem which the corrupted data located in, and call filesystem handler to track files or metadata associated with this page. Finally we are able to try to fix the corrupted data in filesystem and do other necessary processing, such as killing processes who are using the files affected. The call trace is like this: memory_failure() |* fsdax case |------------ |pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() | dax_holder_notify_failure() => | dax_device->holder_ops->notify_failure() => | - xfs_dax_notify_failure() | |* xfs_dax_notify_failure() | |-------------------------- | | xfs_rmap_query_range() | | xfs_dax_failure_fn() | | * corrupted on metadata | | try to recover data, call xfs_force_shutdown() | | * corrupted on file data | | try to recover data, call mf_dax_kill_procs() |* normal case |------------- |mf_generic_kill_procs() The patchset fsdax-reflink attempts to add CoW support for fsdax, and takes XFS, which has both reflink and fsdax features, as an example. One of the key mechanisms needed to be implemented in fsdax is CoW. Copy the data from srcmap before we actually write data to the destination iomap. And we just copy range in which data won't be changed. Another mechanism is range comparison. In page cache case, readpage() is used to load data on disk to page cache in order to be able to compare data. In fsdax case, readpage() does not work. So, we need another compare data with direct access support. With the two mechanisms implemented in fsdax, we are able to make reflink and fsdax work together in XFS. This patch (of 14): To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.wiliams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: shrinkers: provide shrinkers with namesRoman Gushchin2022-07-033-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-08-0485-1657/+1984
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs updates from Darrick Wong: "The biggest changes for this release are the log scalability improvements, lockless lookups for the buffer cache, and making the attr fork a permanent part of the incore inode in preparation for directory parent pointers. There's also a bunch of bug fixes that have accumulated since -rc5. I might send you a second pull request with some more bug fixes that I'm still working on. Once the merge window ends, I will hand maintainership back to Dave Chinner until the 6.1-rc1 release so that I can conduct the design review for the online fsck feature, and try to get it merged. Summary: - Improve scalability of the XFS log by removing spinlocks and global synchronization points. - Add security labels to whiteout inodes to match the other filesystems. - Clean up per-ag pointer passing to simplify call sites. - Reduce verifier overhead by precalculating more AG geometry. - Implement fast-path lockless lookups in the buffer cache to reduce spinlock hammering. - Make attr forks a permanent part of the inode structure to fix a UAF bug and because most files these days tend to have security labels and soon will have parent pointers too. - Clean up XFS_IFORK_Q usage and give it a better name. - Fix more UAF bugs in the xattr code. - SOB my tags. - Fix some typos in the timestamp range documentation. - Fix a few more memory leaks. - Code cleanups and typo fixes. - Fix an unlocked inode fork pointer access in getbmap" * tag 'xfs-5.20-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits) xfs: delete extra space and tab in blank line xfs: fix NULL pointer dereference in xfs_getbmap() xfs: Fix typo 'the the' in comment xfs: Fix comment typo xfs: don't leak memory when attr fork loading fails xfs: fix for variable set but not used warning xfs: xfs_buf cache destroy isn't RCU safe xfs: delete unnecessary NULL checks xfs: fix comment for start time value of inode with bigtime enabled xfs: fix use-after-free in xattr node block inactivation xfs: lockless buffer lookup xfs: remove a superflous hash lookup when inserting new buffers xfs: reduce the number of atomic when locking a buffer after lookup xfs: merge xfs_buf_find() and xfs_buf_get_map() xfs: break up xfs_buf_find() into individual pieces xfs: add in-memory iunlink log item xfs: add log item precommit operation xfs: combine iunlink inode update functions xfs: clean up xfs_iunlink_update_inode() xfs: double link the unlinked inode list ...
| * | xfs: delete extra space and tab in blank lineXie Shaowen2022-07-312-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | delete extra space and tab in blank line, there is no functional change. Reported-by: Hacash Robot <hacashRobot@santino.com> Signed-off-by: Xie Shaowen <studentxswpy@163.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: fix NULL pointer dereference in xfs_getbmap()ChenXiaoSong2022-07-311-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reproducer: 1. fallocate -l 100M image 2. mkfs.xfs -f image 3. mount image /mnt 4. setxattr("/mnt", "trusted.overlay.upper", NULL, 0, XATTR_CREATE) 5. char arg[32] = "\x01\xff\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00" "\x00\x00\x00\x00\x00\x08\x00\x00\x00\xc6\x2a\xf7"; fd = open("/mnt", O_RDONLY|O_DIRECTORY); ioctl(fd, _IOC(_IOC_READ|_IOC_WRITE, 0x58, 0x2c, 0x20), arg); NULL pointer dereference will occur when race happens between xfs_getbmap() and xfs_bmap_set_attrforkoff(): ioctl | setxattr ----------------------------|--------------------------- xfs_getbmap | xfs_ifork_ptr | xfs_inode_has_attr_fork | ip->i_forkoff == 0 | return NULL | ifp == NULL | | xfs_bmap_set_attrforkoff | ip->i_forkoff > 0 xfs_inode_has_attr_fork | ip->i_forkoff > 0 | ifp == NULL | ifp->if_format | Fix this by locking i_lock before xfs_ifork_ptr(). Fixes: abbf9e8a4507 ("xfs: rewrite getbmap using the xfs_iext_* helpers") Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: added fixes tag] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: Fix typo 'the the' in commentSlark Xiao2022-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Replace 'the the' with 'the' in the comment. Signed-off-by: Slark Xiao <slark_xiao@163.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: Fix comment typoXin Gao2022-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The double `the' is duplicated in line 552, remove one. Signed-off-by: Xin Gao <gaoxin@cdjrlc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: don't leak memory when attr fork loading failsDarrick J. Wong2022-07-204-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I observed the following evidence of a memory leak while running xfs/399 from the xfs fsck test suite (edited for brevity): XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs] CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G W 5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:assfail+0x46/0x4a [xfs] Call Trace: <TASK> xfs_ifork_zap_attr+0x7c/0xb0 xfs_iformat_attr_fork+0x86/0x110 xfs_inode_from_disk+0x41d/0x480 xfs_iget+0x389/0xd70 xfs_bulkstat_one_int+0x5b/0x540 xfs_bulkstat_iwalk+0x1e/0x30 xfs_iwalk_ag_recs+0xd1/0x160 xfs_iwalk_run_callbacks+0xb9/0x180 xfs_iwalk_ag+0x1d8/0x2e0 xfs_iwalk+0x141/0x220 xfs_bulkstat+0x105/0x180 xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130 xfs_file_ioctl+0xa5f/0xef0 __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This newly-added assertion checks that there aren't any incore data structures hanging off the incore fork when we're trying to reset its contents. From the call trace, it is evident that iget was trying to construct an incore inode from the ondisk inode, but the attr fork verifier failed and we were trying to undo all the memory allocations that we had done earlier. The three assertions in xfs_ifork_zap_attr check that the caller has already called xfs_idestroy_fork, which clearly has not been done here. As the zap function then zeroes the pointers, we've effectively leaked the memory. The shortest change would have been to insert an extra call to xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork call into _zap_attr, since all other callsites call _idestroy_fork immediately prior to calling _zap_attr. IOWs, it eliminates one way to fail. Note: This change only applies cleanly to 2ed5b09b3e8f, since we just reworked the attr fork lifetime. However, I think this memory leak has existed since 0f45a1b20cd8, since the chain xfs_iformat_attr_fork -> xfs_iformat_local -> xfs_init_local_fork will allocate ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails, xfs_iformat_attr_fork will free i_afp without freeing any of the stuff hanging off i_afp. The solution for older kernels I think is to add the missing call to xfs_idestroy_fork just prior to calling kmem_cache_free. Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399. Fixes: 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode") Probably-Fixes: 0f45a1b20cd8 ("xfs: improve local fork verification") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | xfs: fix for variable set but not used warningsunliming2022-07-201-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix below kernel warning: fs/xfs/scrub/repair.c:539:19: warning: variable 'agno' set but not used [-Wunused-but-set-variable] Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: sunliming <sunliming@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: xfs_buf cache destroy isn't RCU safeDave Chinner2022-07-203-37/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Darrick and Sachin Sant reported that xfs/435 and xfs/436 would report an non-empty xfs_buf slab on module remove. This isn't easily to reproduce, but is clearly a side effect of converting the buffer caceh to RUC freeing and lockless lookups. Sachin bisected and Darrick hit it when testing the patchset directly. Turns out that the xfs_buf slab is not destroyed when all the other XFS slab caches are destroyed. Instead, it's got it's own little wrapper function that gets called separately, and so it doesn't have an rcu_barrier() call in it that is needed to drain all the rcu callbacks before the slab is destroyed. Fix it by removing the xfs_buf_init/terminate wrappers that just allocate and destroy the xfs_buf slab, and move them to the same place that all the other slab caches are set up and destroyed. Reported-and-tested-by: Sachin Sant <sachinp@linux.ibm.com> Fixes: 298f34224506 ("xfs: lockless buffer lookup") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | xfs: delete unnecessary NULL checksDan Carpenter2022-07-201-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | These NULL check are no long needed after commit 2ed5b09b3e8f ("xfs: make inode attribute forks a permanent part of struct xfs_inode"). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: fix comment for start time value of inode with bigtime enabledXiaole He2022-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'ctime', 'mtime', and 'atime' for inode is the type of 'xfs_timestamp_t', which is a 64-bit type: /* fs/xfs/libxfs/xfs_format.h begin */ typedef __be64 xfs_timestamp_t; /* fs/xfs/libxfs/xfs_format.h end */ When the 'bigtime' feature is disabled, this 64-bit type is splitted into two parts of 32-bit, one part is encoded for seconds since 1970-01-01 00:00:00 UTC, the other part is encoded for nanoseconds above the seconds, this two parts are the type of 'xfs_legacy_timestamp' and the min and max time value of this type are defined as macros 'XFS_LEGACY_TIME_MIN' and 'XFS_LEGACY_TIME_MAX': /* fs/xfs/libxfs/xfs_format.h begin */ struct xfs_legacy_timestamp { __be32 t_sec; /* timestamp seconds */ __be32 t_nsec; /* timestamp nanoseconds */ }; #define XFS_LEGACY_TIME_MIN ((int64_t)S32_MIN) #define XFS_LEGACY_TIME_MAX ((int64_t)S32_MAX) /* fs/xfs/libxfs/xfs_format.h end */ /* include/linux/limits.h begin */ #define U32_MAX ((u32)~0U) #define S32_MAX ((s32)(U32_MAX >> 1)) #define S32_MIN ((s32)(-S32_MAX - 1)) /* include/linux/limits.h end */ 'XFS_LEGACY_TIME_MIN' is the min time value of the 'xfs_legacy_timestamp', that is -(2^31) seconds relative to the 1970-01-01 00:00:00 UTC, it can be converted to human-friendly time value by 'date' command: /* command begin */ [root@~]# date --utc -d '@0' +'%Y-%m-%d %H:%M:%S' 1970-01-01 00:00:00 [root@~]# date --utc -d "@`echo '-(2^31)'|bc`" +'%Y-%m-%d %H:%M:%S' 1901-12-13 20:45:52 [root@~]# /* command end */ When 'bigtime' feature is enabled, this 64-bit type becomes a 64-bit nanoseconds counter, with the start time value is the min time value of 'xfs_legacy_timestamp'(start time means the value of 64-bit nanoseconds counter is 0). We have already caculated the min time value of 'xfs_legacy_timestamp', that is 1901-12-13 20:45:52 UTC, but the comment for the start time value of inode with 'bigtime' feature enabled writes the value is 1901-12-31 20:45:52 UTC: /* fs/xfs/libxfs/xfs_format.h begin */ /* * XFS Timestamps * ============== * When the bigtime feature is enabled, ondisk inode timestamps become an * unsigned 64-bit nanoseconds counter. This means that the bigtime inode * timestamp epoch is the start of the classic timestamp range, which is * Dec 31 20:45:52 UTC 1901. ... ... */ /* fs/xfs/libxfs/xfs_format.h end */ That is a typo, and this patch corrects the typo, from 'Dec 31' to 'Dec 13'. Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xiaole He <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | xfs: fix use-after-free in xattr node block inactivationDarrick J. Wong2022-07-141-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel build robot reported a UAF error while running xfs/433 (edited somewhat for brevity): BUG: KASAN: use-after-free in xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs Read of size 4 at addr ffff88820ac2bd44 by task kworker/0:2/139 CPU: 0 PID: 139 Comm: kworker/0:2 Tainted: G S 5.19.0-rc2-00004-g7cf2b0f9611b #1 Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013 Workqueue: xfs-inodegc/sdb4 xfs_inodegc_worker [xfs] Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) print_address_description+0x1f/0x200 print_report.cold (mm/kasan/report.c:430) kasan_report (mm/kasan/report.c:162 mm/kasan/report.c:493) xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs process_one_work worker_thread kthread ret_from_fork </TASK> Allocated by task 139: kasan_save_stack (mm/kasan/common.c:39) __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469) kmem_cache_alloc (mm/slab.h:750 mm/slub.c:3214 mm/slub.c:3222 mm/slub.c:3229 mm/slub.c:3239) _xfs_buf_alloc (include/linux/instrumented.h:86 include/linux/atomic/atomic-instrumented.h:41 fs/xfs/xfs_buf.c:232) xfs xfs_buf_get_map (fs/xfs/xfs_buf.c:660) xfs xfs_buf_read_map (fs/xfs/xfs_buf.c:777) xfs xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:289) xfs xfs_da_read_buf (fs/xfs/libxfs/xfs_da_btree.c:2652) xfs xfs_da3_node_read (fs/xfs/libxfs/xfs_da_btree.c:392) xfs xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:272) xfs xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs process_one_work worker_thread kthread ret_from_fork Freed by task 139: kasan_save_stack (mm/kasan/common.c:39) kasan_set_track (mm/kasan/common.c:45) kasan_set_free_info (mm/kasan/generic.c:372) __kasan_slab_free (mm/kasan/common.c:368 mm/kasan/common.c:328 mm/kasan/common.c:374) kmem_cache_free (mm/slub.c:1753 mm/slub.c:3507 mm/slub.c:3524) xfs_buf_rele (fs/xfs/xfs_buf.c:1040) xfs xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:210) xfs xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs process_one_work worker_thread kthread ret_from_fork I reproduced this for my own satisfaction, and got the same report, along with an extra morsel: The buggy address belongs to the object at ffff88802103a800 which belongs to the cache xfs_buf of size 432 The buggy address is located 396 bytes inside of 432-byte region [ffff88802103a800, ffff88802103a9b0) I tracked this code down to: error = xfs_trans_get_buf(*trans, mp->m_ddev_targp, child_blkno, XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0, &child_bp); if (error) return error; error = bp->b_error; That doesn't look right -- I think this should be dereferencing child_bp, not bp. Looking through the codebase history, I think this was added by commit 2911edb653b9 ("xfs: remove the mappedbno argument to xfs_da_get_buf"), which replaced a call to xfs_da_get_buf with the current call to xfs_trans_get_buf. Not sure why we trans_brelse'd @bp earlier in the function, but I'm guessing it's to avoid pinning too many buffers in memory while we inactivate the bottom of the attr tree. Hence we now have to get the buffer back. I /think/ this was supposed to check child_bp->b_error and fail the rest of the invalidation if child_bp had experienced any kind of IO or corruption error. I bet the xfs_da3_node_read earlier in the loop will catch most cases of incoming on-disk corruption which makes this check mostly moot unless someone corrupts the buffer and the AIL pushes it out to disk while the buffer's unlocked. In the first case we'll never get to the bad check, and in the second case the AIL will shut down the log, at which point there's no reason to check b_error. Remove the check, and null out @bp to avoid this problem in the future. Cc: hch@lst.de Reported-by: kernel test robot <oliver.sang@intel.com> Fixes: 2911edb653b9 ("xfs: remove the mappedbno argument to xfs_da_get_buf") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | Merge tag 'make-attr-fork-permanent-5.20_2022-07-14' of ↵Darrick J. Wong2022-07-1435-247/+285
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.20-mergeB xfs: make attr forks permanent This series fixes a use-after-free bug that syzbot uncovered. The UAF itself is a result of a race condition between getxattr and removexattr because callers to getxattr do not necessarily take any sort of locks before calling into the filesystem. Although the race condition itself can be fixed through clever use of a memory barrier, further consideration of the use cases of extended attributes shows that most files always have at least one attribute, so we might as well make them permanent. v2: Minor tweaks suggested by Dave, and convert some more macros to helper functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'make-attr-fork-permanent-5.20_2022-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: replace inode fork size macros with functions xfs: replace XFS_IFORK_Q with a proper predicate function xfs: use XFS_IFORK_Q to determine the presence of an xattr fork xfs: make inode attribute forks a permanent part of struct xfs_inode xfs: convert XFS_IFORK_PTR to a static inline helper
| | * | xfs: replace inode fork size macros with functionsDarrick J. Wong2022-07-1215-39/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the shouty macros here with typechecked helper functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * | xfs: replace XFS_IFORK_Q with a proper predicate functionDarrick J. Wong2022-07-1213-27/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace this shouty macro with a real C function that has a more descriptive name. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * | xfs: use XFS_IFORK_Q to determine the presence of an xattr forkDarrick J. Wong2022-07-0911-29/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the attribute fork but i_forkoff is zero. This eliminates the ambiguity between i_forkoff and i_af.if_present, which should make it easier to understand the lifetime of attr forks. While we're at it, remove the if_present checks around calls to xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr forks that have already been torn down. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * | xfs: make inode attribute forks a permanent part of struct xfs_inodeDarrick J. Wong2022-07-0916-104/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller reported a UAF bug a while back: ================================================================== BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958 CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted 5.15.0-0.30.3-20220406_1406 #3 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908 do_truncate+0xc3/0x1e0 fs/open.c:56 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 RIP: 0033:0x7f7ef4bb753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0 RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0 </TASK> Allocated by task 2953: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:434 [inline] __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467 kasan_slab_alloc include/linux/kasan.h:254 [inline] slab_post_alloc_hook mm/slab.h:519 [inline] slab_alloc_node mm/slub.c:3213 [inline] slab_alloc mm/slub.c:3221 [inline] kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226 kmem_cache_zalloc include/linux/slab.h:711 [inline] xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275 vfs_setxattr+0x154/0x33d fs/xattr.c:301 setxattr+0x216/0x29f fs/xattr.c:575 __do_sys_fsetxattr fs/xattr.c:632 [inline] __se_sys_fsetxattr fs/xattr.c:621 [inline] __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 Freed by task 2949: kasan_save_stack+0x19/0x38 mm/kasan/common.c:38 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360 ____kasan_slab_free mm/kasan/common.c:366 [inline] ____kasan_slab_free mm/kasan/common.c:328 [inline] __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374 kasan_slab_free include/linux/kasan.h:230 [inline] slab_free_hook mm/slub.c:1700 [inline] slab_free_freelist_hook mm/slub.c:1726 [inline] slab_free mm/slub.c:3492 [inline] kmem_cache_free+0xdc/0x3ce mm/slub.c:3508 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59 __vfs_removexattr+0x106/0x16a fs/xattr.c:468 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324 security_inode_killpriv+0x54/0xa1 security/security.c:1414 setattr_prepare+0x1a6/0x897 fs/attr.c:146 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093 notify_change+0xae5/0x10a1 fs/attr.c:410 do_truncate+0x134/0x1e0 fs/open.c:64 handle_truncate fs/namei.c:3084 [inline] do_open fs/namei.c:3432 [inline] path_openat+0x30ab/0x396d fs/namei.c:3561 do_filp_open+0x1c4/0x290 fs/namei.c:3588 do_sys_openat2+0x60d/0x98c fs/open.c:1212 do_sys_open+0xcf/0x13c fs/open.c:1228 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0x0 The buggy address belongs to the object at ffff88802cec9188 which belongs to the cache xfs_ifork of size 40 The buggy address is located 20 bytes inside of 40-byte region [ffff88802cec9188, ffff88802cec91b0) The buggy address belongs to the page: page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2cec9 flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff) raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80 raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb ^ ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb ================================================================== The root cause of this bug is the unlocked access to xfs_inode.i_afp from the getxattr code paths while trying to determine which ILOCK mode to use to stabilize the xattr data. Unfortunately, the VFS does not acquire i_rwsem when vfs_getxattr (or listxattr) call into the filesystem, which means that getxattr can race with a removexattr that's tearing down the attr fork and crash: xfs_attr_set: xfs_attr_get: xfs_attr_fork_remove: xfs_ilock_attr_map_shared: xfs_idestroy_fork(ip->i_afp); kmem_cache_free(xfs_ifork_cache, ip->i_afp); if (ip->i_afp && ip->i_afp = NULL; xfs_need_iread_extents(ip->i_afp)) <KABOOM> ip->i_forkoff = 0; Regrettably, the VFS is much more lax about i_rwsem and getxattr than is immediately obvious -- not only does it not guarantee that we hold i_rwsem, it actually doesn't guarantee that we *don't* hold it either. The getxattr system call won't acquire the lock before calling XFS, but the file capabilities code calls getxattr with and without i_rwsem held to determine if the "security.capabilities" xattr is set on the file. Fixing the VFS locking requires a treewide investigation into every code path that could touch an xattr and what i_rwsem state it expects or sets up. That could take years or even prove impossible; fortunately, we can fix this UAF problem inside XFS. An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to ensure that i_forkoff is always zeroed before i_afp is set to null and changed the read paths to use smp_rmb before accessing i_forkoff and i_afp, which avoided these UAF problems. However, the patch author was too busy dealing with other problems in the meantime, and by the time he came back to this issue, the situation had changed a bit. On a modern system with selinux, each inode will always have at least one xattr for the selinux label, so it doesn't make much sense to keep incurring the extra pointer dereference. Furthermore, Allison's upcoming parent pointer patchset will also cause nearly every inode in the filesystem to have extended attributes. Therefore, make the inode attribute fork structure part of struct xfs_inode, at a cost of 40 more bytes. This patch adds a clunky if_present field where necessary to maintain the existing logic of xattr fork null pointer testing in the existing codebase. The next patch switches the logic over to XFS_IFORK_Q and it all goes away. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * | xfs: convert XFS_IFORK_PTR to a static inline helperDarrick J. Wong2022-07-0923-83/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're about to make this logic do a bit more, so convert the macro to a static inline function for better typechecking and fewer shouty macros. No functional changes here. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | | Merge tag 'xfs-buf-lockless-lookup-5.20' of ↵Darrick J. Wong2022-07-145-135/+188
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB xfs: lockless buffer cache lookups Current work to merge the XFS inode life cycle with the VFS inode life cycle is finding some interesting issues. If we have a path that hits buffer trylocks fairly hard (e.g. a non-blocking background inode freeing function), we end up hitting massive contention on the buffer cache hash locks: - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker - 92.67% xfs_inodegc_worker - 92.13% xfs_inode_unlink - 91.52% xfs_inactive_ifree - 85.63% xfs_read_agi - 85.61% xfs_trans_read_buf_map - 85.59% xfs_buf_read_map - xfs_buf_get_map - 85.55% xfs_buf_find - 72.87% _raw_spin_lock - do_raw_spin_lock 71.86% __pv_queued_spin_lock_slowpath - 8.74% xfs_buf_rele - 7.88% _raw_spin_lock - 7.88% do_raw_spin_lock 7.63% __pv_queued_spin_lock_slowpath - 1.70% xfs_buf_trylock - 1.68% down_trylock - 1.41% _raw_spin_lock_irqsave - 1.39% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.76% _raw_spin_unlock 0.75% do_raw_spin_unlock This is basically hammering the pag->pag_buf_lock from lots of CPUs doing trylocks at the same time. Most of the buffer trylock operations ultimately fail after we've done the lookup, so we're really hammering the buf hash lock whilst making no progress. We can also see significant spinlock traffic on the same lock just under normal operation when lots of tasks are accessing metadata from the same AG, so let's avoid all this by creating a lookup fast path which leverages the rhashtable's ability to do RCU protected lookups. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-buf-lockless-lookup-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: lockless buffer lookup xfs: remove a superflous hash lookup when inserting new buffers xfs: reduce the number of atomic when locking a buffer after lookup xfs: merge xfs_buf_find() and xfs_buf_get_map() xfs: break up xfs_buf_find() into individual pieces xfs: rework xfs_buf_incore() API
| | * | | xfs: lockless buffer lookupDave Chinner2022-07-142-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have a standalone fast path for buffer lookup, we can easily convert it to use rcu lookups. When we continually hammer the buffer cache with trylock lookups, we end up with a huge amount of lock contention on the per-ag buffer hash locks: - 92.71% 0.05% [kernel] [k] xfs_inodegc_worker - 92.67% xfs_inodegc_worker - 92.13% xfs_inode_unlink - 91.52% xfs_inactive_ifree - 85.63% xfs_read_agi - 85.61% xfs_trans_read_buf_map - 85.59% xfs_buf_read_map - xfs_buf_get_map - 85.55% xfs_buf_find - 72.87% _raw_spin_lock - do_raw_spin_lock 71.86% __pv_queued_spin_lock_slowpath - 8.74% xfs_buf_rele - 7.88% _raw_spin_lock - 7.88% do_raw_spin_lock 7.63% __pv_queued_spin_lock_slowpath - 1.70% xfs_buf_trylock - 1.68% down_trylock - 1.41% _raw_spin_lock_irqsave - 1.39% do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.76% _raw_spin_unlock 0.75% do_raw_spin_unlock This is basically hammering the pag->pag_buf_lock from lots of CPUs doing trylocks at the same time. Most of the buffer trylock operations ultimately fail after we've done the lookup, so we're really hammering the buf hash lock whilst making no progress. We can also see significant spinlock traffic on the same lock just under normal operation when lots of tasks are accessing metadata from the same AG, so let's avoid all this by converting the lookup fast path to leverages the rhashtable's ability to do rcu protected lookups. We avoid races with the buffer release path by using atomic_inc_not_zero() on the buffer hold count. Any buffer that is in the LRU will have a non-zero count, thereby allowing the lockless fast path to be taken in most cache hit situations. If the buffer hold count is zero, then it is likely going through the release path so in that case we fall back to the existing lookup miss slow path. The slow path will then do an atomic lookup and insert under the buffer hash lock and hence serialise correctly against buffer release freeing the buffer. The use of rcu protected lookups means that buffer handles now need to be freed by RCU callbacks (same as inodes). We still free the buffer pages before the RCU callback - we won't be trying to access them at all on a buffer that has zero references - but we need the buffer handle itself to be present for the entire rcu protected read side to detect a zero hold count correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: remove a superflous hash lookup when inserting new buffersDave Chinner2022-07-141-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently on the slow path insert we repeat the initial hash table lookup before we attempt the insert, resulting in a two traversals of the hash table to ensure the insert is valid. The rhashtable API provides a method for an atomic lookup and insert operation, so we can avoid one of the hash table traversals by using this method. Adapted from a large patch containing this optimisation by Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: reduce the number of atomic when locking a buffer after lookupDave Chinner2022-07-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid an extra atomic operation in the non-trylock case by only doing a trylock if the XBF_TRYLOCK flag is set. This follows the pattern in the IO path with NOWAIT semantics where the "trylock-fail-lock" path showed 5-10% reduced throughput compared to just using single lock call when not under NOWAIT conditions. So make that same change here, too. See commit 942491c9e6d6 ("xfs: fix AIM7 regression") for details. Signed-off-by: Dave Chinner <dchinner@redhat.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: merge xfs_buf_find() and xfs_buf_get_map()Dave Chinner2022-07-141-109/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we factored xfs_buf_find(), we can start separating into distinct fast and slow paths from xfs_buf_get_map(). We start by moving the lookup map and perag setup to _get_map(), and then move all the specifics of the fast path lookup into xfs_buf_lookup() and call it directly from _get_map(). We the move all the slow path code to xfs_buf_find_insert(), which is now also called directly from _get_map(). As such, xfs_buf_find() now goes away. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: break up xfs_buf_find() into individual piecesDave Chinner2022-07-141-54/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_buf_find() is made up of three main parts: lookup, insert and locking. The interactions with xfs_buf_get_map() require it to be called twice - once for a pure lookup, and again on lookup failure so the insert path can be run. We want to simplify this down a lot, so split it into a fast path lookup, a slow path insert and a "lock the found buffer" helper. This will then let us integrate these operations more effectively into xfs_buf_get_map() in future patches. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: rework xfs_buf_incore() APIDave Chinner2022-07-075-37/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it consistent with the other buffer APIs to return a error and the buffer is placed in a parameter. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| * | | | Merge tag 'xfs-iunlink-item-5.20' of ↵Darrick J. Wong2022-07-1414-545/+539
| |\ \ \ \ | | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB xfs: introduce in-memory inode unlink log items To facilitate future improvements in inode logging and improving inode cluster buffer locking order consistency, we need a new mechanism for defering inode cluster buffer modifications during unlinked list modifications. The unlinked inode list buffer locking is complex. The unlinked list is unordered - we add to the tail, remove from where-ever the inode is in the list. Hence we might need to lock two inode buffers here (previous inode in list and the one being removed). While we can order the locking of these buffers correctly within the confines of the unlinked list, there may be other inodes that need buffer locking in the same transaction. e.g. O_TMPFILE being linked into a directory also modifies the directory inode. Hence we need a mechanism for defering unlinked inode list updates until a point where we know that all modifications have been made and all that remains is to lock and modify the cluster buffers. We can do this by first observing that we serialise unlinked list modifications by holding the AGI buffer lock. IOWs, the AGI is going to be locked until the transaction commits any time we modify the unlinked list. Hence it doesn't matter when in the unlink transactions that we actually load, lock and modify the inode cluster buffer. We add an in-memory unlinked inode log item to defer the inode cluster buffer update to transaction commit time where it can be ordered with all the other inode cluster operations that need to be done. Essentially all we need to do is record the inodes that need to have their unlinked list pointer updated in a new log item that we attached to the transaction. This log item exists purely for the purpose of delaying the update of the unlinked list pointer until the inode cluster buffer can be locked in the correct order around the other inode cluster buffers. It plays no part in the actual commit, and there's no change to anything that is written to the log. i.e. the inode cluster buffers still have to be fully logged here (not just ordered) as log recovery depedends on this to replay mods to the unlinked inode list. Hence if we add a "precommit" hook into xfs_trans_commit() to run a "precommit" operation on these iunlink log items, we can delay the locking, modification and logging of the inode cluster buffer until after all other modifications have been made. The precommit hook reuires us to sort the items that are going to be run so that we can lock precommit items in the correct order as we perform the modifications they describe. To make this unlinked inode list processing simpler and easier to implement as a log item, we need to change the way we track the unlinked list in memory. Starting from the observation that an inode on the unlinked list is pinned in memory by the VFS, we can use the xfs_inode itself to track the unlinked list. To do this efficiently, we want the unlinked list to be a double linked list. The problem here is that we need a list per AGI unlinked list, and there are 64 of these per AGI. The approach taken in this patchset is to shadow the AGI unlinked list heads in the perag, and link inodes by agino, hence requiring only 8 extra bytes per inode to track this state. We can then use the agino pointers for lockless inode cache lookups to retreive the inode. The aginos in the inode are modified only under the AGI lock, just like the cluster buffer pointers, so we don't need any extra locking here. The i_next_unlinked field tracks the on-disk value of the unlinked list, and the i_prev_unlinked is a purely in-memory pointer that enables us to efficiently remove inodes from the middle of the list. This results in moving a lot of the unlink modification work into the precommit operations on the unlink log item. Tracking all the unlinked inodes in the inodes themselves also gets rid of the unlinked list reference hash table that is used to track this back pointer relationship. This greatly simplifies the the unlinked list modification code, and removes memory allocations in this hot path to track back pointers. This, overall, slightly reduces the CPU overhead of the unlink path. The result of this log item means that we move all the actual manipulation of objects to be logged out of the iunlink path and into the iunlink item. This allows for future optimisation of this mechanism without needing changes to high level unlink path, as well as making the unlink lock ordering predictable and synchronised with other operations that may require inode cluster locking. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> * tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: add in-memory iunlink log item xfs: add log item precommit operation xfs: combine iunlink inode update functions xfs: clean up xfs_iunlink_update_inode() xfs: double link the unlinked inode list xfs: introduce xfs_iunlink_lookup xfs: refactor xlog_recover_process_iunlinks() xfs: track the iunlink list pointer in the xfs_inode xfs: factor the xfs_iunlink functions xfs: flush inode gc workqueue before clearing agi bucket
| | * | | xfs: add in-memory iunlink log itemDave Chinner2022-07-145-63/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have a clean operation to update the di_next_unlinked field of inode cluster buffers, we can easily defer this operation to transaction commit time so we can order the inode cluster buffer locking consistently. To do this, we introduce a new in-memory log item to track the unlinked list item modification that we are going to make. This follows the same observations as the in-memory double linked list used to track unlinked inodes in that the inodes on the list are pinned in memory and cannot go away, and hence we can simply reference them for the duration of the transaction without needing to take active references or pin them or look them up. This allows us to pass the xfs_inode to the transaction commit code along with the modification to be made, and then order the logged modifications via the ->iop_sort and ->iop_precommit operations for the new log item type. As this is an in-memory log item, it doesn't have formatting, CIL or AIL operational hooks - it exists purely to run the inode unlink modifications and is then removed from the transaction item list and freed once the precommit operation has run. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * | | xfs: add log item precommit operationDave Chinner2022-07-143-2/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For inodes that are dirty, we have an attached cluster buffer that we want to use to track the dirty inode through the AIL. Unfortunately, locking the cluster buffer and adding it to the transaction when the inode is first logged in a transaction leads to buffer lock ordering inversions. The specific problem is ordering against the AGI buffer. When modifying unlinked lists, the buffer lock order is AGI -> inode cluster buffer as the AGI buffer lock serialises all access to the unlinked lists. Unfortunately, functionality like xfs_droplink() logs the inode before calling xfs_iunlink(), as do various directory manipulation functions. The inode can be logged way down in the stack as far as the bmapi routines and hence, without a major rewrite of lots of APIs there's no way we can avoid the inode being logged by something until after the AGI has been logged. As we are going to be using ordered buffers for inode AIL tracking, there isn't a need to actually lock that buffer against modification as all the modifications are captured by logging the inode item itself. Hence we don't actually need to join the cluster buffer into the transaction until just before it is committed. This means we do not perturb any of the existing buffer lock orders in transactions, and the inode cluster buffer is always locked last in a transaction that doesn't otherwise touch inode cluster buffers. We do this by introducing a precommit log item method. This commit just introduces the mechanism; the inode item implementation is in followup commits. The precommit items need to be sorted into consistent order as we may be locking multiple items here. Hence if we have two dirty inodes in cluster buffers A and B, and some other transaction has two separate dirty inodes in the same cluster buffers, locking them in different orders opens us up to ABBA deadlocks. Hence we sort the items on the transaction based on the presence of a sort log item method. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * | | xfs: combine iunlink inode update functionsDave Chinner2022-07-141-36/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Combine the logging of the inode unlink list update into the calling function that looks up the buffer we end up logging. These do not need to be separate functions as they are both short, simple operations and there's only a single call path through them. This new function will end up being the core of the iunlink log item processing... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: clean up xfs_iunlink_update_inode()Dave Chinner2022-07-141-13/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We no longer need to have this function return the previous next agino value from the on-disk inode as we have it in the in-core inode now. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: double link the unlinked inode listDave Chinner2022-07-146-310/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we have forwards traversal via the incore inode in place, we now need to add back pointers to the incore inode to entirely replace the back reference cache. We use the same lookup semantics and constraints as for the forwards pointer lookups during unlinks, and so we can look up any inode in the unlinked list directly and update the list pointers, forwards or backwards, at any time. The only wrinkle in converting the unlinked list manipulations to use in-core previous pointers is that log recovery doesn't have the incore inode state built up so it can't just read in an inode and release it to finish off the unlink. Hence we need to modify the traversal in recovery to read one inode ahead before we release the inode at the head of the list. This populates the next->prev relationship sufficient to be able to replay the unlinked list and hence greatly simplify the runtime code. This recovery algorithm also requires that we actually remove inodes from the unlinked list one at a time as background inode inactivation will result in unlinked list removal racing with the building of the in-memory unlinked list state. We could serialise this by holding the AGI buffer lock when constructing the in memory state, but all that does is lockstep background processing with list building. It is much simpler to flush the inodegc immediately after releasing the inode so that it is unlinked immediately and there is no races present at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * | | xfs: introduce xfs_iunlink_lookupDave Chinner2022-07-142-96/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an inode is on an unlinked list during normal operation, it is guaranteed to be pinned in memory as it is either referenced by the current unlink operation or it has a open file descriptor that references it and has it pinned in memory. Hence to look up an inode on the unlinked list, we can do a direct inode cache lookup and always expect the lookup to succeed. Add a function to do this lookup based on the agino that we use to link the chain of unlinked inodes together so we can begin the conversion the unlinked list manipulations to use in-memory inodes rather than inode cluster buffers and remove the backref cache. Use this lookup function to replace the on-disk inode buffer walk when removing inodes from the unlinked list with an in-core inode unlinked list walk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: refactor xlog_recover_process_iunlinks()Dave Chinner2022-07-141-66/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For upcoming changes to the way inode unlinked list processing is done, the structure of recovery needs to change slightly. We also really need to untangle the messy error handling in list recovery so that actions like emptying the bucket on inode lookup failure are associated with the bucket list walk failing, not failing to look up the inode. Refactor the recovery code now to keep the re-organisation seperate to the algorithm changes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: track the iunlink list pointer in the xfs_inodeDave Chinner2022-07-144-18/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having direct access to the i_next_unlinked pointer in unlinked inodes greatly simplifies the processing of inodes on the unlinked list. We no longer need to look up the inode buffer just to find next inode in the list if the xfs_inode is in memory. These improvements will be realised over upcoming patches as other dependencies on the inode buffer for unlinked list processing are removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| | * | | xfs: factor the xfs_iunlink functionsDave Chinner2022-07-141-42/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prep work that separates the locking that protects the unlinked list from the actual operations being performed. This also helps document the fact they are performing list insert and remove operations. No functional code change. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| | * | | xfs: flush inode gc workqueue before clearing agi bucketZhang Yi2022-07-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the procedure of recover AGI unlinked lists, if something bad happenes on one of the unlinked inode in the bucket list, we would call xlog_recover_clear_agi_bucket() to clear the whole unlinked bucket list, not the unlinked inodes after the bad one. If we have already added some inodes to the gc workqueue before the bad inode in the list, we could get below error when freeing those inodes, and finaly fail to complete the log recover procedure. XFS (ram0): Internal error xfs_iunlink_remove at line 2456 of file fs/xfs/xfs_inode.c. Caller xfs_ifree+0xb0/0x360 [xfs] The problem is xlog_recover_clear_agi_bucket() clear the bucket list, so the gc worker fail to check the agino in xfs_verify_agino(). Fix this by flush workqueue before clearing the bucket. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: removed useless condition in function xfs_attr_node_getAndrey Strachuk2022-07-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At line 1561, variable "state" is being compared with NULL every loop iteration. ------------------------------------------------------------------- 1561 for (i = 0; state != NULL && i < state->path.active; i++) { 1562 xfs_trans_brelse(args->trans, state->path.blk[i].bp); 1563 state->path.blk[i].bp = NULL; 1564 } ------------------------------------------------------------------- However, it cannot be NULL. ---------------------------------------- 1546 state = xfs_da_state_alloc(args); ---------------------------------------- xfs_da_state_alloc calls kmem_cache_zalloc. kmem_cache_zalloc is called with __GFP_NOFAIL flag and, therefore, it cannot return NULL. -------------------------------------------------------------------------- struct xfs_da_state * xfs_da_state_alloc( struct xfs_da_args *args) { struct xfs_da_state *state; state = kmem_cache_zalloc(xfs_da_state_cache, GFP_NOFS | __GFP_NOFAIL); state->args = args; state->mp = args->dp->i_mount; return state; } -------------------------------------------------------------------------- Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Andrey Strachuk <strochuk@ispras.ru> Fixes: 4d0cdd2bb8f0 ("xfs: clean up xfs_attr_node_hasname") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | | | xfs: add selinux labels to whiteout inodesEric Sandeen2022-07-093-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We got a report that "renameat2() with flags=RENAME_WHITEOUT doesn't apply an SELinux label on xfs" as it does on other filesystems (for example, ext4 and tmpfs.) While I'm not quite sure how labels may interact w/ whiteout files, leaving them as unlabeled seems inconsistent at best. Now that xfs_init_security is not static, rename it to xfs_inode_init_security per dchinner's suggestion. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>