summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* afs: Fix vlserver probe RTT handlingDavid Howells2023-06-161-2/+2
| | | | | | | | | | | | | | | | | | | In the same spirit as commit ca57f02295f1 ("afs: Fix fileserver probe RTT handling"), don't rule out using a vlserver just because there haven't been enough packets yet to calculate a real rtt. Always set the server's probe rtt from the estimate provided by rxrpc_kernel_get_srtt, which is capped at 1 second. This could lead to EDESTADDRREQ errors when accessing a cell for the first time, even though the vl servers are known and have responded to a probe. Fixes: 1d4adfaf6574 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2023-June/006746.html Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-6.4-rc6-tag' of ↵Linus Torvalds2023-06-163-10/+22
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two fixes for NOCOW files, a regression fix in scrub and an assertion fix: - NOCOW fixes: - keep length of iomap direct io request in case of a failure - properly pass mode of extent reference checking, this can break some cases for swapfile - fix error value confusion when scrubbing a stripe - convert assertion to a proper error handling when loading global roots, reported by syzbot" * tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: fix a return value overwrite in scrub_stripe() btrfs: do not ASSERT() on duplicated global roots btrfs: can_nocow_file_extent should pass down args->strict from callers btrfs: fix iomap_begin length for nocow writes
| * btrfs: scrub: fix a return value overwrite in scrub_stripe()Qu Wenruo2023-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [RETURN VALUE OVERWRITE] Inside scrub_stripe(), we would submit all the remaining stripes after iterating all extents. But since flush_scrub_stripes() can return error, we need to avoid overwriting the existing @ret if there is any error. However the existing check is doing the wrong check: ret2 = flush_scrub_stripes(); if (!ret2) ret = ret2; This would overwrite the existing @ret to 0 as long as the final flush detects no critical errors. [FIX] We should check @ret other than @ret2 in that case. Fixes: 8eb3dd17eadd ("btrfs: dev-replace: error out if we have unrepaired metadata error during") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: do not ASSERT() on duplicated global rootsQu Wenruo2023-06-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot mount option on a corrupted fs. The full report can be found here: https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0 BTRFS error (device loop0: state C): failed to load root csum assertion failed: !tmp, in fs/btrfs/disk-io.c:1103 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3664! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663 RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246 RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1 R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000 R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000 FS: 0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103 load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467 load_global_roots fs/btrfs/disk-io.c:2501 [inline] btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline] init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939 open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574 btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456 btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824 legacy_get_tree+0xea/0x180 fs/fs_context.c:610 vfs_get_tree+0x88/0x270 fs/super.c:1530 fc_mount fs/namespace.c:1043 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073 btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884 [CAUSE] Since the introduction of global roots, we handle csum/extent/free-space-tree roots as global roots, even if no extent-tree-v2 feature is enabled. So for regular csum/extent/fst roots, we load them into fs_info::global_root_tree rb tree. And we should not expect any conflicts in that rb tree, thus we have an ASSERT() inside btrfs_global_root_insert(). But rescue=usebackuproot can break the assumption, as we will try to load those trees again and again as long as we have bad roots and have backup roots slot remaining. So in that case we can have conflicting roots in the rb tree, and triggering the ASSERT() crash. [FIX] We can safely remove that ASSERT(), as the caller will properly put the offending root. To make further debugging easier, also add two explicit error messages: - Error message for conflicting global roots - Error message when using backup roots slot Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com Fixes: abed4aaae4f7 ("btrfs: track the csum, extent, and free space trees in a rb tree") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: can_nocow_file_extent should pass down args->strict from callersChris Mason2023-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 619104ba453ad0 ("btrfs: move common NOCOW checks against a file extent into a helper") changed our call to btrfs_cross_ref_exist() to always pass false for the 'strict' parameter. We're passing this down through the stack so that we can do a full check for cross references during swapfile activation. With strict always false, this test fails: btrfs subvol create swappy chattr +C swappy fallocate -l1G swappy/swapfile chmod 600 swappy/swapfile mkswap swappy/swapfile btrfs subvol snap swappy swapsnap btrfs subvol del -C swapsnap btrfs fi sync / sync;sync;sync swapon swappy/swapfile The fix is to just use args->strict, and everyone except swapfile activation is passing false. Fixes: 619104ba453ad0 ("btrfs: move common NOCOW checks against a file extent into a helper") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix iomap_begin length for nocow writesChristoph Hellwig2023-06-131-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | can_nocow_extent can reduce the len passed in, which needs to be propagated to btrfs_dio_iomap_begin so that iomap does not submit more data then is mapped. This problems exists since the btrfs_get_blocks_direct helper was added in commit c5794e51784a ("btrfs: Factor out write portion of btrfs_get_blocks_direct"), but the ordered_extent splitting added in commit b73a6fd1b1ef ("btrfs: split partial dio bios before submit") added a WARN_ON that made a syzkaller test fail. Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com Fixes: c5794e51784a ("btrfs: Factor out write portion of btrfs_get_blocks_direct") CC: stable@vger.kernel.org # 6.1+ Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2023-06-151-12/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix two regressions in ext4, one report by syzkaller[1], and reported by multiple users (and tracked by regzbot[2])" [1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859 [2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/ * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: drop the call to ext4_error() from ext4_get_group_info() Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
| * | ext4: drop the call to ext4_error() from ext4_get_group_info()Fabio M. De Francesco2023-06-141-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent patch added a call to ext4_error() which is problematic since some callers of the ext4_get_group_info() function may be holding a spinlock, whereas ext4_error() must never be called in atomic context. This triggered a report from Syzbot: "BUG: sleeping function called from invalid context in ext4_update_super" (see the link below). Therefore, drop the call to ext4_error() from ext4_get_group_info(). In the meantime use eight characters tabs instead of nine characters ones. Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/ Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
| * | Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"Kemeng Shi2023-06-141-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ad3f09be6cfe332be8ff46c78e6ec0f8839107aa. The reverted commit was intended to simpfy the code to get group descriptor block number in non-meta block group by assuming s_gdb_count is block number used for all non-meta block group descriptors. However s_gdb_count is block number used for all meta *and* non-meta group descriptors. So s_gdb_group will be > actual group descriptor block number used for all non-meta block group which should be "total non-meta block group" / "group descriptors per block", e.g. s_first_meta_bg. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com Fixes: ad3f09be6cfe ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa") Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2023-06-159-56/+190
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull smb client fixes from Steve French: "Eight, mostly small, smb3 client fixes: - important fix for deferred close oops (race with unmount) found with xfstest generic/098 to some servers - important reconnect fix - fix problem with max_credits mount option - two multichannel (interface related) fixes - one trivial removal of confusing comment - two small debugging improvements (to better spot crediting problems)" * tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: add a warning when the in-flight count goes negative cifs: fix lease break oops in xfstest generic/098 cifs: fix max_credits implementation cifs: fix sockaddr comparison in iface_cmp smb/client: print "Unknown" instead of bogus link speed value cifs: print all credit counters in DebugData cifs: fix status checks in cifs_tree_connect smb: remove obsolete comment
| * | | cifs: add a warning when the in-flight count goes negativeShyam Prasad N2023-06-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've seen the in-flight count go into negative with some internal stress testing in Microsoft. Adding a WARN when this happens, in hope of understanding why this happens when it happens. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: fix lease break oops in xfstest generic/098Steve French2023-06-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | umount can race with lease break so need to check if tcon->ses->server is still valid to send the lease break response. Reviewed-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Fixes: 59a556aebc43 ("SMB3: drop reference to cfile before sending oplock break") Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: fix max_credits implementationShyam Prasad N2023-06-112-4/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of max_credits on the client does not work because the CreditRequest logic for several commands does not take max_credits into account. Still, we can end up asking the server for more credits, depending on the number of credits in flight. For this, we need to limit the credits while parsing the responses too. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: fix sockaddr comparison in iface_cmpShyam Prasad N2023-06-114-37/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iface_cmp used to simply do a memcmp of the two provided struct sockaddrs. The comparison needs to do more based on the address family. Similar logic was already present in cifs_match_ipaddr. Doing something similar now. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | smb/client: print "Unknown" instead of bogus link speed valueEnzo Matsumiya2023-06-111-1/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The virtio driver for Linux guests will not set a link speed to its paravirtualized NICs. This will be seen as -1 in the ethernet layer, and when some servers (e.g. samba) fetches it, it's converted to an unsigned value (and multiplied by 1000 * 1000), so in client side we end up with: 1) Speed: 4294967295000000 bps in DebugData. This patch introduces a helper that returns a speed string (in Mbps or Gbps) if interface speed is valid (>= SPEED_10 and <= SPEED_800000), or "Unknown" otherwise. The reason to not change the value in iface->speed is because we don't know the real speed of the HW backing the server NIC, so let's keep considering these as the fastest NICs available. Also print "Capabilities: None" when the interface doesn't support any. Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: print all credit counters in DebugDataShyam Prasad N2023-06-111-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Output of /proc/fs/cifs/DebugData shows only the per-connection counter for the number of credits of regular type. i.e. the credits reserved for echo and oplocks are not displayed. There have been situations recently where having this info would have been useful. This change prints the credit counters of all three types: regular, echo, oplocks. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: fix status checks in cifs_tree_connectShyam Prasad N2023-06-112-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ordering of status checks at the beginning of cifs_tree_connect is wrong. As a result, a tcon which is good may stay marked as needing reconnect infinitely. Fixes: 2f0e4f034220 ("cifs: check only tcon status on tcon related functions") Cc: stable@vger.kernel.org # 6.3 Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | smb: remove obsolete comment鑫华2023-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because do_gettimeofday has been removed and replaced by ktime_get_real_ts64, So just remove the comment as it's not needed now. Signed-off-by: 鑫华 <jixianghua@xfusion.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | | Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of ↵Linus Torvalds2023-06-127-9/+88
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this development cycle or which were considered inappropriate for a backport" * tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zswap: do not shrink if cgroup may not zswap page cache: fix page_cache_next/prev_miss off by one ocfs2: check new file size on fallocate call mailmap: add entry for John Keeping mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp() epoll: ep_autoremove_wake_function should use list_del_init_careful mm/gup_test: fix ioctl fail for compat task nilfs2: reject devices with insufficient block count ocfs2: fix use-after-free when unmounting read-only filesystem lib/test_vmalloc.c: avoid garbage in page array nilfs2: fix possible out-of-bounds segment allocation in resize ioctl riscv/purgatory: remove PGO flags powerpc/purgatory: remove PGO flags x86/purgatory: remove PGO flags kexec: support purgatories with .text.hot sections mm/uffd: allow vma to merge as much as possible mm/uffd: fix vma operation where start addr cuts part of vma radix-tree: move declarations to header nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
| * | | | ocfs2: check new file size on fallocate callLuís Henriques2023-06-121-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When changing a file size with fallocate() the new size isn't being checked. In particular, the FSIZE ulimit isn't being checked, which makes fstest generic/228 fail. Simply adding a call to inode_newsize_ok() fixes this issue. Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Mark Fasheh <mark@fasheh.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | epoll: ep_autoremove_wake_function should use list_del_init_carefulBenjamin Segall2023-06-121-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | autoremove_wake_function uses list_del_init_careful, so should epoll's more aggressive variant. It only doesn't because it was copied from an older wait.c rather than the most recent. [bsegall@google.com: add comment] Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com Fixes: a16ceb139610 ("epoll: autoremove wakers even more aggressively") Signed-off-by: Ben Segall <bsegall@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | nilfs2: reject devices with insufficient block countRyusuke Konishi2023-06-121-1/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current sanity check for nilfs2 geometry information lacks checks for the number of segments stored in superblocks, so even for device images that have been destructively truncated or have an unusually high number of segments, the mount operation may succeed. This causes out-of-bounds block I/O on file system block reads or log writes to the segments, the latter in particular causing "a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to hang. Fix this issue by checking the number of segments stored in the superblock and avoiding mounting devices that can cause out-of-bounds accesses. To eliminate the possibility of overflow when calculating the number of blocks required for the device from the number of segments, this also adds a helper function to calculate the upper bound on the number of segments and inserts a check using it. Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | ocfs2: fix use-after-free when unmounting read-only filesystemLuís Henriques2023-06-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using fstest generic/452. After a read-only remount, quotas are suspended and ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info(). When unmounting the filesystem, an UAF access to the oinfo will eventually cause a crash. BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0 Read of size 8 at addr ffff8880389a8208 by task umount/669 ... Call Trace: <TASK> ... timer_delete+0x54/0xc0 try_to_grab_pending+0x31/0x230 __cancel_work_timer+0x6c/0x270 ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2] ocfs2_dismount_volume+0xdd/0x450 [ocfs2] generic_shutdown_super+0xaa/0x280 kill_block_super+0x46/0x70 deactivate_locked_super+0x4d/0xb0 cleanup_mnt+0x135/0x1f0 ... </TASK> Allocated by task 632: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x8b/0x90 ocfs2_local_read_info+0xe3/0x9a0 [ocfs2] dquot_load_quota_sb+0x34b/0x680 dquot_load_quota_inode+0xfe/0x1a0 ocfs2_enable_quotas+0x190/0x2f0 [ocfs2] ocfs2_fill_super+0x14ef/0x2120 [ocfs2] mount_bdev+0x1be/0x200 legacy_get_tree+0x6c/0xb0 vfs_get_tree+0x3e/0x110 path_mount+0xa90/0xe10 __x64_sys_mount+0x16f/0x1a0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 650: kasan_save_stack+0x1c/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x50 __kasan_slab_free+0xf9/0x150 __kmem_cache_free+0x89/0x180 ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2] dquot_disable+0x35f/0xa70 ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2] ocfs2_remount+0x150/0x580 [ocfs2] reconfigure_super+0x1a5/0x3a0 path_mount+0xc8a/0xe10 __x64_sys_mount+0x16f/0x1a0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | nilfs2: fix possible out-of-bounds segment allocation in resize ioctlRyusuke Konishi2023-06-121-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reports that in its stress test for resize ioctl, the log writing function nilfs_segctor_do_construct hits a WARN_ON in nilfs_segctor_truncate_segments(). It turned out that there is a problem with the current implementation of the resize ioctl, which changes the writable range on the device (the range of allocatable segments) at the end of the resize process. This order is necessary for file system expansion to avoid corrupting the superblock at trailing edge. However, in the case of a file system shrink, if log writes occur after truncating out-of-bounds trailing segments and before the resize is complete, segments may be allocated from the truncated space. The userspace resize tool was fine as it limits the range of allocatable segments before performing the resize, but it can run into this issue if the resize ioctl is called alone. Fix this issue by changing nilfs_sufile_resize() to update the range of allocatable segments immediately after successful truncation of segment space in case of file system shrink. Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com Fixes: 4e33f9eab07e ("nilfs2: implement resize ioctl") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/uffd: allow vma to merge as much as possiblePeter Xu2023-06-121-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to not pass in the pgoff correctly when register/unregister uffd regions, it caused incorrect behavior on vma merging and can cause mergeable vmas being separate after ioctls return. For example, when we have: vma1(range 0-9, with uffd), vma2(range 10-19, no uffd) Then someone unregisters uffd on range (5-9), it should logically become: vma1(range 0-4, with uffd), vma2(range 5-19, no uffd) But with current code we'll have: vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd) This patch allows such merge to happen correctly before ioctl returns. This behavior seems to have existed since the 1st day of uffd. Since pgoff for vma_merge() is only used to identify the possibility of vma merging, meanwhile here what we did was always passing in a pgoff smaller than what we should, so there should have no other side effect besides not merging it. Let's still tentatively copy stable for this, even though I don't see anything will go wrong besides vma being split (which is mostly not user visible). Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/uffd: fix vma operation where start addr cuts part of vmaPeter Xu2023-06-121-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/uffd: Fix vma merge/split", v2. This series contains two patches that fix vma merge/split for userfaultfd on two separate issues. Patch 1 fixes a regression since 6.1+ due to something we overlooked when converting to maple tree apis. The plan is we use patch 1 to replace the commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring uffd vma operations back aligned with the rest code again. Patch 2 fixes a long standing issue that vma can be left unmerged even if we can for either uffd register or unregister. Many thanks to Lorenzo on either noticing this issue from the assert movement patch, looking at this problem, and also provided a reproducer on the unmerged vma issue [1]. [1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e This patch (of 2): It seems vma merging with uffd paths is broken with either register/unregister, where right now we can feed wrong parameters to vma_merge() and it's found by recent patch which moved asserts upwards in vma_merge() by Lorenzo Stoakes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/ It's possible that "start" is contained within vma but not clamped to its start. We need to convert this into either "cannot merge" case or "can merge" case 4 which permits subdivision of prev by assigning vma to prev. As we loop, each subsequent VMA will be clamped to the start. This patch will eliminate the report and make sure vma_merge() calls will become legal again. One thing to mention is that the "Fixes: 29417d292bd0" below is there only to help explain where the warning can start to trigger, the real commit to fix should be 69dbe6daf104. Commit 29417d292bd0 helps us to identify the issue, but unfortunately we may want to keep it in Fixes too just to ease kernel backporters for easier tracking. Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/ Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()Ryusuke Konishi2023-06-121-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A syzbot fault injection test reported that nilfs_btnode_create_block, a helper function that allocates a new node block for b-trees, causes a kernel BUG for disk images where the file system block size is smaller than the page size. This was due to unexpected flags on the newly allocated buffer head, and it turned out to be because the buffer flags were not cleared by nilfs_btnode_abort_change_key() after an error occurred during a b-tree update operation and the buffer was later reused in that state. Fix this issue by using nilfs_btnode_delete() to abandon the unused preallocated buffer in nilfs_btnode_abort_change_key(). Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | | Merge tag 'for-6.4-rc6-tag' of ↵Linus Torvalds2023-06-123-11/+30
|\ \ \ \ \ | |_|/ / / |/| | | / | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A more fixes and regression fixes: - in subpage mode, fix crash when repairing metadata at the end of a stripe - properly enable async discard when remounting from read-only to read-write - scrub regression fixes: - respect read-only scrub when attempting to do a repair - fix reporting of found errors, the stats don't get properly accounted after a stripe repair" * tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: scrub: also report errors hit during the initial read btrfs: scrub: respect the read-only flag during repair btrfs: properly enable async discard when switching from RO->RW btrfs: subpage: fix a crash in metadata repair path
| * | | btrfs: scrub: also report errors hit during the initial readQu Wenruo2023-06-081-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] After the recent scrub rework introduced in commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"), btrfs scrub no longer reports repaired errors any more: # mkfs.btrfs -f $dev -d DUP # mount $dev $mnt # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file # umount $dev # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror # mount $dev $mnt # btrfs scrub start -BR $mnt scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218 Scrub started: Tue Jun 6 14:56:50 2023 Status: finished Duration: 0:00:00 data_extents_scrubbed: 2 tree_extents_scrubbed: 18 data_bytes_scrubbed: 131072 tree_bytes_scrubbed: 294912 read_errors: 0 csum_errors: 0 <<< No errors here verify_errors: 0 [...] uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 16 <<< Only corrected errors last_physical: 2723151872 This can confuse btrfs-progs, as it relies on the csum_errors to determine if there is anything wrong. While on v6.3.x kernels, the report is different: csum_errors: 16 <<< verify_errors: 0 [...] uncorrectable_errors: 0 unverified_errors: 0 corrected_errors: 16 <<< [CAUSE] In the reworked scrub, we update the scrub progress inside scrub_stripe_report_errors(), using various bitmaps to update the result. For example for csum_errors, we use bitmap_weight() of stripe->csum_error_bitmap. Unfortunately at that stage, all error bitmaps (except init_error_bitmap) are the result of the latest repair attempt, thus if the stripe is fully repaired, those error bitmaps will all be empty, resulting the above output mismatch. To fix this, record the number of errors into stripe->init_nr_*_errors. Since we don't really care about where those errors are, we only need to record the number of errors. Then in scrub_stripe_report_errors(), use those initial numbers to update the progress other than using the latest error bitmaps. Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: scrub: respect the read-only flag during repairQu Wenruo2023-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] With recent scrub rework, the scrub operation no longer respects the read-only flag passed by "-r" option of "btrfs scrub start" command. # mkfs.btrfs -f -d raid1 $dev1 $dev2 # mount $dev1 $mnt # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file # sync # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1 # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2 # mount $dev1 $mnt -o ro # btrfs scrub start -BrRd $mnt Scrub device $dev1 (id 1) done Scrub started: Tue Jun 6 09:59:14 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 16 <<< Still has corrupted sectors last_physical: 1372585984 Scrub device $dev2 (id 2) done Scrub started: Tue Jun 6 09:59:14 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 16 <<< Still has corrupted sectors last_physical: 1351614464 # btrfs scrub start -BrRd $mnt Scrub device $dev1 (id 1) done Scrub started: Tue Jun 6 10:00:17 2023 Status: finished Duration: 0:00:00 [...] corrected_errors: 0 <<< No more errors last_physical: 1372585984 Scrub device $dev2 (id 2) done [...] corrected_errors: 0 <<< No more errors last_physical: 1372585984 [CAUSE] In the newly reworked scrub code, repair is always submitted no matter if we're doing a read-only scrub. [FIX] Fix it by skipping the write submission if the scrub is a read-only one. Unfortunately for the report part, even for a read-only scrub we will still report it as corrected errors, as we know it's repairable, even we won't really submit the write. Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: properly enable async discard when switching from RO->RWChris Mason2023-06-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info to force discards off when the filesystem has aborted or we're generally not able to run discards. This gets flipped on when we're mounted rw, and also when we go from ro->rw. Commit 63a7cb13071842 ("btrfs: auto enable discard=async when possible") enabled async discard by default, and this meant "mount -o ro /dev/xxx /yyy" had async discards turned on. Unfortunately, this meant our check in btrfs_remount_cleanup() would see that discards are already on: /* If we toggled discard async */ if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) && btrfs_test_opt(fs_info, DISCARD_ASYNC)) btrfs_discard_resume(fs_info); So, we'd never call btrfs_discard_resume() when remounting the root filesystem from ro->rw. drgn shows this really nicely: import os import sys from drgn.helpers.linux.fs import path_lookup from drgn import NULL, Object, Type, cast def btrfs_sb(sb): return cast("struct btrfs_fs_info *", sb.s_fs_info) if len(sys.argv) == 1: path = "/" else: path = sys.argv[1] fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info) BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING'] if fs_info.flags & BTRFS_FS_DISCARD_RUNNING: print("discard running flag is on") else: print("discard running flag is off") [root]# mount | grep nvme /dev/nvme0n1p3 on / type btrfs (rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/) [root]# ./discard_running.drgn discard running flag is off [root]# mount -o remount,discard=sync / [root]# mount -o remount,discard=async / [root]# ./discard_running.drgn discard running flag is on The fix is to call btrfs_discard_resume() when we're going from ro->rw. It already checks to make sure the async discard flag is on, so it'll do the right thing. Fixes: 63a7cb13071842 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: subpage: fix a crash in metadata repair pathQu Wenruo2023-06-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] Test case btrfs/027 would crash with subpage (64K page size, 4K sectorsize) with the following dying messages: debug: map_length=16384 length=65536 type=metadata|raid6(0x104) assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: btrfs_assertfail+0x28/0x2c [btrfs] btrfs_map_repair_block+0x150/0x2b8 [btrfs] btrfs_repair_io_failure+0xd4/0x31c [btrfs] btrfs_read_extent_buffer+0x150/0x16c [btrfs] read_tree_block+0x38/0xbc [btrfs] read_tree_root_path+0xfc/0x1bc [btrfs] btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs] open_ctree+0xa30/0x172c [btrfs] btrfs_mount_root+0x3c4/0x4a4 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec vfs_kern_mount.part.0+0x90/0xd4 vfs_kern_mount+0x14/0x28 btrfs_mount+0x114/0x418 [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xec path_mount+0x3e0/0xb64 __arm64_sys_mount+0x200/0x2d8 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0x60/0x11c do_el0_svc+0x38/0x98 el0_svc+0x40/0xa8 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x190/0x194 Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000) [CAUSE] In btrfs/027 we test RAID6 with missing devices, in this particular case, we're repairing a metadata at the end of a data stripe. But at btrfs_repair_io_failure(), we always pass a full PAGE for repair, and for subpage case this can cross stripe boundary and lead to the above BUG_ON(). This metadata repair code is always there, since the introduction of subpage support, but this can trigger BUG_ON() after the bio split ability at btrfs_map_bio(). [FIX] Instead of passing the old PAGE_SIZE, we calculate the correct length based on the eb size and page size for both regular and subpage cases. CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | Merge tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2023-06-116-56/+62
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull smb server fixes from Steve French: "Five smb3 server fixes, all also for stable: - Fix four slab out of bounds warnings: improve checks for protocol id, and for small packet length, and for create context parsing, and for negotiate context parsing - Fix for incorrect dereferencing POSIX ACLs" * tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbd: ksmbd: validate smb request protocol id ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR() ksmbd: fix out-of-bound read in parse_lease_state() ksmbd: fix out-of-bound read in deassemble_neg_contexts()
| * | | | ksmbd: validate smb request protocol idNamjae Jeon2023-06-022-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add the validation for smb request protocol id. If it is not one of the four ids(SMB1_PROTO_NUMBER, SMB2_PROTO_NUMBER, SMB2_TRANSFORM_PROTO_NUM, SMB2_COMPRESSION_TRANSFORM_ID), don't allow processing the request. And this will fix the following KASAN warning also. [ 13.905265] BUG: KASAN: slab-out-of-bounds in init_smb2_rsp_hdr+0x1b9/0x1f0 [ 13.905900] Read of size 16 at addr ffff888005fd2f34 by task kworker/0:2/44 ... [ 13.908553] Call Trace: [ 13.908793] <TASK> [ 13.908995] dump_stack_lvl+0x33/0x50 [ 13.909369] print_report+0xcc/0x620 [ 13.910870] kasan_report+0xae/0xe0 [ 13.911519] kasan_check_range+0x35/0x1b0 [ 13.911796] init_smb2_rsp_hdr+0x1b9/0x1f0 [ 13.912492] handle_ksmbd_work+0xe5/0x820 Cc: stable@vger.kernel.org Reported-by: Chih-Yen Chang <cc85nod@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loopNamjae Jeon2023-06-021-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The length field of netbios header must be greater than the SMB header sizes(smb1 or smb2 header), otherwise the packet is an invalid SMB packet. If `pdu_size` is 0, ksmbd allocates a 4 bytes chunk to `conn->request_buf`. In the function `get_smb2_cmd_val` ksmbd will read cmd from `rcv_hdr->Command`, which is `conn->request_buf + 12`, causing the KASAN detector to print the following error message: [ 7.205018] BUG: KASAN: slab-out-of-bounds in get_smb2_cmd_val+0x45/0x60 [ 7.205423] Read of size 2 at addr ffff8880062d8b50 by task ksmbd:42632/248 ... [ 7.207125] <TASK> [ 7.209191] get_smb2_cmd_val+0x45/0x60 [ 7.209426] ksmbd_conn_enqueue_request+0x3a/0x100 [ 7.209712] ksmbd_server_process_request+0x72/0x160 [ 7.210295] ksmbd_conn_handler_loop+0x30c/0x550 [ 7.212280] kthread+0x160/0x190 [ 7.212762] ret_from_fork+0x1f/0x30 [ 7.212981] </TASK> Cc: stable@vger.kernel.org Reported-by: Chih-Yen Chang <cc85nod@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR()Namjae Jeon2023-06-022-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dan reported the following error message: fs/smb/server/smbacl.c:1296 smb_check_perm_dacl() error: 'posix_acls' dereferencing possible ERR_PTR() fs/smb/server/vfs.c:1323 ksmbd_vfs_make_xattr_posix_acl() error: 'posix_acls' dereferencing possible ERR_PTR() fs/smb/server/vfs.c:1830 ksmbd_vfs_inherit_posix_acl() error: 'acls' dereferencing possible ERR_PTR() __get_acl() returns a mix of error pointers and NULL. This change it with IS_ERR_OR_NULL(). Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | ksmbd: fix out-of-bound read in parse_lease_state()Namjae Jeon2023-06-021-42/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This bug is in parse_lease_state, and it is caused by the missing check of `struct create_context`. When the ksmbd traverses the create_contexts, it doesn't check if the field of `NameOffset` and `Next` is valid, The KASAN message is following: [ 6.664323] BUG: KASAN: slab-out-of-bounds in parse_lease_state+0x7d/0x280 [ 6.664738] Read of size 2 at addr ffff888005c08988 by task kworker/0:3/103 ... [ 6.666644] Call Trace: [ 6.666796] <TASK> [ 6.666933] dump_stack_lvl+0x33/0x50 [ 6.667167] print_report+0xcc/0x620 [ 6.667903] kasan_report+0xae/0xe0 [ 6.668374] kasan_check_range+0x35/0x1b0 [ 6.668621] parse_lease_state+0x7d/0x280 [ 6.668868] smb2_open+0xbe8/0x4420 [ 6.675137] handle_ksmbd_work+0x282/0x820 Use smb2_find_context_vals() to find smb2 create request lease context. smb2_find_context_vals validate create context fields. Cc: stable@vger.kernel.org Reported-by: Chih-Yen Chang <cc85nod@gmail.com> Tested-by: Chih-Yen Chang <cc85nod@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | ksmbd: fix out-of-bound read in deassemble_neg_contexts()Namjae Jeon2023-06-021-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check in the beginning is `clen + sizeof(struct smb2_neg_context) <= len_of_ctxts`, but in the end of loop, `len_of_ctxts` will subtract `((clen + 7) & ~0x7) + sizeof(struct smb2_neg_context)`, which causes integer underflow when clen does the 8 alignment. We should use `(clen + 7) & ~0x7` in the check to avoid underflow from happening. Then there are some variables that need to be declared unsigned instead of signed. [ 11.671070] BUG: KASAN: slab-out-of-bounds in smb2_handle_negotiate+0x799/0x1610 [ 11.671533] Read of size 2 at addr ffff888005e86cf2 by task kworker/0:0/7 ... [ 11.673383] Call Trace: [ 11.673541] <TASK> [ 11.673679] dump_stack_lvl+0x33/0x50 [ 11.673913] print_report+0xcc/0x620 [ 11.674671] kasan_report+0xae/0xe0 [ 11.675171] kasan_check_range+0x35/0x1b0 [ 11.675412] smb2_handle_negotiate+0x799/0x1610 [ 11.676217] ksmbd_smb_negotiate_common+0x526/0x770 [ 11.676795] handle_ksmbd_work+0x274/0x810 ... Cc: stable@vger.kernel.org Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com> Tested-by: Chih-Yen Chang <cc85nod@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | | | Merge tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-clientLinus Torvalds2023-06-092-1/+9
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: "A fix for a potential data corruption in differential backup and snapshot-based mirroring scenarios in RBD and a reference counting fixup to avoid use-after-free in CephFS, all marked for stable" * tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-client: ceph: fix use-after-free bug for inodes when flushing capsnaps rbd: get snapshot context after exclusive lock is ensured to be held rbd: move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting
| * | | | | ceph: fix use-after-free bug for inodes when flushing capsnapsXiubo Li2023-06-082-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list: == Thread A == == Thread B == ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... == Thread C == ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... == Thread D == ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING To fix this we will increase the inode's i_count ref when adding 'ci' to the 'mdsc->snap_flush_list' list. [ idryomov: need_put int -> bool ] Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2023-06-092-7/+5
|\ \ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fix from Ted Ts'o: "Fix an ext4 regression which breaks remounting r/w file systems that have the quota feature enabled" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: only check dquot_initialize_needed() when debugging Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
| * | | | | ext4: only check dquot_initialize_needed() when debuggingTheodore Ts'o2023-06-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4_xattr_block_set() relies on its caller to call dquot_initialize() on the inode. To assure that this has happened there are WARN_ON checks. Unfortunately, this is subject to false positives if there is an antagonist thread which is flipping the file system at high rates between r/o and rw. So only do the check if EXT4_XATTR_DEBUG is enabled. Link: https://lore.kernel.org/r/20230608044056.GA1418535@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | | | Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is ↵Theodore Ts'o2023-06-081-5/+1
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | re-enabled" This reverts commit a44be64bbecb15a452496f60db6eacfee2b59c79. Link: https://lore.kernel.org/r/653b3359-2005-21b1-039d-c55ca4cffdcc@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | | | Merge tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-06-0824-229/+427
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Dave Chinner: "These are a set of regression fixes discovered on recent kernels. I was hoping to send this to you a week and half ago, but events out of my control delayed finalising the changes until early this week. Whilst the diffstat looks large for this stage of the merge window, a large chunk of it comes from moving the guts of one function from one file to another i.e. it's the same code, it is just run in a different context where it is safe to hold a specific lock. Otherwise the individual changes are relatively small and straigtht forward. Summary: - Propagate unlinked inode list corruption back up to log recovery (regression fix) - improve corruption detection for AGFL entries, AGFL indexes and XEFI extents (syzkaller fuzzer oops report) - Avoid double perag reference release (regression fix) - Improve extent merging detection in scrub (regression fix) - Fix a new undefined high bit shift (regression fix) - Fix for AGF vs inode cluster buffer deadlock (regression fix)" * tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: collect errors from inodegc for unlinked inode recovery xfs: validate block number being freed before adding to xefi xfs: validity check agbnos on the AGFL xfs: fix agf/agfl verification on v4 filesystems xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag() xfs: fix broken logic when detecting mergeable bmap records xfs: Fix undefined behavior of shift into sign bit xfs: fix AGF vs inode cluster buffer deadlock xfs: defered work could create precommits xfs: restore allocation trylock iteration xfs: buffer pins need to hold a buffer reference
| * | | | | xfs: collect errors from inodegc for unlinked inode recoveryDave Chinner2023-06-058-37/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlinked list recovery requires errors removing the inode the from the unlinked list get fed back to the main recovery loop. Now that we offload the unlinking to the inodegc work, we don't get errors being fed back when we trip over a corruption that prevents the inode from being removed from the unlinked list. This means we never clear the corrupt unlinked list bucket, resulting in runtime operations eventually tripping over it and shutting down. Fix this by collecting inodegc worker errors and feed them back to the flush caller. This is largely best effort - the only context that really cares is log recovery, and it only flushes a single inode at a time so we don't need complex synchronised handling. Essentially the inodegc workers will capture the first error that occurs and the next flush will gather them and clear them. The flush itself will only report the first gathered error. In the cases where callers can return errors, propagate the collected inodegc flush error up the error handling chain. In the case of inode unlinked list recovery, there are several superfluous calls to flush queued unlinked inodes - xlog_recover_iunlink_bucket() guarantees that it has flushed the inodegc and collected errors before it returns. Hence nothing in the calling path needs to run a flush, even when an error is returned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | xfs: validate block number being freed before adding to xefiDave Chinner2023-06-058-23/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | xfs: validity check agbnos on the AGFLDave Chinner2023-06-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the agfl or the indexing in the AGF has been corrupted, getting a block form the AGFL could return an invalid block number. If this happens, bad things happen. Check the agbno we pull off the AGFL and return -EFSCORRUPTED if we find somethign bad. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | xfs: fix agf/agfl verification on v4 filesystemsDave Chinner2023-06-051-17/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a v4 filesystem has fl_last - fl_first != fl_count, we do not not detect the corruption and allow the AGF to be used as it if was fully valid. On V5 filesystems, we reset the AGFL to empty in these cases and avoid the corruption at a small cost of leaked blocks. If we don't catch the corruption on V4 filesystems, bad things happen later when an allocation attempts to trim the free list and either double-frees stale entries in the AGFl or tries to free NULLAGBNO entries. Either way, this is bad. Prevent this from happening by using the AGFL_NEED_RESET logic for v4 filesysetms, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()Dave Chinner2023-06-051-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_bmap_longest_free_extent() can return an error when accessing the AGF fails. In this case, the behaviour of xfs_filestream_pick_ag() is conditional on the error. We may continue the loop, or break out of it. The error handling after the loop cleans up the perag reference held when the break occurs. If we continue, the next loop iteration handles cleaning up the perag reference. EIther way, we don't need to release the active perag reference when xfs_bmap_longest_free_extent() fails. Doing so means we do a double decrement on the active reference count, and this causes tha active reference count to fall to zero. At this point, new active references will fail. This leads to unmount hanging because it tries to grab active references to that perag, only for it to fail. This happens inside a loop that retries until a inode tree radix tree tag is cleared, which cannot happen because we can't get an active reference to the perag. The unmount livelocks in this path: xfs_reclaim_inodes+0x80/0xc0 xfs_unmount_flush_inodes+0x5b/0x70 xfs_unmountfs+0x5b/0x1a0 xfs_fs_put_super+0x49/0x110 generic_shutdown_super+0x7c/0x1a0 kill_block_super+0x27/0x50 deactivate_locked_super+0x30/0x90 deactivate_super+0x3c/0x50 cleanup_mnt+0xc2/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x5e/0xa0 exit_to_user_mode_prepare+0x1bc/0x1c0 syscall_exit_to_user_mode+0x16/0x40 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Reported-by: Pengfei Xu <pengfei.xu@intel.com> Fixes: eb70aa2d8ed9 ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | | xfs: fix broken logic when detecting mergeable bmap recordsDarrick J. Wong2023-06-051-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6bc6c99a944c was a well-intentioned effort to initiate consolidation of adjacent bmbt mapping records by setting the PREEN flag. Consolidation can only happen if the length of the combined record doesn't overflow the 21-bit blockcount field of the bmbt recordset. Unfortunately, the length test is inverted, leading to it triggering on data forks like these: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL 0: [0..16777207]: 76110848..92888055 0 (76110848..92888055) 16777208 1: [16777208..20639743]: 92888056..96750591 0 (92888056..96750591) 3862536 Note that record 0 has a length of 16777208 512b blocks. This corresponds to 2097151 4k fsblocks, which is the maximum. Hence the two records cannot be merged. However, the logic is still wrong even if we change the in-loop comparison, because the scope of our examination isn't broad enough inside the loop to detect mappings like this: 0: [0..9]: 76110838..76110847 0 (76110838..76110847) 10 1: [10..16777217]: 76110848..92888055 0 (76110848..92888055) 16777208 2: [16777218..20639753]: 92888056..96750591 0 (92888056..96750591) 3862536 These three records could be merged into two, but one cannot determine this purely from looking at records 0-1 or 1-2 in isolation. Hoist the mergability detection outside the loop, and base its decision making on whether or not a merged mapping could be expressed in fewer bmbt records. While we're at it, fix the incorrect return type of the iter function. Fixes: 336642f79283 ("xfs: alert the user about data/attr fork mappings that could be merged") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>