summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-5.18-rc2-tag' of ↵Linus Torvalds2022-04-148-23/+49
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more code and warning fixes. There's one feature ioctl removal patch slated for 5.18 that did not make it to the main pull request. It's just a one-liner and the ioctl has a v2 that's in use for a long time, no point to postpone it to 5.19. Late update: - remove balance v1 ioctl, superseded by v2 in 2012 Fixes: - add back cgroup attribution for compressed writes - add super block write start/end annotations to asynchronous balance - fix root reference count on an error handling path - in zoned mode, activate zone at the chunk allocation time to avoid ENOSPC due to timing issues - fix delayed allocation accounting for direct IO Warning fixes: - simplify assertion condition in zoned check - remove an unused variable" * tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix btrfs_submit_compressed_write cgroup attribution btrfs: fix root ref counts in error handling in btrfs_get_root_ref btrfs: zoned: activate block group only for extent allocation btrfs: return allocated block group from do_chunk_alloc() btrfs: mark resumed async balance as writing btrfs: remove support of balance v1 ioctl btrfs: release correct delalloc amount in direct IO write path btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups() btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
| * btrfs: fix btrfs_submit_compressed_write cgroup attributionDennis Zhou2022-04-061-0/+8
| | | | | | | | | | | | | | | | | | | | | | This restores the logic from commit 46bcff2bfc5e ("btrfs: fix compressed write bio blkcg attribution") which added cgroup attribution to btrfs writeback. It also adds back the REQ_CGROUP_PUNT flag for these ios. Fixes: 91507240482e ("btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix root ref counts in error handling in btrfs_get_root_refJia-Ju Bai2022-04-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In btrfs_get_root_ref(), when btrfs_insert_fs_root() fails, btrfs_put_root() can happen for two reasons: - the root already exists in the tree, in that case it returns the reference obtained in btrfs_lookup_fs_root() - another error so the cleanup is done in the fail label Calling btrfs_put_root() unconditionally would lead to double decrement of the root reference possibly freeing it in the second case. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Fixes: bc44d7c4b2b1 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: activate block group only for extent allocationNaohiro Aota2022-04-063-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In btrfs_make_block_group(), we activate the allocated block group, expecting that the block group is soon used for allocation. However, the chunk allocation from flush_space() context broke the assumption. There can be a large time gap between the chunk allocation time and the extent allocation time from the chunk. Activating the empty block groups pre-allocated from flush_space() context can exhaust the active zone counter of a device. Once we use all the active zone counts for empty pre-allocated block groups, we cannot activate new block group for the other things: metadata, tree-log, or data relocation block group. That failure results in a fake -ENOSPC. This patch introduces CHUNK_ALLOC_FORCE_FOR_EXTENT to distinguish the chunk allocation from find_free_extent(). Now, the new block group is activated only in that context. Fixes: eb66a010d518 ("btrfs: zoned: activate new block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: return allocated block group from do_chunk_alloc()Naohiro Aota2022-04-061-3/+13
| | | | | | | | | | | | | | | | | | | | | | Return the allocated block group from do_chunk_alloc(). This is a preparation patch for the next patch. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: mark resumed async balance as writingNaohiro Aota2022-04-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | When btrfs balance is interrupted with umount, the background balance resumes on the next mount. There is a potential deadlock with FS freezing here like as described in commit 26559780b953 ("btrfs: zoned: mark relocation as writing"). Mark the process as sb_writing to avoid it. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove support of balance v1 ioctlNikolay Borisov2022-04-061-2/+0
| | | | | | | | | | | | | | | | | | | | It was scheduled for removal in kernel v5.18 commit 6c405b24097c ("btrfs: deprecate BTRFS_IOC_BALANCE ioctl") thus its time has come. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: release correct delalloc amount in direct IO write pathNaohiro Aota2022-04-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running generic/406 causes the following WARNING in btrfs_destroy_inode() which tells there are outstanding extents left. In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding extents with btrfs_delalloc_reserve_metadata() (or indirectly from btrfs_delalloc_reserve_space(()). We then release the outstanding extents with btrfs_delalloc_release_extents(). However, the "len" can be modified in the COW case, which releases fewer outstanding extents than expected. Fix it by calling btrfs_delalloc_release_extents() for the original length. To reproduce the warning, the filesystem should be 1 GiB. It's triggering a short-write, due to not being able to allocate a large extent and instead allocating a smaller one. WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram zsmalloc CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014 RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs] RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000 RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78 RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8 R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800 R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68 FS: 00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> destroy_inode+0x33/0x70 dispose_list+0x43/0x60 evict_inodes+0x161/0x1b0 generic_shutdown_super+0x2d/0x110 kill_anon_super+0xf/0x20 btrfs_kill_super+0xd/0x20 [btrfs] deactivate_locked_super+0x27/0x90 cleanup_mnt+0x12c/0x180 task_work_run+0x54/0x80 exit_to_user_mode_prepare+0x152/0x160 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x42/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f854a000fb7 Fixes: f0bfa76a11e9 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range") CC: stable@vger.kernel.org # 5.17 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()Nathan Chancellor2022-04-061-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang's version of -Wunused-but-set-variable recently gained support for unary operations, which reveals two unused variables: fs/btrfs/block-group.c:2949:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ fs/btrfs/block-group.c:3116:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ 2 errors generated. These variables appear to be unused from their introduction, so just remove them to silence the warnings. Fixes: c9dc4c657850 ("Btrfs: two stage dirty block group writeout") Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/ClangBuiltLinux/linux/issues/1614 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: remove redundant condition in btrfs_run_delalloc_rangeHaowen Bai2022-04-061-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic !A || A && B is equivalent to !A || B. so we can make code clear. Note: though it's preferred to be in the more human readable form, there have been repeated reports and patches as the expression is detected by tools so apply it to reduce the load. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Haowen Bai <baihaowen@meizu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-5.18-rc1-tag' of ↵Linus Torvalds2022-04-056-55/+81
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - prevent deleting subvolume with active swapfile - fix qgroup reserve limit calculation overflow - remove device count in superblock and its item in one transaction so they cant't get out of sync - skip defragmenting an isolated sector, this could cause some extra IO - unify handling of mtime/permissions in hole punch with fallocate - zoned mode fixes: - remove assert checking for only single mode, we have the DUP mode implemented - fix potential lockdep warning while traversing devices when checking for zone activation * tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: prevent subvol with swapfile from being deleted btrfs: do not warn for free space inode in cow_file_range btrfs: avoid defragging extents whose next extents are not targets btrfs: fix fallocate to use file_modified to update permissions consistently btrfs: remove device item and update super block in the same transaction btrfs: fix qgroup reserve overflow the qgroup limit btrfs: zoned: remove left over ASSERT checking for single profile btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone
| * btrfs: prevent subvol with swapfile from being deletedKaiwen Hu2022-03-241-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A subvolume with an active swapfile must not be deleted otherwise it would not be possible to deactivate it. After the subvolume is deleted, we cannot swapoff the swapfile in this deleted subvolume because the path is unreachable. The swapfile is still active and holding references, the filesystem cannot be unmounted. The test looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs sub create $mnt/subvol touch $mnt/subvol/swapfile chmod 600 $mnt/subvol/swapfile chattr +C $mnt/subvol/swapfile dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096 mkswap $mnt/subvol/swapfile swapon $mnt/subvol/swapfile btrfs sub delete $mnt/subvol swapoff $mnt/subvol/swapfile # failed: No such file or directory swapoff --all unmount $mnt # target is busy. To prevent above issue, we simply check that whether the subvolume contains any active swapfile, and stop the deleting process. This behavior is like snapshot ioctl dealing with a swapfile. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Kaiwen Hu <kevinhu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: do not warn for free space inode in cow_file_rangeJosef Bacik2022-03-241-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a long time leftover from when I originally added the free space inode, the point was to catch cases where we weren't honoring the NOCOW flag. However there exists a race with relocation, if we allocate our free space inode in a block group that is about to be relocated, we could trigger the COW path before the relocation has the opportunity to find the extents and delete the free space cache. In production where we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around 5k times in a 2 week period, so not super common but enough that it's at the top of our metrics. We're properly handling the error here, and with us phasing out v1 space cache anyway just drop the WARN_ON_ONCE. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: avoid defragging extents whose next extents are not targetsQu Wenruo2022-03-241-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] There is a report that autodefrag is defragging single sector, which is completely waste of IO, and no help for defragging: btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096 [CAUSE] In defrag_collect_targets(), we check if the current range (A) can be merged with next one (B). If mergeable, we will add range A into target for defrag. However there is a catch for autodefrag, when checking mergeability against range B, we intentionally pass 0 as @newer_than, hoping to get a higher chance to merge with the next extent. But in the next iteration, range B will looked up by defrag_lookup_extent(), with non-zero @newer_than. And if range B is not really newer, it will rejected directly, causing only range A being defragged, while we expect to defrag both range A and B. [FIX] Since the root cause is the difference in check condition of defrag_check_next_extent() and defrag_collect_targets(), we fix it by: 1. Pass @newer_than to defrag_check_next_extent() 2. Pass @extent_thresh to defrag_check_next_extent() This makes the check between defrag_collect_targets() and defrag_check_next_extent() more consistent. While there is still some minor difference, the remaining checks are focus on runtime flags like writeback/delalloc, which are mostly transient and safe to be checked only in defrag_collect_targets(). Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856 CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix fallocate to use file_modified to update permissions consistentlyDarrick J. Wong2022-03-241-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove device item and update super block in the same transactionQu Wenruo2022-03-241-37/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. [FIX] Instead of starting and committing a transaction inside btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and pass it to btrfs_rm_dev_item(). And only commit the transaction after everything is done. Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix qgroup reserve overflow the qgroup limitEthan Lien2022-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use extent_changeset->bytes_changed in qgroup_reserve_data() to record how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the bytes_changed is set as "unsigned int", and it will overflow if we try to fallocate a range larger than 4GiB. The result is we reserve less bytes and eventually break the qgroup limit. Unlike regular buffered/direct write, which we use one changeset for each ordered extent, which can never be larger than 256M. For fallocate, we use one changeset for the whole range, thus it no longer respects the 256M per extent limit, and caused the problem. The following example test script reproduces the problem: $ cat qgroup-overflow.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Set qgroup limit to 2GiB. btrfs quota enable $MNT btrfs qgroup limit 2G $MNT # Try to fallocate a 3GiB file. This should fail. echo echo "Try to fallocate a 3GiB file..." fallocate -l 3G $MNT/3G.file # Try to fallocate a 5GiB file. echo echo "Try to fallocate a 5GiB file..." fallocate -l 5G $MNT/5G.file # See we break the qgroup limit. echo sync btrfs qgroup show -r $MNT umount $MNT When running the test: $ ./qgroup-overflow.sh (...) Try to fallocate a 3GiB file... fallocate: fallocate failed: Disk quota exceeded Try to fallocate a 5GiB file... qgroupid         rfer         excl     max_rfer --------         ----         ----     -------- 0/5           5.00GiB      5.00GiB      2.00GiB Since we have no control of how bytes_changed is used, it's better to set it to u64. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: remove left over ASSERT checking for single profileJohannes Thumshirn2022-03-231-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") we started allowing DUP on metadata block groups, so the ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are no longer valid and in fact even harmful. Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") CC: stable@vger.kernel.org # 5.17 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zoneJohannes Thumshirn2022-03-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_can_activate_zone() can be called with the device_list_mutex already held, which will lead to a deadlock: insert_dev_extents() // Takes device_list_mutex `-> insert_dev_extent() `-> btrfs_insert_empty_item() `-> btrfs_insert_empty_items() `-> btrfs_search_slot() `-> btrfs_cow_block() `-> __btrfs_cow_block() `-> btrfs_alloc_tree_block() `-> btrfs_reserve_extent() `-> find_free_extent() `-> find_free_extent_update_loop() `-> can_allocate_chunk() `-> btrfs_can_activate_zone() // Takes device_list_mutex again Instead of using the RCU on fs_devices->device_list we can use fs_devices->alloc_list, protected by the chunk_mutex to traverse the list of active devices. We are in the chunk allocation thread. The newer chunk allocation happens from the devices in the fs_device->alloc_list protected by the chunk_mutex. btrfs_create_chunk() lockdep_assert_held(&info->chunk_mutex); gather_device_info list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) Also, a device that reappears after the mount won't join the alloc_list yet and, it will be in the dev_list, which we don't want to consider in the context of the chunk alloc. [15.166572] WARNING: possible recursive locking detected [15.167117] 5.17.0-rc6-dennis #79 Not tainted [15.167487] -------------------------------------------- [15.167733] kworker/u8:3/146 is trying to acquire lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs] [15.167733] [15.167733] but task is already holding lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.167733] [15.167733] other info that might help us debug this: [15.167733] Possible unsafe locking scenario: [15.167733] [15.171834] CPU0 [15.171834] ---- [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] [15.171834] *** DEADLOCK *** [15.171834] [15.171834] May be due to missing lock nesting notation [15.171834] [15.171834] 5 locks held by kworker/u8:3/146: [15.171834] #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.171834] #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.176244] #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs] [15.176244] #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.176244] #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs] [15.179641] [15.179641] stack backtrace: [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79 [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [15.179641] Call Trace: [15.179641] <TASK> [15.179641] dump_stack_lvl+0x45/0x59 [15.179641] __lock_acquire.cold+0x217/0x2b2 [15.179641] lock_acquire+0xbf/0x2b0 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] __mutex_lock+0x8e/0x970 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? lock_is_held_type+0xd7/0x130 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? _raw_spin_unlock+0x24/0x40 [15.183838] ? btrfs_get_alloc_profile+0x106/0x230 [btrfs] [15.187601] btrfs_reserve_extent+0x131/0x260 [btrfs] [15.187601] btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs] [15.187601] __btrfs_cow_block+0x138/0x600 [btrfs] [15.187601] btrfs_cow_block+0x10f/0x230 [btrfs] [15.187601] btrfs_search_slot+0x55f/0xbc0 [btrfs] [15.187601] ? lock_is_held_type+0xd7/0x130 [15.187601] btrfs_insert_empty_items+0x2d/0x60 [btrfs] [15.187601] btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs] [15.187601] __btrfs_end_transaction+0x36/0x2a0 [btrfs] [15.192037] flush_space+0x374/0x600 [btrfs] [15.192037] ? find_held_lock+0x2b/0x80 [15.192037] ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs] [15.192037] ? lock_release+0x131/0x2b0 [15.192037] btrfs_async_reclaim_data_space+0x70/0x180 [btrfs] [15.192037] process_one_work+0x24c/0x5a0 [15.192037] worker_thread+0x4a/0x3d0 Fixes: a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: Remove a use of PAGE_SIZE in btrfs_invalidate_folio()Matthew Wilcox (Oracle)2022-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | While btrfs doesn't use large folios yet, this should have been changed as part of the conversion from invalidatepage to invalidate_folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk>
* | fs: Remove ->readpages address space operationMatthew Wilcox (Oracle)2022-04-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | All filesystems have now been converted to use ->readahead, so remove the ->readpages operation and fix all the comments that used to refer to it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds2022-03-261-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
| * | block: remove the per-bio/request write hintChristoph Hellwig2022-03-071-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe2022-03-073-10/+7
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
* | \ \ Merge tag 'kbuild-gnu11-v5.18' of ↵Linus Torvalds2022-03-251-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild update for C11 language base from Masahiro Yamada: "Kbuild -std=gnu11 updates for v5.18 Linus pointed out the benefits of C99 some years ago, especially variable declarations in loops [1]. At that time, we were not ready for the migration due to old compilers. Recently, Jakob Koschel reported a bug in list_for_each_entry(), which leaks the invalid pointer out of the loop [2]. In the discussion, we agreed that the time had come. Now that GCC 5.1 is the minimum compiler version, there is nothing to prevent us from going to -std=gnu99, or even straight to -std=gnu11. Discussions for a better list iterator implementation are ongoing, but this patch set must land first" [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/ [2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/ * tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS Kbuild: move to -std=gnu11 Kbuild: use -Wdeclaration-after-statement Kbuild: add -Wno-shift-negative-value where -Wextra is used
| * | | | Kbuild: add -Wno-shift-negative-value where -Wextra is usedArnd Bergmann2022-03-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a preparation for moving to -std=gnu11, turn off the -Wshift-negative-value option. This warning is enabled by gcc when building with -Wextra for c99 or higher, but not for c89. Since the kernel already relies on well-defined overflow behavior, the warning is not helpful and can simply be disabled in all locations that use -Wextra. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 (x86-64) Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
* | | | | Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2022-03-225-86/+87
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
| * | | | | btrfs: Convert extent_range_redirty_for_io() to use foliosMatthew Wilcox (Oracle)2022-03-151-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes a call to __set_page_dirty_nobuffers(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | | fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folioMatthew Wilcox (Oracle)2022-03-151-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These filesystems use __set_page_dirty_nobuffers() either directly or with a very thin wrapper; convert them en masse. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | | btrfs: Convert from set_page_dirty to dirty_folioMatthew Wilcox (Oracle)2022-03-151-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimise the non-DEBUG case to just call filemap_dirty_folio directly. The DEBUG case doesn't actually compile, but convert it to dirty_folio anyway. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | | btrfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)2022-03-155-59/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | | btrfs: Use folio_invalidate()Matthew Wilcox (Oracle)2022-03-151-2/+3
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
* | | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2022-03-221-1/+1
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
| * | | | | fs: allocate inode by using alloc_inode_sb()Muchun Song2022-03-221-1/+1
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge tag 'for-5.18-tag' of ↵Linus Torvalds2022-03-2242-1303/+2948
|\ \ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This contains feature updates, performance improvements, preparatory and core work and some related VFS updates: Features: - encoded read/write ioctls, allows user space to read or write raw data directly to extents (now compressed, encrypted in the future), will be used by send/receive v2 where it saves processing time - zoned mode now works with metadata DUP (the mkfs.btrfs default) - error message header updates: - print error state: transaction abort, other error, log tree errors - print transient filesystem state: remount, device replace, ignored checksum verifications - tree-checker: verify the transaction id of the to-be-written dirty extent buffer Performance improvements for fsync: - directory logging speedups (up to -90% run time) - avoid logging all directory changes during renames (up to -60% run time) - avoid inode logging during rename and link when possible (up to -60% run time) - prepare extents to be logged before locking a log tree path (throughput +7%) - stop copying old file extents when doing a full fsync() - improved logging of old extents after truncate Core, fixes: - improved stale device identification by dev_t and not just path (for devices that are behind other layers like device mapper) - continued extent tree v2 preparatory work - disable features that won't work yet - add wrappers and abstractions for new tree roots - improved error handling - add super block write annotations around background block group reclaim - fix device scanning messages potentially accessing stale pointer - cleanups and refactoring VFS: - allow reflinks/deduplication from two different mounts of the same filesystem - export and add helpers for read/write range verification, for the encoded ioctls" * tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits) btrfs: zoned: put block group after final usage btrfs: don't access possibly stale fs_info data in device_list_add btrfs: add lockdep_assert_held to need_preemptive_reclaim btrfs: verify the tranisd of the to-be-written dirty extent buffer btrfs: unify the error handling of btrfs_read_buffer() btrfs: unify the error handling pattern for read_tree_block() btrfs: factor out do_free_extent_accounting helper btrfs: remove last_ref from the extent freeing code btrfs: add a alloc_reserved_extent helper btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block btrfs: add and use helper for unlinking inode during log replay btrfs: extend locking to all space_info members accesses btrfs: zoned: mark relocation as writing fs: allow cross-vfsmount reflink/dedupe btrfs: remove the cross file system checks from remap btrfs: pass btrfs_fs_info to btrfs_recover_relocation btrfs: pass btrfs_fs_info for deleting snapshots and cleaner btrfs: add filesystems state details to error messages btrfs: deal with unexpected extent type during reflinking btrfs: fix unexpected error path when reflinking an inline extent ...
| * | | | btrfs: zoned: put block group after final usageNikolay Borisov2022-03-141-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: don't access possibly stale fs_info data in device_list_addDongliang Mu2022-03-141-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported a possible use-after-free in printing information in device_list_add. Very similar with the bug fixed by commit 0697d9a61099 ("btrfs: don't access possibly stale fs_info data for printing duplicate device"), but this time the use occurs in btrfs_info_in_rcu. Call Trace: kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 btrfs_printk+0x395/0x425 fs/btrfs/super.c:244 device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957 btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387 btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by modifying device->fs_info to NULL too. Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: add lockdep_assert_held to need_preemptive_reclaimNiels Dossche2022-03-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a previous patch ("btrfs: extend locking to all space_info members accesses") the locking for the space_info members was extended in btrfs_preempt_reclaim_metadata_space because not all the member accesses that needed locks were actually locked (bytes_pinned et al). It was then suggested to also add a call to lockdep_assert_held to need_preemptive_reclaim. This function also works with space_info members. As of now, it has only two call sites which both hold the lock. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: verify the tranisd of the to-be-written dirty extent bufferQu Wenruo2022-03-141-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] There is a bug report that a bitflip in the transid part of an extent buffer makes btrfs to reject certain tree blocks: BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22 [CAUSE] Note the failed transid check, hex(262166) = 0x40016, while hex(22) = 0x16. It's an obvious bitflip. Furthermore, the reporter also confirmed the bitflip is from the hardware, so it's a real hardware caused bitflip, and such problem can not be detected by the existing tree-checker framework. As tree-checker can only verify the content inside one tree block, while generation of a tree block can only be verified against its parent. So such problem remain undetected. [FIX] Although tree-checker can not verify it at write-time, we still have a quick (but not the most accurate) way to catch such obvious corruption. Function csum_one_extent_buffer() is called before we submit metadata write. Thus it means, all the extent buffer passed in should be dirty tree blocks, and should be newer than last committed transaction. Using that we can catch the above bitflip. Although it's not a perfect solution, as if the corrupted generation is higher than the correct value, we have no way to catch it at all. Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/ CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@sus,ree.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: unify the error handling of btrfs_read_buffer()Qu Wenruo2022-03-141-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is one oddball error handling of btrfs_read_buffer(): ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key); if (!ret) { *eb_ret = tmp; return 0; } free_extent_buffer(tmp); btrfs_release_path(p); return -EIO; While all other call sites check the error first. Unify the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: unify the error handling pattern for read_tree_block()Qu Wenruo2022-03-145-27/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. */ } else if (!extent_buffer_uptodate(eb)) { /* * Different error handling here * Normally also ended up with return or goto out; */ } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here */ return eb or goto out; } if (!extent_buffer_uptodate(eb)) { /* * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: factor out do_free_extent_accounting helperJosef Bacik2022-03-141-22/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __btrfs_free_extent() does all of the hard work of updating the extent ref items, and then at the end if we dropped the extent completely it does the cleanup accounting work. We're going to only want to do that work for metadata with extent tree v2, so extract this bit into its own helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove last_ref from the extent freeing codeJosef Bacik2022-03-141-23/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a remnant of the work I did for qgroups a long time ago to only run for a block when we had dropped the last ref. We haven't done that for years, but the code remains. Drop this remnant. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: add a alloc_reserved_extent helperJosef Bacik2022-03-141-32/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We duplicate this logic for both data and metadata, at this point we've already done our type specific extent root operations, this is just doing the accounting and removing the space from the free space tree. Extract this common logic out into a helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove BUG_ON(ret) in alloc_reserved_tree_blockJosef Bacik2022-03-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch this to an ASSERT() and return the error in the normal case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: add and use helper for unlinking inode during log replayFilipe Manana2022-03-141-48/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: extend locking to all space_info members accessesNiels Dossche2022-03-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bytes_pinned is always accessed under space_info->lock, except in btrfs_preempt_reclaim_metadata_space, however the other members are accessed under that lock. The reserved member of the rsv's are also partially accessed under a lock and partially not. Move all these accesses into the same lock to ensure consistency. This could potentially race and lead to a flush instead of a commit but it's not a big problem as it's only for preemptive flush. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Niels Dossche <niels.dossche@ugent.be> Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: zoned: mark relocation as writingNaohiro Aota2022-03-142-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: remove the cross file system checks from remapJosef Bacik2022-03-141-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sb check is already done in do_clone_file_range, and the mnt check (which will hopefully go away in a subsequent patch) is done in ioctl_file_clone(). Remove the check in our code and put an ASSERT() to make sure it doesn't change underneath us. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | btrfs: pass btrfs_fs_info to btrfs_recover_relocationJosef Bacik2022-03-143-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>