summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: print-tree: parent bytenr must be aligned to sector sizeAnastasia Belova2023-05-091-3/+3
| | | | | | | | | | | | | | | Check nodesize to sectorsize in alignment check in print_extent_item. The comment states that and this is correct, similar check is done elsewhere in the functions. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: ea57788eb76d ("btrfs: require only sector size alignment for parent eb bytenr") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anastasia Belova <abelova@astralinux.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't free qgroup space unless specifiedJosef Bacik2023-05-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Boris noticed in his simple quotas testing that he was getting a leak with Sweet Tea's change to subvol create that stopped doing a transaction commit. This was just a side effect of that change. In the delayed inode code we have an optimization that will free extra reservations if we think we can pack a dir item into an already modified leaf. Previously this wouldn't be triggered in the subvolume create case because we'd commit the transaction, it was still possible but much harder to trigger. It could actually be triggered if we did a mkdir && subvol create with qgroups enabled. This occurs because in btrfs_insert_delayed_dir_index(), which gets called when we're adding the dir item, we do the following: btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL); if we're able to skip reserving space. The problem here is that trans->block_rsv points at the temporary block rsv for the subvolume create, which has qgroup reservations in the block rsv. This is a problem because btrfs_block_rsv_release() will do the following: if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { qgroup_to_release = block_rsv->qgroup_rsv_reserved - block_rsv->qgroup_rsv_size; block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; } The temporary block rsv just has ->qgroup_rsv_reserved set, ->qgroup_rsv_size == 0. The optimization in btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then later on when we call btrfs_subvolume_release_metadata() which has btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release); btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release); qgroup_to_release is set to 0, and we do not convert the reserved metadata space. The problem here is that the block rsv code has been unconditionally messing with ->qgroup_rsv_reserved, because the main place this is used is delalloc, and any time we call btrfs_block_rsv_release() we do it with qgroup_to_release set, and thus do the proper accounting. The subvolume code is the only other code that uses the qgroup reservation stuff, but it's intermingled with the above optimization, and thus was getting its reservation freed out from underneath it and thus leaking the reserved space. The solution is to simply not mess with the qgroup reservations if we don't have qgroup_to_release set. This works with the existing code as anything that messes with the delalloc reservations always have qgroup_to_release set. This fixes the leak that Boris was observing. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix encoded write i_size corruption with no-holesBoris Burkov2023-05-021-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have observed a btrfs filesystem corruption on workloads using no-holes and encoded writes via send stream v2. The symptom is that a file appears to be truncated to the end of its last aligned extent, even though the final unaligned extent and even the file extent and otherwise correctly updated inode item have been written. So if we were writing out a 1MiB+X file via 8 128K extents and one extent of length X, i_size would be set to 1MiB, but the ninth extent, nbyte, etc. would all appear correct otherwise. The source of the race is a narrow (one line of code) window in which a no-holes fs has read in an updated i_size, but has not yet set a shared disk_i_size variable to write. Therefore, if two ordered extents run in parallel (par for the course for receive workloads), the following sequence can play out: (following "threads" a bit loosely, since there are callbacks involved for endio but extra threads aren't needed to cause the issue) ENC-WR1 (second to last) ENC-WR2 (last) ------- ------- btrfs_do_encoded_write set i_size = 1M submit bio B1 ending at 1M endio B1 btrfs_inode_safe_disk_i_size_write local i_size = 1M falls off a cliff for some reason btrfs_do_encoded_write set i_size = 1M+X submit bio B2 ending at 1M+X endio B2 btrfs_inode_safe_disk_i_size_write local i_size = 1M+X disk_i_size = 1M+X disk_i_size = 1M btrfs_delayed_update_inode btrfs_delayed_update_inode And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and writes respect i_size and present a corrupted file missing its last extents. Fix this by holding the inode lock in the no-holes case so that a thread can't sneak in a write to disk_i_size that gets overwritten with an out of date i_size. Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zonesNaohiro Aota2023-04-281-3/+3
| | | | | | | | | | | | | | | | find_next_bit and find_next_zero_bit take @size as the second parameter and @offset as the third parameter. They are specified opposite in btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to detect the empty zones. Fix them and (maybe) return the result a bit faster. Note: the naming is a bit confusing, size has two meanings here, bitmap and our range size. Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: properly reject clear_cache and v1 cache for block-group-treeQu Wenruo2023-04-281-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] With block-group-tree feature enabled, mounting it with clear_cache would cause the following transaction abort at mount or remount: BTRFS info (device dm-4): force clearing of disk cache BTRFS info (device dm-4): using free space tree BTRFS info (device dm-4): auto enabling async discard BTRFS info (device dm-4): clearing free space tree BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes BTRFS error (device dm-4): super block corruption detected before writing it to disk BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected) BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction. [CAUSE] For block-group-tree feature, we have an artificial dependency on free-space-tree. This means if we detect block-group-tree without v2 cache, we consider it a corruption and cause the problem. For clear_cache mount option, it would temporary disable v2 cache, then re-enable it. But unfortunately for that temporary v2 cache disabled status, we refuse to write a superblock with bg tree only flag, thus leads to the above transaction abortion. [FIX] For now, just reject clear_cache and v1 cache mount option for block group tree. So now we got a graceful rejection other than a transaction abort: BTRFS info (device dm-4): force clearing of disk cache BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature BTRFS error (device dm-4): open_ctree failed CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: print extent buffers when sibling keys check failsFilipe Manana2023-04-281-0/+4
| | | | | | | | | | | | | | | When trying to move keys from one node/leaf to another sibling node/leaf, if the sibling keys check fails we just print an error message with the last key of the left sibling and the first key of the right sibling. However it's also useful to print all the keys of each sibling, as it may provide some clues to what went wrong, which code path may be inserting keys in an incorrect order. So just do that, print the siblings with btrfs_print_tree(), as it works for both leaves and nodes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: abort transaction when sibling keys check fails for leavesFilipe Manana2023-04-281-0/+2
| | | | | | | | | | | | | | | | | | | | | | If the sibling keys check fails before we move keys from one sibling leaf to another, we are not aborting the transaction - we leave that to some higher level caller of btrfs_search_slot() (or anything else that uses it to insert items into a b+tree). This means that the transaction abort will provide a stack trace that omits the b+tree modification call chain. So change this to immediately abort the transaction and therefore get a more useful stack trace that shows us the call chain in the bt+tree modification code. It's also important to immediately abort the transaction just in case some higher level caller is not doing it, as this indicates a very serious corruption and we should stop the possibility of doing further damage. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix leak of source device allocation state after device replaceFilipe Manana2023-04-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a device replace finishes, the source device is freed by calling btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the allocation state, tracked in the device's alloc_state io tree, is never freed. This is a regression recently introduced by commit f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state"), which removed a call to extent_io_tree_release() from btrfs_free_device(), with the rationale that btrfs_close_one_device() already releases the allocation state from a device and btrfs_close_one_device() is always called before a device is freed with btrfs_free_device(). However that is not true for the device replace case, as btrfs_free_device() is called without any previous call to btrfs_close_one_device(). The issue is trivial to reproduce, for example, by running test btrfs/027 from fstests: $ ./check btrfs/027 $ rmmod btrfs $ dmesg (...) [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished [84519.552251] BTRFS info (device sdc): scrub: started on devid 1 [84519.552277] BTRFS info (device sdc): scrub: started on devid 2 [84519.552332] BTRFS info (device sdc): scrub: started on devid 3 [84519.552705] BTRFS info (device sdc): scrub: started on devid 4 [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0 [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0 [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0 [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1 [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1 So fix this by adding back the call to extent_io_tree_release() at btrfs_free_device(). Fixes: f0bb5474cff0 ("btrfs: remove redundant release of btrfs_device::alloc_state") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix assertion of exclop condition when starting balancexiaoshoukui2023-04-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Balance as exclusive state is compatible with paused balance and device add, which makes some things more complicated. The assertion of valid states when starting from paused balance needs to take into account two more states, the combinations can be hit when there are several threads racing to start balance and device add. This won't typically happen when the commands are started from command line. Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE. Concurrently adding multiple devices to the same mount point and btrfs_exclop_finish executed finishes before assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_NONE state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD, in fs/btrfs/ioctl.c:456 Call Trace: <TASK> btrfs_exclop_balance+0x13c/0x310 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED. Concurrently adding multiple devices to the same mount point and btrfs_exclop_balance executed finish before the latter thread execute assertion in btrfs_exclop_balance, exclusive_operation will changed to BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, fs/btrfs/ioctl.c:458 Call Trace: <TASK> btrfs_exclop_balance+0x240/0x410 ? memdup_user+0xab/0xc0 ? PTR_ERR+0x17/0x20 btrfs_ioctl_add_dev+0x2ee/0x320 btrfs_ioctl+0x9d5/0x10d0 ? btrfs_ioctl_encoded_write+0xb80/0xb80 __x64_sys_ioctl+0x197/0x210 do_syscall_64+0x3c/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd An example of the failed assertion is below, which shows that the paused balance is also needed to be checked. root@syzkaller:/home/xsk# ./repro Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0 Failed to add device /dev/vda, errno 14 [ 416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 Failed to add device /dev/vda, errno 14 [ 416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3 Failed to add device /dev/vda, errno 14 [ 416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3 [ 416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1 Failed to add device /dev/vda, errno 14 [ 416.637759][ T7984] assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation == BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458 [ 416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7 [ 416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.648785][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.649616][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 416.652502][ T7984] PKRU: 55555554 [ 416.652888][ T7984] Call Trace: [ 416.653241][ T7984] <TASK> [ 416.653527][ T7984] btrfs_exclop_balance+0x240/0x410 [ 416.654036][ T7984] ? memdup_user+0xab/0xc0 [ 416.654465][ T7984] ? PTR_ERR+0x17/0x20 [ 416.654874][ T7984] btrfs_ioctl_add_dev+0x2ee/0x320 [ 416.655380][ T7984] btrfs_ioctl+0x9d5/0x10d0 [ 416.655822][ T7984] ? btrfs_ioctl_encoded_write+0xb80/0xb80 [ 416.656400][ T7984] __x64_sys_ioctl+0x197/0x210 [ 416.656874][ T7984] do_syscall_64+0x3c/0xb0 [ 416.657346][ T7984] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 416.657922][ T7984] RIP: 0033:0x4546af [ 416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af [ 416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003 [ 416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f [ 416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640 [ 416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000 [ 416.664703][ T7984] </TASK> [ 416.665040][ T7984] Modules linked in: [ 416.665590][ T7984] ---[ end trace 0000000000000000 ]--- [ 416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e [ 416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282 [ 416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000 [ 416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7 [ 416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000 [ 416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0 [ 416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000 [ 416.673501][ T7984] FS: 00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 [ 416.674425][ T7984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0 [ 416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix btrfs_prev_leaf() to not return the same key twiceFilipe Manana2023-04-281-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A call to btrfs_prev_leaf() may end up returning a path that points to the same item (key) again. This happens if while btrfs_prev_leaf(), after we release the path, a concurrent insertion happens, which moves items off from a sibling into the front of the previous leaf, and an item with the computed previous key does not exists. For example, suppose we have the two following leaves: Leaf A ------------------------------------------------------------- | ... key (300 96 10) key (300 96 15) key (300 96 16) | ------------------------------------------------------------- slot 20 slot 21 slot 22 Leaf B ------------------------------------------------------------- | key (300 96 20) key (300 96 21) key (300 96 22) ... | ------------------------------------------------------------- slot 0 slot 1 slot 2 If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with a path pointing to leaf B and slot 0 and the following happens: 1) At btrfs_prev_leaf() we compute the previous key to search as: (300 96 19), which is a key that does not exists in the tree; 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); 3) Some other task inserts a key at leaf A, that sorts before the key at slot 20, for example it has an objectid of 299. In order to make room for the new key, the key at slot 22 is moved to the front of leaf B. This happens at push_leaf_right(), called from split_leaf(). After this leaf B now looks like: -------------------------------------------------------------------------------- | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | -------------------------------------------------------------------------------- slot 0 slot 1 slot 2 slot 3 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed previous key: (300 96 19). Since the key does not exists, btrfs_search_slot() returns 1 and with a path pointing to leaf B and slot 1, the item with key (300 96 20); 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of leaf B, the same key as before it was called, since the key at slot 0 of leaf B (300 96 16) is less than the computed previous key, which is (300 96 19); 6) As a consequence btrfs_previous_item() returns a path that points again to the item with key (300 96 20). For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not be functional a problem, despite not making sense to return a new path pointing again to the same item/key. However for a caller such as tree-log.c:log_dir_items(), this has a bad consequence, as it can result in not logging some dir index deletions in case the directory is being logged without holding the inode's VFS lock (logging triggered while logging a child inode for example) - for the example scenario above, in case the dir index keys 17, 18 and 19 were deleted in the current transaction. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: mark btrfs_assertfail() __noreturnJosh Poimboeuf2023-04-173-2/+3
| | | | | | | | | | | | | | | | | Fixes a bunch of warnings including: vmlinux.o: warning: objtool: select_reloc_root+0x314: unreachable instruction vmlinux.o: warning: objtool: finish_inode_if_needed+0x15b1: unreachable instruction vmlinux.o: warning: objtool: get_bio_sector_nr+0x259: unreachable instruction vmlinux.o: warning: objtool: raid_wait_read_end_io+0xc26: unreachable instruction vmlinux.o: warning: objtool: raid56_parity_alloc_scrub_rbio+0x37b: unreachable instruction ... Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202302210709.IlXfgMpX-lkp@intel.com/ Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix uninitialized variable warningsGenjian Zhang2023-04-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | There are some warnings on older compilers (gcc 10, 7) or non-x86_64 architectures (aarch64). As btrfs wants to enable -Wmaybe-uninitialized by default, fix the warnings even though it's not necessary on recent compilers (gcc 12+). ../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’: ../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 2703 | btrfs_setup_sprout(fs_info, seed_devices); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../fs/btrfs/send.c: In function ‘get_cur_inode_state’: ../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 70 | (__if_trace.miss_hit[1]++,1) : \ | ^ ../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here 1878 | u64 right_gen; | ^~~~~~~~~ Reported-by: k2ci <kernel-bot@kylinos.cn> Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use log root when iterating over index keys when logging directoryFilipe Manana2023-04-171-27/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When logging dir dentries of a directory, we iterate over the subvolume tree to find dir index keys on leaves modified in the current transaction. This however is heavy on locking, since btrfs_search_forward() may often keep locks on extent buffers for quite a while when walking the tree to find a suitable leaf modified in the current transaction and with a key not smaller than then the provided minimum key. That means it will block other tasks trying to access the subvolume tree, which may be common fs operations like creating, renaming, linking, unlinking, reflinking files, etc. A better solution is to iterate the log tree, since it's much smaller than a subvolume tree and just use plain btrfs_search_slot() (or the wrapper btrfs_for_each_slot()) and only contains dir index keys added in the current transaction. The following bonnie++ test on a non-debug kernel (with Debian's default kernel config) on a 20G null block device, was used to measure the impact: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 NR_DIRECTORIES=20 NR_FILES=20480 # must be a multiple of 1024 DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES )) NR_FILES=$(( NR_FILES / 1024 )) umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT Before patchset: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 8G 376k 99 1.1g 98 939m 92 1527k 99 3.2g 99 9060 256 Latency 24920us 207us 680ms 5594us 171us 2891us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20/20 20480 96 +++++ +++ 20480 95 20480 99 +++++ +++ 20480 97 Latency 8708us 137us 5128us 6743us 60us 19712us After patchset: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 8G 384k 99 1.2g 99 971m 91 1533k 99 3.3g 99 9180 309 Latency 24930us 125us 661ms 5587us 46us 2020us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20/20 20480 90 +++++ +++ 20480 99 20480 99 +++++ +++ 20480 97 Latency 7030us 61us 1246us 4942us 56us 16855us The patchset consists of this patch plus a previous one that has the following subject: "btrfs: avoid iterating over all indexes when logging directory" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: avoid iterating over all indexes when logging directoryFilipe Manana2023-04-172-7/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When logging a directory, after copying all directory index items from the subvolume tree to the log tree, we iterate over the subvolume tree to find all dir index items that are located in leaves COWed (or created) in the current transaction. If we keep logging a directory several times during the same transaction, we end up iterating over the same dir index items everytime we log the directory, wasting time and adding extra lock contention on the subvolume tree. So just keep track of the last logged dir index offset in order to start the search for that index (+1) the next time the directory is logged, as dir index values (key offsets) come from a monotonically increasing counter. The following test measures the difference before and after this change: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT # Time values in milliseconds. declare -a fsync_times # Total number of files added to the test directory. num_files=1000000 # Fsync directory after every N files are added. fsync_period=100 mkdir $MNT/testdir fsync_total_time=0 for ((i = 1; i <= $num_files; i++)); do echo -n > $MNT/testdir/file_$i if [ $((i % fsync_period)) -eq 0 ]; then start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) fsync_total_time=$((fsync_total_time + (end - start))) fsync_times[i]=$(( (end - start) / 1000000 )) echo -n -e "Progress $i / $num_files\r" fi done echo -e "\nHistogram of directory fsync duration in ms:\n" printf '%s\n' "${fsync_times[@]}" | \ perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);' fsync_total_time=$((fsync_total_time / 1000000)) echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 15G null block device. Result before this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751 Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000 3.000 - 5.278: 1423 ################################# 5.278 - 8.854: 1173 ########################### 8.854 - 14.467: 591 ############## 14.467 - 23.277: 1025 ####################### 23.277 - 37.105: 1422 ################################# 37.105 - 58.809: 2036 ############################################### 58.809 - 92.876: 2316 ##################################################### 92.876 - 146.346: 6 | 146.346 - 230.271: 6 | 230.271 - 362.000: 2 | Total time spent in fsync: 350527 ms Result after this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576 Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000 3.000 - 6.007: 3222 ################################# 6.007 - 11.276: 5197 ##################################################### 11.276 - 20.506: 1551 ################ 20.506 - 36.674: 24 | 36.674 - 201.552: 1 | 201.552 - 353.841: 4 | 353.841 - 1088.000: 1 | Total time spent in fsync: 92114 ms Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: error out if we have unrepaired metadata error duringQu Wenruo2023-04-171-5/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] Even before the scrub rework, if we have some corrupted metadata failed to be repaired during replace, we still continue replacing and let it finish just as there is nothing wrong: BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5 BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5 BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1 BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished This can lead to unexpected problems for the resulting filesystem. [CAUSE] Btrfs reuses scrub code path for dev-replace to iterate all dev extents. But unlike scrub, dev-replace doesn't really bother to check the scrub progress, which records all the errors found during replace. And even if we check the progress, we cannot really determine which errors are minor, which are critical just by the plain numbers. (remember we don't treat metadata/data checksum error differently). This behavior is there from the very beginning. [FIX] Instead of continuing the replace, just error out if we hit an unrepaired metadata sector. Now the dev-replace would be rejected with -EIO, to let the user know. Although it also means, the filesystem has some metadata error which cannot be repaired, the user would be upset anyway. The new dmesg would look like this: BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1 BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5 BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5 BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752 BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5 Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove pointless loop at btrfs_get_next_valid_item()Filipe Manana2023-04-171-17/+6
| | | | | | | | | | | | | | | It's pointless to have a while loop at btrfs_get_next_valid_item(), as if the slot on the current leaf is beyond the last item, we call btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or a valid slot in the current leaf if after releasing the path an item gets pushed from the next leaf to the current leaf). So just call btrfs_next_leaf() if the current slot on the current leaf is beyond the last item. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: reject unsupported scrub flagsQu Wenruo2023-04-172-0/+6
| | | | | | | | | | | | | | | | | | | | | Since the introduction of scrub interface, the only flag that we support is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are some undefined flags passed in, we just ignore them. This is problematic if we want to introduce new scrub flags, as we have no way to determine if such flags are supported. Address the problem by introducing a check for the flags, and if unsupported flags are set, return -EOPNOTSUPP to inform the user space. This check should be backported for all supported kernels before any new scrub flags are introduced. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: reinterpret async discard iops_limit=0 as no delayBoris Burkov2023-04-171-7/+12
| | | | | | | | | | | | | | | | | | Currently, a limit of 0 results in a hard coded metering over 6 hours. Since the default is a set limit, I suspect no one truly depends on this rather arbitrary setting. Repurpose it for an arguably more useful "unlimited" mode, where the delay is 0. Note that if block groups are too new, or go fully empty, there is still a delay associated with those conditions. Those delays implement heuristics for not trimming a region we are relatively likely to fully overwrite soon. CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: set default discard iops_limit to 1000Boris Burkov2023-04-171-1/+1
| | | | | | | | | | | | | | | | | | | Previously, the default was a relatively conservative 10. This results in a 100ms delay, so with ~300 discards in a commit, it takes the full 30s till the next commit to finish the discards. On a workstation, this results in the disk never going idle, wasting power/battery, etc. Set the default to 1000, which results in using the smallest possible delay, currently, which is 1ms. This has shown to not pathologically keep the disk busy by the original reporter. Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228 CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Neal Gompa <neal@gompa.dev Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused raid56 functions which were dedicated for scrubQu Wenruo2023-04-172-54/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Since the scrub rework, the following RAID56 functions are no longer called: - raid56_add_scrub_pages() - raid56_alloc_missing_rbio() - raid56_submit_missing_rbio() Those functions are all utilized by scrub to handle missing device cases for RAID56. However the new scrub code handle them in a completely different way: - If it's data stripe, go recovery path through btrfs_submit_bio() - If it's P/Q stripe, it would be handled through raid56_parity_submit_scrub_rbio() And that function would handle dev-replace and repair properly. Thus we can safely remove those functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: remove scrub_bio structureQu Wenruo2023-04-172-244/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since scrub path has been fully moved to scrub_stripe based facilities, no more scrub_bio would be submitted. Thus we can remove it completely, this involves: - SCRUB_SECTORS_PER_BIO macro - SCRUB_BIOS_PER_SCTX macro - SCRUB_MAX_PAGES macro - BTRFS_MAX_MIRRORS macro - scrub_bio structure - scrub_ctx::bios member - scrub_ctx::curr member - scrub_ctx::bios_in_flight member - scrub_ctx::workers_pending member - scrub_ctx::list_lock member - scrub_ctx::list_wait member - function scrub_bio_end_io_worker() - function scrub_pending_bio_inc() - function scrub_pending_bio_dec() - function scrub_throttle() - function scrub_submit() - function scrub_find_csum() - function drop_csum_range() - Some unnecessary flush and scrub pauses Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: remove scrub_block and scrub_sector structuresQu Wenruo2023-04-172-573/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Those two structures are used to represent a bunch of sectors for scrub, but now they are fully replaced by scrub_stripe in one go, so we can remove them. This involves: - structure scrub_block - structure scrub_sector - structure scrub_page_private - function attach_scrub_page_private() - function detach_scrub_page_private() Now we no longer need to use page::private to handle subpage. - function alloc_scrub_block() - function alloc_scrub_sector() - function scrub_sector_get_page() - function scrub_sector_get_page_offset() - function scrub_sector_get_kaddr() - function bio_add_scrub_sector() - function scrub_checksum_data() - function scrub_checksum_tree_block() - function scrub_checksum_super() - function scrub_check_fsid() - function scrub_block_get() - function scrub_block_put() - function scrub_sector_get() - function scrub_sector_put() - function scrub_bio_end_io() - function scrub_block_complete() - function scrub_add_sector_to_rd_bio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: remove the old scrub recheck codeQu Wenruo2023-04-174-1012/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old scrub code has different entrance to verify the content, and since we have removed the writeback path, now we can start removing the re-check part, including: - scrub_recover structure - scrub_sector::recover member - function scrub_setup_recheck_block() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_group_good_copy() - function scrub_repair_sector_from_good_copy() - function scrub_is_page_on_raid56() - function full_stripe_lock() - function search_full_stripe_lock() - function get_full_stripe_logical() - function insert_full_stripe_lock() - function lock_full_stripe() - function unlock_full_stripe() - btrfs_block_group::full_stripe_locks_root member - btrfs_full_stripe_locks_tree structure This infrastructure is to ensure RAID56 scrub is properly handling recovery and P/Q scrub correctly. This is no longer needed, before P/Q scrub we will wait for all the involved data stripes to be scrubbed first, and RAID56 code has internal lock to ensure no race in the same full stripe. - function scrub_print_warning() - function scrub_get_recover() - function scrub_put_recover() - function scrub_handle_errored_block() - function scrub_setup_recheck_block() - function scrub_bio_wait_endio() - function scrub_submit_raid56_bio_wait() - function scrub_recheck_block_on_raid56() - function scrub_recheck_block() - function scrub_recheck_block_checksum() - function scrub_repair_block_from_good_copy() - function scrub_repair_sector_from_good_copy() And two more functions exported temporarily for later cleanup: - alloc_scrub_sector() - alloc_scrub_block() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: remove the old writeback infrastructureQu Wenruo2023-04-172-219/+3
| | | | | | | | | | | | | | | | | | | | | | | Since the whole scrub path has been switched to scrub_stripe based solution, the old writeback path can be removed completely, which involves: - scrub_ctx::wr_curr_bio member - scrub_ctx::flush_all_writes member - function scrub_write_block_to_dev_replace() - function scrub_write_sector_to_dev_replace() - function scrub_add_sector_to_wr_bio() - function scrub_wr_submit() - function scrub_wr_bio_end_io() - function scrub_wr_bio_end_io_worker() And one more function needs to be exported temporarily: - scrub_sector_get() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: remove scrub_parity structureQu Wenruo2023-04-173-524/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The structure scrub_parity is used to indicate that some extents are scrubbed for the purpose of RAID56 P/Q scrubbing. Since the whole RAID56 P/Q scrubbing path has been replaced with new scrub_stripe infrastructure, and we no longer need to use scrub_parity to modify the behavior of data stripes, we can remove it completely. This removal involves: - scrub_parity_workers Now only one worker would be utilized, scrub_workers, to do the read and repair. All writeback would happen at the main scrub thread. - scrub_block::sparity member - scrub_parity structure - function scrub_parity_get() - function scrub_parity_put() - function scrub_free_parity() - function __scrub_mark_bitmap() - function scrub_parity_mark_sectors_error() - function scrub_parity_mark_sectors_data() These helpers are no longer needed, scrub_stripe has its bitmaps and we can use bitmap helpers to get the error/data status. - scrub_parity_bio_endio() - scrub_parity_check_and_repair() - function scrub_sectors_for_parity() - function scrub_extent_for_parity() - function scrub_raid56_data_stripe_for_parity() - function scrub_raid56_parity() The new code would reuse the scrub read-repair and writeback path. Just skip the dev-replace phase. And scrub_stripe infrastructure allows us to submit and wait for those data stripes before scrubbing P/Q, without extra infrastructure. The following two functions are temporarily exported for later cleanup: - scrub_find_csum() - scrub_add_sector_to_rd_bio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrubQu Wenruo2023-04-172-10/+210
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the only missing part for scrub: RAID56 P/Q stripe scrub. The workflow is pretty straightforward for the new function, scrub_raid56_parity_stripe(): - Go through the regular scrub path for each data stripe - Wait for the verification and repair to finish - Writeback the repaired sectors to data stripes - Make sure all stripes are properly repaired If we have sectors unrepaired, we cannot continue, or we could further corrupt the P/Q stripe. - Submit the rbio for P/Q stripe The dev-replace would be handled inside raid56_parity_submit_scrub_rbio() path. - Wait for the above bio to finish Although the old code is no longer used, we still keep the declaration, as the cleanup can be several times larger than this patch itself. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructureQu Wenruo2023-04-172-474/+29
| | | | | | | | | | | | | | | | | | Switch scrub_simple_mirror() to the new scrub_stripe infrastructure. Since scrub_simple_mirror() is the core part of scrub (only RAID56 P/Q stripes don't utilize it), we can get rid of a big chunk of code, mostly scrub_extent(), scrub_sectors() and directly called functions. There is a functionality change: - Scrub speed throttle now only affects read on the scrubbing device Writes (for repair and replace), and reads from other mirrors won't be limited by the set limits. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce helper to queue a stripe for scrubQu Wenruo2023-04-172-17/+181
| | | | | | | | | | | | | | The new helper, queue_scrub_stripe(), would try to queue a stripe for scrub. If all stripes are already in use, we will submit all the existing ones and wait for them to finish. Currently we would queue up to 8 stripes, to enlarge the blocksize to 512KiB to improve the performance. Sectors repaired on zoned need to be relocated instead of in-place fix. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce error reporting functionality for scrub_stripeQu Wenruo2023-04-171-11/+157
| | | | | | | | | | | | | | | | The new helper, scrub_stripe_report_errors(), will report the result of the scrub to system log. The main reporting is done by introducing a new helper, scrub_print_common_warning(), which is mostly the same content from scrub_print_wanring(), but without the need for a scrub_block. Since we're reporting the errors, it's the perfect time to update the scrub stats too. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce a writeback helper for scrub_stripeQu Wenruo2023-04-172-0/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new helper, scrub_write_sectors(), to submit write bios for specified sectors to the target disk. There are several differences compared to read path: - Utilize btrfs_submit_scrub_write() Now we still rely on the @mirror_num based writeback, but the requirement is also a little different than regular writeback or read, thus we have to call btrfs_submit_scrub_write(). - We cannot write the full stripe back We can only write the sectors we have. There will be two call sites later, one for repaired sectors, one for all utilized sectors of dev-replace. Thus the callers should specify their own write_bitmap. This function only submit the bios, will not wait for them unless for zoned case. Caller must explicitly wait for the IO to finish. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce the main read repair worker for scrub_stripeQu Wenruo2023-04-172-4/+204
| | | | | | | | | | | | | | | | | | | | | | | | | | | The new helper, scrub_stripe_read_repair_worker(), would handle the read-repair part: - Wait for the previous submitted read IO to finish - Verify the contents of the stripe - Go through the remaining mirrors, using as large blocksize as possible At this stage, we just read out all the failed sectors from each mirror and re-verify. If no more failed sector, we can exit. - Go through all mirrors again, sector-by-sector This time, we read sector by sector, this is to address cases where one bad sector mismatches the drive's internal checksum, and cause the whole read range to fail. We put this recovery method as the last resort, as sector-by-sector reading is slow, and reading from other mirrors may have already fixed the errors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce a helper to verify one scrub_stripeQu Wenruo2023-04-172-2/+77
| | | | | | | | | | | | | | | | | | | The new helper, scrub_verify_stripe(), shares the same main workflow of the old scrub code. The major differences are: - How pages/page_offset is grabbed Everything can be grabbed from scrub_stripe easily. - When error report happens Currently the helper only verifies the sectors, not really doing any error reporting. The error reporting would be done after we have done the repair. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce a helper to verify one metadata blockQu Wenruo2023-04-172-0/+107
| | | | | | | | | | | The new helper, scrub_verify_one_metadata(), is almost the same as scrub_checksum_tree_block(). The difference is in how we grab the pages from other structures. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripeQu Wenruo2023-04-175-3/+158
| | | | | | | | | | | | | | | | | | | | | The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interfaceQu Wenruo2023-04-172-0/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces the following structures: - scrub_sector_verification Contains all the needed info to verify one sector (data or metadata). - scrub_stripe Contains all needed members (mostly bitmap based) to scrub one stripe (with a length of BTRFS_STRIPE_LEN). The basic idea is, we keep the existing per-device scrub behavior, but merge all the scrub_bio/scrub_bio into one generic structure, and read the full BTRFS_STRIPE_LEN stripe on the first try. This means we will read some sectors which are not scrub target, but that's fine. At dev-replace time we only writeback the utilized and good sectors, and for read-repair we only writeback the repaired sectors. With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio form shaping would be gone. Although to get the same performance of the old scrub behavior, we would need to submit the initial read for two stripes at once. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce a new helper to submit write bio for repairQu Wenruo2023-04-175-44/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both scrub and read-repair are utilizing a special repair writes that: - Only writes back to a single device Even for read-repair on RAID56, we only update the corrupted data stripe itself, not triggering the full RMW path. - Requires a valid @mirror_num For RAID56 case, only @mirror_num == 1 is valid. For non-RAID56 cases, we need @mirror_num to locate our stripe. - No data csum generation needed These two call sites still have some differences though: - Read-repair goes plain bio It doesn't need a full btrfs_bio, and goes submit_bio_wait(). - New scrub repair would go btrfs_bio To simplify both read and write path. So here this patch would: - Introduce a common helper, btrfs_map_repair_block() Due to the single device nature, we can use an on-stack btrfs_io_stripe to pass device and its physical bytenr. - Introduce a new interface, btrfs_submit_repair_bio(), for later scrub code This is for the incoming scrub code. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce btrfs_bio::fs_info memberQu Wenruo2023-04-176-28/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we're doing a lot of work for btrfs_bio: - Checksum verification for data read bios - Bio splits if it crosses stripe boundary - Read repair for data read bios However for the incoming scrub patches, we don't want this extra functionality at all, just plain logical + mirror -> physical mapping ability. Thus here we do the following changes: - Introduce btrfs_bio::fs_info This is for the new scrub specific btrfs_bio, which would not populate btrfs_bio::inode. Thus we need such new member to grab a fs_info This new member will always be populated. - Replace @inode argument with @fs_info for btrfs_bio_init() and its caller Since @inode is no longer a mandatory member, replace it with @fs_info, and let involved users populate @inode. - Skip checksum verification and generation if @bbio->inode is NULL - Add extra ASSERT()s To make sure: * bbio->inode is properly set for involved read repair path * if @file_offset is set, bbio->inode is also populated - Grab @fs_info from @bbio directly We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be NULL. This involves: * btrfs_simple_end_io() * should_async_write() * btrfs_wq_submit_bio() * btrfs_use_zone_append() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: scrub: use dedicated super block verification function to scrub one ↵Qu Wenruo2023-04-171-8/+52
| | | | | | | | | | | | | | | | | | | | | super block There is really no need to go through the super complex scrub_sectors() to just handle super blocks. Introduce a dedicated function to handle super block scrubbing. This new function will introduce a behavior change, instead of using the complex but concurrent scrub_bio system, here we just go submit-and-wait. There is really not much sense to care the performance of super block scrubbing. It only has 3 super blocks at most, and they are all scattered around the devices already. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove redundant release of btrfs_device::alloc_stateAnand Jain2023-04-171-1/+0
| | | | | | | | | | | | | | | | Commit 321f69f86a0f ("btrfs: reset device back to allocation state when removing") included adding extent_io_tree_release(&device->alloc_state) to btrfs_close_one_device(), which had already been called in btrfs_free_device(). The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore, the additional call to extent_io_tree_release(&device->alloc_state) in btrfs_free_device() is unnecessary and can be removed. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: warn for any missed cleanup at btrfs_close_one_deviceAnand Jain2023-04-171-4/+4
| | | | | | | | | | | | During my recent search for the root cause of a reported bug, I realized that it's a good idea to issue a warning for missed cleanup instead of using debug-only assertions. Since most installations run with debug off, missed cleanups and premature calls to close could go unnoticed. However, these issues are serious enough to warrant reporting and fixing. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* libcrc32c: remove crc32c_implChristoph Hellwig2023-04-172-7/+0
| | | | | | | | | | | This was only ever used by btrfs, and the usage just went away. This effectively reverts df91f56adce1 ("libcrc32c: Add crc32c_impl function"). Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't print the crc32c implementation at module load timeChristoph Hellwig2023-04-171-1/+1
| | | | | | | | | | | Btrfs can use various different checksumming algorithms, and prints the one used for a given file system at mount time. Don't bother printing the crc32c implementation at module load time, the information is available in /sys/fs/btrfs/FSID/checksum. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: tree-log: factor out a clean_log_buffer helperChristoph Hellwig2023-04-171-61/+31
| | | | | | | | | | | | | | | The tree-log code has three almost identical copies for the accounting on an extent_buffer that doesn't need to be written any more. The only difference is that walk_down_log_tree passed the bytenr used to find the buffer instead of extent_buffer.start and calculates the length using the nodesize, while the other two callers look at the extent_buffer.len field that must always be equivalent to the nodesize. Factor the code into a common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* block: make blkcg_punt_bio_submit optionalChristoph Hellwig2023-04-174-36/+48
| | | | | | | | | | | Guard all the code to punt bios to a per-cgroup submission helper by a new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs. This way non-btrfs kernel builds don't need to have this code. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* block: async_bio_lock does not need to be bh-safeChristoph Hellwig2023-04-171-4/+4
| | | | | | | | | async_bio_lock is only taken from bio submission and workqueue context, both are never in bottom halves. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs, block: move REQ_CGROUP_PUNT to btrfsChristoph Hellwig2023-04-179-48/+40
| | | | | | | | | | | | | | REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds a branch to the bio submission hot path. To fix this, export blkcg_punt_bio_submit and let btrfs call it directly. Add a new REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level bio submission code that a punt to the cgroup submission helper is required. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs, mm: remove the punt_to_cgroup field in struct writeback_controlChristoph Hellwig2023-04-172-8/+3
| | | | | | | | | | punt_to_cgroup is only used by extent_write_locked_range, but that function also directly controls the bio flags for the actual submission. Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly in extent_write_locked_range. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: also use kthread_associate_blkcg for uncompressible rangesChristoph Hellwig2023-04-171-4/+5
| | | | | | | | | | | submit_one_async_extent needs to use submit_one_async_extent no matter if the range it handles ends up beeing compressed or not as the deadlock risk due to cgroup thottling is the same. Call kthread_associate_blkcg earlier to cover submit_uncompressed_range case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't free the async_extent in submit_uncompressed_rangeChristoph Hellwig2023-04-171-13/+11
| | | | | | | | | | Let submit_one_async_extent, which is the only caller of submit_uncompressed_range handle freeing of the async_extent in one central place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_writeChristoph Hellwig2023-04-173-13/+8
| | | | | | | | | | | | btrfs_submit_compressed_write should not have to care if it is called from a helper thread or not. Move the kthread_associate_blkcg handling into submit_one_async_extent, as that is the one caller that needs it. Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async, as that is the routine that sets up the helper thread offload. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>