summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
...
| * btrfs: update comment and drop assertion in extent item lookup in ↵David Sterba2024-03-041-2/+4
| | | | | | | | | | | | | | | | | | | | find_parent_nodes() Same comment was added to this type of error, unify that and drop the assertion as we'd find out quickly that something is wrong after returning -EUCLEAN. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: push errors up from add_async_extent()David Sterba2024-03-041-5/+8
| | | | | | | | | | | | | | | | The memory allocation error in add_async_extent() is not handled properly, return an error and push the BUG_ON to the caller. Handling it there is not trivial so at least make it visible. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove do_list variable at btrfs_clear_delalloc_extent()Filipe Manana2024-03-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | The "do_list" variable has a rather confusing name, so remove it and directly use btrfs_is_free_space_inode() instead. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove do_list variable at btrfs_set_delalloc_extent()Filipe Manana2024-03-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | The "do_list" variable is only used once, plus its name/meaning is a bit confusing, so remove it and directory use btrfs_is_free_space_inode(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use assertion instead of BUG_ON when adding/removing to delalloc listFilipe Manana2024-03-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | When adding or removing and inode to/from the root's delalloc list, instead of using a BUG_ON() to validate list emptiness, use ASSERT() since this is to check logic errors rather than real errors. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add lockdep assertion to remaining delalloc callbacksFilipe Manana2024-03-041-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The merge and split callbacks for an inode's io tree are supposed to be called while the io tree's spinlock is being held, so that the given extent_state records are stable, not modified or freed while the callbacks are using them. So add lockdep assertions in the callbacks. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: reduce inode lock critical section when setting and clearing delallocFilipe Manana2024-03-042-21/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When setting and clearing a delalloc range, at btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent(), we are adding/removing the inode to/from the root's list of delalloc inodes while under the protection of the inode's lock. This however is not needed, we can add and remove the inode to the root's list without holding the inode's lock because here we are under the protection of the io tree's lock, reducing the size of the critical section delimited by the inode's lock. The inode's lock is used in many other places such as when finishing an ordered extent (when calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(), or decreasing the number of outstanding extents) or when reserving space when doing a buffered or direct IO write (calls to functions from delalloc-space.c). So move the inode add/remove operations to the root's list of delalloc inodes to outside the critical section delimited by the inode's lock. This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag since we can rely on the inode's delalloc bytes counter to determine if the inode is or is not in the list. The following fio based test, that exercises IO to multiple files in the same subvolume, was used to test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT fio --direct=0 --ioengine=sync --thread --directory=$MNT \ --invalidate=1 --group_reporting=1 \ --new_group --rw=randwrite --size=50m --numjobs=200 \ --bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \ --name=foo --filename_format=FioWorkloads.\$jobnum umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 16G null block device. Result before this patch: WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec Result after this patch: WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: rename btrfs_add_delalloc_inodes() to singular formFilipe Manana2024-03-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The function btrfs_add_delalloc_inodes() adds a single inode its root's list of delalloc inodes, so it doesn't make any sense at all for the function's name to be plural. Rename it to the singular form btrfs_add_delalloc_inode() to avoid any confusion. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: assert root delalloc lock is held at __btrfs_del_delalloc_inode()Filipe Manana2024-03-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | This function requires the delalloc lock of the inode's root to be held, so assert it's held. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: stop passing root argument to __btrfs_del_delalloc_inode()Filipe Manana2024-03-043-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | There's no need to pass a root argument to __btrfs_del_delalloc_inode() and btrfs_del_delalloc_inode(), we can just pass the inode since the root is always the root associated to that inode. Some remove the root argument from these functions. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: stop passing root argument to btrfs_add_delalloc_inodes()Filipe Manana2024-03-041-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we can just pass the inode since the root is always the root associated to the inode in the context it's called. So remove it and have the single caller pass only the inode. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add forward declarations and headers, part 3David Sterba2024-03-0410-28/+151
| | | | | | | | | | | | | | | | | | | | | | | | Do a cleanup in the rest of the headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add forward declarations and headers, part 2David Sterba2024-03-0426-19/+206
| | | | | | | | | | | | | | | | | | | | | | | | Do a cleanup in more headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add forward declarations and headers, part 1David Sterba2024-03-0429-4/+176
| | | | | | | | | | | | | | | | | | | | | | | | Do a cleanup in the short headers: - add forward declarations for types referenced by pointers - add includes when types need them This fixes potential compilation problems if the headers are reordered or the missing includes are not provided indirectly. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: hoist fs_info out of loops in end_bbio_data_write and end_bbio_data_readDavid Sterba2024-03-041-5/+4
| | | | | | | | | | | | | | | | | | The fs_info and sectorsize remain the same during the loops, no need to set them on each iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add helper to get fs_info from struct inode pointerDavid Sterba2024-03-0414-68/+72
| | | | | | | | | | | | | | | | | | | | | | | | Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add helpers to get fs_info from page/folio pointersDavid Sterba2024-03-046-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add convenience helpers to get a fs_info from a page or folio pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page, folio, btrfs_root and btrfs_fs_info. The latter can't be static inlines as this would create loop between ctree.h <-> fs.h, or the headers would have to be restructured. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add helpers to get inode from page/folio pointersDavid Sterba2024-03-044-6/+12
| | | | | | | | | | | | | | | | | | | | | | Add convenience helpers to get a struct btrfs_inode from a page or folio pointer instead of open coding the chain or intermediate BTRFS_I. This is implemented as a macro (still with type checking) so we don't need full definitions of struct page or address_space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: tests: allocate dummy fs_info and root in test_find_delalloc()David Sterba2024-03-041-4/+24
| | | | | | | | | | | | | | Allocate fs_info and root to have a valid fs_info pointer in case it's dereferenced by a helper outside of tests, like find_lock_delalloc_range(). Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: mark __btrfs_add_free_space staticLijuan Li2024-03-042-3/+1
| | | | | | | | | | | | | | | | | | | | __btrfs_add_free_space is only used in free-space-cache.c, so mark it static. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: move transaction abort to the error site btrfs_rebuild_free_space_tree()David Sterba2024-03-041-8/+10
| | | | | | | | | | | | | | | | | | | | The recommended pattern for transaction abort after error is to place it right after the error is handled. That way it's easier to locate where it failed and help debugging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: move transaction abort to the error site in ↵David Sterba2024-03-041-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | btrfs_create_free_space_tree() The recommended pattern for transaction abort after error is to place it right after the error is handled. That way it's easier to locate where it failed and help debugging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: move transaction abort to the error site in ↵David Sterba2024-03-041-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | btrfs_delete_free_space_tree() The recommended pattern for transaction abort after error is to place it right after the error is handled. That way it's easier to locate where it failed and help debugging. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: unify handling of return values of btrfs_insert_empty_items()David Sterba2024-03-043-4/+5
| | | | | | | | | | | | | | | | | | | | | | The error values returned by btrfs_insert_empty_items() are following the common patter of 0/-errno, but some callers check for a value > 0, which can't happen. Document that and update calls to not expect positive values. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: change BUG_ON to assertion in reset_balance_state()David Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | The balance state machine is complex so it's good to verify the assumptions in helpers, however reset_balance_state() is used at the end of balance and fs_info::balance_ctl is properly set up before and protected by the exclusive op ownership in btrfs_balance(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: change BUG_ON to assertion when verifying root in ↵David Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | btrfs_alloc_reserved_file_extent() The file extents are normally reserved in subvolume roots but could be also in the data reloc tree. Change the BUG_ON to assertions as this verifies the usage assumptions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: change BUG_ON to assertion when verifying lockdep class setupDavid Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The BUG_ON in btrfs_set_buffer_lockdep_class() is a sanity check of the level which is verified in callers, e.g. when initializing an extent buffer or reading from an eb header. Change it to an assertion as this would not happen unless things are really bad and would fail elsewhere too. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: change BUG_ON to assertion in btrfs_read_roots()David Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | There's one caller of btrfs_read_roots() and that already uses the tree_root pointer, it's pointless to BUG_ON on it. As it's an assumption of the initialization helpers make it an assert instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: defrag: change BUG_ON to assertion in btrfs_defrag_leaves()David Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | | | The BUG_ON verifies a condition that should be guaranteed by the correct use of the path search (with keep_locks and lowest_level set), an assertion is the suitable check. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: change BUG_ON to assertion when checking for delayed_node rootDavid Sterba2024-03-041-1/+1
| | | | | | | | | | | | | | | | | | The pointer to root is initialized in btrfs_init_delayed_node(), no need to check for it again. Change the BUG_ON to assertion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()David Sterba2024-03-041-2/+0
| | | | | | | | | | | | | | | | | | | | There's a BUG_ON checking for a valid pointer of fs_info::delayed_root but it is valid since init_mount_fs_info() and has the same lifetime as fs_info. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: export: handle invalid inode or root reference in btrfs_get_parent()David Sterba2024-03-041-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | The get_parent handler looks up a parent of a given dentry, this can be either a subvolume or a directory. The search is set up with offset -1 but it's never expected to find such item, as it would break allowed range of inode number or a root id. This means it's a corruption (ext4 also returns this error code). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle invalid extent item reference found in check_committed_ref()David Sterba2024-03-041-1/+8
| | | | | | | | | | | | | | | | | | | | | | The check_committed_ref() helper looks up an extent item by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a extent item offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()David Sterba2024-03-041-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The unhandled case in btrfs_relocate_sys_chunks() loop is a corruption, as it could be caused only by two impossible conditions: - at first the search key is set up to look for a chunk tree item, with offset -1, this is an inexact search and the key->offset will contain the correct offset upon a successful search, a valid chunk tree item cannot have an offset -1 - after first successful search, the found_key corresponds to a chunk item, the offset is decremented by 1 before the next loop, it's impossible to find a chunk item there due to alignment and size constraints Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle invalid root reference found in btrfs_init_root_free_objectid()David Sterba2024-03-041-1/+8
| | | | | | | | | | | | | | | | | | | | The btrfs_init_root_free_objectid() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a root id. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle invalid root reference found in btrfs_find_root()David Sterba2024-03-041-1/+8
| | | | | | | | | | | | | | | | | | | | The btrfs_find_root() looks up a root by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break allowed the range of a root id. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle root deletion lookup error in btrfs_del_root()David Sterba2024-03-041-2/+5
| | | | | | | | | | | | | | | | | | | | We're deleting a root and looking it up by key does not succeed, this is an inconsistent state and we can't do anything. All callers handle errors and abort a transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle block group lookup error when it's being removedDavid Sterba2024-03-041-1/+3
| | | | | | | | | | | | | | | | | | | | The unlikely case of lookup error in btrfs_remove_block_group() can be handled properly, in its caller this would lead to a transaction abort. We can't do anything else, a block group must have been loaded first. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle invalid range and start in merge_extent_mapping()David Sterba2024-03-041-4/+5
| | | | | | | | | | | | | | | | | | | | | | Turn a BUG_ON to a properly handled error and update the error message in the caller. It is expected that @em_in and @start passed to btrfs_add_extent_mapping() overlap. Besides tests, the only caller btrfs_get_extent() makes sure this is true. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle directory and dentry mismatch in btrfs_may_delete()David Sterba2024-03-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete() to verify various conditions before deletion. There's a BUG_ON added before linux.git started, we can turn it to a proper error handling at least in our local helper. A mistmatch between directory and the deleted dentry is clearly invalid. This won't be probably ever hit due to the way how the parameters are set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper lookup_one(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use READ/WRITE_ONCE for fs_devices->read_policyNaohiro Aota2024-03-042-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we can read/modify the value from the sysfs interface concurrently, it would be better to protect it from compiler optimizations. Currently, there is only one read policy BTRFS_READ_POLICY_PID available, so no actual problem can happen now. This is a preparation for the future expansion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: preallocate temporary extent buffer for inode logging when neededFilipe Manana2024-03-043-36/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When logging an inode and we require to copy items from subvolume leaves to the log tree, we clone each subvolume leaf and than use that clone to copy items to the log tree. This is required to avoid possible deadlocks as stated in commit 796787c978ef ("btrfs: do not modify log tree while holding a leaf from fs tree locked"). The cloning requires allocating an extent buffer (struct extent_buffer) and then allocating pages (folios) to attach to the extent buffer. This may be slow in case we are under memory pressure, and since we are doing the cloning while holding a read lock on a subvolume leaf, it means we can be blocking other operations on that leaf for significant periods of time, which can increase latency on operations like creating other files, renaming files, etc. Similarly because we're under a log transaction, we may also cause extra delay on other tasks doing an fsync, because syncing the log requires waiting for tasks that joined a log transaction to exit the transaction. So to improve this, for any inode logging operation that needs to copy items from a subvolume leaf ("full sync" or "copy everything" bit set in the inode), preallocate a dummy extent buffer before locking any extent buffer from the subvolume tree, and even before joining a log transaction, add it to the log context and then use it when we need to copy items from a subvolume leaf to the log tree. This avoids making other operations get extra latency when waiting to lock a subvolume leaf that is used during inode logging and we are under heavy memory pressure. The following test script with bonnie++ was used to test this: $ cat test.sh #!/bin/bash DEV=/dev/sdh MNT=/mnt/sdh MOUNT_OPTIONS="-o ssd" MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'` NR_DIRECTORIES=20 NR_FILES=20480 DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576)) DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES)) NR_FILES=$((NR_FILES / 1024)) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT The results of this test on a 8G VM running a non-debug kernel (Debian's default kernel config), were the following. Before this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++ Latency 35068us 24976us 2944ms 30725us 71770us 26152us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61 Latency 411ms 11914us 119ms 617ms 10296us 110ms After this change: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++ Latency 35975us 20945us 2144ms 10297us 2217us 6004us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59 Latency 320ms 11237us 77779us 518ms 6470us 86389us Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: add comment about list_is_singular() use at btrfs_delete_unused_bgs()Filipe Manana2024-03-041-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on a block group may not be immediately obvious. It is there to prevent losing raid profile information for a block group type (data, metadata or system), as that information is removed from fs_info->avail_[data|metadata|system]_alloc_bits when the last block group of a given type is deleted. So deleting the block group would later result in creating block groups of that type with a single profile (because fs_info->avail_*_alloc_bits would have a value of 0). This check was added in commit aefbe9a633b5 ("btrfs: Fix lost-data-profile caused by auto removing bg"). So add a comment mentioning the need for the check. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: document what the spinlock unused_bgs_lock protectsFilipe Manana2024-03-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add some comments to struct btrfs_fs_info to explicitly document which members are protected by the spinlock unused_bgs_lock. It is currently used to protect two linked lists, the reclaim_bgs and unused_bgs lists. So add an explicit comment on top of each list to mention its protected by unused_bgs_lock, as well as comment on top of unused_bgs_lock to mention the lists it protects. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: make btrfs_error_unpin_extent_range() return voidDavid Sterba2024-03-042-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | This helper is used in transaction abort or cleanup context and the callers cannot handle all errors, only do best effort. btrfs_cleanup_one_transaction btrfs_destroy_delayed_refs btrfs_error_unpin_extent_range btrfs_destroy_pinned_extent btrfs_error_unpin_extent_range Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: return errors from unpin_extent_range()David Sterba2024-03-042-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | Handle the lookup failure of the block group to unpin, this is a logic error as the block group must exist at this point. If not, something else must have freed it, like clean_pinned_extents() would do without locking the unused_bg_unpin_mutex. Push the errors to the callers, proper handling will be done in followup patches. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: handle errors returned from unpin_extent_cache()David Sterba2024-03-042-3/+16
| | | | | | | | | | | | | | | | | | | | We've had numerous attempts to let function unpin_extent_cache() return void as it only returns 0. There are still error cases to handle so do that, in addition to the verbose messages. The only caller btrfs_finish_one_ordered() will now abort the transaction, previously it let it continue which could lead to further problems. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zlib: Fix spelling mistake "infalte" -> "inflate"Colin Ian King2024-03-041-1/+1
| | | | | | | | | | | | | | | | There is a spelling mistake in a warning message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zstd: fix and simplify the inline extent decompression (v2)Qu Wenruo2024-03-042-54/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note: this is a fixed version that was previously reverted as e01a83e12604 ("Revert "btrfs: zstd: fix and simplify the inline extent decompression""), with fixed parameters to memzero_page(). [BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 3 (zstd) Then trying to reflink that extent in an aarch64 system with 64K page size, the reflink would just fail: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zstd_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should error out, without any proper explanation (this is copied from other decompression code). In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer nor error out, since the whole input/output buffer should never exceed one sector, thus we should not need to do any buffer switch. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modification: - Rename @start_byte to @dest_pgoff to properly express its meaning - Use @sectorsize other than PAGE_SIZE to properly initialize the output buffer size - Use correct destination offset inside the destination page - Simplify the main loop Since the input/output buffer should never switch, we only need one zstd_decompress_stream() call. - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And results the correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: remove unused included headersDavid Sterba2024-03-0442-65/+6
| | | | | | | | | | | | | | | | | | | | | | With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>