summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/free-space-cache.c
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: add inode to truncate controlJosef Bacik2022-01-071-1/+2
| | | | | | | | | | | | | | In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: pass the ino via truncate controlJosef Bacik2022-01-071-0/+1
| | | | | | | | | | | | In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use a flag to control when to clear the file extent rangeJosef Bacik2022-01-071-0/+1
| | | | | | | | | | | | We only care about updating the file extent range when we are doing a normal truncation. We skip this for tree logging currently, but we can also skip this for eviction as well. Using a flag makes it more explicit when we want to do this work. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: only call inode_sub_bytes in truncate paths that careJosef Bacik2022-01-071-0/+1
| | | | | | | | | | | | | We currently have a bunch of awkward checks to make sure we only update the inode i_bytes if we're truncating the real inode. Instead keep track of the number of bytes we need to sub in the btrfs_truncate_control, and then do the appropriate adjustment in the truncate paths that care. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: only update i_size in truncate paths that careJosef Bacik2022-01-071-0/+3
| | | | | | | | | | | | | | We currently will update the i_size of the inode as we truncate it down, however we skip this if we're calling btrfs_truncate_inode_items from the tree log code. However we also don't care about this in the case of evict. Instead keep track of this value in the btrfs_truncate_control and then have btrfs_truncate() and the free space cache truncate path both do the i_size update themselves. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add truncate control structJosef Bacik2022-01-071-2/+5
| | | | | | | | | | | I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move extent locking outside of btrfs_truncate_inode_itemsJosef Bacik2022-01-071-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | Currently we are locking the extent and dropping the extent cache for any inodes we truncate, unless they're in the tree log. We call this helper from: - truncate - evict - tree log - free space cache truncation For evict we've already dropped all of the extent cache for this inode once we've gotten here, and we're the only one accessing this inode, so this step is unnecessary. For the tree log code we already skip this part. Pull this work into the truncate path and the free space cache truncation path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add an inode-item.hJosef Bacik2022-01-071-0/+1
| | | | | | | | | | | | We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: change name and type of private member of btrfs_free_space_ctlNikolay Borisov2022-01-031-6/+6
| | | | | | | | | | | btrfs_free_space_ctl::private is either unset or it always points to struct btrfs_block_group when it is set. So there's no point in keeping the unhelpful 'private' name and keeping it an untyped pointer. Change both the type and name to be self-describing. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: make __btrfs_add_free_space take just block group referenceNikolay Borisov2022-01-031-15/+10
| | | | | | | | | | | There is no point in the function taking an fs_info and a btrfs_free_space because the ctl passed always belongs to the block group. Furthermore fs_info can be referenced from the block group. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: consolidate unlink_free_space/__unlink_free_space functionsNikolay Borisov2022-01-031-25/+15
| | | | | | | | | | | The only difference between the two is whether btrfs_free_space::bytes is adjusted. Instead of having 2 separate functions control this behavior via an additional parameter and make them one function instead. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: consolidate bitmap_clear_bits/__bitmap_clear_bitsNikolay Borisov2022-01-031-24/+13
| | | | | | | | | | | The only difference is the former adjusts btrfs_free_space::bytes member. Consolidate the two function into 1 and add a bool parameter which controls whether the adjustment is made or not. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: eliminate if in main loop in tree_search_offsetNikolay Borisov2022-01-031-7/+4
| | | | | | | | | Reshuffle the code inside the first loop of tree_search_offset so that one if() is eliminated and the becomes more linear. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get next entry in tree_search_offset before doing checksNikolay Borisov2022-01-031-5/+4
| | | | | | | | | | | This is a small optimisation since the currently 'entry' is already checked in the if () {} else if {} construct above the loop. In essence the first iteration of the final while loop is redundant. To eliminate this extra check simply get the next entry at the beginning of the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: index free space entries on sizeJosef Bacik2022-01-031-44/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we index free space on offset only, because usually we have a hint from the allocator that we want to honor for locality reasons. However if we fail to use this hint we have to go back to a brute force search through the free space entries to find a large enough extent. With sufficiently fragmented free space this becomes quite expensive, as we have to linearly search all of the free space entries to find if we have a part that's long enough. To fix this add a cached rb tree to index based on free space entry bytes. This will allow us to quickly look up the largest chunk in the free space tree for this block group, and stop searching once we've found an entry that is too small to satisfy our allocation. We simply choose to use this tree if we're searching from the beginning of the block group, as we know we do not care about locality at that point. I wrote an allocator test that creates a 10TiB ram backed null block device and then fallocates random files until the file system is full. I think go through and delete all of the odd files. Then I spawn 8 threads that fallocate 64MiB files (1/2 our extent size cap) until the file system is full again. I use bcc's funclatency to measure the latency of find_free_extent. The baseline results are nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 10356 |**** | 512 -> 1023 : 58242 |************************* | 1024 -> 2047 : 74418 |******************************** | 2048 -> 4095 : 90393 |****************************************| 4096 -> 8191 : 79119 |*********************************** | 8192 -> 16383 : 35614 |*************** | 16384 -> 32767 : 13418 |***** | 32768 -> 65535 : 12811 |***** | 65536 -> 131071 : 17090 |******* | 131072 -> 262143 : 26465 |*********** | 262144 -> 524287 : 40179 |***************** | 524288 -> 1048575 : 55469 |************************ | 1048576 -> 2097151 : 48807 |********************* | 2097152 -> 4194303 : 26744 |*********** | 4194304 -> 8388607 : 35351 |*************** | 8388608 -> 16777215 : 13918 |****** | 16777216 -> 33554431 : 21 | | avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690 And the patch results are nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 6883 |** | 512 -> 1023 : 54346 |********************* | 1024 -> 2047 : 79170 |******************************** | 2048 -> 4095 : 98890 |****************************************| 4096 -> 8191 : 81911 |********************************* | 8192 -> 16383 : 27075 |********** | 16384 -> 32767 : 14668 |***** | 32768 -> 65535 : 13251 |***** | 65536 -> 131071 : 15340 |****** | 131072 -> 262143 : 26715 |********** | 262144 -> 524287 : 43274 |***************** | 524288 -> 1048575 : 53870 |********************* | 1048576 -> 2097151 : 55368 |********************** | 2097152 -> 4194303 : 41036 |**************** | 4194304 -> 8388607 : 24927 |********** | 8388608 -> 16777215 : 33 | | 16777216 -> 33554431 : 9 | | avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042 There's a little variation in the amount of calls done because of timing of the threads with metadata requirements, but the avg, total, and count's are relatively consistent between runs (usually within 2-5% of each other). As you can see here we have around a 30% decrease in average latency with a 30% decrease in overall time spent in find_free_extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: only use ->max_extent_size if it is set in the bitmapJosef Bacik2022-01-031-1/+25
| | | | | | | | | | | | | | | | | | While adding self tests for my space index change I was hitting a problem where the space indexed tree wasn't returning the expected ->max_extent_size. This is because we will skip searching any entry that doesn't have ->bytes >= the amount of bytes we want. However we'll still set the max_extent_size based on that entry. The problem is if we don't search the bitmap we won't have ->max_extent_size set properly, so we can't really trust it. This doesn't really result in a problem per-se, it can just result in us not finding contiguous area that may exist. Fix the max_extent_size helper to return ->bytes if ->max_extent_size isn't set, and add a big comment explaining why we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: subpage: add bitmap for PageChecked flagQu Wenruo2021-10-261-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although in btrfs we have very limited usage of PageChecked flag, it's still some page flag not yet subpage compatible. Fix it by introducing btrfs_subpage::checked_offset to do the convert. For most call sites, especially for free-space cache, COW fixup and btrfs_invalidatepage(), they all work in full page mode anyway. For other call sites, they work as subpage compatible mode. Some call sites need extra modification: - btrfs_drop_pages() Needs extra parameter to get the real range we need to clear checked flag. Also since btrfs_drop_pages() will accept pages beyond the dirtied range, update btrfs_subpage_clamp_range() to handle such case by setting @len to 0 if the page is beyond target range. - btrfs_invalidatepage() We need to call subpage helper before calling __btrfs_releasepage(), or it will trigger ASSERT() as page->private will be cleared. - btrfs_verify_data_csum() In theory we don't need the io_bio->csum check anymore, but it's won't hurt. Just change the comment. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: implement active zone trackingNaohiro Aota2021-10-261-2/+3
| | | | | | | | | | | | | | | | | Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: tweak reclaim threshold for zone capacityNaohiro Aota2021-10-261-2/+6
| | | | | | | | | | | With the introduction of zone capacity, the range [capacity, length] is always zone unusable. Counting this region as a reclaim target will cause reclaiming too early. Reclaim block groups based on bytes that can be usable after resetting. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: calculate free space from zone capacityNaohiro Aota2021-10-261-1/+6
| | | | | | | | | | | Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: fix block group alloc_offset calculationNaohiro Aota2021-08-231-2/+5
| | | | | | | | | | | alloc_offset is offset from the start of a block group and @offset is actually an address in logical space. Thus, we need to consider block_group->start when calculating them. Fixes: 011b41bffa3d ("btrfs: zoned: advance allocation pointer after tree log node") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: allow disabling of zone auto reclaimJohannes Thumshirn2021-08-231-3/+4
| | | | | | | | | | | | | | Automatically reclaiming dirty zones might not always be desired for all workloads, especially as there are currently still some rough edges with the relocation code on zoned filesystems. Allow disabling zone auto reclaim on a per filesystem basis by writing 0 as the threshold value. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: allocate file_ra_state on stack in readahead_cacheGoldwyn Rodrigues2021-08-231-9/+3
| | | | | | | | | | Instead of allocating file_ra_state using kmalloc, allocate on stack. sizeof(struct readahead) = 32 bytes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't set the full sync flag when truncation does not touch extentsFilipe Manana2021-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At btrfs_truncate() where we truncate the inode either to the same size or to a smaller size, we always set the full sync flag on the inode. This is needed in case the truncation drops or trims any file extent items that start beyond or cross the new inode size, so that the next fsync drops all inode items from the log and scans again the fs/subvolume tree to find all items that must be logged. However if the truncation does not drop or trims any file extent items, we do not need to set the full sync flag and force the next fsync to use the slow code path. So do not set the full sync flag in such cases. One use case where it is frequent to do truncations that do not change the inode size and do not drop any extents (no prealloc extents beyond i_size) is when running Microsoft's SQL Server inside a Docker container. One example workload is the one Philipp Fent reported recently, in the thread with a link below. In this workload a large number of fsyncs are preceded by such truncate operations. After this change I constantly get the runtime for that workload from Philipp to be reduced by about -12%, for example from 184 seconds down to 162 seconds. Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: initialize return variable in cleanup_free_space_cache_v1Tom Rix2021-05-041-1/+1
| | | | | | | | | | | | | | | | | | Static analysis reports this problem free-space-cache.c:3965:2: warning: Undefined or garbage value returned return ret; ^~~~~~~~~~ ret is set in the node handling loop. Treat doing nothing as a success and initialize ret to 0, although it's unlikely the loop would be skipped. We always have block groups, but as it could lead to transaction abort in the caller it's better to be safe. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: automatically reclaim zonesJohannes Thumshirn2021-04-201-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: do not account freed region of read-only block group as ↵Naohiro Aota2021-03-041-1/+6
| | | | | | | | | | | | | | zone_unusable We migrate zone unusable bytes to read-only bytes when a block group is set to read-only, and account all the free region as bytes_readonly. Thus, we should not increase block_group->zone_unusable when the block group is read-only. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: avoid double put of block group when emptying clusterJosef Bacik2021-02-221-4/+4
| | | | | | | | | | | | | | | | | | It's wrong calling btrfs_put_block_group in __btrfs_return_cluster_to_free_space if the block group passed is different than the block group the cluster represents. As this means the cluster doesn't have a reference to the passed block group. This results in double put and a use-after-free bug. Fix this by simply bailing if the block group we passed in does not match the block group on the cluster. Fixes: fa9c0d795f7b ("Btrfs: rework allocation clustering") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix race between extent freeing/allocation when using bitmapsNikolay Borisov2021-02-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During allocation the allocator will try to allocate an extent using cluster policy. Once the current cluster is exhausted it will remove the entry under btrfs_free_cluster::lock and subsequently acquire btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem because there exists a race condition between removing the entry under one lock and doing the necessary accounting holding a different lock since extent freeing only uses the 2nd lock. This can result in the following situation: T1: T2: btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock> if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) { rb_erase(entry) spin_unlock(&cluster->lock); (total_bitmaps is still 4) spin_lock(&cluster->lock); <doesn't find entry in cluster->root> spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds <blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap> recalculate_thresholds <crashes, due to total_bitmaps becoming 5 and triggering an ASSERT> To fix this ensure that once depleted, the cluster entry is deleted when both cluster lock and tree locks are held in the allocator (T1), this ensures that even if there is a race with a concurrent insert_into_bitmap call it will correctly find the entry in the cluster and add the new space to it. CC: <stable@vger.kernel.org> # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: advance allocation pointer after tree log nodeNaohiro Aota2021-02-091-1/+15
| | | | | | | | | | | | | | | Since the allocation info of a tree log node is not recorded in the extent tree, calculate_alloc_pointer() cannot detect this node, so the pointer can be over a tree node. Replaying the log calls btrfs_remove_free_space() for each node in the log tree. So, advance the pointer after the node to not allocate over it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: implement sequential extent allocationNaohiro Aota2021-02-091-0/+6
| | | | | | | | | | | | | | | | | | | Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: track unusable bytes for zonesNaohiro Aota2021-02-091-0/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce btrfs_subpage for data inodesQu Wenruo2021-02-081-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: improve parameter description for __btrfs_write_out_cacheNikolay Borisov2021-02-081-5/+8
| | | | | | | | | | | | | | Fixes following W=1 warnings: fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectidNikolay Borisov2021-02-081-1/+1
| | | | | | | | | | This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: clarify error returns values in __load_free_space_cacheZhihao Cheng2021-02-081-1/+5
| | | | | | | | | | | | | | Return value in __load_free_space_cache is not properly set after (unlikely) memory allocation failures and 0 is returned instead. This is not a problem for the caller load_free_space_cache because only value 1 is considered as 'cache loaded' but for clarity it's better to set the errors accordingly. Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove free space items when disabling space cache v1Boris Burkov2020-12-091-4/+95
| | | | | | | | | | | | | | | | | | When the filesystem transitions from space cache v1 to v2 or to nospace_cache, it removes the old cached data, but does not remove the FREE_SPACE items nor the free space inodes they point to. This doesn't cause any issues besides being a bit inefficient, since these items no longer do anything useful. To fix it, when we are mounting, and plan to disable the space cache, destroy each block group's free space item and free space inode. The code to remove the items is lifted from the existing use case of removing the block group, with a light adaptation to handle whether or not we have already looked up the free space inode. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: keep sb cache_generation consistent with space_cacheBoris Burkov2020-12-091-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | When mounting, btrfs uses the cache_generation in the super block to determine if space cache v1 is in use. However, by mounting with nospace_cache or space_cache=v2, it is possible to disable space cache v1, which does not result in un-setting cache_generation back to 0. In order to base some logic, like mount option printing in /proc/mounts, on the current state of the space cache rather than just the values of the mount option, keep the value of cache_generation consistent with the status of space cache v1. We ensure that cache_generation > 0 iff the file system is using space_cache v1. This requires committing a transaction on any mount which changes whether we are using v1. (v1->nospace_cache, v1->v2, nospace_cache->v1, v2->v1). Since the mechanism for writing out the cache generation is transaction commit, but we want some finer grained control over when we un-set it, we can't just rely on the SPACE_CACHE mount option, and introduce an fs_info flag that mount can use when it wants to unset the generation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove recalc_thresholds from free space opsDavid Sterba2020-12-091-44/+42
| | | | | | | | | | | | After removing the inode number cache that was using the free space cache code, we can remove at least the recalc_thresholds callback from the ops. Both code and tests use the same callback function. It's moved before its first use. The use_bitmaps callback is still needed by tests to create some extents/bitmap setup. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: always set NODATASUM/NODATACOW in __create_free_space_inodeNikolay Borisov2020-12-091-5/+3
| | | | | | | | | Since it's being used solely for the freespace cache unconditionally set the flags required for it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove crc_check logic from free spaceNikolay Borisov2020-12-091-32/+5
| | | | | | | | | | | Following removal of the ino cache io_ctl_init will be called only on behalf of the freespace inode. In this case we always want to check CRCs so conditional code that depended on io_ctl::check_crc can be removed. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove inode number cache featureNikolay Borisov2020-12-091-177/+0
| | | | | | | | | | | | | | | | | | It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons. A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed. [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: make btrfs_update_inode take btrfs_inodeNikolay Borisov2020-12-081-3/+3
| | | | | | | Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: make btrfs_truncate_inode_items take btrfs_inodeNikolay Borisov2020-12-081-1/+1
| | | | | | | Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: load free space cache into a temporary ctlJosef Bacik2020-12-081-89/+66
| | | | | | | | | | | | | | | | | | | | | | | The free space cache has been special in that we would load it right away instead of farming the work off to a worker thread. This resulted in some weirdness that had to be taken into account for this fact, namely that if we every found a block group being cached the fast way we had to wait for it to finish, because we could get the cache before it had been validated and we may throw the cache away. To handle this particular case instead create a temporary btrfs_free_space_ctl to load the free space cache into. Then once we've validated that it makes sense, copy it's contents into the actual block_group->free_space_ctl. This allows us to avoid the problems of needing to wait for the caching to complete, we can clean up the discard extent handling stuff in __load_free_space_cache, and we no longer need to do the merge_space_tree() because the space is added one by one into the real free_space_ctl. This will allow further reworks of how we handle loading the free space cache. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: cleanup btrfs_discard_update_discardable usageJosef Bacik2020-12-081-6/+8
| | | | | | | | | | | | | | | | | | | This passes in the block_group and the free_space_ctl, but we can get this from the block group itself. Part of this is because we call it from __load_free_space_cache, which can be called for the inode cache as well. Move that call into the block group specific load section, wrap it in the right lock that we need for the assertion (but otherwise this is safe without the lock because this happens in single-thread context). Fix up the arguments to only take the block group. Add a lockdep_assert as well for good measure to make sure we don't mess up the locking again. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()Goldwyn Rodrigues2020-12-081-1/+1
| | | | | | | | | | | | | Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: free-space-cache: use unaligned helpers to access dataDavid Sterba2020-10-071-12/+9
| | | | | | | | | | The free space inode stores the tracking data, checksums etc, using the io_ctl structure and moving the pointers. The data are generally aligned to at least 4 bytes (u32 for CRC) so it's not completely unaligned but for clarity we should use the proper helpers whenever a struct is initialized from io_ctl->cur pointer. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: delete duplicated words + other fixes in commentsRandy Dunlap2020-10-071-1/+1
| | | | | | | | | | | Delete repeated words in fs/btrfs/. {to, the, a, and old} and change "into 2 part" to "into 2 parts". Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix space cache memory leak after transaction abortFilipe Manana2020-08-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a transaction aborts it can cause a memory leak of the pages array of a block group's io_ctl structure. The following steps explain how that can happen: 1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED and it's about to start writing out dirty extent buffers; 2) Transaction N + 1 already started and another task, task A, just called btrfs_commit_transaction() on it; 3) Block group B was dirtied (extents allocated from it) by transaction N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the very beginning of the transaction commit, it starts writeback for the block group's space cache by calling btrfs_write_out_cache(), which allocates the pages array for the block group's io_ctl with a call to io_ctl_init(). Block group A is added to the io_list of transaction N + 1 by btrfs_start_dirty_block_groups(); 4) While transaction N's commit is writing out the extent buffers, it gets an IO error and aborts transaction N, also setting the file system to RO mode; 5) Task A has already returned from btrfs_start_dirty_block_groups(), is at btrfs_commit_transaction() and has set transaction N + 1 state to TRANS_STATE_COMMIT_START. Immediately after that it checks that the filesystem was turned to RO mode, due to transaction N's abort, and jumps to the "cleanup_transaction" label. After that we end up at btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs(). That helper finds block group B in the transaction's io_list but it never releases the pages array of the block group's io_ctl, resulting in a memory leak. In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages array points to pages that were already released by us at __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end up freeing the pages array only after waiting for the ordered extent to complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do that. But in the transaction abort case we don't wait for the space cache's ordered extent to complete through a call to btrfs_wait_cache_io(), so that's why we end up with a memory leak - we wait for the ordered extent to complete indirectly by shutting down the work queues and waiting for any jobs in them to complete before returning from close_ctree(). We can solve the leak simply by freeing the pages array right after releasing the pages (with the call to io_ctl_drop_pages()) at __btrfs_write_out_cache(), since we will never use it anymore after that and the pages array points to already released pages at that point, which is currently not a problem since no one will use it after that, but not a good practice anyway since it can easily lead to use-after-free issues. So fix this by freeing the pages array right after releasing the pages at __btrfs_write_out_cache(). This issue can often be reproduced with test case generic/475 from fstests and kmemleak can detect it and reports it with the following trace: unreferenced object 0xffff9bbf009fa600 (size 512): comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s) hex dump (first 32 bytes): 00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=... 80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=... backtrace: [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0 [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs] [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs] [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs] [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs] [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs] [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs] [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs] [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0 [<0000000043ace2c9>] do_syscall_64+0x33/0x80 [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>