summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent-tree.c
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: reserve delalloc metadata differentlyJosef Bacik2019-05-021-92/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the per-inode block reserves we started refilling the reserve based on the calculated size of the outstanding csum bytes and extents for the inode, including the amount we were adding with the new operation. However, generic/224 exposed a problem with this approach. With 1000 files all writing at the same time we ended up with a bunch of bytes being reserved but unusable. When you write to a file we reserve space for the csum leaves for those bytes, the number of extent items required to cover those bytes, and a single transaction item for updating the inode at ordered extent finish for that range of bytes. This is held until the ordered extent finishes and we release all of the reserved space. If a second write comes in at this point we would add a single reservation for the new outstanding extent and however many reservations for the csum leaves. At this point we find the delta of how much we have reserved and how much outstanding size this is and attempt to reserve this delta. If the first write finishes it will not release any space, because the space it had reserved for the initial write is still needed for the second write. However some space would have been used, as we have added csums, extent items, and dirtied the inode. Our reserved space would be > 0 but less than the total needed reserved space. This is just for a single inode, now consider generic/224. This has 1000 inodes writing in parallel to a very small file system, 1GiB. In my testing this usually means we get about a 120MiB metadata area to work with, more than enough to allow the writes to continue, but not enough if all of the inodes are stuck trying to reserve the slack space while continuing to hold their leftovers from their initial writes. Fix this by pre-reserved _only_ for the space we are currently trying to add. Then once that is successful modify our inodes csum count and outstanding extents, and then add the newly reserved space to the inodes block_rsv. This allows us to actually pass generic/224 without running out of metadata space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: track DIO bytes in flightJosef Bacik2019-04-291-2/+13
| | | | | | | | | | | | | | | | | | | When diagnosing a slowdown of generic/224 I noticed we were not doing anything when calling into shrink_delalloc(). This is because all writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes counter is 0, which short circuits most of the work inside of shrink_delalloc(). However O_DIRECT writes still consume metadata resources and generate ordered extents, which we can still wait on. Fix this by tracking outstanding DIO write bytes, and use this as well as the delalloc bytes counter to decide if we need to lookup and wait on any ordered extents. If we have more DIO writes than delalloc bytes we'll go ahead and wait on any ordered extents regardless of our flush state as flushing delalloc is likely to not gain us anything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ use dio instead of odirect in identifiers ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flagsDavid Sterba2019-04-291-2/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_opDavid Sterba2019-04-291-2/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_extend_itemDavid Sterba2019-04-291-1/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_truncate_itemDavid Sterba2019-04-291-2/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Don't scan leaf if we're modifying reloc treeQu Wenruo2019-04-291-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since reloc tree doesn't contribute to qgroup numbers, just skip them. This should catch the final cause of unnecessary data ref processing when running balance of metadata with qgroups on. The 4G data 16 snapshots test (*) should explain it pretty well: | delayed subtree | refactor delayed ref | this patch --------------------------------------------------------------------- relocated | 22653 | 22673 | 22744 qgroup dirty | 122792 | 48360 | 70 time | 24.494 | 11.606 | 3.944 Finally, we're at the stage where qgroup + metadata balance cost no obvious overhead. Test environment: Test VM: - vRAM 8G - vCPU 8 - block dev vitrio-blk, 'unsafe' cache mode - host block 850evo Test workload: - Copy 4G data from /usr/ to one subvolume - Create 16 snapshots of that subvolume, and modify 3 files in each snapshot - Enable quota, rescan - Time "btrfs balance start -m" Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()Qu Wenruo2019-04-291-29/+23
| | | | | | | | | Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()Qu Wenruo2019-04-291-25/+32
| | | | | | | | | Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()Qu Wenruo2019-04-291-16/+10
| | | | | | | | | | Since add_pinned_bytes() only needs to know if the extent is metadata and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as we don't need various owner/level trick to determine extent type. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()Qu Wenruo2019-04-291-19/+8
| | | | | | | | | | | It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as btrfs_ref describes a metadata/data reference update comprehensively. Now we have one less function use confusing owner/level trick. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()Qu Wenruo2019-04-291-13/+10
| | | | | | | | Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor btrfs_add_delayed_data_ref(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()Qu Wenruo2019-04-291-18/+26
| | | | | | | | | | | | | | btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and some callers like btrfs_inc_extent_ref() are using @owner as level for delayed tree ref. Instead of making the parameter list longer, use btrfs_ref to refactor it, so each parameter assignment should be self-explaining without dirty level/owner trick, and provides the basis for later refactoring. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: extent-tree: Open-code process_func in __btrfs_mod_refQu Wenruo2019-04-291-14/+16
| | | | | | | | | | | | | | | The process_func function pointer is local to __btrfs_mod_ref() and points to either btrfs_inc_extent_ref() or btrfs_free_extent(). Open code it to make later delayed ref refactor easier, so we can refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from block group in btrfs_find_space_clusterDavid Sterba2019-04-291-4/+2
| | | | | | | We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from block group in load_free_space_cacheDavid Sterba2019-04-291-1/+1
| | | | | | | We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from block group in lookup_free_space_inodeDavid Sterba2019-04-291-2/+2
| | | | | | | We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from block group in pin_down_extentDavid Sterba2019-04-291-7/+7
| | | | | | | We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from block group in next_block_groupDavid Sterba2019-04-291-5/+5
| | | | | | | We can read fs_info from the block group cache structure and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: remove no longer used function to run delayed refs asynchronouslyFilipe Manana2019-04-291-85/+0
| | | | | | | | | | | | It used to be called from only two places (truncate path and releasing a transaction handle), but commits 28bad2125767c5 ("btrfs: fix truncate throttling") and db2462a6ad3dc4 ("btrfs: don't run delayed refs in the end transaction logic") removed their calls to this function, so it's not used anymore. Just remove it and all its helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: remove no longer used member num_dirty_bgs from transactionFilipe Manana2019-04-291-1/+0
| | | | | | | | | | | The member num_dirty_bgs of struct btrfs_transaction is not used anymore, it is set and incremented but nothing reads its value anymore. Its last read use was removed by commit 64403612b73a94 ("btrfs: rework btrfs_check_space_for_delayed_refs"). So just remove that member. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_write_out_cacheDavid Sterba2019-04-291-4/+2
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in create_free_space_inodeDavid Sterba2019-04-291-2/+1
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_set_log_full_commitDavid Sterba2019-04-291-1/+1
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in update_block_groupDavid Sterba2019-04-291-5/+5
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_write_dirty_block_groupsDavid Sterba2019-04-291-2/+2
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_setup_space_cacheDavid Sterba2019-04-291-2/+2
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in write_one_cache_groupDavid Sterba2019-04-291-7/+4
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bitNikolay Borisov2019-04-291-63/+28
| | | | | | | | | | | | | | | Instead of always calling the allocator to search for a free extent, that satisfies the input criteria, switch btrfs_trim_free_extents to using find_first_clear_extent_bit. With this change it's no longer necessary to read the device tree in order to figure out holes in the devices. Now the code always searches in-memory data structure to figure out the space range which contains the requested which should result in speed improvements. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Optimize unallocated chunks discardNikolay Borisov2019-04-291-1/+56
| | | | | | | | | | | | | | | | | | | | | Currently unallocated chunks are always trimmed. For example 2 consecutive trims on large storage would trim freespace twice irrespective of whether the space was actually allocated or not between those trims. Optimise this behavior by exploiting the newly introduced alloc_state tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark those unallocated chunks which have been trimmed and have not been allocated afterwards. On chunk allocation the respective underlying devices' physical space will have its CHUNK_TRIMMED flag cleared. This avoids submitting discards for space which hasn't been changed since the last time discard was issued. This applies to the single mount period of the filesystem as the information is not stored permanently. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Factor out in_range macroNikolay Borisov2019-04-291-1/+0
| | | | | | | | | This is used in more than one places so let's factor it out in ctree.h. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove 'trans' argument from find_free_dev_extent(_start)Nikolay Borisov2019-04-291-33/+3
| | | | | | | | | | Now that these functions no longer require a handle to transaction to inspect pending/pinned chunks the argument can be removed. At the same time also remove any surrounding code which acquired the handle. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace pending/pinned chunks lists with io treeJeff Mahoney2019-04-291-28/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Honour FITRIM range constraints during free space trimNikolay Borisov2019-04-291-6/+19
| | | | | | | | | | | | | | | Up until now trimming the freespace was done irrespective of what the arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments will be entirely ignored. Fix it by correctly handling those paramter. This requires breaking if the found freespace extent is after the end of the passed range as well as completing trim after trimming fstrim_range::len bytes. Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from eb in clean_tree_blockDavid Sterba2019-04-291-3/+3
| | | | | | | We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from eb in btrfs_exclude_logged_extentsDavid Sterba2019-04-291-2/+2
| | | | | | | We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: remove no longer used 'sync' member from transaction handleFilipe Manana2019-04-291-6/+0
| | | | | | | | | | | Commit db2462a6ad3d ("btrfs: don't run delayed refs in the end transaction logic") removed its last use, so now it does absolutely nothing, therefore remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Avoid possible qgroup_rsv_size overflow in ↵Nikolay Borisov2019-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | btrfs_calculate_inode_block_rsv_size qgroup_rsv_size is calculated as the product of outstanding_extent * fs_info->nodesize. The product is calculated with 32 bit precision since both variables are defined as u32. Yet qgroup_rsv_size expects a 64 bit result. Avoid possible multiplication overflow by casting outstanding_extent to u64. Such overflow would in the worst case (64K nodesize) require more than 65536 extents, which is quite large and i'ts not likely that it would happen in practice. Fixes-coverity-id: 1435101 Fixes: ff6bc37eb7f6 ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: save drop_progress if we drop refs at allJosef Bacik2019-02-271-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we only updated the drop_progress key if we were in the DROP_REFERENCE stage of snapshot deletion. This is because the UPDATE_BACKREF stage checks the flags of the blocks it's converting to FULL_BACKREF, so if we go over a block we processed before it doesn't matter, we just don't do anything. The problem is in do_walk_down() we will go ahead and drop the roots reference to any blocks that we know we won't need to walk into. Given subvolume A and snapshot B. The root of B points to all of the nodes that belong to A, so all of those nodes have a refcnt > 1. If B did not modify those blocks it'll hit this condition in do_walk_down if (!wc->update_ref || generation <= root->root_key.offset) goto skip; and in "goto skip" we simply do a btrfs_free_extent() for that bytenr that we point at. Now assume we modified some data in B, and then took a snapshot of B and call it C. C points to all the nodes in B, making every node the root of B points to have a refcnt > 1. This assumes the root level is 2 or higher. We delete snapshot B, which does the above work in do_walk_down, free'ing our ref for nodes we share with A that we didn't modify. Now we hit a node we _did_ modify, thus we own. We need to walk down into this node and we set wc->stage == UPDATE_BACKREF. We walk down to level 0 which we also own because we modified data. We can't walk any further down and thus now need to walk up and start the next part of the deletion. Now walk_up_proc is supposed to put us back into DROP_REFERENCE, but there's an exception to this if (level < wc->shared_level) goto out; we are at level == 0, and our shared_level == 1. We skip out of this one and go up to level 1. Since path->slots[1] < nritems we path->slots[1]++ and break out of walk_up_tree to stop our transaction and loop back around. Now in btrfs_drop_snapshot we have this snippet if (wc->stage == DROP_REFERENCE) { level = wc->level; btrfs_node_key(path->nodes[level], &root_item->drop_progress, path->slots[level]); root_item->drop_level = level; } our stage == UPDATE_BACKREF still, so we don't update the drop_progress key. This is a problem because we would have done btrfs_free_extent() for the nodes leading up to our current position. If we crash or unmount here and go to remount we'll start over where we were before and try to free our ref for blocks we've already freed, and thus abort() out. Fix this by keeping track of the last place we dropped a reference for our block in do_walk_down. Then if wc->stage == UPDATE_BACKREF we know we'll start over from a place we meant to, and otherwise things continue to work as they did before. I have a complicated reproducer for this problem, without this patch we'll fail to fsck the fs when replaying the log writes log. With this patch we can replay the whole log without any fsck or mount failures. The steps to reproduce this easily are sort of tricky, I had to add a couple of debug patches to the kernel in order to make it easy, basically I just needed to make sure we did actually commit the transaction every time we finished a walk_down_tree/walk_up_tree combo. The reproducer: 1) Creates a base subvolume. 2) Creates 100k files in the subvolume. 3) Snapshots the base subvolume (snap1). 4) Touches files 5000-6000 in snap1. 5) Snapshots snap1 (snap2). 6) Deletes snap1. I do this with dm-log-writes, and then replay to every FUA in the log and fsck the fs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ copy reproducer steps ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: check for refs on snapshot delete resumeJosef Bacik2019-02-271-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a bug in snapshot deletion where we won't update the drop_progress key if we're in the UPDATE_BACKREF stage. This is a problem because we could drop refs for blocks we know don't belong to ours. If we crash or umount at the right time we could experience messages such as the following when snapshot deletion resumes BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258 owner 1 offset 0 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs] CPU: 3 PID: 16052 Comm: umount Tainted: G W OE 5.0.0-rc4+ #147 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs] RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18 RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000 R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0 FS: 00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0 Call Trace: ? _raw_spin_unlock+0x27/0x40 ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs] __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs] ? join_transaction+0x2b/0x460 [btrfs] btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs] btrfs_commit_transaction+0x52/0xa50 [btrfs] ? start_transaction+0xa6/0x510 [btrfs] btrfs_sync_fs+0x79/0x1c0 [btrfs] sync_filesystem+0x70/0x90 generic_shutdown_super+0x27/0x120 kill_anon_super+0x12/0x30 btrfs_kill_super+0x16/0xa0 [btrfs] deactivate_locked_super+0x43/0x70 deactivate_super+0x40/0x60 cleanup_mnt+0x3f/0x80 __cleanup_mnt+0x12/0x20 task_work_run+0x8b/0xc0 exit_to_usermode_loop+0xce/0xd0 do_syscall_64+0x20b/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe To fix this simply mark dead roots we read from disk as DEAD and then set the walk_control->restarted flag so we know we have a restarted deletion. From here whenever we try to drop refs for blocks we check to verify our ref is set on them, and if it is not we skip it. Once we find a ref that is set we unset walk_control->restarted since the tree should be in a normal state from then on, and any problems we run into from there are different issues. I tested this with an existing broken fs and my reproducer that creates a broken fs and it fixed both file systems. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to ↵Qu Wenruo2019-02-251-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_qgroup_extent_record [BUG] Btrfs/139 will fail with a high probability if the testing machine (VM) has only 2G RAM. Resulting the final write success while it should fail due to EDQUOT, and the fs will have quota exceeding the limit by 16K. The simplified reproducer will be: (needs a 2G ram VM) $ mkfs.btrfs -f $dev $ mount $dev $mnt $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ for i in $(seq -w 1 8); do xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null echo "file $i written" > /dev/kmsg done $ sync $ btrfs qgroup show -pcre --raw $mnt The last pwrite will not trigger EDQUOT and final 'qgroup show' will show something like: qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16384 16384 none none --- --- 0/256 1073758208 1073758208 none 1073741824 --- --- And 1073758208 is larger than > 1073741824. [CAUSE] It's a bug in btrfs qgroup data reserved space management. For quota limit, we must ensure that: reserved (data + metadata) + rfer/excl <= limit Since rfer/excl is only updated at transaction commmit time, reserved space needs to be taken special care. One important part of reserved space is data, and for a new data extent written to disk, we still need to take the reserved space until rfer/excl numbers get updated. Originally when an ordered extent finishes, we migrate the reserved qgroup data space from extent_io tree to delayed ref head of the data extent, expecting delayed ref will only be cleaned up at commit transaction time. However for small RAM machine, due to memory pressure dirty pages can be flushed back to disk without committing a transaction. The related events will be something like: file 1 written btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840 btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096 btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344 btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192 btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344 cleanup_ref_head: num_bytes=54947840 cleanup_ref_head: num_bytes=5636096 cleanup_ref_head: num_bytes=569344 cleanup_ref_head: num_bytes=57344 cleanup_ref_head: num_bytes=8192 ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space file 2 written ... file 8 written cleanup_ref_head: num_bytes=8192 ... btrfs_commit_transaction <<< the only transaction committed during the test When file 2 is written, we have already freed 128M reserved qgroup data space for ino 258. Thus later write won't trigger EDQUOT. This allows us to write more data beyond qgroup limit. In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT. [FIX] By moving reserved qgroup data space from btrfs_delayed_ref_head to btrfs_qgroup_extent_record, we can ensure that reserved qgroup data space won't be freed half way before commit transaction, thus fix the problem. Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use WARN_ON in a canonical form btrfs_remove_block_groupNikolay Borisov2019-02-251-6/+3
| | | | | | | | | | There is no point in using a construct like 'if (!condition) WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: be more explicit about allowed flush statesJosef Bacik2019-02-251-11/+11
| | | | | | | | | | | | | | | For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: loop in inode_rsv_refillJosef Bacik2019-02-251-16/+47
| | | | | | | | | | | | | | | | | | | With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete, because of the over-reservation for the worst case. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error. Flushing that could have been running in parallel did not make any space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ rename to calc_refill_bytes, update comment and changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't enospc all tickets on flush failureJosef Bacik2019-02-251-20/+25
| | | | | | | | | | | | | | | | | | | | With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However, this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't use global reserve for chunk allocationJosef Bacik2019-02-251-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | We've done this forever because of the voodoo around knowing how much space we have. However, we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. This has a side effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves issues I was seeing with the mixed bg tests in xfstests without the new flushing state. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dump block_rsv details when dumping space infoJosef Bacik2019-02-251-0/+15
| | | | | | | | | | For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: check if there are free block groups for commitJosef Bacik2019-02-251-12/+19
| | | | | | | | | | | | | | | may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However, if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace cleaner_delayed_iput_mutex with a waitqueueJosef Bacik2019-02-251-5/+8
| | | | | | | | | | | | | | | | | | | | The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. The waiting is killable as it could be indirectly called from user operations like fallocate or zero-range. Such call sites should handle the error but otherwise it's not necessary. Eg. flush_space just needs to attempt to make space by waiting on iputs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add killable comment and changelog parts ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Output ENOSPC debug info in inc_block_group_roQu Wenruo2019-02-251-2/+13
| | | | | | | | | | Since inc_block_group_ro() would return -ENOSPC, outputting debug info for enospc_debug mount option would be helpful to debug some balance false ENOSPC report. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>