From 450114fc0db0cd5c2e7324b917e5de52cff991d7 Mon Sep 17 00:00:00 2001 From: Josef Bacik Date: Wed, 21 Nov 2018 14:03:08 -0500 Subject: btrfs: don't use global reserve for chunk allocation We've done this forever because of the voodoo around knowing how much space we have. However, we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. This has a side effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves issues I was seeing with the mixed bg tests in xfstests without the new flushing state. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ] Signed-off-by: David Sterba --- include/trace/events/btrfs.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/trace') diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 2887503e4d12..3f08b652363b 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR, "FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE, "ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- cgit v1.2.3 From 1418bae1c22951aad9883bc8f8f4dccb272cce1e Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Wed, 23 Jan 2019 15:15:12 +0800 Subject: btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record [BUG] Btrfs/139 will fail with a high probability if the testing machine (VM) has only 2G RAM. Resulting the final write success while it should fail due to EDQUOT, and the fs will have quota exceeding the limit by 16K. The simplified reproducer will be: (needs a 2G ram VM) $ mkfs.btrfs -f $dev $ mount $dev $mnt $ btrfs subv create $mnt/subv $ btrfs quota enable $mnt $ btrfs quota rescan -w $mnt $ btrfs qgroup limit -e 1G $mnt/subv $ for i in $(seq -w 1 8); do xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null echo "file $i written" > /dev/kmsg done $ sync $ btrfs qgroup show -pcre --raw $mnt The last pwrite will not trigger EDQUOT and final 'qgroup show' will show something like: qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16384 16384 none none --- --- 0/256 1073758208 1073758208 none 1073741824 --- --- And 1073758208 is larger than > 1073741824. [CAUSE] It's a bug in btrfs qgroup data reserved space management. For quota limit, we must ensure that: reserved (data + metadata) + rfer/excl <= limit Since rfer/excl is only updated at transaction commmit time, reserved space needs to be taken special care. One important part of reserved space is data, and for a new data extent written to disk, we still need to take the reserved space until rfer/excl numbers get updated. Originally when an ordered extent finishes, we migrate the reserved qgroup data space from extent_io tree to delayed ref head of the data extent, expecting delayed ref will only be cleaned up at commit transaction time. However for small RAM machine, due to memory pressure dirty pages can be flushed back to disk without committing a transaction. The related events will be something like: file 1 written btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840 btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096 btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344 btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192 btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344 cleanup_ref_head: num_bytes=54947840 cleanup_ref_head: num_bytes=5636096 cleanup_ref_head: num_bytes=569344 cleanup_ref_head: num_bytes=57344 cleanup_ref_head: num_bytes=8192 ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space file 2 written ... file 8 written cleanup_ref_head: num_bytes=8192 ... btrfs_commit_transaction <<< the only transaction committed during the test When file 2 is written, we have already freed 128M reserved qgroup data space for ino 258. Thus later write won't trigger EDQUOT. This allows us to write more data beyond qgroup limit. In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT. [FIX] By moving reserved qgroup data space from btrfs_delayed_ref_head to btrfs_qgroup_extent_record, we can ensure that reserved qgroup data space won't be freed half way before commit transaction, thus fix the problem. Fixes: f64d5ca86821 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref") Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- include/trace/events/btrfs.h | 29 ----------------------------- 1 file changed, 29 deletions(-) (limited to 'include/trace') diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 3f08b652363b..ab1cc33adbac 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1513,35 +1513,6 @@ DEFINE_EVENT(btrfs__qgroup_rsv_data, btrfs_qgroup_release_data, TP_ARGS(inode, start, len, reserved, op) ); -DECLARE_EVENT_CLASS(btrfs__qgroup_delayed_ref, - - TP_PROTO(const struct btrfs_fs_info *fs_info, - u64 ref_root, u64 reserved), - - TP_ARGS(fs_info, ref_root, reserved), - - TP_STRUCT__entry_btrfs( - __field( u64, ref_root ) - __field( u64, reserved ) - ), - - TP_fast_assign_btrfs(fs_info, - __entry->ref_root = ref_root; - __entry->reserved = reserved; - ), - - TP_printk_btrfs("root=%llu reserved=%llu op=free", - __entry->ref_root, __entry->reserved) -); - -DEFINE_EVENT(btrfs__qgroup_delayed_ref, btrfs_qgroup_free_delayed_ref, - - TP_PROTO(const struct btrfs_fs_info *fs_info, - u64 ref_root, u64 reserved), - - TP_ARGS(fs_info, ref_root, reserved) -); - DECLARE_EVENT_CLASS(btrfs_qgroup_extent, TP_PROTO(const struct btrfs_fs_info *fs_info, const struct btrfs_qgroup_extent_record *rec), -- cgit v1.2.3