summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/transaction.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'cleanups-4.7' into for-chris-4.7-20160525David Sterba2016-05-251-1/+1
|\
| * btrfs: sink gfp parameter to convert_extent_bitDavid Sterba2016-04-291-1/+1
| | | | | | | | | | | | | | | | Single caller passes GFP_NOFS. We can get rid of the gfpflags_allow_blocking checks as NOFS can block but does not recurse to filesystem through reclaim. Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'for-chris-4.7' of ↵Chris Mason2016-05-171-1/+1
|\ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7 Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: don't wait for unrelated IO to finish before relocationFilipe Manana2016-05-131-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before the relocation process of a block group starts, it sets the block group to readonly mode, then flushes all delalloc writes and then finally it waits for all ordered extents to complete. This last step includes waiting for ordered extents destinated at extents allocated in other block groups, making us waste unecessary time. So improve this by waiting only for ordered extents that fall into the block group's range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
* | Merge branch 'cleanups-4.7' into for-chris-4.7-20160516David Sterba2016-05-161-1/+1
|\ \
| * | btrfs: rename btrfs_std_error to btrfs_handle_fs_errorAnand Jain2016-04-281-1/+1
| |/ | | | | | | | | | | | | | | | | | | btrfs_std_error() handles errors, puts FS into readonly mode (as of now). So its good idea to rename it to btrfs_handle_fs_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ edit changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: build fixup for qgroup_account_snapshotDavid Sterba2016-05-121-0/+5
| | | | | | | | | | | | | | | | The macro btrfs_std_error got renamed to btrfs_handle_fs_error in an independent branch for the same merge target (4.7). To make the code compilable for bisectability reasons, add a temporary stub. Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: qgroup: Fix qgroup accounting when creating snapshotQu Wenruo2016-05-121-24/+105
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current btrfs qgroup design implies a requirement that after calling btrfs_qgroup_account_extents() there must be a commit root switch. Normally this is OK, as btrfs_qgroup_accounting_extents() is only called inside btrfs_commit_transaction() just be commit_cowonly_roots(). However there is a exception at create_pending_snapshot(), which will call btrfs_qgroup_account_extents() but no any commit root switch. In case of creating a snapshot whose parent root is itself (create a snapshot of fs tree), it will corrupt qgroup by the following trace: (skipped unrelated data) ====== btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 1 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 0, excl = 0 qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer = 16384, excl = 16384 btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, nr_old_roots = 0, nr_new_roots = 0 ====== The problem here is in first qgroup_account_extent(), the nr_new_roots of the extent is 1, which means its reference got increased, and qgroup increased its rfer and excl. But at second qgroup_account_extent(), its reference got decreased, but between these two qgroup_account_extent(), there is no switch roots. This leads to the same nr_old_roots, and this extent just got ignored by qgroup, which means this extent is wrongly accounted. Fix it by call commit_cowonly_roots() after qgroup_account_extent() in create_pending_snapshot(), with needed preparation. Mark: I added a check at the top of qgroup_account_snapshot() to skip this code if qgroups are turned off. xfstest btrfs/122 exposes this problem. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge branch 'cleanups-4.6' into for-chris-4.6David Sterba2016-02-261-2/+5
|\
| * btrfs: Replace CURRENT_TIME by current_fs_time()Deepa Dinamani2016-02-181-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: add transaction space reservation tracepointsJosef Bacik2016-02-181-1/+5
|/ | | | | | | | | There are a few places where we add to trans->bytes_reserved but don't have the corresponding trace point. With these added my tool no longer sees transaction leaks. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge branch 'misc-cleanups-4.5' of ↵Chris Mason2016-01-111-5/+10
|\ | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
| * btrfs: pass proper enum type to start_transaction()Alexandru Moise2016-01-071-5/+10
| | | | | | | | | | | | Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'misc-for-4.5' of ↵Chris Mason2016-01-111-11/+7
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
| * | btrfs: preallocate path for snapshot creation at ioctl timeDavid Sterba2016-01-071-5/+4
| | | | | | | | | | | | | | | | | | | | | We can also preallocate btrfs_path that's used during pending snapshot creation and avoid another late ENOMEM failure. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: allocate root item at snapshot ioctl timeDavid Sterba2016-01-071-6/+3
| |/ | | | | | | | | | | | | | | | | | | The actual snapshot creation is delayed until transaction commit. If we cannot get enough memory for the root item there, we have to fail the whole transaction commit which is bad. So we'll allocate the memory at the ioctl call and pass it along with the pending_snapshot struct. The potential ENOMEM will be returned to the caller of snapshot ioctl. Signed-off-by: David Sterba <dsterba@suse.com>
* / Btrfs: fix memory leaks after transaction is abortedFilipe Manana2015-12-171-0/+17
|/ | | | | | | | | | | | When a transaction is aborted, or its commit fails before writing the new superblock and calling btrfs_finish_extent_commit(), we leak reference counts on the block groups attached to the transaction's delete_bgs list, because btrfs_finish_extent_commit() is never called for those two cases. Fix this by dropping their references at btrfs_put_transaction(), which is called when transactions are aborted (by making the transaction kthread commit the transaction) or if their commits fail. Signed-off-by: Filipe Manana <fdmanana@suse.com>
* Btrfs: fix unprotected list move from unused_bgs to deleted_bgs listFilipe Manana2015-12-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of my previous change titled "Btrfs: fix scrub preventing unused block groups from being deleted", the following warning at extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a filesysten with "-o discard": 10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 10264 { (...) 10405 if (trimming) { 10406 WARN_ON(!list_empty(&block_group->bg_list)); 10407 spin_lock(&trans->transaction->deleted_bgs_lock); 10408 list_move(&block_group->bg_list, 10409 &trans->transaction->deleted_bgs); 10410 spin_unlock(&trans->transaction->deleted_bgs_lock); 10411 btrfs_get_block_group(block_group); 10412 } (...) This happens because scrub can now add back the block group to the list of unused block groups (fs_info->unused_bgs). This is dangerous because we are moving the block group from the unused block groups list to the list of deleted block groups without holding the lock that protects the source list (fs_info->unused_bgs_lock). The following diagram illustrates how this happens: CPU 1 CPU 2 cleaner_kthread() btrfs_delete_unused_bgs() sees bg X in list fs_info->unused_bgs deletes bg X from list fs_info->unused_bgs scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) scrub_chunk(bg X) sets bg X back to RW mode adds bg X to the list fs_info->unused_bgs again, since it's still unused and currently not in that list sets bg X to RO mode btrfs_remove_chunk(bg X) --> discard is enabled and bg X is in the fs_info->unused_bgs list again so the warning is triggered --> we move it from that list into the transaction's delete_bgs list, but we can have another task currently manipulating the first list (fs_info->unused_bgs) Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both the list of unused block groups and the list of deleted block groups. This makes it safe and there's not much worry for more lock contention, as this lock is seldom used and only the cleaner kthread adds elements to the list of deleted block groups. The warning goes away too, as this was previously an impossible case (and would have been better a BUG_ON/ASSERT) but it's not impossible anymore. Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard"). Signed-off-by: Filipe Manana <fdmanana@suse.com>
* Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana2015-11-251-0/+32
| | | | | | | | | | | | | | | | | | | | | It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* Merge branch 'allocator-fixes' into for-linus-4.4Chris Mason2015-10-211-38/+14
|\ | | | | | | Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: add a flags field to btrfs_transactionJosef Bacik2015-10-211-10/+8
| | | | | | | | | | | | | | | | | | I want to set some per transaction flags, so instead of adding yet another int lets just convert the current two int indicators to flags and add a flags field for future use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * Btrfs: change how we wait for pending ordered extentsJosef Bacik2015-10-211-28/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a mechanism to make sure we don't lose updates for ordered extents that were logged in the transaction that is currently running. We add the ordered extent to a transaction list and then the transaction waits on all the ordered extents in that list. However are substantially large file systems this list can be extremely large, and can give us soft lockups, since the ordered extents don't remove themselves from the list when they do complete. To fix this we simply add a counter to the transaction that is incremented any time we have a logged extent that needs to be completed in the current transaction. Then when the ordered extent finally completes it decrements the per transaction counter and wakes up the transaction if we are the last ones. This will eliminate the softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* | btrfs: qgroup: Use new metadata reservation.Qu Wenruo2015-10-211-27/+7
|/ | | | | | | | As we have the new metadata reservation functions, use them to replace the old btrfs_qgroup_reserve() call for metadata. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Merge branch 'cleanups/for-4.4' of ↵Chris Mason2015-10-211-24/+8
|\ | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
| * btrfs: change num_items type from u64 to unsigned intAlexandru Moise2015-10-211-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | The value of num_items that start_transaction() ultimately always takes is a small one, so a 64 bit integer is overkill. Also change num_items for btrfs_start_transaction() and btrfs_start_transaction_lflush() as well. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: memset cur_trans->delayed_refs to zeroAlexandru Moise2015-10-211-6/+2
| | | | | | | | | | | | | | | | | | Use memset() to null out the btrfs_delayed_ref_root of btrfs_transaction instead of setting all the members to 0 by hand. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: trimming some start_transaction() code awayAlexandru Moise2015-10-211-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc(). We're just initializing most fields to 0, false and NULL later on _anyway_, so to make the code mode readable and potentially gain a bit of performance (completely untested claim), we should fill our btrfs_trans_handle with zeros on allocation then just initialize those five remaining fields (not counting the list_heads) as normal. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'fix/waitqueue-barriers' of ↵Chris Mason2015-10-121-0/+9
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
| * | btrfs: add comments to barriers before waitqueue_activeDavid Sterba2015-10-101-0/+3
| | | | | | | | | | | | | | | | | | Reduce number of undocumented barriers out there. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add barrier for waitqueue_active in clear_btree_io_treeDavid Sterba2015-10-101-0/+6
| | | | | | | | | | | | | | | | | | | | | waitqueue_active should be preceded by a barrier, in this function we don't need to call it all the time. Signed-off-by: David Sterba <dsterba@suse.com>
* | | Merge branch 'anand/sysfs-updates-v4.3-rc3' of ↵Chris Mason2015-10-121-1/+1
|\ \ \ | |_|/ |/| | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: consolidate btrfs_error() to btrfs_std_error()Anand Jain2015-09-291-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge branch 'for-linus-4.3' of ↵Linus Torvalds2015-10-091-0/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection
| * Btrfs: fix deadlock when finalizing block group creationFilipe Manana2015-10-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Josef ran into a deadlock while a transaction handle was finalizing the creation of its block groups, which produced the following trace: [260445.593112] fio D ffff88022a9df468 0 8924 4518 0x00000084 [260445.593119] ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488 [260445.593126] ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0 [260445.593132] ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00 [260445.593137] Call Trace: [260445.593145] [<ffffffff8175a437>] schedule+0x37/0x80 [260445.593189] [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [260445.593197] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [260445.593225] [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [260445.593253] [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [260445.593295] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593324] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593351] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593394] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593427] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593459] [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs] [260445.593491] [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs] [260445.593524] [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs] [260445.593532] [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170 [260445.593564] [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [260445.593597] [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs] [260445.593626] [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [260445.593654] [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [260445.593682] [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [260445.593724] [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs] [260445.593752] [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [260445.593830] [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [260445.593905] [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs] [260445.593946] [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs] [260445.593990] [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs] [260445.594042] [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [260445.594089] [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs] [260445.594115] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [260445.594133] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [260445.594149] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [260445.594169] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [260445.594187] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [260445.594204] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 This happened because the same transaction handle created a large number of block groups and while finalizing their creation (inserting new items and updating existing items in the chunk and device trees) a new metadata extent had to be allocated and no free space was found in the current metadata block groups, which made find_free_extent() attempt to allocate a new block group via do_chunk_alloc(). However at do_chunk_alloc() we ended up allocating a new system chunk too and exceeded the threshold of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the final part of block group creation again (at btrfs_create_pending_block_groups()) and attempt to lock again the root of the chunk tree when it's already write locked by the same task. Similarly we can deadlock on extent tree nodes/leafs if while we are running delayed references we end up creating a new metadata block group in order to allocate a new node/leaf for the extent tree (as part of a CoW operation or growing the tree), as btrfs_create_pending_block_groups inserts items into the extent tree as well. In this case we get the following trace: [14242.773581] fio D ffff880428ca3418 0 3615 3100 0x00000084 [14242.773588] ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438 [14242.773594] ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460 [14242.773600] ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190 [14242.773606] Call Trace: [14242.773613] [<ffffffff8175a437>] schedule+0x37/0x80 [14242.773656] [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs] [14242.773664] [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0 [14242.773692] [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs] [14242.773720] [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs] [14242.773750] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.773758] [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200 [14242.773786] [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs] [14242.773818] [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs] [14242.773850] [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs] [14242.773934] [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs] [14242.773998] [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs] [14242.774041] [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs] [14242.774078] [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs] [14242.774118] [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [14242.774155] [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs] [14242.774194] [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs] [14242.774235] [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs] [14242.774274] [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [14242.774318] [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs] [14242.774358] [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs] [14242.774391] [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs] [14242.774432] [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs] [14242.774474] [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs] [14242.774516] [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs] [14242.774558] [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs] [14242.774599] [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs] [14242.774642] [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs] [14242.774650] [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0 [14242.774657] [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180 [14242.774663] [<ffffffff8123e35d>] do_fsync+0x3d/0x70 [14242.774669] [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110 [14242.774675] [<ffffffff8123e600>] SyS_fsync+0x10/0x20 [14242.774681] [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71 Fix this by never recursing into the finalization phase of block group creation and making sure we never trigger the finalization of block group creation while running delayed references. Reported-by: Josef Bacik <jbacik@fb.com> Fixes: 00d80e342c0f ("Btrfs: fix quick exhaustion of the system array in the superblock") Signed-off-by: Filipe Manana <fdmanana@suse.com>
* | Merge branch 'for-linus-4.3' of ↵Linus Torvalds2015-09-251-0/+32
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC
| * Btrfs: keep dropped roots in cache until transaction commitJosef Bacik2015-09-221-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dropping a snapshot we need to account for the qgroup changes. If we drop the snapshot in all one go then the backref code will fail to find blocks from the snapshot we dropped since it won't be able to find the root in the fs root cache. This can lead to us failing to find refs from other roots that pointed at blocks in the now deleted root. To handle this we need to not remove the fs roots from the cache until after we process the qgroup operations. Do this by adding dropped roots to a list on the transaction, and letting the transaction remove the roots at the same time it drops the commit roots. This will keep all of the backref searching code in sync properly, and fixes a problem Mark was seeing with snapshot delete and qgroups. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2015-09-051-6/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "In this one: - d_move fixes (Eric Biederman) - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error handling ought to be fixed) - switch of sb_writers to percpu rwsem (Oleg Nesterov) - superblock scalability (Josef Bacik and Dave Chinner) - swapon(2) race fix (Hugh Dickins)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits) vfs: Test for and handle paths that are unreachable from their mnt_root dcache: Reduce the scope of i_lock in d_splice_alias dcache: Handle escaped paths in prepend_path mm: fix potential data race in SyS_swapon inode: don't softlockup when evicting inodes inode: rename i_wb_list to i_io_list sync: serialise per-superblock sync operations inode: convert inode_sb_list_lock to per-sb inode: add hlist_fake to avoid the inode hash lock in evict writeback: plug writeback at a high level change sb_writers to use percpu_rw_semaphore shift percpu_counter_destroy() into destroy_super_work() percpu-rwsem: kill CONFIG_PERCPU_RWSEM percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() percpu-rwsem: introduce percpu_down_read_trylock() document rwsem_release() in sb_wait_write() fix the broken lockdep logic in __sb_start_write() introduce __sb_writers_{acquired,release}() helpers ufs_inode_get{frag,block}(): get rid of 'phys' argument ufs_getfrag_block(): tidy up a bit ...
| * introduce __sb_writers_{acquired,release}() helpersOleg Nesterov2015-08-151-6/+2
| | | | | | | | | | | | | | | | | | Preparation to hide the sb->s_writers internals from xfs and btrfs. Add 2 trivial define's they can use rather than play with ->s_writers directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Jan Kara <jack@suse.com>
* | Btrfs: check if previous transaction aborted to avoid fs corruptionFilipe Manana2015-08-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While we are committing a transaction, it's possible the previous one is still finishing its commit and therefore we wait for it to finish first. However we were not checking if that previous transaction ended up getting aborted after we waited for it to commit, so we ended up committing the current transaction which can lead to fs corruption because the new superblock can point to trees that have had one or more nodes/leafs that were never durably persisted. The following sequence diagram exemplifies how this is possible: CPU 0 CPU 1 transaction N starts (...) btrfs_commit_transaction(N) cur_trans->state = TRANS_STATE_COMMIT_START; (...) cur_trans->state = TRANS_STATE_COMMIT_DOING; (...) cur_trans->state = TRANS_STATE_UNBLOCKED; root->fs_info->running_transaction = NULL; btrfs_start_transaction() --> starts transaction N + 1 btrfs_write_and_wait_transaction(trans, root); --> starts writing all new or COWed ebs created at transaction N creates some new ebs, COWs some existing ebs but doesn't COW or deletes eb X btrfs_commit_transaction(N + 1) (...) cur_trans->state = TRANS_STATE_COMMIT_START; (...) wait_for_commit(root, prev_trans); --> prev_trans == transaction N btrfs_write_and_wait_transaction() continues writing ebs --> fails writing eb X, we abort transaction N and set bit BTRFS_FS_STATE_ERROR on fs_info->fs_state, so no new transactions can start after setting that bit cleanup_transaction() btrfs_cleanup_one_transaction() wakes up task at CPU 1 continues, doesn't abort because cur_trans->aborted (transaction N + 1) is zero, and no checks for bit BTRFS_FS_STATE_ERROR in fs_info->fs_state are made btrfs_write_and_wait_transaction(trans, root); --> succeeds, no errors during writeback write_ctree_super(trans, root, 0); --> succeeds --> we have now a superblock that points us to some root that uses eb X, which was never written to disk In this scenario future attempts to read eb X from disk results in an error message like "parent transid verify failed on X wanted Y found Z". So fix this by aborting the current transaction if after waiting for the previous transaction we verify that it was aborted. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* | Merge branch 'jeffm-discard-4.3' into for-linus-4.3Chris Mason2015-08-091-0/+2
|\ \
| * | btrfs: add missing discards when unpinning extents with -o discardJeff Mahoney2015-07-291-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we clear the dirty bits in btrfs_delete_unused_bgs for extents in the empty block group, it results in btrfs_finish_extent_commit being unable to discard the freed extents. The block group removal patch added an alternate path to forget extents other than btrfs_finish_extent_commit. As a result, any extents that would be freed when the block group is removed aren't discarded. In my test run, with a large copy of mixed sized files followed by removal, it left nearly 2/3 of extents undiscarded. To clean up the block groups, we add the removed block group onto a list that will be discarded after transaction commit. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* / btrfs: Remove unnecessary variants in relocation.cZhaolei2015-08-091-1/+1
|/ | | | | | | | These arguments are not used in functions, remove them for cleanup and make kernel stack happy. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()Zhao Lei2015-07-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of delayed_iput_sem in xfstests generic/241: [ 2061.345955] ============================================= [ 2061.346027] [ INFO: possible recursive locking detected ] [ 2061.346027] 4.1.0+ #268 Tainted: G W [ 2061.346027] --------------------------------------------- [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] but task is already holding lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] other info that might help us debug this: [ 2061.346027] Possible unsafe locking scenario: [ 2061.346027] CPU0 [ 2061.346027] ---- [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] *** DEADLOCK *** It is rarely happened, about 1/400 in my test env. The reason is recursion of btrfs_run_delayed_iputs(): cleaner_kthread -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock *2 -> iput() -> ... -> btrfs_commit_transaction() -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock (dead lock) *2 *1: recursion of btrfs_run_delayed_iputs() *2: warning of lockdep about delayed_iput_sem When fs is in high stress, new iputs may added into fs_info->delayed_iputs list when btrfs_run_delayed_iputs() is running, which cause second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem) again, and cause above lockdep warning. Actually, it will not cause real problem because both locks are read lock, but to avoid lockdep warning, we can do a fix. Fix: Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for cleaner_kthread thread to break above recursion path. cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code, and don't need to call btrfs_run_delayed_iputs() again in btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow. Test: No above lockdep warning after patch in 1200 generic/241 tests. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix list transaction->pending_ordered corruptionFilipe Manana2015-07-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we call btrfs_commit_transaction(), we splice the list "ordered" of our transaction handle into the transaction's "pending_ordered" list, but we don't re-initialize the "ordered" list of our transaction handle, this means it still points to the same elements it used to before the splice. Then we check if the current transaction's state is >= TRANS_STATE_COMMIT_START and if it is we end up calling btrfs_end_transaction() which simply splices again the "ordered" list of our handle into the transaction's "pending_ordered" list, leaving multiple pointers to the same ordered extents which results in list corruption when we are iterating, removing and freeing ordered extents at btrfs_wait_pending_ordered(), resulting in access to dangling pointers / use-after-free issues. Similarly, btrfs_end_transaction() can end up in some cases calling btrfs_commit_transaction(), and both did a list splice of the transaction handle's "ordered" list into the transaction's "pending_ordered" without re-initializing the handle's "ordered" list, resulting in exactly the same problem. This produces the following warning on a kernel with linked list debugging enabled: [109749.265416] ------------[ cut here ]------------ [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d (...) [109749.287505] Call Trace: [109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98 [109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98 [109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs] [109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs] [109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs] [109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs] [109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e [109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e [109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e [109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14 [109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [109749.382274] ---[ end trace 48e0d07f7c03d95a ]--- On a non-debug kernel this leads to invalid memory accesses, causing a crash. Fix this by using list_splice_init() instead of list_splice() in btrfs_commit_transaction() and btrfs_end_transaction(). Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
* btrfs: qgroup: Make snapshot accounting work with new extent-orientedQu Wenruo2015-06-101-20/+33
| | | | | | | | | | qgroup. Make snapshot accounting work with new extent-oriented mechanism by skipping given root in new/old_roots in create_pending_snapshot(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots.Qu Wenruo2015-06-101-0/+1
| | | | | | | | | | | | | This is used by later qgroup fix patches for snapshot. As current snapshot accounting is done by btrfs_qgroup_inherit(), but new extent oriented quota mechanism will account extent from btrfs_copy_root() and other snapshot things, causing wrong result. So add this ability to handle snapshot accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.Qu Wenruo2015-06-101-0/+18
| | | | | | | | | | | | Switch from old ref_node based qgroup to extent based qgroup mechanism for normal operations. The new mechanism should hugely reduce the overhead of btrfs quota system, and further more, the codes and logic should be more clean and easier to maintain. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: qgroup: Record possible quota-related extent for qgroup.Qu Wenruo2015-06-101-0/+1
| | | | | | | | | Add hook in add_delayed_ref_head() to record quota-related extent record into delayed_ref_root->dirty_extent_record rb-tree for later qgroup accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix -ENOSPC when finishing block group creationFilipe Manana2015-06-031-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While creating a block group, we often end up getting ENOSPC while updating the chunk tree, which leads to a transaction abortion that produces a trace like the following: [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [30670.117777] BTRFS: Transaction aborted (error -28) (...) [30670.163567] Call Trace: [30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs] [30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs] [30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs] [30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15 [30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de [30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62 [30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [30670.187792] ---[ end trace 0373e6b491c4a8cc ]--- This is because we don't do proper space reservation for the chunk block reserve when we have multiple tasks allocating chunks in parallel. So block group creation has 2 phases, and the first phase essentially checks if there is enough space in the system space_info, allocating a new system chunk if there isn't, while the second phase updates the device, extent and chunk trees. However, because the updates to the chunk tree happen in the second phase, if we have N tasks, each with its own transaction handle, allocating new chunks in parallel and if there is only enough space in the system space_info to allocate M chunks, where M < N, none of the tasks ends up allocating a new system chunk in the first phase and N - M tasks will get -ENOSPC when attempting to update the chunk tree in phase 2 if they need to COW any nodes/leafs from the chunk tree. Fix this by doing proper reservation in the chunk block reserve. The issue could be reproduced by running fstests generic/038 in a loop, which eventually triggered the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: allow block group cache writeout outside critical section in commitChris Mason2015-04-101-1/+37
| | | | | | | | | | | | | | | | | We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>