summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent-tree.c
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: abort the transaction when we don't find our extent refJosef Bacik2014-04-071-0/+2
| | | | | | | | | | I'm not sure why we weren't aborting here in the first place, it is obviously a bad time from the fact that we print the leaf and yell loudly about it. Fix this up, otherwise we panic because our path could be pointing into oblivion. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: fix lockdep warning with reclaim lock inversionJeff Mahoney2014-04-071-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When encountering memory pressure, testers have run into the following lockdep warning. It was caused by __link_block_group calling kobject_add with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL, which gets us into reclaim context. The kobject doesn't actually need to be added under the lock -- it just needs to ensure that it's only added for the first block group to be linked. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc8-default #1 Not tainted --------------------------------------------------------- kswapd0/169 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (&found->groups_sem){+++++.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&found->groups_sem); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); *** DEADLOCK *** 2 locks held by kswapd0/169: #0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160 #1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: remove transaction from sendJosef Bacik2014-04-061-10/+10
| | | | | | | | | | | | | | | | | | | | Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: check for an extent_op on the locked refJosef Bacik2014-04-061-1/+2
| | | | | | | | | | We could have possibly added an extent_op to the locked_ref while we dropped locked_ref->lock, so check for this case as well and loop around. Otherwise we could lose flag updates which would lead to extent tree corruption. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lockMiao Xie2014-03-101-4/+4
| | | | | | | | We needn't flush all delalloc inodes when we doesn't get s_umount lock, or we would make the tasks wait for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: reclaim delalloc metadata more aggressivelyMiao Xie2014-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | generic/074 in xfstests failed sometimes because of the enospc error, the reason of this problem is that we just reclaimed the space we need from the reserved space for delalloc, and then tried to reserve the space, but if some task did no-flush reservation between the above reclamation and reservation, Task1 Task2 shrink_delalloc() reclaim 1 block (The space that can be reserved now is 1 block) do no-flush reservation reserve 1 block (The space that can be reserved now is 0 block) reserving 1 block failed the reservation of Task1 failed, but in fact, there was enough space to reserve if we could reclaim more space before. Fix this problem by the aggressive reclamation of the reserved delalloc metadata space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: remove unnecessary lock in may_commit_transaction()Miao Xie2014-03-101-8/+1
| | | | | | | | | The reason is: - The per-cpu counter has its own lock to protect itself. - Here we needn't get a exact value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolumeMiao Xie2014-03-101-0/+35
| | | | | | | | | | | | | | | If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
* btrfs: Cleanup the "_struct" suffix in btrfs_workequeueQu Wenruo2014-03-101-1/+1
| | | | | | | | | | | | | | Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
* btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.Qu Wenruo2014-03-101-3/+3
| | | | | | | | | Replace the fs_info->cache_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
* Btrfs: don't loop forever if we can't run because of the tree mod logJosef Bacik2014-02-081-0/+1
| | | | | | | | | | A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix spin_unlock in check_ref_cleanupChris Mason2014-01-291-1/+3
| | | | | | Our goto out should have gone a little farther. Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix wrong block group in trace during the free space allocationMiao Xie2014-01-281-1/+2
| | | | | | | | | We allocate the free space from the former block group, not the current one, so should use the former one to output the trace information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: cleanup the code of used_block_group in find_free_extent()Miao Xie2014-01-281-20/+13
| | | | | | | | | | used_block_group is just used for the space cluster which doesn't belong to the current block group, the other place needn't use it. Or the logic of code seems unclear. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: cleanup the redundant code for the block group allocation and initMiao Xie2014-01-281-50/+44
| | | | | | Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix btrfs boot when compiled as built-inFilipe David Borba Manana2014-01-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the change titled "Btrfs: add support for inode properties", if btrfs was built-in the kernel (i.e. not as a module), it would cause a kernel panic, as reported recently by Fengguang: [ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] PGD 0 [ 2.028684] Oops: 0000 [#1] SMP [ 2.028684] Modules linked in: [ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1 [ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000 [ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246 [ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000 [ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe [ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20 [ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff [ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000 [ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000 [ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0 [ 2.028684] Stack: [ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05 [ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05 [ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404 [ 2.028684] Call Trace: [ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a [ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d [ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230 [ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88 [ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89 [ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109 The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs) started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as part of the properties initialization routine), the libcrc32c is not yet initialized, so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel panic on boot. The approach to fix this is to use crypto component directly to use its crc32c (which is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is doing as well, it uses crypto directly to get crc32c functionality. Verified this works both when btrfs is built-in and when it's loadable kernel module. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: throttle delayed refs betterJosef Bacik2014-01-281-1/+40
| | | | | | | | | | | | | | | | | | | On one of our gluster clusters we noticed some pretty big lag spikes. This turned out to be because our transaction commit was taking like 3 minutes to complete. This is because we have like 30 gigs of metadata, so our global reserve would end up being the max which is like 512 mb. So our throttling code would allow a ridiculous amount of delayed refs to build up and then they'd all get run at transaction commit time, and for a cold mounted file system that could take up to 3 minutes to run. So fix the throttling to be based on both the size of the global reserve and how long it takes us to run delayed refs. This patch tracks the time it takes to run delayed refs and then only allows 1 seconds worth of outstanding delayed refs at a time. This way it will auto-tune itself from cold cache up to when everything is in memory and it no longer has to go to disk. This makes our transaction commits take much less time to run. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: attach delayed ref updates to delayed ref headsJosef Bacik2014-01-281-215/+102
| | | | | | | | | | | | | | | | | | | | | | | Currently we have two rb-trees, one for delayed ref heads and one for all of the delayed refs, including the delayed ref heads. When we process the delayed refs we have to hold onto the delayed ref lock for all of the selecting and merging and such, which results in quite a bit of lock contention. This was solved by having a waitqueue and only one flusher at a time, however this hurts if we get a lot of delayed refs queued up. So instead just have an rb tree for the delayed ref heads, and then attach the delayed ref updates to an rb tree that is per delayed ref head. Then we only need to take the delayed ref lock when adding new delayed refs and when selecting a delayed ref head to process, all the rest of the time we deal with a per delayed ref head lock which will be much less contentious. The locking rules for this get a little more complicated since we have to lock up to 3 things to properly process delayed refs, but I will address that problem later. For now this passes all of xfstests and my overnight stress tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()Wang Shilong2014-01-281-1/+1
| | | | | | | | | | We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Cc: stable@vger.kernel.org Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: return free space to global_rsv as much as possibleLiu Bo2014-01-281-1/+1
| | | | | | | | | | @full is not protected within global_rsv.lock, so we may think global_rsv is already full but in fact it's not, so we miss the opportunity to return free space to global_rsv directly when we release other block_rsvs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: stop caching thread if extent_commit_sem is contendedJosef Bacik2014-01-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can starve out the transaction commit with a bunch of caching threads all running at the same time. This is because we will only drop the extent_commit_sem if we need_resched(), which isn't likely to happen since we will be reading a lot from the disk so have already schedule()'ed plenty. Alex observed that he could starve out a transaction commit for up to a minute with 32 caching threads all running at once. This will allow us to drop the extent_commit_sem to allow the transaction commit to swap the commit_root out and then all the cachers will start back up. Here is an explanation provided by Igno So, just to fill in what happens in this loop: mutex_unlock(&caching_ctl->mutex); cond_resched(); goto again; where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem again: again: mutex_lock(&caching_ctl->mutex); /* need to make sure the commit_root doesn't disappear */ down_read(&fs_info->extent_commit_sem); So, if I'm reading the code correct, there can be a fair amount of concurrency here: there may be multiple 'caching kthreads' per filesystem active, while there's one fs_info->extent_commit_sem per filesystem AFAICS. So, what happens if there are a lot of CPUs all busy holding the ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all rush to try to release the fs_info->extent_commit_sem, and they'd block in the down_read() because there's a writer waiting. So there's a guarantee of forward progress. This should answer akpm's concern I think. Thanks, Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: convert printk to btrfs_ and fix BTRFS prefixFrank Holton2014-01-281-6/+8
| | | | | | | | | | Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix double initialization of the raid kobjectMiao Xie2014-01-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We met the following oops when doing space balance: kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong. ... Call Trace: [<ffffffff81937262>] dump_stack+0x49/0x5f [<ffffffff8137d259>] kobject_init+0x89/0xa0 [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70 [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs] [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs] [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs] [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs] [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs] [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs] [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs] [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs] [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs] ... Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nospace_cache <dev> <mnt> # btrfs balance start <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 The reason of this problem is that we initialized the raid kobject when we added a block group into a empty raid list. As we know, when we mounted a btrfs filesystem, the raid list was empty, we would initialize the raid kobject when we added the first block group. But if there was not data stored in the block group, the block group would be freed when doing balance, and the raid list would be empty. And then if we allocated a new block group and added it into the raid list, we would initialize the raid kobject again, the oops happened. Fix this problem by initializing the raid kobject just when mounting the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: fix static checker warningsJeff Mahoney2014-01-281-2/+2
| | | | | | | | | | | This patch fixes the following warnings: fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static? fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index)); Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variable from find_free_extentValentina Giusti2014-01-281-2/+0
| | | | | | | | | | The variable found_uncached_bg in find_free_extent is not used since commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff (Btrfs: remove the ideal caching code) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish allocation data in sysfsJeff Mahoney2014-01-281-5/+77
| | | | | | | | | | While trying to debug ENOSPC issues, it's helpful to understand what the kernel's view of the available space is. We export this information via ioctl, but sysfs files are more easily used. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: introduce a head ref rbtreeLiu Bo2014-01-281-7/+14
| | | | | | | | | | | | | | | | | | | | | The way how we process delayed refs is 1) get a bunch of head refs, 2) pick up one head ref, 3) go one node back for any delayed ref updates. The head ref is also linked in the same rbtree as the delayed ref is, so in 1) stage, we have to walk one by one including not only head refs, but delayed refs. When we have a great number of delayed refs pending to process, this'll cost time a lot. Here we introduce a head ref specific rbtree, it only has head refs, so troubles go away. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: don't miss skinny extent items on delayed ref head contentionFilipe David Borba Manana2013-12-121-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup of skinny extent items. This can happen when the execution flow is the following: * We do an extent tree lookup and fail to find a skinny extent item; * As a result, we attempt to see if a non-skinny extent item exists, either by looking at previous item in the leaf or by doing another full extent tree search; * We have a transaction and then we check for a matching delayed ref head in the transaction's delayed refs rbtree; * We find such delayed ref head and then we try to lock it with a call to mutex_trylock(); * The lock was contended so we jump to the label "again", which repeats the extent tree search but for a non-skinny extent item, because we set previously metadata variable to 0 and the search key to look for a non-skinny extent-item; * After the jump (and after releasing the transaction's delayed refs lock), a skinny extent item might have been added to the extent tree but we will miss it because metadata is set to 0 and the search key is set for a non-skinny extent-item. The fix here is to not reset metadata to 0 and to jump to the initial search key setup if the delayed ref head is contended, instead of jumping directly to the extent tree search label ("again"). This issue was found while investigating the issue reported at Bugzilla 64961. David Sterba suspected this function was missing extent items, and that this could be caused by the last change to this function, which was made in the following patch: [PATCH] Btrfs: optimize btrfs_lookup_extent_info() (commit 74be9510876a66ad9826613ac8a526d26f9e7f01) But in fact this issue already existed before, because after failing to find a skinny extent item, the code set the search key for a non-skinny extent item, and on contention of a matching delayed ref head it would not search the extent tree for a skinny extent item anymore. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: rename btrfs_start_all_delalloc_inodesMiao Xie2013-11-111-1/+1
| | | | | | | | | | rename the function -- btrfs_start_all_delalloc_inodes(), and make its name be compatible to btrfs_wait_ordered_roots(), since they are always used at the same place. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: don't wait for the completion of all the ordered extentsMiao Xie2013-11-111-5/+6
| | | | | | | | | | | It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want to reclaim some space for the metadata space reservation, we would be blocked for a long time. The performance would drop down suddenly for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: don't wait for all the async delalloc when shrinking delallocMiao Xie2013-11-111-2/+12
| | | | | | | | | | It was very likely that there were lots of async delalloc pages in the filesystem, if we waited until all the pages were flushed, we would be blocked for a long time, and the performance would also drop down. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix the confusion between delalloc bytes and metadata bytesMiao Xie2013-11-111-0/+6
| | | | | | | | | | | | | | | | | | | In shrink_delalloc(), what we need reclaim is the metadata space, so flushing pages by to_reclaim is not reasonable, it is very likely that the pages we flush are not enough. And then we had to invoke the flush function for several times, at the worst, we need call flush_space for several times. It wasted time. We improve this problem by converting the metadata space size we need reserve to the delalloc bytes, By this way, we can flush the pages by a reasonable number. (Now we use a fixed number to do conversion, it is not flexible, maybe we can find a good way to improve it in the future.) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: pick up the code for the item number calculation in flush_space()Miao Xie2013-11-111-9/+16
| | | | | | | | | | This patch picked up the code that was used to calculate the number of the items for which we need reserve space, and we will use it in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: wait for the ordered extent only when we wantMiao Xie2013-11-111-1/+2
| | | | | | Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc()Miao Xie2013-11-111-4/+3
| | | | | | Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* btrfs: Fix checkpatch.pl warning of spacing issuesDulshani Gunawardhana2013-11-111-3/+3
| | | | | | | | | Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)Dulshani Gunawardhana2013-11-111-7/+4
| | | | | | | | | | | Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source code that outputs a more descriptive warnings. Also fix the styling warning of redundant braces that came up as a result of this fix. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fix the free space write out failure when there is no data spaceMiao Xie2013-11-111-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | After running space balance on a new fs, the fs check program outputed the following warning message: free space inode generation (0) did not match free space cache generation (20) Steps to reproduce: # mkfs.btrfs -f <dev> # mount <dev> <mnt> # btrfs balance start <mnt> # umount <mnt> # btrfs check <dev> It was because there was no data space after the space balance, and the free space write out task didn't try to allocate a new data chunk for the free space inode when doing the reservation. So the data space reservation failed, and in order to tell the free space loader that this free space inode could not be trusted, the generation of the free space inode wasn't updated. Then the check program found this problem and outputed the above message. But in fact, it is safe that we try to allocate a new data chunk when we find the data space is not enough. The patch fixes the above problem by this way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: optimize extent item search in run_delayed_extent_opFilipe David Borba Manana2013-11-111-7/+20
| | | | | | | | | | | Instead of doing another extent tree search if the first search failed to find a metadata item, check if the previous item in the leaf is an extent item and use it if it is, otherwise do the second tree search for an extent item. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* btrfs: add tracing for failed reservationsJeff Mahoney2013-11-111-0/+7
| | | | | | | | | When debugging ENOSPC issues, it's nice to be able to see which reservations failed as well as the ones which succeeded. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* btrfs: remove fs/btrfs/compat.hZach Brown2013-11-111-1/+0
| | | | | | | | | fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink() and inc_nlink(). This doesn't belong in mainline. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fixup error path in __btrfs_inc_extent_refLiu Bo2013-11-111-8/+2
| | | | | | | | | | | | | When we fail to add a reference after a non-inline insertion by some reasons, eg. ENOSPC, we'll abort the transaction, but we don't return this error to the caller who has to walk around again to find something wrong, that's unnecessary. Also fixup other error paths to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: free reserved space on error in a few placesJosef Bacik2013-11-111-2/+19
| | | | | | | | | While trying to track down a reserved space leak I noticed a few places where we won't properly clean up reserved space if we have an error, this patch fixes those up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: fixup reserved trace pointsJosef Bacik2013-11-111-2/+6
| | | | | | | | | | | | | In trying to track down where we were leaking reserved space I noticed our reserve extent tracepoints are a little off. First we were saying that the reserved space had been alloced in btrfs_reserve_extent, which isn't the case, this needs to be triggered when we actually allocate the space when we run the delayed ref. We were also missing a few places where we should have been tracing the btrfs_reserve_extent_free tracepoint. With these in place I was able to put together where we were leaking reserved space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: don't leak block group on errorFilipe David Borba Manana2013-11-111-2/+1
| | | | | | | | | | In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to write_one_cache_group() failed, we would return without putting the block group first. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: remove path arg from btrfs_truncate_free_space_cacheFilipe David Borba Manana2013-11-111-2/+1
| | | | | | | | | Not used for anything, and removing it avoids caller's need to allocate a path structure. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: remove space_info->reservation_progressJosef Bacik2013-09-211-3/+0
| | | | | | | This isn't used for anything anymore, just remove it. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: kill delay_iput arg to the wait_ordered functionsJosef Bacik2013-09-211-3/+3
| | | | | | | | | | | This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Revert "Btrfs: rework the overcommit logic to be based on the total size"Josef Bacik2013-09-211-12/+3
| | | | | | | | | | This reverts commit 70afa3998c9baed4186df38988246de1abdab56d. It is causing performance issues and wasn't actually correct. There were problems with the way we flushed delalloc and that was the real cause of the early enospc. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
* Btrfs: allocate the free space by the existed max extent size when ENOSPCMiao Xie2013-09-211-9/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | By the current code, if the requested size is very large, and all the extents in the free space cache are small, we will waste lots of the cpu time to cut the requested size in half and search the cache again and again until it gets down to the size the allocator can return. In fact, we can know the max extent size in the cache after the first search, so we needn't cut the size in half repeatedly, and just use the max extent size directly. This way can save lots of cpu time and make the performance grow up when there are only fragments in the free space cache. According to my test, if there are only 4KB free space extents in the fs, and the total size of those extents are 256MB, we can reduce the execute time of the following test from 5.4s to 1.4s. dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync Changelog v2 -> v3: - fix the problem that we skip the block group with the space which is less than we need. Changelog v1 -> v2: - address the problem that we return a wrong start position when searching the free space in a bitmap. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>