summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/block-rsv.c
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: simplify percent calculation helpers, rename div_factorDavid Sterba2022-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused btrfs_cond_migrate_bytesJosef Bacik2022-12-051-25/+0
| | | | | | | | | | | The last user of this was removed in 7f9fe6144076 ("btrfs: improve global reserve stealing logic"), drop this code as it's no longer called by anybody. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move accessor helpers into accessors.hJosef Bacik2022-12-051-0/+1
| | | | | | | | | | | | | This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move mount option definitions to fs.hJosef Bacik2022-12-051-0/+1
| | | | | | | | | | | These are fs wide definitions and helpers, move them out of ctree.h and into fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce BTRFS_RESERVE_FLUSH_EMERGENCYJosef Bacik2022-12-051-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inside of FB, as well as some user reports, we've had a consistent problem of occasional ENOSPC transaction aborts. Inside FB we were seeing ~100-200 ENOSPC aborts per day in the fleet, which is a really low occurrence rate given the size of our fleet, but it's not nothing. There are two causes of this particular problem. First is delayed allocation. The reservation system for delalloc assumes that contiguous dirty ranges will result in 1 file extent item. However if there is memory pressure that results in fragmented writeout, or there is fragmentation in the block groups, this won't necessarily be true. Consider the case where we do a single 256MiB write to a file and then close it. We will have 1 reservation for the inode update, the reservations for the checksum updates, and 1 reservation for the file extent item. At some point later we decide to write this entire range out, but we're so fragmented that we break this into 100 different file extents. Since we've already closed the file and are no longer writing to it there's nothing to trigger a refill of the delalloc block rsv to satisfy the 99 new file extent reservations we need. At this point we exhaust our delalloc reservation, and we begin to steal from the global reserve. If you have enough of these cases going in parallel you can easily exhaust the global reserve, get an ENOSPC at btrfs_alloc_tree_block() time, and then abort the transaction. The other case is the delayed refs reserve. The delayed refs reserve updates its size based on outstanding delayed refs and dirty block groups. However we only refill this block reserve when returning excess reservations and when we call btrfs_start_transaction(root, X). We will reserve 2*X credits at transaction start time, and fill in X into the delayed refs reserve to make sure it stays topped off. Generally this works well, but clearly has downsides. If we do a particularly delayed ref heavy operation we may never catch up in our reservations. Additionally running delayed refs generates more delayed refs, and at that point we may be committing the transaction and have no way to trigger a refill of our delayed refs rsv. Then a similar thing occurs with the delalloc reserve. Generally speaking we well over-reserve in all of our block rsvs. If we reserve 1 credit we're usually reserving around 264k of space, but we'll often not use any of that reservation, or use a few blocks of that reservation. We can be reasonably sure that as long as you were able to reserve space up front for your operation you'll be able to find space on disk for that reservation. So introduce a new flushing state, BTRFS_RESERVE_FLUSH_EMERGENCY. This gets used in the case that we've exhausted our reserve and the global reserve. It simply forces a reservation if we have enough actual space on disk to make the reservation, which is almost always the case. This keeps us from hitting ENOSPC aborts in these odd occurrences where we've not kept up with the delayed work. Fixing this in a complete way is going to be relatively complicated and time consuming. This patch is what I discussed with Filipe earlier this year, and what I put into our kernels inside FB. With this patch we're down to 1-2 ENOSPC aborts per week, which is a significant reduction. This is a decent stop gap until we can work out a more wholistic solution to these two corner cases. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add KCSAN annotations for unlocked access to block_rsv->fullDavid Sterba2022-09-261-1/+1
| | | | | | | | | | | | | | | | | | | | | KCSAN reports that there's unlocked access mixed with locked access, which is technically correct but is not a bug. To avoid false alerts at least from KCSAN, add annotation and use a wrapper whenever ->full is accessed for read outside of lock. It is used as a fast check and only advisory. In the worst case the block reserve is found !full and becomes full in the meantime, but properly handled. Depending on the value of ->full, btrfs_block_rsv_release decides where to return the reservation, and block_rsv_release_bytes handles a NULL pointer for block_rsv and if it's not NULL then it double checks the full status under a lock. Link: https://lore.kernel.org/linux-btrfs/CAAwBoOJDjei5Hnem155N_cJwiEkVwJYvgN-tQrwWbZQGhFU=cA@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/YvHU/vsXd7uz5V6j@hungrycats.org Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't save block group root into super blockQu Wenruo2022-09-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent tree v2 needs a new root for storing all block group items, the whole feature hasn't been finished yet so we can afford to do some changes. My initial proposal years ago just added a new tree rootid, and load it from tree root, just like what we did for quota/free space tree/uuid/extent roots. But the extent tree v2 patches introduced a completely new way to store block group tree root into super block which is arguably wasteful. Currently there are only 3 trees stored in super blocks, and they all have their valid reasons: - Chunk root Needed for bootstrap. - Tree root Really the entry point for all trees. - Log root This is special as log root has to be updated out of existing transaction mechanism. There is not even any reason to put block group root into super blocks, the block group tree is updated at the same time as the old extent tree, no need for extra bootstrap/out-of-transaction update. So just move block group root from super block into tree root. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use enum for btrfs_block_rsv::typeDavid Sterba2022-07-251-3/+3
| | | | | | | | | | | | | | | The number of block group reserve types BTRFS_BLOCK_RSV_* is small and fits to u8 and there's enough left in case we want to add more. For type safety use the enum but make it 8 bits in the structure to save space. The structure size is now 48 on release build, making a slight improvement in structures where it's embedded, like btrfs_fs_info or btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: switch btrfs_block_rsv::full to boolDavid Sterba2022-07-251-9/+6
| | | | | | | | | Use simple bool type for the block reserve full status, there's short to save space as there used to be int but there's no reason for that. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: reserve extra space for the free space treeJosef Bacik2022-01-071-0/+1
| | | | | | | | | | | | | | | | | | | Filipe reported a problem where sometimes he'd get an ENOSPC abort when running delayed refs with generic/619 and the free space tree enabled. This is partly because we do not reserve space for modifying the free space tree, nor do we have a block rsv associated with that tree. The delayed_refs_rsv tracks the amount of space required to run delayed refs. This means 1 modification means 1 change to the extent root. With the free space tree this turns into 2 changes, because modifying 1 extent means updating the extent tree and potentially updating the free space tree to either remove that entry or add the free space. Thus if we have the FST enabled, simply double the reservation size for our modification. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: include the free space tree in the global rsv minimum calculationJosef Bacik2022-01-071-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Filipe reported a problem where generic/619 was failing with an ENOSPC abort while running delayed refs, like the following BTRFS: Transaction aborted (error -28) WARNING: CPU: 3 PID: 522920 at fs/btrfs/free-space-tree.c:1049 add_to_free_space_tree+0xe5/0x110 [btrfs] CPU: 3 PID: 522920 Comm: kworker/u16:19 Tainted: G W 5.16.0-rc2-btrfs-next-106 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] RIP: 0010:add_to_free_space_tree+0xe5/0x110 [btrfs] RSP: 0000:ffffa65087fb7b20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff9131eeaa RDI: 00000000ffffffff RBP: ffff8d62e26481b8 R08: ffffffff9ad97ce0 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffe4 R13: ffff8d61c25fe688 R14: ffff8d61ebd88800 R15: ffff8d61ebd88a90 FS: 0000000000000000(0000) GS:ffff8d64ed400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa46a8b1000 CR3: 0000000148d18003 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __btrfs_free_extent+0x516/0x950 [btrfs] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] btrfs_run_delayed_refs+0x86/0x210 [btrfs] flush_space+0x403/0x630 [btrfs] ? call_rcu_tasks_generic+0x50/0x80 ? lock_release+0x223/0x4a0 ? btrfs_get_alloc_profile+0xb5/0x290 [btrfs] ? do_raw_spin_unlock+0x4b/0xa0 btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs] process_one_work+0x24c/0x5b0 worker_thread+0x55/0x3c0 ? process_one_work+0x5b0/0x5b0 kthread+0x17c/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 There's a couple of reasons for this, but in generic/619's case the largest reason is because it is a very small file system, ad we do not reserve enough space for the global reserve. With the free space tree we now have the free space tree that we need to modify when running delayed refs. This means we need the global reserve to take this into account when it calculates the minimum size it needs to be. This is especially important for very small file systems. Fix this by adjusting the minimum global block rsv size math to include the size of the free space tree when calculating the size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: stop accessing ->csum_root directlyJosef Bacik2022-01-031-3/+5
| | | | | | | | | | We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: stop accessing ->extent_root directlyJosef Bacik2022-01-031-1/+3
| | | | | | | | | | | | When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: init root block_rsv at init root timeJosef Bacik2022-01-031-16/+23
| | | | | | | | | | In the future we're going to have multiple csum and extent root trees, so init the roots block_rsv at setup_root time based on their root key objectid. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: change root to fs_info for btrfs_reserve_metadata_bytesJosef Bacik2022-01-031-5/+5
| | | | | | | | | | | | | | We used to need the root for btrfs_reserve_metadata_bytes to check the orphan cleanup state, but we no longer need that, we simply need the fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the same as they simply call btrfs_reserve_metadata_bytes() and then manipulate the block_rsv that is being used. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce mount option rescue=ignorebadrootsJosef Bacik2020-12-081-0/+8
| | | | | | | | | | | | | In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: print the block rsv type when we fail our reservationJosef Bacik2020-11-051-1/+2
| | | | | | | | | | | | | | | | | | To help with debugging, print the type of the block rsv when we fail to use our target block rsv in btrfs_use_block_rsv. This now produces: [ 544.672035] BTRFS: block rsv 1 returned -28 which is still cryptic without consulting the enum in block-rsv.h but I guess it's better than nothing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note from Nikolay ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLEQu Wenruo2020-05-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: force chunk allocation if our global rsv is larger than metadataJosef Bacik2020-05-251-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nikolay noticed a bunch of test failures with my global rsv steal patches. At first he thought they were introduced by them, but they've been failing for a while with 64k nodes. The problem is with 64k nodes we have a global reserve that calculates out to 13MiB on a freshly made file system, which only has 8MiB of metadata space. Because of changes I previously made we no longer account for the global reserve in the overcommit logic, which means we correctly allow overcommit to happen even though we are already overcommitted. However in some corner cases, for example btrfs/170, we will allocate the entire file system up with data chunks before we have enough space pressure to allocate a metadata chunk. Then once the fs is full we ENOSPC out because we cannot overcommit and the global reserve is taking up all of the available space. The most ideal way to deal with this is to change our space reservation stuff to take into account the height of the tree's that we're modifying, so that our global reserve calculation does not end up so obscenely large. However that is a huge undertaking. Instead fix this by forcing a chunk allocation if the global reserve is larger than the total metadata space. This gives us essentially the same behavior that happened before, we get a chunk allocated and these tests can pass. This is meant to be a stop-gap measure until we can tackle the "tree height only" project. Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove __ prefix from btrfs_block_rsv_releaseNikolay Borisov2020-03-231-5/+6
| | | | | | | | | | | | | Currently the non-prefixed version is a simple wrapper used to hide the 4th argument of the prefixed version. This doesn't bring much value in practice and only makes the code harder to follow by adding another level of indirection. Rectify this by removing the __ prefix and have only one public function to release bytes from a block reservation. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: slightly simplify global block reserve calculationsAnand Jain2020-03-231-1/+1
| | | | | | | | | | | | | | | | | | In btrfs_update_global_block_rsv the lines: num_bytes = block_rsv->size - block_rsv->reserved; block_rsv->reserved += num_bytes; imply: block_rsv->reserved = block_rsv->size; Assign block_rsv->size to block_rsv->reserved directly and reorder lines so they match the other branch. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add a comment describing block reservesJosef Bacik2020-03-231-0/+92
| | | | | | | | | | This is a giant comment at the top of block-rsv.c describing generally how block reserves work. It is purely about the block reserves themselves, and nothing to do with how the actual reservation system works. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use btrfs_try_granting_tickets in update_global_rsvJosef Bacik2019-09-091-0/+1
| | | | | | | | | | | | | | | | We have some annoying xfstests tests that will create a very small fs, fill it up, delete it, and repeat to make sure everything works right. This trips btrfs up sometimes because we may commit a transaction to free space, but most of the free metadata space was being reserved by the global reserve. So we commit and update the global reserve, but the space is simply added to bytes_may_use directly, instead of trying to add it to existing tickets. This results in ENOSPC when we really did have space. Fix this by calling btrfs_try_granting_tickets once we add back our excess space to wake any pending tickets. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: always reserve our entire size for the global reserveJosef Bacik2019-09-091-9/+4
| | | | | | | | | | | | | | | | While messing with the overcommit logic I noticed that sometimes we'd ENOSPC out when really we should have run out of space much earlier. It turns out it's because we'll only reserve up to the free amount left in the space info for the global reserve, but that doesn't make sense with overcommit because we could be well above our actual size. This results in the global reserve not carving out it's entire reservation, and thus not putting enough pressure on the rest of the infrastructure to do the right thing and ENOSPC out at a convenient time. Fix this by always taking our full reservation amount for the global reserve. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: change the minimum global reserve sizeJosef Bacik2019-09-091-1/+21
| | | | | | | | | | | | | | | It made sense to have the global reserve set at 16M in the past, but since it is used less nowadays set the minimum size to the number of items we'll need to update the main trees we update during a transaction commit, plus some slop area so we can do unlinks if we need to. In practice this doesn't affect normal file systems, but for xfstests where we do things like fill up a fs and then rm * it can fall over in weird ways. This enables us for more sane behavior at extremely small file system sizes. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename btrfs_space_info_add_old_bytesJosef Bacik2019-09-091-2/+3
| | | | | | | | | | | This name doesn't really fit with how the space reservation stuff works now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what the function is doing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: roll tracepoint into btrfs_space_info_update helperJosef Bacik2019-09-091-5/+0
| | | | | | | | | We duplicate this tracepoint everywhere we call these helpers, so update the helper to have the tracepoint as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move math functions to misc.hDavid Sterba2019-09-091-1/+1
| | | | | Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: migrate the global_block_rsv helpers to block-rsv.cJosef Bacik2019-07-021-0/+172
| | | | | | | These helpers belong in block-rsv.c Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: migrate the block-rsv code to block-rsv.cJosef Bacik2019-07-021-0/+253
This moves everything out of extent-tree.c to block-rsv.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>