summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs/btree_update_interior.c
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: bch_member.btree_allocated_bitmapKent Overstreet2024-04-141-0/+24
| | | | | | | | | | | | | | | | | | | | | | | This adds a small (64 bit) per-device bitmap that tracks ranges that have btree nodes, for accelerating btree node scan if it is ever needed. - New helpers, bch2_dev_btree_bitmap_marked() and bch2_dev_bitmap_mark(), for checking and updating the bitmap - Interior btree update path updates the bitmaps when required - The check_allocations pass has a new fsck_err check, btree_bitmap_not_marked - New on disk format version, mi_btree_mitmap, which indicates the new bitmap is present - Upgrade table lists the required recovery pass and expected fsck error - Btree node scan uses the bitmap to skip ranges if we're on the new version Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Disable merges from interior update pathKent Overstreet2024-04-131-0/+10
| | | | | | | | | | | | | | | | | | There's been a bug in the btree write buffer where it wasn't triggering btree node merges - and leaving behind a bunch of nearly empty btree nodes. Then during journal replay, when updates to the backpointers btree aren't using the btree write buffer (because we require synchronization with journal replay), we end up doing those merges all at once. Then if it's the interior update path running them, we deadlock because those run with the highest watermark. There's no real need for the interior update path to be doing btree node merges; other code paths can handle that at lower watermarks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Run merges at BCH_WATERMARK_btreeKent Overstreet2024-04-131-0/+6
| | | | | | | | This fixes a deadlock where the interior update path during journal replay ends up doing a ton of merges on the backpointers btree, and deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't use bch2_btree_node_lock_write_nofail() in btree split pathKent Overstreet2024-04-111-15/+26
| | | | | | | | | | It turns out - btree splits happen with the rest of the transaction still locked, to avoid unnecessary restarts, which means using nofail doesn't work here - we can deadlock. Fortunately, we now have the ability to return errors here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a race in btree_update_nodes_written()Kent Overstreet2024-04-101-3/+7
| | | | | | | | | | One btree update might have terminated in a node update, and then while it is in flight another btree update might free that original node. This race has to be handled in btree_update_nodes_written() - we were missing a READ_ONCE(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: JOURNAL_SPACE_LOWKent Overstreet2024-04-061-11/+7
| | | | | | | | | | | | | | | | | | "bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant performance regression (nearly 50%) on multithreaded random writes with fio. The reason is that the journal watermark checks multiple things, including the state of the btree write buffer, and on multithreaded update heavy workloads we're bottleneked on write buffer flushing - we don't want kicknig off btree updates to depend on the state of the write buffer. This isn't strictly correct; the interior btree update path does do write buffer updates, but it's a tiny fraction of total accounting updates and we're more concerned with space in the journal itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Further improve btree_update_to_text()Kent Overstreet2024-04-041-54/+46
| | | | | | Print start and end level of the btree update; also a bit of cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()Kent Overstreet2024-04-031-4/+4
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_btree_update_to_text()Kent Overstreet2024-04-021-13/+28
| | | | | | | Print out the mode as a string, and also print out the btree and watermark. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: BCH_WATERMARK_interior_updatesKent Overstreet2024-04-011-3/+3
| | | | | | | | | | | | This adds a new watermark, higher priority than BCH_WATERMARK_reclaim, for interior btree updates. We've seen a deadlock where journal replay triggers a ton of btree node merges, and these use up all available open buckets and then interior updates get stuck. One cause of this is that we're currently lacking btree node merging on write buffer btrees - that needs to be fixed as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix btree node reserveKent Overstreet2024-04-011-1/+1
| | | | | | Sign error when checking the watermark - oops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split out recovery_passes.cKent Overstreet2024-03-311-1/+1
| | | | | | | | | | | We've grown a fair amount of code for managing recovery passes; tracking which ones we're running, which ones need to be run, and flagging in the superblock which ones need to be run on the next recovery. So it's worth splitting out into its own file, this code is pretty different from the code in recovery.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix bch2_btree_increase_depth()Kent Overstreet2024-03-311-2/+5
| | | | | | | When we haven't yet allocated any btree nodes for a given btree, we first need to call the regular split path to allocate one. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix btree node keys accounting in topology repair pathKent Overstreet2024-03-311-0/+1
| | | | | | | When dropping keys now outside a now because we're changing the node min/max, we need to redo the node's accounting as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improved topology repair checksKent Overstreet2024-03-311-39/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate bch2_gc_check_topology() and btree_node_interior_verify(), and replace them with an improved version, bch2_btree_node_check_topology(). This checks that children of an interior node correctly span the full range of the parent node with no overlaps. Also, ensure that topology repairs at runtime are always a fatal error; in particular, this adds a check in btree_iter_down() - if we don't find a key while walking down the btree that's indicative of a topology error and should be flagged as such, not a null ptr deref. Some checks in btree_update_interior.c remaining BUG_ONS(), because we already checked the node for topology errors when starting the update, and the assertions indicate that we _just_ corrupted the btree node - i.e. the problem can't be that existing on disk corruption, they indicate an actual algorithmic bug. In the future, we'll be annotating the fsck errors list with which recovery pass corrects them; the open coded "run explicit recovery pass or fatal error" in bch2_btree_node_check_topology() will in the future be done for every fsck_err() call. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Be careful about btree node splits during journal replayKent Overstreet2024-03-311-1/+8
| | | | | | Don't pick a pivot that's going to be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs; Fix deadlock in bch2_btree_update_start()Kent Overstreet2024-03-181-4/+9
| | | | | | | | | | | | BCH_TRANS_COMMIT_journal_reclaim with watermark != BCH_WATERMARK_reclaim means nonblocking, and we need the journal_res_get() in btree_update_start() to respect that. In a future refactoring we'll be deleting BCH_TRANS_COMMIT_journal_reclaim and replacing it with an explicit BCH_TRANS_COMMIT_nonblocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: ratelimit errors from async_btree_node_rewriteKent Overstreet2024-03-181-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_fatal_error()Kent Overstreet2024-03-181-2/+3
| | | | | | error messages should always include __func__ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve sysfs internal/btree_updatesKent Overstreet2024-03-171-6/+7
| | | | | | Print out the function that launched the btree update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split out btree_node_rewrite_workerKent Overstreet2024-03-171-2/+9
| | | | | | | | This fixes a deadlock due to using btree_interior_update_worker for non interior updates - async btree node rewrites were blocking, and then blocking other interior updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill unused flags argument to btree_split()Kent Overstreet2024-03-131-11/+8
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill more -EIO error codesKent Overstreet2024-03-131-2/+1
| | | | | | | | This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bump max_active on btree_interior_update_workerKent Overstreet2024-03-101-1/+1
| | | | | | | | WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't actually need or want ordered semantics - and probably want a higher concurrency limit anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix journal replay with unreadable btree rootsKent Overstreet2024-03-101-5/+54
| | | | | | | | | | | | When a btree root is unreadable, we still might be able to get some data back by replaying what's in the journal. Previously though, we got confused when journal replay would attempt to replay a key for a level that didn't exist. This adds bch2_btree_increase_depth(), so that journal replay can handle this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Clamp replicas_required to replicasKent Overstreet2024-02-131-1/+2
| | | | | | | This prevents going emergency read only when the user has specified replicas_required > replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Prep work for variable size btree node buffersKent Overstreet2024-01-211-4/+4
| | | | | | | | | | | | | | | | | bcachefs btree nodes are big - typically 256k - and btree roots are pinned in memory. As we're now up to 18 btrees, we now have significant memory overhead in mostly empty btree roots. And in the future we're going to start enforcing that certain btree node boundaries exist, to solve lock contention issues - analagous to XFS's AGIs. Thus, we need to start allocating smaller btree node buffers when we can. This patch changes code that refers to the filesystem constant c->opts.btree_node_size to refer to the btree node buffer size - btree_buf_bytes() - where appropriate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Combine .trans_trigger, .atomic_triggerKent Overstreet2024-01-051-9/+10
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: trans_mark now takes bkey_sKent Overstreet2024-01-051-2/+2
| | | | | | | | Prep work for disk space accounting rewrite: we're going to want to use a single callback for both of our current triggers, so we need to change them to have the same type signature first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix interior update path btree_path usesKent Overstreet2024-01-011-31/+40
| | | | | | | | | | | Since the btree_paths array is now about to become growable, we have to be careful not to refer to paths by pointer across contexts where they may be reallocated. This fixes the remaining btree_interior_update() paths - split and merge. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: get_unlocked_mut_path() -> btree_path_idx_tKent Overstreet2024-01-011-43/+42
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: trans_for_each_path_with_node() no longer uses path->idxKent Overstreet2024-01-011-1/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: trans_for_each_path() no longer uses path->idxKent Overstreet2024-01-011-4/+4
| | | | | | | | | | path->idx is now a code smell: we should be using path_idx_t, since it's stable across btree path reallocation. This is also a bit faster, using the same loop counter vs. fetching path->idx from each path we iterate over. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree_iter -> btree_path_idx_tKent Overstreet2024-01-011-17/+18
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_path_traverse() -> btree_path_idx_tKent Overstreet2024-01-011-4/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_path_make_mut() -> btree_path_idx_tKent Overstreet2024-01-011-5/+5
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs; bch2_path_put() -> btree_path_idx_tKent Overstreet2024-01-011-6/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_path_get() -> btree_path_idx_tKent Overstreet2024-01-011-2/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: for_each_keylist_key() declares loop iterKent Overstreet2024-01-011-2/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Make sure allocation failure errors are loggedKent Overstreet2024-01-011-0/+3
| | | | | | | | | | | The previous patch fixed a bug in allocation path error handling, and it would've been noticed sooner had it been logged properly. Generally speaking, errors that shouldn't happen in normal operation and are being returned up the stack should be logged: the write path was already logging IO errors, but non IO errors were missed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_err_(fn|msg) check if should printKent Overstreet2024-01-011-3/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_trans_node_add no longer uses trans_for_each_path()Kent Overstreet2024-01-011-5/+5
| | | | | | | | In the future we'll be making trans->paths resizable and potentially having _many_ more paths (for fsck); we need to start fixing algorithms that walk each path in a transaction where possible. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve trans->extra_journal_entriesKent Overstreet2024-01-011-9/+6
| | | | | | | | | | | | Instead of using a darray, we now allocate journal entries for the transaction commit path with our normal bump allocator - with an inlined fastpath, and using btree_transaction_stats to remember how much to initially allocate so as to avoid transaction restarts. This is prep work for converting write buffer updates to use this mechanism. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Include btree_trans in more tracepointsKent Overstreet2024-01-011-17/+18
| | | | | | | This gives us more context information - e.g. which codepath is invoking btree node reads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: convert bch_fs_flags to x-macroKent Overstreet2024-01-011-2/+2
| | | | | | | Now we can print out filesystem flags in sysfs, useful for debugging various "what's my filesystem doing" issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Rename BTREE_INSERT flagsKent Overstreet2024-01-011-8/+8
| | | | | | | BTREE_INSERT flags are actually transaction commit flags - rename them for clarity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill dead BTREE_INSERT flagsKent Overstreet2024-01-011-10/+4
| | | | | | | BTREE_INSERT_NOWAIT and BTREE_INSERT_GC_LOCK_HELD are no longer used, and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Journal pins must always have a flush_fnKent Overstreet2024-01-011-3/+18
| | | | | | | flush_fn is how we identify journal pins in debugfs - this is a debugging aid. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs; guard against overflow in btree node splitKent Overstreet2023-12-191-0/+12
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree_node_u64s_with_format() takes nr keysKent Overstreet2023-12-191-13/+14
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>