summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs/btree_update_interior.h
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: btree_node_u64s_with_format() takes nr keysKent Overstreet2023-12-191-4/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill journal pre-reservationsKent Overstreet2023-11-141-1/+0
| | | | | | | | | | This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't iterate over journal entries just for btree rootsKent Overstreet2023-11-051-1/+1
| | | | | | Small performance optimization, and a bit of a code cleanup too. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bkey_copy() is no longer a macroKent Overstreet2023-11-051-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix build errors with gcc 10Kent Overstreet2023-11-041-1/+1
| | | | | | | | | | | | gcc 10 seems to complain about array bounds in situations where gcc 11 does not - curious. This unfortunately requires adding some casts for now; we may investigate getting rid of our __u64 _data[] VLA in a future patch so that our start[0] members can be VLAs. Reported-by: John Stoffel <john@stoffel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Move some declarations to the correct headerKent Overstreet2023-10-221-0/+9
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a null ptr deref in bch2_fs_alloc() error pathKent Overstreet2023-10-221-0/+1
| | | | | | | This fixes a null ptr deref in bch2_free_pending_node_rewrites() when the list head wasn't initialized. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_journal_entries_postprocess()Kent Overstreet2023-10-221-1/+1
| | | | | | | | This brings back journal_entries_compact(), but in a more efficient form - we need to do multiple postprocess steps, so iterate over the journal entries being written just once to make it more efficient. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Handle btree node rewrites before going RWKent Overstreet2023-10-221-0/+3
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improved btree write statisticsKent Overstreet2023-10-221-0/+1
| | | | | | | | | | | This replaces sysfs btree_avg_write_size with btree_write_stats, which now breaks out statistics by the source of the btree write. Btree writes that are too small are a source of inefficiency, and excessive btree resort overhead - this will let us see what's causing them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Btree splits now only take the locks they needKent Overstreet2023-10-221-0/+1
| | | | | | | | | | | Previously, bch2_btree_update_start() would always take all intent locks, all the way up to the root. We've finally got data from users where this became a scalability issue - so, this patch fixes bch2_btree_update_start() to only take the locks we need. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New locking functionsKent Overstreet2023-10-221-0/+1
| | | | | | | | | | | | | | | | In the future, with the new deadlock cycle detector, we won't be using bare six_lock_* anymore: lock wait entries will all be embedded in btree_trans, and we will need a btree_trans context whenever locking a btree node. This patch plumbs a btree_trans to the few places that need it, and adds two new locking functions - btree_node_lock_nopath, which may fail returning a transaction restart, and - btree_node_lock_nopath_nofail, to be used in places where we know we cannot deadlock (i.e. because we're holding no other locks). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Shutdown path improvementsKent Overstreet2023-10-221-1/+1
| | | | | | | | | | | | | | | | | | | We're seeing occasional firings of the assertion in the key cache shutdown code that nr_dirty == 0, which means we must sometimes be doing transaction commits after we've gone read only. Cleanups & changes: - BCH_FS_ALLOC_CLEAN renamed to BCH_FS_CLEAN_SHUTDOWN - new helper bch2_btree_interior_updates_flush(), which returns true if it had to wait - bch2_btree_flush_writes() now also returns true if there were btree writes in flight - __bch2_fs_read_only now checks if btree writes were in flight in the shutdown loop: btree write completion does a transaction update, to update the pointer in the parent node - assert that !BCH_FS_CLEAN_SHUTDOWN in __bch2_trans_commit Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Fix usage of six lock's percpu modeKent Overstreet2023-10-221-2/+4
| | | | | | | | | | | | | | | Six locks have a percpu mode, which we use for interior btree nodes, as well as btree key cache keys for the subvolumes btree. We've been switching locks back and forth between percpu and non percpu mode as needed, but it turns out this is racy - when we're reusing an existing node, other threads could be attempting to lock it while we're switching it between modes. This patch fixes this by never switching 'struct btree' between the two modes, and instead segragating them between two different freed lists. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix keylist size in btree_updateKent Overstreet2023-10-221-2/+2
| | | | | | This fixes a buffer overrun, fortunately caught by a BUG_ON(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Option improvementsKent Overstreet2023-10-221-1/+1
| | | | | | | | | | | | This adds flags for options that must be a power of two (block size and btree node size), and options that are stored in the superblock as a power of two (encoded extent max). Also: options are now stored in memory in the same units they're displayed in (bytes): we now convert when getting and setting from the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Add more time_statsKent Overstreet2023-10-221-0/+1
| | | | | | | | | | This adds more latency/event measurements and breaks some apart into more events. Journal writes are broken apart into flush writes and noflush writes, btree compactions are broken out from btree splits, btree mergers are added, as well as btree_interior_updates - foreground and total. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Kill retry loop in btree merge pathKent Overstreet2023-10-221-5/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: btree_pathKent Overstreet2023-10-221-10/+10
| | | | | | | | | | | | | | | This splits btree_iter into two components: btree_iter is now the externally visible componont, and it points to a btree_path which is now reference counted. This means we no longer have to clone iterators up front if they might be mutated - btree_path can be shared by multiple iterators, and cloned if an iterator would mutate a shared btree_path. This will help us use iterators more efficiently, as well as slimming down the main long lived state in btree_trans, and significantly cleans up the logic for iterator lifetimes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Further reduce iter->trans usageKent Overstreet2023-10-221-1/+1
| | | | | | | This is prep work for splitting btree_path out from btree_iter - btree_path will not have a pointer to btree_trans. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Reduce iter->trans usageKent Overstreet2023-10-221-14/+0
| | | | | | Disfavoured, and should go away. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Regularize argument passing of btree_transKent Overstreet2023-10-221-14/+11
| | | | | | | | btree_trans should always be passed when we have one - iter->trans is disfavoured. This mainly updates old code in btree_update_interior.c, some of which predates btree_trans. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
* bcachefs: Fix a deadlockKent Overstreet2023-10-221-0/+4
| | | | | | | | | | | | | | | Waiting on a btree node write with btree locks held can deadlock, if the write errors: the write error path has to do do a btree update to drop the pointer to the replica that errored. The interior update path has to wait on in flight btree writes before freeing nodes on disk. Previously, this was done in bch2_btree_interior_update_will_free_node(), and could deadlock; now, we just stash a pointer to the node and do it in btree_update_nodes_written(), just prior to the transactional part of the update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bset compactionKent Overstreet2023-10-221-1/+3
| | | | | | | | | The previous patch that fixed btree nodes being written too aggressively now meant that we weren't sorting btree node bsets optimally - this patch fixes that. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_foreground_maybe_merge() now correctly reports lock restartsKent Overstreet2023-10-221-12/+12
| | | | | | | | This means that btree node splits don't have to automatically trigger a transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_btree_update_start()Kent Overstreet2023-10-221-2/+2
| | | | | | | | | bch2_btree_update_start() is now responsible for taking gc_lock and upgrading the iterator to lock parent nodes - greatly simplifying error handling and all of the callers. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete dead codeKent Overstreet2023-10-221-1/+0
| | | | | | | | The interior btree node update path has changed, this is no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Hack around bch2_varint_decode invalid readsKent Overstreet2023-10-221-0/+3
| | | | | | | | | | | bch2_varint_decode can do reads up to 7 bytes past the end ptr, for the sake of performance - these extra bytes are always masked off. This won't be a problem in practice if we make sure to burn 8 bytes in any buffer that has bkeys in it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert various code to printbufKent Overstreet2023-10-221-1/+1
| | | | | | | printbufs know how big the buffer is that was allocated, so we can get rid of the random PAGE_SIZEs all over the place. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix bch2_btree_node_insert_fits()Kent Overstreet2023-10-221-1/+1
| | | | | | | | It should be checking for the recently added flag btree_node_needs_rewrite. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: More open bucketsKent Overstreet2023-10-221-2/+2
| | | | | | | | | We need a larger open bucket reserve now that the btree interior update path holds onto open bucket references; filesystems with many high through devices may need more open buckets now. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Interior btree updates are now fully transactionalKent Overstreet2023-10-221-36/+28
| | | | | | | | | We now update the alloc info (bucket sector counts) atomically with journalling the update to the interior btree nodes, and we also set new btree roots atomically with the journalled part of the btree update. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Factor out bch2_fs_btree_interior_update_init()Kent Overstreet2023-10-221-0/+3
| | | | | Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a deadlock on starting an interior btree updateKent Overstreet2023-10-221-3/+5
| | | | | | | Not legal to block on a journal prereservation with btree locks held. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix journalling of interior node updatesKent Overstreet2023-10-221-0/+4
| | | | | | | We weren't journalling updates done while splitting/compacting nodes - oops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Journal updates to interior nodesKent Overstreet2023-10-221-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Previously, the btree has always been self contained and internally consistent on disk without anything from the journal - the journal just contained pointers to the btree roots. However, this meant that btree node split or compact operations - i.e. anything that changes btree node topology and involves updates to interior nodes - would require that interior btree node to be written immediately, which means emitting a btree node write that's mostly empty (using 4k of space on disk if the filesystemm blocksize is 4k to only write perhaps ~100 bytes of new keys). More importantly, this meant most btree node writes had to be FUA, and consumer drives have a history of slow and/or buggy FUA support - other filesystes have been bit by this. This patch changes the interior btree update path to journal updates to interior nodes, after the writes for the new btree nodes have completed. Best of all, it turns out to simplify the interior node update path somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Move extent overwrite handling out of core btree codeKent Overstreet2023-10-221-9/+14
| | | | | | | | | | | | | | | | | | | | | Ever since the btree code was first written, handling of overwriting existing extents - including partially overwriting and splittin existing extents - was handled as part of the core btree insert path. The modern transaction and iterator infrastructure didn't exist then, so that was the only way for it to be done. This patch moves that outside of the core btree code to a pass that runs at transaction commit time. This is a significant simplification to the btree code and overall reduction in code size, but more importantly it gets us much closer to the core btree code being completely independent of extents and is important prep work for snapshots. This introduces a new feature bit; the old and new extent update models are incompatible when the filesystem needs journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Seralize btree_update operations at btree_update_nodes_written()Kent Overstreet2023-10-221-0/+1
| | | | | | | | Prep work for journalling updates to interior nodes - enforcing ordering will greatly simplify those changes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Whiteout changesKent Overstreet2023-10-221-17/+12
| | | | | | | | | | | | | | | More prep work for snapshots: extents will soon be using KEY_TYPE_deleted for whiteouts, with 0 size. But we wen't be able to keep these whiteouts with the rest of the extents in the btree node, due to sorting invariants breaking. We can deal with this by immediately moving the new whiteouts to the unwritten whiteouts area - this just means those whiteouts won't be sorted, so we need new code to sort them prior to merging them with the rest of the keys to be written. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Inline more of bch2_trans_commit hot pathKent Overstreet2023-10-221-3/+3
| | | | | | | | | | The main optimization here is that if we let bch2_replicas_delta_list_apply() fail, we can completely skip calling bch2_bkey_replicas_marked_locked(). And assorted other small optimizations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree_bkey_cached_commonKent Overstreet2023-10-221-3/+3
| | | | | | | This is prep work for the btree key cache: btree iterators will point to either struct btree, or a new struct bkey_cached. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Allocation code refactoringKent Overstreet2023-10-221-1/+0
| | | | | | | | | | | | | | | | | | | | | bch2_alloc_sectors_start() was a nightmare to work with - it's got some tricky stuff to do, since it wants to use the buckets the writepoint already has, unless they're not in the target it wants to write to, unless it can't allocate from any other devices in which case it will use those buckets if it has to - et cetera. This restructures the code to start with a new empty list of open buckets we're going to use for the new allocation, pulling buckets from the write point's list as we decide that we really are going to use them - making the code somewhat more functional and drastically easier to understand. Also fixes a bug where we could end up waiting on c->freelist_wait (because allocating from one device failed) but return success from bch2_bucket_alloc(), because allocating from a different device succeeded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: kill extent_insert_hookKent Overstreet2023-10-221-9/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: BTREE_INSERT_JOURNAL_RES_FULL is no longer possibleKent Overstreet2023-10-221-27/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bkey_written()Kent Overstreet2023-10-221-9/+12
| | | | | | also cleanups of btree node offsets Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Initial commitKent Overstreet2023-10-221-0/+374
Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write filesystem with every feature you could possibly want. Website: https://bcachefs.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>