summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs/move.c
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: Extra kthread_should_stop() calls for copygcKent Overstreet2023-11-281-3/+9
| | | | | | | | | This fixes a bug where going read-only was taking longer than it should have due to copygc forgetting to check kthread_should_stop() Additionally: fix a missing is_kthread check in bch2_move_ratelimit(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: -EROFS doesn't count as move_extent_start_failKent Overstreet2023-11-281-0/+4
| | | | | | | | The automated tests check if we've hit too many slowpath/error path events and fail the test - if we're just shutting down, that naturally shouldn't count. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: trace_move_extent_start_fail() now includes errcodeKent Overstreet2023-11-281-13/+10
| | | | | | | Renamed from trace_move_extent_alloc_mem_fail, because there are other reasons we colud fail (disk space allocation failure). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Data update path won't accidentaly grow replicasKent Overstreet2023-11-251-53/+5
| | | | | | | | | | | | | | | Previously, there was a bug where if an extent had greater durability than required (because we needed to move a durability=1 pointer and ended up putting it on a durability 2 device), we would submit a write for replicas=2 - the durability of the pointer being rewritten - instead of the number of replicas required to bring it back up to the data_replicas option. This, plus the allocation path sometimes allocating on a greater durability device than requested, meant that extents could continue having more and more replicas added as they were being rewritten. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Make sure bch2_move_ratelimit() also waits for move_opsKent Overstreet2023-11-241-13/+4
| | | | | | | | This adds move_ctxt_wait_event_timeout(), which can sleep for a timeout while also issueing pending moves as reads complete. Co-developed-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_moving_ctxt_flush_all()Kent Overstreet2023-11-241-5/+11
| | | | | | | | | | Introduce a new helper to flush all move IOs, and use it in a few places where we should have been. The new helper also drops btree locks before waiting on outstanding move writes, avoiding potential deadlocks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Data move path now uses bch2_trans_unlock_long()Kent Overstreet2023-11-041-5/+8
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move: move_stats refactoringKent Overstreet2023-10-311-45/+53
| | | | | | | | | | | data_progress_list is gone - it was redundant with moving_context_list The upcoming rebalance rewrite is going to have it using two different move_stats objects with the same moving_context, depending on whether it's scanning or using the rebalance_work btree - this patch plumbs stats around a bit differently so that will work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move: convert to bbposKent Overstreet2023-10-311-11/+8
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: moving_context now owns a btree_transKent Overstreet2023-10-311-51/+42
| | | | | | | btree_trans and moving_context are used together, and having the moving_context owns the transaction object reduces some plumbing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move.c exports, refactoringKent Overstreet2023-10-311-55/+64
| | | | | | Prep work for the new rebalance code - we need a few helpers exported. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve io option handling in data move pathKent Overstreet2023-10-311-50/+81
| | | | | | | The data move path now correctly picks IO options when inodes in different snapshots have different options applied. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_id_str()Kent Overstreet2023-10-311-1/+1
| | | | | | | Since we can run with unknown btree IDs, we can't directly index btree IDs into fixed size arrays. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: More minor smatch fixesKent Overstreet2023-10-221-1/+1
| | | | | | | - fix a few uninitialized return values - return a proper error code in lookup_lostfound() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Heap allocate btree_transKent Overstreet2023-10-221-21/+18
| | | | | | | | | | We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix W=12 build errorsKent Overstreet2023-10-221-1/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Break up io.cKent Overstreet2023-10-221-1/+2
| | | | | | | | | More reorganization, this splits up io.c into - io_read.c - io_misc.c - fallocate, fpunch, truncate - io_write.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_moving_ctxt_to_text()Kent Overstreet2023-10-221-25/+19
| | | | | | | | Print more information out about moving contexts - fold in the output of the redundant bch2_data_jobs_to_text(), and also include information relevant to whether move_data() should be blocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Allow for unknown btree IDsKent Overstreet2023-10-221-2/+8
| | | | | | | | | | | | | | | | | We need to allow filesystems with metadata from newer versions to be mountable and usable by older versions. This patch enables us to roll out new btrees without a new major version number; we can now handle btree roots for unknown btree types. The unknown btree roots will be retained, and fsck (including backpointers) will check them, the same as other btree types. We add a dynamic array for the extra, unknown btree roots, in addition to the fixed size btree root array, and add new helpers for looking up btree roots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New error message helpersKent Overstreet2023-10-221-3/+5
| | | | | | | | | | | | | Add two new helpers for printing error messages with __func__ and bch2_err_str(): - bch_err_fn - bch_err_msg Also kill the old error strings in the recovery path, which were causing us to incorrectly report memory allocation failures - they're not needed anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert -ENOENT to private error codesKent Overstreet2023-10-221-1/+1
| | | | | | | As with previous conversions, replace -ENOENT uses with more informative private error codes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete an incorrect bch2_trans_unlock()Kent Overstreet2023-10-221-1/+0
| | | | | | | | | | | | These deletes a bch2_trans_unlock() call from __bch2_move_data(). It was redundant; bch2_move_extent() has the correct unlock call, and it was buggy because when move_extent calls bch2_extent_drop_ptrs() we don't want the transaction to be unlocked yet - this fixes a btree_iter.c assertion. Fixes https://github.com/koverstreet/bcachefs/issues/511. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_bkey_make_mut() now calls bch2_trans_update()Kent Overstreet2023-10-221-1/+1
| | | | | | | | | | | It's safe to call bch2_trans_update with a k/v pair where the value hasn't been filled out, as long as the key part has been and the value is filled out by transaction commit time. This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(), eliminating a bit of boilerplate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill bch2_verify_bucket_evacuated()Kent Overstreet2023-10-221-79/+0
| | | | | | | | | | | | With backpointers, it's now impossible for bch2_evacuate_bucket() to be completely reliable: it can race with an extent being partially overwritten or split, which needs a new write buffer flush for the backpointer to be seen. This shouldn't be a real issue in practice; the previous patch added a new tracepoint so we'll be able to see more easily if it is. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve move path tracepointsKent Overstreet2023-10-221-3/+40
| | | | | | | Move path tracepoints now include the key being moved. Also, add new tracepoints for the start of move_extent, and evacuate_bucket. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Rip out code for storing backpointers in alloc keysKent Overstreet2023-10-221-14/+12
| | | | | | | | | | We don't store backpointers in alloc keys anymore, since we gained the btree write buffer. This patch drops support for backpointers in alloc keys, and revs the on disk format version so that we know a fsck is required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Use BTREE_ITER_INTENT in ec_stripe_update_extent()Kent Overstreet2023-10-221-2/+2
| | | | | | | | This adds a flags param to bch2_backpointer_get_key() so that we can pass BTREE_ITER_INTENT, since ec_stripe_update_extent() is updating the extent immediately. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix bch2_verify_bucket_evacuated()Kent Overstreet2023-10-221-0/+5
| | | | | | | | | | We were going into an infinite loop when printing out backpointers, due to never incrementing bp_offset - whoops. Also limit the number of backpointers we print to 10; this is debug code and we only need to print a sample, not all of them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: verify_bucket_evacuated() -> set_btree_iter_dontneed()Kent Overstreet2023-10-221-0/+3
| | | | | | This should help with excessive 'would deadlock' transaction restarts. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix an unhandled transaction restart errorKent Overstreet2023-10-221-0/+5
| | | | | | | | This is a bit awkward: we're passing around a btree_trans, but we're not in a context where transaction restarts are handled - we should try to come up with a better way to denote situations like this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New erasure coding shutdown pathKent Overstreet2023-10-221-6/+0
| | | | | | | | | | | | | | | | | | This implements a new shutdown path for erasure coding, which is needed for the upcoming BCH_WRITE_WAIT_FOR_EC write path. The process is: - Cancel new stripes being built up - Close out/cancel open buckets on write points or the partial list that are for stripes - Shutdown rebalance/copygc - Then wait for in flight new stripes to finish With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill up before they complete; the new ec shutdown path is needed for shutting down copygc/rebalance without deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_fs_moving_ctxts_to_text()Kent Overstreet2023-10-221-7/+96
| | | | | | | | This also adds bch2_write_op_to_text(): now we can see outstand moves, useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC and likely for other things in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: evacuate_bucket() no longer moves cached ptrsKent Overstreet2023-10-221-1/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: evacuate_bucket() no longer calls verify_bucket_evacuated()Kent Overstreet2023-10-221-8/+0
| | | | | | | The copygc code itself now calls this when all moves from a given bucket are complete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improved copygc pipeliningKent Overstreet2023-10-221-15/+30
| | | | | | | | | | | | | | | | | | | | This improves copygc pipelining across multiple buckets: we now track each in flight bucket we're evacuating, with separate moving_contexts. This means that whereas previously we had to wait for outstanding moves to complete to ensure we didn't try to evacuate the same bucket twice, we can now just check buckets we want to evacuate against the pending list. This also mean we can run the verify_bucket_evacuated() check without killing pipelining - meaning it can now always be enabled, not just on debug builds. This is going to be important for the upcoming erasure coding work, where moving IOs that are being erasure coded will now skip the initial replication step; instead the IOs will wait on the stripe to complete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: moving_context->stats is allowed to be NULLKent Overstreet2023-10-221-14/+23
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't call bch2_trans_update() unlockedKent Overstreet2023-10-221-1/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fragmentation LRUKent Overstreet2023-10-221-24/+27
| | | | | | | | | | | | | | | | | | | | | | | Now that we have much more efficient updates to the LRU btree, this patch adds a new LRU that indexes buckets by fragmentation. This means copygc no longer has to scan every bucket to find buckets that need to be evacuated. Changes: - A new field in bch_alloc_v4, fragmentation_lru - this corresponds to the bucket's position in the fragmentation LRU. We add a new field for this instead of calculating it as needed because we may make the fragmentation LRU optional; this field indicates whether a bucket is on the fragmentation LRU. Also, zoned devices will introduce variable bucket sizes; explicitly recording the LRU position will be safer for them. - A new copygc path for using the fragmentation LRU instead of scanning every bucket and building up an in-memory heap. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix verify_bucket_evacuated()Kent Overstreet2023-10-221-6/+5
| | | | | | This fixes an incorrectly handled transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add max nr of IOs in flight to the move pathKent Overstreet2023-10-221-6/+16
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix deadlock on nocow locks in data move pathKent Overstreet2023-10-221-15/+6
| | | | | | | | | | | | | | | | | | | | | The recent nocow locking rework introduced a deadlock in the data move path: the new nocow locking scheme uses a hash table with a fixed size array for chaining, meaning on hash collision we may have to wait for other locks to be released before we can lock a bucket. And since the data move path needs to submit writes from the same thread that's taking nocow locks and submitting reads, this introduces a deadlock. This shouldn't happen often in practice, but since the data move path can keep large numbers of IOs in flight simultaneously, it's something we have to handle. This patch makes move_ctxt_wait_event() available to bch2_data_update_init() and uses it when appropriate, which is our normal solution to this kind of thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Nocow supportKent Overstreet2023-10-221-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for nocow mode, where we do writes in-place when possible. Patch components: - New boolean filesystem and inode option, nocow: note that when nocow is enabled, data checksumming and compression are implicitly disabled - To prevent in-place writes from racing with data moves (data_update.c) or bucket reuse (i.e. a bucket being reused and re-allocated while a nocow write is in flight, we have a new locking mechanism. Buckets can be locked for either data update or data move, using a fixed size hash table of two_state_shared locks. We don't have any chaining, meaning updates and moves to different buckets that hash to the same lock will wait unnecessarily - we'll want to watch for this becoming an issue. - The allocator path also needs to check for in-place writes in flight to a given bucket before giving it out: thus we add another counter to bucket_alloc_state so we can track this. - Fsync now may need to issue cache flushes to block devices instead of flushing the journal. We add a device bitmask to bch_inode_info, ei_devs_need_flush, which tracks devices that need to have flushes issued - note that this will lead to unnecessary flushes when other codepaths have already issued flushes, we may want to replace this with a sequence number. - New nocow write path: look up extents, and if they're writable write to them - otherwise fall back to the normal COW write path. XXX: switch to sequence numbers instead of bitmask for devs needing journal flush XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to run in process context - see if we can improve this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Data update support for unwritten extentsKent Overstreet2023-10-221-1/+10
| | | | | | | | | | | | The data update path requires special support for unwritten extents - we still need to be able to move them, but there's no need to read or write anything. This patch adds a new error code to tell bch2_move_extent() that we're short circuiting the read, and adds bch2_update_unwritten_extent() to create a reservation then call __bch2_data_update_index_update(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't use key cache during fsckKent Overstreet2023-10-221-2/+4
| | | | | | | | | | The btree key cache mainly helps with lock contention, at the cost of additional memory overhead. During some fsck passes the memory overhead really matters, but fsck is single threaded so lock contention is an issue - so skipping the key cache during fsck will help with performance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Copygc now uses backpointersKent Overstreet2023-10-221-17/+274
| | | | | | | | | | | | | | Previously, copygc needed to walk the entire extents & reflink btrees to find extents that needed to be moved. Now that we have backpointers, this patch implements bch2_evacuate_bucket() in the move code, which copygc now uses for evacuating mostly empty buckets. Also, thanks to the new backpointers code, copygc can now move btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Debug mode for c->writes referencesKent Overstreet2023-10-221-3/+3
| | | | | | | | | | This adds a debug mode where we split up the c->writes refcount into distinct refcounts for every codepath that takes a reference, and adds sysfs code to print the value of each ref. This will make it easier to debug shutdown hangs due to refcount leaks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_inode_opts_get()Kent Overstreet2023-10-221-1/+1
| | | | | | | | | | | | | This improves io_opts() and makes it a non-inline function - it's big enough that it probably shouldn't be. Also, bch_io_opts no longer needs fields for whether options are defined, so we can slim it down a bit. We'd like to stop passing around the full bch_io_opts, but that'll be tricky because of bch2_rebalance_add_key(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert EROFS errors to private error codesKent Overstreet2023-10-221-1/+1
| | | | | | More error code improvements - this gets us more useful error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New btree helpersKent Overstreet2023-10-221-3/+1
| | | | | | | | | | | | This introduces some new conveniences, to help cut down on boilerplate: - bch2_trans_kmalloc_nomemzero() - performance optimiation - bch2_bkey_make_mut() - bch2_bkey_get_mut() - bch2_bkey_get_mut_typed() - bch2_bkey_alloc() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New bpos_cmp(), bkey_cmp() replacementsKent Overstreet2023-10-221-2/+2
| | | | | | | | | | | | | | | | | This patch introduces - bpos_eq() - bpos_lt() - bpos_le() - bpos_gt() - bpos_ge() and equivalent replacements for bkey_cmp(). Looking at the generated assembly these could probably be improved further, but we already see a significant code size improvement with this patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>