summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs/journal.c
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: bch2_print_allocator_stuck()Kent Overstreet2024-05-081-1/+1
| | | | | | | If we block on the allocator for more than 10 seconds, print out some useful debugging info. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: x-macroize journal flags enumsKent Overstreet2024-05-081-7/+13
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: plumb data_type into bch2_bucket_alloc_trans()Kent Overstreet2024-05-081-1/+2
| | | | | | | prep work for making the allocator try to keep btree nodes within the existing member info btree allocated bitmap Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix flag printing in journal_buf_to_text()Kent Overstreet2024-05-081-2/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: iter/update/trigger/str_hash flag cleanupKent Overstreet2024-05-081-2/+2
| | | | | | | | | | | Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: mark_superblock cleanupKent Overstreet2024-05-081-2/+3
| | | | | | | Consolidate mark_superblock() and trans_mark_superblock(), like we did with the other trigger paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: New assertion for writing to the journal after shutdownKent Overstreet2024-05-081-2/+4
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: prt_printf() now respects \r\n\tKent Overstreet2024-05-081-56/+41
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async()Kent Overstreet2024-05-071-0/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()Kent Overstreet2024-05-061-0/+2
| | | | | | | | We're using mutex_lock() inside a wait_event() conditional - prepare_to_wait() has already flipped task state, so potentially blocking ops need annotation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix lost wakeup on journal shutdownKent Overstreet2024-03-181-6/+6
| | | | | | | We need to check for journal shutdown first in __journal_res_get() - after the journal is shutdown, j->watermark won't be changing anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: pull out time_stats.[ch]Kent Overstreet2024-03-131-2/+1
| | | | | | prep work for lifting out of fs/bcachefs/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix bch2_journal_buf_to_text()Kent Overstreet2024-03-131-5/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix bch2_journal_noflush_seq()Kent Overstreet2024-03-131-5/+5
| | | | | | | | | | | Improved journal pipelining broke journal_noflush_seq(); it implicitly assumed only the oldest outstanding journal buf could be in flight, but that's no longer true. Make this more straightforward by just setting buf->must_flush whenever we know a journal buf is going to be flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: split out ignore_blacklisted, ignore_not_dirtyKent Overstreet2024-03-131-2/+2
| | | | | | prep work for replaying the journal backwards Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add journal.blocked to journal_debug_to_text()Kent Overstreet2024-03-131-0/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: improve bch2_journal_buf_to_text()Kent Overstreet2024-03-131-9/+24
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: kill kvpmalloc()Kent Overstreet2024-03-131-2/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: better journal pipeliningKent Overstreet2024-03-101-9/+38
| | | | | | | | | | | | | | | | | | | | | Recently a severe performance regression was discovered, which bisected to a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path It turns out the old behaviour, which issued excessive journal flushes, worked around a performance issue where queueing delays would cause the journal to not be able to write quickly enough and stall. The journal flushes masked the issue because they periodically flushed the device write cache, reducing write latency for non flushes. This patch reworks the journalling code to allow more than one (non-flush) write to be in flight at a time. With this patch, doing 4k random writes and an iodepth of 128, we are now able to hit 560k iops to a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: closure per journal bufKent Overstreet2024-03-101-6/+12
| | | | | | Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bio per journal bufKent Overstreet2024-03-101-18/+22
| | | | | | Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: convert journal replay ptrs to darrayKent Overstreet2024-03-101-3/+2
| | | | | | | Eliminates some error paths - no longer have a hardcoded BCH_REPLICAS_MAX limit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Avoid taking journal lock unnecessarilyKent Overstreet2024-03-101-53/+54
| | | | | | | | | | | Previously, any time we failed to get a journal reservation we'd retry, with the journal lock held; but this isn't necessary given wait_event()/wake_up() ordering. This avoids performance cliffs when the journal starts to get backed up and lock contention shoots up. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Avoid setting j->write_work unnecessarilyKent Overstreet2024-03-101-13/+11
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split out journal workqueueKent Overstreet2024-03-101-10/+12
| | | | | | | | We don't want journal write completions to be blocked behind btree transactions - io_complete_wq is used for btree updates after data and metadata writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add gfp flags param to bch2_prt_task_backtrace()Kent Overstreet2024-01-221-1/+1
| | | | | | Fixes: e6a2566f7a00 ("bcachefs: Better journal tracepoints") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: smatch
* bcachefs: Better journal tracepointsKent Overstreet2024-01-211-39/+72
| | | | | | | | | Factor out bch2_journal_bufs_to_text(), and use it in the journal_entry_full() tracepoint; when we can't get a journal reservation we need to know the outstanding journal entry sizes to know if the problem is due to excessive flushing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: for_each_member_device_rcu() now declares loop iterKent Overstreet2024-01-011-8/+4
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: for_each_member_device() now declares loop iterKent Overstreet2024-01-011-4/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_err_(fn|msg) check if should printKent Overstreet2024-01-011-4/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree write buffer now slurps keys from journalKent Overstreet2024-01-011-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: journal->buf_lockKent Overstreet2024-01-011-0/+1
| | | | | | | Add a new lock for synchronizing between journal IO path and btree write buffer flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add a tracepoint for journal entry closeKent Overstreet2024-01-011-0/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: track_event_change()Kent Overstreet2024-01-011-12/+4
| | | | | | | | | | | | | This introduces a new helper for connecting time_stats to state changes, i.e. when taking journal reservations is blocked for some reason. We use this to track separately the different reasons the journal might be blocked - i.e. space in the journal full, or the journal pin fifo full. Also do some cleanup and improvements on the time stats code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Include average write size in sysfs journal_debugKent Overstreet2024-01-011-9/+13
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Close journal entry if necessary when flushing all pinsKent Overstreet2023-12-101-4/+4
| | | | | | | | Since outstanding journal buffers hold a journal pin, when flushing all pins we need to close the current journal entry if necessary so its pin can be released. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: move journal seq assertionKent Overstreet2023-11-281-0/+2
| | | | | | | journal_cur_seq() can legitimately be used outside of the journal lock, where this assert can race Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill journal pre-reservationsKent Overstreet2023-11-141-31/+0
| | | | | | | | | | This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure devices are always correctly initializedKent Overstreet2023-10-311-0/+19
| | | | | | | | | | | We can't mark device superblocks or allocate journal on a device that isn't online. That means we may need to do this on every mount, because we may have formatted a new filesystem and then done the first mount (bch2_fs_initialize()) in degraded mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_sb_field_get() refactoringKent Overstreet2023-10-221-2/+2
| | | | | | | Instead of using token pasting to generate methods for each superblock section, just make the type a parameter to bch2_sb_field_get(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix race between journal entry close and pin setBrian Foster2023-10-221-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bcachefs freeze testing via fstests generic/390 occasionally reproduces the following BUG from bch2_fs_read_only(): BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty)); This indicates that one or more dirty key cache keys still exist after the attempt to flush and quiesce the fs. The sequence that leads to this problem actually occurs on unfreeze (ro->rw), and looks something like the following: - Task A begins a transaction commit and acquires journal_res for the current seq. This transaction intends to perform key cache insertion. - Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends up in journal_entry_want_write(), which closes the current journal entry and drops the reference to the pin list created on entry open. The pin put pops the front of the journal via fast reclaim since the reference count has dropped to 0. - Task A attempts to set the journal pin for the associated cached key, but bch2_journal_pin_set() skips the pin insert because the seq of the transaction reservation is behind the front of the pin list fifo. The end result is that the pin associated with the cached key is not added, which prevents a subsequent reclaim from processing the key and thus leaves it dangling at freeze time. The fundamental cause of this problem is that the front of the journal is allowed to pop before a transaction with outstanding reservation on the associated journal seq is able to add a pin. The count for the pin list associated with the seq drops to zero and is prematurely reclaimed as a result. The logical fix for this problem lies in how the journal buffer is managed in similar scenarios where the entry might have been closed before a transaction with outstanding reservations happens to be committed. When a journal entry is opened, the current sequence number is bumped, the associated pin list is initialized with a reference count of 1, and the journal buffer reference count is bumped (via journal_state_inc()). When a journal reservation is acquired, the reservation also acquires a reference on the associated buffer. If the journal entry is closed in the meantime, it drops both the pin and buffer references held by the open entry, but the buffer still has references held by outstanding reservation. After the associated transaction commits, the reservation release drops the associated buffer references and the buffer is written out once the reference count has dropped to zero. The fundamental problem here is that the lifecycle of the pin list reference held by an open journal entry is too short to cover the processing of transactions with outstanding reservations. The simplest way to address this is to expand the pin list reference to the lifecycle of the buffer vs. the shorter lifecycle of the open journal entry. This ensures the pin list for a seq with outstanding reservation cannot be popped and reclaimed before all outstanding reservations have been released, even if the associated journal entry has been closed for further reservations. Move the pin put from journal entry close to where final processing of the journal buffer occurs. Create a duplicate helper to cover the case where the caller doesn't already hold the journal lock. This allows generic/390 to pass reliably. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: prepare journal buf put to handle pin putBrian Foster2023-10-221-1/+1
| | | | | | | | | | | | | | | | | bcachefs freeze testing has uncovered some raciness between journal entry open/close and pin list reference count management. The details of the problem are described in a separate patch. In preparation for the associated fix, refactor the journal buffer put path a bit to allow it to eventually handle dropping the pin list reference currently held by an open journal entry. Retain the journal write dispatch helper since the closure code is inlined and we don't want to increase the amount of inline code in the transaction commit path, but rename the function to reflect the purpose of final processing of the journal buffer. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: refactor pin put helpersBrian Foster2023-10-221-1/+2
| | | | | | | | | | | We have a couple journal pin put helpers to handle cases where the journal lock is already held or not. Refactor the helpers to lock and reclaim from the highest level and open code the reclaim from the one caller of the internal variant. The latter call will be moved into the journal buf release helper in a later patch. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Heap allocate btree_transKent Overstreet2023-10-221-2/+2
| | | | | | | | | | We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix W=12 build errorsKent Overstreet2023-10-221-2/+7
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert more code to bch_err_msg()Kent Overstreet2023-10-221-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix assorted checkpatch nitsKent Overstreet2023-10-221-2/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert more -EROFS to private error codesKent Overstreet2023-10-221-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete redundant log messagesKent Overstreet2023-10-221-15/+5
| | | | | | | Now that we have distinct error codes for different memory allocation failures, the early init log messages are no longer needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill JOURNAL_WATERMARKKent Overstreet2023-10-221-10/+5
| | | | | | | This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>