linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	bcachefs: x-macroize journal flags enums	Kent Overstreet	2024-05-08	1	-5/+10
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Simplify resuming of journal position	Kent Overstreet	2024-05-08	1	-0/+1
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: journal seq blacklist gc no longer has to walk btree	Kent Overstreet	2024-05-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Since btree_ptr_v2, we no longer require the journal seq blacklist table for skipping blacklisted bsets (btree node entries); the pointer to a given node indicates how much data is present. Therefore there's no longer any need for journal seq blacklist gc to walk the btree - we can prune entries older than journal last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: New assertion for writing to the journal after shutdown	Kent Overstreet	2024-05-08	1	-1/+1
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: JOURNAL_SPACE_LOW	Kent Overstreet	2024-04-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant performance regression (nearly 50%) on multithreaded random writes with fio. The reason is that the journal watermark checks multiple things, including the state of the btree write buffer, and on multithreaded update heavy workloads we're bottleneked on write buffer flushing - we don't want kicknig off btree updates to depend on the state of the write buffer. This isn't strictly correct; the interior btree update path does do write buffer updates, but it's a tiny fraction of total accounting updates and we're more concerned with space in the journal itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: pull out time_stats.[ch]	Kent Overstreet	2024-03-13	1	-5/+0
\| \| \| \| \| \|	prep work for lifting out of fs/bcachefs/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: better journal pipelining	Kent Overstreet	2024-03-10	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently a severe performance regression was discovered, which bisected to a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path It turns out the old behaviour, which issued excessive journal flushes, worked around a performance issue where queueing delays would cause the journal to not be able to write quickly enough and stall. The journal flushes masked the issue because they periodically flushed the device write cache, reducing write latency for non flushes. This patch reworks the journalling code to allow more than one (non-flush) write to be in flight at a time. With this patch, doing 4k random writes and an iodepth of 128, we are now able to hit 560k iops to a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: closure per journal buf	Kent Overstreet	2024-03-10	1	-2/+10
\| \| \| \| \| \|	Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: bio per journal buf	Kent Overstreet	2024-03-10	1	-1/+1
\| \| \| \| \| \|	Prep work for having multiple journal writes in flight. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Avoid taking journal lock unnecessarily	Kent Overstreet	2024-03-10	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Previously, any time we failed to get a journal reservation we'd retry, with the journal lock held; but this isn't necessary given wait_event()/wake_up() ordering. This avoids performance cliffs when the journal starts to get backed up and lock contention shoots up. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Split out journal workqueue	Kent Overstreet	2024-03-10	1	-0/+1
\| \| \| \| \| \| \| \|	We don't want journal write completions to be blocked behind btree transactions - io_complete_wq is used for btree updates after data and metadata writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: btree write buffer now slurps keys from journal	Kent Overstreet	2024-01-01	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: journal->buf_lock	Kent Overstreet	2024-01-01	1	-0/+6
\| \| \| \| \| \| \|	Add a new lock for synchronizing between journal IO path and btree write buffer flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: kill journal->preres_wait	Kent Overstreet	2024-01-01	1	-1/+0
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: track_event_change()	Kent Overstreet	2024-01-01	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	This introduces a new helper for connecting time_stats to state changes, i.e. when taking journal reservations is blocked for some reason. We use this to track separately the different reasons the journal might be blocked - i.e. space in the journal full, or the journal pin fifo full. Also do some cleanup and improvements on the time stats code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Include average write size in sysfs journal_debug	Kent Overstreet	2024-01-01	1	-0/+1
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Kill journal pre-reservations	Kent Overstreet	2023-11-14	1	-26/+0
\| \| \| \| \| \| \| \| \| \|	This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Kill JOURNAL_WATERMARK	Kent Overstreet	2023-10-22	1	-14/+1
\| \| \| \| \| \| \|	This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: When shutting down, flush btree node writes last	Kent Overstreet	2023-10-22	1	-2/+8
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Log more messages in the journal	Kent Overstreet	2023-10-22	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch - Adds a mechanism for queuing up journal entries prior to the journal being started, which will be used for early journal log messages - Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now take format strings. bch2_fs_log_msg() can be used before or after the journal has been started, and will use the appropriate mechanism. - Deletes the now obsolete bch2_journal_log_msg() - And adds more log messages to the recovery path - messages for journal/filesystem started, journal entries being blacklisted, and journal replay starting/finishing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Fix a "no journal entries found" bug	Kent Overstreet	2023-10-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	On startup, we need to ensure the first journal entry written is a flush write: after a clean shutdown we generally don't read the journal, which means we might be overwriting whatever was there previously, and there must always be at least one flush entry in the journal or recovery will fail. Found by fstests generic/388. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Introduce a separate journal watermark for copygc	Kent Overstreet	2023-10-22	1	-10/+31
\| \| \| \| \| \| \| \| \| \| \| \|	Since journal reclaim -> btree key cache flushing may require the allocation of new btree nodes, it has an implicit dependency on copygc in order to make forward progress - so we should avoid blocking copygc unless the journal is really close to full. This introduces watermarks to replace our single MAY_GET_UNRESERVED bit in the journal, and adds a watermark for copygc and plumbs it through. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Skip periodic wakeup of journal reclaim when journal empty	Kent Overstreet	2023-10-22	1	-0/+4
\| \| \| \| \| \|	Less system noise. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: __journal_entry_close() never fails	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previous patch just moved responsibility for incrementing the journal sequence number and initializing the new journal entry from __journal_entry_close() to journal_entry_open(); this patch makes the analagous change for journal reservation state, incrementing the index into array of journal_bufs at open time. This means that __journal_entry_close() never fails to close an open journal entry, which is important for the next patch that will change our emergency shutdown behaviour. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Refactor journal code to not use unwritten_idx	Kent Overstreet	2023-10-22	1	-1/+1
\| \| \| \| \| \| \|	It makes the code more readable if we work off of sequence numbers, instead of direct indexes into the array of journal buffers. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Kill JOURNAL_NEED_WRITE	Kent Overstreet	2023-10-22	1	-8/+2
\| \| \| \| \| \| \| \|	This replaces the journal flag JOURNAL_NEED_WRITE with per-journal buf state - more explicit, and solving a race in the old code that would lead to entries being opened and written unnecessarily. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Improve struct journal layout	Kent Overstreet	2023-10-22	1	-9/+12
\| \| \| \| \| \| \| \|	This cacheline aligns struct journal, and puts j->reservations and j->prereserved on their own cacheline - we may want to split them up in a separate patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Revert "Ensure journal doesn't get stuck in nochanges mode"	Kent Overstreet	2023-10-22	1	-1/+0
\| \| \| \| \| \| \| \|	This patch was originally to work around the journal geting stuck in nochanges mode - but that was just a hack, we needed to fix the actual bug. It should be fixed now, so revert it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Simplify journal replay	Kent Overstreet	2023-10-22	1	-1/+0
\| \| \| \| \| \| \| \|	With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the order we have to replay keys from the journal in, and we can also start up journal reclaim right away - and delete a bunch of code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Add more time_stats	Kent Overstreet	2023-10-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	This adds more latency/event measurements and breaks some apart into more events. Journal writes are broken apart into flush writes and noflush writes, btree compactions are broken out from btree splits, btree mergers are added, as well as btree_interior_updates - foreground and total. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Convert journal sysfs params to regular options	Kent Overstreet	2023-10-22	1	-2/+0
\| \| \| \| \| \| \| \|	This converts journal_write_delay, journal_flush_disabled, and journal_reclaim_delay to normal filesystems options, and also adds them to the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Kill journal buf bloom filter	Kent Overstreet	2023-10-22	1	-2/+0
\| \| \| \| \| \| \|	This was used for recording which inodes have been modified by in flight journal writes, but was broken and has been superceded. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Ensure journal doesn't get stuck in nochanges mode	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \|	This tweaks the journal code to always act as if there's space available in nochanges mode, when we're not going to be doing any writes. This helps in recovering filesystems that won't mount because they need journal replay and the journal has gotten stuck. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Fix journal write error path	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \|	Journal write errors were racing with the submission path - potentially causing writes to other replicas to not get submitted. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Fix usage of last_seq + encryption	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \|	jset->last_seq is in the region that's encrypted - on journal write completion, we were using it and getting garbage. This patch shadows it to fix. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Make sure to initialize j->last_flushed	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \|	If the journal reclaim thread makes it to the timeout without ever initializing j->last_flushed, we could end up sleeping for a very long time. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Don't flush btree writes more aggressively because of btree key cache	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to flush the btree key cache when it's too dirty, because otherwise the shrinker won't be able to reclaim memory - this is done by journal reclaim. But journal reclaim also kicks btree node writes: this meant that btree node writes were getting kicked much too often just because we needed to flush btree key cache keys. This patch splits journal pins into two different lists, and teaches journal reclaim to not flush btree node writes when it only needs to flush key cache keys. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Eliminate memory barrier from fast path of journal_preres_put()	Kent Overstreet	2023-10-22	1	-2/+3
\| \| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Be more careful about JOURNAL_RES_GET_RESERVED	Kent Overstreet	2023-10-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be done to free up space in the journal. In particular, when we're flushing keys from the key cache, if we're flushing them out of order we shouldn't be using it, since we're using up our remaining space in the journal without dropping a pin that will let us make forward progress. With this patch, BTREE_INSERT_JOURNAL_RECLAIM without BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on journal reclaim if we're already in journal reclaim. This means we need to propagate these errors up to journal reclaim, indicating that flushing a journal pin should be retried in the future. This is prep work for a patch to change the way journal reclaim works, to split out flushing key cache keys because the btree key cache is too dirty from journal reclaim because we need space in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Don't make foreground writes wait behind journal reclaim too long	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Correctly order flushes and journal writes on multi device filesystems	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	All writes prior to a journal write need to be flushed before the journal write itself happens. On single device filesystems, it suffices to mark the write with REQ_PREFLUSH\|REQ_FUA, but on multi device filesystems we need to issue flushes to every device - and wait for them to complete - before issuing the journal writes. Previously, we were issuing flushes to every device, but we weren't waiting for them to complete before issuing the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Reduce/kill BKEY_PADDED use	Kent Overstreet	2023-10-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	With various newer key types - stripe keys, inline data extents - the old approach of calculating the maximum size of the value is becoming more and more error prone. Better to switch to bkey_on_stack, which can dynamically allocate if necessary to handle any size bkey. In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Be more conservation about journal pre-reservations	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	- Try to always keep 1/8th of the journal free, on top of pre-reservations - Move the check for whether the journal is stuck to bch2_journal_space_available, and make it only fire when there aren't any journal writes in flight (that might free up space by updating last_seq) Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Don't require flush/fua on every journal write	Kent Overstreet	2023-10-22	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds a flag to journal entries which, if set, indicates that they weren't done as flush/fua writes. - non flush/fua journal writes don't update last_seq (i.e. they don't free up space in the journal), thus the journal free space calculations now check whether nonflush journal writes are currently allowed (i.e. are we low on free space, or would doing a flush write free up a lot of space in the journal) - write_delay_ms, the user configurable option for when open journal entries are automatically written, is now interpreted as the max delay between flush journal writes (default 1 second). - bch2_journal_flush_seq_async is changed to ensure a flush write >= the requested sequence number has happened - journal read/replay must now ignore, and blacklist, any journal entries newer than the most recent flush entry in the journal. Also, the way the read_entire_journal option is handled has been improved; struct journal_replay now has an entry, 'ignore', for entries that were read but should not be used. - assorted refactoring and improvements related to journal read in journal_io.c and recovery.c Previously, we'd have to issue a flush/fua write every time we accumulated a full journal entry - typically the bucket size. Now we need to issue them much less frequently: when an fsync is requested, or it's been more than write_delay_ms since the last flush, or when we need to free up space in the journal. This is a significant performance improvement on many write heavy workloads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Improve journal free space calculations	Kent Overstreet	2023-10-22	1	-2/+16
\| \| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Increase journal pipelining	Kent Overstreet	2023-10-22	1	-8/+10
\| \| \| \| \| \| \| \| \|	This patch increases the maximum journal buffers in flight from 2 to 4 - this will be particularly helpful when in the future we stop requiring flush+fua for every journal write. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Refactor filesystem usage accounting	Kent Overstreet	2023-10-22	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \|	Various filesystem usage counters are kept in percpu counters, with one set per in flight journal buffer. Right now all the code that deals with it assumes that there's only two buffers/sets of counters, but the number of journal bufs is getting increased to 4 in the next patch - so refactor that code to not assume a constant. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Move journal reclaim to a kthread	Kent Overstreet	2023-10-22	1	-1/+5
\| \| \| \| \| \| \|	This is to make tracing easier. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Be more precise with journal error reporting	Kent Overstreet	2023-10-22	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \|	We were incorrectly detecting a journal deadlock - the journal filling up - when only the journal pin fifo had filled up; if the journal pin fifo is full that just means we need to wait on reclaim. This plumbs through better error reporting so we can better discriminate in the journal_res_get path what's going on. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Assorted journal refactoring	Kent Overstreet	2023-10-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Improved the way we track various state by adding j->err_seq, which records the first journal sequence number that encountered an error being written, and j->last_empty_seq, which records the most recent journal entry that was completely empty. Also, use the low bits of the journal sequence number to index the corresponding journal_buf. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>