linux-stable.git - Linux kernel stable tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	bcachefs: BCH_WATERMARK_interior_updates	Kent Overstreet	2024-04-01	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	This adds a new watermark, higher priority than BCH_WATERMARK_reclaim, for interior btree updates. We've seen a deadlock where journal replay triggers a ton of btree node merges, and these use up all available open buckets and then interior updates get stuck. One cause of this is that we're currently lacking btree node merging on write buffer btrees - that needs to be fixed as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Fix assorted checkpatch nits	Kent Overstreet	2023-10-22	1	-2/+2
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Assorted fixes for clang	Kent Overstreet	2023-10-22	1	-1/+1
\| \| \| \| \| \| \|	clang had a few more warnings about enum conversion, and also didn't like the opts.c initializer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Kill JOURNAL_WATERMARK	Kent Overstreet	2023-10-22	1	-5/+8
\| \| \| \| \| \| \|	This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: BCH_WATERMARK_reclaim	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \|	Add another watermark for journal reclaim - this is needed for the next patches, that unify BCH_WATERMARK with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Rename enum alloc_reserve -> bch_watermark	Kent Overstreet	2023-10-22	1	-10/+8
\| \| \| \| \| \|	This is prep work for consolidating with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Rework open bucket partial list allocation	Kent Overstreet	2023-10-22	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Now, any open_bucket can go on the partial list: allocating from the partial list has been moved to its own dedicated function, open_bucket_add_bucets() -> bucket_alloc_set_partial(). In particular, this means that erasure coded buckets can safely go on the partial list; the new location works with the "allocate an ec bucket first, then the rest" logic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: RESERVE_stripe	Kent Overstreet	2023-10-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Rework stripe creation path - new algorithm for deciding when to create new stripes or reuse existing stripes. We add a new allocation watermark, RESERVE_stripe, above RESERVE_none. Then we always try to create a new stripe by doing RESERVE_stripe allocations; if this fails, we reuse an existing stripe and allocate buckets for it with the reserve watermark for the given write (RESERVE_none or RESERVE_movinggc). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Improve dev_alloc_debug_to_text()	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \|	Now we also print the number of buckets reserved for each watermark. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Nocow support	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds support for nocow mode, where we do writes in-place when possible. Patch components: - New boolean filesystem and inode option, nocow: note that when nocow is enabled, data checksumming and compression are implicitly disabled - To prevent in-place writes from racing with data moves (data_update.c) or bucket reuse (i.e. a bucket being reused and re-allocated while a nocow write is in flight, we have a new locking mechanism. Buckets can be locked for either data update or data move, using a fixed size hash table of two_state_shared locks. We don't have any chaining, meaning updates and moves to different buckets that hash to the same lock will wait unnecessarily - we'll want to watch for this becoming an issue. - The allocator path also needs to check for in-place writes in flight to a given bucket before giving it out: thus we add another counter to bucket_alloc_state so we can track this. - Fsync now may need to issue cache flushes to block devices instead of flushing the journal. We add a device bitmask to bch_inode_info, ei_devs_need_flush, which tracks devices that need to have flushes issued - note that this will lead to unnecessary flushes when other codepaths have already issued flushes, we may want to replace this with a sequence number. - New nocow write path: look up extents, and if they're writable write to them - otherwise fall back to the normal COW write path. XXX: switch to sequence numbers instead of bitmask for devs needing journal flush XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to run in process context - see if we can improve this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: bucket_alloc_state	Kent Overstreet	2023-10-22	1	-0/+7
\| \| \| \| \| \| \| \|	This refactoring puts our various allocation path counters into a dedicated struct - the upcoming nocow patch is going to add another counter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Fold bucket_state in to BCH_DATA_TYPES()	Kent Overstreet	2023-10-22	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, we were missing accounting for buckets in need_gc_gens and need_discard states. This matters because buckets in those states need other btree operations done before they can be used, so they can't be conuted when checking current number of free buckets against the allocation watermark. Also, we weren't directly counting free buckets at all. Now, data type 0 == BCH_DATA_free, and free buckets are counted; this means we can get rid of the separate (poorly defined) count of unavailable buckets. This is a new on disk format version, with upgrade and fsck required for the accounting changes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Kill allocator threads & freelists	Kent Overstreet	2023-10-22	1	-23/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have new persistent data structures for the allocator, this patch converts the allocator to use them. Now, foreground bucket allocation uses the freespace btree to find buckets to allocate, instead of popping buckets off the freelist. The background allocator threads are no longer needed and are deleted, as well as the allocator freelists. Now we only need background tasks for invalidating buckets containing cached data (when we are low on empty buckets), and for issuing discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Run btree updates after write out of write_point	Kent Overstreet	2023-10-22	1	-9/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the write path, after the write to the block device(s) complete we have to punt to process context to do the btree update. Instead of using the work item embedded in op->cl, this patch switches to a per write-point work item. This helps with two different issues: - lock contention: btree updates to the same writepoint will (usually) be updating the same alloc keys - context switch overhead: when we're bottlenecked on btree updates, having a thread (running out of a work item) checking the write point for completed ops is cheaper than queueing up a new work item and waking up a kworker. In an arbitrary benchmark, 4k random writes with fio running inside a VM, this patch resulted in a 10% improvement in total iops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: x-macroize alloc_reserve enum	Kent Overstreet	2023-10-22	1	-5/+10
\| \| \| \| \| \|	This makes an array of strings available, like our other enums. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Put open_buckets in a hashtable	Kent Overstreet	2023-10-22	1	-0/+4
\| \| \| \| \| \| \| \|	This is so that the copygc code doesn't have to refer to bucket_mark.owned_by_allocator - assisting in getting rid of the in memory bucket array. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Refactor open_bucket code	Kent Overstreet	2023-10-22	1	-3/+6
\| \| \| \| \| \| \| \| \|	Prep work for adding a hash table of open buckets - instead of embedding a bch_extent_ptr, we need to refer to the bucket directly so that we're not calling sector_to_bucket() in the hash table lookup code, which has an expensive divide. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
*	bcachefs: Add allocator thread state to sysfs	Kent Overstreet	2023-10-22	1	-0/+12
\| \| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Persist 64 bit io clocks	Kent Overstreet	2023-10-22	1	-24/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Originally, bcachefs - going back to bcache - stored, for each bucket, a 16 bit counter corresponding to how long it had been since the bucket was read from. But, this required periodically rescaling counters on every bucket to avoid wraparound. That wasn't an issue in bcache, where we'd perodically rewrite the per bucket metadata all at once, but in bcachefs we're trying to avoid having to walk every single bucket. This patch switches to persisting 64 bit io clocks, corresponding to the 64 bit bucket timestaps introduced in the previous patch with KEY_TYPE_alloc_v2. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Reserve some open buckets for btree allocations	Kent Overstreet	2023-10-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	This reverts part of the change from "bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much" - it turns out we still should be reserving open buckets for btree node allocations, because otherwise data bucket allocations (especially with erasure coding enabled) can use up all our open buckets and we won't be able to do the metadata update that lets us release those open bucket references. Oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Use separate new stripes for copygc and non-copygc	Kent Overstreet	2023-10-22	1	-1/+0
\| \| \| \| \| \| \| \|	Allocations for copygc have to be kept separate from everything else, so that copygc doesn't get starved. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much	Kent Overstreet	2023-10-22	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, we were using BTREE_INSERT_RESERVE in a lot of places where it no longer makes sense. - we now have more open_buckets than we used to, and the reserves work better, so we shouldn't need to use BTREE_INSERT_RESERVE just because we're holding open_buckets pinned anymore. - We have the btree key cache for updates to the alloc btree, meaning we no longer need the btree reserve to ensure the allocator can make forward progress. This means that we should only need a reserve for btree updates to ensure that copygc can make forward progress. Since it's now just for copygc, we can also fold RESERVE_BTREE into RESERVE_MOVINGGC (the allocator's freelist reserve). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Don't let copygc buckets be stolen by other threads	Kent Overstreet	2023-10-22	1	-0/+1
\| \| \| \| \| \| \|	And assorted other copygc fixes. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: More open buckets	Kent Overstreet	2023-10-22	1	-5/+11
\| \| \| \| \| \| \| \| \|	We need a larger open bucket reserve now that the btree interior update path holds onto open bucket references; filesystems with many high through devices may need more open buckets now. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Fix some reserve calculations	Kent Overstreet	2023-10-22	1	-2/+3
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Erasure coding	Kent Overstreet	2023-10-22	1	-1/+10
\| \| \| \|	Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Scale down number of writepoints when low on space	Kent Overstreet	2023-10-22	1	-1/+3
\| \| \| \| \| \| \|	this means we don't have to reserve space for them when calculating filesystem capacity Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Allocation code refactoring	Kent Overstreet	2023-10-22	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bch2_alloc_sectors_start() was a nightmare to work with - it's got some tricky stuff to do, since it wants to use the buckets the writepoint already has, unless they're not in the target it wants to write to, unless it can't allocate from any other devices in which case it will use those buckets if it has to - et cetera. This restructures the code to start with a new empty list of open buckets we're going to use for the new allocation, pulling buckets from the write point's list as we decide that we really are going to use them - making the code somewhat more functional and drastically easier to understand. Also fixes a bug where we could end up waiting on c->freelist_wait (because allocating from one device failed) but return success from bch2_bucket_alloc(), because allocating from a different device succeeded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
*	bcachefs: Initial commit	Kent Overstreet	2023-10-22	1	-0/+90
	Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write filesystem with every feature you could possibly want. Website: https://bcachefs.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>