summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-03-1595-2253/+3770
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull bcachefs updates from Kent Overstreet: - Subvolume children btree; this is needed for providing a userspace interface for walking subvolumes, which will come later - Lots of improvements to directory structure checking - Improved journal pipelining, significantly improving performance on high iodepth write workloads - Discard path improvements: the discard path is more efficient, and no longer flushes the journal unnecessarily - Buffered write path can now avoid taking the inode lock - new mm helper: memalloc_flags_{save|restore} - mempool now does kvmalloc mempools * tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits) bcachefs: time_stats: shrink time_stat_buffer for better alignment bcachefs: time_stats: split stats-with-quantiles into a separate structure bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet bcachefs: time_stats: add larger units bcachefs: pull out time_stats.[ch] bcachefs: reconstruct_alloc cleanup bcachefs: fix bch_folio_sector padding bcachefs: Fix btree key cache coherency during replay bcachefs: Always flush write buffer in delete_dead_inodes() bcachefs: Fix order of gc_done passes bcachefs: fix deletion of indirect extents in btree_gc bcachefs: Prefer struct_size over open coded arithmetic bcachefs: Kill unused flags argument to btree_split() bcachefs: Check for writing superblocks with nonsense member seq fields bcachefs: fix bch2_journal_buf_to_text() lib/generic-radix-tree.c: Make nodes more reasonably sized bcachefs: copy_(to|from)_user_errcode() bcachefs: Split out bkey_types.h bcachefs: fix lost journal buf wakeup due to improved pipelining bcachefs: intercept mountoption value for bool type ...
| * bcachefs: time_stats: shrink time_stat_buffer for better alignmentDarrick J. Wong2024-03-131-1/+1
| | | | | | | | | | | | | | | | | | Shrink this percpu object by one array element so that the object size becomes exactly 512 bytes. This will lead to more efficient memory use, hopefully. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: time_stats: split stats-with-quantiles into a separate structureDarrick J. Wong2024-03-137-15/+41
| | | | | | | | | | | | | | | | | | | | | | | | Currently, struct time_stats has the optional ability to quantize the information that it collects. This is /probably/ useful for callers who want to see quantized information, but it more than doubles the size of the structure from 224 bytes to 464. For users who don't care about that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per counter, split the two into separate pieces. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a dietDarrick J. Wong2024-03-136-67/+84
| | | | | | | | | | | | | | | | | | | | | | The only caller of this code (time_stats) always knows the weights and whether or not any information has been collected. Pass this information into the mean and variance code so that it doesn't have to store that information. This reduces the structure size from 24 to 16 bytes, which shrinks each time_stats counter to 192 bytes from 208. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: time_stats: add larger unitsDarrick J. Wong2024-03-131-0/+3
| | | | | | | | | | | | | | | | Filesystems can stay mounted for a very long time, so add some larger units. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: pull out time_stats.[ch]Kent Overstreet2024-03-1311-279/+326
| | | | | | | | | | | | prep work for lifting out of fs/bcachefs/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: reconstruct_alloc cleanupKent Overstreet2024-03-137-95/+113
| | | | | | | | | | | | | | | | | | Now that we've got the errors_silent mechanism, we don't have to check if the reconstruct_alloc option is set all over the place. Also - users no longer have to explicitly select fsck and fix_errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix bch_folio_sector paddingKent Overstreet2024-03-131-6/+3
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix btree key cache coherency during replayKent Overstreet2024-03-132-4/+6
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Always flush write buffer in delete_dead_inodes()Kent Overstreet2024-03-131-5/+10
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix order of gc_done passesKent Overstreet2024-03-131-4/+4
| | | | | | | | | | | | | | | | gc_stripes_done() and gc_reflink_done() may do alloc btree updates (i.e. when deleting an indirect extent) - we need bucket gens to be fixed by then. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix deletion of indirect extents in btree_gcKent Overstreet2024-03-131-2/+2
| | | | | | | | | | | | | | we need to run the normal extent update path on deletion - bch2_bkey_make_mut() is incorrect when key type is changing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Prefer struct_size over open coded arithmeticErick Archer2024-03-132-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows [1][2]. As the "op" variable is a pointer to "struct promote_op" and this structure ends in a flexible array: struct promote_op { [...] struct bio_vec bi_inline_vecs[]; }; and the "t" variable is a pointer to "struct journal_seq_blacklist_table" and this structure also ends in a flexible array: struct journal_seq_blacklist_table { [...] struct journal_seq_blacklist_table_entry { u64 start; u64 end; bool dirty; } entries[]; }; the preferred way in the kernel is to use the struct_size() helper to do the arithmetic instead of the argument "size + size * count" in the kzalloc() functions. This way, the code is more readable and safer. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments [1] Link: https://github.com/KSPP/linux/issues/160 [2] Signed-off-by: Erick Archer <erick.archer@gmx.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Kill unused flags argument to btree_split()Kent Overstreet2024-03-131-11/+8
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Check for writing superblocks with nonsense member seq fieldsKent Overstreet2024-03-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | We're seeing some unmountable filesystems due to split brain detection going awry; it seems we somehow wrote out superblocks where we updated the superblock seq without updating any member seq fields. A given device's superblock should always have the main seq equal to it's member seq field, so this is easy to check for. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix bch2_journal_buf_to_text()Kent Overstreet2024-03-131-5/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * lib/generic-radix-tree.c: Make nodes more reasonably sizedKent Overstreet2024-03-132-36/+28
| | | | | | | | | | | | | | | | | | | | | | this code originally used the page allocator directly, but most code shouldn't do that - PAGE_SIZE varies with architecture, and slab is faster. 4k is also on the large side for typical usage, 512 bytes is a better choice for typical usage that might be somewhat sparse. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: copy_(to|from)_user_errcode()Kent Overstreet2024-03-132-6/+16
| | | | | | | | | | | | | | we've got some helpers that return errors sanely, move them to a more common location for use in fs-ioctl.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Split out bkey_types.hKent Overstreet2024-03-132-201/+214
| | | | | | | | | | | | We're going to need bkey_types.h in bcachefs_ioctl.h in a future patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix lost journal buf wakeup due to improved pipeliningBrian Foster2024-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The journal_write_done() handler was reworked into a loop in commit 746a33c96b7a ("bcachefs: better journal pipelining"). As part of this, the journal buffer wake was factored into a post-loop branch that executes if at least one journal buffer has completed. The journal buffer processing loop iterates on the journal buffer pointer, however. This means that w refers to the last buffer processed by the loop, which may or may not be done. This also means that if multiple buffers are processed by the loop, only the last is awoken. This lost wakeup behavior has lead to stalling problems in various CI and fstests, such as generic/703. Lift the wake into the loop so each done buffer sees a wake call as it is processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: intercept mountoption value for bool typeHongbo Li2024-03-132-1/+2
| | | | | | | | | | | | | | | | | | | | For mount option with bool type, the value must be 0 or 1 (See bch2_opt_parse). But this seems does not well intercepted cause for other value(like 2...), it returns the unexpect return code with error message printed. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: avoid returning private error code in bch2_xattr_bcachefs_setHongbo Li2024-03-131-2/+3
| | | | | | | | | | | | | | | | Avoid the private error code return to caller. The error code should be transformed into genernal error code. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Buffered write path now can avoid the inode lockKent Overstreet2024-03-132-41/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Non append, non extending buffered writes can now avoid taking the inode lock. To ensure atomicity of writes w.r.t. other writes, we lock every folio that we'll be writing to, and if this fails we fall back to taking the inode lock. Extensive comments are provided as to corner cases. Link: https://lore.kernel.org/linux-fsdevel/Zdkxfspq3urnrM6I@bombadil.infradead.org/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * fs: file_remove_privs_flags()Kent Overstreet2024-03-132-3/+5
| | | | | | | | | | | | | | | | | | | | Rename and export __file_remove_privs(); for a buffered write path that doesn't take the inode lock we need to be able to check if the operation needs to do work first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org>
| * bcachefs: Fix bch2_journal_noflush_seq()Kent Overstreet2024-03-132-5/+6
| | | | | | | | | | | | | | | | | | | | | | Improved journal pipelining broke journal_noflush_seq(); it implicitly assumed only the oldest outstanding journal buf could be in flight, but that's no longer true. Make this more straightforward by just setting buf->must_flush whenever we know a journal buf is going to be flush. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix the error code when mounting with incorrect options.Hongbo Li2024-03-133-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mount with incorrect options such as: "mount -t bcachefs -o errors=back /dev/loop1 /mnt/bcachefs/". It rebacks the error "mount: /mnt/bcachefs: permission denied." cause bch2_parse_mount_opts returns -1 and bch2_mount throws it up. This is unreasonable. The real error message should be like this: "mount: /mnt/bcachefs: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error." Adding three private error codes for mounting error. Here are: - BCH_ERR_mount_option as the parent class for option error. - BCH_ERR_option_name represents the invalid option name. - BCH_ERR_option_value represents the invalid option value. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: split out ignore_blacklisted, ignore_not_dirtyKent Overstreet2024-03-135-21/+33
| | | | | | | | | | | | prep work for replaying the journal backwards Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: improve move_gap()Kent Overstreet2024-03-133-8/+9
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: journal_keys now uses darray helpersKent Overstreet2024-03-132-61/+25
| | | | | | | | | | | | nice bit of code cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Rename journal_keys.d -> journal_keys.dataKent Overstreet2024-03-133-42/+42
| | | | | | | | | | | | This will let us use some darray helpers in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: jset_entry for loops declare loop iterKent Overstreet2024-03-134-9/+2
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Errcode tracepoint, documentationKent Overstreet2024-03-134-6/+59
| | | | | | | | | | | | | | | | Add a tracepoint for downcasting private errors to standard errors, so they can be recovered even when not logged; also, add some documentation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: remove redundant assignment to variable retColin Ian King2024-03-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Variable ret is being assigned a value that is never read, it is being re-assigned a couple of statements later on. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/bcachefs/super-io.c:806:2: warning: Value stored to 'ret' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Silence gcc warnings about arm arch ABI driftCalvin Owens2024-03-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | 32-bit arm builds emit a lot of spam like this: fs/bcachefs/backpointers.c: In function ‘extent_matches_bp’: fs/bcachefs/backpointers.c:15:13: note: parameter passing for argument of type ‘struct bch_backpointer’ changed in GCC 9.1 Apply the change from commit ebcc5928c5d9 ("arm64: Silence gcc warnings about arch ABI drift") to fs/bcachefs/ to silence them. Signed-off-by: Calvin Owens <jcalvinowens@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Add journal.blocked to journal_debug_to_text()Kent Overstreet2024-03-131-0/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix journal_buf bitfield accessesKent Overstreet2024-03-132-6/+13
| | | | | | | | | | | | | | All jounal_buf bitfield updates must happen under the journal lock - perhaps we should just switch these to atomic bit flags. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Split out discard fastpathKent Overstreet2024-03-133-7/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Buckets usually can't be discarded until the transaction that made them empty has been committed in the journal. Tracing has indicated that we're queuing the discard worker excessively, only for it to skip over many buckets that are still waiting on a journal commit, discarding only one or two buckets per iteration. We want to switch to only queuing the discard worker after a journal flush write, but there's an important optimization we need to preserve: if a bucket becomes empty and it was never committed in the journal while it was in use, we want to discard it and reuse it right away - since overwriting it before the previous writes are flushed from the device cache eans those writes only cost bus bandwidth. So, this patch implements a fast path for buckets that can be discarded right away. We need new locking between the two discard workers; the new list of buckets being discarded provides that locking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: improve bch2_journal_buf_to_text()Kent Overstreet2024-03-131-9/+24
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Drop redundant btree_path_downgrade()sKent Overstreet2024-03-131-1/+2
| | | | | | | | | | | | | | | | If a path doesn't have any active references, we shouldn't downgrade it; it'll either be reused, possibly with intent refs again, or dropped at bch2_trans_begin() time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: rebalance_status now shows correct unitsDaniel Hill2024-03-131-2/+2
| | | | | | | | | | Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: more informative write path error messageKent Overstreet2024-03-131-5/+11
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: check_path() now only needs to walk up to subvolume rootKent Overstreet2024-03-131-3/+3
| | | | | | | | | | | | | | | | Now that checking subvolume structure is a separate pass, the main check_directory_connectivity() pass only needs to walk up to a given inode's subvolume root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: bch2_check_subvolume_structure()Kent Overstreet2024-03-133-27/+135
| | | | | | | | | | | | | | Now that we've got bch_subvolume.fs_path_parent, it's easy to write subvolume Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: omit alignment attribute on big endian struct bkeyThomas Bertschinger2024-03-131-2/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is needed for building Rust bindings on big endian architectures like s390x. Currently this is only done in userspace, but it might happen in-kernel in the future. When creating a Rust binding for struct bkey, the "packed" attribute is needed to get a type with the correct member offsets in the big endian case. However, rustc does not allow types to have both a "packed" and "align" attribute. Thus, in order to get a Rust type compatible with the C type, we must omit the "aligned" attribute in C. This does not affect the struct's size or member offsets, only its toplevel alignment, which should be an acceptable impact. The little endian version can have the "align" attribute because the "packed" attr is redundant, and rust-bindgen will omit the "packed" attr when an "align" attr is present and it can do so without changing a type's layout Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: bch2_trigger_alloc() handles state changes betterKent Overstreet2024-03-131-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | bch2_trigger_alloc() kicks off certain tasks on bucket state changes; e.g. triggering the bucket discard worker and the invalidate worker. We've observed the discard worker running too often - most runs it doesn't do any work, according to the tracepoint - so clearly, we're kicking it off too often. This adds an explicit statechange() macro to make these checks more precise. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: bch2_print_opts()Kent Overstreet2024-03-133-6/+27
| | | | | | | | | | | | | | Make sure early error messages get redirected, for kernel-fsck-from-userland. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Improve error messages in device remove pathKent Overstreet2024-03-131-5/+5
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Use kvzalloc() when dynamically allocating btree pathsKent Overstreet2024-03-131-2/+2
| | | | | | | | | | | | | | | | THis silences a mm/page_alloc.c warning about allocating more than a page with GFP_NOFAIL - and there's no reason for this to not have a vmalloc fallback anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Track iter->ip_allocated at bch2_trans_copy_iter()Kent Overstreet2024-03-131-0/+3
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Save key_cache_path in peek_slot()Kent Overstreet2024-03-131-0/+1
| | | | | | | | | | | | | | | | | | When bch2_btree_iter_peek_slot() clones the iterator to search for the next key, and then discovers that the key from the cloned iterator is the key we want to return - we also want to save the iter->key_cache_path as well, for the update path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>