summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs/super.c
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: Print shutdown journal sequence numberKent Overstreet2024-04-041-0/+5
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Repair pass for scanning for btree nodesKent Overstreet2024-04-031-0/+3
| | | | | | | | | | | | | | | If a btree root or interior btree node goes bad, we're going to lose a lot of data, unless we can recover the nodes that it pointed to by scanning. Fortunately btree node headers are fully self describing, and additionally the magic number is xored with the filesytem UUID, so we can do so safely. This implements the scanning - next patch will rework topology repair to make use of the found nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve -o norecovery; opts.recovery_pass_limitKent Overstreet2024-03-311-2/+3
| | | | | | | | | | | | | | | | | | | | This adds opts.recovery_pass_limit, and redoes -o norecovery to make use of it; this fixes some issues with -o norecovery so it can be safely used for data recovery. Norecovery means "don't do journal replay"; it's an important data recovery tool when we're getting stuck in journal replay. When using it this way we need to make sure we don't free journal keys after startup, so we continue to overlay them: thus it needs to imply retain_recovery_info, as well as nochanges. recovery_pass_limit is an explicit option for telling recovery to exit after a specific recovery pass; this is a much cleaner way of implementing -o norecovery, as well as being a useful debug feature in its own right. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure bch_sb_field_ext always existsKent Overstreet2024-03-311-0/+8
| | | | | | | | | This makes bch_sb_field_ext more consistent with the rest of -o nochanges - we don't want to be varying other codepaths based on -o nochanges, since it's used for testing in dry run mode; also fixes some potential null ptr derefs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_fatal_error()Kent Overstreet2024-03-181-0/+1
| | | | | | error messages should always include __func__ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix for building in userspaceKent Overstreet2024-03-171-16/+16
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: time_stats: split stats-with-quantiles into a separate structureDarrick J. Wong2024-03-131-4/+4
| | | | | | | | | | | | Currently, struct time_stats has the optional ability to quantize the information that it collects. This is /probably/ useful for callers who want to see quantized information, but it more than doubles the size of the structure from 224 bytes to 464. For users who don't care about that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per counter, split the two into separate pieces. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_print_opts()Kent Overstreet2024-03-131-0/+17
| | | | | | | Make sure early error messages get redirected, for kernel-fsck-from-userland. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve error messages in device remove pathKent Overstreet2024-03-131-5/+5
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: thread_with_stdio: convert to darrayKent Overstreet2024-03-131-7/+2
| | | | | | | | | | - eliminate the dependency on printbufs, so that we can lift thread_with_file for use in xfs - add a nonblocking parameter to stdio_redirect_printf(), and either block if the buffer is full or drop it on the floor - don't buffer infinitely Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: kill kvpmalloc()Kent Overstreet2024-03-131-4/+4
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Workqueues should be WQ_HIGHPRIKent Overstreet2024-03-101-4/+4
| | | | | | | | Most bcachefs workqueues are used for completions, and should be WQ_HIGHPRI - this helps reduce queuing delays, we want to complete quickly once we can no longer signal backpressure by blocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix split brain messageKent Overstreet2024-03-101-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: no_splitbrain_check optionKent Overstreet2024-03-101-8/+17
| | | | | | | | This adds an option to disable kicking out devices when splitbrain is detected - it seems there's some issues with splitbrain detection and we're kicking out devices erronously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix null-ptr-deref in bch2_fs_alloc()Li Zetao2024-03-101-3/+3
| | | | | | | | | | | | | | | | | | | | | There is a null-ptr-deref issue reported by kasan: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] Call Trace: <TASK> bch2_fs_alloc+0x1092/0x2170 [bcachefs] bch2_fs_open+0x683/0xe10 [bcachefs] ... When initializing the name of bch_fs, it needs to dynamically alloc memory to meet the length of the name. However, when name allocation failed, it will cause a null-ptr-deref access exception in subsequent string copy. Fix this issue by checking if name allocation is successful. Fixes: 401ec4db6308 ("bcachefs: Printbuf rework") Signed-off-by: Li Zetao <lizetao1@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Clamp replicas_required to replicasKent Overstreet2024-02-131-2/+2
| | | | | | | This prevents going emergency read only when the user has specified replicas_required > replicas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-01-211-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more bcachefs updates from Kent Overstreet: "Some fixes, Some refactoring, some minor features: - Assorted prep work for disk space accounting rewrite - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this makes our trigger context more explicit - A few fixes to avoid excessive transaction restarts on multithreaded workloads: fstests (in addition to ktest tests) are now checking slowpath counters, and that's shaking out a few bugs - Assorted tracepoint improvements - Starting to break up bcachefs_format.h and move on disk types so they're with the code they belong to; this will make room to start documenting the on disk format better. - A few minor fixes" * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits) bcachefs: Improve inode_to_text() bcachefs: logged_ops_format.h bcachefs: reflink_format.h bcachefs; extents_format.h bcachefs: ec_format.h bcachefs: subvolume_format.h bcachefs: snapshot_format.h bcachefs: alloc_background_format.h bcachefs: xattr_format.h bcachefs: dirent_format.h bcachefs: inode_format.h bcachefs; quota_format.h bcachefs: sb-counters_format.h bcachefs: counters.c -> sb-counters.c bcachefs: comment bch_subvolume bcachefs: bch_snapshot::btime bcachefs: add missing __GFP_NOWARN bcachefs: opts->compression can now also be applied in the background bcachefs: Prep work for variable size btree node buffers bcachefs: grab s_umount only if snapshotting ...
| * bcachefs: counters.c -> sb-counters.cKent Overstreet2024-01-211-1/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Prep work for variable size btree node buffersKent Overstreet2024-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bcachefs btree nodes are big - typically 256k - and btree roots are pinned in memory. As we're now up to 18 btrees, we now have significant memory overhead in mostly empty btree roots. And in the future we're going to start enforcing that certain btree node boundaries exist, to solve lock contention issues - analagous to XFS's AGIs. Thus, we need to start allocating smaller btree node buffers when we can. This patch changes code that refers to the filesystem constant c->opts.btree_node_size to refer to the btree node buffer size - btree_buf_bytes() - where appropriate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: helpers for printing data typesKent Overstreet2024-01-211-1/+1
| | | | | | | | | | | | We need bounds checking since new versions may introduce new data types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Replace strlcpy() with strscpy()Kees Cook2024-01-181-2/+2
|/ | | | | | | | | | | | | | | | | | | | strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated[1]. Additionally, it returns the size of the source string, not the resulting size of the destination string. In an effort to remove strlcpy() completely[2], replace strlcpy() here with strscpy(). Nothing checks the return value here, so a direct replacement with strspy() is possible. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1] Link: https://github.com/KSPP/linux/issues/89 [2] Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Brian Foster <bfoster@redhat.com> Cc: <linux-bcachefs@vger.kernel.org> Link: https://lore.kernel.org/r/20240110235438.work.385-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
* bcachefs: %pg is banishedKent Overstreet2024-01-051-13/+36
| | | | | | not portable to userspace Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: increase max_active on io_complete_wqKent Overstreet2024-01-051-1/+1
| | | | | | | | this definitely should _not_ be 1, and we don't actually want any concurrency limiting at all here - btree node read completions are getting blocked behind btree node write submissions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: factor out thread_with_file, thread_with_stdioKent Overstreet2024-01-051-11/+8
| | | | | | | thread_with_stdio now knows how to handle input - fsck can now prompt to fix errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split brain detectionKent Overstreet2024-01-051-11/+64
| | | | | | | Use the new bch_member->seq, sb->write_time fields to detect split brain and kick out devices when necessary. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix nochanges/read_only interactionKent Overstreet2024-01-051-11/+12
| | | | | | | | | | | | | nochanges means "we cannot issue writes at all"; it's possible to go into a pseudo read-write mode where we pin dirty metadata in memory, which is used for fsck in dry run mode and doing journal replay on a read only mount, but we do not want to allow an actual read-write mount in nochanges mode. But we do always want to allow early read-write, during recovery - this patch clarifies that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: for_each_member_device_rcu() now declares loop iterKent Overstreet2024-01-011-16/+9
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: for_each_member_device() now declares loop iterKent Overstreet2024-01-011-23/+12
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: add more verbose loggingKent Overstreet2024-01-011-4/+5
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: darray_for_each() now declares loop iterKent Overstreet2024-01-011-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_err_(fn|msg) check if should printKent Overstreet2024-01-011-48/+31
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree write buffer now slurps keys from journalKent Overstreet2024-01-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: BCH_IOCTL_FSCK_ONLINEKent Overstreet2024-01-011-0/+1
| | | | | | | | | | | | | | This adds a new ioctl for running fsck on a mounted, in use filesystem. This reuses the fsck_thread code from the previous patch for running fsck on an offline, unmounted filesystem, so that log messages for the fsck thread are redirected to userspace. Only one running fsck instance is allowed at a time; a new semaphore (since the lock will be taken by one thread and released by another) is added for this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add ability to redirect log outputKent Overstreet2024-01-011-0/+28
| | | | | | | | | | | | | | | | | | | | Upcoming patches are going to add two new ioctls for running fsck in the kernel, but pretending that we're running our normal userspace fsck. This patch adds some plumbing for redirecting our normal log messages away from the dmesg log to a thread_with_file file descriptor - via a struct log_output, which will be consumed by the fsck f_op's read method. The new ioctls will allow for running fsck in the kernel against an offline filesystem (without mounting it), and an online filesystem. For an offline filesystem we need a way to pass in a pointer to the log_output, which is done via a new hidden opts.h option. For online fsck, we can set c->output directly, but only want to redirect log messages from the thread running fsck - hence the new c->output_filter method. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: c->ro_refKent Overstreet2024-01-011-0/+6
| | | | | | | | | Add a new refcount for async ops that don't necessarily need the fs to be RW, with similar lifetime/rules otherwise as c->writes. To be used by online fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: convert bch_fs_flags to x-macroKent Overstreet2024-01-011-28/+35
| | | | | | | Now we can print out filesystem flags in sysfs, useful for debugging various "what's my filesystem doing" issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: track_event_change()Kent Overstreet2024-01-011-1/+2
| | | | | | | | | | | | | This introduces a new helper for connecting time_stats to state changes, i.e. when taking journal reservations is blocked for some reason. We use this to track separately the different reasons the journal might be blocked - i.e. space in the journal full, or the journal pin fifo full. Also do some cleanup and improvements on the time stats code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add extra verbose logging for ro pathKent Overstreet2024-01-011-2/+12
| | | | | | | Also log time waiting for c->writes references to be dropped; this will help in debugging why unmounts are taking longer than they should. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: improve modprobe support by providing softdepsDaniel Hill2023-12-141-0/+6
| | | | | | | | | We need to help modprobe load architecture specific modules so we don't fall back to generic software implementations, this should help performance when building as a module. Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: fix invalid memory access in bch2_fs_alloc() error pathThomas Bertschinger2023-12-141-0/+1
| | | | | | | | | | When bch2_fs_alloc() gets an error before calling bch2_fs_btree_iter_init(), bch2_fs_btree_iter_exit() makes an invalid memory access because btree_trans_list is uninitialized. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Fixes: 6bd68ec266ad ("bcachefs: Heap allocate btree_trans") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Proper refcounting for journal_keysKent Overstreet2023-11-241-2/+4
| | | | | | | | | The btree iterator code overlays keys from the journal until journal replay is finished; since we're now starting copygc/rebalance etc. before replay is finished, this is multithreaded access and thus needs refcounting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Start gc, copygc, rebalance threads after initing writes refKent Overstreet2023-11-241-12/+16
| | | | | | | | | | | | This fixes a bug where copygc would occasionally race with going read-write and die, thinking we were read only, because it couldn't take a ref on c->writes. It's not necessary for copygc (or rebalance, or copygc) to take write refs; they could run with BCH_TRANS_COMMIT_nocheck_rw, but this is an easier fix that making sure that flag is passed correctly everywhere. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert bch2_fs_open() to darrayKent Overstreet2023-11-051-32/+28
| | | | | | Open coded dynamic arrays are deprecated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_sb_field_errorsKent Overstreet2023-11-011-4/+8
| | | | | | | | | | | | Add a new superblock section to keep counts of errors seen since filesystem creation: we'll be addingcounters for every distinct fsck error. The new superblock section has entries of the for [ id, count, time_of_last_error ]; this is intended to let us see what errors are occuring - and getting fixed - via show-super output. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add IO error counts to bch_memberKent Overstreet2023-11-011-0/+5
| | | | | | | | | We now track IO errors per device since filesystem creation. IO error counts can be viewed in sysfs, or with the 'bcachefs show-super' command. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a kasan splat in bch2_dev_add()Kent Overstreet2023-11-011-10/+2
| | | | | | | | | | | This fixes a use after free - mi is dangling after the resize call. Additionally, resizing the device's member info section was useless - we were attempting to preallocate the space required before adding it to the filesystem superblock, but there's other sections that we should have been preallocating as well for that to work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_disk_path_to_text() no longer takes sb_lockKent Overstreet2023-10-311-1/+1
| | | | | | | | | | | | | We're going to be using bch2_target_to_text() -> bch2_disk_path_to_text() from bch2_bkey_ptrs_to_text() and bch2_bkey_ptrs_invalid(), which can be called in any context. This patch adds the actual label to bch_disk_group_cpu so that it can be used by bch2_disk_path_to_text, and splits out bch2_disk_path_to_text() into two variants - like the previous patch, one for when we have a running filesystem and another for when we only have a superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure devices are always correctly initializedKent Overstreet2023-10-311-13/+17
| | | | | | | | | | | We can't mark device superblocks or allocate journal on a device that isn't online. That means we may need to do this on every mount, because we may have formatted a new filesystem and then done the first mount (bch2_fs_initialize()) in degraded mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete duplicate time stats initializationKent Overstreet2023-10-311-6/+0
| | | | | | | This code duplicated initialization already done in bch2_fs_btree_iter_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: snapshot_create_lockKent Overstreet2023-10-221-0/+1
| | | | | | | Add a new lock for snapshot creation - this addresses a few races with logged operations and snapshot deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>