| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| |
| |
| |
| |
| |
| |
| | |
check_topology doesn't need the srcu lock and doesn't use normal btree
transactions - we can just drop the srcu lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
fsck_err() jumps to the fsck_err label when bailing out; need to make
sure bp_iter was initialized...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Zero-initialize part of allocated bounce buffer which wasn't touched by
subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value
use KMSAN bug[1].
After applying the patch reproducer still triggers stack overflow[2] but
it seems unrelated to the uninit-value use warning. After further
investigation it was found that stack overflow occurs because KMSAN adds
too many function calls[3]. Backtrace of where the stack magic number gets
smashed was added as a reply to syzkaller thread[3].
It was confirmed that task's stack magic number gets smashed after the code
path where KSMAN detects uninit-value use is executed, so it can be assumed
that it doesn't contribute in any way to uninit-value use detection.
[1] https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718
[2] https://lore.kernel.org/lkml/66e57e46.050a0220.115905.0002.GAE@google.com
[3] https://lore.kernel.org/all/rVaWgPULej8K7HqMPNIu8kVNyXNjjCiTB-QBtItLFBmk0alH6fV2tk4joVPk97Evnuv4ZRDd8HB5uDCkiFG6u81xKdzDj-KrtIMJSlF6Kt8=@proton.me
Reported-by: syzbot+6f655a60d3244d0c6718@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718
Fixes: ec4edd7b9d20 ("bcachefs: Prep work for variable size btree node buffers")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
Most or all errors will be autofix in the future, we're currently just
doing the ones that we know are well tested.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
As we iterate we need to mark that we no longer need iterators -
otherwise we'll infinite loop via the "too many iters" check when
there's many snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
if it doesn't get set we'll never be able to flush the btree write
buffer; this only happens in fake rw mode, but prevents us from shutting
down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Syzbot reports a problem that a warning is triggered due to suspicious
use of rcu_dereference_check(). That is triggered by a call of
bch2_snapshot_tree_oldest_subvol().
The cause of the warning is that inside
bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls
rcu_dereference() that requires a read lock to be held. Also, the call
of bch2_snapshot_tree_next() eventually calls snapshot_t().
To fix this, call rcu_read_lock() before calling snapshot_t(). Then,
release the lock after the termination of the while loop.
Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com>
Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull bcachefs updates from Kent Overstreet:
- rcu_pending, btree key cache rework: this solves lock contenting in
the key cache, eliminating the biggest source of the srcu lock hold
time warnings, and drastically improving performance on some metadata
heavy workloads - on multithreaded creates we're now 3-4x faster than
xfs.
- We're now using an rhashtable instead of the system inode hash table;
this is another significant performance improvement on multithreaded
metadata workloads, eliminating more lock contention.
- for_each_btree_key_in_subvolume_upto(): new helper for iterating over
keys within a specific subvolume, eliminating a lot of open coded
"subvolume_get_snapshot()" and also fixing another source of srcu
lock time warnings, by running each loop iteration in its own
transaction (as the existing for_each_btree_key() does).
- More work on btree_trans locking asserts; we now assert that we don't
hold btree node locks when trans->locked is false, which is important
because we don't use lockdep for tracking individual btree node
locks.
- Some cleanups and improvements in the bset.c btree node lookup code,
from Alan.
- Rework of btree node pinning, which we use in backpointers fsck. The
old hacky implementation, where the shrinker just skipped over nodes
in the pinned range, was causing OOMs; instead we now use another
shrinker with a much higher seeks number for pinned nodes.
- Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue
where rebalance would sometimes fall back to allocating from the full
filesystem, which is not what we want when it's trying to move data
to a specific target.
- Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache
allocations.
- Idmap mounts are now supported (Hongbo Li)
- Rename whiteouts are now supported (Hongbo Li)
- Erasure coding can now handle devices being marked as failed, or
forcibly removed. We still need the evacuate path for erasure coding,
but it's getting very close to ready for people to start using.
* tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits)
bcachefs: return err ptr instead of null in read sb clean
bcachefs: Remove duplicated include in backpointers.c
bcachefs: Don't drop devices with stripe pointers
bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices
bcachefs: bch_fs.rw_devs_change_count
bcachefs: bch2_dev_remove_stripes()
bcachefs: bch2_trigger_ptr() calculates sectors even when no device
bcachefs: improve error messages in bch2_ec_read_extent()
bcachefs: improve error message on too few devices for ec
bcachefs: improve bch2_new_stripe_to_text()
bcachefs: ec_stripe_head.nr_created
bcachefs: bch_stripe.disk_label
bcachefs: stripe_to_mem()
bcachefs: EIO errcode cleanup
bcachefs: Rework btree node pinning
bcachefs: split up btree cache counters for live, freeable
bcachefs: btree cache counters should be size_t
bcachefs: Don't count "skipped access bit" as touched in btree cache scan
bcachefs: Failed devices no longer require mounting in degraded mode
bcachefs: bch2_dev_rcu_noerror()
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot reported a null-ptr-deref in bch2_fs_start. [0]
When a sb is marked clear but doesn't have a clean section
bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO
lets through, eventually leading to a null ptr dereference down
the line. Adjust read sb clean to return an ERR_PTR indicating the
invalid clean section.
[0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543
Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The header files bbpos.h is included twice in backpointers.c,
so one inclusion of each can be removed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This factors out ec_strie_head_devs_update(), which initializes the
bitmap of devices we're allocating from, and runs it every time
c->rw_devs_change_count changes.
We also cancel pending, not allocated stripes, since they may refer to
devices that are no longer available.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Add a counter that's incremented whenever rw devices change; this will
be used for erasure coding so that it can keep ec_stripe_head in sync
and not deadlock on a new stripe when a device it wants goes away.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
We can now correctly force-remove a device that has stripes on it; this
uses the new BCH_SB_MEMBER_INVALID sentinal value.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
This is necessary for erasure coded pointers to devices that have been
removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
also print out the new stripe key
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
additional debug stat
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When reshaping existing stripes, we should keep them on the same target
that they were allocated on; to do this, we need to add a field to the
btree stripe type.
This is a tad awkward, because we only have 8 bits left, and targets are
16 bits - but we only need to store a label, not a full target.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
factor out a common helper
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
We want to be using private errcodes whenever possible, for better error
messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In backpointers fsck, we do a seqential scan of one btree, and check
references to another: extents <-> backpointers
Checking references generates random lookups, so we want to pin that
btree in memory (or only a range, if it doesn't fit in ram).
Previously, this was done with a simple check in the shrinker - "if
btree node is in range being pinned, don't free it" - but this generated
OOMs, as our shrinker wasn't well behaved if there was less memory
available than expected.
Instead, we now have two different shrinkers and lru lists; the second
shrinker being for pinned nodes, with seeks set much higher than normal
- so they can still be freed if necessary, but we'll prefer not to.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
this is prep for introducing a second live list and shrinker for pinned
nodes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
32 bits won't overflow any time soon, but size_t is the correct type for
counting objects in memory.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
bch2_dev_rcu() now properly errors if the device is invalid
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
Factor out bch2_show_options() into a generic helper, for debugging
option passing issues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix the following compilation error:
```
fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’:
fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement
508 | unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices);
```
Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
No reason for it not to be where it's needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This adds mount options for specifying recovery passes to run, or
exclude; the immediate need for this is that backpointers fsck is having
trouble completing, so we need a way to skip it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When freeing in a shrinker callback, we need to notify memory reclaim,
so it knows forward progress has been made.
Normally this is done in e.g. slab code, but we're not freeing through
slab - or rather we are, but these allocations are big, and use the
kmalloc_large() path.
This is really a bug in the slub code, but we're working around it here
for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
This is needed for overlayfs, which is used by container managers.
Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| | |
this was an oversight: rebalance is moving data to a specific device, so
we don't want it falling back to the full filesystem
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't
allocate from the full filesystem - but we don't want spurious
allocation failures due to open buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| | |
like the previous patch, kill use of bare arrays; the encryption code
likes to work in big batches, so this is a small performance
improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| | |
convert to darray, and add a fallback when allocation fails
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| | |
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add the __counted_by compiler attribute to the flexible array members
devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment nr_devs before adding a new device to the devs array and
adjust the array indexes accordingly. Add a helper macro for adding a
new device.
In bch2_journal_read(), explicitly set nr_devs to 0.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|