summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
...
| | * xfs: xattr scrub should ensure one namespace bit per nameDarrick J. Wong2023-04-111-1/+7
| | | | | | | | | | | | | | | | | | | | | Check that each extended attribute exists in only one namespace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of ↵Dave Chinner2023-04-143-3/+196
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect mergeable and overlapping btree records [v24.5] While I was doing differential fuzz analysis between xfs_scrub and xfs_repair, I noticed that xfs_repair was only partially effective at detecting btree records that can be merged, and xfs_scrub totally didn't notice at all. For every interval btree type except for the bmbt, there should never exist two adjacent records with adjacent keyspaces because the blockcount field is always large enough to span the entire keyspace of the domain. This is because the free space, rmap, and refcount btrees have a blockcount field large enough to store the maximum AG length, and there can never be an allocation larger than an AG. The bmbt is a different story due to its ondisk encoding where the blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks and the RT volume can span up to 2^52 blocks, a preallocation of 2^22 blocks will be expressed as two records of 2^21 length. We don't opportunistically combine records when doing bmbt operations, which is why the fsck tools have never complained about this scenario. Offline repair is partially effective at detecting mergeable records because I taught it to do that for the rmap and refcount btrees. This series enhances the free space, rmap, and refcount scrubbers to detect mergeable records. For the bmbt, it will flag the file as being eligible for an optimization to shrink the size of the data structure. The last patch in this set also enhances the rmap scrubber to detect records that overlap incorrectly. This check is done automatically for non-overlapping btree types, but we have to do it separately for the rmapbt because there are constraints on which allocation types are allowed to overlap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: check for reverse mapping records that could be mergedDarrick J. Wong2023-04-111-0/+52
| | | | | | | | | | | | | | | | | | | | | Enhance the rmap scrubber to flag adjacent records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: check overlapping rmap btree recordsDarrick J. Wong2023-04-111-2/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rmap btree scrubber doesn't contain sufficient checking for records that cannot overlap but do anyway. For the other btrees, this is enforced by the inorder checks in xchk_btree_rec, but the rmap btree is special because it allows overlapping records to handle shared data extents. Therefore, enhance the rmap btree record check function to compare each record against the previous one so that we can detect overlapping rmap records for space allocations that do not allow sharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: flag refcount btree records that could be mergedDarrick J. Wong2023-04-111-0/+44
| | | | | | | | | | | | | | | | | | | | | Complain if we encounter refcount btree records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: flag free space btree records that could be mergedDarrick J. Wong2023-04-111-1/+28
| | | | | | | | | | | | | | | | | | | | | Complain if we encounter free space btree records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of ↵Dave Chinner2023-04-142-136/+239
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: merge bmap records for faster scrubs [v24.5] I started looking into performance problems with the data fork scrubber in generic/333, and noticed a few things that needed improving. First, due to design reasons, it's possible for file forks btrees to contain multiple contiguous mappings to the same physical space. Instead of checking each ondisk mapping individually, it's much faster to combine them when possible and check the combined mapping because that's fewer trips through the rmap btree, and we can drop this check-around behavior that it does when an rmapbt lookup produces a record that starts before or ends after a particular bmbt mapping. Second, I noticed that the bmbt scrubber decides to walk every reverse mapping in the filesystem if the file fork is in btree format. This is very costly, and only necessary if the inode repair code had to zap a fork to convince iget to work. Constraining the full-rmap scan to this one case means we can skip it for normal files, which drives the runtime of this test from 8 hours down to 45 minutes (observed with realtime reflink and rebuild-all mode.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: don't call xchk_bmap_check_rmaps for btree-format file forksDarrick J. Wong2023-04-111-16/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic at the end of xchk_bmap_want_check_rmaps tries to detect a file fork that has been zapped by what will become the online inode repair code. Zapped forks are in FMT_EXTENTS with zero extents, and some sort of hint that there's supposed to be data somewhere in the filesystem. Unfortunately, the inverted logic here is confusing and has the effect that we always call xchk_bmap_check_rmaps for FMT_BTREE forks. This is horribly inefficient and unnecessary, so invert the logic to get rid of this performance problem. This has caused 8h delays in generic/333 and generic/334. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: split the xchk_bmap_check_rmaps into a predicateDarrick J. Wong2023-04-111-22/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function has two parts: the second part scans every reverse mapping record for this file fork to make sure that there's a corresponding mapping in the fork, and the first part decides if we even want to do that. Split the first part into a separate predicate so that we can make more changes to it in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: alert the user about data/attr fork mappings that could be mergedDarrick J. Wong2023-04-111-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the data or attr forks have mappings that could be merged, let the user know that the structure could be optimized. This isn't a filesystem corruption since the regular filesystem does not try to be smart about merging bmbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: split xchk_bmap_xref_rmap into two functionsDarrick J. Wong2023-04-111-40/+76
| | | | | | | | | | | | | | | | | | | | | | | | There's more special-cased functionality than not in this function. Split it into two so that each can be far more cohesive. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: accumulate iextent records when checking bmapDarrick J. Wong2023-04-112-78/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the bmap scrubber checks file fork mappings individually. In the case that the file uses multiple mappings to a single contiguous piece of space, the scrubber repeatedly locks the AG to check the existence of a reverse mapping that overlaps this file mapping. If the reverse mapping starts before or ends after the mapping we're checking, it will also crawl around in the bmbt checking correspondence for adjacent extents. This is not very time efficient because it does the crawling while holding the AGF buffer, and checks the middle mappings multiple times. Instead, create a custom iextent record iterator function that combines multiple adjacent allocated mappings into one large incore bmbt record. This is feasible because the incore bmbt record length is 64-bits wide. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: change bmap scrubber to store the previous mappingDarrick J. Wong2023-04-111-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the inode data/attr/cow fork scrubber to remember the entire previous mapping, not just the next expected offset. No behavior changes here, but this will enable some better checking in subsequent patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of ↵Dave Chinner2023-04-149-101/+438
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix iget/irele usage in online fsck [v24.5] This patchset fixes a handful of problems relating to how we get and release incore inodes in the online scrub code. The first patch fixes how we handle DONTCACHE -- our reasons for setting (or clearing it) depend entirely on the runtime environment at irele time. Hence we can refactor iget and irele to use our own wrappers that set that context appropriately. The second patch fixes a race between the iget call in the inode core scrubber and other writer threads that are allocating or freeing inodes in the same AG by changing the behavior of xchk_iget (and the inode core scrub setup function) to return either an incore inode or the AGI buffer so that we can be sure that the inode cannot disappear on us. The final patch elides MMAPLOCK from scrub paths when possible. It did not fit anywhere else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: don't take the MMAPLOCK when scrubbing file metadataDarrick J. Wong2023-04-113-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MMAPLOCK stabilizes mappings in a file's pagecache. Therefore, we do not need it to check directories, symlinks, extended attributes, or file-based metadata. Reduce its usage to the one case that requires it, which is when we want to scrub the data fork of a regular file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: retain the AGI when we can't iget an inode to scrub the coreDarrick J. Wong2023-04-113-24/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xchk_get_inode is not quite the right function to be calling from the inode scrubber setup function. The common get_inode function either gets an inode and installs it in the scrub context, or it returns an error code explaining what happened. This is acceptable for most file scrubbers because it is not in their scope to fix corruptions in the inode core and fork areas that cause iget to fail. Dealing with these problems is within the scope of the inode scrubber, however. If iget fails with EFSCORRUPTED, we need to xchk_inode to flag that as corruption. Since we can't get our hands on an incore inode, we need to hold the AGI to prevent inode allocation activity so that nothing changes in the inode metadata. Looking ahead to the inode core repair patches, we will also need to hold the AGI buffer into xrep_inode so that we can make modifications to the xfs_dinode structure without any other thread swooping in to allocate or free the inode. Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off use case where the error codes we check for are a little different, and the return state is much different from the common function. xchk_setup_inode prepares to check or repair an inode record, so it must continue the scrub operation even if the inode/inobt verifiers cause xfs_iget to return EFSCORRUPTED. This is done by attaching the locked AGI buffer to the scrub transaction and returning 0 to move on to the actual scrub. (Later, the online inode repair code will also want the xfs_imap structure so that it can reset the ondisk xfs_dinode structure.) xchk_get_inode retrieves an inode on behalf of a scrubber that operates on an incore inode -- data/attr/cow forks, directories, xattrs, symlinks, parent pointers, etc. If the inode/inobt verifiers fail and xfs_iget returns EFSCORRUPTED, we want to exit to userspace (because the caller should be fix the inode first) and drop everything we acquired along the way. A behavior common to both functions is that it's possible that xfs_scrub asked for a scrub-by-handle concurrent with the inode being freed or the passed-in inumber is invalid. In this case, we call xfs_imap to see if the inobt index thinks the inode is allocated, and return ENOENT ("nothing to check here") to userspace if this is not the case. The imap lookup is why both functions call xchk_iget_agi. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: rename xchk_get_inode -> xchk_iget_for_scrubbingDarrick J. Wong2023-04-114-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Chinner suggested renaming this function to make more obvious what it does. The function returns an incore inode to callers that want to scrub a metadata structure that hangs off an inode. If the iget fails with EINVAL, it will single-step the loading process to distinguish between actually free inodes or impossible inumbers (ENOENT); discrepancies between the inobt freemask and the free status in the inode record (EFSCORRUPTED). Any other negative errno is returned unchanged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: fix an inode lookup race in xchk_get_inodeDarrick J. Wong2023-04-114-46/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit d658e, we tried to improve the robustnes of xchk_get_inode in the face of EINVAL returns from iget by calling xfs_imap to see if the inobt itself thinks that the inode is allocated. Unfortunately, that commit didn't consider the possibility that the inode gets allocated after iget but before imap. In this case, the imap call will succeed, but we turn that into a corruption error and tell userspace the inode is corrupt. Avoid this false corruption report by grabbing the AGI header and retrying the iget before calling imap. If the iget succeeds, we can proceed with the usual scrub-by-handle code. Fix all the incorrect comments too, since unreadable/corrupt inodes no longer result in EINVAL returns. Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: manage inode DONTCACHE status at irele timeDarrick J. Wong2023-04-115-24/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, there are statements scattered all over the online fsck codebase about how we can't use XFS_IGET_DONTCACHE because of concerns about scrub's unusual practice of releasing inodes with transactions held. However, iget is the wrong place to handle this -- the DONTCACHE state doesn't matter at all until we try to *release* the inode, and here we get things wrong in multiple ways: First, if we /do/ have a transaction, we must NOT drop the inode, because the inode could have dirty pages, dropping the inode will trigger writeback, and writeback can trigger a nested transaction. Second, if the inode already had an active reference and the DONTCACHE flag set, the icache hit when scrub grabs another ref will not clear DONTCACHE. This is sort of by design, since DONTCACHE is now used to initiate cache drops so that sysadmins can change a file's access mode between pagecache and DAX. Third, if we do actually have the last active reference to the inode, we can set DONTCACHE to avoid polluting the cache. This is the /one/ case where we actually want that flag. Create an xchk_irele helper to encode all that logic and switch the online fsck code to use it. Since this now means that nearly all scrubbers use the same xfs_iget flags, we can wrap them too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of ↵Dave Chinner2023-04-143-168/+71
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix bugs in parent pointer checking [v24.5] Jan Kara pointed out that the VFS doesn't take i_rwsem of a child subdirectory that is being moved from one parent to another. Upon deeper analysis, I realized that this was the source of a very hard to trigger false corruption report in the parent pointer checking code. Now that we've refactored how directory walks work in scrub, we can also get rid of all the unnecessary and broken locking to make parent pointer scrubbing work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: fix parent pointer scrub racing with subdirectory reparentingDarrick J. Wong2023-04-113-84/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Kara pointed out that rename() doesn't lock a subdirectory that is being moved from one parent to another, even though the move requires an update to the subdirectory's dotdot entry. This means that it's *not* sufficient to hold a directory's IOLOCK to stabilize the dotdot entry. We must hold the ILOCK of both the child and the alleged parent, and there's no use in holding the parent's IOLOCK. With that in mind, we can get rid of all the messy code that tries to grab the parent's IOLOCK, which means we don't need to let go of the ILOCK of the directory whose parent we are checking. We still have to use nonblocking mode to take the ILOCK of the alleged parent, so the revalidation loop has to stay. However, we can remove the retry counter, since threads aren't supposed to hold the ILOCK for long periods of time. Remove the inverted ilock helper from the common code since nobody uses it. Remove the entire source of -EDEADLOCK-based "retry harder" scrub executions. Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: simplify xchk_parent_validateDarrick J. Wong2023-04-111-77/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is unnecessarily long because it contains code to revalidate a dotdot entry after cycling locks to try to confirm a subdirectory parent pointer. Shorten the codebase by making the parent's lookup call do double duty as the revalidation code. This weakeans the efficacy of this scrub function temporarily, but the next patch will resolve this as part of fixing an unhandled race that is the result of the VFS rename locking model not working the way Darrick thought it did. Rename this stupid 'dnum' variable too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: remove xchk_parent_count_parent_dentriesDarrick J. Wong2023-04-111-29/+13
| | | | | | | | | | | | | | | | | | | | | This helper is now trivial, so get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of ↵Dave Chinner2023-04-145-217/+497
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix iget usage in directory scrub [v24.5] In this series, we fix some problems with how the directory scrubber grabs child inodes. First, we want to reduce EDEADLOCK returns by replacing fixed-iteration loops with interruptible trylock loops. Second, we add UNTRUSTED to the child iget call so that we can detect a dirent that points to an unallocated inode. Third, we fix a bug where we weren't checking the inode pointed to by dotdot entries at all. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: always check the existence of a dirent's child inodeDarrick J. Wong2023-04-111-45/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're scrubbing directory entries, we always need to iget the child inode to make sure that the inode pointer points to a valid inode. The original directory scrub code (commit a5c4) only set us up to do this for ftype=1 filesystems, which is not sufficient; and then commit 4b80 made it worse by exempting the dot and dotdot entries. Sorta-fixes: a5c46e5e8912 ("xfs: scrub directory metadata") Sorta-fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: xfs_iget in the directory scrubber needs to use UNTRUSTEDDarrick J. Wong2023-04-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 4b80ac64450f, we tried to strengthen the directory scrubber by using the iget call to detect directory entries that point to unallocated inodes. Unfortunately, that commit neglected to pass XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first. If the inode number points to something that isn't even an inode cluster, iget will throw corruption errors and return -EFSCORRUPTED, which means that we fail to mark the directory corrupt. Fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: streamline the directory iteration code for scrubDarrick J. Wong2023-04-115-183/+473
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, online scrub reuses the xfs_readdir code to walk every entry in a directory. This isn't awesome for performance, since we end up cycling the directory ILOCK needlessly and coding around the particular quirks of the VFS dir_context interface. Create a streamlined version of readdir that keeps the ILOCK (since the walk function isn't going to copy stuff to userspace), skips a whole lot of directory walk cursor checks (since we start at 0 and walk to the end) and has a sane way to return error codes. Note: Porting the dotdot checking code is left for a subsequent patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: use the directory name hash function for dir scrubbingDarrick J. Wong2023-04-111-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The directory code has a directory-specific hash computation function that includes a modified hash function for case-insensitive lookups. Hence we must use that function (and not the raw da_hashname) when checking the dabtree structure. Found by accidentally breaking xfs/188 to create an abnormally huge case-insensitive directory and watching scrub break. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of ↵Dave Chinner2023-04-149-94/+195
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in rmap btree [v24.5] Following in the theme of the last two patchsets, this one strengthens the rmap btree record checking so that scrub can count the number of space records that map to a given owner and that do not map to a given owner. This enables us to determine exclusive ownership of space that can't be shared. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: ensure that single-owner file blocks are not owned by othersDarrick J. Wong2023-04-111-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For any file fork mapping that can only have a single owner, make sure that there are no other rmap owners for that mapping. This patch requires the more detailed checking provided by xfs_rmap_count_owners so that we can know how many rmap records for a given range of space had a matching owner, how many had a non-matching owner, and how many conflicted with the records that have a matching owner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: teach scrub to check for sole ownership of metadata objectsDarrick J. Wong2023-04-118-93/+182
| | | | | | | | | | | | | | | | | | | | | | | | Strengthen online scrub's checking even further by enabling us to check that a range of blocks are owned solely by a given owner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of ↵Dave Chinner2023-04-143-88/+269
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in inode btree [v24.5] This series continues the corrections for a couple of problems I found in the inode btree scrubber. The first problem is that we don't directly check the inobt records have a direct correspondence with the finobt records, and vice versa. The second problem occurs on filesystems with sparse inode chunks -- the cross-referencing we do detects sparseness, but it doesn't actually check the consistency between the inobt hole records and the rmap data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan resultsDarrick J. Wong2023-04-113-42/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill scan results because for a given range of inode numbers, we might have no indexed inodes at all; the entire region might be allocated ondisk inodes; or there might be a mix of the two. Unfortunately, sparse inodes adds to the complexity, because each inode record can have holes, which means that we cannot use the generic btree _scan_keyfill function because we must look for holes in individual records to decide the result. On the plus side, online fsck can now detect sub-chunk discrepancies in the inobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: directly cross-reference the inode btrees with each otherDarrick J. Wong2023-04-111-27/+198
| | | | | | | | | | | | | | | | | | | | | | | | Improve the cross-referencing of the two inode btrees by directly checking the free and hole state of each inode with the other btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: clean up broken eearly-exit code in the inode btree scrubberDarrick J. Wong2023-04-111-25/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | Corrupt inode chunks should cause us to exit early after setting the CORRUPT flag on the scrub state. While we're at it, collapse trivial helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: remove pointless shadow variable from xfs_difree_inobtDarrick J. Wong2023-04-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | In xfs_difree_inobt, the pag passed in was previously used to look up the AGI buffer. There's no need to extract it again, so remove the shadow variable and shut up -Wshadow. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of ↵Dave Chinner2023-04-1423-129/+610
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: detect incorrect gaps in refcount btree [v24.5] The next few patchsets address a deficiency in scrub that I found while QAing the refcount btree scrubber. If there's a gap between refcount records, we need to cross-reference that gap with the reverse mappings to ensure that there are no overlapping records in the rmap btree. If we find any, then the refcount btree is not consistent. This is not a property that is specific to the refcount btree; they all need to have this sort of keyspace scanning logic to detect inconsistencies. To do this accurately, we need to be able to scan the keyspace of a btree (which we already do) to be able to tell the caller if the keyspace is empty, sparse, or fully covered by records. The first few patches add the keyspace scanner to the generic btree code, along with the ability to mask off parts of btree keys because when we scan the rmapbt, we only care about space usage, not the owners. The final patch closes the scanning gap in the refcountbt scanner. v23.1: create helpers for the key extraction and comparison functions, improve documentation, and eliminate the ->mask_key indirect calls Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: ensure that all metadata and data blocks are not cow staging extentsDarrick J. Wong2023-04-117-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that all filesystem metadata blocks and file data blocks are not also marked as CoW staging extents. The extra checking added here was inspired by an actual VM host filesystem corruption incident due to bugs in the CoW handling of 4.x kernels. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: check the reference counts of gaps in the refcount btreeDarrick J. Wong2023-04-111-5/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gaps in the reference count btree are also significant -- for these regions, there must not be any overlapping reverse mappings. We don't currently check this, so make the refcount scrubber more complete. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: implement masked btree key comparisons for _has_records scansDarrick J. Wong2023-04-1110-40/+142
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For keyspace fullness scans, we want to be able to mask off the parts of the key that we don't care about. For most btree types we /do/ want the full keyspace, but for checking that a given space usage also has a full complement of rmapbt records (even if different/multiple owners) we need this masking so that we only track sparseness of rm_startblock, not the whole keyspace (which is extremely sparse). Augment the ->diff_two_keys and ->keys_contiguous helpers to take a third union xfs_btree_key argument, and wire up xfs_rmap_has_records to pass this through. This third "mask" argument should contain a nonzero value in each structure field that should be used in the key comparisons done during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: replace xfs_btree_has_record with a general keyspace scannerDarrick J. Wong2023-04-1117-43/+249
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of xfs_btree_has_record returns true if it finds /any/ record within the given range. Unfortunately, that's not sufficient for scrub. We want to be able to tell if a range of keyspace for a btree is devoid of records, is totally mapped to records, or is somewhere in between. By forcing this to be a boolean, we conflated sparseness and fullness, which caused scrub to return incorrect results. Fix the API so that we can tell the caller which of those three is the current state. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: refactor ->diff_two_keys callsitesDarrick J. Wong2023-04-113-45/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create wrapper functions around ->diff_two_keys so that we don't have to remember what the return values mean, and adjust some of the code comments to reflect the longtime code behavior. We're going to introduce more uses of ->diff_two_keys in the next patch, so reduce the cognitive load for readers by doing this refactoring now. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: refactor converting btree irec to btree keyDarrick J. Wong2023-04-111-8/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | We keep doing these conversions to support btree queries, so refactor this into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of ↵Dave Chinner2023-04-142-8/+63
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: enhance btree key scrubbing [v24.5] This series fixes the scrub btree block checker to ensure that the keys in the parent block accurately represent the block, and check the ordering of all interior key records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: always scrub record/key order of interior recordsDarrick J. Wong2023-04-112-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit d47fef9342d0, we removed the firstrec and firstkey fields of struct xchk_btree because Christoph thought they were unnecessary because we could use the record index in the btree cursor. This is incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor position within a specific btree block, not within the entire level. The end result is that scrub no longer detects situations where the rightmost record of a block is identical to the leftmost record of that block's right sibling. Fix this regression by reintroducing record validity booleans so that order checking skips *only* the leftmost record/key in each level. Fixes: d47fef9342d0 ("xfs: don't track firstrec/firstkey separately in xchk_btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: check btree keys reflect the child blockDarrick J. Wong2023-04-111-1/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When scrub is checking a non-root btree block, it should make sure that the keys in the parent btree block accurately capture the keyspace that the child block stores. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of ↵Dave Chinner2023-04-144-10/+95
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: fix rmap btree key flag handling [v24.5] This series fixes numerous flag handling bugs in the rmapbt key code. The most serious transgression is that key comparisons completely strip out all flag bits from rm_offset, including the ones that participate in record lookups. The second problem is that for years we've been letting the unwritten flag (which is an attribute of a specific record and not part of the record key) escape from leaf records into key records. The solution to the second problem is to filter attribute flags when creating keys from records, and the solution to the first problem is to preserve *only* the flags used for key lookups. The ATTR and BMBT flags are a part of the lookup key, and the UNWRITTEN flag is a record attribute. This has worked for years without generating user complaints because ATTR and BMBT extents cannot be shared, so key comparisons succeed solely on rm_startblock. Only file data fork extents can be shared, and those records never set any of the three flag bits, so comparisons that dig into rm_owner and rm_offset work just fine. A filesystem written with an unpatched kernel and mounted on a patched kernel will work correctly because the ATTR/BMBT flags have been conveyed into keys correctly all along, and we still ignore the UNWRITTEN flag in any key record. This was what doomed my previous attempt to correct this problem in 2019. A filesystem written with a patched kernel and mounted on an unpatched kernel will also work correctly because unpatched kernels ignore all flags. With this patchset applied, the scrub code gains the ability to detect rmap btrees with incorrectly set attr and bmbt flags in the key records. After three years of testing, I haven't encountered any problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
| | * xfs: detect unwritten bit set in rmapbt node block keysDarrick J. Wong2023-04-113-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the last patch, we changed the rmapbt code to remove the UNWRITTEN bit when creating an rmapbt key from an rmapbt record, and we changed the rmapbt key comparison code to start considering the ATTR and BMBT flags during lookup. This brought the behavior of the rmapbt implementation in line with its specification. However, there may exist filesystems that have the unwritten bit still set in the rmapbt keys. We should detect these situations and flag the rmapbt as one that would benefit from optimization. Eventually, online repair will be able to do something in response to this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: fix rm_offset flag handling in rmap keysDarrick J. Wong2023-04-111-10/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, offset) This provides users the ability to look up a reverse mapping from a file block mapping record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. Unfortunately, the code that creates rmap lookup keys from rmap records forgot to mask off the record attribute flags, leading to ondisk keys that look like this: (physical block, owner, fork, is_btree, unwritten state, offset) Fortunately, this has all worked ok for the past six years because the key comparison functions incorrectly ignore the fork/bmbt/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) Queries can (theoretically) return incorrect results because of this omission. On consistent filesystems this isn't an issue because xattr and bmbt blocks cannot be shared and hence the comparisons succeed purely on the contents of the rm_startblock field. For the one case where we support sharing (written data fork blocks) all flag bits are zero, so the omission in the comparison has no ill effects. Unfortunately, this bug prevents scrub from detecting incorrect fork and bmbt flag bits in the rmap btree, so we really do need to fix the compare code. Old filesystems with the unwritten bit erroneously set in the rmap key struct will work fine on new kernels since we still ignore the unwritten bit. New filesystems on older kernels will work fine since the old kernels never paid attention to the unwritten bit. A previous version of this patch forgot to keep the (un)written state flag masked during the comparison and caused a major regression in 5.9.x since unwritten extent conversion can update an rmap record without requiring key updates. Note that blocks cannot go directly from data fork to attr fork without being deallocated and reallocated, nor can they be added to or removed from a bmbt without a free/alloc cycle, so this should not cause any regressions. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of ↵Dave Chinner2023-04-144-28/+31
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next xfs: hoist scrub record checks into libxfs [v24.5] There are a few things about btree records that scrub checked but the libxfs _get_rec functions didn't. Move these bits into libxfs so that everyone can benefit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>