summaryrefslogtreecommitdiffstats
path: root/fs/xfs
Commit message (Collapse)AuthorAgeFilesLines
* xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocateDarrick J. Wong2022-01-111-1/+2
| | | | | | | | | | | | | | | commit 983d8e60f50806f90534cc5373d0ce867e5aaf79 upstream. The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at the end of files, just like fallocate and RESVSP. Make the behavior consistent with the other ioctls. Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Fix assert failure in xfs_setattr_size()Yumei Huang2021-03-071-1/+1
| | | | | | | | | | | | | | | | commit 88a9e03beef22cc5fabea344f54b9a0dfe63de08 upstream. An assert failure is triggered by syzkaller test due to ATTR_KILL_PRIV is not cleared before xfs_setattr_size. As ATTR_KILL_PRIV is not checked/used by xfs_setattr_size, just remove it from the assert. Signed-off-by: Yumei Huang <yuhuang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: revert "xfs: fix rmap key and record comparison functions"Darrick J. Wong2020-11-241-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit eb8409071a1d47e3593cfe077107ac46853182ab ] This reverts commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f. Your maintainer committed a major braino in the rmap code by adding the attr fork, bmbt, and unwritten extent usage bits into rmap record key comparisons. While XFS uses the usage bits *in the rmap records* for cross-referencing metadata in xfs_scrub and xfs_repair, it only needs the owner and offset information to distinguish between reverse mappings of the same physical extent into the data fork of a file at multiple offsets. The other bits are not important for key comparisons for index lookups, and never have been. Eric Sandeen reports that this causes regressions in generic/299, so undo this patch before it does more damage. Reported-by: Eric Sandeen <sandeen@sandeen.net> Fixes: 6ff646b2ceb0 ("xfs: fix rmap key and record comparison functions") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix a missing unlock on error in xfs_fs_map_blocksChristoph Hellwig2020-11-181-1/+1
| | | | | | | | | | | | | [ Upstream commit 2bd3fa793aaa7e98b74e3653fdcc72fa753913b5 ] We also need to drop the iolock when invalidate_inode_pages2 fails, not only on all other error or successful cases. Fixes: 527851124d10 ("xfs: implement pNFS export operations") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix rmap key and record comparison functionsDarrick J. Wong2020-11-181-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6ff646b2ceb0eec916101877f38da0b73e3a5b7f ] Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, is_unwritten, offset) This provides users the ability to look up a reverse mapping from a bmbt record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. However, the key comparison functions incorrectly remove the fork/btree/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) This means that queries can return incorrect results. On consistent filesystems this hasn't been an issue because blocks are never shared between forks or with bmbt blocks; and are never unwritten. However, this bug means that online repair cannot always detect corruption in the key information in internal rmapbt nodes. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix flags argument to rmap lookup when converting shared file rmapsDarrick J. Wong2020-11-181-1/+1
| | | | | | | | | | | | | | [ Upstream commit ea8439899c0b15a176664df62aff928010fad276 ] Pass the same oldext argument (which contains the existing rmapping's unwritten state) to xfs_rmap_lookup_le_range at the start of xfs_rmap_convert_shared. At this point in the code, flags is zero, which means that we perform lookups using the wrong key. Fixes: 3f165b334e51 ("xfs: convert unwritten status of reverse mappings for shared files") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: flush new eof page on truncate to avoid post-eof corruptionBrian Foster2020-11-181-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 869ae85dae64b5540e4362d7fe4cd520e10ec05c ] It is possible to expose non-zeroed post-EOF data in XFS if the new EOF page is dirty, backed by an unwritten block and the truncate happens to race with writeback. iomap_truncate_page() will not zero the post-EOF portion of the page if the underlying block is unwritten. The subsequent call to truncate_setsize() will, but doesn't dirty the page. Therefore, if writeback happens to complete after iomap_truncate_page() (so it still sees the unwritten block) but before truncate_setsize(), the cached page becomes inconsistent with the on-disk block. A mapped read after the associated page is reclaimed or invalidated exposes non-zero post-EOF data. For example, consider the following sequence when run on a kernel modified to explicitly flush the new EOF page within the race window: $ xfs_io -fc "falloc 0 4k" -c fsync /mnt/file $ xfs_io -c "pwrite 0 4k" -c "truncate 1k" /mnt/file ... $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: 00 00 00 00 00 00 00 00 ........ $ umount /mnt/; mount <dev> /mnt/ $ xfs_io -c "mmap 0 4k" -c "mread -v 1k 8" /mnt/file 00000400: cd cd cd cd cd cd cd cd ........ Update xfs_setattr_size() to explicitly flush the new EOF page prior to the page truncate to ensure iomap has the latest state of the underlying block. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix realtime bitmap/summary file truncation when growing rt volumeDarrick J. Wong2020-11-101-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f4c32e87de7d66074d5612567c5eac7325024428 ] The realtime bitmap and summary files are regular files that are hidden away from the directory tree. Since they're regular files, inode inactivation will try to purge what it thinks are speculative preallocations beyond the incore size of the file. Unfortunately, xfs_growfs_rt forgets to update the incore size when it resizes the inodes, with the result that inactivating the rt inodes at unmount time will cause their contents to be truncated. Fix this by updating the incore size when we change the ondisk size as part of updating the superblock. Note that we don't do this when we're allocating blocks to the rt inodes because we actually want those blocks to get purged if the growfs fails. This fixes corruption complaints from the online rtsummary checker when running xfs/233. Since that test requires rmap, one can also trigger this by growing an rt volume, cycling the mount, and creating rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: make sure the rt allocator doesn't run off the endDarrick J. Wong2020-10-291-0/+11
| | | | | | | | | | | | | | | | | | [ Upstream commit 2a6ca4baed620303d414934aa1b7b0a8e7bab05f ] There's an overflow bug in the realtime allocator. If the rt volume is large enough to handle a single allocation request that is larger than the maximum bmap extent length and the rt bitmap ends exactly on a bitmap block boundary, it's possible that the near allocator will try to check the freeness of a range that extends past the end of the bitmap. This fails with a corruption error and shuts down the fs. Therefore, constrain maxlen so that the range scan cannot run off the end of the rt bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: don't ever return a stale pointer from __xfs_dir3_free_readDarrick J. Wong2020-10-011-0/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 1cb5deb5bc095c070c09a4540c45f9c9ba24be43 ] If we decide that a directory free block is corrupt, we must take care not to leak a buffer pointer to the caller. After xfs_trans_brelse returns, the buffer can be freed or reused, which means that we have to set *bpp back to NULL. Callers are supposed to notice the nonzero return value and not use the buffer pointer, but we should code more defensively, even if all current callers handle this situation correctly. Fixes: de14c5f541e7 ("xfs: verify free block header fields") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix attr leaf header freemap.size underflowBrian Foster2020-10-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2a2b5932db67586bacc560cc065d62faece5b996 ] The leaf format xattr addition helper xfs_attr3_leaf_add_work() adjusts the block freemap in a couple places. The first update drops the size of the freemap that the caller had already selected to place the xattr name/value data. Before the function returns, it also checks whether the entries array has encroached on a freemap range by virtue of the new entry addition. This is necessary because the entries array grows from the start of the block (but end of the block header) towards the end of the block while the name/value data grows from the end of the block in the opposite direction. If the associated freemap is already empty, however, size is zero and the subtraction underflows the field and causes corruption. This is reproduced rarely by generic/070. The observed behavior is that a smaller sized freemap is aligned to the end of the entries list, several subsequent xattr additions land in larger freemaps and the entries list expands into the smaller freemap until it is fully consumed and then underflows. Note that it is not otherwise a corruption for the entries array to consume an empty freemap because the nameval list (i.e. the firstused pointer in the xattr header) starts beyond the end of the corrupted freemap. Update the freemap size modification to account for the fact that the freemap entry can be empty and thus stale. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: initialize the shortform attr header padding entryDarrick J. Wong2020-09-231-2/+2
| | | | | | | | | | | | [ Upstream commit 125eac243806e021f33a1fdea3687eccbb9f7636 ] Don't leak kernel memory contents into the shortform attr fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_initEiichi Tsukata2020-08-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 96cf2a2c75567ff56195fe3126d497a2e7e4379f ] If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN shows the following warning: UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23 member access within null pointer of type 'struct xfs_kobj' Call Trace: dump_stack+0x10e/0x195 ubsan_type_mismatch_common+0x241/0x280 __ubsan_handle_type_mismatch_v1+0x32/0x40 init_xfs_fs+0x12b/0x28f do_one_initcall+0xdd/0x1d0 do_initcall_level+0x151/0x1b6 do_initcalls+0x50/0x8f do_basic_setup+0x29/0x2b kernel_init_freeable+0x19f/0x20b kernel_init+0x11/0x1e0 ret_from_fork+0x22/0x30 Fix it by checking parent_kobj before the code accesses its member. Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: minor whitespace edits] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix inode quota reservation checksDarrick J. Wong2020-08-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f959b5d037e71a4d69b5bf71faffa065d9269b4a ] xfs_trans_dqresv is the function that we use to make reservations against resource quotas. Each resource contains two counters: the q_core counter, which tracks resources allocated on disk; and the dquot reservation counter, which tracks how much of that resource has either been allocated or reserved by threads that are working on metadata updates. For disk blocks, we compare the proposed reservation counter against the hard and soft limits to decide if we're going to fail the operation. However, for inodes we inexplicably compare against the q_core counter, not the incore reservation count. Since the q_core counter is always lower than the reservation count and we unlock the dquot between reservation and transaction commit, this means that multiple threads can reserve the last inode count before we hit the hard limit, and when they commit, we'll be well over the hard limit. Fix this by checking against the incore inode reservation counter, since we would appear to maintain that correctly (and that's what we report in GETQUOTA). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix reflink quota reservation accounting errorDarrick J. Wong2020-08-211-7/+14
| | | | | | | | | | | | | | [ Upstream commit 83895227aba1ade33e81f586aa7b6b1e143096a5 ] Quota reservations are supposed to account for the blocks that might be allocated due to a bmap btree split. Reflink doesn't do this, so fix this to make the quota accounting more accurate before we start rearranging things. Fixes: 862bb360ef56 ("xfs: reflink extents from one file to another") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix missed wakeup on l_flush_waitRik van Riel2020-08-211-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cdea5459ce263fbc963657a7736762ae897a8ae6 upstream. The code in xlog_wait uses the spinlock to make adding the task to the wait queue, and setting the task state to UNINTERRUPTIBLE atomic with respect to the waker. Doing the wakeup after releasing the spinlock opens up the following race condition: Task 1 task 2 add task to wait queue wake up task set task state to UNINTERRUPTIBLE This issue was found through code inspection as a result of kworkers being observed stuck in UNINTERRUPTIBLE state with an empty wait queue. It is rare and largely unreproducable. Simply moving the spin_unlock to after the wake_up_all results in the waker not being able to see a task on the waitqueue before it has set its state to UNINTERRUPTIBLE. This bug dates back to the conversion of this code to generic waitqueue infrastructure from a counting semaphore back in 2008 which didn't place the wakeups consistently w.r.t. to the relevant spin locks. [dchinner: Also fix a similar issue in the shutdown path on xc_commit_wait. Update commit log with more details of the issue.] Fixes: d748c62367eb ("[XFS] Convert l_flushsema to a sv_t") Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: stable@vger.kernel.org # 4.9.x-4.19.x [modified for contextual change near xlog_state_do_callback()] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't call xfs_da_shrink_inode with NULL bpEric Sandeen2020-08-211-3/+2
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit bb3d48dcf86a97dc25fe9fc2c11938e19cb4399a ] xfs_attr3_leaf_create may have errored out before instantiating a buffer, for example if the blkno is out of range. In that case there is no work to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops if we try. This also seems to fix a flaw where the original error from xfs_attr3_leaf_create gets overwritten in the cleanup case, and it removes a pointless assignment to bp which isn't used after this. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969 Reported-by: Xu, Wen <wen.xu@gatech.edu> Tested-by: Xu, Wen <wen.xu@gatech.edu> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: validate cached inodes are free when allocatedDave Chinner2020-08-211-25/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit afca6c5b2595fc44383919fba740c194b0b76aff ] A recent fuzzed filesystem image cached random dcache corruption when the reproducer was run. This often showed up as panics in lookup_slow() on a null inode->i_ops pointer when doing pathwalks. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 .... Call Trace: lookup_slow+0x44/0x60 walk_component+0x3dd/0x9f0 link_path_walk+0x4a7/0x830 path_lookupat+0xc1/0x470 filename_lookup+0x129/0x270 user_path_at_empty+0x36/0x40 path_listxattr+0x98/0x110 SyS_listxattr+0x13/0x20 do_syscall_64+0xf5/0x280 entry_SYSCALL_64_after_hwframe+0x42/0xb7 but had many different failure modes including deadlocks trying to lock the inode that was just allocated or KASAN reports of use-after-free violations. The cause of the problem was a corrupt INOBT on a v4 fs where the root inode was marked as free in the inobt record. Hence when we allocated an inode, it chose the root inode to allocate, found it in the cache and re-initialised it. We recently fixed a similar inode allocation issue caused by inobt record corruption problem in xfs_iget_cache_miss() in commit ee457001ed6c ("xfs: catch inode allocation state mismatch corruption"). This change adds similar checks to the cache-hit path to catch it, and turns the reproducer into a corruption shutdown situation. Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix typos in comment] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: catch inode allocation state mismatch corruptionDave Chinner2020-08-211-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ee457001ed6c6f31ddad69c24c1da8f377d8472d ] We recently came across a V4 filesystem causing memory corruption due to a newly allocated inode being setup twice and being added to the superblock inode list twice. From code inspection, the only way this could happen is if a newly allocated inode was not marked as free on disk (i.e. di_mode wasn't zero). Running the metadump on an upstream debug kernel fails during inode allocation like so: XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod= e.c, line: 838 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:114! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0= 1/2014 RIP: 0010:assfail+0x28/0x30 RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202 RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30 R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10 FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000= 000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0 Call Trace: xfs_ialloc+0x383/0x570 xfs_dir_ialloc+0x6a/0x2a0 xfs_create+0x412/0x670 xfs_generic_create+0x1f7/0x2c0 ? capable_wrt_inode_uidgid+0x3f/0x50 vfs_mkdir+0xfb/0x1b0 SyS_mkdir+0xcf/0xf0 do_syscall_64+0x73/0x1a0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Extracting the inode number we crashed on from an event trace and looking at it with xfs_db: xfs_db> inode 184452204 xfs_db> p core.magic = 0x494e core.mode = 0100644 core.version = 2 core.format = 2 (extents) core.nlinkv2 = 1 core.onlink = 0 ..... Confirms that it is not a free inode on disk. xfs_repair also trips over this inode: ..... zero length extent (off = 0, fsbno = 0) in ino 184452204 correcting nextents for inode 184452204 bad attribute fork in inode 184452204, would clear attr fork bad nblocks 1 for inode 184452204, would reset to 0 bad anextents 1 for inode 184452204, would reset to 0 imap claims in-use inode 184452204 is free, would correct imap would have cleared inode 184452204 ..... disconnected inode 184452204, would move to lost+found And so we have a situation where the directory structure and the inobt thinks the inode is free, but the inode on disk thinks it is still in use. Where this corruption came from is not possible to diagnose, but we can detect it and prevent the kernel from oopsing on lookup. The reproducer now results in: $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5} mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex= ists mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu= re needs cleaning mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o= utput error mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o= utput error .... And this corruption shutdown: [ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not= marked free on disk [ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 = of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670 [ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #= 443 [ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO= S 1.10.2-1 04/01/2014 [ 54.852859] Call Trace: [ 54.853531] dump_stack+0x85/0xc5 [ 54.854385] xfs_trans_cancel+0x197/0x1c0 [ 54.855421] xfs_create+0x425/0x670 [ 54.856314] xfs_generic_create+0x1f7/0x2c0 [ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50 [ 54.858586] vfs_mkdir+0xfb/0x1b0 [ 54.859458] SyS_mkdir+0xcf/0xf0 [ 54.860254] do_syscall_64+0x73/0x1a0 [ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 54.862492] RIP: 0033:0x7fb73bddf547 [ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000= 000000000053 [ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73= bddf547 [ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda= a55449a [ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a= 8670dd0 [ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000= 00001ff [ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda= a553500 [ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1= 024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050 [ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt= ing down filesystem [ 54.884597] XFS (loop0): Please umount the filesystem and rectify the = problem(s) Note that this crash is only possible on v4 filesystemsi or v5 filesystems mounted with the ikeep mount option. For all other V5 filesystems, this problem cannot occur because we don't read inodes we are allocating from disk - we simply overwrite them with the new inode information. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Tested-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: set format back to extents if xfs_bmap_extents_to_btreeEric Sandeen2020-07-311-0/+4
| | | | | | | | | | | | | | | | | | | commit 2c4306f719b083d17df2963bc761777576b8ad1b upstream. If xfs_bmap_extents_to_btree fails in a mode where we call xfs_iroot_realloc(-1) to de-allocate the root, set the format back to extents. Otherwise we can assume we can dereference ifp->if_broot based on the XFS_DINODE_FMT_BTREE format, and crash. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Nobuhiro Iwamatsu (CIP) <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: add agf freeblocks verify in xfs_agf_verifyZheng Bin2020-06-301-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d0c7feaf87678371c2c09b3709400be416b2dc62 ] We recently used fuzz(hydra) to test XFS and automatically generate tmp.img(XFS v5 format, but some metadata is wrong) xfs_repair information(just one AG): agf_freeblks 0, counted 3224 in ag 0 agf_longest 536874136, counted 3224 in ag 0 sb_fdblocks 613, counted 3228 Test as follows: mount tmp.img tmpdir cp file1M tmpdir sync In 4.19-stable, sync will stuck, the reason is: xfs_mountfs xfs_check_summary_counts if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) || XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) && !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS)) return 0; -->just return, incore sb_fdblocks still be 613 xfs_initialize_perag_data cp file1M tmpdir -->ok(write file to pagecache) sync -->stuck(write pagecache to disk) xfs_map_blocks xfs_iomap_write_allocate while (count_fsb != 0) { nimaps = 0; while (nimaps == 0) { --> endless loop nimaps = 1; xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again xfs_bmapi_write xfs_bmap_alloc xfs_bmap_btalloc xfs_alloc_vextent xfs_alloc_fix_freelist xfs_alloc_space_available -->fail(agf_freeblks is 0) In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs: use the latest extent at writeback delalloc conversion time") remove the above while, dmesg is as follows: [ 55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0. Users do not know why this page is discard, the better soultion is: 1. Like xfs_repair, make sure sb_fdblocks is equal to counted (xfs_initialize_perag_data did this, who is not called at this mount) 2. Add agf verify, if fail, will tell users to repair This patch use the second soultion. Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: Ren Xudong <renxudong1@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix partially uninitialized structure in xfs_reflink_remap_extentDarrick J. Wong2020-05-021-0/+1
| | | | | | | | | | | | | | | [ Upstream commit c142932c29e533ee892f87b44d8abc5719edceec ] In the reflink extent remap function, it turns out that uirec (the block mapping corresponding only to the part of the passed-in mapping that got unmapped) was not fully initialized. Specifically, br_state was not being copied from the passed-in struct to the uirec. This could lead to unpredictable results such as the reflinked mapping being marked unwritten in the destination file. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: Sanity check flags of Q_XQUOTARM callJan Kara2020-01-291-0/+3
| | | | | | | | | | | | | | | | commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 upstream. Flags passed to Q_XQUOTARM were not sanity checked for invalid values. Fix that. Fixes: 9da93f9b7cdf ("xfs: fix Q_XQUOTARM ioctl") Reported-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't check for AG deadlock for realtime files in bunmapiOmar Sandoval2020-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream. Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added a check in __xfs_bunmapi() to stop early if we would touch multiple AGs in the wrong order. However, this check isn't applicable for realtime files. In most cases, it just makes us do unnecessary commits. However, without the fix from the previous commit ("xfs: fix realtime file data space leak"), if the last and second-to-last extents also happen to have different "AG numbers", then the break actually causes __xfs_bunmapi() to return without making any progress, which sends xfs_itruncate_extents_flags() into an infinite loop. Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix mount failure crash on invalid iclog memory accessBrian Foster2020-01-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 798a9cada4694ca8d970259f216cec47e675bfd5 ] syzbot (via KASAN) reports a use-after-free in the error path of xlog_alloc_log(). Specifically, the iclog freeing loop doesn't handle the case of a fully initialized ->l_iclog linked list. Instead, it assumes that the list is partially constructed and NULL terminated. This bug manifested because there was no possible error scenario after iclog list setup when the original code was added. Subsequent code and associated error conditions were added some time later, while the original error handling code was never updated. Fix up the error loop to terminate either on a NULL iclog or reaching the end of the list. Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: Fix bulkstat compat ioctls on x32 userspace.Nick Bowler2019-12-051-4/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7ca860e3c1a74ad6bd8949364073ef1044cad758 ] The bulkstat family of ioctls are problematic on x32, because there is a mixup of native 32-bit and 64-bit conventions. The xfs_fsop_bulkreq struct contains pointers and 32-bit integers so that matches the native 32-bit layout, and that means the ioctl implementation goes into the regular compat path on x32. However, the 'ubuffer' member of that struct in turn refers to either struct xfs_inogrp or xfs_bstat (or an array of these). On x32, those structures match the native 64-bit layout. The compat implementation writes out the 32-bit version of these structures. This is not the expected format for x32 userspace, causing problems. Fortunately the functions which actually output these xfs_inogrp and xfs_bstat structures have an easy way to select which output format is required, so we just need a little tweak to select the right format on x32. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: Align compat attrlist_by_handle with native implementation.Nick Bowler2019-12-051-0/+6
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit c456d64449efe37da50832b63d91652a85ea1d20 ] While inspecting the ioctl implementations, I noticed that the compat implementation of XFS_IOC_ATTRLIST_BY_HANDLE does not do exactly the same thing as the native implementation. Specifically, the "cursor" does not appear to be written out to userspace on the compat path, like it is on the native path. This adjusts the compat implementation to copy out the cursor just like the native implementation does. The attrlist cursor does not require any special compat handling. This fixes xfstests xfs/269 on both IA-32 and x32 userspace, when running on an amd64 kernel. Signed-off-by: Nick Bowler <nbowler@draconx.ca> Fixes: 0facef7fb053b ("xfs: in _attrlist_by_handle, copy the cursor back to userspace") Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: require both realtime inodes to mountDarrick J. Wong2019-12-051-3/+1
| | | | | | | | | | | | | | | [ Upstream commit 64bafd2f1e484e27071e7584642005d56516cb77 ] Since mkfs always formats the filesystem with the realtime bitmap and summary inodes immediately after the root directory, we should expect that both of them are present and loadable, even if there isn't a realtime volume attached. There's no reason to skip this if rbmino == NULLFSINO; in fact, this causes an immediate crash if the there /is/ a realtime volume and someone writes to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix use-after-free race in xfs_buf_releDave Chinner2019-11-281-1/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 37fd1678245f7a5898c1b05128bc481fb403c290 ] When looking at a 4.18 based KASAN use after free report, I noticed that racing xfs_buf_rele() may race on dropping the last reference to the buffer and taking the buffer lock. This was the symptom displayed by the KASAN report, but the actual issue that was reported had already been fixed in 4.19-rc1 by commit e339dd8d8b04 ("xfs: use sync buffer I/O for sync delwri queue submission"). Despite this, I think there is still an issue with xfs_buf_rele() in this code: release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock); spin_lock(&bp->b_lock); if (!release) { ..... If two threads race on the b_lock after both dropping a reference and one getting dropping the last reference so release = true, we end up with: CPU 0 CPU 1 atomic_dec_and_lock() atomic_dec_and_lock() spin_lock(&bp->b_lock) spin_lock(&bp->b_lock) <spins> <release = true bp->b_lru_ref = 0> <remove from lists> freebuf = true spin_unlock(&bp->b_lock) xfs_buf_free(bp) <gets lock, reading and writing freed memory> <accesses freed memory> spin_unlock(&bp->b_lock) <reads/writes freed memory> IOWs, we can't safely take bp->b_lock after dropping the hold reference because the buffer may go away at any time after we drop that reference. However, this can be fixed simply by taking the bp->b_lock before we drop the reference. It is safe to nest the pag_buf_lock inside bp->b_lock as the pag_buf_lock is only used to serialise against lookup in xfs_buf_find() and no other locks are held over or under the pag_buf_lock there. Make this clear by documenting the buffer lock orders at the top of the file. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: Correctly invert xfs_buftarg LRU isolation logicVratislav Bendel2019-11-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 19957a181608d25c8f4136652d0ea00b3738972d upstream. Due to an inverted logic mistake in xfs_buftarg_isolate() the xfs_buffers with zero b_lru_ref will take another trip around LRU, while isolating buffers with non-zero b_lru_ref. Additionally those isolated buffers end up right back on the LRU once they are released, because b_lru_ref remains elevated. Fix that circuitous route by leaving them on the LRU as originally intended. Signed-off-by: Vratislav Bendel <vbendel@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Alex Lyakas <alex@zadara.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: clear sb->s_fs_info on mount failureDave Chinner2019-10-171-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c9fbd7bbc23dbdd73364be4d045e5d3612cf6e82 upstream. We recently had an oops reported on a 4.14 kernel in xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage and so the m_perag_tree lookup walked into lala land. Essentially, the machine was under memory pressure when the mount was being run, xfs_fs_fill_super() failed after allocating the xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and freed the xfs_mount, but the sb->s_fs_info field still pointed to the freed memory. Hence when the superblock shrinker then ran it fell off the bad pointer. With the superblock shrinker problem fixed at teh VFS level, this stale s_fs_info pointer is still a problem - we use it unconditionally in ->put_super when the superblock is being torn down, and hence we can still trip over it after a ->fill_super call failure. Hence we need to clear s_fs_info if xfs-fs_fill_super() fails, and we need to check if it's valid in the places it can potentially be dereferenced after a ->fill_super failure. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ajay Kaher <akaher@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't crash on null attr fork xfs_bmapi_readDarrick J. Wong2019-10-051-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8612de3f7ba6e900465e340516b8313806d27b2d ] Zorro Lang reported a crash in generic/475 if we try to inactivate a corrupt inode with a NULL attr fork (stack trace shortened somewhat): RIP: 0010:xfs_bmapi_read+0x311/0xb00 [xfs] RSP: 0018:ffff888047f9ed68 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: ffff888047f9f038 RCX: 1ffffffff5f99f51 RDX: 0000000000000002 RSI: 0000000000000008 RDI: 0000000000000012 RBP: ffff888002a41f00 R08: ffffed10005483f0 R09: ffffed10005483ef R10: ffffed10005483ef R11: ffff888002a41f7f R12: 0000000000000004 R13: ffffe8fff53b5768 R14: 0000000000000005 R15: 0000000000000001 FS: 00007f11d44b5b80(0000) GS:ffff888114200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000ef6000 CR3: 000000002e176003 CR4: 00000000001606e0 Call Trace: xfs_dabuf_map.constprop.18+0x696/0xe50 [xfs] xfs_da_read_buf+0xf5/0x2c0 [xfs] xfs_da3_node_read+0x1d/0x230 [xfs] xfs_attr_inactive+0x3cc/0x5e0 [xfs] xfs_inactive+0x4c8/0x5b0 [xfs] xfs_fs_destroy_inode+0x31b/0x8e0 [xfs] destroy_inode+0xbc/0x190 xfs_bulkstat_one_int+0xa8c/0x1200 [xfs] xfs_bulkstat_one+0x16/0x20 [xfs] xfs_bulkstat+0x6fa/0xf20 [xfs] xfs_ioc_bulkstat+0x182/0x2b0 [xfs] xfs_file_ioctl+0xee0/0x12a0 [xfs] do_vfs_ioctl+0x193/0x1000 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x6f/0xb0 do_syscall_64+0x9f/0x4d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f11d39a3e5b The "obvious" cause is that the attr ifork is null despite the inode claiming an attr fork having at least one extent, but it's not so obvious why we ended up with an inode in that state. Reported-by: Zorro Lang <zlang@redhat.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204031 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOTDarrick J. Wong2019-09-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1fb254aa983bf190cfd685d40c64a480a9bafaee upstream. Benjamin Moody reported to Debian that XFS partially wedges when a chgrp fails on account of being out of disk quota. I ran his reproducer script: # adduser dummy # adduser dummy plugdev # dd if=/dev/zero bs=1M count=100 of=test.img # mkfs.xfs test.img # mount -t xfs -o gquota test.img /mnt # mkdir -p /mnt/dummy # chown -c dummy /mnt/dummy # xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt (and then as user dummy) $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo $ chgrp plugdev /mnt/dummy/foo and saw: ================================================ WARNING: lock held when returning to user space! 5.3.0-rc5 #rc5 Tainted: G W ------------------------------------------------ chgrp/47006 is leaving the kernel with locks still held! 1 lock held by chgrp/47006: #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs] ...which is clearly caused by xfs_setattr_nonsize failing to unlock the ILOCK after the xfs_qm_vop_chown_reserve call fails. Add the missing unlock. Reported-by: benjamin.moody@gmail.com Fixes: 253f4911f297 ("xfs: better xfs_trans_alloc interface") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't fail when converting shortform attr to long form during ATTR_REPLACEDarrick J. Wong2018-12-081-1/+8
| | | | | | | | | | | | | | | | | | | | | commit 7b38460dc8e4eafba06c78f8e37099d3b34d473c upstream. Kanda Motohiro reported that expanding a tiny xattr into a large xattr fails on XFS because we remove the tiny xattr from a shortform fork and then try to re-add it after converting the fork to extents format having not removed the ATTR_REPLACE flag. This fails because the attr is no longer present, causing a fs shutdown. This is derived from the patch in his bug report, but we really shouldn't ignore a nonzero retval from the remove call. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119 Reported-by: kanda.motohiro@gmail.com Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: truncate transaction does not modify the inobtBrian Foster2018-11-101-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a606ebdb859e78beb757dfefa08001df366e2ef5 ] The truncate transaction does not ever modify the inode btree, but includes an associated log reservation. Update xfs_calc_itruncate_reservation() to remove the reservation associated with inobt updates. [Amir: This commit was merged for kernel v4.16 and a twin commit was merged for xfsprogs v4.16. As a result, a small xfs filesystem formatted with features -m rmapbt=1,reflink=1 using mkfs.xfs version >= v4.16 cannot be mounted with kernel < v4.16. For example, xfstests generic/17{1,2,3} format a small fs and when trying to mount it, they fail with an assert on this very demonic line: XFS (vdc): Log size 3075 blocks too small, minimum size is 3717 blocks XFS (vdc): AAIEEE! Log failed size checks. Abort! XFS: Assertion failed: 0, file: src/linux/fs/xfs/xfs_log.c, line: 666 The simple solution for stable kernels is to apply this patch, because mkfs.xfs v4.16 is already in the wild, so we have to assume that xfs filesystems with a "too small" log exist. Regardless, xfsprogs maintainers should also consider reverting the twin patch to stop creating those filesystems for the sake of users with unpatched kernels.] Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J . Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* xfs: detect agfl count corruption and reset agflBrian Foster2018-06-063-1/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream. The struct xfs_agfl v5 header was originally introduced with unexpected padding that caused the AGFL to operate with one less slot than intended. The header has since been packed, but the fix left an incompatibility for users who upgrade from an old kernel with the unpacked header to a newer kernel with the packed header while the AGFL happens to wrap around the end. The newer kernel recognizes one extra slot at the physical end of the AGFL that the previous kernel did not. The new kernel will eventually attempt to allocate a block from that slot, which contains invalid data, and cause a crash. This condition can be detected by comparing the active range of the AGFL to the count. While this detects a padding mismatch, it can also trigger false positives for unrelated flcount corruption. Since we cannot distinguish a size mismatch due to padding from unrelated corruption, we can't trust the AGFL enough to simply repopulate the empty slot. Instead, avoid unnecessarily complex detection logic and and use a solution that can handle any form of flcount corruption that slips through read verifiers: distrust the entire AGFL and reset it to an empty state. Any valid blocks within the AGFL are intentionally leaked. This requires xfs_repair to rectify (which was already necessary based on the state the AGFL was found in). The reset mitigates the side effect of the padding mismatch problem from a filesystem crash to a free space accounting inconsistency. The generic approach also means that this patch can be safely backported to kernels with or without a packed struct xfs_agfl. Check the AGF for an invalid freelist count on initial read from disk. If detected, set a flag on the xfs_perag to indicate that a reset is required before the AGFL can be used. In the first transaction that attempts to use a flagged AGFL, reset it to empty, warn the user about the inconsistency and allow the freelist fixup code to repopulate the AGFL with new blocks. The xfs_perag flag is cleared to eliminate the need for repeated checks on each block allocation operation. This allows kernels that include the packing fix commit 96f859d52bcb ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct") to handle older unpacked AGFL formats without a filesystem crash. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chiluk <chiluk+linuxxfs@indeed.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Force log to disk before reading the AGF during a fstrimCarlos Maiolino2018-05-301-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ] Forcing the log to disk after reading the agf is wrong, we might be calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held. This can cause a deadlock when racing a fstrim with a filesystem shutdown. The deadlock has been identified due a miscalculation bug in device-mapper dm-thin, which returns lack of space to its users earlier than the device itself really runs out of space, changing the device-mapper volume into an error state. The problem happened while filling the filesystem with a single file, triggering the bug in device-mapper, consequently causing an IO error and shutting down the filesystem. If such file is removed, and fstrim executed before the XFS finishes the shut down process, the fstrim process will end up holding the buffer lock, and going to sleep on the cil wait queue. At this point, the shut down process will try to wake up all the threads waiting on the cil wait queue, but for this, it will try to hold the same buffer log already held my the fstrim, locking up the filesystem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: prevent creating negative-sized file via INSERT_RANGEDarrick J. Wong2018-05-091-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7d83fb14258b9961920cd86f0b921caaeb3ebe85 upstream. During the "insert range" fallocate operation, i_size grows by the specified 'len' bytes. XFS verifies that i_size + len < s_maxbytes, as it should. But this comparison is done using the signed 'loff_t', and 'i_size + len' can wrap around to a negative value, causing the check to incorrectly pass, resulting in an inode with "negative" i_size. This is possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX. ext4 and f2fs don't run into this because they set a smaller s_maxbytes. Fix it by using subtraction instead. Reproducer: xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096" Fixes: a904b1ca5751 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: <stable@vger.kernel.org> # v4.1+ Originally-From: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix signed integer addition overflow too] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: quota: check result of register_shrinker()Aliaksei Karaliou2018-03-031-16/+29
| | | | | | | | | | | | | | | [ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ] xfs_qm_init_quotainfo() does not check result of register_shrinker() which was tagged as __must_check recently, reported by sparse. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> [darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: quota: fix missed destroy of qi_tree_lockAliaksei Karaliou2018-03-031-0/+1
| | | | | | | | | | | | | [ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ] xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock while destroys quotainfo->qi_quotaofflock. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Properly retry failed dquot items in case of error during buffer writebackCarlos Maiolino2018-02-032-5/+49
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 373b0589dc8d58bc09c9a28d03611ae4fb216057 ] Once the inode item writeback errors is already fixed, it's time to fix the same problem in dquot code. Although there were no reports of users hitting this bug in dquot code (at least none I've seen), the bug is there and I was already planning to fix it when the correct approach to fix the inodes part was decided. This patch aims to fix the same problem in dquot code, regarding failed buffers being unable to be resubmitted once they are flush locked. Tested with the recently test-case sent to fstests list by Hou Tao. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: ubsan fixesDarrick J. Wong2018-02-031-3/+3
| | | | | | | | | | | [ Upstream commit 22a6c83777ac7c17d6c63891beeeac24cf5da450 ] Fix some complaints from the UBSAN about signed integer addition overflows. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fortify xfs_alloc_buftarg error handlingMichal Hocko2018-02-031-5/+10
| | | | | | | | | | | | | | | | | | | | [ Upstream commit d210a9874b8f6166579408131cb74495caff1958 ] percpu_counter_init failure path doesn't clean up &btp->bt_lru list. Call list_lru_destroy in that error path. Similarly register_shrinker error path is not handled. While it is unlikely to trigger these error path, it is not impossible especially the later might fail with large NUMAs. Let's handle the failure to make the code more robust. Noticed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: always free inline data before resetting inode fork during ifreeDarrick J. Wong2018-02-031-0/+21
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 98c4f78dcdd8cec112d1cbc5e9a792ee6e5ab7a6 ] In xfs_ifree, we reset the data/attr forks to extents format without bothering to free any inline data buffer that might still be around after all the blocks have been truncated off the file. Prior to commit 43518812d2 ("xfs: remove support for inlining data/extents into the inode fork") nobody noticed because the leftover inline data after truncation was small enough to fit inside the inline buffer inside the fork itself. However, now that we've removed the inline buffer, we /always/ have to free the inline data buffer or else we leak them like crazy. This test was found by turning on kmemleak for generic/001 or generic/388. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_realChristoph Hellwig2017-12-201-1/+1
| | | | | | | | | | | | | | | | | | | [ Upstream commit 5e422f5e4fd71d18bc6b851eeb3864477b3d842e ] There was one spot in xfs_bmap_add_extent_unwritten_real that didn't use the passed in new extent state but always converted to normal, leading to wrong behavior when converting from normal to unwritten. Only found by code inspection, it seems like this code path to move partial extent from written to unwritten while merging it with the next extent is rarely exercised. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix log block underflow during recovery cycle verificationBrian Foster2017-12-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f2a4505800607e537e9dd9dea4f55c4b0c30c7a ] It is possible for mkfs to format very small filesystems with too small of an internal log with respect to the various minimum size and block count requirements. If this occurs when the log happens to be smaller than the scan window used for cycle verification and the scan wraps the end of the log, the start_blk calculation in xlog_find_head() underflows and leads to an attempt to scan an invalid range of log blocks. This results in log recovery failure and a failed mount. Since there may be filesystems out in the wild with this kind of geometry, we cannot simply refuse to mount. Instead, cap the scan window for cycle verification to the size of the physical log. This ensures that the cycle verification proceeds as expected when the scan wraps the end of the log. Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: truncate pagecache before writeback in xfs_setattr_size()Eryu Guan2017-12-201-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 350976ae21873b0d36584ea005076356431b8f79 ] On truncate down, if new size is not block size aligned, we zero the rest of block to avoid exposing stale data to user, and iomap_truncate_page() skips zeroing if the range is already in unwritten state or a hole. Then we writeback from on-disk i_size to the new size if this range hasn't been written to disk yet, and truncate page cache beyond new EOF and set in-core i_size. The problem is that we could write data between di_size and newsize before removing the page cache beyond newsize, as the extents may still be in unwritten state right after a buffer write. As such, the page of data that newsize lies in has not been zeroed by page cache invalidation before it is written, and xfs_do_writepage() hasn't triggered it's "zero data beyond EOF" case because we haven't updated in-core i_size yet. Then a subsequent mmap read could see non-zeros past EOF. I occasionally see this in fsx runs in fstests generic/112, a simplified fsx operation sequence is like (assuming 4k block size xfs): fallocate 0x0 0x1000 0x0 keep_size write 0x0 0x1000 0x0 truncate 0x0 0x800 0x1000 punch_hole 0x0 0x800 0x800 mapread 0x0 0x800 0x800 where fallocate allocates unwritten extent but doesn't update i_size, buffer write populates the page cache and extent is still unwritten, truncate skips zeroing page past new EOF and writes the page to disk, punch_hole invalidates the page cache, at last mapread reads the block back and sees non-zero beyond EOF. Fix it by moving truncate_setsize() to before writeback so the page cache invalidation zeros the partial page at the new EOF. This also triggers "zero data beyond EOF" in xfs_do_writepage() at writeback time, because newsize has been set and page straddles the newsize. Also fixed the wrong 'end' param of filemap_write_and_wait_range() call while we're at it, the 'end' is inclusive and should be 'newsize - 1'. Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <eguan@redhat.com> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix forgotten rcu read unlock when skipping inode reclaimDarrick J. Wong2017-12-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 962cc1ad6caddb5abbb9f0a43e5abe7131a71f18 ] In commit f2e9ad21 ("xfs: check for race with xfs_reclaim_inode"), we skip an inode if we're racing with freeing the inode via xfs_reclaim_inode, but we forgot to release the rcu read lock when dumping the inode, with the result that we exit to userspace with a lock held. Don't do that; generic/320 with a 1k block size fails this very occasionally. ================================================ WARNING: lock held when returning to user space! 4.14.0-rc6-djwong #4 Tainted: G W ------------------------------------------------ rm/30466 is leaving the kernel with locks still held! 1 lock held by rm/30466: #0: (rcu_read_lock){....}, at: [<ffffffffa01364d3>] xfs_ifree_cluster.isra.17+0x2c3/0x6f0 [xfs] ------------[ cut here ]------------ WARNING: CPU: 1 PID: 30466 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x71/0x700 Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug] CPU: 1 PID: 30466 Comm: rm Tainted: G W 4.14.0-rc6-djwong #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014 task: ffff880037680000 task.stack: ffffc90001064000 RIP: 0010:rcu_note_context_switch+0x71/0x700 RSP: 0000:ffffc90001067e50 EFLAGS: 00010002 RAX: 0000000000000001 RBX: ffff880037680000 RCX: ffff88003e73d200 RDX: 0000000000000002 RSI: ffffffff819e53e9 RDI: ffffffff819f4375 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880062c900d0 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880037680000 R13: 0000000000000000 R14: ffffc90001067eb8 R15: ffff880037680690 FS: 00007fa3b8ce8700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f69bf77c000 CR3: 000000002450a000 CR4: 00000000000006e0 Call Trace: __schedule+0xb8/0xb10 schedule+0x40/0x90 exit_to_usermode_loop+0x6b/0xa0 prepare_exit_to_usermode+0x7a/0x90 retint_user+0x8/0x20 RIP: 0033:0x7fa3b87fda87 RSP: 002b:00007ffe41206568 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02 RAX: 0000000000000000 RBX: 00000000010e88c0 RCX: 00007fa3b87fda87 RDX: 0000000000000000 RSI: 00000000010e89c8 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000 R10: 000000000000015e R11: 0000000000000246 R12: 00000000010c8060 R13: 00007ffe41206690 R14: 0000000000000000 R15: 0000000000000000 ---[ end trace e88f83bf0cfbd07d ]--- Fixes: f2e9ad212def50bcf4c098c6288779dd97fff0f0 Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: trim writepage mapping to within eofBrian Foster2017-10-273-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2 upstream. The writeback rework in commit fbcc02561359 ("xfs: Introduce writeback context for writepages") introduced a subtle change in behavior with regard to the block mapping used across the ->writepages() sequence. The previous xfs_cluster_write() code would only flush pages up to EOF at the time of the writepage, thus ensuring that any pages due to file-extending writes would be handled on a separate cycle and with a new, updated block mapping. The updated code establishes a block mapping in xfs_writepage_map() that could extend beyond EOF if the file has post-eof preallocation. Because we now use the generic writeback infrastructure and pass the cached mapping to each writepage call, there is no implicit EOF limit in place. If eofblocks trimming occurs during ->writepages(), any post-eof portion of the cached mapping becomes invalid. The eofblocks code has no means to serialize against writeback because there are no pages associated with post-eof blocks. Therefore if an eofblocks trim occurs and is followed by a file-extending buffered write, not only has the mapping become invalid, but we could end up writing a page to disk based on the invalid mapping. Consider the following sequence of events: - A buffered write creates a delalloc extent and post-eof speculative preallocation. - Writeback starts and on the first writepage cycle, the delalloc extent is converted to real blocks (including the post-eof blocks) and the mapping is cached. - The file is closed and xfs_release() trims post-eof blocks. The cached writeback mapping is now invalid. - Another buffered write appends the file with a delalloc extent. - The concurrent writeback cycle picks up the just written page because the writeback range end is LLONG_MAX. xfs_writepage_map() attributes it to the (now invalid) cached mapping and writes the data to an incorrect location on disk (and where the file offset is still backed by a delalloc extent). This problem is reproduced by xfstests test generic/464, which triggers racing writes, appends, open/closes and writeback requests. To address this problem, trim the mapping used during writeback to within EOF when the mapping is validated. This ensures the mapping is revalidated for any pages encountered beyond EOF as of the time the current mapping was cached or last validated. Reported-by: Eryu Guan <eguan@redhat.com> Diagnosed-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: cancel dirty pages on invalidationDave Chinner2017-10-271-12/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 793d7dbe6d82a50b9d14bf992b9eaacb70a11ce6 upstream. Recently we've had warnings arise from the vm handing us pages without bufferheads attached to them. This should not ever occur in XFS, but we don't defend against it properly if it does. The only place where we remove bufferheads from a page is in xfs_vm_releasepage(), but we can't tell the difference here between "page is dirty so don't release" and "page is dirty but is being invalidated so release it". In some places that are invalidating pages ask for pages to be released and follow up afterward calling ->releasepage by checking whether the page was dirty and then aborting the invalidation. This is a possible vector for releasing buffers from a page but then leaving it in the mapping, so we really do need to avoid dirty pages in xfs_vm_releasepage(). To differentiate between invalidated pages and normal pages, we need to clear the page dirty flag when invalidating the pages. This can be done through xfs_vm_invalidatepage(), and will result xfs_vm_releasepage() seeing the page as clean which matches the bufferhead state on the page after calling block_invalidatepage(). Hence we can re-add the page dirty check in xfs_vm_releasepage to catch the case where we might be releasing a page that is actually dirty and so should not have the bufferheads on it removed. This will remove one possible vector of "dirty page with no bufferheads" and so help narrow down the search for the root cause of that problem. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>