| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a27ba2607e60312554cbcd43fc660b2c7f29dc9c upstream.
The struct xfs_agfl v5 header was originally introduced with
unexpected padding that caused the AGFL to operate with one less
slot than intended. The header has since been packed, but the fix
left an incompatibility for users who upgrade from an old kernel
with the unpacked header to a newer kernel with the packed header
while the AGFL happens to wrap around the end. The newer kernel
recognizes one extra slot at the physical end of the AGFL that the
previous kernel did not. The new kernel will eventually attempt to
allocate a block from that slot, which contains invalid data, and
cause a crash.
This condition can be detected by comparing the active range of the
AGFL to the count. While this detects a padding mismatch, it can
also trigger false positives for unrelated flcount corruption. Since
we cannot distinguish a size mismatch due to padding from unrelated
corruption, we can't trust the AGFL enough to simply repopulate the
empty slot.
Instead, avoid unnecessarily complex detection logic and and use a
solution that can handle any form of flcount corruption that slips
through read verifiers: distrust the entire AGFL and reset it to an
empty state. Any valid blocks within the AGFL are intentionally
leaked. This requires xfs_repair to rectify (which was already
necessary based on the state the AGFL was found in). The reset
mitigates the side effect of the padding mismatch problem from a
filesystem crash to a free space accounting inconsistency. The
generic approach also means that this patch can be safely backported
to kernels with or without a packed struct xfs_agfl.
Check the AGF for an invalid freelist count on initial read from
disk. If detected, set a flag on the xfs_perag to indicate that a
reset is required before the AGFL can be used. In the first
transaction that attempts to use a flagged AGFL, reset it to empty,
warn the user about the inconsistency and allow the freelist fixup
code to repopulate the AGFL with new blocks. The xfs_perag flag is
cleared to eliminate the need for repeated checks on each block
allocation operation.
This allows kernels that include the packing fix commit 96f859d52bcb
("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct")
to handle older unpacked AGFL formats without a filesystem crash.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by Dave Chiluk <chiluk+linuxxfs@indeed.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a78ee256c325ecfaec13cafc41b315bd4e1dd518 upstream.
The AGFL size calculation is about to get more complex, so lets turn
the macro into a function first and remove the macro.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: forward port to newer kernel, simplify the helper]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.
If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion. At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().
Fix is simple - remove from the list after the call of kiocb_cancel(). All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).
Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 116e5258e4115aca0c64ac0bf40ded3b353ed626 ]
Currently when UDF filesystem is recorded without uid / gid (ids are set
to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid=
and gid= mount options. In such case filesystem could not be modified in
any way as VFS refuses to modify files with invalid ids (even by root).
This is confusing to users and not very useful default since such media
mode is generally used for removable media. Use overflow[ug]id instead
so that at least root can modify the filesystem.
Reported-by: Steve Kenton <skenton@ou.edu>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 174d1232ebc84fcde8f5889d1171c9c7e74a10a7 ]
The chunk size of allocations in __gfs2_fallocate is calculated
incorrectly. The size can collapse, causing __gfs2_fallocate to
allocate one block at a time, which is very inefficient. This needs
fixing in two places:
In gfs2_quota_lock_check, always set ap->allowed to UINT_MAX to indicate
that there is no quota limit. This fixes callers that rely on
ap->allowed to be set even when quotas are off.
In __gfs2_fallocate, reset max_blks to UINT_MAX in each iteration of the
loop to make sure that allocation limits from one resource group won't
spill over into another resource group.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit bf617f7a92edc6bb2909db2bfa4576f50b280ee5 ]
If noextent_cache mount option is on, we will never initialize extent tree
in inode, but still we're going to access it in f2fs_drop_extent_tree,
result in kernel panic as below:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: _raw_write_lock+0xc/0x30
Call Trace:
? f2fs_drop_extent_tree+0x41/0x70 [f2fs]
f2fs_fallocate+0x5a0/0xdd0 [f2fs]
? common_file_perm+0x47/0xc0
? apparmor_file_permission+0x1a/0x20
vfs_fallocate+0x15b/0x290
SyS_fallocate+0x44/0x70
do_syscall_64+0x6e/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch fixes to check extent cache status before using in
f2fs_drop_extent_tree.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit cd36d7a17f9da68be9aa67185ba3ad7969934a19 ]
Once CP_TRIMMED_FLAG is set, after a reboot, we will never issue discard
before LBA becomes invalid again, fix it by clearing the flag in
checkpoint without CP_TRIMMED reason.
Fixes: 1f43e2ad7bff ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 17cd07ae95073c298af92c1ba14ac58ce84de33b ]
As Jayashree Mohan reported:
A simple workload to reproduce this would be :
1. create foo
2. Write (8K - 16K) // foo size = 16K now
3. fsync()
4. falloc zero_range , keep_size (4202496 - 4210688) // foo size must be 16K
5. fdatasync()
Crash now
On recovery, we see that the file size is 4210688 and not 16K, which
violates the semantics of keep_size flag. We have a test case to
reproduce this using CrashMonkey on 4.15 kernel. Try this out by
simply running :
./c_harness -f /dev/sda -d /dev/cow_ram0 -t f2fs -e 102400 -P -v
tests/generic_468_zero.so
The root cause is that we miss to set KEEP_SIZE bit correctly in zero_range
when zeroing block cross EOF with FALLOC_FL_KEEP_SIZE, let's fix this
missing case.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 0d9366d67bcf066b028e57d09c9a86ce879bcc28 ]
If mount is auto-probing for filesystem type, it will try various
filesystems in order, with the MS_SILENT flag set. We get
that flag as the silent arg to ext4_fill_super.
If we're probing (silent==1) then don't complain about feature
incompatibilities that are found if it looks like it's actually
a different valid extN type - failed probes should be silent
in this case.
If the on-disk features are unknown even to ext4, then complain.
Reported-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@infinera.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
are created with quota enabled
[ Upstream commit 4d31778aa2fa342f5f92ca4025b293a1729161d1 ]
When multiple pending snapshots referring to the same source subvolume
are executed, enabled quota will cause root item corruption, where root
items are using old bytenr (no backref in extent tree).
This can be triggered by fstests btrfs/152.
The cause is when source subvolume is still dirty, extra commit
(simplied transaction commit) of qgroup_account_snapshot() can skip
dirty roots not recorded in current transaction, making root item of
source subvolume not updated.
Fix it by forcing recording source subvolume in current transaction
before qgroup sub-transaction commit.
Reported-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ]
While running btrfs/011, I hit the following lockdep splat.
This is the important bit:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
The percpu_counter_init call in btrfs_alloc_subvolume_writers
uses GFP_KERNEL, which we can't do during transaction commit.
This switches it to GFP_NOFS.
========================================================
WARNING: possible irq lock inversion dependency detected
4.12.14-kvmsmall #8 Tainted: G W
--------------------------------------------------------
kswapd0/50 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(pcpu_alloc_mutex){+.+.+.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(pcpu_alloc_mutex);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/50:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
#1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50
the shortest dependencies between 2nd lock and 1st lock:
-> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
kmem_cache_init_late+0x42/0x75
start_kernel+0x343/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
RECLAIM_FS-ON-W at:
__kmalloc+0x47/0x310
pcpu_extend_area_map+0x2b/0xc0
pcpu_alloc+0x3ec/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
__do_tune_cpucache+0x2c/0x220
do_tune_cpucache+0x26/0xc0
enable_cpucache+0x6d/0xf0
__kmem_cache_create+0x1bf/0x390
create_cache+0xba/0x1b0
kmem_cache_create+0x1f8/0x2b0
ksm_init+0x6f/0x19d
do_one_initcall+0x50/0x1b0
kernel_init_freeable+0x201/0x289
kernel_init+0xa/0x100
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
pcpu_alloc+0x1ac/0x5e0
alloc_kmem_cache_cpus.isra.70+0x25/0xa0
setup_cpu_cache+0x2f/0x1f0
__kmem_cache_create+0x1bf/0x390
create_boot_cache+0x8b/0xb1
kmem_cache_init+0xa1/0x19e
start_kernel+0x270/0x4cb
x86_64_start_kernel+0x127/0x134
secondary_startup_64+0xa5/0xb0
}
... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
... acquired at:
pcpu_alloc+0x1ac/0x5e0
__percpu_counter_init+0x4e/0xb0
btrfs_init_fs_root+0x99/0x1c0 [btrfs]
btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
resolve_indirect_refs+0x130/0x830 [btrfs]
find_parent_nodes+0x69e/0xff0 [btrfs]
btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
btrfs_find_all_roots+0x50/0x70 [btrfs]
btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
transaction_kthread+0x176/0x1b0 [btrfs]
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
-> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
caching_thread+0x57/0x560 [btrfs]
normal_work_helper+0x1c0/0x5e0 [btrfs]
process_one_work+0x1e0/0x5c0
worker_thread+0x44/0x390
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
down_write+0x3e/0xa0
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
cow_file_range.isra.66+0x133/0x470 [btrfs]
run_delalloc_range+0x121/0x410 [btrfs]
writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
__extent_writepage+0x19a/0x360 [btrfs]
extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
extent_writepages+0x4d/0x60 [btrfs]
do_writepages+0x1a/0x70
__filemap_fdatawrite_range+0xa7/0xe0
btrfs_rename+0x5ee/0xdb0 [btrfs]
vfs_rename+0x52a/0x7e0
SyS_rename+0x351/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
... acquired at:
cache_block_group+0x287/0x420 [btrfs]
find_free_extent+0x106c/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
btrfs_create_tree+0xbb/0x2a0 [btrfs]
btrfs_create_uuid_tree+0x37/0x140 [btrfs]
open_ctree+0x23c0/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&found->groups_sem){++++..} ops: 2134587 {
HARDIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
HARDIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-R at:
down_read+0x35/0x90
btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
open_ctree+0x207b/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
INITIAL USE at:
down_write+0x3e/0xa0
__link_block_group+0x34/0x130 [btrfs]
btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
open_ctree+0x2054/0x2660 [btrfs]
btrfs_mount+0xd36/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
btrfs_mount+0x18c/0xf90 [btrfs]
mount_fs+0x3a/0x160
vfs_kern_mount+0x66/0x150
do_mount+0x1c1/0xcc0
SyS_mount+0x7e/0xd0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
... acquired at:
find_free_extent+0xcb4/0x12d0 [btrfs]
btrfs_reserve_extent+0xd8/0x170 [btrfs]
btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
__btrfs_cow_block+0x110/0x5b0 [btrfs]
btrfs_cow_block+0xd7/0x290 [btrfs]
btrfs_search_slot+0x1f6/0x960 [btrfs]
btrfs_lookup_inode+0x2a/0x90 [btrfs]
__btrfs_update_delayed_inode+0x65/0x210 [btrfs]
btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
evict+0xc4/0x190
__dentry_kill+0xbf/0x170
dput+0x2ae/0x2f0
SyS_rename+0x2a6/0x3b0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
HARDIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
SOFTIRQ-ON-W at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
IN-RECLAIM_FS-W at:
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
INITIAL USE at:
__mutex_lock+0x4e/0x8c0
btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
btrfs_update_inode+0x83/0x110 [btrfs]
btrfs_dirty_inode+0x62/0xe0 [btrfs]
touch_atime+0x8c/0xb0
do_generic_file_read+0x818/0xb10
__vfs_read+0xdc/0x150
vfs_read+0x8a/0x130
SyS_read+0x45/0xa0
do_syscall_64+0x79/0x1e0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
}
... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
... acquired at:
__lock_acquire+0x264/0x11c0
lock_acquire+0xbd/0x1e0
__mutex_lock+0x4e/0x8c0
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
ret_from_fork+0x3a/0x50
stack backtrace:
CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x78/0xb7
print_irq_inversion_bug.part.38+0x19f/0x1aa
check_usage_forwards+0x102/0x120
? ret_from_fork+0x3a/0x50
? check_usage_backwards+0x110/0x110
mark_lock+0x16c/0x270
__lock_acquire+0x264/0x11c0
? pagevec_lookup_entries+0x1a/0x30
? truncate_inode_pages_range+0x2b3/0x7f0
lock_acquire+0xbd/0x1e0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
__mutex_lock+0x4e/0x8c0
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
__btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
btrfs_evict_inode+0x22c/0x6a0 [btrfs]
evict+0xc4/0x190
dispose_list+0x35/0x50
prune_icache_sb+0x42/0x50
super_cache_scan+0x139/0x190
shrink_slab+0x262/0x5b0
shrink_node+0x2eb/0x2f0
kswapd+0x2eb/0x890
kthread+0x102/0x140
? mem_cgroup_shrink_node+0x2c0/0x2c0
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x3a/0x50
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ]
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ]
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 2c98425720233ae3e135add0c7e869b32913502f ]
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.
The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.
Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit bb34f24c7d2c98d0c81838a7700e6068325b17a0 ]
We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain. At last other nodes will get stuck into infinite loop when
requsting lock from us.
The problem is caused by concurrency umount between nodes. Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target. So N2 will continue sending lockres to N1 even though
N1 has left domain.
N1 N2 (owner)
touch file
access the file,
and get pr lock
begin leave domain and
pick up N1 as new owner
begin leave domain and
migrate all lockres done
begin migrate lockres to N1
end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed
[piaojun@huawei.com: v3]
Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ]
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.
The fix is to simply add the missing cond_resched.
Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ]
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ]
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.
Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ]
Forcing the log to disk after reading the agf is wrong, we might be
calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.
This can cause a deadlock when racing a fstrim with a filesystem
shutdown.
The deadlock has been identified due a miscalculation bug in device-mapper
dm-thin, which returns lack of space to its users earlier than the device itself
really runs out of space, changing the device-mapper volume into an error state.
The problem happened while filling the filesystem with a single file,
triggering the bug in device-mapper, consequently causing an IO error
and shutting down the filesystem.
If such file is removed, and fstrim executed before the XFS finishes the
shut down process, the fstrim process will end up holding the buffer
lock, and going to sleep on the cil wait queue.
At this point, the shut down process will try to wake up all the threads
waiting on the cil wait queue, but for this, it will try to hold the
same buffer log already held my the fstrim, locking up the filesystem.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ]
proc_sys_link_fill_cache() does not take currently unregistering sysctl
tables into account, which might result into a page fault in
sysctl_follow_link() - add a check to fix it.
This bug has been present since v3.4.
Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 471d557afed155b85da237ec46c549f443eeb5de ]
Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:
add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure
In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K
A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.
Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit af7227338135d2f1b1552bf9a6d43e02dcba10b9 ]
Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 1c789249578895bb14ab62b4327306439b754857 ]
There is lack of cache destroy operation for ceph_file_cachep
when failing from fscache register.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 9a6509c4daa91400b52a5fd541a5521c649a8fea ]
If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ mkfifo /mnt/testdir/foo
# Make sure everything done so far is durably persisted.
$ sync
# Create some unrelated file and fsync it, this is just to create a log
# tree. The file must be in the same directory as our special file.
$ touch /mnt/testdir/f1
$ xfs_io -c "fsync" /mnt/testdir/f1
# Rename our special file and then create a hard link with its old name.
$ mv /mnt/testdir/foo /mnt/testdir/bar
$ ln /mnt/testdir/bar /mnt/testdir/foo
# Create some other unrelated file and fsync it, this is just to persist
# the log tree which was modified by the previous rename and link
# operations. Alternatively we could have modified file f1 and fsync it.
$ touch /mnt/f2
$ xfs_io -c "fsync" /mnt/f2
<power failure>
$ mount /dev/sdc /mnt
mount: mount /dev/sdc on /mnt failed: File exists
This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.
Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.
A new generic test case for fstests was also submitted.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ]
When doing an incremental send of a filesystem with the no-holes feature
enabled, we end up issuing a write operation when using the no data mode
send flag, instead of issuing an update extent operation. Fix this by
issuing the update extent operation instead.
Trivial reproducer:
$ mkfs.btrfs -f -O no-holes /dev/sdc
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdc /mnt/sdc
$ mount /dev/sdd /mnt/sdd
$ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
$ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
$ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
$ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
$ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
| btrfs receive -vv /mnt/sdd
Before this change the output of the second receive command is:
receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
utimes
write foobar, offset 8192, len 8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
After this change it is:
receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
utimes
update_extent foobar: offset=8192, len=8192
utimes foobar
BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit a8fd1f71749387c9a1053a83ff1c16287499a4e7 ]
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On
kernels built with NR_CPUS=8192, this can result in kmalloc failures
that prevent mounting.
There is work in progress to try to resolve this for every user of
srcu_struct but using kvzalloc will work around the failures until
that is complete.
As an example with NR_CPUS=512 on x86_64: the overall size of
subvol_srcu is 3460 bytes, fs_info is 6496.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 18106734b512664a8541026519ce4b862498b6c3 ]
When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
there is lack of dput of root_dentry and it causes slab errors,
so change the calling order of ceph_fs_debugfs_init() and
open_root_dentry() and do some cleanups to avoid this issue.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 937441f3a3158d5510ca8cc78a82453f57a96365 ]
When parsing string option, in order to avoid memory leak we need to
carefully free it first in case of specifying same option several times.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 8cc07c808c9d595e81cbe5aad419b7769eb2e5c9 ]
i_dir_seq is subject to concurrent modification by a cmpxchg or
store-release operation, so ensure that the relaxed access in
d_alloc_parallel uses READ_ONCE.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 015555fd4d2930bc0c86952c46ad88b3392f66e4 ]
If d_alloc_parallel runs concurrently with __d_add, it is possible for
d_alloc_parallel to continuously retry whilst i_dir_seq has been
incremented to an odd value by __d_add:
CPU0:
__d_add
n = start_dir_add(dir);
cmpxchg(&dir->i_dir_seq, n, n + 1) == n
CPU1:
d_alloc_parallel
retry:
seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
hlist_bl_lock(b);
bit_spin_lock(0, (unsigned long *)b); // Always succeeds
CPU0:
__d_lookup_done(dentry)
hlist_bl_lock
bit_spin_lock(0, (unsigned long *)b); // Never succeeds
CPU1:
if (unlikely(parent->d_inode->i_dir_seq != seq)) {
hlist_bl_unlock(b);
goto retry;
}
Since the simple bit_spin_lock used to implement hlist_bl_lock does not
provide any fairness guarantees, then CPU1 can starve CPU0 of the lock
and prevent it from reaching end_dir_add(dir), therefore CPU1 cannot
exit its retry loop because the sequence number always has the bottom
bit set.
This patch resolves the livelock by not taking hlist_bl_lock in
d_alloc_parallel if the sequence counter is odd, since any subsequent
masked comparison with i_dir_seq will fail anyway.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Naresh Madhusudana <naresh.madhusudana@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit ad86f605c59500da82d196ac312cfbac3daba31d ]
nfs4_update_server unconditionally releases the nfs_client for the
source server. If migration fails, this can cause the source server's
nfs_client struct to be left with a low reference count, resulting in
use-after-free. Also, adjust reference count handling for ELOOP.
NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6
WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]()
nfs_put_client+0xfa/0x110 [nfs]
nfs4_run_state_manager+0x30/0x40 [nfsv4]
kthread+0xd8/0xf0
BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8
nfs4_xdr_enc_write+0x6b/0x160 [nfsv4]
rpcauth_wrap_req+0xac/0xf0 [sunrpc]
call_transmit+0x18c/0x2c0 [sunrpc]
__rpc_execute+0xa6/0x490 [sunrpc]
rpc_async_schedule+0x15/0x20 [sunrpc]
process_one_work+0x160/0x470
worker_thread+0x112/0x540
? rescuer_thread+0x3f0/0x3f0
kthread+0xd8/0xf0
This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"),
but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops")
Reported-by: Helen Chao <helen.chao@oracle.com>
Fixes: 52442f9b11b7 ("NFS4: Avoid migration loops")
Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream.
For anything NFS-exported we do _not_ want to unlock new inode
before it has grown an alias; original set of fixes got the
ordering right, but missed the nasty complication in case of
lockdep being enabled - unlock_new_inode() does
lockdep_annotate_inode_mutex_key(inode)
which can only be done before anyone gets a chance to touch
->i_mutex. Unfortunately, flipping the order and doing
unlock_new_inode() before d_instantiate() opens a window when
mkdir can race with open-by-fhandle on a guessed fhandle, leading
to multiple aliases for a directory inode and all the breakage
that follows from that.
Correct solution: a new primitive (d_instantiate_new())
combining these two in the right order - lockdep annotate, then
d_instantiate(), then the rest of unlock_new_inode(). All
combinations of d_instantiate() with unlock_new_inode() should
be converted to that.
Cc: stable@kernel.org # 2.6.29 and later
Tested-by: Mike Marshall <hubcap@omnibond.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream.
kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.
Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.
In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().
Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.
The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.
Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 79f546a696bff2590169fb5684e23d65f4d9f591 upstream.
We recently had an oops reported on a 4.14 kernel in
xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
and so the m_perag_tree lookup walked into lala land. It produces
an oops down this path during the failed mount:
radix_tree_gang_lookup_tag+0xc4/0x130
xfs_perag_get_tag+0x37/0xf0
xfs_reclaim_inodes_count+0x32/0x40
xfs_fs_nr_cached_objects+0x11/0x20
super_cache_count+0x35/0xc0
shrink_slab.part.66+0xb1/0x370
shrink_node+0x7e/0x1a0
try_to_free_pages+0x199/0x470
__alloc_pages_slowpath+0x3a1/0xd20
__alloc_pages_nodemask+0x1c3/0x200
cache_grow_begin+0x20b/0x2e0
fallback_alloc+0x160/0x200
kmem_cache_alloc+0x111/0x4e0
The problem is that the superblock shrinker is running before the
filesystem structures it depends on have been fully set up. i.e.
the shrinker is registered in sget(), before ->fill_super() has been
called, and the shrinker can call into the filesystem before
fill_super() does it's setup work. Essentially we are exposed to
both use-after-free and use-before-initialisation bugs here.
To fix this, add a check for the SB_BORN flag in super_cache_count.
In general, this flag is not set until ->fs_mount() completes
successfully, so we know that it is set after the filesystem
setup has completed. This matches the trylock_super() behaviour
which will not let super_cache_scan() run if SB_BORN is not set, and
hence will not allow the superblock shrinker from entering the
filesystem while it is being set up or after it has failed setup
and is being torn down.
Cc: stable@kernel.org
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 30da870ce4a4e007c901858a96e9e394a1daa74a upstream.
we unlock the directory hash too early - if we are looking at secondary
link and primary (in another directory) gets removed just as we unlock,
we could have the old primary moved in place of the secondary, leaving
us to look into freed entry (and leaving our dentry with ->d_fsdata
pointing to a freed entry).
Cc: stable@vger.kernel.org # 2.4.4+
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 66072c29328717072fd84aaff3e070e3f008ba77 upstream.
syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().
As far as I can see, it is hfsplus_mark_mdb_dirty() from
hfsplus_new_inode() in hfsplus_fill_super() that calls
queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
not fail if queue_delayed_work() was called, and the out_put_hidden_dir
label is the appropriate location to call cancel_delayed_work_sync().
[1] https://syzkaller.appspot.com/bug?id=a66f45e96fdbeb76b796bf46eb25ea878c42a6c9
Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4f2e5f086147d543ab03@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 5aa1437d2d9a068c0334bd7c9dafa8ec4f97f13b upstream.
open file, unlink it, then use ioctl(2) to make it immutable or
append only. Now close it and watch the blocks *not* freed...
Immutable/append-only checks belong in ->setattr().
Note: the bug is old and backport to anything prior to 737f2e93b972
("ext2: convert to use the new truncate convention") will need
these checks lifted into ext2_setattr().
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit e96f46ee8587607a828f783daa6eb5b44d25004d upstream
The style for the 'status' file is CamelCase or this. _.
Fixes: fae1fa0fc ("proc: Provide details on speculation flaw mitigations")
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 356e4bfff2c5489e016fdb925adbf12a1e3950ee upstream
For certain use cases it is desired to enforce mitigations so they cannot
be undone afterwards. That's important for loader stubs which want to
prevent a child from disabling the mitigation again. Will also be used for
seccomp(). The extra state preserving of the prctl state for SSB is a
preparatory step for EBPF dymanic speculation control.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit fae1fa0fc6cca8beee3ab8ed71d54f9a78fa3f64 upstream
As done with seccomp and no_new_privs, also show speculation flaw
mitigation state in /proc/$pid/status.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 02a3307aa9c20b4f6626255b028f07f6cfa16feb upstream.
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit fe816d0f1d4c31c4c31d42ca78a87660565fc800 upstream.
When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:
"VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..."
Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
in free_fs_root for the same reason.
This patch aims to rectify the sitaution by doing the following:
1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.
2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.
Note: this patch is tagged for 4.14 stable but the fix applies to older
versions too but needs to be backported manually due to conflicts.
CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions
CC: stable@vger.kernel.org # 4.14.x
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to igrab ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2b8773313494ede83a26fb372466e634564002ed upstream.
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 02ee654d3a04563c67bfe658a05384548b9bb105 upstream.
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
only, which isn't called during the remount. So when resuming from
the paused balance we hit the bug:
kernel: kernel BUG at fs/btrfs/volumes.c:3890!
::
kernel: balance_kthread+0x51/0x60 [btrfs]
kernel: kthread+0x111/0x130
::
kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
Reproducer:
On a mounted filesystem:
btrfs balance start --full-balance /btrfs
btrfs balance pause /btrfs
mount -o remount,ro /dev/sdb /btrfs
mount -o remount,rw /dev/sdb /btrfs
To fix this set the BTRFS_BALANCE_RESUME flag in
btrfs_resume_balance_async().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1a63c198ddb810c790101d693c7071cca703b3c7 upstream.
Incompat flag of LZO/ZSTD compression should be set at:
1. mount time (-o compress/compress-force)
2. when defrag is done
3. when property is set
Currently 3. is missing and this commit adds this.
This could lead to a filesystem that uses ZSTD but is not marked as
such. If a kernel without a ZSTD support encounteres a ZSTD compressed
extent, it will handle that but this could be confusing to the user.
Typically the filesystem is mounted with the ZSTD option, but the
discrepancy can arise when a filesystem is never mounted with ZSTD and
then the property on some file is set (and some new extents are
written). A simple mount with -o compress=zstd will fix that up on an
unpatched kernel.
Same goes for LZO, but this has been around for a very long time
(2.6.37) so it's unlikely that a pre-LZO kernel would be used.
Fixes: 5c1aab1dd544 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add user visible impact ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 upstream.
[BUG]
btrfs incremental send BUG happens when creating a snapshot of snapshot
that is being used by send.
[REASON]
The problem can happen if while we are doing a send one of the snapshots
used (parent or send) is snapshotted, because snapshoting implies COWing
the root of the source subvolume/snapshot.
1. When doing an incremental send, the send process will get the commit
roots from the parent and send snapshots, and add references to them
through extent_buffer_get().
2. When a snapshot/subvolume is snapshotted, its root node is COWed
(transaction.c:create_pending_snapshot()).
3. COWing releases the space used by the node immediately, through:
__btrfs_cow_block()
--btrfs_free_tree_block()
----btrfs_add_free_space(bytenr of node)
4. Because send doesn't hold a transaction open, it's possible that
the transaction used to create the snapshot commits, switches the
commit root and the old space used by the previous root node gets
assigned to some other node allocation. Allocation of a new node will
use the existing extent buffer found in memory, which we previously
got a reference through extent_buffer_get(), and allow the extent
buffer's content (pages) to be modified:
btrfs_alloc_tree_block
--btrfs_reserve_extent
----find_free_extent (get bytenr of old node)
--btrfs_init_new_buffer (use bytenr of old node)
----btrfs_find_create_tree_block
------alloc_extent_buffer
--------find_extent_buffer (get old node)
5. So send can access invalid memory content and have unpredictable
behaviour.
[FIX]
So we fix the problem by copying the commit roots of the send and
parent snapshots and use those copies.
CallTrace looks like this:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1861!
invalid opcode: 0000 [#1] SMP
CPU: 6 PID: 24235 Comm: btrfs Tainted: P O 3.10.105 #23721
ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
RSP: 0018:ffff88041b723b68 EFLAGS: 00010246
RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
FS: 00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
Call Trace:
[<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
[<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
[<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
[<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
[<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
[<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
[<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
[<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
[<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
[<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
[<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
[<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
[<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
[<ffffffff81034f83>] ? do_fork+0x113/0x3b0
[<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
[<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
---[ end trace 29576629ee80b2e1 ]---
Fixes: 7069830a9e38 ("Btrfs: add btrfs_compare_trees function")
CC: stable@vger.kernel.org # 3.6+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 9a8fca62aacc1599fea8e813d01e1955513e4fad upstream.
If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.
Trivial reproducer
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ setfattr -n user.xa -v qwerty /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
$ sync
$ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power failure>
$ mount /dev/sdb /mnt
$ getfattr --absolute-names --dump /mnt/foobar
<empty output>
$
So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.
Fixes: 36283bf777d9 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: <stable@vger.kernel.org> # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7f7ccc2ccc2e70c6054685f5e3522efa81556830 upstream.
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 998ac6d21cfd6efd58f5edf420bae8839dda9f2a upstream.
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.
When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.
After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.
Fix this by using trans_lock and increasing the use_count.
Fixes: e4c3b2dcd144 ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 6e70c267e68d77679534dcf4aaf84e66f2cf1425 upstream.
As with NFS, which ignores sync on directory handles,
fsync on a directory handle is a noop for CIFS/SMB3.
Do not return an error on it. It breaks some database
apps otherwise.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|