| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit cba23ac158db7f3cd48a923d6861bee2eb7a2978 ]
Corrupted metadata could warrant returning error from sm_ll_lookup_bitmap().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 85bca3c05b6cca31625437eedf2060e846c4bbad ]
Corrupt metadata could trigger an out of bounds write.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1b8d2789dad0005fd5e7d35dab26a8e1203fb6da upstream.
Move dm_tm_unlock() after dm_tm_dec().
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 7df835a32a8bedf7ce88efcfa7c9b245b52ff139 ]
Commit b0140891a8cea3 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj. Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.
On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.
Fixes: b0140891a8ce ("md: Fix race when creating a new md device.")
Fixes: d62633873590 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 528b16bfc3ae5f11638e71b3b63a81f9999df727 upstream.
On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.
Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.
This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().
Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.
This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.
* Disable swap
* Tune vm settings to promote regular writeback
$ echo 50 > /proc/sys/vm/dirty_expire_centisecs
$ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
$ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
* Create 8 dmcrypt devices based on files on a tmpfs
* Create and mount an ext4 filesystem on each crypt devices
* Run stress-ng --hdd 8 within one of above filesystems
Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():
99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork
|
--ret_from_fork
kthread
worker_thread
|
--99.92%--process_one_work
|
|--80.52%--kcryptd_crypt
| |
| |--62.58%--mempool_alloc
| | |
| | --62.24%--crypt_page_alloc
| | |
| | --61.51%--__percpu_counter_compare
| | |
| | --61.34%--__percpu_counter_sum
| | |
| | |--58.68%--_raw_spin_lock_irqsave
| | | |
| | | --58.30%--native_queued_spin_lock_slowpath
| | |
| | --0.69%--cpumask_next
| | |
| | --0.51%--_find_next_bit
| |
| |--10.61%--crypt_convert
| | |
| | |--6.05%--xts_crypt
...
After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.
|--8.15%--mempool_alloc
| |
| |--3.93%--crypt_page_alloc
| | |
| | --3.75%--__alloc_pages
| | |
| | --3.62%--get_page_from_freelist
| | |
| | --3.22%--rmqueue_bulk
| | |
| | --2.59%--_raw_spin_lock
| | |
| | --2.57%--native_queued_spin_lock_slowpath
| |
| --3.05%--_raw_spin_lock_irqsave
| |
| --2.49%--native_queued_spin_lock_slowpath
Suggested-by: DJ Gregor <dj@corelight.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
Fixes: 5059353df86e ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit b6e58b5466b2959f83034bead2e2e1395cca8aeb upstream.
remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().
For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:
general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
Hardware name: QEMU Standard PC
RIP: 0010:metadata_ll_load_ie+0x14/0x30
Call Trace:
sm_metadata_count_is_more_than_one+0xb9/0xe0
dm_tm_shadow_block+0x52/0x1c0
shadow_step+0x59/0xf0
remove_raw+0xb2/0x170
dm_btree_remove+0xf4/0x1c0
dm_pool_delete_thin_device+0xc3/0x140
pool_message+0x218/0x2b0
target_message+0x251/0x290
ctl_ioctl+0x1c4/0x4d0
dm_ctl_ioctl+0xe/0x20
__x64_sys_ioctl+0x7b/0xb0
do_syscall_64+0x40/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixing it by only assign new_root when removal succeeds
Signed-off-by: Hou Tao <houtao1@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 5faafc77f7de69147d1e818026b9a0cbf036a7b2 ]
Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device. There are a couple
of repercussions to this:
- The first allocation after the commit is likely to take longer than
normal as it searches for a free block in an area that is likely to
have very few free blocks (if any).
- Any free blocks it finds will have been recently freed. Reusing them
means we have fewer old copies of the metadata to aid recovery from
hardware error.
Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7e768532b2396bcb7fbf6f82384b85c0f1d2f197 upstream.
If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().
Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit c699a0db2d62e3bbb7f0bf35c87edbc8d23e3062 upstream.
The following commands will crash the kernel:
modprobe brd rd_size=1048576
dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"
The reason is that when we test for zero chunk size, we jump to the label
bad_read_metadata without setting the "r" variable. The function
snapshot_ctr destroys all the structures and then exits with "r == 0". The
kernel then crashes because it falsely believes that snapshot_ctr
succeeded.
In order to fix the bug, we set the variable "r" to -EINVAL.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream.
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.
This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).
For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.
*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt
*** script ***
about trigger every time with below script
```
1 node1="mdcluster1"
2 node2="mdcluster2"
3
4 mdadm -Ss
5 ssh ${node2} "mdadm -Ss"
6 wipefs -a /dev/sda /dev/sdb
7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
/dev/sdb --assume-clean
8
9 for i in {1..10}; do
10 echo ==== $i ====;
11
12 echo "test ...."
13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14 sleep 1
15
16 echo "clean ....."
17 ssh ${node2} "mdadm -Ss"
18 done
```
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.
*** stack ***
```
[ 884.226509] mddev_put+0x1c/0xe0 [md_mod]
[ 884.226515] md_open+0x3c/0xe0 [md_mod]
[ 884.226518] __blkdev_get+0x30d/0x710
[ 884.226520] ? bd_acquire+0xd0/0xd0
[ 884.226522] blkdev_get+0x14/0x30
[ 884.226524] do_dentry_open+0x204/0x3a0
[ 884.226531] path_openat+0x2fc/0x1520
[ 884.226534] ? seq_printf+0x4e/0x70
[ 884.226536] do_filp_open+0x9b/0x110
[ 884.226542] ? md_release+0x20/0x20 [md_mod]
[ 884.226543] ? seq_read+0x1d8/0x3e0
[ 884.226545] ? kmem_cache_alloc+0x18a/0x270
[ 884.226547] ? do_sys_open+0x1bd/0x260
[ 884.226548] do_sys_open+0x1bd/0x260
[ 884.226551] do_syscall_64+0x5b/0x1e0
[ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9
```
*** rootcause ***
"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.
The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.
```
<thread 1>: "mdadm -Ss"
md_release
mddev_put deletes mddev from all_mddevs
queue_work for mddev_delayed_delete
at this time, "/dev/md0" is still available for opening
<thread 2>: "mdadm --monitor ..."
md_open
+ mddev_find can't find mddev of /dev/md0, and create a new mddev and
| return.
+ trigger "if (mddev->gendisk != bdev->bd_disk)" and return
-ERESTARTSYS.
```
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.
In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 8b57251f9a91f5e5a599de7549915d2d226cc3af upstream.
Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".
Cc: stable@vger.kernel.org
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 65aa97c4d2bfd76677c211b9d03ef05a98c6d68e upstream.
Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.
This turns out to fix this bug reported by Zhao Heming.
----------------------------- snip ------------------------------
commit d3374825ce57 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit f7c7a2f9a23e5b6e0f5251f29648d0238bb7757e upstream.
md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.
How to trigger:
env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
```
node2=192.168.0.3
for i in {1..20}; do
echo ==== $i `date` ====;
mdadm -Ss && ssh ${node2} "mdadm -Ss"
wipefs -a /dev/sda /dev/sdb
mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
/dev/sdb --assume-clean
ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
mdadm --wait /dev/md0
ssh ${node2} "mdadm --wait /dev/md0"
mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
sleep 1
done
```
Crash stack:
```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS: 0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
raid1d+0x5c/0xd40 [raid1]
? finish_task_switch+0x75/0x2a0
? lock_timer_base+0x67/0x80
? try_to_del_timer_sync+0x4d/0x80
? del_timer_sync+0x41/0x50
? schedule_timeout+0x254/0x2d0
? md_start_sync+0xe0/0xe0 [md_mod]
? md_thread+0x127/0x160 [md_mod]
md_thread+0x127/0x160 [md_mod]
? wait_woken+0x80/0x80
kthread+0x10d/0x130
? kthread_park+0xa0/0xa0
ret_from_fork+0x1f/0x40
```
Fixes: dbb64f8635f5d ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254fa7392 ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: Gang He <ghe@suse.com>
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 8e947c8f4a5620df77e43c9c75310dc510250166 upstream.
When loading a device-mapper table for a request-based mapped device,
and the allocation/initialization of the blk_mq_tag_set for the device
fails, a following device remove will cause a double free.
E.g. (dmesg):
device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
device-mapper: ioctl: unable to set up device queue for new table.
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0305e098835de000 TEID: 0305e098835de803
Fault in home space mode while using kernel ASCE.
AS:000000025efe0007 R3:0000000000000024
Oops: 0038 ilc:3 [#1] SMP
Modules linked in: ... lots of modules ...
Supported: Yes, External
CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G W X 5.3.18-53-default #1 SLE15-SP3
Hardware name: IBM 8561 T01 7I2 (LPAR)
Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
Krnl Code: 000000025e368eb8: c4180041e100 lgrl %r1,25eba50b8
000000025e368ebe: ecba06b93a55 risbg %r11,%r10,6,185,58
#000000025e368ec4: e3b010000008 ag %r11,0(%r1)
>000000025e368eca: e310b0080004 lg %r1,8(%r11)
000000025e368ed0: a7110001 tmll %r1,1
000000025e368ed4: a7740129 brc 7,25e369126
000000025e368ed8: e320b0080004 lg %r2,8(%r11)
000000025e368ede: b904001b lgr %r1,%r11
Call Trace:
[<000000025e368eca>] kfree+0x42/0x330
[<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
[<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
[<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
[<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
[<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
[<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
[<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
[<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
[<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
[<000000025e8c15ac>] system_call+0xd8/0x2c8
Last Breaking-Event-Address:
[<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
Kernel panic - not syncing: Fatal exception: panic_on_oops
When allocation/initialization of the blk_mq_tag_set fails in
dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
is not reset to NULL; so when dev_remove() later gets into
dm_mq_cleanup_mapped_device() it sees the pointer and tries to
uninitialize and free it again.
Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().
Cc: <stable@vger.kernel.org> # 4.6+
Fixes: 1c357a1e86a4 ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
Signed-off-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 5208692e80a1f3c8ce2063a22b675dd5589d1d80 upstream.
This division bug meant the search for free metadata space could skip
the final allocation bitmap's worth of entries. Fix affects DM thinp,
cache and era targets.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Tested-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a88b2358f1da2c9f9fcc432f2e0a79617fea397c upstream.
Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
byte-by-byte access.
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2417b9869b81882ab90fd5ed1081a1cb2d4db1dd upstream.
This patch addresses a data corruption bug in raid1 arrays using bitmaps.
Without this fix, the bitmap bits for the failed I/O end up being cleared.
Since we are in the failure leg of raid1_end_write_request, the request
either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).
Fixes: eeba6809d8d5 ("md/raid1: end bio when the device faulty")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Paul Clements <paul.clements@us.sios.com>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4edbe1d7bcffcd6269f3b5eb63f710393ff2ec7a upstream.
If there are not any dm devices, we need to zero the "dev" argument in
the first structure dm_name_list. However, this can cause out of
bounds write, because the "needed" variable is zero and len may be
less than eight.
Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is
too small to hold the "nl->dev" value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 24f6b6036c9eec21191646930ad42808e6180510 upstream.
Fix dm_table_supports_zoned_model() and invert logic of both
iterate_devices_callout_fn so that all devices' zoned capabilities are
properly checked.
Add one more parameter to dm_table_any_dev_attr(), which is actually
used as the @data parameter of iterate_devices_callout_fn, so that
dm_table_matches_zone_sectors() can be replaced by
dm_table_any_dev_attr().
Fixes: dd88d313bef02 ("dm table: add zoned block devices validation")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[jeffle: also convert no_sg_merge check]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 5b0fab508992c2e120971da658ce80027acbc405 upstream.
Fix dm_table_supports_dax() and invert logic of both
iterate_devices_callout_fn so that all devices' DAX capabilities are
properly checked.
Fixes: 545ed20e6df6 ("dm: add infrastructure for DAX support")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
[jeffle: no dax synchronous]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a4c8dd9c2d0987cf542a2a0c42684c9c6d78a04e upstream.
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
[jeffle: backport notes]
Also convert the no_sg_merge capability check, which is introduced by
commit 200612ec33e5 ("dm table: propagate QUEUE_FLAG_NO_SG_MERGE"), and
removed since commit 2705c93742e9 ("block: kill QUEUE_FLAG_NO_SG_MERGE")
in v5.1.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set")
Fixes: 4693c9668fdc ("dm table: propagate non rotational flag")
Cc: stable@vger.kernel.org
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2099b145d77c1d53f5711f029c37cc537897cee6 upstream.
In case of a system crash, dm-era might fail to mark blocks as written
in its metadata, although the corresponding writes to these blocks were
passed down to the origin device and completed successfully.
Consider the following sequence of events:
1. We write to a block that has not been yet written in the current era
2. era_map() checks the in-core bitmap for the current era and sees
that the block is not marked as written.
3. The write is deferred for submission after the metadata have been
updated and committed.
4. The worker thread processes the deferred write
(process_deferred_bios()) and marks the block as written in the
in-core bitmap, **before** committing the metadata.
5. The worker thread starts committing the metadata.
6. We do more writes that map to the same block as the write of step (1)
7. era_map() checks the in-core bitmap and sees that the block is marked
as written, **although the metadata have not been committed yet**.
8. These writes are passed down to the origin device immediately and the
device reports them as completed.
9. The system crashes, e.g., power failure, before the commit from step
(5) finishes.
When the system recovers and we query the dm-era target for the list of
written blocks it doesn't report the aforementioned block as written,
although the writes of step (6) completed successfully.
The issue is that era_map() decides whether to defer or not a write
based on non committed information. The root cause of the bug is that we
update the in-core bitmap, **before** committing the metadata.
Fix this by updating the in-core bitmap **after** successfully
committing the metadata.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit cca2c6aebe86f68103a8615074b3578e854b5016 upstream.
Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
(inactive) table that will only become active upon resume. That is why
resize should always be done in terms of resume. Otherwise a load (ctr)
whose inactive table never becomes active will incorrectly resize the
metadata.
Also, perform the resize directly in preresume, instead of using the
worker to do it.
The worker might run other metadata operations, e.g., it could start
digestion, before resizing the metadata. These operations will end up
using the old size.
This could lead to errors, like:
device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
device-mapper: era: process_old_eras: digest step failed, stopping digestion
The reason of the above error is that the worker started the digestion
of the archived writeset using the old, larger size.
As a result, metadata_digest_transcribe_writeset tried to write beyond
the end of the era array.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2524933307fd0036d5c32357c693c021ab09a0b0 upstream.
In case of devices with at most 64 blocks, the digestion of consecutive
eras uses the writeset of the first era as the writeset of all eras to
digest, leading to lost writes. That is, we lose the information about
what blocks were written during the affected eras.
The digestion code uses a dm_disk_bitset object to access the archived
writesets. This structure includes a one word (64-bit) cache to reduce
the number of array lookups.
This structure is initialized only once, in metadata_digest_start(),
when we kick off digestion.
But, when we insert a new writeset into the writeset tree, before the
digestion of the previous writeset is done, or equivalently when there
are multiple writesets in the writeset tree to digest, then all these
writesets are digested using the same cache and the cache is not
re-initialized when moving from one writeset to the next.
For devices with more than 64 blocks, i.e., the size of the cache, the
cache is indirectly invalidated when we move to a next set of blocks, so
we avoid the bug.
But for devices with at most 64 blocks we end up using the same cached
data for digesting all archived writesets, i.e., the cache is loaded
when digesting the first writeset and it never gets reloaded, until the
digestion is done.
As a result, the writeset of the first era to digest is used as the
writeset of all the following archived eras, leading to lost writes.
Fix this by reinitializing the dm_disk_bitset structure, and thus
invalidating the cache, every time the digestion code starts digesting a
new writeset.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 64f2d15afe7b336aafebdcd14cc835ecf856df4b upstream.
Fix the writeset tree equality test function to use the right value size
when comparing two btree values.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 904e6b266619c2da5c58b5dce14ae30629e39645 upstream.
Deallocate the memory allocated for the in-core bitsets when destroying
the target and in error paths.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit c8e846ff93d5eaa5384f6f325a1687ac5921aade upstream.
dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit de89afc1e40fdfa5f8b666e5d07c43d21a1d3be0 upstream.
Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.
dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.
This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.
There are three issues that cause the committed writeset to get lost:
1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
the current era might not have been archived, due to a system crash.
To fix this:
1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
before starting a new era.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 809b1e4945774c9ec5619a8f4e2189b7b3833c0c upstream.
This reverts commit
644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()")
dm_get_dev_t() is just used to convert an arbitrary 'path' string
into a dev_t. It doesn't presume that the device is present; that
check will be done later, as the only caller is dm_get_device(),
which does a dm_get_table_device() later on, which will properly
open the device.
So if the path string already _is_ in major:minor representation
we can convert it directly, avoiding a recursion into the filesystem
to lookup the block device.
This avoids a hang in multipath_message() when the filesystem is
inaccessible.
Fixes: 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 0378c625afe80eb3f212adae42cc33c9f6f31abf upstream.
There wasn't ever a real need to log an error in the kernel log for
ioctls issued with insufficient permissions. Simply return an error
and if an admin/user is sufficiently motivated they can enable DM's
dynamic debugging to see an explanation for why the ioctls were
disallowed.
Reported-by: Nir Soffer <nsoffer@redhat.com>
Fixes: e980f62353c6 ("dm: don't allow ioctls to targets that don't map to whole devices")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit fcc42338375a1e67b8568dbb558f8b784d0f3b01 upstream.
If the origin device has a volatile write-back cache and the following
events occur:
1: After finishing merge operation of one set of exceptions,
merge_callback() is invoked.
2: Update the metadata in COW device tracking the merge completion.
This update to COW device is flushed cleanly.
3: System crashes and the origin device's cache where the recent
merge was completed has not been flushed.
During the next cycle when we read the metadata from the COW device,
we will skip reading those metadata whose merge was completed in
step (1). This will lead to data loss/corruption.
To address this, flush the origin device post merge IO before
updating the metadata.
Cc: stable@vger.kernel.org
Signed-off-by: Akilesh Kailash <akailash@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 252bd1256396cebc6fc3526127fdb0b317601318 ]
If emergency system shutdown is called, like by thermal shutdown,
a dm device could be alive when the block device couldn't process
I/O requests anymore. In this state, the handling of I/O errors
by new dm I/O requests or by those already in-flight can lead to
a verity corruption state, which is a misjudgment.
So, skip verity work in response to I/O error when system is shutting
down.
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 93decc563637c4288380912eac0eb42fb246cc04 upstream.
In __make_request() a new r10bio is allocated and passed to
raid10_read_request(). The read_slot member of the bio is not
initialized, and the raid10_read_request() uses it to index an
array. This leads to occasional panics.
Fix by initializing the field to invalid value and checking for
valid value in raid10_read_request().
Cc: stable@vger.kernel.org
Signed-off-by: Kevin Vigor <kvigor@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit bca5b0658020be90b6b504ca514fd80110204f71 upstream.
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
mdadm /dev/md0 --remove /dev/sdg
+ held reconfig_mutex
+ send REMOVE
+ wait_event(MD_CLUSTER_SEND_LOCK)
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) ||
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
This will block another node to send msg
b. B does sync jobs, which will send RESYNCING at intervals.
This will be block for holding token_lockres:EX lock.
c. B do "mdadm --remove", which will send REMOVE.
This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
d. B recv METADATA_UPDATED msg, which send from A in step <a>.
This will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".
For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.
Repro steps (I only triggered 3 times with hundreds tests):
two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
test script will hung when executing "mdadm --remove".
```
# dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D 0 5329 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? _cond_resched+0x2d/0x40
? schedule+0x4a/0xb0
? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
? wait_woken+0x80/0x80
? process_recvd_msg+0x113/0x1d0 [md_cluster]
? recv_daemon+0x9e/0x120 [md_cluster]
? md_thread+0x94/0x160 [md_mod]
? wait_woken+0x80/0x80
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
mdadm D 0 5423 1 0x00004004
Call Trace:
__schedule+0x1f6/0x560
? __schedule+0x1fe/0x560
? schedule+0x4a/0xb0
? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
? wait_woken+0x80/0x80
? remove_disk+0x4f/0x90 [md_cluster]
? hot_remove_disk+0xb1/0x1b0 [md_mod]
? md_ioctl+0x50c/0xba0 [md_mod]
? wait_woken+0x80/0x80
? blkdev_ioctl+0xa2/0x2a0
? block_ioctl+0x39/0x40
? ksys_ioctl+0x82/0xc0
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x5f/0x150
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
md0_resync D 0 5425 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? schedule+0x4a/0xb0
? dlm_lock_sync+0xa1/0xd0 [md_cluster]
? wait_woken+0x80/0x80
? lock_token+0x2d/0x90 [md_cluster]
? resync_info_update+0x95/0x100 [md_cluster]
? raid1_sync_request+0x7d3/0xa40 [raid1]
? md_do_sync.cold+0x737/0xc8f [md_mod]
? md_thread+0x94/0x160 [md_mod]
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
```
At last, thanks for Xiao's solution.
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 4d7659bfbe277a43399a4a2d90fca141e70f29e1 ]
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 2ca4c92f58f9 ("dm ioctl: prevent empty message")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit c731b84b51bf7fe83448bea8f56a6d55006b0615 upstream.
Syzkaller reports a warning as belows.
WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
...
Call Trace:
...
RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
RSP: 0018:ffff888096027950 EFLAGS: 00010293
RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
__blkdev_driver_ioctl block/ioctl.c:304 [inline]
blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
block_ioctl+0xee/0x130 fs/block_dev.c:1930
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is caused by a race between two concurrenct md_ioctl()s closing
the array.
CPU1 (md_ioctl()) CPU2 (md_ioctl())
------ ------
set_bit(MD_CLOSING, &mddev->flags);
did_set_md_closing = true;
WARN_ON_ONCE(test_bit(MD_CLOSING,
&mddev->flags));
if(did_set_md_closing)
clear_bit(MD_CLOSING, &mddev->flags);
Fix the warning by returning immediately if the MD_CLOSING bit is set
in &mddev->flags which indicates that the array is being closed.
Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit e7b624183d921b49ef0a96329f21647d38865ee9 ]
The BUG_ON(in_interrupt()) in dm_table_event() is a historic leftover from
a rework of the dm table code which changed the calling context.
Issuing a BUG for a wrong calling context is frowned upon and
in_interrupt() is deprecated and only covering parts of the wrong
contexts. The sanity check for the context is covered by
CONFIG_DEBUG_ATOMIC_SLEEP and other debug facilities already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit b44c018cdf748b96b676ba09fdbc5b34fc443ada upstream.
KoWei reported crash during raid5 reshape:
[ 1032.252932] Oops: 0002 [#1] SMP PTI
[...]
[ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10
[...]
[ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286
[ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000
[ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000
[ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8
[ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8
[ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 1032.252958] FS: 0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000
[ 1032.252959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0
[ 1032.252962] Call Trace:
[ 1032.252969] ? async_memcpy+0x179/0x1000 [async_memcpy]
[ 1032.252977] ? raid5_release_stripe+0x8e/0x110 [raid456]
[ 1032.252982] handle_stripe_expansion+0x15a/0x1f0 [raid456]
[ 1032.252988] handle_stripe+0x592/0x1270 [raid456]
[ 1032.252993] handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456]
[ 1032.252999] raid5d+0x35c/0x550 [raid456]
[ 1032.253002] ? schedule+0x42/0xb0
[ 1032.253006] ? schedule_timeout+0x10e/0x160
[ 1032.253011] md_thread+0x97/0x160
[ 1032.253015] ? wait_woken+0x80/0x80
[ 1032.253019] kthread+0x104/0x140
[ 1032.253022] ? md_start_sync+0x60/0x60
[ 1032.253024] ? kthread_park+0x90/0x90
[ 1032.253027] ret_from_fork+0x35/0x40
This is because cache_size_mutex was unlocked too early in resize_stripes,
which races with grow_one_stripe() that grow_one_stripe() allocates a
stripe with wrong pool_size.
Fix this issue by unlocking cache_size_mutex after updating pool_size.
Cc: <stable@vger.kernel.org> # v4.4+
Reported-by: KoWei Sung <winders@amazon.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit d837f7277f56e70d82b3a4a037d744854e62f387 ]
md_bitmap_get_counter() has code:
```
if (bitmap->bp[page].hijacked ||
bitmap->bp[page].map == NULL)
csize = ((sector_t)1) << (bitmap->chunkshift +
PAGE_COUNTER_SHIFT - 1);
```
The minus 1 is wrong, this branch should report 2048 bits of space.
With "-1" action, this only report 1024 bit of space.
This bug code returns wrong blocks, but it doesn't inflence bitmap logic:
1. Most callers focus this function return value (the counter of offset),
not the parameter blocks.
2. The bug is only triggered when hijacked is true or map is NULL.
the hijacked true condition is very rare.
the "map == null" only true when array is creating or resizing.
3. Even the caller gets wrong blocks, current code makes caller just to
call md_bitmap_get_counter() one more time.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 34cf78bf34d48dddddfeeadb44f9841d7864997a ]
This patch fix a lost wake-up problem caused by the race between
mca_cannibalize_lock and bch_cannibalize_unlock.
Consider two processes, A and B. Process A is executing
mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
and is executing bch_cannibalize_unlock. The problem happens that after
process A executes cmpxchg and will execute prepare_to_wait. In this
timeslice process B executes wake_up, but after that process A executes
prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
goes to sleep but no one will wake up it. This problem may cause bcache
device to dead.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 219403d7e56f9b716ad80ab87db85d29547ee73e upstream.
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit d16ff19e69ab57e08bf908faaacbceaf660249de upstream.
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 610b15c50e86eb1e4b77274fabcaea29ac72d6a8 upstream.
In preparation for replacing unchecked overflows for memory allocations,
this creates helpers for the 3 most common calculations:
array_size(a, b): 2-dimensional array
array3_size(a, b, c): 3-dimensional array
struct_size(ptr, member, n): struct followed by n-many trailing members
Each of these return SIZE_MAX on overflow instead of wrapping around.
(Additionally renames a variable named "array_size" to avoid future
collision.)
Co-developed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 9958f1d9a04efb3db19134482b3f4c6897e0e7b8 upstream.
Now that the writethrough code is much simpler there is no need to track
so much state or cascade bio submission (as was done, via
writethrough_endio(), to issue origin then cache IO in series).
As such the obsolete writethrough list and workqueue is also removed.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 2df3bae9a6543e90042291707b8db0cbfbae9ee9 upstream.
Discontinue issuing writethrough write IO in series to the origin and
then cache.
Use bio_clone_fast() to create a new origin clone bio that will be
mapped to the origin device and then bio_chain() it to the bio that gets
remapped to the cache device. The origin clone bio does _not_ have a
copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
be called.
The cache bio (parent bio) will not complete until the origin bio has
completed -- this fulfills bio_clone_fast()'s requirements as well as
the requirement to not complete the original IO until the write IO has
completed to both the origin and cache device.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit 8e3c3827776fc93728c0c8d7c7b731226dc6ee23 upstream.
No functional changes, just a bit cleaner than passing cache_features
structure.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit e766668c6cd49d741cfb49eaeb38998ba34d27bc ]
dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't
formally stop the blk-mq queue; therefore there is no point making the
blk_mq_queue_stopped() check -- it will never be stopped.
In addition, even though dm_stop_queue() actually tries to quiesce hw
queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced()
to avoid unnecessary queue quiesce isn't reliable because: the
QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and
dm_stop_queue() may be called when synchronize_rcu() from another
blk_mq_quiesce_queue() is in-progress.
Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 5fe48867856367142d91a82f2cbf7a57a24cbb70 upstream.
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.
For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.
This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a1c6ae3d9f3dd6aa5981a332a6f700cf1c25edef upstream.
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.
Reproducible Steps:
1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5
Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.
Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 117f636ea695270fe492d0c0c9dfadc7a662af47 ]
In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.
But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156 if (ca->sb.seq > c->sb.seq) {
2157 c->sb.version = ca->sb.version;
2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159 c->sb.flags = ca->sb.flags;
2160 c->sb.seq = ca->sb.seq;
2161 pr_debug("set version = %llu\n", c->sb.version);
2162 }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.
The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.
But if the large bucket feature is enabled, sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.
This patch modifies the if() check to the following way,
2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|