| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 235b6003fb28f0dd8e7ed8fbdb088bb548291766 ]
When reshaping a fully degraded raid5/raid6 to a larger
nubmer of devices, the new device(s) are not in-sync
and so that can make the newly grown stripe appear to be
"failed".
To avoid this, we set the R5_Expanded flag to say "Even though
this device is not fully in-sync, this block is safe so
don't treat the device as failed for this stripe".
This flag is set for data devices, not not for parity devices.
Consequently, if you have a RAID6 with two devices that are partly
recovered and a spare, and start a reshape to include the spare,
then when the reshape gets past the point where the recovery was
up to, it will think the stripes are failed and will get into
an infinite loop, failing to make progress.
So when contructing parity on an EXPAND_READY stripe,
set R5_Expanded.
Reported-by: Curt <lightspd@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit c157313791a999646901b3e3c6888514ebc36d62 ]
Currently, Cache missed IOs are identified by s->cache_miss, but actually,
there are many situations that missed IOs are not assigned a value for
s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
(s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
it will go to out_put or out_submit, and s->cache_miss is null, which leads
bch_mark_cache_accounting() to treat this IO as a hit IO.
[ML: applied by 3-way merge]
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 330a4db89d39a6b43f36da16824eaa7a7509d34d ]
mutex_destroy does nothing most of time, but it's better to call
it to make the code future proof and it also has some meaning
for like mutex debug.
As Coly pointed out in a previous review, bcache_exit() may not be
able to handle all the references properly if userspace registers
cache and backing devices right before bch_debug_init runs and
bch_debug_init failes later. So not exposing userspace interface
until everything is ready to avoid that issue.
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 9c8043f337f14d1743006dfc59c03e80a42e3884 ]
To avoid memory leak, we need to free the cinfo which
is allocated when node join cluster.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 0868b99c214a3d55486c700de7c3f770b7243e7c ]
When bitmap is resized, the old kalloced chunks just are not released
once the resized bitmap starts to use new space.
This fixes in particular kmemleak reports like this one:
unreferenced object 0xffff8f4311e9c000 (size 4096):
comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
hex dump (first 32 bytes):
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80 ................
backtrace:
[<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
[<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
[<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
[<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
[<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
[<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
[<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
[<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
[<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
[<ffffffffa6774b52>] dm_resume+0x122/0x140
[<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
[<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
[<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
[<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
[<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
[<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit e393aa2446150536929140739f09c6ecbcbea7f0 upstream.
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)
It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in /sys/fs/bcache/XXX/internal/cache_read_races.
Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happened, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.
In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.
[edited by mlyle to fix up whitespace, commit log title, comment
spelling]
Fixes: d59b23795933 ("bcache: only permit to recovery read error when cache device is clean")
Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit d59b23795933678c9638fd20c942d2b4f3cd6185 upstream.
When bcache does read I/Os, for example in writeback or writethrough mode,
if a read request on cache device is failed, bcache will try to recovery
the request by reading from cached device. If the data on cached device is
not synced with cache device, then requester will get a stale data.
For critical storage system like database, providing stale data from
recovery may result an application level data corruption, which is
unacceptible.
With this patch, for a failed read request in writeback or writethrough
mode, recovery a recoverable read request only happens when cache device
is clean. That is to say, all data on cached device is up to update.
For other cache modes in bcache, read request will never hit
cached_dev_read_error(), they don't need this patch.
Please note, because cache mode can be switched arbitrarily in run time, a
writethrough mode might be switched from a writeback mode. Therefore
checking dc->has_data in writethrough mode still makes sense.
Changelog:
V4: Fix parens error pointed by Michael Lyle.
v3: By response from Kent Oversteet, he thinks recovering stale data is a
bug to fix, and option to permit it is unnecessary. So this version
the sysfs file is removed.
v2: rename sysfs entry from allow_stale_data_on_failure to
allow_stale_data_on_failure, and fix the confusing commit log.
v1: initial patch posted.
[small change to patch comment spelling by mlyle]
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Arne Wolf <awolf@lenovo.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Kai Krakow <hurikhan77@gmail.com>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: Junhui Tang <tang.junhui@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit cf33c1ee5254c6a430bc1538232b49c3ea13e613 upstream.
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the PTR macro, which conflicts with the PTR macro
in include/uapi/linux/bcache.h.
[fixed by mlyle: corrected a line-length issue]
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 91af8300d9c1d7c6b6a2fd754109e08d4798b8d8 upstream.
In bcache code, sysfs entries are created before all resources get
allocated, e.g. allocation thread of a cache set.
There is posibility for NULL pointer deference if a resource is accessed
but which is not initialized yet. Indeed Jorg Bornschein catches one on
cache set allocation thread and gets a kernel oops.
The reason for this bug is, when bch_bucket_alloc() is called during
cache set registration and attaching, ca->alloc_thread is not properly
allocated and initialized yet, call wake_up_process() on ca->alloc_thread
triggers NULL pointer deference failure. A simple and fast fix is, before
waking up ca->alloc_thread, checking whether it is allocated, and only
wake up ca->alloc_thread when it is not NULL.
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Jorg Bornschein <jb@capsec.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit b9a41d21dceadf8104812626ef85dc56ee8a60ed upstream.
The following BUG_ON was hit when testing repeat creation and removal of
DM devices:
kernel BUG at drivers/md/dm.c:2919!
CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
Call Trace:
[<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
[<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
[<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
[<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
[<ffffffff811de257>] kernfs_seq_show+0x23/0x25
[<ffffffff81199118>] seq_read+0x16f/0x325
[<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
[<ffffffff8117b625>] __vfs_read+0x26/0x9d
[<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
[<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
[<ffffffff8117be9d>] vfs_read+0x8f/0xcf
[<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
[<ffffffff8117c686>] SyS_read+0x4b/0x76
[<ffffffff817b606e>] system_call_fastpath+0x12/0x71
The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().
To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.
The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 856eb0916d181da6d043cc33e03f54d5c5bbe54a upstream.
The structure srcu_struct can be very big, its size is proportional to the
value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
io_barrier in the struct mapped_device has 84kB in the debugging kernel
and 50kB in the non-debugging kernel. The large size may result in failure
of the function kzalloc_node.
In order to avoid the allocation failure, we use the function
kvzalloc_node, this function falls back to vmalloc if a large contiguous
chunk of memory is not available. This patch also moves the field
io_barrier to the last position of struct mapped_device - the reason is
that on many processor architectures, short memory offsets result in
smaller code than long memory offsets - on x86-64 it reduces code size by
320 bytes.
Note to stable kernel maintainers - the kernels 4.11 and older don't have
the function kvzalloc_node, you can use the function vzalloc_node instead.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 74d4108d9e681dbbe4a2940ed8fdff1f6868184c upstream.
The default max_cache_size_bytes for dm-bufio is meant to be the lesser
of 25% of the size of the vmalloc area and 2% of the size of lowmem.
However, on 32-bit systems the intermediate result in the expression
(VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100
overflows, causing the wrong result to be computed. For example, on a
32-bit system where the vmalloc area is 520093696 bytes, the result is
1174405 rather than the expected 130023424, which makes the maximum
cache size much too small (far less than 2% of lowmem). This causes
severe performance problems for dm-verity users on affected systems.
Fix this by using mult_frac() to correctly multiply by a percentage. Do
this for all places in dm-bufio that multiply by a percentage. Also
replace (VMALLOC_END - VMALLOC_START) with VMALLOC_TOTAL, which contrary
to the comment is now defined in include/linux/vmalloc.h.
Depends-on: 9993bc635 ("sched/x86: Fix overflow in cyc2ns_offset")
Fixes: 95d402f057f2 ("dm: add bufio")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit d939cdfde34f50b95254b375f498447c82190b3e ]
Commit 03a9e24(md linear: fix a race between linear_add() and
linear_congested()) introduces the warnning.
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[ Upstream commit 6d399783e9d4e9bd44931501948059d24ad96ff8 ]
Commit 57c67df(md/raid10: submit IO from originating thread instead of
md thread) submits bio directly for normal disks but not for replacement
disks. There is no point we shouldn't do this for replacement disks.
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 184a09eb9a2fe425e49c9538f1604b05ed33cfef upstream.
In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST
set, it indicates that this stripe_head is already in the raid5_plug_cb
list and release_stripe() would be called instead to drop a reference
count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this
stripe_head and it will get queued into the raid5_plug_cb list.
Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST,
A stripe could be re-added to plug list while it is still on that list
in the following situation. If stripe_head A is added to another
stripe_head B's batch list, in this case A will have its
batch_head != NULL and be added into the plug list. After that,
stripe_head B gets handled and called break_stripe_batch_list() to
reset all the batched stripe_head(including A which is still on
the plug list)'s state and reset their batch_head to NULL.
Before the plug list gets processed, if there is another write request
comes in and get stripe_head A, A will have its batch_head == NULL
(cleared by calling break_stripe_batch_list() on B) and be added to
plug list once again.
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 3664847d95e60a9a943858b7800f8484669740fc upstream.
We have a race condition in below scenario, say have 3 continuous stripes, sh1,
sh2 and sh3, sh1 is the stripe_head of sh2 and sh3:
CPU1 CPU2 CPU3
handle_stripe(sh3)
stripe_add_to_batch_list(sh3)
-> lock(sh2, sh3)
-> lock batch_lock(sh1)
-> add sh3 to batch_list of sh1
-> unlock batch_lock(sh1)
clear_batch_ready(sh1)
-> lock(sh1) and batch_lock(sh1)
-> clear STRIPE_BATCH_READY for all stripes in batch_list
-> unlock(sh1) and batch_lock(sh1)
->clear_batch_ready(sh3)
-->test_and_clear_bit(STRIPE_BATCH_READY, sh3)
--->return 0 as sh->batch == NULL
-> sh3->batch_head = sh1
-> unlock (sh2, sh3)
In CPU1, handle_stripe will continue handle sh3 even it's in batch stripe list
of sh1. By moving sh3->batch_head assignment in to batch_lock, we make it
impossible to clear STRIPE_BATCH_READY before batch_head is set.
Thanks Stephane for helping debug this tricky issue.
Reported-and-tested-by: Stephane Thiell <sthiell@stanford.edu>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 9276717b9e297a62d1151a43d1cd286213f68eb7 upstream.
Most importantly, solve a crash where %llu was used to format signed
numbers. This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.
Always use the units mechanism rather than having different output
paths for simplicity.
Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100. (Remainders of >= 1000 would print as .10).
Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown. Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream.
gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread write-back thread
| |bch_writeback_thread()
|bch_gc_thread() |
| |==>read_dirty()
|==>bch_btree_gc() |
|==>btree_root() //get btree root |
| //node write locker |
|==>bch_btree_gc_root() |
| |==>read_dirty_submit()
| |==>write_dirty()
| |==>continue_at(cl,
| | write_dirty_finish,
| | system_wq);
| |==>write_dirty_finish()//excute
| | //in system_wq
| |==>bch_btree_insert()
| |==>bch_btree_map_leaf_nodes()
| |==>__bch_btree_map_nodes()
| |==>btree_root //try to get btree
| | //root node read
| | //lock
| |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
| //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
| //journal_write in system_wq
| //but work queue is excuting
| //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
| //over and wake up gc,
|-------------stuck here
|==>release root node write locker
This patch alloc a separate work-queue for write-back thread to avoid such
race.
(Commit log re-organized by Coly Li to pass checkpatch.pl checking)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 77fa100f27475d08a569b9d51c17722130f089e7 upstream.
If you encounter any errors in bch_cached_dev_attach it will return
a negative error code. The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior. Utilize 1
signed variable to use throughout the function to preserve error return
capability.
Signed-off-by: Tony Asleson <tasleson@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit a8394090a9129b40f9d90dcb7f4a49d60c727ca6 upstream.
__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.
bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.
A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.
Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.
This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().
(Commit log composed by Coly Li to pass checkpatch.pl checking)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 69daf03adef5f7bc13e0ac86b4b8007df1767aab upstream.
Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4b758df21ee7081ab41448d21d60367efaa625b3 upstream.
If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 175206cf9ab63161dec74d9cd7f9992e062491f5 upstream.
bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.
Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.
This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().
(Commit log is composed by Coly Li)
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit e8a27f836f165c26f867ece7f31eb5c811692319 upstream.
bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.
The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger. Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.
So this patch disables the resizing of arrays which have
file-backed bitmaps. This is better than crashing.
Reported-by: Zhilong Liu <zlliu@suse.com>
Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 9c72a18e46ebe0f09484cce8ebf847abdab58498 upstream.
In raid5, there are scenarios where some ios are deferred to a later
time, and some IO need a flush to complete. To make sure we make
progress with these IOs, we need to call the following functions:
flush_deferred_bios(conf);
r5l_flush_stripe_to_raid(conf->log);
Both of these functions are called in raid5d(), but missing in
raid5_do_work(). As a result, these functions are not called
when multi-threading (group_thread_cnt > 0) is enabled. This patch
adds calls to these function to raid5_do_work().
Note for stable branches:
r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
flush_deferred_bios(conf) is only needed for 4.11+
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7e96d559634b73a8158ee99a7abece2eacec2668 upstream.
Since thread_group worker and raid5d kthread are not in sync, if
worker writes stripe before raid5d then requests will be waiting
for issue_pendig.
Issue observed when building raid5 with ext4, in some build runs
jbd2 would get hung and requests were waiting in the HW engine
waiting to be issued.
Fix this by adding a call to async_tx_issue_pending_all in the
raid5_do_work.
Signed-off-by: Ofer Heifetz <oferh@marvell.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit b5d27718f38843a74552e9a93d32e2391fd3999f upstream.
The raid5 md device is created by the disks which we don't use the total size. For example,
the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
and assemble it again. It fails.
mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
mdadm /dev/md0 --grow --chunk=64
wait reshape to finish
mdadm -S /dev/md0
mdadm -As
The error messages:
[197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
[197519.821686] md: md_import_device returned -22
After reshape the data offset is changed. It selects backwards direction in this condition.
In function super_1_load it compares the available space of the underlying device with
sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
on rdev->sectors. So add md_finish_reshape in end_reshape.
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit f9c79bc05a2a91f4fba8bfd653579e066714b1ec upstream.
The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.
The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
| |
commit d19a55ccad15a486ffe03030570744e5d5bd9f8e upstream.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream.
The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1345921393ba23b60d3fcf15933e699232ad25ae upstream.
The sb->layout is of type __le32, so we shoud use le32_to_cpu.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 00a0ea33b495ee6149bf5a77ac5807ce87323abb upstream.
process_prepared_discard_passdown_pt1() should cleanup
dm_thin_new_mapping in cases of error.
dm_pool_inc_data_range() can fail trying to get a block reference:
metadata operation 'dm_pool_inc_data_range' failed: error = -61
When dm_pool_inc_data_range() fails, dm thin aborts current metadata
transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
is released as well. However, current thin mapping will be queued
onto next stage as part of queue_passdown_pt2() or passdown_endio().
This dangling thin mapping memory when processed and accessed in
next stage will lead to device mapper crashing.
Code flow without fix:
-> process_prepared_discard_passdown_pt1(m)
-> dm_thin_remove_range()
-> discard passdown
--> passdown_endio(m) queues m onto next stage
-> dm_pool_inc_data_range() fails, frees memory m
but does not remove it from next stage queue
-> process_prepared_discard_passdown_pt2(m)
-> processes freed memory m and crashes
One such stack:
Call Trace:
[<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
[<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
[<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
[<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
[<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
[<ffffffff8152bf54>] ? __schedule+0x244/0x680
[<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
[<ffffffff81089f53>] process_one_work+0x153/0x3f0
[<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
[<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
[<ffffffff8108fd6a>] kthread+0xca/0xe0
[<ffffffff8108fca0>] ? kthread_park+0x60/0x60
[<ffffffff81530b45>] ret_from_fork+0x25/0x30
The fix is to first take the block ref count for discarded block and
then do a passdown discard of this block. If block ref count fails,
then bail out aborting current metadata transaction, mark pool as
PM_READ_ONLY and also free current thin mapping memory (existing error
handling code) without queueing this thin mapping onto next stage of
processing. If block ref count succeeds, then passdown discard of this
block. Discard callback of passdown_endio() will queue this thin mapping
onto next stage of processing.
Code flow with fix:
-> process_prepared_discard_passdown_pt1(m)
-> dm_thin_remove_range()
-> dm_pool_inc_data_range()
--> if fails, free memory m and bail out
-> discard passdown
--> passdown_endio(m) queues m onto next stage
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Reviewed-by: Cristian Gafton <gafton@amazon.com>
Reviewed-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Vallish Vaidyeshwara <vallish@amazon.com>
Reviewed-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 065e519e71b2c1f41936cce75b46b5ab34adb588 upstream.
if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
needs to be cleared in md_ioctl after any call to md_set_readonly()
or do_md_stop().
Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: Zhilong Liu <zlliu@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 583da48e388f472e8818d9bb60ef6a1d40ee9f9d upstream.
When growing raid5 device on machine with small memory, there is chance that
mdadm will be killed and the following bug report can be observed. The same
bug could also be reproduced in linux-4.10.6.
[57600.075774] BUG: unable to handle kernel NULL pointer dereference at (null)
[57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.110378] PGD 421cf067 PUD 4442d067 PMD 0
[57600.114678] Oops: 0002 [#1] SMP
[57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P O 4.2.8 #1
[57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013
[57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000
[57600.204963] RIP: 0010:[<ffffffff81a6aa87>] [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.213057] RSP: 0018:ffff880043073810 EFLAGS: 00010046
[57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0
[57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000
[57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282
[57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838
[57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00
[57600.253999] FS: 00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000
[57600.262078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0
[57600.274942] Stack:
[57600.276949] ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f
[57600.284383] ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98
[57600.291820] ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968
[57600.299254] Call Trace:
[57600.301698] [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0
[57600.307523] [<ffffffff81119043>] ? __page_cache_release+0x23/0x110
[57600.313779] [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0
[57600.319344] [<ffffffff81579942>] drop_one_stripe+0x62/0x90
[57600.324915] [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0
[57600.330563] [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250
[57600.336650] [<ffffffff8111e38c>] shrink_zone+0x23c/0x250
[57600.342039] [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420
[57600.348210] [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0
[57600.353959] [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0
[57600.360303] [<ffffffff8157a30b>] check_reshape+0x62b/0x770
[57600.365866] [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0
[57600.371778] [<ffffffff81583df7>] update_raid_disks+0xc7/0x110
[57600.377604] [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10
[57600.382827] [<ffffffff81385380>] blkdev_ioctl+0x170/0x690
[57600.388307] [<ffffffff81195238>] block_ioctl+0x38/0x40
[57600.393525] [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480
[57600.399010] [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0
[57600.404400] [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70
[57600.409447] [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a
[57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d
[57600.435460] RIP [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20
[57600.441208] RSP <ffff880043073810>
[57600.444690] CR2: 0000000000000000
[57600.448000] ---[ end trace cbc6b5cc4bf9831d ]---
The problem is that resize_stripes() releases new stripe_heads before assigning new
slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called
after resize_stripes() starting releasing new stripes but right before new slab cache
being assigned, it is possible that these new stripe_heads will be freed with the old
slab_cache which was already been destoryed and that triggers this bug.
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 0377a07c7a035e0d033cd8b29f0cb15244c0916a upstream.
When decrementing the reference count for a block, the free count wasn't
being updated if the reference count went to zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
commit 91bcdb92d39711d1adb40c26b653b7978d93eb98 upstream.
These calls were the wrong way round in __write_initial_superblock.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 13840d38016203f0095cd547b90352812d24b787 upstream.
Change the type of the parameter "retain_bytes" from unsigned to
unsigned long, so that on 64-bit machines the user can set more than
4GiB of data to be retained.
Also, change the type of the variable "count" in the function
"__evict_old_buffers" to unsigned long. The assignment
"count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];"
could result in unsigned long to unsigned overflow and that could result
in buffers not being freed when they should.
While at it, avoid division in get_retain_buffers(). Division is slow,
we can change it to shift because we have precalculated the log2 of
block size.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 10add84e276432d9dd8044679a1028dd4084117e upstream.
Otherwise it is possible to trigger crashes due to the metadata being
inaccessible yet these methods don't safely account for that possibility
without these checks.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 89bfce763e43fa4897e0d3af6b29ed909df64cfd upstream.
activate_path() is renamed to activate_path_work() which now calls
activate_or_offline_path(). activate_or_offline_path() will be used
by the next commit.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 390020ad2af9ca04844c4f3b1f299ad8746d84c8 upstream.
dm-bufio checks a watermark when it allocates a new buffer in
__bufio_new(). However, it doesn't check the watermark when the user
changes /sys/module/dm_bufio/parameters/max_cache_size_bytes.
This may result in a problem - if the watermark is high enough so that
all possible buffers are allocated and if the user lowers the value of
"max_cache_size_bytes", the watermark will never be checked against the
new value because no new buffer would be allocated.
To fix this, change __evict_old_buffers() so that it checks the
watermark. __evict_old_buffers() is called every 30 seconds, so if the
user reduces "max_cache_size_bytes", dm-bufio will react to this change
within 30 seconds and decrease memory consumption.
Depends-on: 1b0fb5a5b2 ("dm bufio: avoid a possible ABBA deadlock")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 1b0fb5a5b2dc0dddcfa575060441a7176ba7ac37 upstream.
__get_memory_limit() tests if dm_bufio_cache_size changed and calls
__cache_size_refresh() if it did. It takes dm_bufio_clients_lock while
it already holds the client lock. However, lock ordering is violated
because in cleanup_old_buffers() dm_bufio_clients_lock is taken before
the client lock.
This results in a possible deadlock and lockdep engine warning.
Fix this deadlock by changing mutex_lock() to mutex_trylock(). If the
lock can't be taken, it will be re-checked next time when a new buffer
is allocated.
Also add "unlikely" to the if condition, so that the optimizer assumes
that the condition is false.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7b81ef8b14f80033e4a4168d199a0f5fd79b9426 upstream.
Since the commit 0cf4503174c1 ("dm raid: add support for the MD RAID0
personality"), the dm-raid subsystem can activate a RAID-0 array.
Therefore, add MD_RAID0 to the dependencies of DM_RAID, so that MD_RAID0
will be selected when DM_RAID is selected.
Fixes: 0cf4503174c1 ("dm raid: add support for the MD RAID0 personality")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7d1fedb6e96a960aa91e4ff70714c3fb09195a5a upstream.
dm_btree_find_lowest_key() is giving incorrect results. find_key()
traverses the btree correctly for finding the highest key, but there is
an error in the way it traverses the btree for retrieving the lowest
key. dm_btree_find_lowest_key() fetches the first key of the rightmost
block of the btree instead of fetching the first key from the leftmost
block.
Fix this by conditionally passing the correct parameter to value64()
based on the @find_highest flag.
Signed-off-by: Erez Zadok <ezk@fsl.cs.sunysb.edu>
Signed-off-by: Vinothkumar Raja <vinraja@cs.stonybrook.edu>
Signed-off-by: Nidhi Panpalia <npanpalia@cs.stonybrook.edu>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 948f581a53b704b984aa20df009f0a2b4cf7f907 upstream.
dm-thin does not free the discard_parent bio after all chained sub
bios finished. The following kmemleak report could be observed after
pool with discard_passdown option processes discard bios in
linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
when its endio (passdown_endio) called.
unreferenced object 0xffff8803d6b29700 (size 256):
comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff81a5efd9>] kmemleak_alloc+0x49/0xa0
[<ffffffff8114ec34>] kmem_cache_alloc+0xb4/0x100
[<ffffffff8110eec0>] mempool_alloc_slab+0x10/0x20
[<ffffffff8110efa5>] mempool_alloc+0x55/0x150
[<ffffffff81374939>] bio_alloc_bioset+0xb9/0x260
[<ffffffffa018fd20>] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
[<ffffffffa018b409>] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
[<ffffffffa018b484>] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
[<ffffffffa018b24d>] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
[<ffffffffa018ecf6>] do_worker+0xa76/0xd50 [dm_thin_pool]
[<ffffffff81086239>] process_one_work+0x139/0x370
[<ffffffff810867b1>] worker_thread+0x61/0x450
[<ffffffff8108b316>] kthread+0xd6/0xf0
[<ffffffff81a6cd1f>] ret_from_fork+0x3f/0x70
[<ffffffffffffffff>] 0xffffffffffffffff
Signed-off-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 23a601248958fa4142d49294352fe8d1fdf3e509 upstream.
Otherwise the request-based DM blk-mq request_queue will be put into
service without being properly exported via sysfs.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 117aceb030307dcd431fdcff87ce988d3016c34a upstream.
When committing era metadata to disk, it doesn't always save the latest
spacemap metadata root in superblock. Due to this, metadata is getting
corrupted sometimes when reopening the device. The correct order of update
should be, pre-commit (shadows spacemap root), save the spacemap root
(newly shadowed block) to in-core superblock and then the final commit.
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 4617f564c06117c7d1b611be49521a4430042287 upstream.
When calling a dm ioctl that doesn't process any data
(IOCTL_FLAGS_NO_PARAMS), the contents of the data field in struct
dm_ioctl are left initialized. Current code is incorrectly extending
the size of data copied back to user, causing the contents of kernel
stack to be leaked to user. Fix by only copying contents before data
and allow the functions processing the ioctl to override.
Signed-off-by: Adrian Salido <salidoa@google.com>
Reviewed-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 7a0c5c5b834fb60764b494b0e39c239da3b0774b upstream.
Commit 4257e08 ("dm raid: support to change bitmap region size")
introduced a bitmap resize call during preresume phase. User can create
a DM device with "raid" target configured as raid1 with no metadata
devices to hold superblock/bitmap info. It can be achieved using the
following sequence:
truncate -s 32M /dev/shm/raid-test
LOOP=$(losetup --show -f /dev/shm/raid-test)
dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0"
dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024"
dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1"
This results in the following crash:
[ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1
[ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB
[ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors
[ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.114816] PGD 0
…
[ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015
[ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000
[ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod]
[ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246
[ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
[ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000
[ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff
[ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058
[ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000
[ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0
[ 4029.115281] Call Trace:
[ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid]
[ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod]
[ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod]
[ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid]
[ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod]
[ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod]
[ 4029.115391] dev_suspend+0x123/0x250 [dm_mod]
[ 4029.115405] ? table_load+0x350/0x350 [dm_mod]
[ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod]
[ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 4029.115447] do_vfs_ioctl+0x8d/0x5a0
[ 4029.115459] ? ____fput+0x9/0x10
[ 4029.115470] ? task_work_run+0x79/0xa0
[ 4029.115481] SyS_ioctl+0x3c/0x70
[ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94
The raid_preresume() function incorrectly assumes that the raid_set has
a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But
RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap()
even if there is no bitmap present (and bitmap_load() happily returns 0
even if a bitmap isn't present). So the only way forward in the
near-term is to check if the bitmap is present by seeing if
mddev->bitmap is not NULL after bitmap_load() has been called.
By doing so the above NULL pointer is avoided.
Fixes: 4257e08 ("dm raid: support to change bitmap region size")
Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit 86e3e83b443669dd2bcc5c8a83b23e3aa0694c0d upstream.
Buffers read through dm_bufio_read() were not released in all code paths.
Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
commit f1a880a93baaadb14c10a348fd199f1cdb6bcccd upstream.
If the hash tree itself is sufficiently corrupt in addition to data blocks,
it's possible for error correction to end up in a deep recursive loop,
which eventually causes a kernel panic. This change limits the
recursion to a reasonable level during a single I/O operation.
Fixes: a739ff3f543a ("dm verity: add support for forward error correction")
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|