summaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c
Commit message (Collapse)AuthorAgeFilesLines
* dm raid: protect md_stop() with 'reconfig_mutex'Yu Kuai2023-07-251-0/+2
| | | | | | | | | | | | | __md_stop_writes() and __md_stop() will modify many fields that are protected by 'reconfig_mutex', and all the callers will grab 'reconfig_mutex' except for md_stop(). Also, update md_stop() to make certain 'reconfig_mutex' is held using lockdep_assert_held(). Fixes: 9d09e663d550 ("dm: raid456 basic support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
* md: fix 'delete_mutex' deadlockYu Kuai2023-06-231-19/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3ce94ce5d05a ("md: fix duplicate filename for rdev") introduce a new lock 'delete_mutex', and trigger a new deadlock: t1: remove rdev t2: sysfs writer rdev_attr_store rdev_attr_store mddev_lock state_store md_kick_rdev_from_array lock delete_mutex list_add mddev->deleting unlock delete_mutex mddev_unlock mddev_lock ... lock delete_mutex kobject_del // wait for sysfs writers to be done mddev_unlock lock delete_mutex // wait for delete_mutex, deadlock 'delete_mutex' is used to protect the list 'mddev->deleting', turns out that this list can be protected by 'reconfig_mutex' directly, and this lock can be removed. Fix this problem by removing the lock, and use 'reconfig_mutex' to protect the list. mddev_unlock() will move this list to a local list to be handled after 'reconfig_mutex' is dropped. Fixes: 3ce94ce5d05a ("md: fix duplicate filename for rdev") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230621142933.1395629-1-yukuai1@huaweicloud.com
* md: use mddev->external to select holder in export_rdev()Song Liu2023-06-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mdadm test "10ddf-create-fail-rebuild" triggers warnings like the following [ 215.526357] ------------[ cut here ]------------ [ 215.527243] WARNING: CPU: 18 PID: 1264 at block/bdev.c:617 blkdev_put+0x269/0x350 [ 215.528334] Modules linked in: [ 215.528806] CPU: 18 PID: 1264 Comm: mdmon Not tainted 6.4.0-rc2+ #768 [ 215.529863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS [ 215.531464] RIP: 0010:blkdev_put+0x269/0x350 [ 215.532167] Code: ff ff 49 8d 7d 10 e8 56 bf b8 ff 4d 8b 65 10 49 8d bc 24 58 05 00 00 e8 05 be b8 ff 41 83 ac 24 58 05 00 00 01 e9 44 ff ff ff <0f> 0b e9 52 fe ff ff 0f 0b e9 6b fe ff ff1 [ 215.534780] RSP: 0018:ffffc900040bfbf0 EFLAGS: 00010283 [ 215.535635] RAX: ffff888174001000 RBX: ffff88810b1c3b00 RCX: ffffffff819a4061 [ 215.536645] RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88810b1c3ba0 [ 215.537657] RBP: ffff88810dbde800 R08: fffffbfff0fca983 R09: fffffbfff0fca983 [ 215.538674] R10: ffffc900040bfbf0 R11: fffffbfff0fca982 R12: ffff88810b1c3b38 [ 215.539687] R13: ffff88810b1c3b10 R14: ffff88810dbdecb8 R15: ffff88810b1c3b00 [ 215.540833] FS: 00007f2aabdff700(0000) GS:ffff888dfb400000(0000) knlGS:0000000000000000 [ 215.541961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 215.542775] CR2: 00007fa19a85d934 CR3: 000000010c076006 CR4: 0000000000370ee0 [ 215.543814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 215.544840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 215.545885] Call Trace: [ 215.546257] <TASK> [ 215.546608] export_rdev.isra.63+0x71/0xe0 [ 215.547338] mddev_unlock+0x1b1/0x2d0 [ 215.547898] array_state_store+0x28d/0x450 [ 215.548519] md_attr_store+0xd7/0x150 [ 215.549059] ? __pfx_sysfs_kf_write+0x10/0x10 [ 215.549702] kernfs_fop_write_iter+0x1b9/0x260 [ 215.550351] vfs_write+0x491/0x760 [ 215.550863] ? __pfx_vfs_write+0x10/0x10 [ 215.551445] ? __fget_files+0x156/0x230 [ 215.552053] ksys_write+0xc0/0x160 [ 215.552570] ? __pfx_ksys_write+0x10/0x10 [ 215.553141] ? ktime_get_coarse_real_ts64+0xec/0x100 [ 215.553878] do_syscall_64+0x3a/0x90 [ 215.554403] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 215.555125] RIP: 0033:0x7f2aade11847 [ 215.555696] Code: c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b fd ff ff 4c 89 e2 48 89 ee 89 df 41 89 c0 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 448 [ 215.558398] RSP: 002b:00007f2aabdfeba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 215.559516] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2aade11847 [ 215.560515] RDX: 0000000000000005 RSI: 0000000000438b8b RDI: 0000000000000010 [ 215.561512] RBP: 0000000000438b8b R08: 0000000000000000 R09: 00007f2aaecf0060 [ 215.562511] R10: 000000000e3ba40b R11: 0000000000000293 R12: 0000000000000005 [ 215.563647] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000c70750 [ 215.564693] </TASK> [ 215.565029] irq event stamp: 15979 [ 215.565584] hardirqs last enabled at (15991): [<ffffffff811a7432>] __up_console_sem+0x52/0x60 [ 215.566806] hardirqs last disabled at (16000): [<ffffffff811a7417>] __up_console_sem+0x37/0x60 [ 215.568022] softirqs last enabled at (15716): [<ffffffff8277a2db>] __do_softirq+0x3eb/0x531 [ 215.569239] softirqs last disabled at (15711): [<ffffffff810d8f45>] irq_exit_rcu+0x115/0x160 [ 215.570434] ---[ end trace 0000000000000000 ]--- This means export_rdev() calls blkdev_put with a different holder than the one used by blkdev_get_by_dev(). This is because mddev->major_version == -2 is not a good check for external metadata. Fix this by using mddev->external instead. Also, do not clear mddev->external in md_clean(), as the flag might be used later in export_rdev(). Fixes: 2736e8eeb0cc ("block: use the holder as indication for exclusive opens") Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230617052405.305871-1-song@kernel.org
* md/md-bitmap: add a new helper to unplug bitmap asynchrouslyYu Kuai2023-06-131-0/+9
| | | | | | | | | | | | | | | | | If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
* md/raid10: clean up md_add_new_disk()Li Nan2023-06-131-1/+0
| | | | | | | | | | | | | | | | | | | | Commit 1a855a060665 ("md: fix bug with re-adding of partially recovered device.") only add device which is set to In_sync. But it let devices without metadata cannot be added when they should be. Commit bf572541ab44 ("md: fix regression with re-adding devices to arrays with no metadata") fix the above issue, it set device without metadata to In_sync when add new disk. However, after commit f466722ca614 ("md: Change handling of save_raid_disk and metadata update during recovery.") deletes changes of the first patch, setting In_sync for devcie without metadata is meanless because the flag will be cleared soon and will not be used during this period. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
* md: protect md_thread with rcuYu Kuai2023-06-131-37/+32
| | | | | | | | | | | | | | | | | | | | | Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
* md: factor out a helper to wake up md_thread directlyYu Kuai2023-06-131-8/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md_wakeup_thread() can't wakeup md_thread->tsk if md_thread->run is still in progress, and in some cases md_thread->tsk need to be woke up directly, like md_set_readonly() and do_md_stop(). Commit 9dfbdafda3b3 ("md: unlock mddev before reap sync_thread in action_store") introduce a new scenario where unregister sync_thread is not protected by 'reconfig_mutex', this can cause null-ptr-deference in theroy: t1: md_set_readonly t2: action_store md_unregister_thread // 'reconfig_mutex' is not held // 'reconfig_mutex' is held by caller if (mddev->sync_thread) thread = *threadp *threadp = NULL wake_up_process(mddev->sync_thread->tsk) // null-ptr-deference Fix this problem by factoring out a helper to wake up md_thread directly, so that 'sync_thread' won't be accessed multiple times from the reader side. This helper also prepare to protect md_thread with rcu. Noted that later patches is going to fix that unregister sync_thread is not protected by 'reconfig_mutex' from action_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-2-yukuai1@huaweicloud.com
* md: fix duplicate filename for rdevYu Kuai2023-06-131-42/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs") delays the deletion of rdev, however, this introduces a window that rdev can be added again while the deletion is not done yet, and sysfs will complain about duplicate filename. Follow up patches try to fix this problem by flushing workqueue, however, flush_rdev_wq() is just dead code, the progress in md_kick_rdev_from_array(): 1) list_del_rcu(&rdev->same_set); 2) synchronize_rcu(); 3) queue_work(md_rdev_misc_wq, &rdev->del_work); So in flush_rdev_wq(), if rdev is found in the list, work_pending() can never pass, in the meantime, if work is queued, then rdev can never be found in the list. flush_rdev_wq() can be replaced by flush_workqueue() directly, however, this approach is not good: - the workqueue is global, this synchronization for all raid disks is not necessary. - flush_workqueue can't be called under 'reconfig_mutex', there is still a small window between flush_workqueue() and mddev_lock() that other contexts can queue new work, hence the problem is not solved completely. sysfs already has apis to support delete itself through writer, and these apis, specifically sysfs_break/unbreak_active_protection(), is used to support deleting rdev synchronously. Therefore, the above commit can be reverted, and sysfs duplicate filename can be avoided. A new mdadm regression test is proposed as well([1]). [1] https://lore.kernel.org/linux-raid/20230428062845.1975462-1-yukuai1@huaweicloud.com/ Fixes: 5792a2856a63 ("[PATCH] md: avoid a deadlock when removing a device from an md array via sysfs") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523012727.3042247-1-yukuai1@huaweicloud.com
* md/raid10: fix wrong setting of max_corr_read_errorsLi Nan2023-06-131-0/+2
| | | | | | | | | | | There is no input check when echo md/max_read_errors and overflow might occur. Add check of input number. Fixes: 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com
* md/raid10: fix overflow of md/safe_mode_delayLi Nan2023-06-131-3/+4
| | | | | | | | | | | | There is no input check when echo md/safe_mode_delay in safe_delay_store(). And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by checking overflow in safe_delay_store() and use unsigned long conversion in safe_delay_show(). Fixes: 72e02075a33f ("md: factor out parsing of fixed-point numbers") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com
* md/raid5: fix a deadlock in the case that reshape is interruptedYu Kuai2023-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If reshape is in progress and io across reshape_position is issued, such io will wait for reshape to make progress(see details in the case that make_stripe_request() return STRIPE_SCHEDULE_AND_RETRY). It has been reported several times that if system reboot while growing raid5 to raid6, array assemble will hang infinitely([1, 2]). This is because following deadlock is triggered: 1) a normal io is waiting for reshape to progress, this io can be from system-udevd or mdadm. 2) while assemble, mdadm tries to suspend the array, hence 'reconfig_mutex' is held and mddev_suspend() must wait for normal io to be done. 3) daemon thread can't start reshape because 'reconfig_mutex' can't be held. 1) and 3) is unbreakable because they're foundation design. In order to break 2), following is possible solutions that I can think of: a) Let mddev_suspend() fail is not a good option, because this will break many scenarios since mddev_suspend() doesn't fail before. b) Fail the io that is waiting for reshape to make progress from mddev_suspend(). c) Return false for the io that is waiting for reshape to make progress from raid5_make_request(), and these io will wait for suspend to be done in md_handle_request(), where 'active_io' is not grabbed. c) sounds better than b), however, b) is used because it's easy and straightforward, and it's verified that mdadm can assemble in this case. On the other hand, c) breaks the logic that mddev_suspend() will wait for submitted io to be completely handled. Fix the problem by checking reshape in mddev_suspend(), if reshape can't make progress and there are still some io waiting for reshape, fail those io. [1] https://lore.kernel.org/all/CAFig2csUV2QiomUhj_t3dPOgV300dbQ6XtM9ygKPdXJFSH__Nw@mail.gmail.com/ [2] https://lore.kernel.org/all/CAO2ABipzbw6QL5eNa44CQHjiVa-LTvS696Mh9QaTw+qsUKFUCw@mail.gmail.com/ Reported-by: Jove <jovetoo@gmail.com> Reported-by: David Gilmour <dgilmour76@gmail.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-6-yukuai1@huaweicloud.com
* md: add a new api prepare_suspend() in md_personalityYu Kuai2023-06-131-0/+4
| | | | | | | | | There are no functional changes, the new api will be used later to do special handling for raid456 in md_suspend(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-5-yukuai1@huaweicloud.com
* md: export md_is_rdwr() and is_md_suspended()Yu Kuai2023-06-131-16/+0
| | | | | | | | | The two apis will be used later to fix a deadlock in raid456, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-4-yukuai1@huaweicloud.com
* md: fix data corruption for raid456 when reshape restart while grow upYu Kuai2023-06-131-2/+12
| | | | | | | | | | | | | | | | | Currently, if reshape is interrupted, echo "reshape" to sync_action will restart reshape from scratch, for example: echo frozen > sync_action echo reshape > sync_action This will corrupt data before reshape_position if the array is growing, fix the problem by continue reshape from reshape_position. Reported-by: Peter Neuwirth <reddunur@online.de> Link: https://lore.kernel.org/linux-raid/e2f96772-bfbc-f43b-6da1-f520e5164536@online.de/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230512015610.821290-3-yukuai1@huaweicloud.com
* block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig2023-06-121-4/+4
| | | | | | | | | | | | | | The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: use the holder as indication for exclusive opensChristoph Hellwig2023-06-121-18/+20
| | | | | | | | | | | | | | | | | | The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the unused mode argument to ->releaseChristoph Hellwig2023-06-121-1/+1
| | | | | | | | | | | | The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass a gendisk to ->openChristoph Hellwig2023-06-121-3/+3
| | | | | | | | | | | | ->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: pass a gendisk on bdev_check_media_changeChristoph Hellwig2023-06-121-1/+1
| | | | | | | | | | | | bdev_check_media_change should only ever be called for the whole device. Pass a gendisk to make that explicit and rename the function to disk_check_media_change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: introduce holder opsChristoph Hellwig2023-06-051-1/+1
| | | | | | | | | | | | | | Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: use __bio_add_page to add single pageJohannes Thumshirn2023-05-311-2/+2
| | | | | | | | | | | | | | | | | | | The md-raid superblock writing code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked. Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed. This brings us a step closer to marking bio_add_page() as __must_check. Signed-of_-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/ca196f5e650e318106dbb4496eb6cbac4bc800bd.1685532726.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'sysctl-6.4-rc1' of ↵Linus Torvalds2023-04-271-21/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "This only does a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could *save* memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: - register_sysctl_table() - register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls *also* saves memory. And so long term, all these changes *will* prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3" Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0] * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits) fs: fix sysctls.c built mm: compaction: remove incorrect #ifdef checks mm: compaction: move compaction sysctl to its own file mm: memory-failure: Move memory failure sysctls to its own file arm: simplify two-level sysctl registration for ctl_isa_vars ia64: simplify one-level sysctl registration for kdump_ctl_table utsname: simplify one-level sysctl registration for uts_kern_table ntfs: simplfy one-level sysctl registration for ntfs_sysctls coda: simplify one-level sysctl registration for coda_table fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls xfs: simplify two-level sysctl registration for xfs_table nfs: simplify two-level sysctl registration for nfs_cb_sysctls nfs: simplify two-level sysctl registration for nfs4_cb_sysctls lockd: simplify two-level sysctl registration for nlm_sysctls proc_sysctl: enhance documentation xen: simplify sysctl registration for balloon md: simplify sysctl registration hv: simplify sysctl registration scsi: simplify sysctl registration with register_sysctl() csky: simplify alignment sysctl registration ...
| * md: simplify sysctl registrationLuis Chamberlain2023-04-131-21/+1
| | | | | | | | | | | | | | | | | | register_sysctl_table() is a deprecated compatibility wrapper. register_sysctl() can do the directory creation for you so just use that. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Song Liu <song@kernel.org>
* | Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds2023-04-261-12/+15
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
| * | md: fix soft lockup in status_resyncYu Kuai2023-04-131-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | status_resync() will calculate 'curr_resync - recovery_active' to show user a progress bar like following: [============>........] resync = 61.4% 'curr_resync' and 'recovery_active' is updated in md_do_sync(), and status_resync() can read them concurrently, hence it's possible that 'curr_resync - recovery_active' can overflow to a huge number. In this case status_resync() will be stuck in the loop to print a large amount of '=', which will end up soft lockup. Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case, this way resync in progress will be reported to user. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com
| * | md: add error_handlers for raid0 and linearMariusz Tkaczyk2023-04-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10") MD_BROKEN must be set if array is failed because state_store() checks it. If it is set then -EBUSY is returned to userspace. For raid0 and linear MD_BROKEN is not set by error_handler(). As a result mdadm is unable to trigger clean-up actions. It is a regression. This patch adds appropriate error_handler for raid0 and linear. The error handler sets MD_BROKEN for this device. Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
| * | md: make kobj_type structures constantThomas Weißschuh2023-04-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230214-kobj_type-md-v1-1-d6853f707f11@weissschuh.net
* | | md: fix regression for null-ptr-deference in __md_stop()Yu Kuai2023-03-291-1/+2
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3e453522593d ("md: Free resources in __md_stop") tried to fix null-ptr-deference for 'active_io' by moving percpu_ref_exit() to __md_stop(), however, the commit also moving 'writes_pending' to __md_stop(), and this will cause mdadm tests broken: BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37 RIP: 0010:free_percpu+0x465/0x670 Call Trace: <TASK> __percpu_ref_exit+0x48/0x70 percpu_ref_exit+0x1a/0x90 __md_stop+0xe9/0x170 do_md_stop+0x1e1/0x7b0 md_ioctl+0x90c/0x1aa0 blkdev_ioctl+0x19b/0x400 vfs_ioctl+0x20/0x50 __x64_sys_ioctl+0xba/0xe0 do_syscall_64+0x6c/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0xcd And the problem can be reporduced 100% by following test: mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force echo inactive > /sys/block/md0/md/array_state echo read-auto > /sys/block/md0/md/array_state echo inactive > /sys/block/md0/md/array_state Root cause: // start raid raid1_run mddev_init_writes_pending percpu_ref_init // inactive raid array_state_store do_md_stop __md_stop percpu_ref_exit // start raid again array_state_store do_md_run raid1_run mddev_init_writes_pending if (mddev->writes_pending.percpu_count_ptr) // won't reinit // inactive raid again ... percpu_ref_exit -> null-ptr-deference Before the commit, 'writes_pending' is exited when mddev is freed, and it's safe to restart raid because mddev_init_writes_pending() already make sure that 'writes_pending' will only be initialized once. Fix the prblem by moving 'writes_pending' back, it's a litter hard to find the relationship between alloc memory and free memory, however, code changes is much less and we lived with this for a long time already. Fixes: 3e453522593d ("md: Free resources in __md_stop") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com
* | md: avoid signed overflow in slot_store()NeilBrown2023-03-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | slot_store() uses kstrtouint() to get a slot number, but stores the result in an "int" variable (by casting a pointer). This can result in a negative slot number if the unsigned int value is very large. A negative number means that the slot is empty, but setting a negative slot number this way will not remove the device from the array. I don't think this is a serious problem, but it could cause confusion and it is best to fix it. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Song Liu <song@kernel.org>
* | md: Free resources in __md_stopXiao Ni2023-03-131-9/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If md_run() fails after ->active_io is initialized, then percpu_ref_exit is called in error path. However, later md_free_disk will call percpu_ref_exit again which leads to a panic because of null pointer dereference. It can also trigger this bug when resources are initialized but are freed in error path, then will be freed again in md_free_disk. BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP Workqueue: md_misc mddev_delayed_delete RIP: 0010:free_percpu+0x110/0x630 Call Trace: <TASK> __percpu_ref_exit+0x44/0x70 percpu_ref_exit+0x16/0x90 md_free_disk+0x2f/0x80 disk_release+0x101/0x180 device_release+0x84/0x110 kobject_put+0x12a/0x380 kobject_put+0x160/0x380 mddev_delayed_delete+0x19/0x30 process_one_work+0x269/0x680 worker_thread+0x266/0x640 kthread+0x151/0x1b0 ret_from_fork+0x1f/0x30 For creating raid device, md raid calls do_md_run->md_run, dm raid calls md_run. We alloc those memory in md_run. For stopping raid device, md raid calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can free those memory resources in __md_stop. Fixes: 72adae23a72c ("md: Change active_io to percpu") Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
* md: account io_acct_set usage with active_ioXiao Ni2023-02-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_acct_set was enabled for raid0/raid5 io accounting. bios that contain md_io_acct are allocated in the i/o path. There isn't a good method to monitor if these bios are all finished and freed. In the takeover process, io_acct_set (which is used for bios with md_io_acct) need to be freed. However, if some bios finish after io_acct_set is freed, it may trigger the following panic: [ 6973.767999] RIP: 0010:mempool_free+0x52/0x80 [ 6973.786098] Call Trace: [ 6973.786549] md_end_io_acct+0x31/0x40 [ 6973.787227] blk_update_request+0x224/0x380 [ 6973.787994] blk_mq_end_request+0x1a/0x130 [ 6973.788739] blk_complete_reqs+0x35/0x50 [ 6973.789456] __do_softirq+0xd7/0x2c8 [ 6973.790114] ? sort_range+0x20/0x20 [ 6973.790763] run_ksoftirqd+0x2a/0x40 [ 6973.791400] smpboot_thread_fn+0xb5/0x150 [ 6973.792114] kthread+0x10b/0x130 [ 6973.792724] ? set_kthread_struct+0x50/0x50 [ 6973.793491] ret_from_fork+0x1f/0x40 Fix this by increasing and decreasing active_io for each bio with md_io_acct so that mddev_suspend() will wait until all bios from io_acct_set finish before freeing io_acct_set. Reported-by: Fine Fan <ffan@redhat.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
* md: use MD_RESYNC_* whenever possibleHou Tao2023-02-011-3/+3
| | | | | | | | Just replace magic numbers by MD_RESYNC_* enumerations. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
* md: Free writes_pending in md_stopXiao Ni2023-02-011-0/+1
| | | | | | | | dm raid calls md_stop to stop the raid device. It needs to free the writes_pending here. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
* md: Change active_io to percpuXiao Ni2023-02-011-19/+24
| | | | | | | | | | | | | | | | | Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
* md: Factor out is_md_suspended helperXiao Ni2023-02-011-5/+12
| | | | | | | | This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
* md: don't update recovery_cp when curr_resync is ACTIVEHou Tao2023-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't update recovery_cp when curr_resync is MD_RESYNC_ACTIVE, otherwise md may skip the resync of the first 3 sectors if the resync procedure is interrupted before the first calling of ->sync_request() as shown below: md_do_sync thread control thread // setup resync mddev->recovery_cp = 0 j = 0 mddev->curr_resync = MD_RESYNC_ACTIVE // e.g., set array as idle set_bit(MD_RECOVERY_INTR, &&mddev_recovery) // resync loop // check INTR before calling sync_request !test_bit(MD_RECOVERY_INTR, &mddev->recovery // resync interrupted // update recovery_cp from 0 to 3 // the resync of three 3 sectors will be skipped mddev->recovery_cp = 3 Fixes: eac58d08d493 ("md: Use enum for overloaded magic numbers used by mddev->curr_resync") Cc: stable@vger.kernel.org # 6.0+ Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
* md: fix incorrect declaration about claim_rdev in md_import_deviceAdrian Huang2023-01-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fb541ca4c365 ("md: remove lock_bdev / unlock_bdev") removes wrappers for blkdev_get/blkdev_put. However, the uninitialized local static variable of pointer type 'claim_rdev' in md_import_device() is NULL, which leads to the following warning call trace: WARNING: CPU: 22 PID: 1037 at block/bdev.c:577 bd_prepare_to_claim+0x131/0x150 CPU: 22 PID: 1037 Comm: mdadm Not tainted 6.2.0-rc3+ #69 .. RIP: 0010:bd_prepare_to_claim+0x131/0x150 .. Call Trace: <TASK> ? _raw_spin_unlock+0x15/0x30 ? iput+0x6a/0x220 blkdev_get_by_dev.part.0+0x4b/0x300 md_import_device+0x126/0x1d0 new_dev_store+0x184/0x240 md_attr_store+0x80/0xf0 kernfs_fop_write_iter+0x128/0x1c0 vfs_write+0x2be/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc It turns out the md device cannot be used: md: could not open device unknown-block(259,0). md: md127 stopped. Fix the issue by declaring the local static variable of struct type and passing the pointer of the variable to blkdev_get_by_dev(). Fixes: fb541ca4c365 ("md: remove lock_bdev / unlock_bdev") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
* block: handle bio_split_to_limits() NULL returnJens Axboe2023-01-041-0/+2
| | | | | | | | This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: fold unbind_rdev_from_array into md_kick_rdev_from_arrayChristoph Hellwig2022-12-021-21/+16
| | | | | | | | unbind_rdev_from_array is only called from md_kick_rdev_from_array, so merge it into its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
* md: mark md_kick_rdev_from_array staticChristoph Hellwig2022-12-021-2/+1
| | | | | | | | md_kick_rdev_from_array is only used in md.c, so unexport it and mark the symbol static. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
* md: remove lock_bdev / unlock_bdevChristoph Hellwig2022-12-021-41/+22
| | | | | | | | | These wrappers for blkdev_get / blkdev_put just horribly confuse the code with their odd naming. Remove them and improve the error unwinding in md_import_device with the now folded code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
* md: fix a crash in mempool_freeMikulas Patocka2022-11-141-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a crash in mempool_free when running the lvm test shell/lvchange-rebuild-raid.sh. The reason for the crash is this: * super_written calls atomic_dec_and_test(&mddev->pending_writes) and wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev) and bio_put(bio). * so, the process that waited on sb_wait and that is woken up is racing with bio_put(bio). * if the process wins the race, it calls bioset_exit before bio_put(bio) is executed. * bio_put(bio) attempts to free a bio into a destroyed bio set - causing a crash in mempool_free. We fix this bug by moving bio_put before atomic_dec_and_test. We also move rdev_dec_pending before atomic_dec_and_test as suggested by Neil Brown. The function md_end_flush has a similar bug - we must call bio_put before we decrement the number of in-progress bios. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 11557f0067 P4D 11557f0067 PUD 0 Oops: 0002 [#1] PREEMPT SMP CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: kdelayd flush_expired_bios [dm_delay] RIP: 0010:mempool_free+0x47/0x80 Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00 RSP: 0018:ffff88910036bda8 EFLAGS: 00010093 RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8 RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900 R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000 R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05 FS: 0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0 Call Trace: <TASK> clone_endio+0xf4/0x1c0 [dm_mod] clone_endio+0xf4/0x1c0 [dm_mod] __submit_bio+0x76/0x120 submit_bio_noacct_nocheck+0xb6/0x2a0 flush_expired_bios+0x28/0x2f [dm_delay] process_one_work+0x1b4/0x300 worker_thread+0x45/0x3e0 ? rescuer_thread+0x380/0x380 kthread+0xc2/0x100 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd] CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
* md: introduce md_ro_stateYe Bin2022-11-141-70/+82
| | | | | | | Introduce md_ro_state for mddev->ro, so it is easy to understand. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Song Liu <song@kernel.org>
* md: factor out __md_set_array_info()Ye Bin2022-11-141-30/+35
| | | | | | | Factor out __md_set_array_info(). No functional change. Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Song Liu <song@kernel.org>
* Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds2022-10-071-3/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
| * block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig2022-09-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * md: Remove extra mddev_get() in md_seq_start()Logan Gunthorpe2022-09-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A regression is seen where mddev devices stay permanently after they are stopped due to an elevated reference count. This was tracked down to an extra mddev_get() in md_seq_start(). It only happened rarely because most of the time the md_seq_start() is called with a zero offset. The path with an extra mddev_get() only happens when it starts with a non-zero offset. The commit noted below changed an mddev_get() to check its success but inadvertently left the original call in. Remove the extra call. Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
* | md: call __md_stop_writes in md_stopGuoqing Jiang2022-08-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From the link [1], we can see raid1d was running even after the path raid_dtr -> md_stop -> __md_stop. Let's stop write first in destructor to align with normal md-raid to fix the KASAN issue. [1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
* | Revert "md-raid: destroy the bitmap after destroying the thread"Guoqing Jiang2022-08-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it obviously breaks clustered raid as noticed by Neil though it fixed KASAN issue for dm-raid, let's revert it and fix KASAN issue in next commit. [1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470 Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread") Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <song@kernel.org>
* | md: Flush workqueue md_rdev_misc_wq in md_alloc()David Sloan2022-08-241-0/+1
|/ | | | | | | | | | | | | | | | | | | | A race condition still exists when removing and re-creating md devices in test cases. However, it is only seen on some setups. The race condition was tracked down to a reference still being held to the kobject by the rdev in the md_rdev_misc_wq which will be released in rdev_delayed_delete(). md_alloc() waits for previous deletions by waiting on the md_misc_wq, but the md_rdev_misc_wq may still be holding a reference to a recently removed device. To fix this, also flush the md_rdev_misc_wq in md_alloc(). Signed-off-by: David Sloan <david.sloan@eideticom.com> [logang@deltatee.com: rewrote commit message] Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>