summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* block: Fix partition check for host-aware zoned block devicesShin'ichiro Kawasaki2021-11-021-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e0c60d0102a5ad3475401e1a2faa3d3623eefce4 upstream. Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") modified the method to check partition existence in host-aware zoned block devices from disk_has_partitions() helper function call to empty check of xarray disk->part_tbl. However, disk->part_tbl always has single entry for disk->part0 and never becomes empty. This resulted in the host-aware zoned devices always judged to have partitions, and it made the sysfs queue/zoned attribute to be "none" instead of "host-aware" regardless of partition existence in the devices. This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned device. Since block layer handled the host-aware zoned devices as non- zoned devices, scsi layer did not have chance to initialize the mutex for zone revalidation. Therefore, the warning was triggered. To fix the issues, call the helper function disk_has_partitions() in place of disk->part_tbl empty check. Since the function was removed with the commit a33df75c6328, reimplement it to walk through entries in the xarray disk->part_tbl. Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Cc: stable@vger.kernel.org # v5.14+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on ↵Tejun Heo2021-10-271-2/+3
| | | | | | | | | | | | | | | | | | blkg->iostat_cpu commit 5370b0f49078203acf3c064b634a09707167a864 upstream. c3df5fb57fe8 ("cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks. Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it. Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Cc: stable@vger.kernel.org # v5.13+ Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs outputJohannes Thumshirn2021-10-271-0/+1
| | | | | | | | | | | | | | | | [ Upstream commit 1dbdd99b511c966be9147ad72991a2856ac76f22 ] While debugging an issue we've found that $DEBUGFS/block/$disk/state doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical value. Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get decoded properly. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block/mq-deadline: Move dd_queued() to fix defined but not used warningGeert Uytterhoeven2021-10-271-6/+6
| | | | | | | | | | | | | | | | | | | | | commit 55a51ea14094a1e7dd0d7f33237d246033dd39ab upstream. If CONFIG_BLK_DEBUG_FS=n: block/mq-deadline.c:274:12: warning: ‘dd_queued’ defined but not used [-Wunused-function] 274 | static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) | ^~~~~~~~~ Fix this by moving dd_queued() just before the sole function that calls it. Fixes: 7b05bf771084ff78 ("Revert "block/mq-deadline: Prioritize high-priority requests"") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210830091128.1854266-1-geert@linux-m68k.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* block: don't call rq_qos_ops->done_bio if the bio isn't trackedMing Lei2021-10-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a647a524a46736786c95cdb553a070322ca096e3 ] rq_qos framework is only applied on request based driver, so: 1) rq_qos_done_bio() needn't to be called for bio based driver 2) rq_qos_done_bio() needn't to be called for bio which isn't tracked, such as bios ended from error handling code. Especially in bio_endio(): 1) request queue is referred via bio->bi_bdev->bd_disk->queue, which may be gone since request queue refcount may not be held in above two cases 2) q->rq_qos may be freed in blk_cleanup_queue() when calling into __rq_qos_done_bio() Fix the potential kernel panic by not calling rq_qos_ops->done_bio if the bio isn't tracked. This way is safe because both ioc_rqos_done_bio() and blkcg_iolatency_done_bio() are nop if the bio isn't tracked. Reported-by: Yu Kuai <yukuai3@huawei.com> Cc: tj@kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Revert "block, bfq: honor already-setup queue merges"Jens Axboe2021-10-071-13/+3
| | | | | | | | | | | | | | | | [ Upstream commit ebc69e897e17373fbe1daaff1debaa77583a5284 ] This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b. We have had several folks complain that this causes hangs for them, which is especially problematic as the commit has also hit stable already. As no resolution seems to be forthcoming right now, revert the patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503 Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pdLi Jinlin2021-09-301-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 858560b27645e7e97aca37ee8f232cccd658fbd2 ] KASAN reports a use-after-free report when doing fuzz test: [693354.104835] ================================================================== [693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160 [693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338 [693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147 [693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018 [693354.105612] Call Trace: [693354.105621] dump_stack+0xf1/0x19b [693354.105626] ? show_regs_print_info+0x5/0x5 [693354.105634] ? printk+0x9c/0xc3 [693354.105638] ? cpumask_weight+0x1f/0x1f [693354.105648] print_address_description+0x70/0x360 [693354.105654] kasan_report+0x1b2/0x330 [693354.105659] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105665] ? bfq_io_set_weight_legacy+0xd3/0x160 [693354.105670] bfq_io_set_weight_legacy+0xd3/0x160 [693354.105675] ? bfq_cpd_init+0x20/0x20 [693354.105683] cgroup_file_write+0x3aa/0x510 [693354.105693] ? ___slab_alloc+0x507/0x540 [693354.105698] ? cgroup_file_poll+0x60/0x60 [693354.105702] ? 0xffffffff89600000 [693354.105708] ? usercopy_abort+0x90/0x90 [693354.105716] ? mutex_lock+0xef/0x180 [693354.105726] kernfs_fop_write+0x1ab/0x280 [693354.105732] ? cgroup_file_poll+0x60/0x60 [693354.105738] vfs_write+0xe7/0x230 [693354.105744] ksys_write+0xb0/0x140 [693354.105749] ? __ia32_sys_read+0x50/0x50 [693354.105760] do_syscall_64+0x112/0x370 [693354.105766] ? syscall_return_slowpath+0x260/0x260 [693354.105772] ? do_page_fault+0x9b/0x270 [693354.105779] ? prepare_exit_to_usermode+0xf9/0x1a0 [693354.105784] ? enter_from_user_mode+0x30/0x30 [693354.105793] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.105875] Allocated by task 1453337: [693354.106001] kasan_kmalloc+0xa0/0xd0 [693354.106006] kmem_cache_alloc_node_trace+0x108/0x220 [693354.106010] bfq_pd_alloc+0x96/0x120 [693354.106015] blkcg_activate_policy+0x1b7/0x2b0 [693354.106020] bfq_create_group_hierarchy+0x1e/0x80 [693354.106026] bfq_init_queue+0x678/0x8c0 [693354.106031] blk_mq_init_sched+0x1f8/0x460 [693354.106037] elevator_switch_mq+0xe1/0x240 [693354.106041] elevator_switch+0x25/0x40 [693354.106045] elv_iosched_store+0x1a1/0x230 [693354.106049] queue_attr_store+0x78/0xb0 [693354.106053] kernfs_fop_write+0x1ab/0x280 [693354.106056] vfs_write+0xe7/0x230 [693354.106060] ksys_write+0xb0/0x140 [693354.106064] do_syscall_64+0x112/0x370 [693354.106069] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.106114] Freed by task 1453336: [693354.106225] __kasan_slab_free+0x130/0x180 [693354.106229] kfree+0x90/0x1b0 [693354.106233] blkcg_deactivate_policy+0x12c/0x220 [693354.106238] bfq_exit_queue+0xf5/0x110 [693354.106241] blk_mq_exit_sched+0x104/0x130 [693354.106245] __elevator_exit+0x45/0x60 [693354.106249] elevator_switch_mq+0xd6/0x240 [693354.106253] elevator_switch+0x25/0x40 [693354.106257] elv_iosched_store+0x1a1/0x230 [693354.106261] queue_attr_store+0x78/0xb0 [693354.106264] kernfs_fop_write+0x1ab/0x280 [693354.106268] vfs_write+0xe7/0x230 [693354.106271] ksys_write+0xb0/0x140 [693354.106275] do_syscall_64+0x112/0x370 [693354.106280] entry_SYSCALL_64_after_hwframe+0x65/0xca [693354.106329] The buggy address belongs to the object at ffff888be0a35580 which belongs to the cache kmalloc-1k of size 1024 [693354.106736] The buggy address is located 228 bytes inside of 1024-byte region [ffff888be0a35580, ffff888be0a35980) [693354.107114] The buggy address belongs to the page: [693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0 [693354.107606] flags: 0x17ffffc0008100(slab|head) [693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080 [693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000 [693354.108278] page dumped because: kasan: bad access detected [693354.108511] Memory state around the buggy address: [693354.108671] ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [693354.116396] ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.132421] ^ [693354.140284] ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.147912] ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [693354.155281] ================================================================== blkgs are protected by both queue and blkcg locks and holding either should stabilize them. However, the path of destroying blkg policy data is only protected by queue lock in blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks can get the blkg policy data before the blkg policy data is destroyed, and use it after destroyed, which will result in a use-after-free. CPU0 CPU1 blkcg_deactivate_policy spin_lock_irq(&q->queue_lock) bfq_io_set_weight_legacy spin_lock_irq(&blkcg->lock) blkg_to_bfqg(blkg) pd_to_bfqg(blkg->pd[pol->plid]) ^^^^^^blkg->pd[pol->plid] != NULL bfqg != NULL pol->pd_free_fn(blkg->pd[pol->plid]) pd_to_bfqg(blkg->pd[pol->plid]) bfqg_put(bfqg) kfree(bfqg) blkg->pd[pol->plid] = NULL spin_unlock_irq(q->queue_lock); bfq_group_set_weight(bfqg, val, 0) bfqg->entity.new_weight ^^^^^^trigger uaf here spin_unlock_irq(&blkcg->lock); Fix by grabbing the matching blkcg lock before trying to destroy blkg policy data. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block: flush the integrity workqueue in blk_integrity_unregisterLihong Kou2021-09-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3df49967f6f1d2121b0c27c381ca1c8386b1dab9 ] When the integrity profile is unregistered there can still be integrity reads queued up which could see a NULL verify_fn as shown by the race window below: CPU0 CPU1 process_one_work nvme_validate_ns bio_integrity_verify_fn nvme_update_ns_info nvme_update_disk_info blk_integrity_unregister ---set queue->integrity as 0 bio_integrity_process --access bi->profile->verify_fn(bi is a pointer of queue->integity) Before calling blk_integrity_unregister in nvme_update_disk_info, we must make sure that there is no work item in the kintegrityd_wq. Just call blk_flush_integrity to flush the work queue so the bug can be resolved. Signed-off-by: Lihong Kou <koulihong@huawei.com> [hch: split up and shortened the changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block: check if a profile is actually registered in blk_integrity_unregisterChristoph Hellwig2021-09-301-1/+5
| | | | | | | | | | | | | | | [ Upstream commit 783a40a1b3ac7f3714d2776fa8ac8cce3535e4f6 ] While clearing the profile itself is harmless, we really should not clear the stable writes flag if it wasn't set due to a registered integrity profile. Reported-by: Lihong Kou <koulihong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blk-mq: avoid to iterate over stale requestMing Lei2021-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 67f3b2f822b7e71cfc9b42dbd9f3144fa2933e0b ] blk-mq can't run allocating driver tag and updating ->rqs[tag] atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is released. So there is chance to iterating over one stale request just after the tag is allocated and before updating ->rqs[tag]. scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi in-flight requests after scsi host is blocked, so no new scsi command can be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request may have been kept in another slot of ->rqs[], meantime the slot can be allocated out but ->rqs[] isn't updated yet. Then this in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in handling scsi error. Fixes the issue by not iterating over stale request. Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Reported-by: luojiaxing <luojiaxing@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queuesSong Liu2021-09-261-1/+13
| | | | | | | | | | | | | | | | | [ Upstream commit 7f2a6a69f7ced6db8220298e0497cf60482a9d4b ] Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts performance for large md arrays. [1] shows resync speed of md array drops for md array with more than 16 HDDs. Fix this by allowing more request at plug queue. The multiple_queue flag is used to only apply higher limit to multiple queue cases. [1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/ Tested-by: Marcin Wanat <marcin.wanat@gmail.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()Li Jinlin2021-09-261-0/+1
| | | | | | | | | | | | | | | | [ Upstream commit 884f0e84f1e3195b801319c8ec3d5774e9bf2710 ] The pending timer has been set up in blk_throtl_init(). However, the timer is not deleted in blk_throtl_exit(). This means that the timer handler may still be running after freeing the timer, which would result in a use-after-free. Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit(). Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block: genhd: don't call blkdev_show() with major_names_lock heldTetsuo Handa2021-09-261-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dfbb3409b27fa42b96f5727a80d3ceb6a8663991 ] If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other combinations), lockdep complains circular locking dependency at __loop_clr_fd(), for major_names_lock serves as a locking dependency aggregating hub across multiple block modules. ====================================================== WARNING: possible circular locking dependency detected 5.14.0+ #757 Tainted: G E ------------------------------------------------------ systemd-udevd/7568 is trying to acquire lock: ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560 but task is already holding lock: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (&lo->lo_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_killable_nested+0x17/0x20 lo_open+0x23/0x50 [loop] blkdev_get_by_dev+0x199/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #5 (&disk->open_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 bd_register_pending_holders+0x20/0x100 device_add_disk+0x1ae/0x390 loop_add+0x29c/0x2d0 [loop] blk_request_module+0x5a/0xb0 blkdev_get_no_open+0x27/0xa0 blkdev_get_by_dev+0x5f/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (major_names_lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 blkdev_show+0x19/0x80 devinfo_show+0x52/0x60 seq_read_iter+0x2d5/0x3e0 proc_reg_read_iter+0x41/0x80 vfs_read+0x2ac/0x330 ksys_read+0x6b/0xd0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&p->lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 seq_read_iter+0x37/0x3e0 generic_file_splice_read+0xf3/0x170 splice_direct_to_actor+0x14e/0x350 do_splice_direct+0x84/0xd0 do_sendfile+0x263/0x430 __se_sys_sendfile64+0x96/0xc0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#3){.+.+}-{0:0}: lock_acquire+0xbe/0x1f0 lo_write_bvec+0x96/0x280 [loop] loop_process_work+0xa68/0xc10 [loop] process_one_work+0x293/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: lock_acquire+0xbe/0x1f0 process_one_work+0x280/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: validate_chain+0x1f0d/0x33e0 __lock_acquire+0x92d/0x1030 lock_acquire+0xbe/0x1f0 flush_workqueue+0x8c/0x560 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 __loop_clr_fd+0xb4/0x400 [loop] blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 2 locks held by systemd-udevd/7568: #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0 #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] stack backtrace: CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ #757 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 Call Trace: dump_stack_lvl+0x79/0xbf print_circular_bug+0x5d6/0x5e0 ? stack_trace_save+0x42/0x60 ? save_trace+0x3d/0x2d0 check_noncircular+0x10b/0x120 validate_chain+0x1f0d/0x33e0 ? __lock_acquire+0x953/0x1030 ? __lock_acquire+0x953/0x1030 __lock_acquire+0x92d/0x1030 ? flush_workqueue+0x70/0x560 lock_acquire+0xbe/0x1f0 ? flush_workqueue+0x70/0x560 flush_workqueue+0x8c/0x560 ? flush_workqueue+0x70/0x560 ? sched_clock_cpu+0xe/0x1a0 ? drain_workqueue+0x41/0x140 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 ? blk_mq_freeze_queue_wait+0xac/0xd0 __loop_clr_fd+0xb4/0x400 [loop] ? __mutex_unlock_slowpath+0x35/0x230 blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0fd4c661f7 Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006 RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000 R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050 Commit 1c500ad706383f1a ("loop: reduce the loop_ctl_mutex scope") is for breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling a different block module results in forming circular locking dependency due to shared major_names_lock mutex. The simplest fix is to call probe function without holding major_names_lock [1], but Christoph Hellwig does not like such idea. Therefore, instead of holding major_names_lock in blkdev_show(), introduce a different lock for blkdev_show() in order to break "sb_writers#$N => &p->lock => major_names_lock" dependency chain. Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block, bfq: honor already-setup queue mergesPaolo Valente2021-09-221-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b ] The function bfq_setup_merge prepares the merging between two bfq_queues, say bfqq and new_bfqq. To this goal, it assigns bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives, the process that generated that I/O is disassociated from bfqq and associated with new_bfqq (merging is actually a redirection). In this respect, bfq_setup_merge increases new_bfqq->ref in advance, adding the number of processes that are expected to be associated with new_bfqq. Unfortunately, the stable-merging mechanism interferes with this setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and before all the expected processes have been associated with bfqq->new_bfqq, bfqq may happen to be stably merged with a different queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq gets changed. So, some of the processes that have been already accounted for in the ref counter of the previous new_bfqq will not be associated with that queue. This creates an unbalance, because those references will never be decremented. This commit fixes this issue by reestablishing the previous, natural behaviour: once bfqq->new_bfqq has been set, it will not be changed until all expected redirections have occurred. Signed-off-by: Davide Zini <davidezini2@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blkcg: fix memory leak in blk_iolatency_initYanfei Xu2021-09-221-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6f5ddde41069fcd1f0993ec76c9dbbf9d021fd4d ] BUG: memory leak unreferenced object 0xffff888129acdb80 (size 96): comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s) hex dump (first 32 bytes): 20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff G...... ..).... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline] [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline] [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724 [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185 [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566 [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline] [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124 [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344 [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline] [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516 [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline] [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline] [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline] [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860 [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will not be invoked for cleanup. That leads a memory leak. Swap the blk_throtl_init() and blk_iolatency_init() calls can solve this. Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com Fixes: 19688d7f9592 (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls) Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* scsi: bsg: Remove support for SCSI_IOCTL_SEND_COMMANDChristoph Hellwig2021-09-181-1/+4
| | | | | | | | | | | | | | | | [ Upstream commit beec64d0c9749afedf51c3c10cf52de1d9a89cc0 ] SCSI_IOCTL_SEND_COMMAND has been deprecated longer than bsg exists and has been warning for just as long. More importantly it harcodes SCSI CDBs and thus will do the wrong thing on non-SCSI bsg nodes. Link: https://lore.kernel.org/r/20210724072033.1284840-2-hch@lst.de Fixes: aa387cc89567 ("block: add bsg helper library") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block: bfq: fix bfq_set_next_ioprio_data()Damien Le Moal2021-09-181-1/+1
| | | | | | | | | | | | | | | | | | | commit a680dd72ec336b81511e3bff48efac6dbfa563e7 upstream. For a request that has a priority level equal to or larger than IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This is not consistent with the warning and the allowed values for priority levels. Fix this by setting the request new_ioprio field to IOPRIO_BE_NR - 1, the lowest priority level allowed. Cc: <stable@vger.kernel.org> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMINNiklas Cassel2021-09-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | commit 4d643b66089591b4769bcdb6fd1bfeff2fe301b8 upstream. A user space process should not need the CAP_SYS_ADMIN capability set in order to perform a BLKREPORTZONE ioctl. Getting the zone report is required in order to get the write pointer. Neither read() nor write() requires CAP_SYS_ADMIN, so it is reasonable that a user space process that can read/write from/to the device, also can get the write pointer. (Since e.g. writes have to be at the write pointer.) Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: stable@vger.kernel.org # v4.10+ Link: https://lore.kernel.org/r/20210811110505.29649-3-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* blk-zoned: allow zone management send operations without CAP_SYS_ADMINNiklas Cassel2021-09-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ead3b768bb51259e3a5f2287ff5fc9041eb6f450 upstream. Zone management send operations (BLKRESETZONE, BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE) should be allowed under the same permissions as write(). (write() does not require CAP_SYS_ADMIN). Additionally, other ioctls like BLKSECDISCARD and BLKZEROOUT only check if the fd was successfully opened with FMODE_WRITE. (They do not require CAP_SYS_ADMIN). Currently, zone management send operations require both CAP_SYS_ADMIN and that the fd was successfully opened with FMODE_WRITE. Remove the CAP_SYS_ADMIN requirement, so that zone management send operations match the access control requirement of write(), BLKSECDISCARD and BLKZEROOUT. Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Aravind Ramesh <aravind.ramesh@wdc.com> Reviewed-by: Adam Manzanares <a.manzanares@samsung.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: stable@vger.kernel.org # v4.10+ Link: https://lore.kernel.org/r/20210811110505.29649-2-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bio: fix page leak bio_add_hw_page failurePavel Begunkov2021-09-151-2/+13
| | | | | | | | | | | | | | | | commit d9cf3bd531844ffbfe94b16e417037a16efc988d upstream. __bio_iov_append_get_pages() doesn't put not appended pages on bio_add_hw_page() failure, so potentially leaking them, fix it. Also, do the same for __bio_iov_iter_get_pages(), even though it looks like it can't be triggered by userspace in this case. Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND") Cc: stable@vger.kernel.org # 5.8+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1edfa6a2ffd66d55e6345a477df5387d2c1415d0.1626653825.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* blk-crypto: fix check for too-large dun_bytesEric Biggers2021-09-151-1/+1
| | | | | | | | | | | | | | | | [ Upstream commit cc40b7225151f611ef837f6403cfaeadc7af214a ] dun_bytes needs to be less than or equal to the IV size of the encryption mode, not just less than or equal to BLK_CRYPTO_MAX_IV_SIZE. Currently this doesn't matter since blk_crypto_init_key() is never actually passed invalid values, but we might as well fix this. Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210825055918.51975-1-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* block: return ELEVATOR_DISCARD_MERGE if possibleMing Lei2021-09-154-16/+8
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 866663b7b52d2da267b28e12eed89ee781b8fed1 ] When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 2705dfb20947 ("block: fix discard request merge") Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* blk-throtl: optimize IOPS throttle for large IO scenariosChunguang Xu2021-09-153-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4f1e9630afe6332de7286820fedd019f19eac057 ] After patch 54efd50 (block: make generic_make_request handle arbitrarily sized bios), the IO through io-throttle may be larger, and these IOs may be further split into more small IOs. However, IOPS throttle does not seem to be aware of this change, which makes the calculation of IOPS of large IOs incomplete, resulting in disk-side IOPS that does not meet expectations. Maybe we should fix this problem. We can reproduce it by set max_sectors_kb of disk to 128, set blkio.write_iops_throttle to 100, run a dd instance inside blkio and use iostat to watch IOPS: dd if=/dev/zero of=/dev/sdb bs=1M count=1000 oflag=direct As a result, without this change the average IOPS is 1995, with this change the IOPS is 98. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/65869aaad05475797d63b4c3fed4f529febe3c26.1627876014.git.brookxu@tencent.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* Merge tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-271-42/+16
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - Revert the mq-deadline priority handling, it's causing serious performance regressions. While experimental patches exists to fix this up, it's too late to do so now. Revert it and re-do it properly for 5.15 instead. - Fix a NULL vs IS_ERR() regression in this release (Dan) - Fix a mq-deadline accounting regression in this release (Bart) - Mark cryptoloop as deprecated. It's broken and dm-crypt fully supports it, and it's actively intefering with loop. Plan on removal for 5.16 (Christoph) * tag 'block-5.14-2021-08-27' of git://git.kernel.dk/linux-block: cryptoloop: add a deprecation warning pd: fix a NULL vs IS_ERR() check Revert "block/mq-deadline: Prioritize high-priority requests" mq-deadline: Fix request accounting
| * Revert "block/mq-deadline: Prioritize high-priority requests"Jens Axboe2021-08-261-37/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit fb926032b3209300f9dc454a36b8299582ae545c. Zhen reports that this commit slows down mq-deadline on a 128 thread box, going from 258K IOPS to 170-180K. My testing shows that Optane gen2 IOPS goes from 2.3M IOPS to 1.2M IOPS on a 64 thread box. Looking in detail at the code, the main culprit here is needing to sum percpu counters in the dispatch hot path, leading to very high CPU utilization there. To make matters worse, the code currently needs to sum 2 percpu counters, and it does so in the most naive way of iterating possible CPUs _twice_. Since we're close to release, revert this commit and we can re-do it with regular per-priority counters instead for the 5.15 kernel. Link: https://lore.kernel.org/linux-block/20210826144039.2143-1-thunder.leizhen@huawei.com/ Reported-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * mq-deadline: Fix request accountingBart Van Assche2021-08-241-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block layer may call the I/O scheduler .finish_request() callback without having called the .insert_requests() callback. Make sure that the mq-deadline I/O statistics are correct if the block layer inserts an I/O request that bypasses the I/O scheduler. This patch prevents that lower priority I/O is delayed longer than necessary for mixed I/O priority workloads. Cc: Niklas Cassel <Niklas.Cassel@wdc.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com> Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210824170520.1659173-1-bvanassche@acm.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-214-32/+20
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Three fixes from Ming Lei that should go into 5.14: - Fix for a kernel panic when iterating over tags for some cases where a flush request is present, a regression in this cycle. - Request timeout fix - Fix flush request checking" * tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block: blk-mq: fix is_flush_rq blk-mq: fix kernel panic during iterating over flush request blk-mq: don't grab rq's refcount in blk_mq_check_expired()
| * blk-mq: fix is_flush_rqMing Lei2021-08-173-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is_flush_rq() is called from bt_iter()/bt_tags_iter(), and runs the following check: hctx->fq->flush_rq == req but the passed hctx from bt_iter()/bt_tags_iter() may be NULL because: 1) memory re-order in blk_mq_rq_ctx_init(): rq->mq_hctx = data->hctx; ... refcount_set(&rq->ref, 1); OR 2) tag re-use and ->rqs[] isn't updated with new request. Fix the issue by re-writing is_flush_rq() as: return rq->end_io == flush_end_io; which turns out simpler to follow and immune to data race since we have ordered WRITE rq->end_io and refcount_set(&rq->ref, 1). Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Cc: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Cc: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210818010925.607383-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix kernel panic during iterating over flush requestMing Lei2021-08-172-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For fixing use-after-free during iterating over requests, we grabbed request's refcount before calling ->fn in commit 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter"). Turns out this way may cause kernel panic when iterating over one flush request: 1) old flush request's tag is just released, and this tag is reused by one new request, but ->rqs[] isn't updated yet 2) the flush request can be re-used for submitting one new flush command, so blk_rq_init() is called at the same time 3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called, flush_rq->end_io may not be updated yet, so NULL pointer dereference is triggered in blk_mq_put_rq_ref(). Fix the issue by calling refcount_set(&flush_rq->ref, 1) after flush_rq->end_io is set. So far the only other caller of blk_rq_init() is scsi_ioctl_reset() in which the request doesn't enter block IO stack and the request reference count isn't used, so the change is safe. Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reported-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Tested-by: "Blank-Burian, Markus, Dr." <blankburian@uni-muenster.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: don't grab rq's refcount in blk_mq_check_expired()Ming Lei2021-08-171-25/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inside blk_mq_queue_tag_busy_iter() we already grabbed request's refcount before calling ->fn(), so needn't to grab it one more time in blk_mq_check_expired(). Meantime remove extra request expire check in blk_mq_check_expired(). Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210811155202.629575-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-137-313/+22
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A few fixes for block that should go into 5.14: - Revert the mq-deadline cgroup addition. More work is needed on this front, let's revert it for now and get it right before having it in a released kernel (Tejun) - blk-iocost lockdep fix (Ming) - nbd double completion fix (Xie) - Fix for non-idling when clearing the shared tag flag (Yu)" * tag 'block-5.14-2021-08-13' of git://git.kernel.dk/linux-block: nbd: Aovid double completion of a request blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHARED Revert "block/mq-deadline: Add cgroup support" blk-iocost: fix lockdep warning on blkcg->lock
| * blk-mq: clear active_queues before clearing BLK_MQ_F_TAG_QUEUE_SHAREDYu Kuai2021-08-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We run a test that delete and recover devcies frequently(two devices on the same host), and we found that 'active_queues' is super big after a period of time. If device a and device b share a tag set, and a is deleted, then blk_mq_exit_queue() will clear BLK_MQ_F_TAG_QUEUE_SHARED because there is only one queue that are using the tag set. However, if b is still active, the active_queues of b might never be cleared even if b is deleted. Thus clear active_queues before BLK_MQ_F_TAG_QUEUE_SHARED is cleared. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210731062130.1533893-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Revert "block/mq-deadline: Add cgroup support"Tejun Heo2021-08-115-307/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a WARN_ON_ONCE() call"). The added cgroup support has the following issues: * It breaks cgroup interface file format rule by adding custom elements to a nested key-value file. * It registers mq-deadline as a cgroup-aware policy even though all it's doing is collecting per-cgroup stats. Even if we need these stats, this isn't the right way to add them. * It hasn't been reviewed from cgroup side. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-iocost: fix lockdep warning on blkcg->lockMing Lei2021-08-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkcg->lock depends on q->queue_lock which may depend on another driver lock required in irq context, one example is dm-thin: Chain exists of: &pool->lock#3 --> &q->queue_lock --> &blkcg->lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&blkcg->lock); local_irq_disable(); lock(&pool->lock#3); lock(&q->queue_lock); <Interrupt> lock(&pool->lock#3); Fix the issue by using spin_lock_irq(&blkcg->lock) in ioc_weight_write(). Cc: Tejun Heo <tj@kernel.org> Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Link: https://lore.kernel.org/linux-block/CA+QYu4rzz6079ighEanS3Qq_Dmnczcf45ZoJoHKVLVATTo1e4Q@mail.gmail.com/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210803070608.1766400-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'for-5.14-fixes' of ↵Linus Torvalds2021-08-091-6/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "One commit to fix a possible A-A deadlock around u64_stats_sync on 32bit machines caused by updating it without disabling IRQ when it may be read from IRQ context" * 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
| * | cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_syncTejun Heo2021-07-271-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq context. However, rstat paths use u64_stats_sync to synchronize access to 64bit stat counters on 32bit machines. u64_stats_sync is implemented using seq_lock and trying to read from an irq context can lead to A-A deadlock if the irq happens to interrupt the stat update. Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and u64_stats_update_end_irqrestore() - in the update paths. Note that none of this matters on 64bit machines. All these are just for 32bit SMP setups. Note that the interface was introduced way back, its first and currently only use was recently added by 2d146aa3aa84 ("mm: memcontrol: switch to rstat"). Stable tagging targets this commit. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rik van Riel <riel@surriel.com> Fixes: 2d146aa3aa84 ("mm: memcontrol: switch to rstat") Cc: stable@vger.kernel.org # v5.13+
* | | Merge tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-073-3/+7
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A few minor fixes: - Fix ldm kernel-doc warning (Bart) - Fix adding offset twice for DMA address in n64cart (Christoph) - Fix use-after-free in dasd path handling (Stefan) - Order kyber insert trace correctly (Vincent) - raid1 errored write handling fix (Wei) - Fix blk-iolatency queue get failure handling (Yu)" * tag 'block-5.14-2021-08-07' of git://git.kernel.dk/linux-block: kyber: make trace_block_rq call consistent with documentation block/partitions/ldm.c: Fix a kernel-doc warning blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit() n64cart: fix the dma address in n64cart_do_bvec s390/dasd: fix use after free in dasd path handling md/raid10: properly indicate failure when ending a failed write request
| * | kyber: make trace_block_rq call consistent with documentationVincent Fu2021-08-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kyber ioscheduler calls trace_block_rq_insert() *after* the request is added to the queue but the documentation for trace_block_rq_insert() says that the call should be made *before* the request is added to the queue. Move the tracepoint for the kyber ioscheduler so that it is consistent with the documentation. Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20210804194913.10497-1-vincent.fu@samsung.com Reviewed by: Adam Manzanares <a.manzanares@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block/partitions/ldm.c: Fix a kernel-doc warningBart Van Assche2021-08-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following kernel-doc warning that appears when building with W=1: block/partitions/ldm.c:31: warning: expecting prototype for ldm(). Prototype was for ldm_debug() instead Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210805173447.3249906-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-iolatency: error out if blk_get_queue() failed in iolatency_set_limit()Yu Kuai2021-08-051-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If queue is dying while iolatency_set_limit() is in progress, blk_get_queue() won't increment the refcount of the queue. However, blk_put_queue() will still decrement the refcount later, which will cause the refcout to be unbalanced. Thus error out in such case to fix the problem. Fixes: 8c772a9bfc7c ("blk-iolatency: fix IO hang due to negative inflight counter") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-07-303-20/+11
|\| | | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - gendisk freeing fix (Christoph) - blk-iocost wake ordering fix (Tejun) - tag allocation error handling fix (John) - loop locking fix. While this isn't the prettiest fix in the world, nobody has any good alternatives for 5.14. Something to likely revisit for 5.15. (Tetsuo) * tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block: block: delay freeing the gendisk blk-iocost: fix operation ordering in iocg_wake_fn() blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling loop: reintroduce global lock for safe loop_validate_file() traversal
| * block: delay freeing the gendiskChristoph Hellwig2021-07-271-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_get_no_open acquires a reference to the block_device through the block device inode and then tries to acquire a device model reference to the gendisk. But at this point the disk migh already be freed (although the race is free). Fix this by only freeing the gendisk from the whole device bdevs ->free_inode callback as well. Fixes: 22ae8ce8b892 ("block: simplify bdev/disk lookup in blkdev_get") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-iocost: fix operation ordering in iocg_wake_fn()Tejun Heo2021-07-271-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it wants the wq_entry to be always removed whether it ended up waking the task or not. finish_wait() tests whether wq_entry needs removal without grabbing the wait_queue lock and expects the waker to use list_del_init_careful() after all waking operations are complete, which iocg_wake_fn() didn't do. The operation order was wrong and the regular list_del_init() was used. The result is that if a waiter wakes up racing the waker, it can free pop the wq_entry off stack before the waker is still looking at it, which can lead to a backtrace like the following. [7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP ... [7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0 ... [7312084.858314] Call Trace: [7312084.863548] _raw_spin_lock_irqsave+0x22/0x30 [7312084.872605] try_to_wake_up+0x4c/0x4f0 [7312084.880444] iocg_wake_fn+0x71/0x80 [7312084.887763] __wake_up_common+0x71/0x140 [7312084.895951] iocg_kick_waitq+0xe8/0x2b0 [7312084.903964] ioc_rqos_throttle+0x275/0x650 [7312084.922423] __rq_qos_throttle+0x20/0x30 [7312084.930608] blk_mq_make_request+0x120/0x650 [7312084.939490] generic_make_request+0xca/0x310 [7312084.957600] submit_bio+0x173/0x200 [7312084.981806] swap_readpage+0x15c/0x240 [7312084.989646] read_swap_cache_async+0x58/0x60 [7312084.998527] swap_cluster_readahead+0x201/0x320 [7312085.023432] swapin_readahead+0x2df/0x450 [7312085.040672] do_swap_page+0x52f/0x820 [7312085.058259] handle_mm_fault+0xa16/0x1420 [7312085.066620] do_page_fault+0x2c6/0x5c0 [7312085.074459] page_fault+0x2f/0x40 Fix it by switching to list_del_init_careful() and putting it at the end. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rik van Riel <riel@surriel.com> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handlingJohn Garry2021-07-271-13/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the blk_mq_sched_alloc_tags() -> blk_mq_alloc_rqs() call fails, then we call blk_mq_sched_free_tags() -> blk_mq_free_rqs(). It is incorrect to do so, as any rqs would have already been freed in the blk_mq_alloc_rqs() call. Fix by calling blk_mq_free_rq_map() only directly. Fixes: 6917ff0b5bd41 ("blk-mq-sched: refactor scheduler initialization") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1627378373-148090-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-blockLinus Torvalds2021-07-099-27/+63
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: "A combination of changes that ended up depending on both the driver and core branch (and/or the IDE removal), and a few late arriving fixes. In detail: - Fix io ticks wrap-around issue (Chunguang) - nvme-tcp sock locking fix (Maurizio) - s390-dasd fixes (Kees, Christoph) - blk_execute_rq polling support (Keith) - blk-cgroup RCU iteration fix (Yu) - nbd backend ID addition (Prasanna) - Partition deletion fix (Yufen) - Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph) - Removal of now dead block request types due to IDE removal (Christoph) - Loop probing and control device cleanups (Christoph) - Device uevent fix (Christoph) - Misc cleanups/fixes (Tetsuo, Christoph)" * tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits) blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs block: fix the problem of io_ticks becoming smaller nvme-tcp: can't set sk_user_data without write_lock loop: remove unused variable in loop_set_status() block: remove the bdgrab in blk_drop_partitions block: grab a device refcount in disk_uevent s390/dasd: Avoid field over-reading memcpy() dasd: unexport dasd_set_target_state block: check disk exist before trying to add partition ubd: remove dead code in ubd_setup_common nvme: use return value from blk_execute_rq() block: return errors from blk_execute_rq() nvme: use blk_execute_rq() for passthrough commands block: support polling through blk_execute_rq block: remove REQ_OP_SCSI_{IN,OUT} block: mark blk_mq_init_queue_data static loop: rewrite loop_exit using idr_for_each_entry loop: split loop_lookup loop: don't allow deleting an unspecified loop device loop: move loop_ctl_mutex locking into loop_add ...
| * blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgsYu Kuai2021-07-071-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We run a test that create millions of cgroups and blkgs, and then trigger blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long time in such situation. Thus release the lock when a batch of blkgs are destroyed. blkcg_activate_policy() and blkcg_deactivate_policy() might have the same problem, however, as they are basically only called from module init/exit paths, let's leave them alone for now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix the problem of io_ticks becoming smallerChunguang Xu2021-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On the IO submission path, blk_account_io_start() may interrupt the system interruption. When the interruption returns, the value of part->stamp may have been updated by other cores, so the time value collected before the interruption may be less than part-> stamp. So when this happens, we should do nothing to make io_ticks more accurate? For kernels less than 5.0, this may cause io_ticks to become smaller, which in turn may cause abnormal ioutil values. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove the bdgrab in blk_drop_partitionsChristoph Hellwig2021-07-011-5/+1
| | | | | | | | | | | | | | | | There is no need to hold a bdev reference when removing the partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210701081638.246552-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: grab a device refcount in disk_ueventChristoph Hellwig2021-07-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | Sending uevents requires the struct device to be alive. To ensure that grab the device refcount instead of just an inode reference. Fixes: bc359d03c7ec ("block: add a disk_uevent helper") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210701081638.246552-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: check disk exist before trying to add partitionYufen Yu2021-06-301-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If disk have been deleted, we should return fail for ioctl BLKPG_DEL_PARTITION. Otherwise, the directory /sys/class/block may remain invalid symlinks file. The race as following: blkdev_open del_gendisk disk->flags &= ~GENHD_FL_UP; blk_drop_partitions blkpg_ioctl bdev_add_partition add_partition device_add device_add_class_symlinks ioctl may add_partition after del_gendisk() have tried to delete partitions. Then, symlinks file will be created. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Link: https://lore.kernel.org/r/20210610023241.3646241-1-yuyufen@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>