summaryrefslogtreecommitdiffstats
path: root/block/blk-mq.c
Commit message (Collapse)AuthorAgeFilesLines
...
* | Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds2023-02-201-83/+69
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...
| * block: Merge bio before checking ->cached_rqXiao Ni2023-02-091-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | It checks if plug->cached_rq is empty before merging bio. But the merge action doesn't have relationship with plug->cached_rq, it trys to merge bio with requests within plug->mq_list. Now it checks if ->cached_rq is empty before merging bio. If it's empty, it will miss the merge chances. So move the merge function before checking ->cached_rq. Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230209031930.27354-1-xni@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: use switch/case to improve readability in blk_mq_try_issue_list_directlyKemeng Shi2023-02-061-9/+13
| | | | | | | | | | | | | | | | | | Use switch/case handle error as other function do to improve readability in blk_mq_try_issue_list_directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove set of bd->last when get driver tag for next request failsKemeng Shi2023-02-061-22/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 113285b473824 ("blk-mq: ensure that bd->last is always set correctly") will set last if we failed to get driver tag for next request to avoid flush miss as we break the list walk and will not send the last request in the list which will be sent with last set normally. This code seems stale now becase the flush introduced is always redundant as: For case tag is really out, we will send a extra flush if we find list is not empty after list walk. For case some tag is freed before retry in blk_mq_prep_dispatch_rq for next, then we can get a tag for next request in retry and flush notified already is not necessary. Just remove these stale codes. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove unnecessary error count and check in blk_mq_dispatch_rq_listKemeng Shi2023-02-061-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_dispatch_rq_list will notify if hctx is busy in return bool. It will return true if we are not busy and can handle more and return false on the opposite. Inside blk_mq_dispatch_rq_list, errors is only used if list is empty and we will return true if list is empty and (errors + queued) != 0. There are three types of status returned from request: -busy error BLK_STS*_RESOURCE: the failed request will be added back to list and list will not be empty. -BLK_STS_OK: We count queued for BLK_STS_OK -rest error: We count errors for rest error If list is empty, there is no request gets busy error then (errors + queued) will be total requests in the list which is checked not empty at beginning of blk_mq_dispatch_rq_list. So (errors + queued) != 0 is always met if list is empty. Then the (errors + queued) != 0 check and errors number count is not needed. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: simplify flush check in blk_mq_dispatch_rq_listKemeng Shi2023-02-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Remove check of needs_resource and ret == BLK_STS_DEV_RESOURCE. For busy error BLK_STS*_RESOURCE, request will always be added back to list, so need_resource will not be true and ret will not be == BLK_STS_DEV_RESOURCE if list is empty. We could remove these dead check. 2. Check ret of last request instead of errors If list is empty, we only need to explicitly commit_rqs if error happens at last request which is stored in ret. So check ret of last request instead of errors to remove unnecessary commit_rqs triggered by errors returned from previous request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: use blk_mq_commit_rqs helper in blk_mq_try_issue_list_directlyKemeng Shi2023-02-061-10/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call blk_mq_commit_rqs instead of access ->commit_rqs directly. As you can see in comment of blk_mq_commit_rqs, we only need explicitly call this in two cases: -did not queue everything initially scheduled to queue -the last attempt to queue a request failed Both cases can be checked with ret of last request which breaks list walk. Then we can remove unnecessary error count and unnecessary commit triggered by error besides cases described above. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove unncessary error count and commit in blk_mq_plug_issue_directKemeng Shi2023-02-061-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We need only to explicitly commit in two error cases: -did not queue everything initially scheduled to queue -the last attempt to queue a request failed (see comment of blk_mq_commit_rqs for more details). Both cases can be checked with ret of last request which breaks list walk. Remove unnecessary error count and unnecessary commit triggered by error which is not covered by cases described above. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: make blk_mq_commit_rqs a general function for all commitsKemeng Shi2023-02-061-14/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. move blk_mq_commit_rqs forward before functions need commits. 2. add queued check and only commits request if any request was queued in blk_mq_commit_rqs to keep commit behavior consistent and remove unnecessary commit. 3. split the queued clearing from blk_mq_plug_commit_rqs as it is not wanted general. 4. sync current caller of blk_mq_commit_rqs with new general blk_mq_commit_rqs. 5. document rule for unusual cases which need explicit commit_rqs. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove unncessary from_schedule parameter in blk_mq_plug_issue_directKemeng Shi2023-02-061-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | Function blk_mq_plug_issue_direct tries to issue batch requests in plug list to driver directly. We will only issue plug request to driver if we are not from scheduler, so from_scheduler parameter of blk_mq_plug_issue_direct is always false. Remove unncessary from_scheduler of blk_mq_plug_issue_direct. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove unnecessary list_empty check in blk_mq_try_issue_list_directlyKemeng Shi2023-02-061-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | We only break the list walk if we get 'BLK_STS_*RESOURCE'. We also count errors for 'BLK_STS_*RESOURCE' error. If list is not empty, errors will always be non-zero. So we can remove unnecessary list_empty check. This will remove redundant list_empty check for case that error happened at sending last request in list. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: Fix potential io hung for shared sbitmap per tagsetKemeng Shi2023-02-061-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f906a6a0f4268 ("blk-mq: improve tag waiting setup for non-shared tags") mark restart for unshared tags for improvement. At that time, tags is only shared betweens queues and we can check if tags is shared by test BLK_MQ_F_TAG_SHARED. Afterwards, commit 32bc15afed04b ("blk-mq: Facilitate a shared sbitmap per tagset") enabled tags share betweens hctxs inside a queue. We only mark restart for shared hctxs inside a queue and may cause io hung if there is no tag currently allocated by hctxs going to be marked restart. Wait on sbitmap_queue instead of mark restart for shared hctxs case to fix this. Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: wait on correct sbitmap_queue in blk_mq_mark_tag_waitKemeng Shi2023-02-061-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For shared queues case, we will only wait on bitmap_tags if we fail to get driver tag. However, rq could be from breserved_tags, then two problems will occur: 1. io hung if no tag is currently allocated from bitmap_tags. 2. unnecessary wakeup when tag is freed to bitmap_tags while no tag is freed to breserved_tags. Wait on the bitmap which rq from to fix this. Fixes: f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared tags") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: avoid sleep in blk_mq_alloc_request_hctxKemeng Shi2023-02-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") add blk_mq_alloc_request_hctx to send commands to a specific queue. If BLK_MQ_REQ_NOWAIT is not set in tag allocation, we may change to different hctx after sleep and get tag from unexpected hctx. So BLK_MQ_REQ_NOWAIT must be set in flags for blk_mq_alloc_request_hctx. After commit 600c3b0cea784 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx"), blk_mq_alloc_request_hctx return -EINVAL if both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED are not set instead of if BLK_MQ_REQ_NOWAIT is not set. So if BLK_MQ_REQ_NOWAIT is not set and BLK_MQ_REQ_RESERVED is set, blk_mq_alloc_request_hctx could alloc tag from unexpected hctx. I guess what we need here is that return -EINVAL if either BLK_MQ_REQ_NOWAIT or BLK_MQ_REQ_RESERVED is not set. Currently both BLK_MQ_REQ_NOWAIT and BLK_MQ_REQ_RESERVED will be set if specific hctx is needed in nvme_auth_submit, nvmf_connect_io_queue and nvmf_connect_admin_queue. Fix the potential BLK_MQ_REQ_NOWAIT missed case in future. Fixes: 600c3b0cea78 ("blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: Fix the blk_mq_destroy_queue() documentationBart Van Assche2023-01-311-2/+3
|/ | | | | | | | | | | | | | | | Commit 2b3f056f72e5 moved a blk_put_queue() call from blk_mq_destroy_queue() into its callers. Reflect this change in the documentation block above blk_mq_destroy_queue(). Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Keith Busch <kbusch@kernel.org> Fixes: 2b3f056f72e5 ("blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230130211233.831613-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix hctx checks for batch allocationPavel Begunkov2023-01-171-1/+5
| | | | | | | | | | | | | | | When there are no read queues read requests will be assigned a default queue on allocation. However, blk_mq_get_cached_request() is not prepared for that and will fail all attempts to grab read requests from the cache. Worst case it doubles the number of requests allocated, roughly half of which will be returned by blk_mq_free_plug_rqs(). It only affects batched allocations and so is io_uring specific. For reference, QD8 t/io_uring benchmark improves by 20-35%. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/80d4511011d7d4751b4cf6375c4e38f237d935e3.1673955390.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: handle bio_split_to_limits() NULL returnJens Axboe2023-01-041-1/+4
| | | | | | | | This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds2022-12-131-81/+148
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
| * block: fix crash in 'blk_mq_elv_switch_none'Ye Bin2022-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot found the following issue: general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef] CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__elevator_get block/elevator.h:94 [inline] RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline] RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline] RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709 RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206 RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000 RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8 RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517 R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70 R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380 FS: 0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355 nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline] __nbd_ioctl drivers/block/nbd.c:1481 [inline] nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521 blkdev_ioctl+0x36e/0x800 block/ioctl.c:614 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd As after dd6f7f17bf58 commit move '__elevator_get(qe->type)' before set 'qe->type', so will lead to access wild pointer. To solve above issue get 'qe->type' after set 'qe->type'. Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com Fixes:dd6f7f17bf58("block: add proper helpers for elevator_type module refcount management") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tagsShin'ichiro Kawasaki2022-11-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") cleaned up the function blk_mq_realloc_tag_set_tags. After this change, the function does not update nr_hw_queues of struct blk_mq_tag_set when new nr_hw_queues value is smaller than original. This results in failure of queue number change of block devices. To avoid the failure, add the missing nr_hw_queues update. Fixes: ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") Reported-by: Chaitanya Kulkarni <chaitanyak@nvidia.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221122084917.2034220-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: simplify blk_mq_realloc_tag_set_tagsChristoph Hellwig2022-11-101-6/+4
| | | | | | | | | | | | | | | | | | | | Use set->nr_hw_queues for the current number of tags, and remove the duplicate set->nr_hw_queues update in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove blk_mq_alloc_tag_set_tagsChristoph Hellwig2022-11-101-9/+5
| | | | | | | | | | | | | | | | | | | | | | There is no point in trying to share any code with the realloc case when all that is needed by the initial tagset allocation is a simple kcalloc_node. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: use if-else instead of goto in blk_mq_alloc_cached_request()Jinlong Chen2022-11-021-13/+14
| | | | | | | | | | | | | | | | if-else is more readable than goto here. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: improve error handling in blk_mq_alloc_rq_map()Jinlong Chen2022-11-021-9/+10
| | | | | | | | | | | | | | | | Use goto-style error handling like we do elsewhere in the kernel. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: add tagset quiesce interfaceChao Leng2022-11-021-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. Because some queues should not need to be quiesced (e.g. the nvme connect_q) when quiescing the tagset, introduce a QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to ski quiescing a particular queue. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: pass a tagset to blk_mq_wait_quiesce_doneChristoph Hellwig2022-11-021-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: move the srcu_struct used for quiescing to the tagsetChristoph Hellwig2022-11-021-8/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de [axboe: fix whitespace] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: skip non-mq queues in blk_mq_quiesce_queueChristoph Hellwig2022-11-021-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For submit_bio based queues there is no (S)RCU critical section during I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done, so skip doing any synchronization. No non-mq driver should be calling this, but for now we have core callers that unconditionally call into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: split elevator_switchChristoph Hellwig2022-11-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | Split an elevator_disable helper from elevator_switch for the case where we want to switch to no scheduler at all. This includes removing the pointless elevator_switch_mq helper and removing the switch to no schedule logic from blk_mq_init_sched. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queueJinlong Chen2022-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The calling relationship in blk_mq_destroy_queue() is as follows: blk_mq_destroy_queue() ... -> blk_queue_start_drain() -> blk_freeze_queue_start() <- called ... -> blk_freeze_queue() -> blk_freeze_queue_start() <- called again -> blk_mq_freeze_queue_wait() ... So there is a redundant call to blk_freeze_queue_start(). Replace blk_freeze_queue() with blk_mq_freeze_queue_wait() to avoid the redundant call. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030083212.1251255-1-nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: move queue_is_mq out of blk_mq_cancel_work_syncJinlong Chen2022-10-311-7/+5
| | | | | | | | | | | | | | | | | | | | The only caller that needs queue_is_mq check is del_gendisk, so move the check into it. Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: avoid double ->queue_rq() because of early timeoutDavid Jeffery2022-10-311-12/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Jeffery found one double ->queue_rq() issue, so far it can be triggered in VM use case because of long vmexit latency or preempt latency of vCPU pthread or long page fault in vCPU pthread, then block IO req could be timed out before queuing the request to hardware but after calling blk_mq_start_request() during ->queue_rq(), then timeout handler may handle it by requeue, then double ->queue_rq() is caused, and kernel panic. So far, it is driver's responsibility to cover the race between timeout and completion, so it seems supposed to be solved in driver in theory, given driver has enough knowledge. But it is really one common problem, lots of driver could have similar issue, and could be hard to fix all affected drivers, even it isn't easy for driver to handle the race. So David suggests this patch by draining in-progress ->queue_rq() for solving this issue. Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: virtualization@lists.linux-foundation.org Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queueChristoph Hellwig2022-10-251-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The fact that blk_mq_destroy_queue also drops a queue reference leads to various places having to grab an extra reference. Move the call to blk_put_queue into the callers to allow removing the extra references. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de [axboe: fix fabrics_q vs admin_q conflict in nvme core.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix up elevator_type refcountingJinlong Chen2022-10-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current reference management logic of io scheduler modules contains refcnt problems. For example, blk_mq_init_sched may fail before or after the calling of e->ops.init_sched. If it fails before the calling, it does nothing to the reference to the io scheduler module. But if it fails after the calling, it releases the reference by calling kobject_put(&eq->kobj). As the callers of blk_mq_init_sched can't know exactly where the failure happens, they can't handle the reference to the io scheduler module properly: releasing the reference on failure results in double-release if blk_mq_init_sched has released it, and not releasing the reference results in ghost reference if blk_mq_init_sched did not release it either. The same problem also exists in io schedulers' init_sched implementations. We can address the problem by adding releasing statements to the error handling procedures of blk_mq_init_sched and init_sched implementations. But that is counterintuitive and requires modifications to existing io schedulers. Instead, We make elevator_alloc get the io scheduler module references that will be released by elevator_release. And then, we match each elevator_get with an elevator_put. Therefore, each reference to an io scheduler module explicitly has its own getter and releaser, and we no longer need to worry about the refcnt problems. The bugs and the patch can be validated with tools here: https://github.com/nickyc975/linux_elv_refcnt_bug.git [hch: split out a few bits into separate patches, use a non-try module_get in elevator_alloc] Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add proper helpers for elevator_type module refcount managementChristoph Hellwig2022-10-231-9/+2
| | | | | | | | | | | | | | | | | | | | | | Make sure we have helpers for all relevant module refcount operations on the elevator_type in elevator.h, and use them. Move the call to the get helper in blk_mq_elv_switch_none a bit so that it is obvious with a less verbose comment. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221020064819.1469928-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failureChristoph Hellwig2022-11-221-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | Drop the request queue reference just acquired when __alloc_disk_node failed. Fixes: 6f8191fdf41d ("block: simplify disk shutdown") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20221122072753.426077-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: blk_add_rq_to_plug(): clear stale 'last' after flushAl Viro2022-10-311-0/+1
| | | | | | | | | | | | | | | | | | | | blk_mq_flush_plug_list() empties ->mq_list and request we'd peeked there before that call is gone; in any case, we are not dealing with a mix of requests for different queues now - there's no requests left in the plug. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Fix kmemleak in blk_mq_init_allocated_queueChen Jun2022-10-311-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a kmemleak caused by modprobe null_blk.ko unreferenced object 0xffff8881acb1f000 (size 1024): comm "modprobe", pid 836, jiffies 4294971190 (age 27.068s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 00 53 99 9e ff ff ff ff .........S...... backtrace: [<000000004a10c249>] kmalloc_node_trace+0x22/0x60 [<00000000648f7950>] blk_mq_alloc_and_init_hctx+0x289/0x350 [<00000000af06de0e>] blk_mq_realloc_hw_ctxs+0x2fe/0x3d0 [<00000000e00c1872>] blk_mq_init_allocated_queue+0x48c/0x1440 [<00000000d16b4e68>] __blk_mq_alloc_disk+0xc8/0x1c0 [<00000000d10c98c3>] 0xffffffffc450d69d [<00000000b9299f48>] 0xffffffffc4538392 [<0000000061c39ed6>] do_one_initcall+0xd0/0x4f0 [<00000000b389383b>] do_init_module+0x1a4/0x680 [<0000000087cf3542>] load_module+0x6249/0x7110 [<00000000beba61b8>] __do_sys_finit_module+0x140/0x200 [<00000000fdcfff51>] do_syscall_64+0x35/0x80 [<000000003c0f1f71>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 That is because q->ma_ops is set to NULL before blk_release_queue is called. blk_mq_init_queue_data blk_mq_init_allocated_queue blk_mq_realloc_hw_ctxs for (i = 0; i < set->nr_hw_queues; i++) { old_hctx = xa_load(&q->hctx_table, i); if (!blk_mq_alloc_and_init_hctx(.., i, ..)) [1] if (!old_hctx) break; xa_for_each_start(&q->hctx_table, j, hctx, j) blk_mq_exit_hctx(q, set, hctx, j); [2] if (!q->nr_hw_queues) [3] goto err_hctxs; err_exit: q->mq_ops = NULL; [4] blk_put_queue blk_release_queue if (queue_is_mq(q)) [5] blk_mq_release(q); [1]: blk_mq_alloc_and_init_hctx failed at i != 0. [2]: The hctxs allocated by [1] are moved to q->unused_hctx_list and will be cleaned up in blk_mq_release. [3]: q->nr_hw_queues is 0. [4]: Set q->mq_ops to NULL. [5]: queue_is_mq returns false due to [4]. And blk_mq_release will not be called. The hctxs in q->unused_hctx_list are leaked. To fix it, call blk_release_queue in exception path. Fixes: 2f8f1336a48b ("blk-mq: always free hctx after request queue is freed") Signed-off-by: Yuan Can <yuancan@huawei.com> Signed-off-by: Chen Jun <chenjun102@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221031031242.94107-1-chenjun102@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: Properly init requests from blk_mq_alloc_request_hctx()John Garry2022-10-281-1/+6
|/ | | | | | | | | | | | | Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio, biotail, __sector, and __data_len members, which blk_mq_alloc_request() has, so duplicate what we do in blk_mq_alloc_request(). Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx") Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1666780513-121650-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()Yu Kuai2022-10-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our syzkaller report a null pointer dereference, root cause is following: __blk_mq_alloc_map_and_rqs set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs blk_mq_alloc_map_and_rqs blk_mq_alloc_rqs // failed due to oom alloc_pages_node // set->tags[hctx_idx] is still NULL blk_mq_free_rqs drv_tags = set->tags[hctx_idx]; // null pointer dereference is triggered blk_mq_clear_rq_mapping(drv_tags, ...) This is because commit 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") merged the two steps: 1) set->tags[hctx_idx] = blk_mq_alloc_rq_map() 2) blk_mq_alloc_rqs(..., set->tags[hctx_idx]) into one step: set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs() Since tags is not initialized yet in this case, fix the problem by checking if tags is NULL pointer in blk_mq_clear_rq_mapping(). Fixes: 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: allow end_io based requests in the completion batch handlingJens Axboe2022-09-301-2/+11
| | | | | | | | | | | With end_io handlers now being able to potentially pass ownership of the request upon completion, we can allow requests with end_io handlers in the batch completion handling. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: change request end_io handler to pass back a return valueJens Axboe2022-09-301-5/+9
| | | | | | | | | | | Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: enable batched allocation for blk_mq_alloc_request()Jens Axboe2022-09-301-9/+71
| | | | | | | | | | | | | | | The filesystem IO path can take advantage of allocating batches of requests, if the underlying submitter tells the block layer about it through the blk_plug. For passthrough IO, the exported API is the blk_mq_alloc_request() helper, and that one does not allow for request caching. Wire up request caching for blk_mq_alloc_request(), which is generally done without having a bio available upfront. Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe2022-09-301-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
| * block: export blk_rq_is_pollKanchan Joshi2022-09-211-1/+2
| | | | | | | | | | | | | | | | This is in preparation to support iopoll for nvme passthrough. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe2022-09-301-13/+19
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add rationale for not using blk_mq_plug() when applicablePankaj Raghav2022-09-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two places in the block layer at the moment where blk_mq_plug() helper could be used instead of directly accessing the plug from struct current. In both these cases, directly accessing the plug should not have any consequences for zoned devices. Make the intent explicit by adding comments instead of introducing unwanted checks with blk_mq_plug() helper.[1] [1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/ Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Suggested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com [axboe: fixup multi-line comment style] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: use quiesced elevator switch when reinitializing queuesKeith Busch2022-09-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The hctx's run_work may be racing with the elevator switch when reinitializing hardware queues. The queue is merely frozen in this context, but that only prevents requests from allocating and doesn't stop the hctx work from running. The work may get an elevator pointer that's being torn down, and can result in use-after-free errors and kernel panics (example below). Use the quiesced elevator switch instead, and make the previous one static since it is now only used locally. nvme nvme0: resetting controller nvme nvme0: 32/0/0 default/read/poll queues BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0 Oops: 0000 [#1] SMP PTI Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:kyber_has_work+0x29/0x70 ... Call Trace: __blk_mq_do_dispatch_sched+0x83/0x2b0 __blk_mq_sched_dispatch_requests+0x12e/0x170 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x2b/0x50 process_one_work+0x1ef/0x380 worker_thread+0x2d/0x3e0 Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220927155652.3260724-1-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: don't redirect completion for hctx withs only one ctx mappingLiu Song2022-09-241-3/+5
| | | | | | | | | | | | | | | | | | | | High-performance NVMe devices usually support a large hw queues, which ensures a 1:1 mapping of hctx and ctx. In this case there will be no remote request, so we don't need to care about it. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/1663731123-81536-1-git-send-email-liusong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: remove unneeded needs_restart checkMiaohe Lin2022-09-051-1/+1
| | | | | | | | | | | | | | | | | | If code reaches here, needs_restart must be true. Remove this unneeded needs_restart check. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220905101950.4606-1-linmiaohe@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>