summaryrefslogtreecommitdiffstats
path: root/block/blk-mq-tag.c
Commit message (Collapse)AuthorAgeFilesLines
* block: remove useless BUG_ON() in blk_mq_put_tag()Damien Le Moal2022-06-021-1/+0
| | | | | | | | | | Since the if condition in blk_mq_put_tag() checks that the tag to put is not a reserved one, the BUG_ON() check in the else branch checking if the tag is indeed a reserved one is useless. Remove it. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220602075159.1273366-1-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: manage hctx map via xarrayMing Lei2022-03-081-1/+1
| | | | | | | | | | | | | | | | | | First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: prepare for implementing hctx table via xarrayMing Lei2022-03-081-1/+1
| | | | | | | | | | | | | | | | It is inevitable to cause use-after-free on q->queue_hw_ctx between queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting to xarray can fix the uaf, meantime code gets cleaner. Prepare for converting q->queue_hctx_ctx into xarray, one thing is that xa_for_each() can only accept 'unsigned long' as index, so changes type of hctx index of queue_for_each_hw_ctx() into 'unsigned long'. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* sbitmap: Delete old sbitmap_queue_get_shallow()John Garry2022-02-081-1/+1
| | | | | | | | | | | | | Since __sbitmap_queue_get_shallow() was introduced in commit c05e66733788 ("sbitmap: add sbitmap_get_shallow() operation"), it has not been used. Delete __sbitmap_queue_get_shallow() and rename public __sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd to have public __foo but no foo at all. Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix tag_get wait task can't be awakenedLaibin Qiu2022-01-131-7/+33
| | | | | | | | | | | | | | | | | | | | | | | In case of shared tags, there might be more than one hctx which allocates from the same tags, and each hctx is limited to allocate at most: hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U); tag idle detection is lazy, and may be delayed for 30sec, so there could be just one real active hctx(queue) but all others are actually idle and still accounted as active because of the lazy idle detection. Then if wake_batch is > hctx_max_depth, driver tag allocation may wait forever on this real active hctx. Fix this by recalculating wake_batch when inc or dec active_queues. Fixes: 0d2602ca30e41 ("blk-mq: improve support for shared tags maps") Suggested-by: Ming Lei <ming.lei@redhat.com> Suggested-by: John Garry <john.garry@huawei.com> Signed-off-by: Laibin Qiu <qiulaibin@huawei.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tagsJohn Garry2021-12-061-18/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees using megaraid SAS RAID card since moving to shared tags [0]. Previously, when shared tags was shared sbitmap, this function was less than optimum since we would iter through all tags for all hctx's, yet only ever match upto tagset depth number of rqs. Since the change to shared tags, things are even less efficient if we have parallel callers of blk_mq_queue_tag_busy_iter(). This is because in bt_iter() -> blk_mq_find_and_get_req() there would be more contention on accessing each request ref and tags->lock since they are now shared among all HW queues. Optimise by having separate calls to bt_for_each() for when we're using shared tags. In this case no longer pass a hctx, as it is no longer relevant, and teach bt_iter() about this. Ming suggested something along the lines of this change, apart from a different implementation. [0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/ Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support") Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Delete busy_iter_fnJohn Garry2021-12-061-3/+3
| | | | | | | | | | | | | | | | | | | Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete busy_iter_fn to reduce duplication. It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is less specific. However busy_tag_iter_fn is used in many different parts of the tree, unlike busy_iter_fn which is just use in block/, so just take the straightforward path now, so that we could rename later treewide. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argumentJohn Garry2021-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is blk_mq_rq_inflight(). Function blk_mq_rq_inflight() uses the hctx to find the associated request queue to match against the request. However this same check is already done in caller bt_iter(), so drop this check. With that change there are no more users of busy_iter_fn blk_mq_hw_ctx argument, so drop the argument. Reviewed-by Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: switch to atomic_t for request referencesJens Axboe2021-12-031-1/+1
| | | | | | | | | | | | refcount_t is not as expensive as it used to be, but it's still more expensive than the io_uring method of using atomic_t and just checking for potential over/underflow. This borrows that same implementation, which in turn is based on the mm implementation from Linus. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Fix blk_mq_tagset_busy_iter() for shared tagsJohn Garry2021-10-211-2/+5
| | | | | | | | | | | | | | Since it is now possible for a tagset to share a single set of tags, the iter function should not re-iter the tags for the count of #hw queues in that case. Rather it should just iter once. Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support") Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add support for blk_mq_end_request_batch()Jens Axboe2021-10-181-0/+6
| | | | | | | | | Instead of calling blk_mq_end_request() on a single request, add a helper that takes the new struct io_comp_batch and completes any request stored in there. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: improve batched tag allocationJens Axboe2021-10-181-0/+15
| | | | | | | | | | | | | Add a blk_mq_get_tags() helper, which uses the new sbitmap API for allocating a batch of tags all at once. This both simplifies the block code for batched allocation, and it is also more efficient than just doing repeated calls into __sbitmap_queue_get(). This reduces the sbitmap overhead in peak runs from ~3% to ~1% and yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single CPU core. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Change shared sbitmap naming to shared tagsJohn Garry2021-10-181-9/+9
| | | | | | | | | Now that shared sbitmap support really means shared tags, rename symbols to match that. Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Stop using pointers for blk_mq_tags bitmap tagsJohn Garry2021-10-181-36/+20
| | | | | | | | | | | | | | | | | Now that we use shared tags for shared sbitmap support, we don't require the tags sbitmap pointers, so drop them. This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for blk_mq_tags bitmap tags"). Function blk_mq_init_bitmap_tags() is removed also, since it would be only a wrappper for blk_mq_init_bitmaps(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Use shared tags for shared sbitmap supportJohn Garry2021-10-181-44/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use separate sbitmap pairs and active_queues atomic_t for shared sbitmap support. However a full sets of static requests are used per HW queue, which is quite wasteful, considering that the total number of requests usable at any given time across all HW queues is limited by the shared sbitmap depth. As such, it is considerably more memory efficient in the case of shared sbitmap to allocate a set of static rqs per tag set or request queue, and not per HW queue. So replace the sbitmap pairs and active_queues atomic_t with a shared tags per tagset and request queue, which will hold a set of shared static rqs. Since there is now no valid HW queue index to be passed to the blk_mq_ops .init and .exit_request callbacks, pass an invalid index token. This changes the semantics of the APIs, such that the callback would need to validate the HW queue index before using it. Currently no user of shared sbitmap actually uses the HW queue index (as would be expected). Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()John Garry2021-10-181-2/+1
| | | | | | | | | | | | | | Refactor blk_mq_free_map_and_requests() such that it can be used at many sites at which the tag map and rqs are freed. Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the alloc equivalent. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Add blk_mq_alloc_map_and_rqs()John Garry2021-10-181-8/+1
| | | | | | | | | | | | | | | | | | | | | Add a function to combine allocating tags and the associated requests, and factor out common patterns to use this new function. Some function only call blk_mq_alloc_map_and_rqs() now, but more functionality will be added later. Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they are only used in blk-mq.c, and finally rename some functions for conciseness and consistency with other function names: - __blk_mq_alloc_map_and_{request -> rqs}() - blk_mq_alloc_{map_and_requests -> set_map_and_rqs}() Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()John Garry2021-10-181-0/+6
| | | | | | | | | | | | | | Put the functionality to update the sched shared sbitmap size in a common function. Since the same formula is always used to resize, and it can be got from the request queue argument, so just pass the request queue pointer. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: avoid to iterate over stale requestMing Lei2021-09-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq can't run allocating driver tag and updating ->rqs[tag] atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is released. So there is chance to iterating over one stale request just after the tag is allocated and before updating ->rqs[tag]. scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi in-flight requests after scsi host is blocked, so no new scsi command can be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request may have been kept in another slot of ->rqs[], meantime the slot can be allocated out but ->rqs[] isn't updated yet. Then this in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in handling scsi error. Fixes the issue by not iterating over stale request. Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Reported-by: luojiaxing <luojiaxing@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Use request queue-wide tags for tagset-wide sbitmapJohn Garry2021-05-241-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tags used for an IO scheduler are currently per hctx. As such, when q->nr_hw_queues grows, so does the request queue total IO scheduler tag depth. This may cause problems for SCSI MQ HBAs whose total driver depth is fixed. Ming and Yanhui report higher CPU usage and lower throughput in scenarios where the fixed total driver tag depth is appreciably lower than the total scheduler tag depth: https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b In that scenario, since the scheduler tag is got first, much contention is introduced since a driver tag may not be available after we have got the sched tag. Improve this scenario by introducing request queue-wide tags for when a tagset-wide sbitmap is used. The static sched requests are still allocated per hctx, as requests are initialised per hctx, as in blk_mq_init_request(..., hctx_idx, ...) -> set->ops->init_request(.., hctx_idx, ...). For simplicity of resizing the request queue sbitmap when updating the request queue depth, just init at the max possible size, so we don't need to deal with the possibly with swapping out a new sbitmap for old if we need to grow. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Some tag allocation code refactoringJohn Garry2021-05-241-21/+33
| | | | | | | | | | | | The tag allocation code to alloc the sbitmap pairs is common for regular bitmaps tags and shared sbitmap, so refactor into a common function. Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap(). Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: clear stale request in tags->rq[] before freeing one request poolMing Lei2021-05-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | refcount_inc_not_zero() in bt_tags_iter() still may read one freed request. Fix the issue by the following approach: 1) hold a per-tags spinlock when reading ->rqs[tag] and calling refcount_inc_not_zero in bt_tags_iter() 2) clearing stale request referred via ->rqs[tag] before freeing request pool, the per-tags spinlock is held for clearing stale ->rq[tag] So after we cleared stale requests, bt_tags_iter() won't observe freed request any more, also the clearing will wait for pending request reference. The idea of clearing ->rqs[] is borrowed from John Garry's previous patch and one recent David's patch. Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iterMing Lei2021-05-241-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and this way will prevent the request from being re-used when ->fn is running. The approach is same as what we do during handling timeout. Fix request use-after-free(UAF) related with completion race or queue releasing: - If one rq is referred before rq->q is frozen, then queue won't be frozen before the request is released during iteration. - If one rq is referred after rq->q is frozen, refcount_inc_not_zero() will return false, and we won't iterate over this request. However, still one request UAF not covered: refcount_inc_not_zero() may read one freed request, and it will be handled in next patch. Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Always use blk_mq_is_sbitmap_sharedNikolay Borisov2021-04-061-2/+2
| | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Sentence reconstruct for better readabilityBhaskar Chowdhury2021-03-251-2/+2
| | | | | | | Sentence reconstruction for better readability. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block-mq: fix comments in blk_mq_queue_tag_busy_iteryangerkun2020-09-291-3/+1
| | | | | | | | | | | | | 'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block: Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")' remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need update the confused comments in blk_mq_queue_tag_busy_iter. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: always allow reserved allocation in hctx_may_queueMing Lei2020-09-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | NVMe shares tagset between fabric queue and admin queue or between connect_q and NS queue, so hctx_may_queue() can be called to allocate request for these queues. Tags can be reserved in these tagset. Before error recovery, there is often lots of in-flight requests which can't be completed, and new reserved request may be needed in error recovery path. However, hctx_may_queue() can always return false because there is too many in-flight requests which can't be completed during error handling. Finally, nothing can proceed. Fix this issue by always allowing reserved tag allocation in hctx_may_queue(). This is reasonable because reserved tags are supposed to always be available. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: David Milburn <dmilburn@redhat.com> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Record active_queues_shared_sbitmap per tag_set for when using ↵John Garry2020-09-031-8/+25
| | | | | | | | | | | | | | | | | shared sbitmap For when using a shared sbitmap, no longer should the number of active request queues per hctx be relied on for when judging how to share the tag bitmap. Instead maintain the number of active request queues per tag_set, and make the judgement based on that. Originally-from: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Facilitate a shared sbitmap per tagsetJohn Garry2020-09-031-4/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support multiple reply queues with single hostwide tags. In addition, these drivers want to use interrupt assignment in pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0], CPU hotplug may cause in-flight IO completion to not be serviced when an interrupt is shutdown. That problem is solved in commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline"). However, to take advantage of that blk-mq feature, the HBA HW queuess are required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues need to be exposed to the upper layer. In making that transition, the per-SCSI command request tags are no longer unique per Scsi host - they are just unique per hctx. As such, the HBA LLDD would have to generate this tag internally, which has a certain performance overhead. However another problem is that blk-mq assumes the host may accept (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy counter was removed, which would stop the LLDD being sent more than .can_queue commands; however, it should still be ensured that the block layer does not issue more than .can_queue commands to the Scsi host. To solve this problem, introduce a shared sbitmap per blk_mq_tag_set, which may be requested at init time. New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the tagset to indicate whether the shared sbitmap should be used. Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests are still allocated per hctx; the reason for this is that if tags and requests were only allocated for a single hctx - like hctx0 - it may break block drivers which expect a request be associated with a specific hctx, i.e. not always hctx0. This will introduce extra memory usage. This change is based on work originally from Ming Lei in [1] and from Bart's suggestion in [2]. [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Use pointers for blk_mq_tags bitmap tagsJohn Garry2020-09-031-19/+22
| | | | | | | | | | | | Introduce pointers for the blk_mq_tags regular and reserved bitmap tags, with the goal of later being able to use a common shared tag bitmap across all HW contexts in a set. Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Pass flags for tag init/freeJohn Garry2020-09-031-5/+7
| | | | | | | | | | | | Pass hctx/tagset flags argument down to blk_mq_init_tags() and blk_mq_free_tags() for selective init/free. For now, make it include the alloc policy flag, which can be evaluated when needed (in blk_mq_init_tags()). Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Free tags in blk_mq_init_tags() upon errorHannes Reinecke2020-09-031-8/+10
| | | | | | | | | | | | | Since the tags are allocated in blk_mq_init_tags(), it's better practice to free in that same function upon error, rather than a callee which is to init the bitmap tags (blk_mq_init_tags()). [jpg: Split from an earlier patch with a new commit message] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: move blk_mq_get_driver_tag into blk-mq.cMing Lei2020-06-301-58/+0
| | | | | | | | | | | | | | | | | blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to stay in blk-mq.c, so move it and preparing for cleanup code of get/put driver tag. Meantime hctx_may_queue() is moved to header file and it is fine since it is defined as inline always. No functional change. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove the BLK_MQ_REQ_INTERNAL flagChristoph Hellwig2020-06-291-2/+2
| | | | | | | Just check for a non-NULL elevator directly to make the code more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Remove redundant 'return' statementBaolin Wang2020-06-151-1/+1
| | | | | | | | | The blk_mq_all_tag_iter() is a void function, thus remove the redundant 'return' statement in this function. Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: fix blk_mq_all_tag_iterMing Lei2020-06-071-3/+9
| | | | | | | | | | | | | | | | | blk_mq_all_tag_iter() is added to iterate all requests, so we should fetch the request from ->static_rqs][] instead of ->rqs[] which is for holding in-flight request only. Fix it by adding flag of BT_TAG_ITER_STATIC_RQS. Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: split out a __blk_mq_get_driver_tag helperChristoph Hellwig2020-06-071-0/+27
| | | | | | | | | | | | | | | | | | | Allocation of the driver tag in the case of using a scheduler shares very little code with the "normal" tag allocation. Split out a new helper to streamline this path, and untangle it from the complex normal tag allocation. This way also avoids to fail driver tag allocation because of inactive hctx during cpu hotplug, and fixes potential hang risk. Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: John Garry <john.garry@huawei.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: drain I/O when all CPUs in a hctx are offlineMing Lei2020-05-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: add blk_mq_all_tag_iterMing Lei2020-05-291-18/+32
| | | | | | | | | | | | | | | | | | | | | Add a new blk_mq_all_tag_iter function to iterate over all allocated scheduler tags and driver tags. This is more flexible than the existing blk_mq_all_tag_busy_iter function as it allows the callers to do whatever they want on allocated request instead of being limited to started requests. It will be used to implement draining allocated requests on specified hctx in this patchset. [hch: switch from the two booleans to a more readable flags field and consolidate the tags iter functions] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Bart van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: use BLK_MQ_NO_TAG in more placesChristoph Hellwig2020-05-291-4/+4
| | | | | | | | | | | Replace various magic -1 constants for tags with BLK_MQ_NO_TAG. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAGChristoph Hellwig2020-05-291-2/+2
| | | | | | | | | | | To prepare for wider use of this constant give it a more applicable name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Remove some unused function argumentsJohn Garry2020-02-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(), blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove it. Overall obj code size shows a minor reduction, before: text data bss dec hex filename 27306 1312 0 28618 6fca block/blk-mq.o 4303 272 0 4575 11df block/blk-mq-tag.o after: 27282 1312 0 28594 6fb2 block/blk-mq.o 4311 272 0 4583 11e7 block/blk-mq-tag.o Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> -- This minor patch had been carried as part of the blk-mq shared tags RFC, I'd rather not carry it anymore as it required rebasing, so now or never.. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()John Garry2019-11-131-8/+0
| | | | | | | These functions are not referenced, so delete them. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: introduce blk_mq_tagset_wait_completed_request()Ming Lei2019-08-041-0/+32
| | | | | | | | | | | | | | | | | | | | | | blk-mq may schedule to call queue's complete function on remote CPU via IPI, but doesn't provide any way to synchronize the request's complete fn. The current queue freeze interface can't provide the synchonization because aborted requests stay at blk-mq queues during EH. In some driver's EH(such as NVMe), hardware queue's resource may be freed & re-allocated. If the completed request's complete fn is run finally after the hardware queue's resource is released, kernel crash will be triggered. Prepare for fixing this kind of issue by introducing blk_mq_tagset_wait_completed_request(). Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove blk_mq_put_ctx()Bart Van Assche2019-07-021-8/+0
| | | | | | | | | | | | | | | No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add SPDX tags to block layer files missing licensing informationChristoph Hellwig2019-04-301-0/+1
| | | | | | | | | Various block layer files do not have any licensing information at all. Add SPDX tags for the default kernel GPLv2 license to those. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: save queue mapping result into ctx directlyJianchao Wang2019-02-011-1/+1
| | | | | | | | | | | | | | | Currently, the queue mapping result is saved in a two-dimensional array. In the hot path, to get a hctx, we need do following: q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]] This isn't very efficient. We could save the queue mapping result into ctx directly with different hctx type, like, ctx->hctxs[type] Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* sbitmap: optimize wakeup checkJens Axboe2018-11-301-6/+5
| | | | | | | | | | | | | | | | | Even if we have no waiters on any of the sbitmap_queue wait states, we still have to loop every entry to check. We do this for every IO, so the cost adds up. Shift a bit of the cost to the slow path, when we actually have waiters. Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain an internal count of how many are currently active. Then we can simply check this count in sbq_wake_ptr() and not have to loop if we don't have any sleepers. Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-tag: document tag iteration helper return valueJens Axboe2018-11-081-4/+8
| | | | | | | | Document the fact that the strategy function passed in can control whether to continue iterating or not. Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq-tag: change busy_iter_fn to return whether to continue or notJens Axboe2018-11-081-2/+2
| | | | | | | | | | We have this functionality in sbitmap, but we don't export it in blk-mq for users of the tags busy iteration. This can be useful for stopping the iteration, if the caller doesn't need to find more requests. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>