summaryrefslogtreecommitdiffstats
path: root/block/blk-mq.h
Commit message (Collapse)AuthorAgeFilesLines
* block: directly insert blk-mq request from blk_insert_cloned_request()Jens Axboe2017-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A NULL pointer crash was reported for the case of having the BFQ IO scheduler attached to the underlying blk-mq paths of a DM multipath device. The crash occured in blk_mq_sched_insert_request()'s call to e->type->ops.mq.insert_requests(). Paolo Valente correctly summarized why the crash occured with: "the call chain (dm_mq_queue_rq -> map_request -> setup_clone -> blk_rq_prep_clone) creates a cloned request without invoking e->type->ops.mq.prepare_request for the target elevator e. The cloned request is therefore not initialized for the scheduler, but it is however inserted into the scheduler by blk_mq_sched_insert_request." All said, a request-based DM multipath device's IO scheduler should be the only one used -- when the original requests are issued to the underlying paths as cloned requests they are inserted directly in the underlying dispatch queue(s) rather than through an additional elevator. But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO schedulers") switched blk_insert_cloned_request() from using blk_mq_insert_request() to blk_mq_sched_insert_request(). Which incorrectly added elevator machinery into a call chain that isn't supposed to have any. To fix this introduce a blk-mq private blk_mq_request_bypass_insert() that blk_insert_cloned_request() calls to insert the request without involving any elevator that may be attached to the cloned request's request_queue. Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers") Cc: stable@vger.kernel.org Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: provide internal in-flight variantJens Axboe2017-08-091-0/+3
| | | | | | | | | | We don't have to inc/dec some counter, since we can just iterate the tags. That makes inc/dec a noop, but means we have to iterate busy tags to get an in-flight count. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'irq-core-for-linus' of ↵Linus Torvalds2017-07-031-5/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq department delivers: - Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it. (Thomas Gleixner) Aside of consolidating code this is a preparatory change for: - Finalizing the affinity management for multi-queue devices. The main change here is to shut down interrupts which are affine to a outgoing CPU and reenabling them when the CPU comes online again. That avoids moving interrupts pointlessly around and breaking and reestablishing affinities for no value. (Christoph Hellwig) Note: This contains also the BLOCK-MQ and NVME changes which depend on the rework of the irq core infrastructure. Jens acked them and agreed that they should go with the irq changes. - Consolidation of irq domain code (Marc Zyngier) - State tracking consolidation in the core code (Jeffy Chen) - Add debug infrastructure for hierarchical irq domains (Thomas Gleixner) - Infrastructure enhancement for managing generic interrupt chips via devmem (Bartosz Golaszewski) - Constification work all over the place (Tobias Klauser) - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni) - The usual set of fixes, updates and enhancements all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits) irqchip/or1k-pic: Fix interrupt acknowledgement irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity nvme: Allocate queues for all possible CPUs blk-mq: Create hctx for each present CPU blk-mq: Include all present CPUs in the default queue mapping genirq: Avoid unnecessary low level irq function calls genirq: Set irq masked state when initializing irq_desc genirq/timings: Add infrastructure for estimating the next interrupt arrival time genirq/timings: Add infrastructure to track the interrupt timings genirq/debugfs: Remove pointless NULL pointer check irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID irqchip/gic-v3-its: Add ACPI NUMA node mapping irqchip/gic-v3-its-platform-msi: Make of_device_ids const irqchip/gic-v3-its: Make of_device_ids const irqchip/irq-mvebu-icu: Add new driver for Marvell ICU irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU genirq/irqdomain: Remove auto-recursive hierarchy support irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access ...
| * blk-mq: Create hctx for each present CPUChristoph Hellwig2017-06-281-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we only create hctx for online CPUs, which can lead to a lot of churn due to frequent soft offline / online operations. Instead allocate one for each present CPU to avoid this and dramatically simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: linux-block@vger.kernel.org Cc: linux-nvme@lists.infradead.org Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | blk-mq: remove __blk_mq_alloc_requestChristoph Hellwig2017-06-181-6/+0
| | | | | | | | | | | | | | | | Move most code into blk_mq_rq_ctx_init, and the rest into blk_mq_get_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: simplify blk_mq_free_requestChristoph Hellwig2017-06-181-3/+0
| | | | | | | | | | | | | | | | Merge three functions only tail-called by blk_mq_free_request into blk_mq_free_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: mark blk_mq_rq_ctx_init staticChristoph Hellwig2017-06-181-2/+0
|/ | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: move debugfs declarations to a separate header fileOmar Sandoval2017-05-041-28/+0
| | | | | | | | Preparation for adding more declarations. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-debugfs: Rename functions for registering and unregistering the mq ↵Bart Van Assche2017-04-261-4/+4
| | | | | | | | | | | | | directory Since the blk_mq_debugfs_*register_hctxs() functions register and unregister all attributes under the "mq" directory, rename these into blk_mq_debugfs_*register_mq(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Let blk_mq_debugfs_register() look up the queue nameBart Van Assche2017-04-261-3/+2
| | | | | | | | | | | | A later patch will move the call of blk_mq_debugfs_register() to a function to which the queue name is not passed as an argument. To avoid having to add a 'name' argument to multiple callers, let blk_mq_debugfs_register() look up the queue name. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: Register <dev>/queue/mq after having registered <dev>/queueBart Van Assche2017-04-261-0/+1
| | | | | | | | | | | | | | A later patch in this series will modify blk_mq_debugfs_register() such that it uses q->kobj.parent to determine the name of a request queue. Hence make sure that that pointer is initialized before blk_mq_debugfs_register() is called. To avoid lock inversion, protect sysfs / debugfs registration with the queue sysfs_lock instead of the global mutex all_q_mutex. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: add shallow depth option for blk_mq_get_tag()Omar Sandoval2017-04-141-0/+1
| | | | | | | | Wire up the sbitmap_get_shallow() operation to the tag code so that a caller can limit the number of tags available to it. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-linus' into for-4.12/blockJens Axboe2017-04-071-1/+1
|\ | | | | | | | | | | | | | | | | | | We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: use the right hctx when getting a driver tag failsOmar Sandoval2017-04-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While dispatching requests, if we fail to get a driver tag, we mark the hardware queue as waiting for a tag and put the requests on a hctx->dispatch list to be run later when a driver tag is freed. However, blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware queues if using a single-queue scheduler with a multiqueue device. If blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we are processing. This means we end up using the hardware queue of the previous request, which may or may not be the same as that of the current request. If it isn't, the wrong hardware queue will end up waiting for a tag, and the requests will be on the wrong dispatch list, leading to a hang. The fix is twofold: 1. Make sure we save which hardware queue we were trying to get a request for in blk_mq_get_driver_tag() regardless of whether it succeeds or not. 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a blk_mq_hw_queue to make it clear that it must handle multiple hardware queues, since I've already messed this up on a couple of occasions. This didn't appear in testing with nvme and mq-deadline because nvme has more driver tags than the default number of scheduler tags. However, with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-stat: convert to callback-based statistics reportingOmar Sandoval2017-03-211-1/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: make lifetime consitent between q/ctx and its kobjectMing Lei2017-03-081-0/+1
| | | | | | | | | | | | | | | Currently from kobject view, both q->mq_kobj and ctx->kobj can be released during one cycle of blk_mq_register_dev() and blk_mq_unregister_dev(). Actually, sw queue's lifetime is same with its request queue's, which is covered by request_queue->kobj. So we don't need to call kobject_put() for the two kinds of kobject in __blk_mq_unregister_dev(), instead we do that in release handler of request queue. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()Ming Lei2017-03-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both q->mq_kobj and sw queues' kobjects should have been initialized once, instead of doing that each add_disk context. Also this patch removes clearing of ctx in blk_mq_init_cpu_queues() because percpu allocator fills zero to allocated variable. This patch fixes one issue[1] reported from Omar. [1] kernel wearning when doing unbind/bind on one scsi-mq device [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong. [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34 [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014 [ 19.350920] Workqueue: events_unbound async_run_entry_fn [ 19.350920] Call Trace: [ 19.350920] dump_stack+0x63/0x83 [ 19.350920] kobject_init+0x77/0x90 [ 19.350920] blk_mq_register_dev+0x40/0x130 [ 19.350920] blk_register_queue+0xb6/0x190 [ 19.350920] device_add_disk+0x1ec/0x4b0 [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod] [ 19.350920] async_run_entry_fn+0x48/0x150 [ 19.350920] process_one_work+0x1d0/0x480 [ 19.350920] worker_thread+0x48/0x4e0 [ 19.350920] kthread+0x101/0x140 [ 19.350920] ? process_one_work+0x480/0x480 [ 19.350920] ? kthread_create_on_node+0x60/0x60 [ 19.350920] ret_from_fork+0x2c/0x40 Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: kill blk_mq_set_alloc_data()Omar Sandoval2017-03-021-10/+0
| | | | | | | | Nothing is using it anymore. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Sagi Grimberg <sagi@grimberg.me>
* block: use same block debugfs directory for blk-mq and blktraceOmar Sandoval2017-02-021-5/+0
| | | | | | | | | | When I added the blk-mq debugging information to debugfs, I didn't notice that blktrace also creates a "block" directory in debugfs. Make them use the same dentry, now created in the core block code. Based on a patch from Jens. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: fix debugfs compilation issuesOmar Sandoval2017-01-271-5/+6
| | | | | | | | | | | | | | | | | This fixes a couple of problems: 1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus. 2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at all. Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option. Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree") Signed-off-by: Omar Sandoval <osandov@fb.com> Augment Kconfig description. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: add flush insertion into blk_mq_sched_insert_request()Jens Axboe2017-01-271-0/+2
| | | | | | | | | | Instead of letting the caller check this and handle the details of inserting a flush request, put the logic in the scheduler insertion function. This fixes direct flush insertion outside of the usual make_request_fn calls, like from dm via blk_insert_cloned_request(). Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: fix starvation for multiple hardware queues and shared tagsJens Axboe2017-01-271-0/+1
| | | | | | | | | | | | | | If we have both multiple hardware queues and shared tag map between devices, we need to ensure that we propagate the hardware queue restart bit higher up. This is because we can get into a situation where we don't have any IO pending on a hardware queue, yet we fail getting a tag to start new IO. If that happens, it's not enough to mark the hardware queue as needing a restart, we need to bubble that up to the higher level queue as well. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Tested-by: Hannes Reinecke <hare@suse.com>
* blk-mq: create debugfs directory treeOmar Sandoval2017-01-271-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for putting blk-mq debugging information in debugfs, create a directory tree mirroring the one in sysfs: # tree -d /sys/kernel/debug/block /sys/kernel/debug/block |-- nvme0n1 | `-- mq | |-- 0 | | `-- cpu0 | |-- 1 | | `-- cpu1 | |-- 2 | | `-- cpu2 | `-- 3 | `-- cpu3 `-- vda `-- mq `-- 0 |-- cpu0 |-- cpu1 |-- cpu2 `-- cpu3 Also add the scaffolding for the actual files that will go in here, either under the hardware queue or software queue directories. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe2017-01-171-1/+7
| | | | | | | | | | | | | | | | | | This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq: abstract out helpers for allocating/freeing tag mapsJens Axboe2017-01-171-5/+9
| | | | | | | | Prep patch for adding an extra tag map for scheduler requests. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq-tag: cleanup the normal/reserved tag allocationJens Axboe2017-01-171-0/+5
| | | | | | | | | | This is in preparation for having another tag set available. Cleanup the parameters, and allow passing in of tags for blk_mq_put_tag(). Signed-off-by: Jens Axboe <axboe@fb.com> [hch: even more cleanups] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-mq: export some helpers we need to the scheduling frameworkJens Axboe2017-01-171-0/+25
| | | | | | Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Omar Sandoval <osandov@fb.com>
* Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2016-12-141-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SCSI updates from James Bottomley: "This update includes the usual round of major driver updates (ncr5380, lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas). There's also an assortment of minor fixes, mostly in error legs or other not very user visible stuff. The major change is the pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this effectively makes IRQ mapping generic for the drivers and allows blk_mq to use the information" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits) scsi: qla4xxx: switch to pci_alloc_irq_vectors scsi: hisi_sas: support deferred probe for v2 hw scsi: megaraid_sas: switch to pci_alloc_irq_vectors scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices scsi: be2iscsi: set errno on error path scsi: be2iscsi: set errno on error path scsi: hpsa: fallback to use legacy REPORT PHYS command scsi: scsi_dh_alua: Fix RCU annotations scsi: hpsa: use %phN for short hex dumps scsi: hisi_sas: fix free'ing in probe and remove scsi: isci: switch to pci_alloc_irq_vectors scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI scsi: dpt_i2o: double free on error path scsi: cxlflash: Migrate scsi command pointer to AFU command scsi: cxlflash: Migrate IOARRIN specific routines to function pointers scsi: cxlflash: Cleanup queuecommand() scsi: cxlflash: Cleanup send_tmf() scsi: cxlflash: Remove AFU command lock scsi: cxlflash: Wait for active AFU commands to timeout upon tear down scsi: cxlflash: Remove private command pool ...
| * blk-mq: export blk_mq_map_queuesChristoph Hellwig2016-11-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | This will allow SCSI to have a single blk_mq_ops structure that either lets the LLDD map the queues to PCIe MSIx vectors or use the default. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | blk-mq: abstract out blk_mq_dispatch_rq_list() helperJens Axboe2016-12-091-0/+1
| | | | | | | | | | | | | | | | Takes a list of requests, and dispatches it. Moves any residual requests to the dispatch list. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com>
* | block: add scalable completion tracking of requestsJens Axboe2016-11-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: Introduce blk_mq_hctx_stopped()Bart Van Assche2016-11-021-0/+5
|/ | | | | | | | | | | | | Multiple functions test the BLK_MQ_S_STOPPED bit so introduce a helper function that performs this test. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-091-7/+0
|\ | | | | | | | | | | | | | | | | | | Pull blk-mq CPU hotplug update from Jens Axboe: "This is the conversion of blk-mq to the new hotplug state machine" * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block: blk-mq: fixup "Convert to new hotplug state machine" blk-mq: Convert to new hotplug state machine blk-mq/cpu-notif: Convert to new hotplug state machine
| * blk-mq/cpu-notif: Convert to new hotplug state machineThomas Gleixner2016-09-221-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | Replace the block-mq notifier list management with the multi instance facility in the cpu hotplug state machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-091-3/+7
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event
| * blk-mq: allow the driver to pass in a queue mappingChristoph Hellwig2016-09-151-3/+1
| | | | | | | | | | | | | | | | | | | | | | This allows drivers specify their own queue mapping by overriding the setup-time function that builds the mq_map. This can be used for example to build the map based on the MSI-X vector mapping provided by the core interrupt layer for PCI devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq: remove ->map_queueChristoph Hellwig2016-09-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | sbitmap: push per-cpu last_tag into sbitmap_queueOmar Sandoval2016-09-171-2/+0
| | | | | | | | | | | | | | | | | | | | Allocating your own per-cpu allocation hint separately makes for an awkward API. Instead, allocate the per-cpu hint as part of the struct sbitmap_queue. There's no point for a struct sbitmap_queue without the cache, but you can still use a bare struct sbitmap. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: abstract tag allocation out into sbitmap libraryOmar Sandoval2016-09-171-9/+0
|/ | | | | | | | | | | | | | This is a generally useful data structure, so make it available to anyone else who might want to use it. It's also a nice cleanup separating the allocation logic from the rest of the tag handling logic. The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only selected by CONFIG_BLOCK for now. This should be a complete noop functionality-wise. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: dynamic h/w context countKeith Busch2016-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The hardware's provided queue count may change at runtime with resource provisioning. This patch allows a block driver to alter the number of h/w queues available when its resource count changes. The main part is a new blk-mq API to request a new number of h/w queues for a given live tag set. The new API freezes all queues using that set, then adjusts the allocated count prior to remapping these to CPUs. The bulk of the rest just shifts where h/w contexts and all their artifacts are allocated and freed. The number of max h/w contexts is capped to the number of possible cpus since there is no use for more than that. As such, all pre-allocated memory for pointers need to account for the max possible rather than the initial number of queues. A side effect of this is that the blk-mq will proceed successfully as long as it can allocate at least one h/w context. Previously it would fail request queue initialization if less than the requested number was allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: add a flags parameter to blk_mq_alloc_requestChristoph Hellwig2015-12-011-7/+4
| | | | | | | | | We already have the reserved flag, and a nowait flag awkwardly encoded as a gfp_t. Add a real flags argument to make the scheme more extensible and allow for a nicer calling convention. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: mark __blk_mq_complete_request() staticJens Axboe2015-11-111-1/+0
| | | | | | It's no longer used outside of blk-mq core. Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: remove unused blk_mq_clone_flush_request prototypeChristoph Hellwig2015-10-091-2/+0
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: avoid inserting requests before establishing new mappingAkinobu Mita2015-09-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Notifier callbacks for CPU_ONLINE action can be run on the other CPU than the CPU which was just onlined. So it is possible for the process running on the just onlined CPU to insert request and run hw queue before establishing new mapping which is done by blk_mq_queue_reinit_notify(). This can cause a problem when the CPU has just been onlined first time since the request queue was initialized. At this time ctx->index_hw for the CPU, which is the index in hctx->ctxs[] for this ctx, is still zero before blk_mq_queue_reinit_notify() is called by notifier callbacks for CPU_ONLINE action. For example, there is a single hw queue (hctx) and two CPU queues (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and a request is inserted into ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is still zero. And then while running hw queue, flush_busy_ctxs() finds bit0 is set in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is ignored. Fix it by ensuring that new mapping is established before onlined cpu starts running. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: release mq's kobjects in blk_release_queue()Ming Lei2015-01-291-0/+2
| | | | | | | | | | | | | | | | The kobject memory inside blk-mq hctx/ctx shouldn't have been freed before the kobject is released because driver core can access it freely before its release. We can't do that in all ctx/hctx/mq_kobj's release handler because it can be run before blk_cleanup_queue(). Given mq_kobj shouldn't have been introduced, this patch simply moves mq's release into blk_release_queue(). Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: wake up waiters when a queue is marked dyingJens Axboe2014-12-311-0/+1
| | | | | | | | | | | If it's dying, we can't expect new request to complete and come in an wake up other tasks waiting for requests. So after we have marked it as dying, wake up everybody currently waiting for a request. Once they wake, they will retry their allocation and fail appropriately due to the state of the queue. Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: prevent unmapped hw queue from being scheduledMing Lei2014-12-081-0/+5
| | | | | | | | | | | | When one hardware queue has no mapped software queues, it shouldn't have been scheduled. Otherwise WARNING or OOPS can triggered. blk_mq_hw_queue_mapped() helper is introduce for fixing the problem. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: introduce blk_init_flush and its pairMing Lei2014-09-251-1/+0
| | | | | | | | | | | | These two temporary functions are introduced for holding flush initialization and de-initialization, so that we can introduce 'flush queue' easier in the following patch. And once 'flush queue' and its allocation/free functions are ready, they will be removed for sake of code readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: allocate flush_rq in blk_mq_init_flush()Ming Lei2014-09-251-1/+1
| | | | | | | | It is reasonable to allocate flush req in blk_mq_init_flush(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: fix blk_abort_request on blk-mqChristoph Hellwig2014-09-221-0/+2
| | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header. Signed-off-by: Jens Axboe <axboe@fb.com>