summaryrefslogtreecommitdiffstats
path: root/block/Makefile
Commit message (Collapse)AuthorAgeFilesLines
* block, bfq: split bfq-iosched.c into multiple source filesPaolo Valente2017-04-191-1/+2
| | | | | | | | | | | | | | | | | | | | | The BFQ I/O scheduler features an optimal fair-queuing (proportional-share) scheduling algorithm, enriched with several mechanisms to boost throughput and reduce latency for interactive and real-time applications. This makes BFQ a large and complex piece of code. This commit addresses this issue by splitting BFQ into three main, independent components, and by moving each component into a separate source file: 1. Main algorithm: handles the interaction with the kernel, and decides which requests to dispatch; it uses the following two further components to achieve its goals. 2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm): computes the schedule, using weights and budgets provided by the above component. 3. cgroups support: handles group operations (creation, destruction, move, ...). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* block, bfq: introduce the BFQ-v0 I/O scheduler as an extra schedulerPaolo Valente2017-04-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We tag as v0 the version of BFQ containing only BFQ's engine plus hierarchical support. BFQ's engine is introduced by this commit, while hierarchical support is added by next commit. We use the v0 tag to distinguish this minimal version of BFQ from the versions containing also the features and the improvements added by next commits. BFQ-v0 coincides with the version of BFQ submitted a few years ago [1], apart from the introduction of preemption, described below. BFQ is a proportional-share I/O scheduler, whose general structure, plus a lot of code, are borrowed from CFQ. - Each process doing I/O on a device is associated with a weight and a (bfq_)queue. - BFQ grants exclusive access to the device, for a while, to one queue (process) at a time, and implements this service model by associating every queue with a budget, measured in number of sectors. - After a queue is granted access to the device, the budget of the queue is decremented, on each request dispatch, by the size of the request. - The in-service queue is expired, i.e., its service is suspended, only if one of the following events occurs: 1) the queue finishes its budget, 2) the queue empties, 3) a "budget timeout" fires. - The budget timeout prevents processes doing random I/O from holding the device for too long and dramatically reducing throughput. - Actually, as in CFQ, a queue associated with a process issuing sync requests may not be expired immediately when it empties. In contrast, BFQ may idle the device for a short time interval, giving the process the chance to go on being served if it issues a new request in time. Device idling typically boosts the throughput on rotational devices, if processes do synchronous and sequential I/O. In addition, under BFQ, device idling is also instrumental in guaranteeing the desired throughput fraction to processes issuing sync requests (see [2] for details). - With respect to idling for service guarantees, if several processes are competing for the device at the same time, but all processes (and groups, after the following commit) have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. Throughput is thus as high as possible in this common scenario. - Queues are scheduled according to a variant of WF2Q+, named B-WF2Q+, and implemented using an augmented rb-tree to preserve an O(log N) overall complexity. See [2] for more details. B-WF2Q+ is also ready for hierarchical scheduling. However, for a cleaner logical breakdown, the code that enables and completes hierarchical support is provided in the next commit, which focuses exactly on this feature. - B-WF2Q+ guarantees a tight deviation with respect to an ideal, perfectly fair, and smooth service. In particular, B-WF2Q+ guarantees that each queue receives a fraction of the device throughput proportional to its weight, even if the throughput fluctuates, and regardless of: the device parameters, the current workload and the budgets assigned to the queue. - The last, budget-independence, property (although probably counterintuitive in the first place) is definitely beneficial, for the following reasons: - First, with any proportional-share scheduler, the maximum deviation with respect to an ideal service is proportional to the maximum budget (slice) assigned to queues. As a consequence, BFQ can keep this deviation tight not only because of the accurate service of B-WF2Q+, but also because BFQ *does not* need to assign a larger budget to a queue to let the queue receive a higher fraction of the device throughput. - Second, BFQ is free to choose, for every process (queue), the budget that best fits the needs of the process, or best leverages the I/O pattern of the process. In particular, BFQ updates queue budgets with a simple feedback-loop algorithm that allows a high throughput to be achieved, while still providing tight latency guarantees to time-sensitive applications. When the in-service queue expires, this algorithm computes the next budget of the queue so as to: - Let large budgets be eventually assigned to the queues associated with I/O-bound applications performing sequential I/O: in fact, the longer these applications are served once got access to the device, the higher the throughput is. - Let small budgets be eventually assigned to the queues associated with time-sensitive applications (which typically perform sporadic and short I/O), because, the smaller the budget assigned to a queue waiting for service is, the sooner B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). - Weights can be assigned to processes only indirectly, through I/O priorities, and according to the relation: weight = 10 * (IOPRIO_BE_NR - ioprio). The next patch provides, instead, a cgroups interface through which weights can be assigned explicitly. - If several processes are competing for the device at the same time, but all processes and groups have the same weight, then BFQ guarantees the expected throughput distribution without ever idling the device. It uses preemption instead. Throughput is then much higher in this common scenario. - ioprio classes are served in strict priority order, i.e., lower-priority queues are not served as long as there are higher-priority queues. Among queues in the same class, the bandwidth is distributed in proportion to the weight of each queue. A very thin extra bandwidth is however guaranteed to the Idle class, to prevent it from starving. - If the strict_guarantees parameter is set (default: unset), then BFQ - always performs idling when the in-service queue becomes empty; - forces the device to serve one I/O request at a time, by dispatching a new request only if there is no outstanding request. In the presence of differentiated weights or I/O-request sizes, both the above conditions are needed to guarantee that every queue receives its allotted share of the bandwidth (see Documentation/block/bfq-iosched.txt for more details). Setting strict_guarantees may evidently affect throughput. [1] https://lkml.org/lkml/2008/4/1/234 https://lkml.org/lkml/2008/11/11/148 [2] P. Valente and M. Andreolini, "Improving Application Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR '12), June 2012. Slightly extended version: http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- results.pdf Signed-off-by: Fabio Checconi <fchecconi@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: introduce Kyber multiqueue I/O schedulerOmar Sandoval2017-04-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kyber I/O scheduler is an I/O scheduler for fast devices designed to scale to multiple queues. Users configure only two knobs, the target read and synchronous write latencies, and the scheduler tunes itself to achieve that latency goal. The implementation is based on "tokens", built on top of the scalable bitmap library. Tokens serve as a mechanism for limiting requests. There are two tiers of tokens: queueing tokens and dispatch tokens. A queueing token is required to allocate a request. In fact, these tokens are actually the blk-mq internal scheduler tags, but the scheduler manages the allocation directly in order to implement its policy. Dispatch tokens are device-wide and split up into two scheduling domains: reads vs. writes. Each hardware queue dispatches batches round-robin between the scheduling domains as long as tokens are available for that domain. These tokens can be used as the mechanism to enable various policies. The policy Kyber uses is inspired by active queue management techniques for network routing, similar to blk-wbt. The scheduler monitors latencies and scales the number of dispatch tokens accordingly. Queueing tokens are used to prevent starvation of synchronous requests by asynchronous requests. Various extensions are possible, including better heuristics and ionice support. The new scheduler isn't set as the default yet. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2017-03-021-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull vhost updates from Michael Tsirkin: "virtio, vhost: optimizations, fixes Looks like a quiet cycle for vhost/virtio, just a couple of minor tweaks. Most notable is automatic interrupt affinity for blk and scsi. Hopefully other devices are not far behind" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio-console: avoid DMA from stack vhost: introduce O(1) vq metadata cache virtio_scsi: use virtio IRQ affinity virtio_blk: use virtio IRQ affinity blk-mq: provide a default queue mapping for virtio device virtio: provide a method to get the IRQ affinity mask for a virtqueue virtio: allow drivers to request IRQ affinity when creating VQs virtio_pci: simplify MSI-X setup virtio_pci: don't duplicate the msix_enable flag in struct pci_dev virtio_pci: use shared interrupts for virtqueues virtio_pci: remove struct virtio_pci_vq_info vhost: try avoiding avail index access when getting descriptor virtio_mmio: expose header to userspace
| * blk-mq: provide a default queue mapping for virtio deviceChristoph Hellwig2017-02-271-0/+1
| | | | | | | | | | | | | | Similar to the PCI version, just calling into virtio instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | Merge branch 'for-4.11/next' into for-4.11/linus-mergeJens Axboe2017-02-171-2/+3
|\ \ | | | | | | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block: make scsi_request and scsi ioctl support optionalChristoph Hellwig2017-01-311-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only need this code to support scsi, ide, cciss and virtio. And at least for virtio it's a deprecated feature to start with. This should shrink the kernel size for embedded device that only use, say eMMC a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | block: Add Sed-opal libraryScott Bauer2017-02-061-0/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the necessary logic to bring an Opal enabled drive out of a factory-enabled into a working Opal state. This patch set also enables logic to save a password to be replayed during a resume from suspend. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Rafael Antognolli <Rafael.Antognolli@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: fix debugfs compilation issuesOmar Sandoval2017-01-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a couple of problems: 1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus. 2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at all. Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option. Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree") Signed-off-by: Omar Sandoval <osandov@fb.com> Augment Kconfig description. Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: create debugfs directory treeOmar Sandoval2017-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for putting blk-mq debugging information in debugfs, create a directory tree mirroring the one in sysfs: # tree -d /sys/kernel/debug/block /sys/kernel/debug/block |-- nvme0n1 | `-- mq | |-- 0 | | `-- cpu0 | |-- 1 | | `-- cpu1 | |-- 2 | | `-- cpu2 | `-- 3 | `-- cpu3 `-- vda `-- mq `-- 0 |-- cpu0 |-- cpu1 |-- cpu2 `-- cpu3 Also add the scaffolding for the actual files that will go in here, either under the hardware queue or software queue directories. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | mq-deadline: add blk-mq adaptation of the deadline IO schedulerJens Axboe2017-01-171-0/+1
| | | | | | | | | | | | | | | | | | This is basically identical to deadline-iosched, except it registers as a MQ capable scheduler. This is still a single queue design. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* | blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe2017-01-171-1/+1
|/ | | | | | | | | | | | | | | | | | This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* blk-wbt: add general throttling mechanismJens Axboe2016-11-101-0/+1
| | | | | | | | | | | | | | | | | | | | We can hook this up to the block layer, to help throttle buffered writes. wbt registers a few trace points that can be used to track what is happening in the system: wbt_lat: 259:0: latency 2446318 wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1, wmean=518866, wmin=15522, wmax=5330353, wsamples=57 wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32 This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat dumps the current read/write stats for that window, and wbt_step shows a step down event where we now scale back writes. Each trace includes the device, 259:0 in this case. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add scalable completion tracking of requestsJens Axboe2016-11-101-1/+1
| | | | | | | | | | | | | | | | | | For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Implement support for zoned block devicesHannes Reinecke2016-10-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Implement zoned block device zone information reporting and reset. Zone information are reported as struct blk_zone. This implementation does not differentiate between host-aware and host-managed device models and is valid for both. Two functions are provided: blkdev_report_zones for discovering the zone configuration of a zoned block device, and blkdev_reset_zones for resetting the write pointer of sequential zones. The helper function blk_queue_zone_size and bdev_zone_size are also provided for, as the name suggest, obtaining the zone size (in 512B sectors) of the zones of the device. Signed-off-by: Hannes Reinecke <hare@suse.de> [Damien: * Removed the zone cache * Implement report zones operation based on earlier proposal by Shaun Tancheff <shaun.tancheff@seagate.com>] Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-091-1/+1
|\ | | | | | | | | | | | | | | | | | | Pull blk-mq CPU hotplug update from Jens Axboe: "This is the conversion of blk-mq to the new hotplug state machine" * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block: blk-mq: fixup "Convert to new hotplug state machine" blk-mq: Convert to new hotplug state machine blk-mq/cpu-notif: Convert to new hotplug state machine
| * blk-mq/cpu-notif: Convert to new hotplug state machineThomas Gleixner2016-09-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Replace the block-mq notifier list management with the multi instance facility in the cpu hotplug state machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk_mq: linux/blk-mq.h does not include all the headers it depends onStephen Rothwell2016-09-191-1/+1
| | | | | | | | | | | | | | | | | | and building block/blk-mq-pci.o should depend on CONFIG_BLOCK Fixes: 973c4e372c8f ("blk-mq: provide a default queue mapping for PCI device") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: provide a default queue mapping for PCI deviceChristoph Hellwig2016-09-151-1/+1
|/ | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'for-linus' of ↵Linus Torvalds2016-01-231-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "Initial roundup of 4.5 merge window patches - Remove usage of ib_query_device and instead store attributes in ib_device struct - Move iopoll out of block and into lib, rename to irqpoll, and use in several places in the rdma stack as our new completion queue polling library mechanism. Update the other block drivers that already used iopoll to use the new mechanism too. - Replace the per-entry GID table locks with a single GID table lock - IPoIB multicast cleanup - Cleanups to the IB MR facility - Add support for 64bit extended IB counters - Fix for netlink oops while parsing RDMA nl messages - RoCEv2 support for the core IB code - mlx4 RoCEv2 support - mlx5 RoCEv2 support - Cross Channel support for mlx5 - Timestamp support for mlx5 - Atomic support for mlx5 - Raw QP support for mlx5 - MAINTAINERS update for mlx4/mlx5 - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates - Add support for remote invalidate to the iSER driver (pushed through the RDMA tree due to dependencies, acknowledged by nab) - Update to NFSoRDMA (pushed through the RDMA tree due to dependencies, acknowledged by Bruce)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits) IB/mlx5: Unify CQ create flags check IB/mlx5: Expose Raw Packet QP to user space consumers {IB, net}/mlx5: Move the modify QP operation table to mlx5_ib IB/mlx5: Support setting Ethernet priority for Raw Packet QPs IB/mlx5: Add Raw Packet QP query functionality IB/mlx5: Add create and destroy functionality for Raw Packet QP IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types IB/mlx5: Allocate a Transport Domain for each ucontext net/mlx5_core: Warn on unsupported events of QP/RQ/SQ net/mlx5_core: Add RQ and SQ event handling net/mlx5_core: Export transport objects IB/mlx5: Expose CQE version to user-space IB/mlx5: Add CQE version 1 support to user QPs and SRQs IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext IB/sa: Fix netlink local service GFP crash IB/srpt: Remove redundant wc array IB/qib: Improve ipoib UD performance IB/mlx4: Advertise RoCE v2 support IB/mlx4: Create and use another QP1 for RoCEv2 IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers ...
| * irq_poll: make blk-iopoll available outside the block layerChristoph Hellwig2015-12-111-1/+1
| | | | | | | | | | | | | | | | The new name is irq_poll as iopoll is already taken. Better suggestions welcome. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
* | badblocks: Add core badblock management codeVishal Verma2016-01-091-1/+1
|/ | | | | | | | | | | | | | Take the core badblocks implementation from md, and make it generally available. This follows the same style as kernel implementations of linked lists, rb-trees etc, where you can have a structure that can be embedded anywhere, and accessor functions to manipulate the data. The only changes in this copy of the code are ones to generalize function/variable names from md-specific ones. Also add init and free functions. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* block: Add T10 Protection Information functionsMartin K. Petersen2014-09-271-2/+2
| | | | | | | | | | | The T10 Protection Information format is also used by some devices that do not go through the SCSI layer (virtual block devices, NVMe). Relocate the relevant functions to a block layer library that can be used without involving SCSI. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move mm/bounce.c to block/Jens Axboe2014-05-191-0/+1
| | | | | | | | | Continue moving some of the block files that are scattered around. bounce.c contains only code for bouncing the contents of a bio. It's block proper code, not mm code. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move ioprio.c from fs/ to block/Jens Axboe2014-05-191-1/+2
| | | | | | | Like commit f9c78b2b, move this block related file outside of fs/ and into the core block directory, block/. Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move bio.c and bio-integrity.c from fs/ to block/Jens Axboe2014-05-191-1/+2
| | | | | | | | | | They really belong in block/, especially now since it's not in drivers/block/ anymore. Additionally, the get_maintainer script gets it wrong when in fs/. Suggested-by: Christoph Hellwig <hch@infradead.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>
* blk-mq: new multi-queue block IO queueing mechanismJens Axboe2013-10-251-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux currently has two models for block devices: - The classic request_fn based approach, where drivers use struct request units for IO. The block layer provides various helper functionalities to let drivers share code, things like tag management, timeout handling, queueing, etc. - The "stacked" approach, where a driver squeezes in between the block layer and IO submitter. Since this bypasses the IO stack, driver generally have to manage everything themselves. With drivers being written for new high IOPS devices, the classic request_fn based driver doesn't work well enough. The design dates back to when both SMP and high IOPS was rare. It has problems with scaling to bigger machines, and runs into scaling issues even on smaller machines when you have IOPS in the hundreds of thousands per device. The stacked approach is then most often selected as the model for the driver. But this means that everybody has to re-invent everything, and along with that we get all the problems again that the shared approach solved. This commit introduces blk-mq, block multi queue support. The design is centered around per-cpu queues for queueing IO, which then funnel down into x number of hardware submission queues. We might have a 1:1 mapping between the two, or it might be an N:M mapping. That all depends on what the hardware supports. blk-mq provides various helper functions, which include: - Scalable support for request tagging. Most devices need to be able to uniquely identify a request both in the driver and to the hardware. The tagging uses per-cpu caches for freed tags, to enable cache hot reuse. - Timeout handling without tracking request on a per-device basis. Basically the driver should be able to get a notification, if a request happens to fail. - Optional support for non 1:1 mappings between issue and submission queues. blk-mq can redirect IO completions to the desired location. - Support for per-request payloads. Drivers almost always need to associate a request structure with some driver private command structure. Drivers can tell blk-mq this at init time, and then any request handed to the driver will have the required size of memory associated with it. - Support for merging of IO, and plugging. The stacked model gets neither of these. Even for high IOPS devices, merging sequential IO reduces per-command overhead and thus increases bandwidth. For now, this is provided as a potential 3rd queueing model, with the hope being that, as it matures, it can replace both the classic and stacked model. That would get us back to having just 1 real model for block devices, leaving the stacked approach to dm/md devices (as it was originally intended). Contributions in this patch from the following people: Shaohua Li <shli@fusionio.com> Alexander Gordeev <agordeev@redhat.com> Christoph Hellwig <hch@infradead.org> Mike Christie <michaelc@cs.wisc.edu> Matias Bjorling <m@bjorling.me> Jeff Moyer <jmoyer@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: change config option name for cmdline partition parsingPaul Gortmaker2013-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Recently commit bab55417b10c ("block: support embedded device command line partition") introduced CONFIG_CMDLINE_PARSER. However, that name is too generic and sounds like it enables/disables generic kernel boot arg processing, when it really is block specific. Before this option becomes a part of a full/final release, add the BLK_ prefix to it so that it is clear in absence of any other context that it is block specific. In addition, fix up the following less critical items: - help text was not really at all helpful. - index file for Documentation was not updated - add the new arg to Documentation/kernel-parameters.txt - clarify wording in source comments Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Cai Zhiyong <caizhiyong@huawei.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* block: support embedded device command line partitionCai Zhiyong2013-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Read block device partition table from command line. The partition used for fixed block device (eMMC) embedded device. It is no MBR, save storage space. Bootloader can be easily accessed by absolute address of data on the block device. Users can easily change the partition. This code reference MTD partition, source "drivers/mtd/cmdlinepart.c" About the partition verbose reference "Documentation/block/cmdline-partition.txt" [akpm@linux-foundation.org: fix printk text] [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()] Signed-off-by: Cai Zhiyong <caizhiyong@huawei.com> Cc: Karel Zak <kzak@redhat.com> Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com> Cc: Marius Groeger <mag@sysgo.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* separate partition format handling from generic codeAl Viro2012-01-031-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* move fs/partitions to block/Al Viro2012-01-031-1/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* block: add bsg helper libraryMike Christie2011-07-311-0/+1
| | | | | | | | | | | | | This moves the FC classes bsg code to the block layer and makes it a lib so that other classes like iscsi and SAS can use it. It is helpful because working with the request queue, bios, creating scatterlists, etc are a pain that the LLD does not have to worry about with normal IOs and should not have to worry about for bsg requests. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-10-221-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
| * block: rename blk-barrier.c to blk-flush.cTejun Heo2010-09-101-1/+1
| | | | | | | | | | | | | | | | | | | | Without ordering requirements, barrier and ordering are minomers. Rename block/blk-barrier.c to block/blk-flush.c. Rename of symbols will follow. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | blkio: Core implementation of throttle policyVivek Goyal2010-09-161-0/+1
|/ | | | | | | | | o Actual implementation of throttling policy in block layer. Currently it implements READ and WRITE bytes per second throttling logic. IOPS throttling comes in later patches. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* blkdev: move blkdev_issue helper functions to separate fileDmitry Monakhov2010-04-281-1/+1
| | | | | | | | | Move blkdev_issue_discard from blk-barrier.c because it is not barrier related. Later the file will be populated by other helpers. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* blkio: Introduce blkio controller cgroup interfaceVivek Goyal2009-12-031-0/+1
| | | | | | | | | o This is basic implementation of blkio controller cgroup interface. This is the common interface visible to user space and should be used by different IO control policies as we implement those. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: remove the anticipatory IO schedulerJens Axboe2009-10-031-1/+0
| | | | | | | | | AS is mostly a subset of CFQ, so there's little point in still providing this separate IO scheduler. Hopefully at some point we can get down to one single IO scheduler again, at least this brings us closer by having only one intelligent IO scheduler. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: add blk-iopoll, a NAPI like approach for block devicesJens Axboe2009-09-111-1/+1
| | | | | | | | | | | | | | | | This borrows some code from NAPI and implements a polled completion mode for block devices. The idea is the same as NAPI - instead of doing the command completion when the irq occurs, schedule a dedicated softirq in the hopes that we will complete more IO when the iopoll handler is invoked. Devices have a budget of commands assigned, and will stay in polled mode as long as they continue to consume their budget from the iopoll softirq handler. If they do not, the device is set back to interrupt completion mode. This patch holds the core bits for blk-iopoll, device driver support sold separately. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: get rid of queue-private command filterJens Axboe2009-07-011-1/+1
| | | | | | | | | | The initial patches to support this through sysfs export were broken and have been if 0'ed out in any release. So lets just kill the code and reclaim some space in struct request_queue, if anyone would later like to fixup the sysfs bits, the git history can easily restore the removed bits. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* tracing/blktrace: move the tracing file to kernel/traceFrederic Weisbecker2009-02-091-1/+0
| | | | | | | | | | | Impact: cleanup Move blktrace.c to kernel/trace, also move its config entry. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* block: unify request timeout handlingJens Axboe2008-10-091-2/+2
| | | | | | | | | | | | | Right now SCSI and others do their own command timeout handling. Move those bits to the block layer. Instead of having a timer per command, we try to be a bit more clever and simply have one per-queue. This avoids the overhead of having to tear down and setup a timer for each command, so it will result in a lot less timer fiddling. Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: split softirq handling into blk-softirq.cJens Axboe2008-10-091-2/+2
| | | | Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* allow userspace to modify scsi command filter on per device basisAdel Gadllah2008-07-031-1/+2
| | | | | | | | | | | | This patch exports the per-gendisk command filter to user space through sysfs, so it can be changed by the system administrator. All users of the old cmd filter have been converted to use the new one. Original patch from Peter Jones. Signed-off-by: Adel Gadllah <adel.gadllah@gmail.com> Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: Block layer data integrity supportMartin K. Petersen2008-07-031-0/+1
| | | | | | | | | | | | | | | Some block devices support verifying the integrity of requests by way of checksums or other protection information that is submitted along with the I/O. This patch implements support for generating and verifying integrity metadata, as well as correctly merging, splitting and cloning bios and requests that have this extra information attached. See Documentation/block/data-integrity.txt for more information. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: ll_rw_blk.c split, add blk-merge.cJens Axboe2008-01-291-1/+1
| | | | Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: continue ll_rw_blk.c splitupJens Axboe2008-01-291-2/+3
| | | | | | | Adds files for barrier handling, rq execution, io context handling, mapping data to requests, and queue settings. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: split tag and sysfs handling from blk-core.cJens Axboe2008-01-291-1/+2
| | | | | | Seperates the tag and sysfs handling from ll_rw_blk. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* block: first step of splitting ll_rw_blk, rename itJens Axboe2008-01-291-1/+1
| | | | | | Then we retain history in blk-core.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* [BLOCK] Only include the compat ioctl code if CONFIG_BLOCK is setJens Axboe2007-10-121-1/+1
| | | | | | Add an extra CONFIG_BLOCK_COMPAT that we can use in the Makefile Signed-off-by: Jens Axboe <jens.axboe@oracle.com>