diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2017-05-01 10:39:57 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2017-05-01 10:39:57 -0700 |
commit | 694752922b12bd318aa80191bd9d8c3dcfb39055 (patch) | |
tree | 5afe83fd99100bea546dd5a1c1f778c58f41e5c0 /block | |
parent | a351e9b9fc24e982ec2f0e76379a49826036da12 (diff) | |
parent | 9438b3e080beccf6022138ea62192d55cc7dc4ed (diff) | |
download | linux-stable-694752922b12bd318aa80191bd9d8c3dcfb39055.tar.gz linux-stable-694752922b12bd318aa80191bd9d8c3dcfb39055.tar.bz2 linux-stable-694752922b12bd318aa80191bd9d8c3dcfb39055.zip |
Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
Diffstat (limited to 'block')
45 files changed, 11837 insertions, 1171 deletions
diff --git a/block/Kconfig b/block/Kconfig index e9f780f815f5..89cd28f8d051 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -115,6 +115,18 @@ config BLK_DEV_THROTTLING See Documentation/cgroups/blkio-controller.txt for more information. +config BLK_DEV_THROTTLING_LOW + bool "Block throttling .low limit interface support (EXPERIMENTAL)" + depends on BLK_DEV_THROTTLING + default n + ---help--- + Add .low limit interface for block throttling. The low limit is a best + effort limit to prioritize cgroups. Depending on the setting, the limit + can be used to protect cgroups in terms of bandwidth/iops and better + utilize disk resource. + + Note, this is an experimental interface and could be changed someday. + config BLK_CMDLINE_PARSER bool "Block device command line partition parser" default n diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched index 58fc8684788d..fd2cefa47d35 100644 --- a/block/Kconfig.iosched +++ b/block/Kconfig.iosched @@ -40,6 +40,7 @@ config CFQ_GROUP_IOSCHED Enable group IO scheduling in CFQ. choice + prompt "Default I/O scheduler" default DEFAULT_CFQ help @@ -69,6 +70,35 @@ config MQ_IOSCHED_DEADLINE ---help--- MQ version of the deadline IO scheduler. +config MQ_IOSCHED_KYBER + tristate "Kyber I/O scheduler" + default y + ---help--- + The Kyber I/O scheduler is a low-overhead scheduler suitable for + multiqueue and other fast devices. Given target latencies for reads and + synchronous writes, it will self-tune queue depths to achieve that + goal. + +config IOSCHED_BFQ + tristate "BFQ I/O scheduler" + default n + ---help--- + BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of + of the device among all processes according to their weights, + regardless of the device parameters and with any workload. It + also guarantees a low latency to interactive and soft + real-time applications. Details in + Documentation/block/bfq-iosched.txt + +config BFQ_GROUP_IOSCHED + bool "BFQ hierarchical scheduling support" + depends on IOSCHED_BFQ && BLK_CGROUP + default n + ---help--- + + Enable hierarchical scheduling in BFQ, using the blkio + (cgroups-v1) or io (cgroups-v2) controller. + endmenu endif diff --git a/block/Makefile b/block/Makefile index 081bb680789b..2b281cf258a0 100644 --- a/block/Makefile +++ b/block/Makefile @@ -20,6 +20,9 @@ obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o +obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o +bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o +obj-$(CONFIG_IOSCHED_BFQ) += bfq.o obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c new file mode 100644 index 000000000000..c8a32fb345cf --- /dev/null +++ b/block/bfq-cgroup.c @@ -0,0 +1,1139 @@ +/* + * cgroups support for the BFQ I/O scheduler. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/blkdev.h> +#include <linux/cgroup.h> +#include <linux/elevator.h> +#include <linux/ktime.h> +#include <linux/rbtree.h> +#include <linux/ioprio.h> +#include <linux/sbitmap.h> +#include <linux/delay.h> + +#include "bfq-iosched.h" + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +/* bfqg stats flags */ +enum bfqg_stats_flags { + BFQG_stats_waiting = 0, + BFQG_stats_idling, + BFQG_stats_empty, +}; + +#define BFQG_FLAG_FNS(name) \ +static void bfqg_stats_mark_##name(struct bfqg_stats *stats) \ +{ \ + stats->flags |= (1 << BFQG_stats_##name); \ +} \ +static void bfqg_stats_clear_##name(struct bfqg_stats *stats) \ +{ \ + stats->flags &= ~(1 << BFQG_stats_##name); \ +} \ +static int bfqg_stats_##name(struct bfqg_stats *stats) \ +{ \ + return (stats->flags & (1 << BFQG_stats_##name)) != 0; \ +} \ + +BFQG_FLAG_FNS(waiting) +BFQG_FLAG_FNS(idling) +BFQG_FLAG_FNS(empty) +#undef BFQG_FLAG_FNS + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats) +{ + unsigned long long now; + + if (!bfqg_stats_waiting(stats)) + return; + + now = sched_clock(); + if (time_after64(now, stats->start_group_wait_time)) + blkg_stat_add(&stats->group_wait_time, + now - stats->start_group_wait_time); + bfqg_stats_clear_waiting(stats); +} + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg, + struct bfq_group *curr_bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (bfqg_stats_waiting(stats)) + return; + if (bfqg == curr_bfqg) + return; + stats->start_group_wait_time = sched_clock(); + bfqg_stats_mark_waiting(stats); +} + +/* This should be called with the queue_lock held. */ +static void bfqg_stats_end_empty_time(struct bfqg_stats *stats) +{ + unsigned long long now; + + if (!bfqg_stats_empty(stats)) + return; + + now = sched_clock(); + if (time_after64(now, stats->start_empty_time)) + blkg_stat_add(&stats->empty_time, + now - stats->start_empty_time); + bfqg_stats_clear_empty(stats); +} + +void bfqg_stats_update_dequeue(struct bfq_group *bfqg) +{ + blkg_stat_add(&bfqg->stats.dequeue, 1); +} + +void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (blkg_rwstat_total(&stats->queued)) + return; + + /* + * group is already marked empty. This can happen if bfqq got new + * request in parent group and moved to this group while being added + * to service tree. Just ignore the event and move on. + */ + if (bfqg_stats_empty(stats)) + return; + + stats->start_empty_time = sched_clock(); + bfqg_stats_mark_empty(stats); +} + +void bfqg_stats_update_idle_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + if (bfqg_stats_idling(stats)) { + unsigned long long now = sched_clock(); + + if (time_after64(now, stats->start_idle_time)) + blkg_stat_add(&stats->idle_time, + now - stats->start_idle_time); + bfqg_stats_clear_idling(stats); + } +} + +void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + stats->start_idle_time = sched_clock(); + bfqg_stats_mark_idling(stats); +} + +void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) +{ + struct bfqg_stats *stats = &bfqg->stats; + + blkg_stat_add(&stats->avg_queue_size_sum, + blkg_rwstat_total(&stats->queued)); + blkg_stat_add(&stats->avg_queue_size_samples, 1); + bfqg_stats_update_group_wait_time(stats); +} + +/* + * blk-cgroup policy-related handlers + * The following functions help in converting between blk-cgroup + * internal structures and BFQ-specific structures. + */ + +static struct bfq_group *pd_to_bfqg(struct blkg_policy_data *pd) +{ + return pd ? container_of(pd, struct bfq_group, pd) : NULL; +} + +struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg) +{ + return pd_to_blkg(&bfqg->pd); +} + +static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg) +{ + return pd_to_bfqg(blkg_to_pd(blkg, &blkcg_policy_bfq)); +} + +/* + * bfq_group handlers + * The following functions help in navigating the bfq_group hierarchy + * by allowing to find the parent of a bfq_group or the bfq_group + * associated to a bfq_queue. + */ + +static struct bfq_group *bfqg_parent(struct bfq_group *bfqg) +{ + struct blkcg_gq *pblkg = bfqg_to_blkg(bfqg)->parent; + + return pblkg ? blkg_to_bfqg(pblkg) : NULL; +} + +struct bfq_group *bfqq_group(struct bfq_queue *bfqq) +{ + struct bfq_entity *group_entity = bfqq->entity.parent; + + return group_entity ? container_of(group_entity, struct bfq_group, + entity) : + bfqq->bfqd->root_group; +} + +/* + * The following two functions handle get and put of a bfq_group by + * wrapping the related blk-cgroup hooks. + */ + +static void bfqg_get(struct bfq_group *bfqg) +{ + return blkg_get(bfqg_to_blkg(bfqg)); +} + +void bfqg_put(struct bfq_group *bfqg) +{ + return blkg_put(bfqg_to_blkg(bfqg)); +} + +void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, + unsigned int op) +{ + blkg_rwstat_add(&bfqg->stats.queued, op, 1); + bfqg_stats_end_empty_time(&bfqg->stats); + if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue)) + bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq)); +} + +void bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op) +{ + blkg_rwstat_add(&bfqg->stats.queued, op, -1); +} + +void bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op) +{ + blkg_rwstat_add(&bfqg->stats.merged, op, 1); +} + +void bfqg_stats_update_completion(struct bfq_group *bfqg, uint64_t start_time, + uint64_t io_start_time, unsigned int op) +{ + struct bfqg_stats *stats = &bfqg->stats; + unsigned long long now = sched_clock(); + + if (time_after64(now, io_start_time)) + blkg_rwstat_add(&stats->service_time, op, + now - io_start_time); + if (time_after64(io_start_time, start_time)) + blkg_rwstat_add(&stats->wait_time, op, + io_start_time - start_time); +} + +/* @stats = 0 */ +static void bfqg_stats_reset(struct bfqg_stats *stats) +{ + /* queued stats shouldn't be cleared */ + blkg_rwstat_reset(&stats->merged); + blkg_rwstat_reset(&stats->service_time); + blkg_rwstat_reset(&stats->wait_time); + blkg_stat_reset(&stats->time); + blkg_stat_reset(&stats->avg_queue_size_sum); + blkg_stat_reset(&stats->avg_queue_size_samples); + blkg_stat_reset(&stats->dequeue); + blkg_stat_reset(&stats->group_wait_time); + blkg_stat_reset(&stats->idle_time); + blkg_stat_reset(&stats->empty_time); +} + +/* @to += @from */ +static void bfqg_stats_add_aux(struct bfqg_stats *to, struct bfqg_stats *from) +{ + if (!to || !from) + return; + + /* queued stats shouldn't be cleared */ + blkg_rwstat_add_aux(&to->merged, &from->merged); + blkg_rwstat_add_aux(&to->service_time, &from->service_time); + blkg_rwstat_add_aux(&to->wait_time, &from->wait_time); + blkg_stat_add_aux(&from->time, &from->time); + blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum); + blkg_stat_add_aux(&to->avg_queue_size_samples, + &from->avg_queue_size_samples); + blkg_stat_add_aux(&to->dequeue, &from->dequeue); + blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time); + blkg_stat_add_aux(&to->idle_time, &from->idle_time); + blkg_stat_add_aux(&to->empty_time, &from->empty_time); +} + +/* + * Transfer @bfqg's stats to its parent's aux counts so that the ancestors' + * recursive stats can still account for the amount used by this bfqg after + * it's gone. + */ +static void bfqg_stats_xfer_dead(struct bfq_group *bfqg) +{ + struct bfq_group *parent; + + if (!bfqg) /* root_group */ + return; + + parent = bfqg_parent(bfqg); + + lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock); + + if (unlikely(!parent)) + return; + + bfqg_stats_add_aux(&parent->stats, &bfqg->stats); + bfqg_stats_reset(&bfqg->stats); +} + +void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + entity->weight = entity->new_weight; + entity->orig_weight = entity->new_weight; + if (bfqq) { + bfqq->ioprio = bfqq->new_ioprio; + bfqq->ioprio_class = bfqq->new_ioprio_class; + bfqg_get(bfqg); + } + entity->parent = bfqg->my_entity; /* NULL for root group */ + entity->sched_data = &bfqg->sched_data; +} + +static void bfqg_stats_exit(struct bfqg_stats *stats) +{ + blkg_rwstat_exit(&stats->merged); + blkg_rwstat_exit(&stats->service_time); + blkg_rwstat_exit(&stats->wait_time); + blkg_rwstat_exit(&stats->queued); + blkg_stat_exit(&stats->time); + blkg_stat_exit(&stats->avg_queue_size_sum); + blkg_stat_exit(&stats->avg_queue_size_samples); + blkg_stat_exit(&stats->dequeue); + blkg_stat_exit(&stats->group_wait_time); + blkg_stat_exit(&stats->idle_time); + blkg_stat_exit(&stats->empty_time); +} + +static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp) +{ + if (blkg_rwstat_init(&stats->merged, gfp) || + blkg_rwstat_init(&stats->service_time, gfp) || + blkg_rwstat_init(&stats->wait_time, gfp) || + blkg_rwstat_init(&stats->queued, gfp) || + blkg_stat_init(&stats->time, gfp) || + blkg_stat_init(&stats->avg_queue_size_sum, gfp) || + blkg_stat_init(&stats->avg_queue_size_samples, gfp) || + blkg_stat_init(&stats->dequeue, gfp) || + blkg_stat_init(&stats->group_wait_time, gfp) || + blkg_stat_init(&stats->idle_time, gfp) || + blkg_stat_init(&stats->empty_time, gfp)) { + bfqg_stats_exit(stats); + return -ENOMEM; + } + + return 0; +} + +static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd) +{ + return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL; +} + +static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg) +{ + return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq)); +} + +struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp) +{ + struct bfq_group_data *bgd; + + bgd = kzalloc(sizeof(*bgd), gfp); + if (!bgd) + return NULL; + return &bgd->pd; +} + +void bfq_cpd_init(struct blkcg_policy_data *cpd) +{ + struct bfq_group_data *d = cpd_to_bfqgd(cpd); + + d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ? + CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL; +} + +void bfq_cpd_free(struct blkcg_policy_data *cpd) +{ + kfree(cpd_to_bfqgd(cpd)); +} + +struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node) +{ + struct bfq_group *bfqg; + + bfqg = kzalloc_node(sizeof(*bfqg), gfp, node); + if (!bfqg) + return NULL; + + if (bfqg_stats_init(&bfqg->stats, gfp)) { + kfree(bfqg); + return NULL; + } + + return &bfqg->pd; +} + +void bfq_pd_init(struct blkg_policy_data *pd) +{ + struct blkcg_gq *blkg = pd_to_blkg(pd); + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + struct bfq_data *bfqd = blkg->q->elevator->elevator_data; + struct bfq_entity *entity = &bfqg->entity; + struct bfq_group_data *d = blkcg_to_bfqgd(blkg->blkcg); + + entity->orig_weight = entity->weight = entity->new_weight = d->weight; + entity->my_sched_data = &bfqg->sched_data; + bfqg->my_entity = entity; /* + * the root_group's will be set to NULL + * in bfq_init_queue() + */ + bfqg->bfqd = bfqd; + bfqg->active_entities = 0; + bfqg->rq_pos_tree = RB_ROOT; +} + +void bfq_pd_free(struct blkg_policy_data *pd) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + + bfqg_stats_exit(&bfqg->stats); + return kfree(bfqg); +} + +void bfq_pd_reset_stats(struct blkg_policy_data *pd) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + + bfqg_stats_reset(&bfqg->stats); +} + +static void bfq_group_set_parent(struct bfq_group *bfqg, + struct bfq_group *parent) +{ + struct bfq_entity *entity; + + entity = &bfqg->entity; + entity->parent = parent->my_entity; + entity->sched_data = &parent->sched_data; +} + +static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd, + struct blkcg *blkcg) +{ + struct blkcg_gq *blkg; + + blkg = blkg_lookup(blkcg, bfqd->queue); + if (likely(blkg)) + return blkg_to_bfqg(blkg); + return NULL; +} + +struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, + struct blkcg *blkcg) +{ + struct bfq_group *bfqg, *parent; + struct bfq_entity *entity; + + bfqg = bfq_lookup_bfqg(bfqd, blkcg); + + if (unlikely(!bfqg)) + return NULL; + + /* + * Update chain of bfq_groups as we might be handling a leaf group + * which, along with some of its relatives, has not been hooked yet + * to the private hierarchy of BFQ. + */ + entity = &bfqg->entity; + for_each_entity(entity) { + bfqg = container_of(entity, struct bfq_group, entity); + if (bfqg != bfqd->root_group) { + parent = bfqg_parent(bfqg); + if (!parent) + parent = bfqd->root_group; + bfq_group_set_parent(bfqg, parent); + } + } + + return bfqg; +} + +/** + * bfq_bfqq_move - migrate @bfqq to @bfqg. + * @bfqd: queue descriptor. + * @bfqq: the queue to move. + * @bfqg: the group to move to. + * + * Move @bfqq to @bfqg, deactivating it from its old group and reactivating + * it on the new one. Avoid putting the entity on the old group idle tree. + * + * Must be called under the queue lock; the cgroup owning @bfqg must + * not disappear (by now this just means that we are called under + * rcu_read_lock()). + */ +void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_group *bfqg) +{ + struct bfq_entity *entity = &bfqq->entity; + + /* If bfqq is empty, then bfq_bfqq_expire also invokes + * bfq_del_bfqq_busy, thereby removing bfqq and its entity + * from data structures related to current group. Otherwise we + * need to remove bfqq explicitly with bfq_deactivate_bfqq, as + * we do below. + */ + if (bfqq == bfqd->in_service_queue) + bfq_bfqq_expire(bfqd, bfqd->in_service_queue, + false, BFQQE_PREEMPTED); + + if (bfq_bfqq_busy(bfqq)) + bfq_deactivate_bfqq(bfqd, bfqq, false, false); + else if (entity->on_st) + bfq_put_idle_entity(bfq_entity_service_tree(entity), entity); + bfqg_put(bfqq_group(bfqq)); + + /* + * Here we use a reference to bfqg. We don't need a refcounter + * as the cgroup reference will not be dropped, so that its + * destroy() callback will not be invoked. + */ + entity->parent = bfqg->my_entity; + entity->sched_data = &bfqg->sched_data; + bfqg_get(bfqg); + + if (bfq_bfqq_busy(bfqq)) { + bfq_pos_tree_add_move(bfqd, bfqq); + bfq_activate_bfqq(bfqd, bfqq); + } + + if (!bfqd->in_service_queue && !bfqd->rq_in_driver) + bfq_schedule_dispatch(bfqd); +} + +/** + * __bfq_bic_change_cgroup - move @bic to @cgroup. + * @bfqd: the queue descriptor. + * @bic: the bic to move. + * @blkcg: the blk-cgroup to move to. + * + * Move bic to blkcg, assuming that bfqd->queue is locked; the caller + * has to make sure that the reference to cgroup is valid across the call. + * + * NOTE: an alternative approach might have been to store the current + * cgroup in bfqq and getting a reference to it, reducing the lookup + * time here, at the price of slightly more complex code. + */ +static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd, + struct bfq_io_cq *bic, + struct blkcg *blkcg) +{ + struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0); + struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1); + struct bfq_group *bfqg; + struct bfq_entity *entity; + + bfqg = bfq_find_set_group(bfqd, blkcg); + + if (unlikely(!bfqg)) + bfqg = bfqd->root_group; + + if (async_bfqq) { + entity = &async_bfqq->entity; + + if (entity->sched_data != &bfqg->sched_data) { + bic_set_bfqq(bic, NULL, 0); + bfq_log_bfqq(bfqd, async_bfqq, + "bic_change_group: %p %d", + async_bfqq, async_bfqq->ref); + bfq_put_queue(async_bfqq); + } + } + + if (sync_bfqq) { + entity = &sync_bfqq->entity; + if (entity->sched_data != &bfqg->sched_data) + bfq_bfqq_move(bfqd, sync_bfqq, bfqg); + } + + return bfqg; +} + +void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) +{ + struct bfq_data *bfqd = bic_to_bfqd(bic); + struct bfq_group *bfqg = NULL; + uint64_t serial_nr; + + rcu_read_lock(); + serial_nr = bio_blkcg(bio)->css.serial_nr; + + /* + * Check whether blkcg has changed. The condition may trigger + * spuriously on a newly created cic but there's no harm. + */ + if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr)) + goto out; + + bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio)); + bic->blkcg_serial_nr = serial_nr; +out: + rcu_read_unlock(); +} + +/** + * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st. + * @st: the service tree being flushed. + */ +static void bfq_flush_idle_tree(struct bfq_service_tree *st) +{ + struct bfq_entity *entity = st->first_idle; + + for (; entity ; entity = st->first_idle) + __bfq_deactivate_entity(entity, false); +} + +/** + * bfq_reparent_leaf_entity - move leaf entity to the root_group. + * @bfqd: the device data structure with the root group. + * @entity: the entity to move. + */ +static void bfq_reparent_leaf_entity(struct bfq_data *bfqd, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + bfq_bfqq_move(bfqd, bfqq, bfqd->root_group); +} + +/** + * bfq_reparent_active_entities - move to the root group all active + * entities. + * @bfqd: the device data structure with the root group. + * @bfqg: the group to move from. + * @st: the service tree with the entities. + * + * Needs queue_lock to be taken and reference to be valid over the call. + */ +static void bfq_reparent_active_entities(struct bfq_data *bfqd, + struct bfq_group *bfqg, + struct bfq_service_tree *st) +{ + struct rb_root *active = &st->active; + struct bfq_entity *entity = NULL; + + if (!RB_EMPTY_ROOT(&st->active)) + entity = bfq_entity_of(rb_first(active)); + + for (; entity ; entity = bfq_entity_of(rb_first(active))) + bfq_reparent_leaf_entity(bfqd, entity); + + if (bfqg->sched_data.in_service_entity) + bfq_reparent_leaf_entity(bfqd, + bfqg->sched_data.in_service_entity); +} + +/** + * bfq_pd_offline - deactivate the entity associated with @pd, + * and reparent its children entities. + * @pd: descriptor of the policy going offline. + * + * blkio already grabs the queue_lock for us, so no need to use + * RCU-based magic + */ +void bfq_pd_offline(struct blkg_policy_data *pd) +{ + struct bfq_service_tree *st; + struct bfq_group *bfqg = pd_to_bfqg(pd); + struct bfq_data *bfqd = bfqg->bfqd; + struct bfq_entity *entity = bfqg->my_entity; + unsigned long flags; + int i; + + if (!entity) /* root group */ + return; + + spin_lock_irqsave(&bfqd->lock, flags); + /* + * Empty all service_trees belonging to this group before + * deactivating the group itself. + */ + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) { + st = bfqg->sched_data.service_tree + i; + + /* + * The idle tree may still contain bfq_queues belonging + * to exited task because they never migrated to a different + * cgroup from the one being destroyed now. No one else + * can access them so it's safe to act without any lock. + */ + bfq_flush_idle_tree(st); + + /* + * It may happen that some queues are still active + * (busy) upon group destruction (if the corresponding + * processes have been forced to terminate). We move + * all the leaf entities corresponding to these queues + * to the root_group. + * Also, it may happen that the group has an entity + * in service, which is disconnected from the active + * tree: it must be moved, too. + * There is no need to put the sync queues, as the + * scheduler has taken no reference. + */ + bfq_reparent_active_entities(bfqd, bfqg, st); + } + + __bfq_deactivate_entity(entity, false); + bfq_put_async_queues(bfqd, bfqg); + + spin_unlock_irqrestore(&bfqd->lock, flags); + /* + * @blkg is going offline and will be ignored by + * blkg_[rw]stat_recursive_sum(). Transfer stats to the parent so + * that they don't get lost. If IOs complete after this point, the + * stats for them will be lost. Oh well... + */ + bfqg_stats_xfer_dead(bfqg); +} + +void bfq_end_wr_async(struct bfq_data *bfqd) +{ + struct blkcg_gq *blkg; + + list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) { + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + + bfq_end_wr_async_queues(bfqd, bfqg); + } + bfq_end_wr_async_queues(bfqd, bfqd->root_group); +} + +static int bfq_io_show_weight(struct seq_file *sf, void *v) +{ + struct blkcg *blkcg = css_to_blkcg(seq_css(sf)); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + unsigned int val = 0; + + if (bfqgd) + val = bfqgd->weight; + + seq_printf(sf, "%u\n", val); + + return 0; +} + +static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css, + struct cftype *cftype, + u64 val) +{ + struct blkcg *blkcg = css_to_blkcg(css); + struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg); + struct blkcg_gq *blkg; + int ret = -ERANGE; + + if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT) + return ret; + + ret = 0; + spin_lock_irq(&blkcg->lock); + bfqgd->weight = (unsigned short)val; + hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) { + struct bfq_group *bfqg = blkg_to_bfqg(blkg); + + if (!bfqg) + continue; + /* + * Setting the prio_changed flag of the entity + * to 1 with new_weight == weight would re-set + * the value of the weight to its ioprio mapping. + * Set the flag only if necessary. + */ + if ((unsigned short)val != bfqg->entity.new_weight) { + bfqg->entity.new_weight = (unsigned short)val; + /* + * Make sure that the above new value has been + * stored in bfqg->entity.new_weight before + * setting the prio_changed flag. In fact, + * this flag may be read asynchronously (in + * critical sections protected by a different + * lock than that held here), and finding this + * flag set may cause the execution of the code + * for updating parameters whose value may + * depend also on bfqg->entity.new_weight (in + * __bfq_entity_update_weight_prio). + * This barrier makes sure that the new value + * of bfqg->entity.new_weight is correctly + * seen in that code. + */ + smp_wmb(); + bfqg->entity.prio_changed = 1; + } + } + spin_unlock_irq(&blkcg->lock); + + return ret; +} + +static ssize_t bfq_io_set_weight(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + u64 weight; + /* First unsigned long found in the file is used */ + int ret = kstrtoull(strim(buf), 0, &weight); + + if (ret) + return ret; + + return bfq_io_set_weight_legacy(of_css(of), NULL, weight); +} + +static int bfqg_print_stat(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat, + &blkcg_policy_bfq, seq_cft(sf)->private, false); + return 0; +} + +static int bfqg_print_rwstat(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat, + &blkcg_policy_bfq, seq_cft(sf)->private, true); + return 0; +} + +static u64 bfqg_prfill_stat_recursive(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd), + &blkcg_policy_bfq, off); + return __blkg_prfill_u64(sf, pd, sum); +} + +static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd), + &blkcg_policy_bfq, + off); + return __blkg_prfill_rwstat(sf, pd, &sum); +} + +static int bfqg_print_stat_recursive(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_stat_recursive, &blkcg_policy_bfq, + seq_cft(sf)->private, false); + return 0; +} + +static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_rwstat_recursive, &blkcg_policy_bfq, + seq_cft(sf)->private, true); + return 0; +} + +static u64 bfqg_prfill_sectors(struct seq_file *sf, struct blkg_policy_data *pd, + int off) +{ + u64 sum = blkg_rwstat_total(&pd->blkg->stat_bytes); + + return __blkg_prfill_u64(sf, pd, sum >> 9); +} + +static int bfqg_print_stat_sectors(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_sectors, &blkcg_policy_bfq, 0, false); + return 0; +} + +static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL, + offsetof(struct blkcg_gq, stat_bytes)); + u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) + + atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]); + + return __blkg_prfill_u64(sf, pd, sum >> 9); +} + +static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_sectors_recursive, &blkcg_policy_bfq, 0, + false); + return 0; +} + +static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf, + struct blkg_policy_data *pd, int off) +{ + struct bfq_group *bfqg = pd_to_bfqg(pd); + u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples); + u64 v = 0; + + if (samples) { + v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum); + v = div64_u64(v, samples); + } + __blkg_prfill_u64(sf, pd, v); + return 0; +} + +/* print avg_queue_size */ +static int bfqg_print_avg_queue_size(struct seq_file *sf, void *v) +{ + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), + bfqg_prfill_avg_queue_size, &blkcg_policy_bfq, + 0, false); + return 0; +} + +struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node) +{ + int ret; + + ret = blkcg_activate_policy(bfqd->queue, &blkcg_policy_bfq); + if (ret) + return NULL; + + return blkg_to_bfqg(bfqd->queue->root_blkg); +} + +struct blkcg_policy blkcg_policy_bfq = { + .dfl_cftypes = bfq_blkg_files, + .legacy_cftypes = bfq_blkcg_legacy_files, + + .cpd_alloc_fn = bfq_cpd_alloc, + .cpd_init_fn = bfq_cpd_init, + .cpd_bind_fn = bfq_cpd_init, + .cpd_free_fn = bfq_cpd_free, + + .pd_alloc_fn = bfq_pd_alloc, + .pd_init_fn = bfq_pd_init, + .pd_offline_fn = bfq_pd_offline, + .pd_free_fn = bfq_pd_free, + .pd_reset_stats_fn = bfq_pd_reset_stats, +}; + +struct cftype bfq_blkcg_legacy_files[] = { + { + .name = "bfq.weight", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = bfq_io_show_weight, + .write_u64 = bfq_io_set_weight_legacy, + }, + + /* statistics, covers only the tasks in the bfqg */ + { + .name = "bfq.time", + .private = offsetof(struct bfq_group, stats.time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.sectors", + .seq_show = bfqg_print_stat_sectors, + }, + { + .name = "bfq.io_service_bytes", + .private = (unsigned long)&blkcg_policy_bfq, + .seq_show = blkg_print_stat_bytes, + }, + { + .name = "bfq.io_serviced", + .private = (unsigned long)&blkcg_policy_bfq, + .seq_show = blkg_print_stat_ios, + }, + { + .name = "bfq.io_service_time", + .private = offsetof(struct bfq_group, stats.service_time), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_wait_time", + .private = offsetof(struct bfq_group, stats.wait_time), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_merged", + .private = offsetof(struct bfq_group, stats.merged), + .seq_show = bfqg_print_rwstat, + }, + { + .name = "bfq.io_queued", + .private = offsetof(struct bfq_group, stats.queued), + .seq_show = bfqg_print_rwstat, + }, + + /* the same statictics which cover the bfqg and its descendants */ + { + .name = "bfq.time_recursive", + .private = offsetof(struct bfq_group, stats.time), + .seq_show = bfqg_print_stat_recursive, + }, + { + .name = "bfq.sectors_recursive", + .seq_show = bfqg_print_stat_sectors_recursive, + }, + { + .name = "bfq.io_service_bytes_recursive", + .private = (unsigned long)&blkcg_policy_bfq, + .seq_show = blkg_print_stat_bytes_recursive, + }, + { + .name = "bfq.io_serviced_recursive", + .private = (unsigned long)&blkcg_policy_bfq, + .seq_show = blkg_print_stat_ios_recursive, + }, + { + .name = "bfq.io_service_time_recursive", + .private = offsetof(struct bfq_group, stats.service_time), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_wait_time_recursive", + .private = offsetof(struct bfq_group, stats.wait_time), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_merged_recursive", + .private = offsetof(struct bfq_group, stats.merged), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.io_queued_recursive", + .private = offsetof(struct bfq_group, stats.queued), + .seq_show = bfqg_print_rwstat_recursive, + }, + { + .name = "bfq.avg_queue_size", + .seq_show = bfqg_print_avg_queue_size, + }, + { + .name = "bfq.group_wait_time", + .private = offsetof(struct bfq_group, stats.group_wait_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.idle_time", + .private = offsetof(struct bfq_group, stats.idle_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.empty_time", + .private = offsetof(struct bfq_group, stats.empty_time), + .seq_show = bfqg_print_stat, + }, + { + .name = "bfq.dequeue", + .private = offsetof(struct bfq_group, stats.dequeue), + .seq_show = bfqg_print_stat, + }, + { } /* terminate */ +}; + +struct cftype bfq_blkg_files[] = { + { + .name = "bfq.weight", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = bfq_io_show_weight, + .write = bfq_io_set_weight, + }, + {} /* terminate */ +}; + +#else /* CONFIG_BFQ_GROUP_IOSCHED */ + +void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, + unsigned int op) { } +void bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op) { } +void bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op) { } +void bfqg_stats_update_completion(struct bfq_group *bfqg, uint64_t start_time, + uint64_t io_start_time, unsigned int op) { } +void bfqg_stats_update_dequeue(struct bfq_group *bfqg) { } +void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) { } +void bfqg_stats_update_idle_time(struct bfq_group *bfqg) { } +void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { } +void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { } + +void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_group *bfqg) {} + +void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + entity->weight = entity->new_weight; + entity->orig_weight = entity->new_weight; + if (bfqq) { + bfqq->ioprio = bfqq->new_ioprio; + bfqq->ioprio_class = bfqq->new_ioprio_class; + } + entity->sched_data = &bfqg->sched_data; +} + +void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio) {} + +void bfq_end_wr_async(struct bfq_data *bfqd) +{ + bfq_end_wr_async_queues(bfqd, bfqd->root_group); +} + +struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, struct blkcg *blkcg) +{ + return bfqd->root_group; +} + +struct bfq_group *bfqq_group(struct bfq_queue *bfqq) +{ + return bfqq->bfqd->root_group; +} + +struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node) +{ + struct bfq_group *bfqg; + int i; + + bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node); + if (!bfqg) + return NULL; + + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) + bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; + + return bfqg; +} +#endif /* CONFIG_BFQ_GROUP_IOSCHED */ diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c new file mode 100644 index 000000000000..bd8499ef157c --- /dev/null +++ b/block/bfq-iosched.c @@ -0,0 +1,5047 @@ +/* + * Budget Fair Queueing (BFQ) I/O scheduler. + * + * Based on ideas and code from CFQ: + * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk> + * + * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it> + * Paolo Valente <paolo.valente@unimore.it> + * + * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it> + * Arianna Avanzini <avanzini@google.com> + * + * Copyright (C) 2017 Paolo Valente <paolo.valente@linaro.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * BFQ is a proportional-share I/O scheduler, with some extra + * low-latency capabilities. BFQ also supports full hierarchical + * scheduling through cgroups. Next paragraphs provide an introduction + * on BFQ inner workings. Details on BFQ benefits, usage and + * limitations can be found in Documentation/block/bfq-iosched.txt. + * + * BFQ is a proportional-share storage-I/O scheduling algorithm based + * on the slice-by-slice service scheme of CFQ. But BFQ assigns + * budgets, measured in number of sectors, to processes instead of + * time slices. The device is not granted to the in-service process + * for a given time slice, but until it has exhausted its assigned + * budget. This change from the time to the service domain enables BFQ + * to distribute the device throughput among processes as desired, + * without any distortion due to throughput fluctuations, or to device + * internal queueing. BFQ uses an ad hoc internal scheduler, called + * B-WF2Q+, to schedule processes according to their budgets. More + * precisely, BFQ schedules queues associated with processes. Each + * process/queue is assigned a user-configurable weight, and B-WF2Q+ + * guarantees that each queue receives a fraction of the throughput + * proportional to its weight. Thanks to the accurate policy of + * B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound + * processes issuing sequential requests (to boost the throughput), + * and yet guarantee a low latency to interactive and soft real-time + * applications. + * + * In particular, to provide these low-latency guarantees, BFQ + * explicitly privileges the I/O of two classes of time-sensitive + * applications: interactive and soft real-time. This feature enables + * BFQ to provide applications in these classes with a very low + * latency. Finally, BFQ also features additional heuristics for + * preserving both a low latency and a high throughput on NCQ-capable, + * rotational or flash-based devices, and to get the job done quickly + * for applications consisting in many I/O-bound processes. + * + * BFQ is described in [1], where also a reference to the initial, more + * theoretical paper on BFQ can be found. The interested reader can find + * in the latter paper full details on the main algorithm, as well as + * formulas of the guarantees and formal proofs of all the properties. + * With respect to the version of BFQ presented in these papers, this + * implementation adds a few more heuristics, such as the one that + * guarantees a low latency to soft real-time applications, and a + * hierarchical extension based on H-WF2Q+. + * + * B-WF2Q+ is based on WF2Q+, which is described in [2], together with + * H-WF2Q+, while the augmented tree used here to implement B-WF2Q+ + * with O(log N) complexity derives from the one introduced with EEVDF + * in [3]. + * + * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O + * Scheduler", Proceedings of the First Workshop on Mobile System + * Technologies (MST-2015), May 2015. + * http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf + * + * [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing + * Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689, + * Oct 1997. + * + * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz + * + * [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline + * First: A Flexible and Accurate Mechanism for Proportional Share + * Resource Allocation", technical report. + * + * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf + */ +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/blkdev.h> +#include <linux/cgroup.h> +#include <linux/elevator.h> +#include <linux/ktime.h> +#include <linux/rbtree.h> +#include <linux/ioprio.h> +#include <linux/sbitmap.h> +#include <linux/delay.h> + +#include "blk.h" +#include "blk-mq.h" +#include "blk-mq-tag.h" +#include "blk-mq-sched.h" +#include "bfq-iosched.h" + +#define BFQ_BFQQ_FNS(name) \ +void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \ +{ \ + __set_bit(BFQQF_##name, &(bfqq)->flags); \ +} \ +void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \ +{ \ + __clear_bit(BFQQF_##name, &(bfqq)->flags); \ +} \ +int bfq_bfqq_##name(const struct bfq_queue *bfqq) \ +{ \ + return test_bit(BFQQF_##name, &(bfqq)->flags); \ +} + +BFQ_BFQQ_FNS(just_created); +BFQ_BFQQ_FNS(busy); +BFQ_BFQQ_FNS(wait_request); +BFQ_BFQQ_FNS(non_blocking_wait_rq); +BFQ_BFQQ_FNS(fifo_expire); +BFQ_BFQQ_FNS(idle_window); +BFQ_BFQQ_FNS(sync); +BFQ_BFQQ_FNS(IO_bound); +BFQ_BFQQ_FNS(in_large_burst); +BFQ_BFQQ_FNS(coop); +BFQ_BFQQ_FNS(split_coop); +BFQ_BFQQ_FNS(softrt_update); +#undef BFQ_BFQQ_FNS \ + +/* Expiration time of sync (0) and async (1) requests, in ns. */ +static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 }; + +/* Maximum backwards seek (magic number lifted from CFQ), in KiB. */ +static const int bfq_back_max = 16 * 1024; + +/* Penalty of a backwards seek, in number of sectors. */ +static const int bfq_back_penalty = 2; + +/* Idling period duration, in ns. */ +static u64 bfq_slice_idle = NSEC_PER_SEC / 125; + +/* Minimum number of assigned budgets for which stats are safe to compute. */ +static const int bfq_stats_min_budgets = 194; + +/* Default maximum budget values, in sectors and number of requests. */ +static const int bfq_default_max_budget = 16 * 1024; + +/* + * Async to sync throughput distribution is controlled as follows: + * when an async request is served, the entity is charged the number + * of sectors of the request, multiplied by the factor below + */ +static const int bfq_async_charge_factor = 10; + +/* Default timeout values, in jiffies, approximating CFQ defaults. */ +const int bfq_timeout = HZ / 8; + +static struct kmem_cache *bfq_pool; + +/* Below this threshold (in ns), we consider thinktime immediate. */ +#define BFQ_MIN_TT (2 * NSEC_PER_MSEC) + +/* hw_tag detection: parallel requests threshold and min samples needed. */ +#define BFQ_HW_QUEUE_THRESHOLD 4 +#define BFQ_HW_QUEUE_SAMPLES 32 + +#define BFQQ_SEEK_THR (sector_t)(8 * 100) +#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32) +#define BFQQ_CLOSE_THR (sector_t)(8 * 1024) +#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 32/8) + +/* Min number of samples required to perform peak-rate update */ +#define BFQ_RATE_MIN_SAMPLES 32 +/* Min observation time interval required to perform a peak-rate update (ns) */ +#define BFQ_RATE_MIN_INTERVAL (300*NSEC_PER_MSEC) +/* Target observation time interval for a peak-rate update (ns) */ +#define BFQ_RATE_REF_INTERVAL NSEC_PER_SEC + +/* Shift used for peak rate fixed precision calculations. */ +#define BFQ_RATE_SHIFT 16 + +/* + * By default, BFQ computes the duration of the weight raising for + * interactive applications automatically, using the following formula: + * duration = (R / r) * T, where r is the peak rate of the device, and + * R and T are two reference parameters. + * In particular, R is the peak rate of the reference device (see below), + * and T is a reference time: given the systems that are likely to be + * installed on the reference device according to its speed class, T is + * about the maximum time needed, under BFQ and while reading two files in + * parallel, to load typical large applications on these systems. + * In practice, the slower/faster the device at hand is, the more/less it + * takes to load applications with respect to the reference device. + * Accordingly, the longer/shorter BFQ grants weight raising to interactive + * applications. + * + * BFQ uses four different reference pairs (R, T), depending on: + * . whether the device is rotational or non-rotational; + * . whether the device is slow, such as old or portable HDDs, as well as + * SD cards, or fast, such as newer HDDs and SSDs. + * + * The device's speed class is dynamically (re)detected in + * bfq_update_peak_rate() every time the estimated peak rate is updated. + * + * In the following definitions, R_slow[0]/R_fast[0] and + * T_slow[0]/T_fast[0] are the reference values for a slow/fast + * rotational device, whereas R_slow[1]/R_fast[1] and + * T_slow[1]/T_fast[1] are the reference values for a slow/fast + * non-rotational device. Finally, device_speed_thresh are the + * thresholds used to switch between speed classes. The reference + * rates are not the actual peak rates of the devices used as a + * reference, but slightly lower values. The reason for using these + * slightly lower values is that the peak-rate estimator tends to + * yield slightly lower values than the actual peak rate (it can yield + * the actual peak rate only if there is only one process doing I/O, + * and the process does sequential I/O). + * + * Both the reference peak rates and the thresholds are measured in + * sectors/usec, left-shifted by BFQ_RATE_SHIFT. + */ +static int R_slow[2] = {1000, 10700}; +static int R_fast[2] = {14000, 33000}; +/* + * To improve readability, a conversion function is used to initialize the + * following arrays, which entails that they can be initialized only in a + * function. + */ +static int T_slow[2]; +static int T_fast[2]; +static int device_speed_thresh[2]; + +#define RQ_BIC(rq) ((struct bfq_io_cq *) (rq)->elv.priv[0]) +#define RQ_BFQQ(rq) ((rq)->elv.priv[1]) + +struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync) +{ + return bic->bfqq[is_sync]; +} + +void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync) +{ + bic->bfqq[is_sync] = bfqq; +} + +struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic) +{ + return bic->icq.q->elevator->elevator_data; +} + +/** + * icq_to_bic - convert iocontext queue structure to bfq_io_cq. + * @icq: the iocontext queue. + */ +static struct bfq_io_cq *icq_to_bic(struct io_cq *icq) +{ + /* bic->icq is the first member, %NULL will convert to %NULL */ + return container_of(icq, struct bfq_io_cq, icq); +} + +/** + * bfq_bic_lookup - search into @ioc a bic associated to @bfqd. + * @bfqd: the lookup key. + * @ioc: the io_context of the process doing I/O. + * @q: the request queue. + */ +static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd, + struct io_context *ioc, + struct request_queue *q) +{ + if (ioc) { + unsigned long flags; + struct bfq_io_cq *icq; + + spin_lock_irqsave(q->queue_lock, flags); + icq = icq_to_bic(ioc_lookup_icq(ioc, q)); + spin_unlock_irqrestore(q->queue_lock, flags); + + return icq; + } + + return NULL; +} + +/* + * Scheduler run of queue, if there are requests pending and no one in the + * driver that will restart queueing. + */ +void bfq_schedule_dispatch(struct bfq_data *bfqd) +{ + if (bfqd->queued != 0) { + bfq_log(bfqd, "schedule dispatch"); + blk_mq_run_hw_queues(bfqd->queue, true); + } +} + +#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE) +#define bfq_class_rt(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_RT) + +#define bfq_sample_valid(samples) ((samples) > 80) + +/* + * Lifted from AS - choose which of rq1 and rq2 that is best served now. + * We choose the request that is closesr to the head right now. Distance + * behind the head is penalized and only allowed to a certain extent. + */ +static struct request *bfq_choose_req(struct bfq_data *bfqd, + struct request *rq1, + struct request *rq2, + sector_t last) +{ + sector_t s1, s2, d1 = 0, d2 = 0; + unsigned long back_max; +#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */ +#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */ + unsigned int wrap = 0; /* bit mask: requests behind the disk head? */ + + if (!rq1 || rq1 == rq2) + return rq2; + if (!rq2) + return rq1; + + if (rq_is_sync(rq1) && !rq_is_sync(rq2)) + return rq1; + else if (rq_is_sync(rq2) && !rq_is_sync(rq1)) + return rq2; + if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META)) + return rq1; + else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META)) + return rq2; + + s1 = blk_rq_pos(rq1); + s2 = blk_rq_pos(rq2); + + /* + * By definition, 1KiB is 2 sectors. + */ + back_max = bfqd->bfq_back_max * 2; + + /* + * Strict one way elevator _except_ in the case where we allow + * short backward seeks which are biased as twice the cost of a + * similar forward seek. + */ + if (s1 >= last) + d1 = s1 - last; + else if (s1 + back_max >= last) + d1 = (last - s1) * bfqd->bfq_back_penalty; + else + wrap |= BFQ_RQ1_WRAP; + + if (s2 >= last) + d2 = s2 - last; + else if (s2 + back_max >= last) + d2 = (last - s2) * bfqd->bfq_back_penalty; + else + wrap |= BFQ_RQ2_WRAP; + + /* Found required data */ + + /* + * By doing switch() on the bit mask "wrap" we avoid having to + * check two variables for all permutations: --> faster! + */ + switch (wrap) { + case 0: /* common case for CFQ: rq1 and rq2 not wrapped */ + if (d1 < d2) + return rq1; + else if (d2 < d1) + return rq2; + + if (s1 >= s2) + return rq1; + else + return rq2; + + case BFQ_RQ2_WRAP: + return rq1; + case BFQ_RQ1_WRAP: + return rq2; + case BFQ_RQ1_WRAP|BFQ_RQ2_WRAP: /* both rqs wrapped */ + default: + /* + * Since both rqs are wrapped, + * start with the one that's further behind head + * (--> only *one* back seek required), + * since back seek takes more time than forward. + */ + if (s1 <= s2) + return rq1; + else + return rq2; + } +} + +static struct bfq_queue * +bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root, + sector_t sector, struct rb_node **ret_parent, + struct rb_node ***rb_link) +{ + struct rb_node **p, *parent; + struct bfq_queue *bfqq = NULL; + + parent = NULL; + p = &root->rb_node; + while (*p) { + struct rb_node **n; + + parent = *p; + bfqq = rb_entry(parent, struct bfq_queue, pos_node); + + /* + * Sort strictly based on sector. Smallest to the left, + * largest to the right. + */ + if (sector > blk_rq_pos(bfqq->next_rq)) + n = &(*p)->rb_right; + else if (sector < blk_rq_pos(bfqq->next_rq)) + n = &(*p)->rb_left; + else + break; + p = n; + bfqq = NULL; + } + + *ret_parent = parent; + if (rb_link) + *rb_link = p; + + bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d", + (unsigned long long)sector, + bfqq ? bfqq->pid : 0); + + return bfqq; +} + +void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct rb_node **p, *parent; + struct bfq_queue *__bfqq; + + if (bfqq->pos_root) { + rb_erase(&bfqq->pos_node, bfqq->pos_root); + bfqq->pos_root = NULL; + } + + if (bfq_class_idle(bfqq)) + return; + if (!bfqq->next_rq) + return; + + bfqq->pos_root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree; + __bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root, + blk_rq_pos(bfqq->next_rq), &parent, &p); + if (!__bfqq) { + rb_link_node(&bfqq->pos_node, parent, p); + rb_insert_color(&bfqq->pos_node, bfqq->pos_root); + } else + bfqq->pos_root = NULL; +} + +/* + * Tell whether there are active queues or groups with differentiated weights. + */ +static bool bfq_differentiated_weights(struct bfq_data *bfqd) +{ + /* + * For weights to differ, at least one of the trees must contain + * at least two nodes. + */ + return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) && + (bfqd->queue_weights_tree.rb_node->rb_left || + bfqd->queue_weights_tree.rb_node->rb_right) +#ifdef CONFIG_BFQ_GROUP_IOSCHED + ) || + (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) && + (bfqd->group_weights_tree.rb_node->rb_left || + bfqd->group_weights_tree.rb_node->rb_right) +#endif + ); +} + +/* + * The following function returns true if every queue must receive the + * same share of the throughput (this condition is used when deciding + * whether idling may be disabled, see the comments in the function + * bfq_bfqq_may_idle()). + * + * Such a scenario occurs when: + * 1) all active queues have the same weight, + * 2) all active groups at the same level in the groups tree have the same + * weight, + * 3) all active groups at the same level in the groups tree have the same + * number of children. + * + * Unfortunately, keeping the necessary state for evaluating exactly the + * above symmetry conditions would be quite complex and time-consuming. + * Therefore this function evaluates, instead, the following stronger + * sub-conditions, for which it is much easier to maintain the needed + * state: + * 1) all active queues have the same weight, + * 2) all active groups have the same weight, + * 3) all active groups have at most one active child each. + * In particular, the last two conditions are always true if hierarchical + * support and the cgroups interface are not enabled, thus no state needs + * to be maintained in this case. + */ +static bool bfq_symmetric_scenario(struct bfq_data *bfqd) +{ + return !bfq_differentiated_weights(bfqd); +} + +/* + * If the weight-counter tree passed as input contains no counter for + * the weight of the input entity, then add that counter; otherwise just + * increment the existing counter. + * + * Note that weight-counter trees contain few nodes in mostly symmetric + * scenarios. For example, if all queues have the same weight, then the + * weight-counter tree for the queues may contain at most one node. + * This holds even if low_latency is on, because weight-raised queues + * are not inserted in the tree. + * In most scenarios, the rate at which nodes are created/destroyed + * should be low too. + */ +void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, + struct rb_root *root) +{ + struct rb_node **new = &(root->rb_node), *parent = NULL; + + /* + * Do not insert if the entity is already associated with a + * counter, which happens if: + * 1) the entity is associated with a queue, + * 2) a request arrival has caused the queue to become both + * non-weight-raised, and hence change its weight, and + * backlogged; in this respect, each of the two events + * causes an invocation of this function, + * 3) this is the invocation of this function caused by the + * second event. This second invocation is actually useless, + * and we handle this fact by exiting immediately. More + * efficient or clearer solutions might possibly be adopted. + */ + if (entity->weight_counter) + return; + + while (*new) { + struct bfq_weight_counter *__counter = container_of(*new, + struct bfq_weight_counter, + weights_node); + parent = *new; + + if (entity->weight == __counter->weight) { + entity->weight_counter = __counter; + goto inc_counter; + } + if (entity->weight < __counter->weight) + new = &((*new)->rb_left); + else + new = &((*new)->rb_right); + } + + entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter), + GFP_ATOMIC); + + /* + * In the unlucky event of an allocation failure, we just + * exit. This will cause the weight of entity to not be + * considered in bfq_differentiated_weights, which, in its + * turn, causes the scenario to be deemed wrongly symmetric in + * case entity's weight would have been the only weight making + * the scenario asymmetric. On the bright side, no unbalance + * will however occur when entity becomes inactive again (the + * invocation of this function is triggered by an activation + * of entity). In fact, bfq_weights_tree_remove does nothing + * if !entity->weight_counter. + */ + if (unlikely(!entity->weight_counter)) + return; + + entity->weight_counter->weight = entity->weight; + rb_link_node(&entity->weight_counter->weights_node, parent, new); + rb_insert_color(&entity->weight_counter->weights_node, root); + +inc_counter: + entity->weight_counter->num_active++; +} + +/* + * Decrement the weight counter associated with the entity, and, if the + * counter reaches 0, remove the counter from the tree. + * See the comments to the function bfq_weights_tree_add() for considerations + * about overhead. + */ +void bfq_weights_tree_remove(struct bfq_data *bfqd, struct bfq_entity *entity, + struct rb_root *root) +{ + if (!entity->weight_counter) + return; + + entity->weight_counter->num_active--; + if (entity->weight_counter->num_active > 0) + goto reset_entity_pointer; + + rb_erase(&entity->weight_counter->weights_node, root); + kfree(entity->weight_counter); + +reset_entity_pointer: + entity->weight_counter = NULL; +} + +/* + * Return expired entry, or NULL to just start from scratch in rbtree. + */ +static struct request *bfq_check_fifo(struct bfq_queue *bfqq, + struct request *last) +{ + struct request *rq; + + if (bfq_bfqq_fifo_expire(bfqq)) + return NULL; + + bfq_mark_bfqq_fifo_expire(bfqq); + + rq = rq_entry_fifo(bfqq->fifo.next); + + if (rq == last || ktime_get_ns() < rq->fifo_time) + return NULL; + + bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq); + return rq; +} + +static struct request *bfq_find_next_rq(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct request *last) +{ + struct rb_node *rbnext = rb_next(&last->rb_node); + struct rb_node *rbprev = rb_prev(&last->rb_node); + struct request *next, *prev = NULL; + + /* Follow expired path, else get first next available. */ + next = bfq_check_fifo(bfqq, last); + if (next) + return next; + + if (rbprev) + prev = rb_entry_rq(rbprev); + + if (rbnext) + next = rb_entry_rq(rbnext); + else { + rbnext = rb_first(&bfqq->sort_list); + if (rbnext && rbnext != &last->rb_node) + next = rb_entry_rq(rbnext); + } + + return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last)); +} + +/* see the definition of bfq_async_charge_factor for details */ +static unsigned long bfq_serv_to_charge(struct request *rq, + struct bfq_queue *bfqq) +{ + if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1) + return blk_rq_sectors(rq); + + /* + * If there are no weight-raised queues, then amplify service + * by just the async charge factor; otherwise amplify service + * by twice the async charge factor, to further reduce latency + * for weight-raised queues. + */ + if (bfqq->bfqd->wr_busy_queues == 0) + return blk_rq_sectors(rq) * bfq_async_charge_factor; + + return blk_rq_sectors(rq) * 2 * bfq_async_charge_factor; +} + +/** + * bfq_updated_next_req - update the queue after a new next_rq selection. + * @bfqd: the device data the queue belongs to. + * @bfqq: the queue to update. + * + * If the first request of a queue changes we make sure that the queue + * has enough budget to serve at least its first request (if the + * request has grown). We do this because if the queue has not enough + * budget for its first request, it has to go through two dispatch + * rounds to actually get it dispatched. + */ +static void bfq_updated_next_req(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + struct request *next_rq = bfqq->next_rq; + unsigned long new_budget; + + if (!next_rq) + return; + + if (bfqq == bfqd->in_service_queue) + /* + * In order not to break guarantees, budgets cannot be + * changed after an entity has been selected. + */ + return; + + new_budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(next_rq, bfqq)); + if (entity->budget != new_budget) { + entity->budget = new_budget; + bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu", + new_budget); + bfq_requeue_bfqq(bfqd, bfqq); + } +} + +static void +bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic) +{ + if (bic->saved_idle_window) + bfq_mark_bfqq_idle_window(bfqq); + else + bfq_clear_bfqq_idle_window(bfqq); + + if (bic->saved_IO_bound) + bfq_mark_bfqq_IO_bound(bfqq); + else + bfq_clear_bfqq_IO_bound(bfqq); + + bfqq->ttime = bic->saved_ttime; + bfqq->wr_coeff = bic->saved_wr_coeff; + bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt; + bfqq->last_wr_start_finish = bic->saved_last_wr_start_finish; + bfqq->wr_cur_max_time = bic->saved_wr_cur_max_time; + + if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) || + time_is_before_jiffies(bfqq->last_wr_start_finish + + bfqq->wr_cur_max_time))) { + bfq_log_bfqq(bfqq->bfqd, bfqq, + "resume state: switching off wr"); + + bfqq->wr_coeff = 1; + } + + /* make sure weight will be updated, however we got here */ + bfqq->entity.prio_changed = 1; +} + +static int bfqq_process_refs(struct bfq_queue *bfqq) +{ + return bfqq->ref - bfqq->allocated - bfqq->entity.on_st; +} + +/* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */ +static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_queue *item; + struct hlist_node *n; + + hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node) + hlist_del_init(&item->burst_list_node); + hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); + bfqd->burst_size = 1; + bfqd->burst_parent_entity = bfqq->entity.parent; +} + +/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */ +static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + /* Increment burst size to take into account also bfqq */ + bfqd->burst_size++; + + if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) { + struct bfq_queue *pos, *bfqq_item; + struct hlist_node *n; + + /* + * Enough queues have been activated shortly after each + * other to consider this burst as large. + */ + bfqd->large_burst = true; + + /* + * We can now mark all queues in the burst list as + * belonging to a large burst. + */ + hlist_for_each_entry(bfqq_item, &bfqd->burst_list, + burst_list_node) + bfq_mark_bfqq_in_large_burst(bfqq_item); + bfq_mark_bfqq_in_large_burst(bfqq); + + /* + * From now on, and until the current burst finishes, any + * new queue being activated shortly after the last queue + * was inserted in the burst can be immediately marked as + * belonging to a large burst. So the burst list is not + * needed any more. Remove it. + */ + hlist_for_each_entry_safe(pos, n, &bfqd->burst_list, + burst_list_node) + hlist_del_init(&pos->burst_list_node); + } else /* + * Burst not yet large: add bfqq to the burst list. Do + * not increment the ref counter for bfqq, because bfqq + * is removed from the burst list before freeing bfqq + * in put_queue. + */ + hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list); +} + +/* + * If many queues belonging to the same group happen to be created + * shortly after each other, then the processes associated with these + * queues have typically a common goal. In particular, bursts of queue + * creations are usually caused by services or applications that spawn + * many parallel threads/processes. Examples are systemd during boot, + * or git grep. To help these processes get their job done as soon as + * possible, it is usually better to not grant either weight-raising + * or device idling to their queues. + * + * In this comment we describe, firstly, the reasons why this fact + * holds, and, secondly, the next function, which implements the main + * steps needed to properly mark these queues so that they can then be + * treated in a different way. + * + * The above services or applications benefit mostly from a high + * throughput: the quicker the requests of the activated queues are + * cumulatively served, the sooner the target job of these queues gets + * completed. As a consequence, weight-raising any of these queues, + * which also implies idling the device for it, is almost always + * counterproductive. In most cases it just lowers throughput. + * + * On the other hand, a burst of queue creations may be caused also by + * the start of an application that does not consist of a lot of + * parallel I/O-bound threads. In fact, with a complex application, + * several short processes may need to be executed to start-up the + * application. In this respect, to start an application as quickly as + * possible, the best thing to do is in any case to privilege the I/O + * related to the application with respect to all other + * I/O. Therefore, the best strategy to start as quickly as possible + * an application that causes a burst of queue creations is to + * weight-raise all the queues created during the burst. This is the + * exact opposite of the best strategy for the other type of bursts. + * + * In the end, to take the best action for each of the two cases, the + * two types of bursts need to be distinguished. Fortunately, this + * seems relatively easy, by looking at the sizes of the bursts. In + * particular, we found a threshold such that only bursts with a + * larger size than that threshold are apparently caused by + * services or commands such as systemd or git grep. For brevity, + * hereafter we call just 'large' these bursts. BFQ *does not* + * weight-raise queues whose creation occurs in a large burst. In + * addition, for each of these queues BFQ performs or does not perform + * idling depending on which choice boosts the throughput more. The + * exact choice depends on the device and request pattern at + * hand. + * + * Unfortunately, false positives may occur while an interactive task + * is starting (e.g., an application is being started). The + * consequence is that the queues associated with the task do not + * enjoy weight raising as expected. Fortunately these false positives + * are very rare. They typically occur if some service happens to + * start doing I/O exactly when the interactive task starts. + * + * Turning back to the next function, it implements all the steps + * needed to detect the occurrence of a large burst and to properly + * mark all the queues belonging to it (so that they can then be + * treated in a different way). This goal is achieved by maintaining a + * "burst list" that holds, temporarily, the queues that belong to the + * burst in progress. The list is then used to mark these queues as + * belonging to a large burst if the burst does become large. The main + * steps are the following. + * + * . when the very first queue is created, the queue is inserted into the + * list (as it could be the first queue in a possible burst) + * + * . if the current burst has not yet become large, and a queue Q that does + * not yet belong to the burst is activated shortly after the last time + * at which a new queue entered the burst list, then the function appends + * Q to the burst list + * + * . if, as a consequence of the previous step, the burst size reaches + * the large-burst threshold, then + * + * . all the queues in the burst list are marked as belonging to a + * large burst + * + * . the burst list is deleted; in fact, the burst list already served + * its purpose (keeping temporarily track of the queues in a burst, + * so as to be able to mark them as belonging to a large burst in the + * previous sub-step), and now is not needed any more + * + * . the device enters a large-burst mode + * + * . if a queue Q that does not belong to the burst is created while + * the device is in large-burst mode and shortly after the last time + * at which a queue either entered the burst list or was marked as + * belonging to the current large burst, then Q is immediately marked + * as belonging to a large burst. + * + * . if a queue Q that does not belong to the burst is created a while + * later, i.e., not shortly after, than the last time at which a queue + * either entered the burst list or was marked as belonging to the + * current large burst, then the current burst is deemed as finished and: + * + * . the large-burst mode is reset if set + * + * . the burst list is emptied + * + * . Q is inserted in the burst list, as Q may be the first queue + * in a possible new burst (then the burst list contains just Q + * after this step). + */ +static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + /* + * If bfqq is already in the burst list or is part of a large + * burst, or finally has just been split, then there is + * nothing else to do. + */ + if (!hlist_unhashed(&bfqq->burst_list_node) || + bfq_bfqq_in_large_burst(bfqq) || + time_is_after_eq_jiffies(bfqq->split_time + + msecs_to_jiffies(10))) + return; + + /* + * If bfqq's creation happens late enough, or bfqq belongs to + * a different group than the burst group, then the current + * burst is finished, and related data structures must be + * reset. + * + * In this respect, consider the special case where bfqq is + * the very first queue created after BFQ is selected for this + * device. In this case, last_ins_in_burst and + * burst_parent_entity are not yet significant when we get + * here. But it is easy to verify that, whether or not the + * following condition is true, bfqq will end up being + * inserted into the burst list. In particular the list will + * happen to contain only bfqq. And this is exactly what has + * to happen, as bfqq may be the first queue of the first + * burst. + */ + if (time_is_before_jiffies(bfqd->last_ins_in_burst + + bfqd->bfq_burst_interval) || + bfqq->entity.parent != bfqd->burst_parent_entity) { + bfqd->large_burst = false; + bfq_reset_burst_list(bfqd, bfqq); + goto end; + } + + /* + * If we get here, then bfqq is being activated shortly after the + * last queue. So, if the current burst is also large, we can mark + * bfqq as belonging to this large burst immediately. + */ + if (bfqd->large_burst) { + bfq_mark_bfqq_in_large_burst(bfqq); + goto end; + } + + /* + * If we get here, then a large-burst state has not yet been + * reached, but bfqq is being activated shortly after the last + * queue. Then we add bfqq to the burst. + */ + bfq_add_to_burst(bfqd, bfqq); +end: + /* + * At this point, bfqq either has been added to the current + * burst or has caused the current burst to terminate and a + * possible new burst to start. In particular, in the second + * case, bfqq has become the first queue in the possible new + * burst. In both cases last_ins_in_burst needs to be moved + * forward. + */ + bfqd->last_ins_in_burst = jiffies; +} + +static int bfq_bfqq_budget_left(struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + return entity->budget - entity->service; +} + +/* + * If enough samples have been computed, return the current max budget + * stored in bfqd, which is dynamically updated according to the + * estimated disk peak rate; otherwise return the default max budget + */ +static int bfq_max_budget(struct bfq_data *bfqd) +{ + if (bfqd->budgets_assigned < bfq_stats_min_budgets) + return bfq_default_max_budget; + else + return bfqd->bfq_max_budget; +} + +/* + * Return min budget, which is a fraction of the current or default + * max budget (trying with 1/32) + */ +static int bfq_min_budget(struct bfq_data *bfqd) +{ + if (bfqd->budgets_assigned < bfq_stats_min_budgets) + return bfq_default_max_budget / 32; + else + return bfqd->bfq_max_budget / 32; +} + +/* + * The next function, invoked after the input queue bfqq switches from + * idle to busy, updates the budget of bfqq. The function also tells + * whether the in-service queue should be expired, by returning + * true. The purpose of expiring the in-service queue is to give bfqq + * the chance to possibly preempt the in-service queue, and the reason + * for preempting the in-service queue is to achieve one of the two + * goals below. + * + * 1. Guarantee to bfqq its reserved bandwidth even if bfqq has + * expired because it has remained idle. In particular, bfqq may have + * expired for one of the following two reasons: + * + * - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling + * and did not make it to issue a new request before its last + * request was served; + * + * - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue + * a new request before the expiration of the idling-time. + * + * Even if bfqq has expired for one of the above reasons, the process + * associated with the queue may be however issuing requests greedily, + * and thus be sensitive to the bandwidth it receives (bfqq may have + * remained idle for other reasons: CPU high load, bfqq not enjoying + * idling, I/O throttling somewhere in the path from the process to + * the I/O scheduler, ...). But if, after every expiration for one of + * the above two reasons, bfqq has to wait for the service of at least + * one full budget of another queue before being served again, then + * bfqq is likely to get a much lower bandwidth or resource time than + * its reserved ones. To address this issue, two countermeasures need + * to be taken. + * + * First, the budget and the timestamps of bfqq need to be updated in + * a special way on bfqq reactivation: they need to be updated as if + * bfqq did not remain idle and did not expire. In fact, if they are + * computed as if bfqq expired and remained idle until reactivation, + * then the process associated with bfqq is treated as if, instead of + * being greedy, it stopped issuing requests when bfqq remained idle, + * and restarts issuing requests only on this reactivation. In other + * words, the scheduler does not help the process recover the "service + * hole" between bfqq expiration and reactivation. As a consequence, + * the process receives a lower bandwidth than its reserved one. In + * contrast, to recover this hole, the budget must be updated as if + * bfqq was not expired at all before this reactivation, i.e., it must + * be set to the value of the remaining budget when bfqq was + * expired. Along the same line, timestamps need to be assigned the + * value they had the last time bfqq was selected for service, i.e., + * before last expiration. Thus timestamps need to be back-shifted + * with respect to their normal computation (see [1] for more details + * on this tricky aspect). + * + * Secondly, to allow the process to recover the hole, the in-service + * queue must be expired too, to give bfqq the chance to preempt it + * immediately. In fact, if bfqq has to wait for a full budget of the + * in-service queue to be completed, then it may become impossible to + * let the process recover the hole, even if the back-shifted + * timestamps of bfqq are lower than those of the in-service queue. If + * this happens for most or all of the holes, then the process may not + * receive its reserved bandwidth. In this respect, it is worth noting + * that, being the service of outstanding requests unpreemptible, a + * little fraction of the holes may however be unrecoverable, thereby + * causing a little loss of bandwidth. + * + * The last important point is detecting whether bfqq does need this + * bandwidth recovery. In this respect, the next function deems the + * process associated with bfqq greedy, and thus allows it to recover + * the hole, if: 1) the process is waiting for the arrival of a new + * request (which implies that bfqq expired for one of the above two + * reasons), and 2) such a request has arrived soon. The first + * condition is controlled through the flag non_blocking_wait_rq, + * while the second through the flag arrived_in_time. If both + * conditions hold, then the function computes the budget in the + * above-described special way, and signals that the in-service queue + * should be expired. Timestamp back-shifting is done later in + * __bfq_activate_entity. + * + * 2. Reduce latency. Even if timestamps are not backshifted to let + * the process associated with bfqq recover a service hole, bfqq may + * however happen to have, after being (re)activated, a lower finish + * timestamp than the in-service queue. That is, the next budget of + * bfqq may have to be completed before the one of the in-service + * queue. If this is the case, then preempting the in-service queue + * allows this goal to be achieved, apart from the unpreemptible, + * outstanding requests mentioned above. + * + * Unfortunately, regardless of which of the above two goals one wants + * to achieve, service trees need first to be updated to know whether + * the in-service queue must be preempted. To have service trees + * correctly updated, the in-service queue must be expired and + * rescheduled, and bfqq must be scheduled too. This is one of the + * most costly operations (in future versions, the scheduling + * mechanism may be re-designed in such a way to make it possible to + * know whether preemption is needed without needing to update service + * trees). In addition, queue preemptions almost always cause random + * I/O, and thus loss of throughput. Because of these facts, the next + * function adopts the following simple scheme to avoid both costly + * operations and too frequent preemptions: it requests the expiration + * of the in-service queue (unconditionally) only for queues that need + * to recover a hole, or that either are weight-raised or deserve to + * be weight-raised. + */ +static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + bool arrived_in_time, + bool wr_or_deserves_wr) +{ + struct bfq_entity *entity = &bfqq->entity; + + if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) { + /* + * We do not clear the flag non_blocking_wait_rq here, as + * the latter is used in bfq_activate_bfqq to signal + * that timestamps need to be back-shifted (and is + * cleared right after). + */ + + /* + * In next assignment we rely on that either + * entity->service or entity->budget are not updated + * on expiration if bfqq is empty (see + * __bfq_bfqq_recalc_budget). Thus both quantities + * remain unchanged after such an expiration, and the + * following statement therefore assigns to + * entity->budget the remaining budget on such an + * expiration. For clarity, entity->service is not + * updated on expiration in any case, and, in normal + * operation, is reset only when bfqq is selected for + * service (see bfq_get_next_queue). + */ + entity->budget = min_t(unsigned long, + bfq_bfqq_budget_left(bfqq), + bfqq->max_budget); + + return true; + } + + entity->budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(bfqq->next_rq, bfqq)); + bfq_clear_bfqq_non_blocking_wait_rq(bfqq); + return wr_or_deserves_wr; +} + +static unsigned int bfq_wr_duration(struct bfq_data *bfqd) +{ + u64 dur; + + if (bfqd->bfq_wr_max_time > 0) + return bfqd->bfq_wr_max_time; + + dur = bfqd->RT_prod; + do_div(dur, bfqd->peak_rate); + + /* + * Limit duration between 3 and 13 seconds. Tests show that + * higher values than 13 seconds often yield the opposite of + * the desired result, i.e., worsen responsiveness by letting + * non-interactive and non-soft-real-time applications + * preserve weight raising for a too long time interval. + * + * On the other end, lower values than 3 seconds make it + * difficult for most interactive tasks to complete their jobs + * before weight-raising finishes. + */ + if (dur > msecs_to_jiffies(13000)) + dur = msecs_to_jiffies(13000); + else if (dur < msecs_to_jiffies(3000)) + dur = msecs_to_jiffies(3000); + + return dur; +} + +static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + unsigned int old_wr_coeff, + bool wr_or_deserves_wr, + bool interactive, + bool in_burst, + bool soft_rt) +{ + if (old_wr_coeff == 1 && wr_or_deserves_wr) { + /* start a weight-raising period */ + if (interactive) { + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + } else { + bfqq->wr_start_at_switch_to_srt = jiffies; + bfqq->wr_coeff = bfqd->bfq_wr_coeff * + BFQ_SOFTRT_WEIGHT_FACTOR; + bfqq->wr_cur_max_time = + bfqd->bfq_wr_rt_max_time; + } + + /* + * If needed, further reduce budget to make sure it is + * close to bfqq's backlog, so as to reduce the + * scheduling-error component due to a too large + * budget. Do not care about throughput consequences, + * but only about latency. Finally, do not assign a + * too small budget either, to avoid increasing + * latency by causing too frequent expirations. + */ + bfqq->entity.budget = min_t(unsigned long, + bfqq->entity.budget, + 2 * bfq_min_budget(bfqd)); + } else if (old_wr_coeff > 1) { + if (interactive) { /* update wr coeff and duration */ + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + } else if (in_burst) + bfqq->wr_coeff = 1; + else if (soft_rt) { + /* + * The application is now or still meeting the + * requirements for being deemed soft rt. We + * can then correctly and safely (re)charge + * the weight-raising duration for the + * application with the weight-raising + * duration for soft rt applications. + * + * In particular, doing this recharge now, i.e., + * before the weight-raising period for the + * application finishes, reduces the probability + * of the following negative scenario: + * 1) the weight of a soft rt application is + * raised at startup (as for any newly + * created application), + * 2) since the application is not interactive, + * at a certain time weight-raising is + * stopped for the application, + * 3) at that time the application happens to + * still have pending requests, and hence + * is destined to not have a chance to be + * deemed soft rt before these requests are + * completed (see the comments to the + * function bfq_bfqq_softrt_next_start() + * for details on soft rt detection), + * 4) these pending requests experience a high + * latency because the application is not + * weight-raised while they are pending. + */ + if (bfqq->wr_cur_max_time != + bfqd->bfq_wr_rt_max_time) { + bfqq->wr_start_at_switch_to_srt = + bfqq->last_wr_start_finish; + + bfqq->wr_cur_max_time = + bfqd->bfq_wr_rt_max_time; + bfqq->wr_coeff = bfqd->bfq_wr_coeff * + BFQ_SOFTRT_WEIGHT_FACTOR; + } + bfqq->last_wr_start_finish = jiffies; + } + } +} + +static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + return bfqq->dispatched == 0 && + time_is_before_jiffies( + bfqq->budget_timeout + + bfqd->bfq_wr_min_idle_time); +} + +static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + int old_wr_coeff, + struct request *rq, + bool *interactive) +{ + bool soft_rt, in_burst, wr_or_deserves_wr, + bfqq_wants_to_preempt, + idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq), + /* + * See the comments on + * bfq_bfqq_update_budg_for_activation for + * details on the usage of the next variable. + */ + arrived_in_time = ktime_get_ns() <= + bfqq->ttime.last_end_request + + bfqd->bfq_slice_idle * 3; + + bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq, rq->cmd_flags); + + /* + * bfqq deserves to be weight-raised if: + * - it is sync, + * - it does not belong to a large burst, + * - it has been idle for enough time or is soft real-time, + * - is linked to a bfq_io_cq (it is not shared in any sense). + */ + in_burst = bfq_bfqq_in_large_burst(bfqq); + soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 && + !in_burst && + time_is_before_jiffies(bfqq->soft_rt_next_start); + *interactive = !in_burst && idle_for_long_time; + wr_or_deserves_wr = bfqd->low_latency && + (bfqq->wr_coeff > 1 || + (bfq_bfqq_sync(bfqq) && + bfqq->bic && (*interactive || soft_rt))); + + /* + * Using the last flag, update budget and check whether bfqq + * may want to preempt the in-service queue. + */ + bfqq_wants_to_preempt = + bfq_bfqq_update_budg_for_activation(bfqd, bfqq, + arrived_in_time, + wr_or_deserves_wr); + + /* + * If bfqq happened to be activated in a burst, but has been + * idle for much more than an interactive queue, then we + * assume that, in the overall I/O initiated in the burst, the + * I/O associated with bfqq is finished. So bfqq does not need + * to be treated as a queue belonging to a burst + * anymore. Accordingly, we reset bfqq's in_large_burst flag + * if set, and remove bfqq from the burst list if it's + * there. We do not decrement burst_size, because the fact + * that bfqq does not need to belong to the burst list any + * more does not invalidate the fact that bfqq was created in + * a burst. + */ + if (likely(!bfq_bfqq_just_created(bfqq)) && + idle_for_long_time && + time_is_before_jiffies( + bfqq->budget_timeout + + msecs_to_jiffies(10000))) { + hlist_del_init(&bfqq->burst_list_node); + bfq_clear_bfqq_in_large_burst(bfqq); + } + + bfq_clear_bfqq_just_created(bfqq); + + + if (!bfq_bfqq_IO_bound(bfqq)) { + if (arrived_in_time) { + bfqq->requests_within_timer++; + if (bfqq->requests_within_timer >= + bfqd->bfq_requests_within_timer) + bfq_mark_bfqq_IO_bound(bfqq); + } else + bfqq->requests_within_timer = 0; + } + + if (bfqd->low_latency) { + if (unlikely(time_is_after_jiffies(bfqq->split_time))) + /* wraparound */ + bfqq->split_time = + jiffies - bfqd->bfq_wr_min_idle_time - 1; + + if (time_is_before_jiffies(bfqq->split_time + + bfqd->bfq_wr_min_idle_time)) { + bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq, + old_wr_coeff, + wr_or_deserves_wr, + *interactive, + in_burst, + soft_rt); + + if (old_wr_coeff != bfqq->wr_coeff) + bfqq->entity.prio_changed = 1; + } + } + + bfqq->last_idle_bklogged = jiffies; + bfqq->service_from_backlogged = 0; + bfq_clear_bfqq_softrt_update(bfqq); + + bfq_add_bfqq_busy(bfqd, bfqq); + + /* + * Expire in-service queue only if preemption may be needed + * for guarantees. In this respect, the function + * next_queue_may_preempt just checks a simple, necessary + * condition, and not a sufficient condition based on + * timestamps. In fact, for the latter condition to be + * evaluated, timestamps would need first to be updated, and + * this operation is quite costly (see the comments on the + * function bfq_bfqq_update_budg_for_activation). + */ + if (bfqd->in_service_queue && bfqq_wants_to_preempt && + bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff && + next_queue_may_preempt(bfqd)) + bfq_bfqq_expire(bfqd, bfqd->in_service_queue, + false, BFQQE_PREEMPTED); +} + +static void bfq_add_request(struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_data *bfqd = bfqq->bfqd; + struct request *next_rq, *prev; + unsigned int old_wr_coeff = bfqq->wr_coeff; + bool interactive = false; + + bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq)); + bfqq->queued[rq_is_sync(rq)]++; + bfqd->queued++; + + elv_rb_add(&bfqq->sort_list, rq); + + /* + * Check if this request is a better next-serve candidate. + */ + prev = bfqq->next_rq; + next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position); + bfqq->next_rq = next_rq; + + /* + * Adjust priority tree position, if next_rq changes. + */ + if (prev != bfqq->next_rq) + bfq_pos_tree_add_move(bfqd, bfqq); + + if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */ + bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff, + rq, &interactive); + else { + if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) && + time_is_before_jiffies( + bfqq->last_wr_start_finish + + bfqd->bfq_wr_min_inter_arr_async)) { + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + + bfqd->wr_busy_queues++; + bfqq->entity.prio_changed = 1; + } + if (prev != bfqq->next_rq) + bfq_updated_next_req(bfqd, bfqq); + } + + /* + * Assign jiffies to last_wr_start_finish in the following + * cases: + * + * . if bfqq is not going to be weight-raised, because, for + * non weight-raised queues, last_wr_start_finish stores the + * arrival time of the last request; as of now, this piece + * of information is used only for deciding whether to + * weight-raise async queues + * + * . if bfqq is not weight-raised, because, if bfqq is now + * switching to weight-raised, then last_wr_start_finish + * stores the time when weight-raising starts + * + * . if bfqq is interactive, because, regardless of whether + * bfqq is currently weight-raised, the weight-raising + * period must start or restart (this case is considered + * separately because it is not detected by the above + * conditions, if bfqq is already weight-raised) + * + * last_wr_start_finish has to be updated also if bfqq is soft + * real-time, because the weight-raising period is constantly + * restarted on idle-to-busy transitions for these queues, but + * this is already done in bfq_bfqq_handle_idle_busy_switch if + * needed. + */ + if (bfqd->low_latency && + (old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive)) + bfqq->last_wr_start_finish = jiffies; +} + +static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd, + struct bio *bio, + struct request_queue *q) +{ + struct bfq_queue *bfqq = bfqd->bio_bfqq; + + + if (bfqq) + return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio)); + + return NULL; +} + +static sector_t get_sdist(sector_t last_pos, struct request *rq) +{ + if (last_pos) + return abs(blk_rq_pos(rq) - last_pos); + + return 0; +} + +#if 0 /* Still not clear if we can do without next two functions */ +static void bfq_activate_request(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + + bfqd->rq_in_driver++; +} + +static void bfq_deactivate_request(struct request_queue *q, struct request *rq) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + + bfqd->rq_in_driver--; +} +#endif + +static void bfq_remove_request(struct request_queue *q, + struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_data *bfqd = bfqq->bfqd; + const int sync = rq_is_sync(rq); + + if (bfqq->next_rq == rq) { + bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq); + bfq_updated_next_req(bfqd, bfqq); + } + + if (rq->queuelist.prev != &rq->queuelist) + list_del_init(&rq->queuelist); + bfqq->queued[sync]--; + bfqd->queued--; + elv_rb_del(&bfqq->sort_list, rq); + + elv_rqhash_del(q, rq); + if (q->last_merge == rq) + q->last_merge = NULL; + + if (RB_EMPTY_ROOT(&bfqq->sort_list)) { + bfqq->next_rq = NULL; + + if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) { + bfq_del_bfqq_busy(bfqd, bfqq, false); + /* + * bfqq emptied. In normal operation, when + * bfqq is empty, bfqq->entity.service and + * bfqq->entity.budget must contain, + * respectively, the service received and the + * budget used last time bfqq emptied. These + * facts do not hold in this case, as at least + * this last removal occurred while bfqq is + * not in service. To avoid inconsistencies, + * reset both bfqq->entity.service and + * bfqq->entity.budget, if bfqq has still a + * process that may issue I/O requests to it. + */ + bfqq->entity.budget = bfqq->entity.service = 0; + } + + /* + * Remove queue from request-position tree as it is empty. + */ + if (bfqq->pos_root) { + rb_erase(&bfqq->pos_node, bfqq->pos_root); + bfqq->pos_root = NULL; + } + } + + if (rq->cmd_flags & REQ_META) + bfqq->meta_pending--; + + bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags); +} + +static bool bfq_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio) +{ + struct request_queue *q = hctx->queue; + struct bfq_data *bfqd = q->elevator->elevator_data; + struct request *free = NULL; + /* + * bfq_bic_lookup grabs the queue_lock: invoke it now and + * store its return value for later use, to avoid nesting + * queue_lock inside the bfqd->lock. We assume that the bic + * returned by bfq_bic_lookup does not go away before + * bfqd->lock is taken. + */ + struct bfq_io_cq *bic = bfq_bic_lookup(bfqd, current->io_context, q); + bool ret; + + spin_lock_irq(&bfqd->lock); + + if (bic) + bfqd->bio_bfqq = bic_to_bfqq(bic, op_is_sync(bio->bi_opf)); + else + bfqd->bio_bfqq = NULL; + bfqd->bio_bic = bic; + + ret = blk_mq_sched_try_merge(q, bio, &free); + + if (free) + blk_mq_free_request(free); + spin_unlock_irq(&bfqd->lock); + + return ret; +} + +static int bfq_request_merge(struct request_queue *q, struct request **req, + struct bio *bio) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct request *__rq; + + __rq = bfq_find_rq_fmerge(bfqd, bio, q); + if (__rq && elv_bio_merge_ok(__rq, bio)) { + *req = __rq; + return ELEVATOR_FRONT_MERGE; + } + + return ELEVATOR_NO_MERGE; +} + +static void bfq_request_merged(struct request_queue *q, struct request *req, + enum elv_merge type) +{ + if (type == ELEVATOR_FRONT_MERGE && + rb_prev(&req->rb_node) && + blk_rq_pos(req) < + blk_rq_pos(container_of(rb_prev(&req->rb_node), + struct request, rb_node))) { + struct bfq_queue *bfqq = RQ_BFQQ(req); + struct bfq_data *bfqd = bfqq->bfqd; + struct request *prev, *next_rq; + + /* Reposition request in its sort_list */ + elv_rb_del(&bfqq->sort_list, req); + elv_rb_add(&bfqq->sort_list, req); + + /* Choose next request to be served for bfqq */ + prev = bfqq->next_rq; + next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req, + bfqd->last_position); + bfqq->next_rq = next_rq; + /* + * If next_rq changes, update both the queue's budget to + * fit the new request and the queue's position in its + * rq_pos_tree. + */ + if (prev != bfqq->next_rq) { + bfq_updated_next_req(bfqd, bfqq); + bfq_pos_tree_add_move(bfqd, bfqq); + } + } +} + +static void bfq_requests_merged(struct request_queue *q, struct request *rq, + struct request *next) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq), *next_bfqq = RQ_BFQQ(next); + + if (!RB_EMPTY_NODE(&rq->rb_node)) + goto end; + spin_lock_irq(&bfqq->bfqd->lock); + + /* + * If next and rq belong to the same bfq_queue and next is older + * than rq, then reposition rq in the fifo (by substituting next + * with rq). Otherwise, if next and rq belong to different + * bfq_queues, never reposition rq: in fact, we would have to + * reposition it with respect to next's position in its own fifo, + * which would most certainly be too expensive with respect to + * the benefits. + */ + if (bfqq == next_bfqq && + !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) && + next->fifo_time < rq->fifo_time) { + list_del_init(&rq->queuelist); + list_replace_init(&next->queuelist, &rq->queuelist); + rq->fifo_time = next->fifo_time; + } + + if (bfqq->next_rq == next) + bfqq->next_rq = rq; + + bfq_remove_request(q, next); + + spin_unlock_irq(&bfqq->bfqd->lock); +end: + bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags); +} + +/* Must be called with bfqq != NULL */ +static void bfq_bfqq_end_wr(struct bfq_queue *bfqq) +{ + if (bfq_bfqq_busy(bfqq)) + bfqq->bfqd->wr_busy_queues--; + bfqq->wr_coeff = 1; + bfqq->wr_cur_max_time = 0; + bfqq->last_wr_start_finish = jiffies; + /* + * Trigger a weight change on the next invocation of + * __bfq_entity_update_weight_prio. + */ + bfqq->entity.prio_changed = 1; +} + +void bfq_end_wr_async_queues(struct bfq_data *bfqd, + struct bfq_group *bfqg) +{ + int i, j; + + for (i = 0; i < 2; i++) + for (j = 0; j < IOPRIO_BE_NR; j++) + if (bfqg->async_bfqq[i][j]) + bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]); + if (bfqg->async_idle_bfqq) + bfq_bfqq_end_wr(bfqg->async_idle_bfqq); +} + +static void bfq_end_wr(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq; + + spin_lock_irq(&bfqd->lock); + + list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) + bfq_bfqq_end_wr(bfqq); + list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) + bfq_bfqq_end_wr(bfqq); + bfq_end_wr_async(bfqd); + + spin_unlock_irq(&bfqd->lock); +} + +static sector_t bfq_io_struct_pos(void *io_struct, bool request) +{ + if (request) + return blk_rq_pos(io_struct); + else + return ((struct bio *)io_struct)->bi_iter.bi_sector; +} + +static int bfq_rq_close_to_sector(void *io_struct, bool request, + sector_t sector) +{ + return abs(bfq_io_struct_pos(io_struct, request) - sector) <= + BFQQ_CLOSE_THR; +} + +static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + sector_t sector) +{ + struct rb_root *root = &bfq_bfqq_to_bfqg(bfqq)->rq_pos_tree; + struct rb_node *parent, *node; + struct bfq_queue *__bfqq; + + if (RB_EMPTY_ROOT(root)) + return NULL; + + /* + * First, if we find a request starting at the end of the last + * request, choose it. + */ + __bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL); + if (__bfqq) + return __bfqq; + + /* + * If the exact sector wasn't found, the parent of the NULL leaf + * will contain the closest sector (rq_pos_tree sorted by + * next_request position). + */ + __bfqq = rb_entry(parent, struct bfq_queue, pos_node); + if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) + return __bfqq; + + if (blk_rq_pos(__bfqq->next_rq) < sector) + node = rb_next(&__bfqq->pos_node); + else + node = rb_prev(&__bfqq->pos_node); + if (!node) + return NULL; + + __bfqq = rb_entry(node, struct bfq_queue, pos_node); + if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector)) + return __bfqq; + + return NULL; +} + +static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd, + struct bfq_queue *cur_bfqq, + sector_t sector) +{ + struct bfq_queue *bfqq; + + /* + * We shall notice if some of the queues are cooperating, + * e.g., working closely on the same area of the device. In + * that case, we can group them together and: 1) don't waste + * time idling, and 2) serve the union of their requests in + * the best possible order for throughput. + */ + bfqq = bfqq_find_close(bfqd, cur_bfqq, sector); + if (!bfqq || bfqq == cur_bfqq) + return NULL; + + return bfqq; +} + +static struct bfq_queue * +bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) +{ + int process_refs, new_process_refs; + struct bfq_queue *__bfqq; + + /* + * If there are no process references on the new_bfqq, then it is + * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain + * may have dropped their last reference (not just their last process + * reference). + */ + if (!bfqq_process_refs(new_bfqq)) + return NULL; + + /* Avoid a circular list and skip interim queue merges. */ + while ((__bfqq = new_bfqq->new_bfqq)) { + if (__bfqq == bfqq) + return NULL; + new_bfqq = __bfqq; + } + + process_refs = bfqq_process_refs(bfqq); + new_process_refs = bfqq_process_refs(new_bfqq); + /* + * If the process for the bfqq has gone away, there is no + * sense in merging the queues. + */ + if (process_refs == 0 || new_process_refs == 0) + return NULL; + + bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d", + new_bfqq->pid); + + /* + * Merging is just a redirection: the requests of the process + * owning one of the two queues are redirected to the other queue. + * The latter queue, in its turn, is set as shared if this is the + * first time that the requests of some process are redirected to + * it. + * + * We redirect bfqq to new_bfqq and not the opposite, because + * we are in the context of the process owning bfqq, thus we + * have the io_cq of this process. So we can immediately + * configure this io_cq to redirect the requests of the + * process to new_bfqq. In contrast, the io_cq of new_bfqq is + * not available any more (new_bfqq->bic == NULL). + * + * Anyway, even in case new_bfqq coincides with the in-service + * queue, redirecting requests the in-service queue is the + * best option, as we feed the in-service queue with new + * requests close to the last request served and, by doing so, + * are likely to increase the throughput. + */ + bfqq->new_bfqq = new_bfqq; + new_bfqq->ref += process_refs; + return new_bfqq; +} + +static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq, + struct bfq_queue *new_bfqq) +{ + if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) || + (bfqq->ioprio_class != new_bfqq->ioprio_class)) + return false; + + /* + * If either of the queues has already been detected as seeky, + * then merging it with the other queue is unlikely to lead to + * sequential I/O. + */ + if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq)) + return false; + + /* + * Interleaved I/O is known to be done by (some) applications + * only for reads, so it does not make sense to merge async + * queues. + */ + if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq)) + return false; + + return true; +} + +/* + * If this function returns true, then bfqq cannot be merged. The idea + * is that true cooperation happens very early after processes start + * to do I/O. Usually, late cooperations are just accidental false + * positives. In case bfqq is weight-raised, such false positives + * would evidently degrade latency guarantees for bfqq. + */ +static bool wr_from_too_long(struct bfq_queue *bfqq) +{ + return bfqq->wr_coeff > 1 && + time_is_before_jiffies(bfqq->last_wr_start_finish + + msecs_to_jiffies(100)); +} + +/* + * Attempt to schedule a merge of bfqq with the currently in-service + * queue or with a close queue among the scheduled queues. Return + * NULL if no merge was scheduled, a pointer to the shared bfq_queue + * structure otherwise. + * + * The OOM queue is not allowed to participate to cooperation: in fact, since + * the requests temporarily redirected to the OOM queue could be redirected + * again to dedicated queues at any time, the state needed to correctly + * handle merging with the OOM queue would be quite complex and expensive + * to maintain. Besides, in such a critical condition as an out of memory, + * the benefits of queue merging may be little relevant, or even negligible. + * + * Weight-raised queues can be merged only if their weight-raising + * period has just started. In fact cooperating processes are usually + * started together. Thus, with this filter we avoid false positives + * that would jeopardize low-latency guarantees. + * + * WARNING: queue merging may impair fairness among non-weight raised + * queues, for at least two reasons: 1) the original weight of a + * merged queue may change during the merged state, 2) even being the + * weight the same, a merged queue may be bloated with many more + * requests than the ones produced by its originally-associated + * process. + */ +static struct bfq_queue * +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq, + void *io_struct, bool request) +{ + struct bfq_queue *in_service_bfqq, *new_bfqq; + + if (bfqq->new_bfqq) + return bfqq->new_bfqq; + + if (!io_struct || + wr_from_too_long(bfqq) || + unlikely(bfqq == &bfqd->oom_bfqq)) + return NULL; + + /* If there is only one backlogged queue, don't search. */ + if (bfqd->busy_queues == 1) + return NULL; + + in_service_bfqq = bfqd->in_service_queue; + + if (!in_service_bfqq || in_service_bfqq == bfqq + || wr_from_too_long(in_service_bfqq) || + unlikely(in_service_bfqq == &bfqd->oom_bfqq)) + goto check_scheduled; + + if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) && + bfqq->entity.parent == in_service_bfqq->entity.parent && + bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) { + new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq); + if (new_bfqq) + return new_bfqq; + } + /* + * Check whether there is a cooperator among currently scheduled + * queues. The only thing we need is that the bio/request is not + * NULL, as we need it to establish whether a cooperator exists. + */ +check_scheduled: + new_bfqq = bfq_find_close_cooperator(bfqd, bfqq, + bfq_io_struct_pos(io_struct, request)); + + if (new_bfqq && !wr_from_too_long(new_bfqq) && + likely(new_bfqq != &bfqd->oom_bfqq) && + bfq_may_be_close_cooperator(bfqq, new_bfqq)) + return bfq_setup_merge(bfqq, new_bfqq); + + return NULL; +} + +static void bfq_bfqq_save_state(struct bfq_queue *bfqq) +{ + struct bfq_io_cq *bic = bfqq->bic; + + /* + * If !bfqq->bic, the queue is already shared or its requests + * have already been redirected to a shared queue; both idle window + * and weight raising state have already been saved. Do nothing. + */ + if (!bic) + return; + + bic->saved_ttime = bfqq->ttime; + bic->saved_idle_window = bfq_bfqq_idle_window(bfqq); + bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq); + bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq); + bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node); + bic->saved_wr_coeff = bfqq->wr_coeff; + bic->saved_wr_start_at_switch_to_srt = bfqq->wr_start_at_switch_to_srt; + bic->saved_last_wr_start_finish = bfqq->last_wr_start_finish; + bic->saved_wr_cur_max_time = bfqq->wr_cur_max_time; +} + +static void +bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic, + struct bfq_queue *bfqq, struct bfq_queue *new_bfqq) +{ + bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu", + (unsigned long)new_bfqq->pid); + /* Save weight raising and idle window of the merged queues */ + bfq_bfqq_save_state(bfqq); + bfq_bfqq_save_state(new_bfqq); + if (bfq_bfqq_IO_bound(bfqq)) + bfq_mark_bfqq_IO_bound(new_bfqq); + bfq_clear_bfqq_IO_bound(bfqq); + + /* + * If bfqq is weight-raised, then let new_bfqq inherit + * weight-raising. To reduce false positives, neglect the case + * where bfqq has just been created, but has not yet made it + * to be weight-raised (which may happen because EQM may merge + * bfqq even before bfq_add_request is executed for the first + * time for bfqq). Handling this case would however be very + * easy, thanks to the flag just_created. + */ + if (new_bfqq->wr_coeff == 1 && bfqq->wr_coeff > 1) { + new_bfqq->wr_coeff = bfqq->wr_coeff; + new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time; + new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish; + new_bfqq->wr_start_at_switch_to_srt = + bfqq->wr_start_at_switch_to_srt; + if (bfq_bfqq_busy(new_bfqq)) + bfqd->wr_busy_queues++; + new_bfqq->entity.prio_changed = 1; + } + + if (bfqq->wr_coeff > 1) { /* bfqq has given its wr to new_bfqq */ + bfqq->wr_coeff = 1; + bfqq->entity.prio_changed = 1; + if (bfq_bfqq_busy(bfqq)) + bfqd->wr_busy_queues--; + } + + bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d", + bfqd->wr_busy_queues); + + /* + * Merge queues (that is, let bic redirect its requests to new_bfqq) + */ + bic_set_bfqq(bic, new_bfqq, 1); + bfq_mark_bfqq_coop(new_bfqq); + /* + * new_bfqq now belongs to at least two bics (it is a shared queue): + * set new_bfqq->bic to NULL. bfqq either: + * - does not belong to any bic any more, and hence bfqq->bic must + * be set to NULL, or + * - is a queue whose owning bics have already been redirected to a + * different queue, hence the queue is destined to not belong to + * any bic soon and bfqq->bic is already NULL (therefore the next + * assignment causes no harm). + */ + new_bfqq->bic = NULL; + bfqq->bic = NULL; + /* release process reference to bfqq */ + bfq_put_queue(bfqq); +} + +static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq, + struct bio *bio) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + bool is_sync = op_is_sync(bio->bi_opf); + struct bfq_queue *bfqq = bfqd->bio_bfqq, *new_bfqq; + + /* + * Disallow merge of a sync bio into an async request. + */ + if (is_sync && !rq_is_sync(rq)) + return false; + + /* + * Lookup the bfqq that this bio will be queued with. Allow + * merge only if rq is queued there. + */ + if (!bfqq) + return false; + + /* + * We take advantage of this function to perform an early merge + * of the queues of possible cooperating processes. + */ + new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false); + if (new_bfqq) { + /* + * bic still points to bfqq, then it has not yet been + * redirected to some other bfq_queue, and a queue + * merge beween bfqq and new_bfqq can be safely + * fulfillled, i.e., bic can be redirected to new_bfqq + * and bfqq can be put. + */ + bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq, + new_bfqq); + /* + * If we get here, bio will be queued into new_queue, + * so use new_bfqq to decide whether bio and rq can be + * merged. + */ + bfqq = new_bfqq; + + /* + * Change also bqfd->bio_bfqq, as + * bfqd->bio_bic now points to new_bfqq, and + * this function may be invoked again (and then may + * use again bqfd->bio_bfqq). + */ + bfqd->bio_bfqq = bfqq; + } + + return bfqq == RQ_BFQQ(rq); +} + +/* + * Set the maximum time for the in-service queue to consume its + * budget. This prevents seeky processes from lowering the throughput. + * In practice, a time-slice service scheme is used with seeky + * processes. + */ +static void bfq_set_budget_timeout(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + unsigned int timeout_coeff; + + if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time) + timeout_coeff = 1; + else + timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight; + + bfqd->last_budget_start = ktime_get(); + + bfqq->budget_timeout = jiffies + + bfqd->bfq_timeout * timeout_coeff; +} + +static void __bfq_set_in_service_queue(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + if (bfqq) { + bfqg_stats_update_avg_queue_size(bfqq_group(bfqq)); + bfq_clear_bfqq_fifo_expire(bfqq); + + bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8; + + if (time_is_before_jiffies(bfqq->last_wr_start_finish) && + bfqq->wr_coeff > 1 && + bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time && + time_is_before_jiffies(bfqq->budget_timeout)) { + /* + * For soft real-time queues, move the start + * of the weight-raising period forward by the + * time the queue has not received any + * service. Otherwise, a relatively long + * service delay is likely to cause the + * weight-raising period of the queue to end, + * because of the short duration of the + * weight-raising period of a soft real-time + * queue. It is worth noting that this move + * is not so dangerous for the other queues, + * because soft real-time queues are not + * greedy. + * + * To not add a further variable, we use the + * overloaded field budget_timeout to + * determine for how long the queue has not + * received service, i.e., how much time has + * elapsed since the queue expired. However, + * this is a little imprecise, because + * budget_timeout is set to jiffies if bfqq + * not only expires, but also remains with no + * request. + */ + if (time_after(bfqq->budget_timeout, + bfqq->last_wr_start_finish)) + bfqq->last_wr_start_finish += + jiffies - bfqq->budget_timeout; + else + bfqq->last_wr_start_finish = jiffies; + } + + bfq_set_budget_timeout(bfqd, bfqq); + bfq_log_bfqq(bfqd, bfqq, + "set_in_service_queue, cur-budget = %d", + bfqq->entity.budget); + } + + bfqd->in_service_queue = bfqq; +} + +/* + * Get and set a new queue for service. + */ +static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq = bfq_get_next_queue(bfqd); + + __bfq_set_in_service_queue(bfqd, bfqq); + return bfqq; +} + +static void bfq_arm_slice_timer(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq = bfqd->in_service_queue; + u32 sl; + + bfq_mark_bfqq_wait_request(bfqq); + + /* + * We don't want to idle for seeks, but we do want to allow + * fair distribution of slice time for a process doing back-to-back + * seeks. So allow a little bit of time for him to submit a new rq. + */ + sl = bfqd->bfq_slice_idle; + /* + * Unless the queue is being weight-raised or the scenario is + * asymmetric, grant only minimum idle time if the queue + * is seeky. A long idling is preserved for a weight-raised + * queue, or, more in general, in an asymmetric scenario, + * because a long idling is needed for guaranteeing to a queue + * its reserved share of the throughput (in particular, it is + * needed if the queue has a higher weight than some other + * queue). + */ + if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 && + bfq_symmetric_scenario(bfqd)) + sl = min_t(u64, sl, BFQ_MIN_TT); + + bfqd->last_idling_start = ktime_get(); + hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl), + HRTIMER_MODE_REL); + bfqg_stats_set_start_idle_time(bfqq_group(bfqq)); +} + +/* + * In autotuning mode, max_budget is dynamically recomputed as the + * amount of sectors transferred in timeout at the estimated peak + * rate. This enables BFQ to utilize a full timeslice with a full + * budget, even if the in-service queue is served at peak rate. And + * this maximises throughput with sequential workloads. + */ +static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd) +{ + return (u64)bfqd->peak_rate * USEC_PER_MSEC * + jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT; +} + +/* + * Update parameters related to throughput and responsiveness, as a + * function of the estimated peak rate. See comments on + * bfq_calc_max_budget(), and on T_slow and T_fast arrays. + */ +static void update_thr_responsiveness_params(struct bfq_data *bfqd) +{ + int dev_type = blk_queue_nonrot(bfqd->queue); + + if (bfqd->bfq_user_max_budget == 0) + bfqd->bfq_max_budget = + bfq_calc_max_budget(bfqd); + + if (bfqd->device_speed == BFQ_BFQD_FAST && + bfqd->peak_rate < device_speed_thresh[dev_type]) { + bfqd->device_speed = BFQ_BFQD_SLOW; + bfqd->RT_prod = R_slow[dev_type] * + T_slow[dev_type]; + } else if (bfqd->device_speed == BFQ_BFQD_SLOW && + bfqd->peak_rate > device_speed_thresh[dev_type]) { + bfqd->device_speed = BFQ_BFQD_FAST; + bfqd->RT_prod = R_fast[dev_type] * + T_fast[dev_type]; + } + + bfq_log(bfqd, +"dev_type %s dev_speed_class = %s (%llu sects/sec), thresh %llu setcs/sec", + dev_type == 0 ? "ROT" : "NONROT", + bfqd->device_speed == BFQ_BFQD_FAST ? "FAST" : "SLOW", + bfqd->device_speed == BFQ_BFQD_FAST ? + (USEC_PER_SEC*(u64)R_fast[dev_type])>>BFQ_RATE_SHIFT : + (USEC_PER_SEC*(u64)R_slow[dev_type])>>BFQ_RATE_SHIFT, + (USEC_PER_SEC*(u64)device_speed_thresh[dev_type])>> + BFQ_RATE_SHIFT); +} + +static void bfq_reset_rate_computation(struct bfq_data *bfqd, + struct request *rq) +{ + if (rq != NULL) { /* new rq dispatch now, reset accordingly */ + bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns(); + bfqd->peak_rate_samples = 1; + bfqd->sequential_samples = 0; + bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size = + blk_rq_sectors(rq); + } else /* no new rq dispatched, just reset the number of samples */ + bfqd->peak_rate_samples = 0; /* full re-init on next disp. */ + + bfq_log(bfqd, + "reset_rate_computation at end, sample %u/%u tot_sects %llu", + bfqd->peak_rate_samples, bfqd->sequential_samples, + bfqd->tot_sectors_dispatched); +} + +static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq) +{ + u32 rate, weight, divisor; + + /* + * For the convergence property to hold (see comments on + * bfq_update_peak_rate()) and for the assessment to be + * reliable, a minimum number of samples must be present, and + * a minimum amount of time must have elapsed. If not so, do + * not compute new rate. Just reset parameters, to get ready + * for a new evaluation attempt. + */ + if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES || + bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL) + goto reset_computation; + + /* + * If a new request completion has occurred after last + * dispatch, then, to approximate the rate at which requests + * have been served by the device, it is more precise to + * extend the observation interval to the last completion. + */ + bfqd->delta_from_first = + max_t(u64, bfqd->delta_from_first, + bfqd->last_completion - bfqd->first_dispatch); + + /* + * Rate computed in sects/usec, and not sects/nsec, for + * precision issues. + */ + rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT, + div_u64(bfqd->delta_from_first, NSEC_PER_USEC)); + + /* + * Peak rate not updated if: + * - the percentage of sequential dispatches is below 3/4 of the + * total, and rate is below the current estimated peak rate + * - rate is unreasonably high (> 20M sectors/sec) + */ + if ((bfqd->sequential_samples < (3 * bfqd->peak_rate_samples)>>2 && + rate <= bfqd->peak_rate) || + rate > 20<<BFQ_RATE_SHIFT) + goto reset_computation; + + /* + * We have to update the peak rate, at last! To this purpose, + * we use a low-pass filter. We compute the smoothing constant + * of the filter as a function of the 'weight' of the new + * measured rate. + * + * As can be seen in next formulas, we define this weight as a + * quantity proportional to how sequential the workload is, + * and to how long the observation time interval is. + * + * The weight runs from 0 to 8. The maximum value of the + * weight, 8, yields the minimum value for the smoothing + * constant. At this minimum value for the smoothing constant, + * the measured rate contributes for half of the next value of + * the estimated peak rate. + * + * So, the first step is to compute the weight as a function + * of how sequential the workload is. Note that the weight + * cannot reach 9, because bfqd->sequential_samples cannot + * become equal to bfqd->peak_rate_samples, which, in its + * turn, holds true because bfqd->sequential_samples is not + * incremented for the first sample. + */ + weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples; + + /* + * Second step: further refine the weight as a function of the + * duration of the observation interval. + */ + weight = min_t(u32, 8, + div_u64(weight * bfqd->delta_from_first, + BFQ_RATE_REF_INTERVAL)); + + /* + * Divisor ranging from 10, for minimum weight, to 2, for + * maximum weight. + */ + divisor = 10 - weight; + + /* + * Finally, update peak rate: + * + * peak_rate = peak_rate * (divisor-1) / divisor + rate / divisor + */ + bfqd->peak_rate *= divisor-1; + bfqd->peak_rate /= divisor; + rate /= divisor; /* smoothing constant alpha = 1/divisor */ + + bfqd->peak_rate += rate; + update_thr_responsiveness_params(bfqd); + +reset_computation: + bfq_reset_rate_computation(bfqd, rq); +} + +/* + * Update the read/write peak rate (the main quantity used for + * auto-tuning, see update_thr_responsiveness_params()). + * + * It is not trivial to estimate the peak rate (correctly): because of + * the presence of sw and hw queues between the scheduler and the + * device components that finally serve I/O requests, it is hard to + * say exactly when a given dispatched request is served inside the + * device, and for how long. As a consequence, it is hard to know + * precisely at what rate a given set of requests is actually served + * by the device. + * + * On the opposite end, the dispatch time of any request is trivially + * available, and, from this piece of information, the "dispatch rate" + * of requests can be immediately computed. So, the idea in the next + * function is to use what is known, namely request dispatch times + * (plus, when useful, request completion times), to estimate what is + * unknown, namely in-device request service rate. + * + * The main issue is that, because of the above facts, the rate at + * which a certain set of requests is dispatched over a certain time + * interval can vary greatly with respect to the rate at which the + * same requests are then served. But, since the size of any + * intermediate queue is limited, and the service scheme is lossless + * (no request is silently dropped), the following obvious convergence + * property holds: the number of requests dispatched MUST become + * closer and closer to the number of requests completed as the + * observation interval grows. This is the key property used in + * the next function to estimate the peak service rate as a function + * of the observed dispatch rate. The function assumes to be invoked + * on every request dispatch. + */ +static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq) +{ + u64 now_ns = ktime_get_ns(); + + if (bfqd->peak_rate_samples == 0) { /* first dispatch */ + bfq_log(bfqd, "update_peak_rate: goto reset, samples %d", + bfqd->peak_rate_samples); + bfq_reset_rate_computation(bfqd, rq); + goto update_last_values; /* will add one sample */ + } + + /* + * Device idle for very long: the observation interval lasting + * up to this dispatch cannot be a valid observation interval + * for computing a new peak rate (similarly to the late- + * completion event in bfq_completed_request()). Go to + * update_rate_and_reset to have the following three steps + * taken: + * - close the observation interval at the last (previous) + * request dispatch or completion + * - compute rate, if possible, for that observation interval + * - start a new observation interval with this dispatch + */ + if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC && + bfqd->rq_in_driver == 0) + goto update_rate_and_reset; + + /* Update sampling information */ + bfqd->peak_rate_samples++; + + if ((bfqd->rq_in_driver > 0 || + now_ns - bfqd->last_completion < BFQ_MIN_TT) + && get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR) + bfqd->sequential_samples++; + + bfqd->tot_sectors_dispatched += blk_rq_sectors(rq); + + /* Reset max observed rq size every 32 dispatches */ + if (likely(bfqd->peak_rate_samples % 32)) + bfqd->last_rq_max_size = + max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size); + else + bfqd->last_rq_max_size = blk_rq_sectors(rq); + + bfqd->delta_from_first = now_ns - bfqd->first_dispatch; + + /* Target observation interval not yet reached, go on sampling */ + if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL) + goto update_last_values; + +update_rate_and_reset: + bfq_update_rate_reset(bfqd, rq); +update_last_values: + bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); + bfqd->last_dispatch = now_ns; +} + +/* + * Remove request from internal lists. + */ +static void bfq_dispatch_remove(struct request_queue *q, struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + + /* + * For consistency, the next instruction should have been + * executed after removing the request from the queue and + * dispatching it. We execute instead this instruction before + * bfq_remove_request() (and hence introduce a temporary + * inconsistency), for efficiency. In fact, should this + * dispatch occur for a non in-service bfqq, this anticipated + * increment prevents two counters related to bfqq->dispatched + * from risking to be, first, uselessly decremented, and then + * incremented again when the (new) value of bfqq->dispatched + * happens to be taken into account. + */ + bfqq->dispatched++; + bfq_update_peak_rate(q->elevator->elevator_data, rq); + + bfq_remove_request(q, rq); +} + +static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + /* + * If this bfqq is shared between multiple processes, check + * to make sure that those processes are still issuing I/Os + * within the mean seek distance. If not, it may be time to + * break the queues apart again. + */ + if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq)) + bfq_mark_bfqq_split_coop(bfqq); + + if (RB_EMPTY_ROOT(&bfqq->sort_list)) { + if (bfqq->dispatched == 0) + /* + * Overloading budget_timeout field to store + * the time at which the queue remains with no + * backlog and no outstanding request; used by + * the weight-raising mechanism. + */ + bfqq->budget_timeout = jiffies; + + bfq_del_bfqq_busy(bfqd, bfqq, true); + } else { + bfq_requeue_bfqq(bfqd, bfqq); + /* + * Resort priority tree of potential close cooperators. + */ + bfq_pos_tree_add_move(bfqd, bfqq); + } + + /* + * All in-service entities must have been properly deactivated + * or requeued before executing the next function, which + * resets all in-service entites as no more in service. + */ + __bfq_bfqd_reset_in_service(bfqd); +} + +/** + * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior. + * @bfqd: device data. + * @bfqq: queue to update. + * @reason: reason for expiration. + * + * Handle the feedback on @bfqq budget at queue expiration. + * See the body for detailed comments. + */ +static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + enum bfqq_expiration reason) +{ + struct request *next_rq; + int budget, min_budget; + + min_budget = bfq_min_budget(bfqd); + + if (bfqq->wr_coeff == 1) + budget = bfqq->max_budget; + else /* + * Use a constant, low budget for weight-raised queues, + * to help achieve a low latency. Keep it slightly higher + * than the minimum possible budget, to cause a little + * bit fewer expirations. + */ + budget = 2 * min_budget; + + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d", + bfqq->entity.budget, bfq_bfqq_budget_left(bfqq)); + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d", + budget, bfq_min_budget(bfqd)); + bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d", + bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue)); + + if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) { + switch (reason) { + /* + * Caveat: in all the following cases we trade latency + * for throughput. + */ + case BFQQE_TOO_IDLE: + /* + * This is the only case where we may reduce + * the budget: if there is no request of the + * process still waiting for completion, then + * we assume (tentatively) that the timer has + * expired because the batch of requests of + * the process could have been served with a + * smaller budget. Hence, betting that + * process will behave in the same way when it + * becomes backlogged again, we reduce its + * next budget. As long as we guess right, + * this budget cut reduces the latency + * experienced by the process. + * + * However, if there are still outstanding + * requests, then the process may have not yet + * issued its next request just because it is + * still waiting for the completion of some of + * the still outstanding ones. So in this + * subcase we do not reduce its budget, on the + * contrary we increase it to possibly boost + * the throughput, as discussed in the + * comments to the BUDGET_TIMEOUT case. + */ + if (bfqq->dispatched > 0) /* still outstanding reqs */ + budget = min(budget * 2, bfqd->bfq_max_budget); + else { + if (budget > 5 * min_budget) + budget -= 4 * min_budget; + else + budget = min_budget; + } + break; + case BFQQE_BUDGET_TIMEOUT: + /* + * We double the budget here because it gives + * the chance to boost the throughput if this + * is not a seeky process (and has bumped into + * this timeout because of, e.g., ZBR). + */ + budget = min(budget * 2, bfqd->bfq_max_budget); + break; + case BFQQE_BUDGET_EXHAUSTED: + /* + * The process still has backlog, and did not + * let either the budget timeout or the disk + * idling timeout expire. Hence it is not + * seeky, has a short thinktime and may be + * happy with a higher budget too. So + * definitely increase the budget of this good + * candidate to boost the disk throughput. + */ + budget = min(budget * 4, bfqd->bfq_max_budget); + break; + case BFQQE_NO_MORE_REQUESTS: + /* + * For queues that expire for this reason, it + * is particularly important to keep the + * budget close to the actual service they + * need. Doing so reduces the timestamp + * misalignment problem described in the + * comments in the body of + * __bfq_activate_entity. In fact, suppose + * that a queue systematically expires for + * BFQQE_NO_MORE_REQUESTS and presents a + * new request in time to enjoy timestamp + * back-shifting. The larger the budget of the + * queue is with respect to the service the + * queue actually requests in each service + * slot, the more times the queue can be + * reactivated with the same virtual finish + * time. It follows that, even if this finish + * time is pushed to the system virtual time + * to reduce the consequent timestamp + * misalignment, the queue unjustly enjoys for + * many re-activations a lower finish time + * than all newly activated queues. + * + * The service needed by bfqq is measured + * quite precisely by bfqq->entity.service. + * Since bfqq does not enjoy device idling, + * bfqq->entity.service is equal to the number + * of sectors that the process associated with + * bfqq requested to read/write before waiting + * for request completions, or blocking for + * other reasons. + */ + budget = max_t(int, bfqq->entity.service, min_budget); + break; + default: + return; + } + } else if (!bfq_bfqq_sync(bfqq)) { + /* + * Async queues get always the maximum possible + * budget, as for them we do not care about latency + * (in addition, their ability to dispatch is limited + * by the charging factor). + */ + budget = bfqd->bfq_max_budget; + } + + bfqq->max_budget = budget; + + if (bfqd->budgets_assigned >= bfq_stats_min_budgets && + !bfqd->bfq_user_max_budget) + bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget); + + /* + * If there is still backlog, then assign a new budget, making + * sure that it is large enough for the next request. Since + * the finish time of bfqq must be kept in sync with the + * budget, be sure to call __bfq_bfqq_expire() *after* this + * update. + * + * If there is no backlog, then no need to update the budget; + * it will be updated on the arrival of a new request. + */ + next_rq = bfqq->next_rq; + if (next_rq) + bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget, + bfq_serv_to_charge(next_rq, bfqq)); + + bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d", + next_rq ? blk_rq_sectors(next_rq) : 0, + bfqq->entity.budget); +} + +/* + * Return true if the process associated with bfqq is "slow". The slow + * flag is used, in addition to the budget timeout, to reduce the + * amount of service provided to seeky processes, and thus reduce + * their chances to lower the throughput. More details in the comments + * on the function bfq_bfqq_expire(). + * + * An important observation is in order: as discussed in the comments + * on the function bfq_update_peak_rate(), with devices with internal + * queues, it is hard if ever possible to know when and for how long + * an I/O request is processed by the device (apart from the trivial + * I/O pattern where a new request is dispatched only after the + * previous one has been completed). This makes it hard to evaluate + * the real rate at which the I/O requests of each bfq_queue are + * served. In fact, for an I/O scheduler like BFQ, serving a + * bfq_queue means just dispatching its requests during its service + * slot (i.e., until the budget of the queue is exhausted, or the + * queue remains idle, or, finally, a timeout fires). But, during the + * service slot of a bfq_queue, around 100 ms at most, the device may + * be even still processing requests of bfq_queues served in previous + * service slots. On the opposite end, the requests of the in-service + * bfq_queue may be completed after the service slot of the queue + * finishes. + * + * Anyway, unless more sophisticated solutions are used + * (where possible), the sum of the sizes of the requests dispatched + * during the service slot of a bfq_queue is probably the only + * approximation available for the service received by the bfq_queue + * during its service slot. And this sum is the quantity used in this + * function to evaluate the I/O speed of a process. + */ +static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool compensate, enum bfqq_expiration reason, + unsigned long *delta_ms) +{ + ktime_t delta_ktime; + u32 delta_usecs; + bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */ + + if (!bfq_bfqq_sync(bfqq)) + return false; + + if (compensate) + delta_ktime = bfqd->last_idling_start; + else + delta_ktime = ktime_get(); + delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start); + delta_usecs = ktime_to_us(delta_ktime); + + /* don't use too short time intervals */ + if (delta_usecs < 1000) { + if (blk_queue_nonrot(bfqd->queue)) + /* + * give same worst-case guarantees as idling + * for seeky + */ + *delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC; + else /* charge at least one seek */ + *delta_ms = bfq_slice_idle / NSEC_PER_MSEC; + + return slow; + } + + *delta_ms = delta_usecs / USEC_PER_MSEC; + + /* + * Use only long (> 20ms) intervals to filter out excessive + * spikes in service rate estimation. + */ + if (delta_usecs > 20000) { + /* + * Caveat for rotational devices: processes doing I/O + * in the slower disk zones tend to be slow(er) even + * if not seeky. In this respect, the estimated peak + * rate is likely to be an average over the disk + * surface. Accordingly, to not be too harsh with + * unlucky processes, a process is deemed slow only if + * its rate has been lower than half of the estimated + * peak rate. + */ + slow = bfqq->entity.service < bfqd->bfq_max_budget / 2; + } + + bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow); + + return slow; +} + +/* + * To be deemed as soft real-time, an application must meet two + * requirements. First, the application must not require an average + * bandwidth higher than the approximate bandwidth required to playback or + * record a compressed high-definition video. + * The next function is invoked on the completion of the last request of a + * batch, to compute the next-start time instant, soft_rt_next_start, such + * that, if the next request of the application does not arrive before + * soft_rt_next_start, then the above requirement on the bandwidth is met. + * + * The second requirement is that the request pattern of the application is + * isochronous, i.e., that, after issuing a request or a batch of requests, + * the application stops issuing new requests until all its pending requests + * have been completed. After that, the application may issue a new batch, + * and so on. + * For this reason the next function is invoked to compute + * soft_rt_next_start only for applications that meet this requirement, + * whereas soft_rt_next_start is set to infinity for applications that do + * not. + * + * Unfortunately, even a greedy application may happen to behave in an + * isochronous way if the CPU load is high. In fact, the application may + * stop issuing requests while the CPUs are busy serving other processes, + * then restart, then stop again for a while, and so on. In addition, if + * the disk achieves a low enough throughput with the request pattern + * issued by the application (e.g., because the request pattern is random + * and/or the device is slow), then the application may meet the above + * bandwidth requirement too. To prevent such a greedy application to be + * deemed as soft real-time, a further rule is used in the computation of + * soft_rt_next_start: soft_rt_next_start must be higher than the current + * time plus the maximum time for which the arrival of a request is waited + * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle. + * This filters out greedy applications, as the latter issue instead their + * next request as soon as possible after the last one has been completed + * (in contrast, when a batch of requests is completed, a soft real-time + * application spends some time processing data). + * + * Unfortunately, the last filter may easily generate false positives if + * only bfqd->bfq_slice_idle is used as a reference time interval and one + * or both the following cases occur: + * 1) HZ is so low that the duration of a jiffy is comparable to or higher + * than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with + * HZ=100. + * 2) jiffies, instead of increasing at a constant rate, may stop increasing + * for a while, then suddenly 'jump' by several units to recover the lost + * increments. This seems to happen, e.g., inside virtual machines. + * To address this issue, we do not use as a reference time interval just + * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In + * particular we add the minimum number of jiffies for which the filter + * seems to be quite precise also in embedded systems and KVM/QEMU virtual + * machines. + */ +static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + return max(bfqq->last_idle_bklogged + + HZ * bfqq->service_from_backlogged / + bfqd->bfq_wr_max_softrt_rate, + jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4); +} + +/* + * Return the farthest future time instant according to jiffies + * macros. + */ +static unsigned long bfq_greatest_from_now(void) +{ + return jiffies + MAX_JIFFY_OFFSET; +} + +/* + * Return the farthest past time instant according to jiffies + * macros. + */ +static unsigned long bfq_smallest_from_now(void) +{ + return jiffies - MAX_JIFFY_OFFSET; +} + +/** + * bfq_bfqq_expire - expire a queue. + * @bfqd: device owning the queue. + * @bfqq: the queue to expire. + * @compensate: if true, compensate for the time spent idling. + * @reason: the reason causing the expiration. + * + * If the process associated with bfqq does slow I/O (e.g., because it + * issues random requests), we charge bfqq with the time it has been + * in service instead of the service it has received (see + * bfq_bfqq_charge_time for details on how this goal is achieved). As + * a consequence, bfqq will typically get higher timestamps upon + * reactivation, and hence it will be rescheduled as if it had + * received more service than what it has actually received. In the + * end, bfqq receives less service in proportion to how slowly its + * associated process consumes its budgets (and hence how seriously it + * tends to lower the throughput). In addition, this time-charging + * strategy guarantees time fairness among slow processes. In + * contrast, if the process associated with bfqq is not slow, we + * charge bfqq exactly with the service it has received. + * + * Charging time to the first type of queues and the exact service to + * the other has the effect of using the WF2Q+ policy to schedule the + * former on a timeslice basis, without violating service domain + * guarantees among the latter. + */ +void bfq_bfqq_expire(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + bool compensate, + enum bfqq_expiration reason) +{ + bool slow; + unsigned long delta = 0; + struct bfq_entity *entity = &bfqq->entity; + int ref; + + /* + * Check whether the process is slow (see bfq_bfqq_is_slow). + */ + slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta); + + /* + * Increase service_from_backlogged before next statement, + * because the possible next invocation of + * bfq_bfqq_charge_time would likely inflate + * entity->service. In contrast, service_from_backlogged must + * contain real service, to enable the soft real-time + * heuristic to correctly compute the bandwidth consumed by + * bfqq. + */ + bfqq->service_from_backlogged += entity->service; + + /* + * As above explained, charge slow (typically seeky) and + * timed-out queues with the time and not the service + * received, to favor sequential workloads. + * + * Processes doing I/O in the slower disk zones will tend to + * be slow(er) even if not seeky. Therefore, since the + * estimated peak rate is actually an average over the disk + * surface, these processes may timeout just for bad luck. To + * avoid punishing them, do not charge time to processes that + * succeeded in consuming at least 2/3 of their budget. This + * allows BFQ to preserve enough elasticity to still perform + * bandwidth, and not time, distribution with little unlucky + * or quasi-sequential processes. + */ + if (bfqq->wr_coeff == 1 && + (slow || + (reason == BFQQE_BUDGET_TIMEOUT && + bfq_bfqq_budget_left(bfqq) >= entity->budget / 3))) + bfq_bfqq_charge_time(bfqd, bfqq, delta); + + if (reason == BFQQE_TOO_IDLE && + entity->service <= 2 * entity->budget / 10) + bfq_clear_bfqq_IO_bound(bfqq); + + if (bfqd->low_latency && bfqq->wr_coeff == 1) + bfqq->last_wr_start_finish = jiffies; + + if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 && + RB_EMPTY_ROOT(&bfqq->sort_list)) { + /* + * If we get here, and there are no outstanding + * requests, then the request pattern is isochronous + * (see the comments on the function + * bfq_bfqq_softrt_next_start()). Thus we can compute + * soft_rt_next_start. If, instead, the queue still + * has outstanding requests, then we have to wait for + * the completion of all the outstanding requests to + * discover whether the request pattern is actually + * isochronous. + */ + if (bfqq->dispatched == 0) + bfqq->soft_rt_next_start = + bfq_bfqq_softrt_next_start(bfqd, bfqq); + else { + /* + * The application is still waiting for the + * completion of one or more requests: + * prevent it from possibly being incorrectly + * deemed as soft real-time by setting its + * soft_rt_next_start to infinity. In fact, + * without this assignment, the application + * would be incorrectly deemed as soft + * real-time if: + * 1) it issued a new request before the + * completion of all its in-flight + * requests, and + * 2) at that time, its soft_rt_next_start + * happened to be in the past. + */ + bfqq->soft_rt_next_start = + bfq_greatest_from_now(); + /* + * Schedule an update of soft_rt_next_start to when + * the task may be discovered to be isochronous. + */ + bfq_mark_bfqq_softrt_update(bfqq); + } + } + + bfq_log_bfqq(bfqd, bfqq, + "expire (%d, slow %d, num_disp %d, idle_win %d)", reason, + slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq)); + + /* + * Increase, decrease or leave budget unchanged according to + * reason. + */ + __bfq_bfqq_recalc_budget(bfqd, bfqq, reason); + ref = bfqq->ref; + __bfq_bfqq_expire(bfqd, bfqq); + + /* mark bfqq as waiting a request only if a bic still points to it */ + if (ref > 1 && !bfq_bfqq_busy(bfqq) && + reason != BFQQE_BUDGET_TIMEOUT && + reason != BFQQE_BUDGET_EXHAUSTED) + bfq_mark_bfqq_non_blocking_wait_rq(bfqq); +} + +/* + * Budget timeout is not implemented through a dedicated timer, but + * just checked on request arrivals and completions, as well as on + * idle timer expirations. + */ +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq) +{ + return time_is_before_eq_jiffies(bfqq->budget_timeout); +} + +/* + * If we expire a queue that is actively waiting (i.e., with the + * device idled) for the arrival of a new request, then we may incur + * the timestamp misalignment problem described in the body of the + * function __bfq_activate_entity. Hence we return true only if this + * condition does not hold, or if the queue is slow enough to deserve + * only to be kicked off for preserving a high throughput. + */ +static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) +{ + bfq_log_bfqq(bfqq->bfqd, bfqq, + "may_budget_timeout: wait_request %d left %d timeout %d", + bfq_bfqq_wait_request(bfqq), + bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3, + bfq_bfqq_budget_timeout(bfqq)); + + return (!bfq_bfqq_wait_request(bfqq) || + bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3) + && + bfq_bfqq_budget_timeout(bfqq); +} + +/* + * For a queue that becomes empty, device idling is allowed only if + * this function returns true for the queue. As a consequence, since + * device idling plays a critical role in both throughput boosting and + * service guarantees, the return value of this function plays a + * critical role in both these aspects as well. + * + * In a nutshell, this function returns true only if idling is + * beneficial for throughput or, even if detrimental for throughput, + * idling is however necessary to preserve service guarantees (low + * latency, desired throughput distribution, ...). In particular, on + * NCQ-capable devices, this function tries to return false, so as to + * help keep the drives' internal queues full, whenever this helps the + * device boost the throughput without causing any service-guarantee + * issue. + * + * In more detail, the return value of this function is obtained by, + * first, computing a number of boolean variables that take into + * account throughput and service-guarantee issues, and, then, + * combining these variables in a logical expression. Most of the + * issues taken into account are not trivial. We discuss these issues + * individually while introducing the variables. + */ +static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; + bool idling_boosts_thr, idling_boosts_thr_without_issues, + idling_needed_for_service_guarantees, + asymmetric_scenario; + + if (bfqd->strict_guarantees) + return true; + + /* + * The next variable takes into account the cases where idling + * boosts the throughput. + * + * The value of the variable is computed considering, first, that + * idling is virtually always beneficial for the throughput if: + * (a) the device is not NCQ-capable, or + * (b) regardless of the presence of NCQ, the device is rotational + * and the request pattern for bfqq is I/O-bound and sequential. + * + * Secondly, and in contrast to the above item (b), idling an + * NCQ-capable flash-based device would not boost the + * throughput even with sequential I/O; rather it would lower + * the throughput in proportion to how fast the device + * is. Accordingly, the next variable is true if any of the + * above conditions (a) and (b) is true, and, in particular, + * happens to be false if bfqd is an NCQ-capable flash-based + * device. + */ + idling_boosts_thr = !bfqd->hw_tag || + (!blk_queue_nonrot(bfqd->queue) && bfq_bfqq_IO_bound(bfqq) && + bfq_bfqq_idle_window(bfqq)); + + /* + * The value of the next variable, + * idling_boosts_thr_without_issues, is equal to that of + * idling_boosts_thr, unless a special case holds. In this + * special case, described below, idling may cause problems to + * weight-raised queues. + * + * When the request pool is saturated (e.g., in the presence + * of write hogs), if the processes associated with + * non-weight-raised queues ask for requests at a lower rate, + * then processes associated with weight-raised queues have a + * higher probability to get a request from the pool + * immediately (or at least soon) when they need one. Thus + * they have a higher probability to actually get a fraction + * of the device throughput proportional to their high + * weight. This is especially true with NCQ-capable drives, + * which enqueue several requests in advance, and further + * reorder internally-queued requests. + * + * For this reason, we force to false the value of + * idling_boosts_thr_without_issues if there are weight-raised + * busy queues. In this case, and if bfqq is not weight-raised, + * this guarantees that the device is not idled for bfqq (if, + * instead, bfqq is weight-raised, then idling will be + * guaranteed by another variable, see below). Combined with + * the timestamping rules of BFQ (see [1] for details), this + * behavior causes bfqq, and hence any sync non-weight-raised + * queue, to get a lower number of requests served, and thus + * to ask for a lower number of requests from the request + * pool, before the busy weight-raised queues get served + * again. This often mitigates starvation problems in the + * presence of heavy write workloads and NCQ, thereby + * guaranteeing a higher application and system responsiveness + * in these hostile scenarios. + */ + idling_boosts_thr_without_issues = idling_boosts_thr && + bfqd->wr_busy_queues == 0; + + /* + * There is then a case where idling must be performed not + * for throughput concerns, but to preserve service + * guarantees. + * + * To introduce this case, we can note that allowing the drive + * to enqueue more than one request at a time, and hence + * delegating de facto final scheduling decisions to the + * drive's internal scheduler, entails loss of control on the + * actual request service order. In particular, the critical + * situation is when requests from different processes happen + * to be present, at the same time, in the internal queue(s) + * of the drive. In such a situation, the drive, by deciding + * the service order of the internally-queued requests, does + * determine also the actual throughput distribution among + * these processes. But the drive typically has no notion or + * concern about per-process throughput distribution, and + * makes its decisions only on a per-request basis. Therefore, + * the service distribution enforced by the drive's internal + * scheduler is likely to coincide with the desired + * device-throughput distribution only in a completely + * symmetric scenario where: + * (i) each of these processes must get the same throughput as + * the others; + * (ii) all these processes have the same I/O pattern + (either sequential or random). + * In fact, in such a scenario, the drive will tend to treat + * the requests of each of these processes in about the same + * way as the requests of the others, and thus to provide + * each of these processes with about the same throughput + * (which is exactly the desired throughput distribution). In + * contrast, in any asymmetric scenario, device idling is + * certainly needed to guarantee that bfqq receives its + * assigned fraction of the device throughput (see [1] for + * details). + * + * We address this issue by controlling, actually, only the + * symmetry sub-condition (i), i.e., provided that + * sub-condition (i) holds, idling is not performed, + * regardless of whether sub-condition (ii) holds. In other + * words, only if sub-condition (i) holds, then idling is + * allowed, and the device tends to be prevented from queueing + * many requests, possibly of several processes. The reason + * for not controlling also sub-condition (ii) is that we + * exploit preemption to preserve guarantees in case of + * symmetric scenarios, even if (ii) does not hold, as + * explained in the next two paragraphs. + * + * Even if a queue, say Q, is expired when it remains idle, Q + * can still preempt the new in-service queue if the next + * request of Q arrives soon (see the comments on + * bfq_bfqq_update_budg_for_activation). If all queues and + * groups have the same weight, this form of preemption, + * combined with the hole-recovery heuristic described in the + * comments on function bfq_bfqq_update_budg_for_activation, + * are enough to preserve a correct bandwidth distribution in + * the mid term, even without idling. In fact, even if not + * idling allows the internal queues of the device to contain + * many requests, and thus to reorder requests, we can rather + * safely assume that the internal scheduler still preserves a + * minimum of mid-term fairness. The motivation for using + * preemption instead of idling is that, by not idling, + * service guarantees are preserved without minimally + * sacrificing throughput. In other words, both a high + * throughput and its desired distribution are obtained. + * + * More precisely, this preemption-based, idleless approach + * provides fairness in terms of IOPS, and not sectors per + * second. This can be seen with a simple example. Suppose + * that there are two queues with the same weight, but that + * the first queue receives requests of 8 sectors, while the + * second queue receives requests of 1024 sectors. In + * addition, suppose that each of the two queues contains at + * most one request at a time, which implies that each queue + * always remains idle after it is served. Finally, after + * remaining idle, each queue receives very quickly a new + * request. It follows that the two queues are served + * alternatively, preempting each other if needed. This + * implies that, although both queues have the same weight, + * the queue with large requests receives a service that is + * 1024/8 times as high as the service received by the other + * queue. + * + * On the other hand, device idling is performed, and thus + * pure sector-domain guarantees are provided, for the + * following queues, which are likely to need stronger + * throughput guarantees: weight-raised queues, and queues + * with a higher weight than other queues. When such queues + * are active, sub-condition (i) is false, which triggers + * device idling. + * + * According to the above considerations, the next variable is + * true (only) if sub-condition (i) holds. To compute the + * value of this variable, we not only use the return value of + * the function bfq_symmetric_scenario(), but also check + * whether bfqq is being weight-raised, because + * bfq_symmetric_scenario() does not take into account also + * weight-raised queues (see comments on + * bfq_weights_tree_add()). + * + * As a side note, it is worth considering that the above + * device-idling countermeasures may however fail in the + * following unlucky scenario: if idling is (correctly) + * disabled in a time period during which all symmetry + * sub-conditions hold, and hence the device is allowed to + * enqueue many requests, but at some later point in time some + * sub-condition stops to hold, then it may become impossible + * to let requests be served in the desired order until all + * the requests already queued in the device have been served. + */ + asymmetric_scenario = bfqq->wr_coeff > 1 || + !bfq_symmetric_scenario(bfqd); + + /* + * Finally, there is a case where maximizing throughput is the + * best choice even if it may cause unfairness toward + * bfqq. Such a case is when bfqq became active in a burst of + * queue activations. Queues that became active during a large + * burst benefit only from throughput, as discussed in the + * comments on bfq_handle_burst. Thus, if bfqq became active + * in a burst and not idling the device maximizes throughput, + * then the device must no be idled, because not idling the + * device provides bfqq and all other queues in the burst with + * maximum benefit. Combining this and the above case, we can + * now establish when idling is actually needed to preserve + * service guarantees. + */ + idling_needed_for_service_guarantees = + asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq); + + /* + * We have now all the components we need to compute the return + * value of the function, which is true only if both the following + * conditions hold: + * 1) bfqq is sync, because idling make sense only for sync queues; + * 2) idling either boosts the throughput (without issues), or + * is necessary to preserve service guarantees. + */ + return bfq_bfqq_sync(bfqq) && + (idling_boosts_thr_without_issues || + idling_needed_for_service_guarantees); +} + +/* + * If the in-service queue is empty but the function bfq_bfqq_may_idle + * returns true, then: + * 1) the queue must remain in service and cannot be expired, and + * 2) the device must be idled to wait for the possible arrival of a new + * request for the queue. + * See the comments on the function bfq_bfqq_may_idle for the reasons + * why performing device idling is the best choice to boost the throughput + * and preserve service guarantees when bfq_bfqq_may_idle itself + * returns true. + */ +static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; + + return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 && + bfq_bfqq_may_idle(bfqq); +} + +/* + * Select a queue for service. If we have a current queue in service, + * check whether to continue servicing it, or retrieve and set a new one. + */ +static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd) +{ + struct bfq_queue *bfqq; + struct request *next_rq; + enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT; + + bfqq = bfqd->in_service_queue; + if (!bfqq) + goto new_queue; + + bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue"); + + if (bfq_may_expire_for_budg_timeout(bfqq) && + !bfq_bfqq_wait_request(bfqq) && + !bfq_bfqq_must_idle(bfqq)) + goto expire; + +check_queue: + /* + * This loop is rarely executed more than once. Even when it + * happens, it is much more convenient to re-execute this loop + * than to return NULL and trigger a new dispatch to get a + * request served. + */ + next_rq = bfqq->next_rq; + /* + * If bfqq has requests queued and it has enough budget left to + * serve them, keep the queue, otherwise expire it. + */ + if (next_rq) { + if (bfq_serv_to_charge(next_rq, bfqq) > + bfq_bfqq_budget_left(bfqq)) { + /* + * Expire the queue for budget exhaustion, + * which makes sure that the next budget is + * enough to serve the next request, even if + * it comes from the fifo expired path. + */ + reason = BFQQE_BUDGET_EXHAUSTED; + goto expire; + } else { + /* + * The idle timer may be pending because we may + * not disable disk idling even when a new request + * arrives. + */ + if (bfq_bfqq_wait_request(bfqq)) { + /* + * If we get here: 1) at least a new request + * has arrived but we have not disabled the + * timer because the request was too small, + * 2) then the block layer has unplugged + * the device, causing the dispatch to be + * invoked. + * + * Since the device is unplugged, now the + * requests are probably large enough to + * provide a reasonable throughput. + * So we disable idling. + */ + bfq_clear_bfqq_wait_request(bfqq); + hrtimer_try_to_cancel(&bfqd->idle_slice_timer); + bfqg_stats_update_idle_time(bfqq_group(bfqq)); + } + goto keep_queue; + } + } + + /* + * No requests pending. However, if the in-service queue is idling + * for a new request, or has requests waiting for a completion and + * may idle after their completion, then keep it anyway. + */ + if (bfq_bfqq_wait_request(bfqq) || + (bfqq->dispatched != 0 && bfq_bfqq_may_idle(bfqq))) { + bfqq = NULL; + goto keep_queue; + } + + reason = BFQQE_NO_MORE_REQUESTS; +expire: + bfq_bfqq_expire(bfqd, bfqq, false, reason); +new_queue: + bfqq = bfq_set_in_service_queue(bfqd); + if (bfqq) { + bfq_log_bfqq(bfqd, bfqq, "select_queue: checking new queue"); + goto check_queue; + } +keep_queue: + if (bfqq) + bfq_log_bfqq(bfqd, bfqq, "select_queue: returned this queue"); + else + bfq_log(bfqd, "select_queue: no queue returned"); + + return bfqq; +} + +static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */ + bfq_log_bfqq(bfqd, bfqq, + "raising period dur %u/%u msec, old coeff %u, w %d(%d)", + jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish), + jiffies_to_msecs(bfqq->wr_cur_max_time), + bfqq->wr_coeff, + bfqq->entity.weight, bfqq->entity.orig_weight); + + if (entity->prio_changed) + bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change"); + + /* + * If the queue was activated in a burst, or too much + * time has elapsed from the beginning of this + * weight-raising period, then end weight raising. + */ + if (bfq_bfqq_in_large_burst(bfqq)) + bfq_bfqq_end_wr(bfqq); + else if (time_is_before_jiffies(bfqq->last_wr_start_finish + + bfqq->wr_cur_max_time)) { + if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time || + time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt + + bfq_wr_duration(bfqd))) + bfq_bfqq_end_wr(bfqq); + else { + /* switch back to interactive wr */ + bfqq->wr_coeff = bfqd->bfq_wr_coeff; + bfqq->wr_cur_max_time = bfq_wr_duration(bfqd); + bfqq->last_wr_start_finish = + bfqq->wr_start_at_switch_to_srt; + bfqq->entity.prio_changed = 1; + } + } + } + /* Update weight both if it must be raised and if it must be lowered */ + if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1)) + __bfq_entity_update_weight_prio( + bfq_entity_service_tree(entity), + entity); +} + +/* + * Dispatch next request from bfqq. + */ +static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + struct request *rq = bfqq->next_rq; + unsigned long service_to_charge; + + service_to_charge = bfq_serv_to_charge(rq, bfqq); + + bfq_bfqq_served(bfqq, service_to_charge); + + bfq_dispatch_remove(bfqd->queue, rq); + + /* + * If weight raising has to terminate for bfqq, then next + * function causes an immediate update of bfqq's weight, + * without waiting for next activation. As a consequence, on + * expiration, bfqq will be timestamped as if has never been + * weight-raised during this service slot, even if it has + * received part or even most of the service as a + * weight-raised queue. This inflates bfqq's timestamps, which + * is beneficial, as bfqq is then more willing to leave the + * device immediately to possible other weight-raised queues. + */ + bfq_update_wr_data(bfqd, bfqq); + + /* + * Expire bfqq, pretending that its budget expired, if bfqq + * belongs to CLASS_IDLE and other queues are waiting for + * service. + */ + if (bfqd->busy_queues > 1 && bfq_class_idle(bfqq)) + goto expire; + + return rq; + +expire: + bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED); + return rq; +} + +static bool bfq_has_work(struct blk_mq_hw_ctx *hctx) +{ + struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + + /* + * Avoiding lock: a race on bfqd->busy_queues should cause at + * most a call to dispatch for nothing + */ + return !list_empty_careful(&bfqd->dispatch) || + bfqd->busy_queues > 0; +} + +static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ + struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + struct request *rq = NULL; + struct bfq_queue *bfqq = NULL; + + if (!list_empty(&bfqd->dispatch)) { + rq = list_first_entry(&bfqd->dispatch, struct request, + queuelist); + list_del_init(&rq->queuelist); + + bfqq = RQ_BFQQ(rq); + + if (bfqq) { + /* + * Increment counters here, because this + * dispatch does not follow the standard + * dispatch flow (where counters are + * incremented) + */ + bfqq->dispatched++; + + goto inc_in_driver_start_rq; + } + + /* + * We exploit the put_rq_private hook to decrement + * rq_in_driver, but put_rq_private will not be + * invoked on this request. So, to avoid unbalance, + * just start this request, without incrementing + * rq_in_driver. As a negative consequence, + * rq_in_driver is deceptively lower than it should be + * while this request is in service. This may cause + * bfq_schedule_dispatch to be invoked uselessly. + * + * As for implementing an exact solution, the + * put_request hook, if defined, is probably invoked + * also on this request. So, by exploiting this hook, + * we could 1) increment rq_in_driver here, and 2) + * decrement it in put_request. Such a solution would + * let the value of the counter be always accurate, + * but it would entail using an extra interface + * function. This cost seems higher than the benefit, + * being the frequency of non-elevator-private + * requests very low. + */ + goto start_rq; + } + + bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues); + + if (bfqd->busy_queues == 0) + goto exit; + + /* + * Force device to serve one request at a time if + * strict_guarantees is true. Forcing this service scheme is + * currently the ONLY way to guarantee that the request + * service order enforced by the scheduler is respected by a + * queueing device. Otherwise the device is free even to make + * some unlucky request wait for as long as the device + * wishes. + * + * Of course, serving one request at at time may cause loss of + * throughput. + */ + if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0) + goto exit; + + bfqq = bfq_select_queue(bfqd); + if (!bfqq) + goto exit; + + rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq); + + if (rq) { +inc_in_driver_start_rq: + bfqd->rq_in_driver++; +start_rq: + rq->rq_flags |= RQF_STARTED; + } +exit: + return rq; +} + +static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ + struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + struct request *rq; + + spin_lock_irq(&bfqd->lock); + + rq = __bfq_dispatch_request(hctx); + spin_unlock_irq(&bfqd->lock); + + return rq; +} + +/* + * Task holds one reference to the queue, dropped when task exits. Each rq + * in-flight on this queue also holds a reference, dropped when rq is freed. + * + * Scheduler lock must be held here. Recall not to use bfqq after calling + * this function on it. + */ +void bfq_put_queue(struct bfq_queue *bfqq) +{ +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_group *bfqg = bfqq_group(bfqq); +#endif + + if (bfqq->bfqd) + bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d", + bfqq, bfqq->ref); + + bfqq->ref--; + if (bfqq->ref) + return; + + if (bfq_bfqq_sync(bfqq)) + /* + * The fact that this queue is being destroyed does not + * invalidate the fact that this queue may have been + * activated during the current burst. As a consequence, + * although the queue does not exist anymore, and hence + * needs to be removed from the burst list if there, + * the burst size has not to be decremented. + */ + hlist_del_init(&bfqq->burst_list_node); + + kmem_cache_free(bfq_pool, bfqq); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + bfqg_put(bfqg); +#endif +} + +static void bfq_put_cooperator(struct bfq_queue *bfqq) +{ + struct bfq_queue *__bfqq, *next; + + /* + * If this queue was scheduled to merge with another queue, be + * sure to drop the reference taken on that queue (and others in + * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs. + */ + __bfqq = bfqq->new_bfqq; + while (__bfqq) { + if (__bfqq == bfqq) + break; + next = __bfqq->new_bfqq; + bfq_put_queue(__bfqq); + __bfqq = next; + } +} + +static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + if (bfqq == bfqd->in_service_queue) { + __bfq_bfqq_expire(bfqd, bfqq); + bfq_schedule_dispatch(bfqd); + } + + bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref); + + bfq_put_cooperator(bfqq); + + bfq_put_queue(bfqq); /* release process reference */ +} + +static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync) +{ + struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync); + struct bfq_data *bfqd; + + if (bfqq) + bfqd = bfqq->bfqd; /* NULL if scheduler already exited */ + + if (bfqq && bfqd) { + unsigned long flags; + + spin_lock_irqsave(&bfqd->lock, flags); + bfq_exit_bfqq(bfqd, bfqq); + bic_set_bfqq(bic, NULL, is_sync); + spin_unlock_irqrestore(&bfqd->lock, flags); + } +} + +static void bfq_exit_icq(struct io_cq *icq) +{ + struct bfq_io_cq *bic = icq_to_bic(icq); + + bfq_exit_icq_bfqq(bic, true); + bfq_exit_icq_bfqq(bic, false); +} + +/* + * Update the entity prio values; note that the new values will not + * be used until the next (re)activation. + */ +static void +bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) +{ + struct task_struct *tsk = current; + int ioprio_class; + struct bfq_data *bfqd = bfqq->bfqd; + + if (!bfqd) + return; + + ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); + switch (ioprio_class) { + default: + dev_err(bfqq->bfqd->queue->backing_dev_info->dev, + "bfq: bad prio class %d\n", ioprio_class); + case IOPRIO_CLASS_NONE: + /* + * No prio set, inherit CPU scheduling settings. + */ + bfqq->new_ioprio = task_nice_ioprio(tsk); + bfqq->new_ioprio_class = task_nice_ioclass(tsk); + break; + case IOPRIO_CLASS_RT: + bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio_class = IOPRIO_CLASS_RT; + break; + case IOPRIO_CLASS_BE: + bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio_class = IOPRIO_CLASS_BE; + break; + case IOPRIO_CLASS_IDLE: + bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE; + bfqq->new_ioprio = 7; + bfq_clear_bfqq_idle_window(bfqq); + break; + } + + if (bfqq->new_ioprio >= IOPRIO_BE_NR) { + pr_crit("bfq_set_next_ioprio_data: new_ioprio %d\n", + bfqq->new_ioprio); + bfqq->new_ioprio = IOPRIO_BE_NR; + } + + bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio); + bfqq->entity.prio_changed = 1; +} + +static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, + struct bio *bio, bool is_sync, + struct bfq_io_cq *bic); + +static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio) +{ + struct bfq_data *bfqd = bic_to_bfqd(bic); + struct bfq_queue *bfqq; + int ioprio = bic->icq.ioc->ioprio; + + /* + * This condition may trigger on a newly created bic, be sure to + * drop the lock before returning. + */ + if (unlikely(!bfqd) || likely(bic->ioprio == ioprio)) + return; + + bic->ioprio = ioprio; + + bfqq = bic_to_bfqq(bic, false); + if (bfqq) { + /* release process reference on this queue */ + bfq_put_queue(bfqq); + bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic); + bic_set_bfqq(bic, bfqq, false); + } + + bfqq = bic_to_bfqq(bic, true); + if (bfqq) + bfq_set_next_ioprio_data(bfqq, bic); +} + +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_io_cq *bic, pid_t pid, int is_sync) +{ + RB_CLEAR_NODE(&bfqq->entity.rb_node); + INIT_LIST_HEAD(&bfqq->fifo); + INIT_HLIST_NODE(&bfqq->burst_list_node); + + bfqq->ref = 0; + bfqq->bfqd = bfqd; + + if (bic) + bfq_set_next_ioprio_data(bfqq, bic); + + if (is_sync) { + if (!bfq_class_idle(bfqq)) + bfq_mark_bfqq_idle_window(bfqq); + bfq_mark_bfqq_sync(bfqq); + bfq_mark_bfqq_just_created(bfqq); + } else + bfq_clear_bfqq_sync(bfqq); + + /* set end request to minus infinity from now */ + bfqq->ttime.last_end_request = ktime_get_ns() + 1; + + bfq_mark_bfqq_IO_bound(bfqq); + + bfqq->pid = pid; + + /* Tentative initial value to trade off between thr and lat */ + bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3; + bfqq->budget_timeout = bfq_smallest_from_now(); + + bfqq->wr_coeff = 1; + bfqq->last_wr_start_finish = jiffies; + bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now(); + bfqq->split_time = bfq_smallest_from_now(); + + /* + * Set to the value for which bfqq will not be deemed as + * soft rt when it becomes backlogged. + */ + bfqq->soft_rt_next_start = bfq_greatest_from_now(); + + /* first request is almost certainly seeky */ + bfqq->seek_history = 1; +} + +static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd, + struct bfq_group *bfqg, + int ioprio_class, int ioprio) +{ + switch (ioprio_class) { + case IOPRIO_CLASS_RT: + return &bfqg->async_bfqq[0][ioprio]; + case IOPRIO_CLASS_NONE: + ioprio = IOPRIO_NORM; + /* fall through */ + case IOPRIO_CLASS_BE: + return &bfqg->async_bfqq[1][ioprio]; + case IOPRIO_CLASS_IDLE: + return &bfqg->async_idle_bfqq; + default: + return NULL; + } +} + +static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, + struct bio *bio, bool is_sync, + struct bfq_io_cq *bic) +{ + const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); + struct bfq_queue **async_bfqq = NULL; + struct bfq_queue *bfqq; + struct bfq_group *bfqg; + + rcu_read_lock(); + + bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio)); + if (!bfqg) { + bfqq = &bfqd->oom_bfqq; + goto out; + } + + if (!is_sync) { + async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class, + ioprio); + bfqq = *async_bfqq; + if (bfqq) + goto out; + } + + bfqq = kmem_cache_alloc_node(bfq_pool, + GFP_NOWAIT | __GFP_ZERO | __GFP_NOWARN, + bfqd->queue->node); + + if (bfqq) { + bfq_init_bfqq(bfqd, bfqq, bic, current->pid, + is_sync); + bfq_init_entity(&bfqq->entity, bfqg); + bfq_log_bfqq(bfqd, bfqq, "allocated"); + } else { + bfqq = &bfqd->oom_bfqq; + bfq_log_bfqq(bfqd, bfqq, "using oom bfqq"); + goto out; + } + + /* + * Pin the queue now that it's allocated, scheduler exit will + * prune it. + */ + if (async_bfqq) { + bfqq->ref++; /* + * Extra group reference, w.r.t. sync + * queue. This extra reference is removed + * only if bfqq->bfqg disappears, to + * guarantee that this queue is not freed + * until its group goes away. + */ + bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d", + bfqq, bfqq->ref); + *async_bfqq = bfqq; + } + +out: + bfqq->ref++; /* get a process reference to this queue */ + bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref); + rcu_read_unlock(); + return bfqq; +} + +static void bfq_update_io_thinktime(struct bfq_data *bfqd, + struct bfq_queue *bfqq) +{ + struct bfq_ttime *ttime = &bfqq->ttime; + u64 elapsed = ktime_get_ns() - bfqq->ttime.last_end_request; + + elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle); + + ttime->ttime_samples = (7*bfqq->ttime.ttime_samples + 256) / 8; + ttime->ttime_total = div_u64(7*ttime->ttime_total + 256*elapsed, 8); + ttime->ttime_mean = div64_ul(ttime->ttime_total + 128, + ttime->ttime_samples); +} + +static void +bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct request *rq) +{ + bfqq->seek_history <<= 1; + bfqq->seek_history |= + get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR && + (!blk_queue_nonrot(bfqd->queue) || + blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT); +} + +/* + * Disable idle window if the process thinks too long or seeks so much that + * it doesn't matter. + */ +static void bfq_update_idle_window(struct bfq_data *bfqd, + struct bfq_queue *bfqq, + struct bfq_io_cq *bic) +{ + int enable_idle; + + /* Don't idle for async or idle io prio class. */ + if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq)) + return; + + /* Idle window just restored, statistics are meaningless. */ + if (time_is_after_eq_jiffies(bfqq->split_time + + bfqd->bfq_wr_min_idle_time)) + return; + + enable_idle = bfq_bfqq_idle_window(bfqq); + + if (atomic_read(&bic->icq.ioc->active_ref) == 0 || + bfqd->bfq_slice_idle == 0 || + (bfqd->hw_tag && BFQQ_SEEKY(bfqq) && + bfqq->wr_coeff == 1)) + enable_idle = 0; + else if (bfq_sample_valid(bfqq->ttime.ttime_samples)) { + if (bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle && + bfqq->wr_coeff == 1) + enable_idle = 0; + else + enable_idle = 1; + } + bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d", + enable_idle); + + if (enable_idle) + bfq_mark_bfqq_idle_window(bfqq); + else + bfq_clear_bfqq_idle_window(bfqq); +} + +/* + * Called when a new fs request (rq) is added to bfqq. Check if there's + * something we should do about it. + */ +static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct request *rq) +{ + struct bfq_io_cq *bic = RQ_BIC(rq); + + if (rq->cmd_flags & REQ_META) + bfqq->meta_pending++; + + bfq_update_io_thinktime(bfqd, bfqq); + bfq_update_io_seektime(bfqd, bfqq, rq); + if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 || + !BFQQ_SEEKY(bfqq)) + bfq_update_idle_window(bfqd, bfqq, bic); + + bfq_log_bfqq(bfqd, bfqq, + "rq_enqueued: idle_window=%d (seeky %d)", + bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq)); + + bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq); + + if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) { + bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 && + blk_rq_sectors(rq) < 32; + bool budget_timeout = bfq_bfqq_budget_timeout(bfqq); + + /* + * There is just this request queued: if the request + * is small and the queue is not to be expired, then + * just exit. + * + * In this way, if the device is being idled to wait + * for a new request from the in-service queue, we + * avoid unplugging the device and committing the + * device to serve just a small request. On the + * contrary, we wait for the block layer to decide + * when to unplug the device: hopefully, new requests + * will be merged to this one quickly, then the device + * will be unplugged and larger requests will be + * dispatched. + */ + if (small_req && !budget_timeout) + return; + + /* + * A large enough request arrived, or the queue is to + * be expired: in both cases disk idling is to be + * stopped, so clear wait_request flag and reset + * timer. + */ + bfq_clear_bfqq_wait_request(bfqq); + hrtimer_try_to_cancel(&bfqd->idle_slice_timer); + bfqg_stats_update_idle_time(bfqq_group(bfqq)); + + /* + * The queue is not empty, because a new request just + * arrived. Hence we can safely expire the queue, in + * case of budget timeout, without risking that the + * timestamps of the queue are not updated correctly. + * See [1] for more details. + */ + if (budget_timeout) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQQE_BUDGET_TIMEOUT); + } +} + +static void __bfq_insert_request(struct bfq_data *bfqd, struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq), + *new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true); + + if (new_bfqq) { + if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq) + new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1); + /* + * Release the request's reference to the old bfqq + * and make sure one is taken to the shared queue. + */ + new_bfqq->allocated++; + bfqq->allocated--; + new_bfqq->ref++; + bfq_clear_bfqq_just_created(bfqq); + /* + * If the bic associated with the process + * issuing this request still points to bfqq + * (and thus has not been already redirected + * to new_bfqq or even some other bfq_queue), + * then complete the merge and redirect it to + * new_bfqq. + */ + if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq) + bfq_merge_bfqqs(bfqd, RQ_BIC(rq), + bfqq, new_bfqq); + /* + * rq is about to be enqueued into new_bfqq, + * release rq reference on bfqq + */ + bfq_put_queue(bfqq); + rq->elv.priv[1] = new_bfqq; + bfqq = new_bfqq; + } + + bfq_add_request(rq); + + rq->fifo_time = ktime_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)]; + list_add_tail(&rq->queuelist, &bfqq->fifo); + + bfq_rq_enqueued(bfqd, bfqq, rq); +} + +static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, + bool at_head) +{ + struct request_queue *q = hctx->queue; + struct bfq_data *bfqd = q->elevator->elevator_data; + + spin_lock_irq(&bfqd->lock); + if (blk_mq_sched_try_insert_merge(q, rq)) { + spin_unlock_irq(&bfqd->lock); + return; + } + + spin_unlock_irq(&bfqd->lock); + + blk_mq_sched_request_inserted(rq); + + spin_lock_irq(&bfqd->lock); + if (at_head || blk_rq_is_passthrough(rq)) { + if (at_head) + list_add(&rq->queuelist, &bfqd->dispatch); + else + list_add_tail(&rq->queuelist, &bfqd->dispatch); + } else { + __bfq_insert_request(bfqd, rq); + + if (rq_mergeable(rq)) { + elv_rqhash_add(q, rq); + if (!q->last_merge) + q->last_merge = rq; + } + } + + spin_unlock_irq(&bfqd->lock); +} + +static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx, + struct list_head *list, bool at_head) +{ + while (!list_empty(list)) { + struct request *rq; + + rq = list_first_entry(list, struct request, queuelist); + list_del_init(&rq->queuelist); + bfq_insert_request(hctx, rq, at_head); + } +} + +static void bfq_update_hw_tag(struct bfq_data *bfqd) +{ + bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver, + bfqd->rq_in_driver); + + if (bfqd->hw_tag == 1) + return; + + /* + * This sample is valid if the number of outstanding requests + * is large enough to allow a queueing behavior. Note that the + * sum is not exact, as it's not taking into account deactivated + * requests. + */ + if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD) + return; + + if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES) + return; + + bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD; + bfqd->max_rq_in_driver = 0; + bfqd->hw_tag_samples = 0; +} + +static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) +{ + u64 now_ns; + u32 delta_us; + + bfq_update_hw_tag(bfqd); + + bfqd->rq_in_driver--; + bfqq->dispatched--; + + if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) { + /* + * Set budget_timeout (which we overload to store the + * time at which the queue remains with no backlog and + * no outstanding request; used by the weight-raising + * mechanism). + */ + bfqq->budget_timeout = jiffies; + + bfq_weights_tree_remove(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + } + + now_ns = ktime_get_ns(); + + bfqq->ttime.last_end_request = now_ns; + + /* + * Using us instead of ns, to get a reasonable precision in + * computing rate in next check. + */ + delta_us = div_u64(now_ns - bfqd->last_completion, NSEC_PER_USEC); + + /* + * If the request took rather long to complete, and, according + * to the maximum request size recorded, this completion latency + * implies that the request was certainly served at a very low + * rate (less than 1M sectors/sec), then the whole observation + * interval that lasts up to this time instant cannot be a + * valid time interval for computing a new peak rate. Invoke + * bfq_update_rate_reset to have the following three steps + * taken: + * - close the observation interval at the last (previous) + * request dispatch or completion + * - compute rate, if possible, for that observation interval + * - reset to zero samples, which will trigger a proper + * re-initialization of the observation interval on next + * dispatch + */ + if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC && + (bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us < + 1UL<<(BFQ_RATE_SHIFT - 10)) + bfq_update_rate_reset(bfqd, NULL); + bfqd->last_completion = now_ns; + + /* + * If we are waiting to discover whether the request pattern + * of the task associated with the queue is actually + * isochronous, and both requisites for this condition to hold + * are now satisfied, then compute soft_rt_next_start (see the + * comments on the function bfq_bfqq_softrt_next_start()). We + * schedule this delayed check when bfqq expires, if it still + * has in-flight requests. + */ + if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 && + RB_EMPTY_ROOT(&bfqq->sort_list)) + bfqq->soft_rt_next_start = + bfq_bfqq_softrt_next_start(bfqd, bfqq); + + /* + * If this is the in-service queue, check if it needs to be expired, + * or if we want to idle in case it has no pending requests. + */ + if (bfqd->in_service_queue == bfqq) { + if (bfqq->dispatched == 0 && bfq_bfqq_must_idle(bfqq)) { + bfq_arm_slice_timer(bfqd); + return; + } else if (bfq_may_expire_for_budg_timeout(bfqq)) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQQE_BUDGET_TIMEOUT); + else if (RB_EMPTY_ROOT(&bfqq->sort_list) && + (bfqq->dispatched == 0 || + !bfq_bfqq_may_idle(bfqq))) + bfq_bfqq_expire(bfqd, bfqq, false, + BFQQE_NO_MORE_REQUESTS); + } +} + +static void bfq_put_rq_priv_body(struct bfq_queue *bfqq) +{ + bfqq->allocated--; + + bfq_put_queue(bfqq); +} + +static void bfq_put_rq_private(struct request_queue *q, struct request *rq) +{ + struct bfq_queue *bfqq = RQ_BFQQ(rq); + struct bfq_data *bfqd = bfqq->bfqd; + + if (rq->rq_flags & RQF_STARTED) + bfqg_stats_update_completion(bfqq_group(bfqq), + rq_start_time_ns(rq), + rq_io_start_time_ns(rq), + rq->cmd_flags); + + if (likely(rq->rq_flags & RQF_STARTED)) { + unsigned long flags; + + spin_lock_irqsave(&bfqd->lock, flags); + + bfq_completed_request(bfqq, bfqd); + bfq_put_rq_priv_body(bfqq); + + spin_unlock_irqrestore(&bfqd->lock, flags); + } else { + /* + * Request rq may be still/already in the scheduler, + * in which case we need to remove it. And we cannot + * defer such a check and removal, to avoid + * inconsistencies in the time interval from the end + * of this function to the start of the deferred work. + * This situation seems to occur only in process + * context, as a consequence of a merge. In the + * current version of the code, this implies that the + * lock is held. + */ + + if (!RB_EMPTY_NODE(&rq->rb_node)) + bfq_remove_request(q, rq); + bfq_put_rq_priv_body(bfqq); + } + + rq->elv.priv[0] = NULL; + rq->elv.priv[1] = NULL; +} + +/* + * Returns NULL if a new bfqq should be allocated, or the old bfqq if this + * was the last process referring to that bfqq. + */ +static struct bfq_queue * +bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq) +{ + bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue"); + + if (bfqq_process_refs(bfqq) == 1) { + bfqq->pid = current->pid; + bfq_clear_bfqq_coop(bfqq); + bfq_clear_bfqq_split_coop(bfqq); + return bfqq; + } + + bic_set_bfqq(bic, NULL, 1); + + bfq_put_cooperator(bfqq); + + bfq_put_queue(bfqq); + return NULL; +} + +static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd, + struct bfq_io_cq *bic, + struct bio *bio, + bool split, bool is_sync, + bool *new_queue) +{ + struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync); + + if (likely(bfqq && bfqq != &bfqd->oom_bfqq)) + return bfqq; + + if (new_queue) + *new_queue = true; + + if (bfqq) + bfq_put_queue(bfqq); + bfqq = bfq_get_queue(bfqd, bio, is_sync, bic); + + bic_set_bfqq(bic, bfqq, is_sync); + if (split && is_sync) { + if ((bic->was_in_burst_list && bfqd->large_burst) || + bic->saved_in_large_burst) + bfq_mark_bfqq_in_large_burst(bfqq); + else { + bfq_clear_bfqq_in_large_burst(bfqq); + if (bic->was_in_burst_list) + hlist_add_head(&bfqq->burst_list_node, + &bfqd->burst_list); + } + bfqq->split_time = jiffies; + } + + return bfqq; +} + +/* + * Allocate bfq data structures associated with this request. + */ +static int bfq_get_rq_private(struct request_queue *q, struct request *rq, + struct bio *bio) +{ + struct bfq_data *bfqd = q->elevator->elevator_data; + struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq); + const int is_sync = rq_is_sync(rq); + struct bfq_queue *bfqq; + bool new_queue = false; + bool split = false; + + spin_lock_irq(&bfqd->lock); + + if (!bic) + goto queue_fail; + + bfq_check_ioprio_change(bic, bio); + + bfq_bic_update_cgroup(bic, bio); + + bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, false, is_sync, + &new_queue); + + if (likely(!new_queue)) { + /* If the queue was seeky for too long, break it apart. */ + if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) { + bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq"); + + /* Update bic before losing reference to bfqq */ + if (bfq_bfqq_in_large_burst(bfqq)) + bic->saved_in_large_burst = true; + + bfqq = bfq_split_bfqq(bic, bfqq); + split = true; + + if (!bfqq) + bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, + true, is_sync, + NULL); + } + } + + bfqq->allocated++; + bfqq->ref++; + bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d", + rq, bfqq, bfqq->ref); + + rq->elv.priv[0] = bic; + rq->elv.priv[1] = bfqq; + + /* + * If a bfq_queue has only one process reference, it is owned + * by only this bic: we can then set bfqq->bic = bic. in + * addition, if the queue has also just been split, we have to + * resume its state. + */ + if (likely(bfqq != &bfqd->oom_bfqq) && bfqq_process_refs(bfqq) == 1) { + bfqq->bic = bic; + if (split) { + /* + * The queue has just been split from a shared + * queue: restore the idle window and the + * possible weight raising period. + */ + bfq_bfqq_resume_state(bfqq, bic); + } + } + + if (unlikely(bfq_bfqq_just_created(bfqq))) + bfq_handle_burst(bfqd, bfqq); + + spin_unlock_irq(&bfqd->lock); + + return 0; + +queue_fail: + spin_unlock_irq(&bfqd->lock); + + return 1; +} + +static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq) +{ + struct bfq_data *bfqd = bfqq->bfqd; + enum bfqq_expiration reason; + unsigned long flags; + + spin_lock_irqsave(&bfqd->lock, flags); + bfq_clear_bfqq_wait_request(bfqq); + + if (bfqq != bfqd->in_service_queue) { + spin_unlock_irqrestore(&bfqd->lock, flags); + return; + } + + if (bfq_bfqq_budget_timeout(bfqq)) + /* + * Also here the queue can be safely expired + * for budget timeout without wasting + * guarantees + */ + reason = BFQQE_BUDGET_TIMEOUT; + else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0) + /* + * The queue may not be empty upon timer expiration, + * because we may not disable the timer when the + * first request of the in-service queue arrives + * during disk idling. + */ + reason = BFQQE_TOO_IDLE; + else + goto schedule_dispatch; + + bfq_bfqq_expire(bfqd, bfqq, true, reason); + +schedule_dispatch: + spin_unlock_irqrestore(&bfqd->lock, flags); + bfq_schedule_dispatch(bfqd); +} + +/* + * Handler of the expiration of the timer running if the in-service queue + * is idling inside its time slice. + */ +static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer) +{ + struct bfq_data *bfqd = container_of(timer, struct bfq_data, + idle_slice_timer); + struct bfq_queue *bfqq = bfqd->in_service_queue; + + /* + * Theoretical race here: the in-service queue can be NULL or + * different from the queue that was idling if a new request + * arrives for the current queue and there is a full dispatch + * cycle that changes the in-service queue. This can hardly + * happen, but in the worst case we just expire a queue too + * early. + */ + if (bfqq) + bfq_idle_slice_timer_body(bfqq); + + return HRTIMER_NORESTART; +} + +static void __bfq_put_async_bfqq(struct bfq_data *bfqd, + struct bfq_queue **bfqq_ptr) +{ + struct bfq_queue *bfqq = *bfqq_ptr; + + bfq_log(bfqd, "put_async_bfqq: %p", bfqq); + if (bfqq) { + bfq_bfqq_move(bfqd, bfqq, bfqd->root_group); + + bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d", + bfqq, bfqq->ref); + bfq_put_queue(bfqq); + *bfqq_ptr = NULL; + } +} + +/* + * Release all the bfqg references to its async queues. If we are + * deallocating the group these queues may still contain requests, so + * we reparent them to the root cgroup (i.e., the only one that will + * exist for sure until all the requests on a device are gone). + */ +void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg) +{ + int i, j; + + for (i = 0; i < 2; i++) + for (j = 0; j < IOPRIO_BE_NR; j++) + __bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]); + + __bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq); +} + +static void bfq_exit_queue(struct elevator_queue *e) +{ + struct bfq_data *bfqd = e->elevator_data; + struct bfq_queue *bfqq, *n; + + hrtimer_cancel(&bfqd->idle_slice_timer); + + spin_lock_irq(&bfqd->lock); + list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list) + bfq_deactivate_bfqq(bfqd, bfqq, false, false); + spin_unlock_irq(&bfqd->lock); + + hrtimer_cancel(&bfqd->idle_slice_timer); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_deactivate_policy(bfqd->queue, &blkcg_policy_bfq); +#else + spin_lock_irq(&bfqd->lock); + bfq_put_async_queues(bfqd, bfqd->root_group); + kfree(bfqd->root_group); + spin_unlock_irq(&bfqd->lock); +#endif + + kfree(bfqd); +} + +static void bfq_init_root_group(struct bfq_group *root_group, + struct bfq_data *bfqd) +{ + int i; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + root_group->entity.parent = NULL; + root_group->my_entity = NULL; + root_group->bfqd = bfqd; +#endif + root_group->rq_pos_tree = RB_ROOT; + for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) + root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT; + root_group->sched_data.bfq_class_idle_last_service = jiffies; +} + +static int bfq_init_queue(struct request_queue *q, struct elevator_type *e) +{ + struct bfq_data *bfqd; + struct elevator_queue *eq; + + eq = elevator_alloc(q, e); + if (!eq) + return -ENOMEM; + + bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node); + if (!bfqd) { + kobject_put(&eq->kobj); + return -ENOMEM; + } + eq->elevator_data = bfqd; + + spin_lock_irq(q->queue_lock); + q->elevator = eq; + spin_unlock_irq(q->queue_lock); + + /* + * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues. + * Grab a permanent reference to it, so that the normal code flow + * will not attempt to free it. + */ + bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0); + bfqd->oom_bfqq.ref++; + bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO; + bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE; + bfqd->oom_bfqq.entity.new_weight = + bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio); + + /* oom_bfqq does not participate to bursts */ + bfq_clear_bfqq_just_created(&bfqd->oom_bfqq); + + /* + * Trigger weight initialization, according to ioprio, at the + * oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio + * class won't be changed any more. + */ + bfqd->oom_bfqq.entity.prio_changed = 1; + + bfqd->queue = q; + + INIT_LIST_HEAD(&bfqd->dispatch); + + hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + bfqd->idle_slice_timer.function = bfq_idle_slice_timer; + + bfqd->queue_weights_tree = RB_ROOT; + bfqd->group_weights_tree = RB_ROOT; + + INIT_LIST_HEAD(&bfqd->active_list); + INIT_LIST_HEAD(&bfqd->idle_list); + INIT_HLIST_HEAD(&bfqd->burst_list); + + bfqd->hw_tag = -1; + + bfqd->bfq_max_budget = bfq_default_max_budget; + + bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0]; + bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1]; + bfqd->bfq_back_max = bfq_back_max; + bfqd->bfq_back_penalty = bfq_back_penalty; + bfqd->bfq_slice_idle = bfq_slice_idle; + bfqd->bfq_timeout = bfq_timeout; + + bfqd->bfq_requests_within_timer = 120; + + bfqd->bfq_large_burst_thresh = 8; + bfqd->bfq_burst_interval = msecs_to_jiffies(180); + + bfqd->low_latency = true; + + /* + * Trade-off between responsiveness and fairness. + */ + bfqd->bfq_wr_coeff = 30; + bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300); + bfqd->bfq_wr_max_time = 0; + bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000); + bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500); + bfqd->bfq_wr_max_softrt_rate = 7000; /* + * Approximate rate required + * to playback or record a + * high-definition compressed + * video. + */ + bfqd->wr_busy_queues = 0; + + /* + * Begin by assuming, optimistically, that the device is a + * high-speed one, and that its peak rate is equal to 2/3 of + * the highest reference rate. + */ + bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] * + T_fast[blk_queue_nonrot(bfqd->queue)]; + bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)] * 2 / 3; + bfqd->device_speed = BFQ_BFQD_FAST; + + spin_lock_init(&bfqd->lock); + + /* + * The invocation of the next bfq_create_group_hierarchy + * function is the head of a chain of function calls + * (bfq_create_group_hierarchy->blkcg_activate_policy-> + * blk_mq_freeze_queue) that may lead to the invocation of the + * has_work hook function. For this reason, + * bfq_create_group_hierarchy is invoked only after all + * scheduler data has been initialized, apart from the fields + * that can be initialized only after invoking + * bfq_create_group_hierarchy. This, in particular, enables + * has_work to correctly return false. Of course, to avoid + * other inconsistencies, the blk-mq stack must then refrain + * from invoking further scheduler hooks before this init + * function is finished. + */ + bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node); + if (!bfqd->root_group) + goto out_free; + bfq_init_root_group(bfqd->root_group, bfqd); + bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group); + + + return 0; + +out_free: + kfree(bfqd); + kobject_put(&eq->kobj); + return -ENOMEM; +} + +static void bfq_slab_kill(void) +{ + kmem_cache_destroy(bfq_pool); +} + +static int __init bfq_slab_setup(void) +{ + bfq_pool = KMEM_CACHE(bfq_queue, 0); + if (!bfq_pool) + return -ENOMEM; + return 0; +} + +static ssize_t bfq_var_show(unsigned int var, char *page) +{ + return sprintf(page, "%u\n", var); +} + +static ssize_t bfq_var_store(unsigned long *var, const char *page, + size_t count) +{ + unsigned long new_val; + int ret = kstrtoul(page, 10, &new_val); + + if (ret == 0) + *var = new_val; + + return count; +} + +#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \ +static ssize_t __FUNC(struct elevator_queue *e, char *page) \ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + u64 __data = __VAR; \ + if (__CONV == 1) \ + __data = jiffies_to_msecs(__data); \ + else if (__CONV == 2) \ + __data = div_u64(__data, NSEC_PER_MSEC); \ + return bfq_var_show(__data, (page)); \ +} +SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 2); +SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 2); +SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0); +SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0); +SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2); +SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0); +SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1); +SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0); +SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0); +#undef SHOW_FUNCTION + +#define USEC_SHOW_FUNCTION(__FUNC, __VAR) \ +static ssize_t __FUNC(struct elevator_queue *e, char *page) \ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + u64 __data = __VAR; \ + __data = div_u64(__data, NSEC_PER_USEC); \ + return bfq_var_show(__data, (page)); \ +} +USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle); +#undef USEC_SHOW_FUNCTION + +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ +static ssize_t \ +__FUNC(struct elevator_queue *e, const char *page, size_t count) \ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + unsigned long uninitialized_var(__data); \ + int ret = bfq_var_store(&__data, (page), count); \ + if (__data < (MIN)) \ + __data = (MIN); \ + else if (__data > (MAX)) \ + __data = (MAX); \ + if (__CONV == 1) \ + *(__PTR) = msecs_to_jiffies(__data); \ + else if (__CONV == 2) \ + *(__PTR) = (u64)__data * NSEC_PER_MSEC; \ + else \ + *(__PTR) = __data; \ + return ret; \ +} +STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1, + INT_MAX, 2); +STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1, + INT_MAX, 2); +STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0); +STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1, + INT_MAX, 0); +STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2); +#undef STORE_FUNCTION + +#define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX) \ +static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\ +{ \ + struct bfq_data *bfqd = e->elevator_data; \ + unsigned long uninitialized_var(__data); \ + int ret = bfq_var_store(&__data, (page), count); \ + if (__data < (MIN)) \ + __data = (MIN); \ + else if (__data > (MAX)) \ + __data = (MAX); \ + *(__PTR) = (u64)__data * NSEC_PER_USEC; \ + return ret; \ +} +USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0, + UINT_MAX); +#undef USEC_STORE_FUNCTION + +static ssize_t bfq_max_budget_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data == 0) + bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd); + else { + if (__data > INT_MAX) + __data = INT_MAX; + bfqd->bfq_max_budget = __data; + } + + bfqd->bfq_user_max_budget = __data; + + return ret; +} + +/* + * Leaving this name to preserve name compatibility with cfq + * parameters, but this timeout is used for both sync and async. + */ +static ssize_t bfq_timeout_sync_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data < 1) + __data = 1; + else if (__data > INT_MAX) + __data = INT_MAX; + + bfqd->bfq_timeout = msecs_to_jiffies(__data); + if (bfqd->bfq_user_max_budget == 0) + bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd); + + return ret; +} + +static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data > 1) + __data = 1; + if (!bfqd->strict_guarantees && __data == 1 + && bfqd->bfq_slice_idle < 8 * NSEC_PER_MSEC) + bfqd->bfq_slice_idle = 8 * NSEC_PER_MSEC; + + bfqd->strict_guarantees = __data; + + return ret; +} + +static ssize_t bfq_low_latency_store(struct elevator_queue *e, + const char *page, size_t count) +{ + struct bfq_data *bfqd = e->elevator_data; + unsigned long uninitialized_var(__data); + int ret = bfq_var_store(&__data, (page), count); + + if (__data > 1) + __data = 1; + if (__data == 0 && bfqd->low_latency != 0) + bfq_end_wr(bfqd); + bfqd->low_latency = __data; + + return ret; +} + +#define BFQ_ATTR(name) \ + __ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store) + +static struct elv_fs_entry bfq_attrs[] = { + BFQ_ATTR(fifo_expire_sync), + BFQ_ATTR(fifo_expire_async), + BFQ_ATTR(back_seek_max), + BFQ_ATTR(back_seek_penalty), + BFQ_ATTR(slice_idle), + BFQ_ATTR(slice_idle_us), + BFQ_ATTR(max_budget), + BFQ_ATTR(timeout_sync), + BFQ_ATTR(strict_guarantees), + BFQ_ATTR(low_latency), + __ATTR_NULL +}; + +static struct elevator_type iosched_bfq_mq = { + .ops.mq = { + .get_rq_priv = bfq_get_rq_private, + .put_rq_priv = bfq_put_rq_private, + .exit_icq = bfq_exit_icq, + .insert_requests = bfq_insert_requests, + .dispatch_request = bfq_dispatch_request, + .next_request = elv_rb_latter_request, + .former_request = elv_rb_former_request, + .allow_merge = bfq_allow_bio_merge, + .bio_merge = bfq_bio_merge, + .request_merge = bfq_request_merge, + .requests_merged = bfq_requests_merged, + .request_merged = bfq_request_merged, + .has_work = bfq_has_work, + .init_sched = bfq_init_queue, + .exit_sched = bfq_exit_queue, + }, + + .uses_mq = true, + .icq_size = sizeof(struct bfq_io_cq), + .icq_align = __alignof__(struct bfq_io_cq), + .elevator_attrs = bfq_attrs, + .elevator_name = "bfq", + .elevator_owner = THIS_MODULE, +}; + +static int __init bfq_init(void) +{ + int ret; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + ret = blkcg_policy_register(&blkcg_policy_bfq); + if (ret) + return ret; +#endif + + ret = -ENOMEM; + if (bfq_slab_setup()) + goto err_pol_unreg; + + /* + * Times to load large popular applications for the typical + * systems installed on the reference devices (see the + * comments before the definitions of the next two + * arrays). Actually, we use slightly slower values, as the + * estimated peak rate tends to be smaller than the actual + * peak rate. The reason for this last fact is that estimates + * are computed over much shorter time intervals than the long + * intervals typically used for benchmarking. Why? First, to + * adapt more quickly to variations. Second, because an I/O + * scheduler cannot rely on a peak-rate-evaluation workload to + * be run for a long time. + */ + T_slow[0] = msecs_to_jiffies(3500); /* actually 4 sec */ + T_slow[1] = msecs_to_jiffies(6000); /* actually 6.5 sec */ + T_fast[0] = msecs_to_jiffies(7000); /* actually 8 sec */ + T_fast[1] = msecs_to_jiffies(2500); /* actually 3 sec */ + + /* + * Thresholds that determine the switch between speed classes + * (see the comments before the definition of the array + * device_speed_thresh). These thresholds are biased towards + * transitions to the fast class. This is safer than the + * opposite bias. In fact, a wrong transition to the slow + * class results in short weight-raising periods, because the + * speed of the device then tends to be higher that the + * reference peak rate. On the opposite end, a wrong + * transition to the fast class tends to increase + * weight-raising periods, because of the opposite reason. + */ + device_speed_thresh[0] = (4 * R_slow[0]) / 3; + device_speed_thresh[1] = (4 * R_slow[1]) / 3; + + ret = elv_register(&iosched_bfq_mq); + if (ret) + goto err_pol_unreg; + + return 0; + +err_pol_unreg: +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_policy_unregister(&blkcg_policy_bfq); +#endif + return ret; +} + +static void __exit bfq_exit(void) +{ + elv_unregister(&iosched_bfq_mq); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + blkcg_policy_unregister(&blkcg_policy_bfq); +#endif + bfq_slab_kill(); +} + +module_init(bfq_init); +module_exit(bfq_exit); + +MODULE_AUTHOR("Paolo Valente"); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("MQ Budget Fair Queueing I/O Scheduler"); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h new file mode 100644 index 000000000000..ae783c06dfd9 --- /dev/null +++ b/block/bfq-iosched.h @@ -0,0 +1,941 @@ +/* + * Header file for the BFQ I/O scheduler: data structures and + * prototypes of interface functions among BFQ components. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#ifndef _BFQ_H +#define _BFQ_H + +#include <linux/blktrace_api.h> +#include <linux/hrtimer.h> +#include <linux/blk-cgroup.h> + +#define BFQ_IOPRIO_CLASSES 3 +#define BFQ_CL_IDLE_TIMEOUT (HZ/5) + +#define BFQ_MIN_WEIGHT 1 +#define BFQ_MAX_WEIGHT 1000 +#define BFQ_WEIGHT_CONVERSION_COEFF 10 + +#define BFQ_DEFAULT_QUEUE_IOPRIO 4 + +#define BFQ_WEIGHT_LEGACY_DFL 100 +#define BFQ_DEFAULT_GRP_IOPRIO 0 +#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE + +/* + * Soft real-time applications are extremely more latency sensitive + * than interactive ones. Over-raise the weight of the former to + * privilege them against the latter. + */ +#define BFQ_SOFTRT_WEIGHT_FACTOR 100 + +struct bfq_entity; + +/** + * struct bfq_service_tree - per ioprio_class service tree. + * + * Each service tree represents a B-WF2Q+ scheduler on its own. Each + * ioprio_class has its own independent scheduler, and so its own + * bfq_service_tree. All the fields are protected by the queue lock + * of the containing bfqd. + */ +struct bfq_service_tree { + /* tree for active entities (i.e., those backlogged) */ + struct rb_root active; + /* tree for idle entities (i.e., not backlogged, with V <= F_i)*/ + struct rb_root idle; + + /* idle entity with minimum F_i */ + struct bfq_entity *first_idle; + /* idle entity with maximum F_i */ + struct bfq_entity *last_idle; + + /* scheduler virtual time */ + u64 vtime; + /* scheduler weight sum; active and idle entities contribute to it */ + unsigned long wsum; +}; + +/** + * struct bfq_sched_data - multi-class scheduler. + * + * bfq_sched_data is the basic scheduler queue. It supports three + * ioprio_classes, and can be used either as a toplevel queue or as an + * intermediate queue on a hierarchical setup. @next_in_service + * points to the active entity of the sched_data service trees that + * will be scheduled next. It is used to reduce the number of steps + * needed for each hierarchical-schedule update. + * + * The supported ioprio_classes are the same as in CFQ, in descending + * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE. + * Requests from higher priority queues are served before all the + * requests from lower priority queues; among requests of the same + * queue requests are served according to B-WF2Q+. + * All the fields are protected by the queue lock of the containing bfqd. + */ +struct bfq_sched_data { + /* entity in service */ + struct bfq_entity *in_service_entity; + /* head-of-line entity (see comments above) */ + struct bfq_entity *next_in_service; + /* array of service trees, one per ioprio_class */ + struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES]; + /* last time CLASS_IDLE was served */ + unsigned long bfq_class_idle_last_service; + +}; + +/** + * struct bfq_weight_counter - counter of the number of all active entities + * with a given weight. + */ +struct bfq_weight_counter { + unsigned int weight; /* weight of the entities this counter refers to */ + unsigned int num_active; /* nr of active entities with this weight */ + /* + * Weights tree member (see bfq_data's @queue_weights_tree and + * @group_weights_tree) + */ + struct rb_node weights_node; +}; + +/** + * struct bfq_entity - schedulable entity. + * + * A bfq_entity is used to represent either a bfq_queue (leaf node in the + * cgroup hierarchy) or a bfq_group into the upper level scheduler. Each + * entity belongs to the sched_data of the parent group in the cgroup + * hierarchy. Non-leaf entities have also their own sched_data, stored + * in @my_sched_data. + * + * Each entity stores independently its priority values; this would + * allow different weights on different devices, but this + * functionality is not exported to userspace by now. Priorities and + * weights are updated lazily, first storing the new values into the + * new_* fields, then setting the @prio_changed flag. As soon as + * there is a transition in the entity state that allows the priority + * update to take place the effective and the requested priority + * values are synchronized. + * + * Unless cgroups are used, the weight value is calculated from the + * ioprio to export the same interface as CFQ. When dealing with + * ``well-behaved'' queues (i.e., queues that do not spend too much + * time to consume their budget and have true sequential behavior, and + * when there are no external factors breaking anticipation) the + * relative weights at each level of the cgroups hierarchy should be + * guaranteed. All the fields are protected by the queue lock of the + * containing bfqd. + */ +struct bfq_entity { + /* service_tree member */ + struct rb_node rb_node; + /* pointer to the weight counter associated with this entity */ + struct bfq_weight_counter *weight_counter; + + /* + * Flag, true if the entity is on a tree (either the active or + * the idle one of its service_tree) or is in service. + */ + bool on_st; + + /* B-WF2Q+ start and finish timestamps [sectors/weight] */ + u64 start, finish; + + /* tree the entity is enqueued into; %NULL if not on a tree */ + struct rb_root *tree; + + /* + * minimum start time of the (active) subtree rooted at this + * entity; used for O(log N) lookups into active trees + */ + u64 min_start; + + /* amount of service received during the last service slot */ + int service; + + /* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */ + int budget; + + /* weight of the queue */ + int weight; + /* next weight if a change is in progress */ + int new_weight; + + /* original weight, used to implement weight boosting */ + int orig_weight; + + /* parent entity, for hierarchical scheduling */ + struct bfq_entity *parent; + + /* + * For non-leaf nodes in the hierarchy, the associated + * scheduler queue, %NULL on leaf nodes. + */ + struct bfq_sched_data *my_sched_data; + /* the scheduler queue this entity belongs to */ + struct bfq_sched_data *sched_data; + + /* flag, set to request a weight, ioprio or ioprio_class change */ + int prio_changed; +}; + +struct bfq_group; + +/** + * struct bfq_ttime - per process thinktime stats. + */ +struct bfq_ttime { + /* completion time of the last request */ + u64 last_end_request; + + /* total process thinktime */ + u64 ttime_total; + /* number of thinktime samples */ + unsigned long ttime_samples; + /* average process thinktime */ + u64 ttime_mean; +}; + +/** + * struct bfq_queue - leaf schedulable entity. + * + * A bfq_queue is a leaf request queue; it can be associated with an + * io_context or more, if it is async or shared between cooperating + * processes. @cgroup holds a reference to the cgroup, to be sure that it + * does not disappear while a bfqq still references it (mostly to avoid + * races between request issuing and task migration followed by cgroup + * destruction). + * All the fields are protected by the queue lock of the containing bfqd. + */ +struct bfq_queue { + /* reference counter */ + int ref; + /* parent bfq_data */ + struct bfq_data *bfqd; + + /* current ioprio and ioprio class */ + unsigned short ioprio, ioprio_class; + /* next ioprio and ioprio class if a change is in progress */ + unsigned short new_ioprio, new_ioprio_class; + + /* + * Shared bfq_queue if queue is cooperating with one or more + * other queues. + */ + struct bfq_queue *new_bfqq; + /* request-position tree member (see bfq_group's @rq_pos_tree) */ + struct rb_node pos_node; + /* request-position tree root (see bfq_group's @rq_pos_tree) */ + struct rb_root *pos_root; + + /* sorted list of pending requests */ + struct rb_root sort_list; + /* if fifo isn't expired, next request to serve */ + struct request *next_rq; + /* number of sync and async requests queued */ + int queued[2]; + /* number of requests currently allocated */ + int allocated; + /* number of pending metadata requests */ + int meta_pending; + /* fifo list of requests in sort_list */ + struct list_head fifo; + + /* entity representing this queue in the scheduler */ + struct bfq_entity entity; + + /* maximum budget allowed from the feedback mechanism */ + int max_budget; + /* budget expiration (in jiffies) */ + unsigned long budget_timeout; + + /* number of requests on the dispatch list or inside driver */ + int dispatched; + + /* status flags */ + unsigned long flags; + + /* node for active/idle bfqq list inside parent bfqd */ + struct list_head bfqq_list; + + /* associated @bfq_ttime struct */ + struct bfq_ttime ttime; + + /* bit vector: a 1 for each seeky requests in history */ + u32 seek_history; + + /* node for the device's burst list */ + struct hlist_node burst_list_node; + + /* position of the last request enqueued */ + sector_t last_request_pos; + + /* Number of consecutive pairs of request completion and + * arrival, such that the queue becomes idle after the + * completion, but the next request arrives within an idle + * time slice; used only if the queue's IO_bound flag has been + * cleared. + */ + unsigned int requests_within_timer; + + /* pid of the process owning the queue, used for logging purposes */ + pid_t pid; + + /* + * Pointer to the bfq_io_cq owning the bfq_queue, set to %NULL + * if the queue is shared. + */ + struct bfq_io_cq *bic; + + /* current maximum weight-raising time for this queue */ + unsigned long wr_cur_max_time; + /* + * Minimum time instant such that, only if a new request is + * enqueued after this time instant in an idle @bfq_queue with + * no outstanding requests, then the task associated with the + * queue it is deemed as soft real-time (see the comments on + * the function bfq_bfqq_softrt_next_start()) + */ + unsigned long soft_rt_next_start; + /* + * Start time of the current weight-raising period if + * the @bfq-queue is being weight-raised, otherwise + * finish time of the last weight-raising period. + */ + unsigned long last_wr_start_finish; + /* factor by which the weight of this queue is multiplied */ + unsigned int wr_coeff; + /* + * Time of the last transition of the @bfq_queue from idle to + * backlogged. + */ + unsigned long last_idle_bklogged; + /* + * Cumulative service received from the @bfq_queue since the + * last transition from idle to backlogged. + */ + unsigned long service_from_backlogged; + + /* + * Value of wr start time when switching to soft rt + */ + unsigned long wr_start_at_switch_to_srt; + + unsigned long split_time; /* time of last split */ +}; + +/** + * struct bfq_io_cq - per (request_queue, io_context) structure. + */ +struct bfq_io_cq { + /* associated io_cq structure */ + struct io_cq icq; /* must be the first member */ + /* array of two process queues, the sync and the async */ + struct bfq_queue *bfqq[2]; + /* per (request_queue, blkcg) ioprio */ + int ioprio; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + uint64_t blkcg_serial_nr; /* the current blkcg serial */ +#endif + /* + * Snapshot of the idle window before merging; taken to + * remember this value while the queue is merged, so as to be + * able to restore it in case of split. + */ + bool saved_idle_window; + /* + * Same purpose as the previous two fields for the I/O bound + * classification of a queue. + */ + bool saved_IO_bound; + + /* + * Same purpose as the previous fields for the value of the + * field keeping the queue's belonging to a large burst + */ + bool saved_in_large_burst; + /* + * True if the queue belonged to a burst list before its merge + * with another cooperating queue. + */ + bool was_in_burst_list; + + /* + * Similar to previous fields: save wr information. + */ + unsigned long saved_wr_coeff; + unsigned long saved_last_wr_start_finish; + unsigned long saved_wr_start_at_switch_to_srt; + unsigned int saved_wr_cur_max_time; + struct bfq_ttime saved_ttime; +}; + +enum bfq_device_speed { + BFQ_BFQD_FAST, + BFQ_BFQD_SLOW, +}; + +/** + * struct bfq_data - per-device data structure. + * + * All the fields are protected by @lock. + */ +struct bfq_data { + /* device request queue */ + struct request_queue *queue; + /* dispatch queue */ + struct list_head dispatch; + + /* root bfq_group for the device */ + struct bfq_group *root_group; + + /* + * rbtree of weight counters of @bfq_queues, sorted by + * weight. Used to keep track of whether all @bfq_queues have + * the same weight. The tree contains one counter for each + * distinct weight associated to some active and not + * weight-raised @bfq_queue (see the comments to the functions + * bfq_weights_tree_[add|remove] for further details). + */ + struct rb_root queue_weights_tree; + /* + * rbtree of non-queue @bfq_entity weight counters, sorted by + * weight. Used to keep track of whether all @bfq_groups have + * the same weight. The tree contains one counter for each + * distinct weight associated to some active @bfq_group (see + * the comments to the functions bfq_weights_tree_[add|remove] + * for further details). + */ + struct rb_root group_weights_tree; + + /* + * Number of bfq_queues containing requests (including the + * queue in service, even if it is idling). + */ + int busy_queues; + /* number of weight-raised busy @bfq_queues */ + int wr_busy_queues; + /* number of queued requests */ + int queued; + /* number of requests dispatched and waiting for completion */ + int rq_in_driver; + + /* + * Maximum number of requests in driver in the last + * @hw_tag_samples completed requests. + */ + int max_rq_in_driver; + /* number of samples used to calculate hw_tag */ + int hw_tag_samples; + /* flag set to one if the driver is showing a queueing behavior */ + int hw_tag; + + /* number of budgets assigned */ + int budgets_assigned; + + /* + * Timer set when idling (waiting) for the next request from + * the queue in service. + */ + struct hrtimer idle_slice_timer; + + /* bfq_queue in service */ + struct bfq_queue *in_service_queue; + + /* on-disk position of the last served request */ + sector_t last_position; + + /* time of last request completion (ns) */ + u64 last_completion; + + /* time of first rq dispatch in current observation interval (ns) */ + u64 first_dispatch; + /* time of last rq dispatch in current observation interval (ns) */ + u64 last_dispatch; + + /* beginning of the last budget */ + ktime_t last_budget_start; + /* beginning of the last idle slice */ + ktime_t last_idling_start; + + /* number of samples in current observation interval */ + int peak_rate_samples; + /* num of samples of seq dispatches in current observation interval */ + u32 sequential_samples; + /* total num of sectors transferred in current observation interval */ + u64 tot_sectors_dispatched; + /* max rq size seen during current observation interval (sectors) */ + u32 last_rq_max_size; + /* time elapsed from first dispatch in current observ. interval (us) */ + u64 delta_from_first; + /* + * Current estimate of the device peak rate, measured in + * [BFQ_RATE_SHIFT * sectors/usec]. The left-shift by + * BFQ_RATE_SHIFT is performed to increase precision in + * fixed-point calculations. + */ + u32 peak_rate; + + /* maximum budget allotted to a bfq_queue before rescheduling */ + int bfq_max_budget; + + /* list of all the bfq_queues active on the device */ + struct list_head active_list; + /* list of all the bfq_queues idle on the device */ + struct list_head idle_list; + + /* + * Timeout for async/sync requests; when it fires, requests + * are served in fifo order. + */ + u64 bfq_fifo_expire[2]; + /* weight of backward seeks wrt forward ones */ + unsigned int bfq_back_penalty; + /* maximum allowed backward seek */ + unsigned int bfq_back_max; + /* maximum idling time */ + u32 bfq_slice_idle; + + /* user-configured max budget value (0 for auto-tuning) */ + int bfq_user_max_budget; + /* + * Timeout for bfq_queues to consume their budget; used to + * prevent seeky queues from imposing long latencies to + * sequential or quasi-sequential ones (this also implies that + * seeky queues cannot receive guarantees in the service + * domain; after a timeout they are charged for the time they + * have been in service, to preserve fairness among them, but + * without service-domain guarantees). + */ + unsigned int bfq_timeout; + + /* + * Number of consecutive requests that must be issued within + * the idle time slice to set again idling to a queue which + * was marked as non-I/O-bound (see the definition of the + * IO_bound flag for further details). + */ + unsigned int bfq_requests_within_timer; + + /* + * Force device idling whenever needed to provide accurate + * service guarantees, without caring about throughput + * issues. CAVEAT: this may even increase latencies, in case + * of useless idling for processes that did stop doing I/O. + */ + bool strict_guarantees; + + /* + * Last time at which a queue entered the current burst of + * queues being activated shortly after each other; for more + * details about this and the following parameters related to + * a burst of activations, see the comments on the function + * bfq_handle_burst. + */ + unsigned long last_ins_in_burst; + /* + * Reference time interval used to decide whether a queue has + * been activated shortly after @last_ins_in_burst. + */ + unsigned long bfq_burst_interval; + /* number of queues in the current burst of queue activations */ + int burst_size; + + /* common parent entity for the queues in the burst */ + struct bfq_entity *burst_parent_entity; + /* Maximum burst size above which the current queue-activation + * burst is deemed as 'large'. + */ + unsigned long bfq_large_burst_thresh; + /* true if a large queue-activation burst is in progress */ + bool large_burst; + /* + * Head of the burst list (as for the above fields, more + * details in the comments on the function bfq_handle_burst). + */ + struct hlist_head burst_list; + + /* if set to true, low-latency heuristics are enabled */ + bool low_latency; + /* + * Maximum factor by which the weight of a weight-raised queue + * is multiplied. + */ + unsigned int bfq_wr_coeff; + /* maximum duration of a weight-raising period (jiffies) */ + unsigned int bfq_wr_max_time; + + /* Maximum weight-raising duration for soft real-time processes */ + unsigned int bfq_wr_rt_max_time; + /* + * Minimum idle period after which weight-raising may be + * reactivated for a queue (in jiffies). + */ + unsigned int bfq_wr_min_idle_time; + /* + * Minimum period between request arrivals after which + * weight-raising may be reactivated for an already busy async + * queue (in jiffies). + */ + unsigned long bfq_wr_min_inter_arr_async; + + /* Max service-rate for a soft real-time queue, in sectors/sec */ + unsigned int bfq_wr_max_softrt_rate; + /* + * Cached value of the product R*T, used for computing the + * maximum duration of weight raising automatically. + */ + u64 RT_prod; + /* device-speed class for the low-latency heuristic */ + enum bfq_device_speed device_speed; + + /* fallback dummy bfqq for extreme OOM conditions */ + struct bfq_queue oom_bfqq; + + spinlock_t lock; + + /* + * bic associated with the task issuing current bio for + * merging. This and the next field are used as a support to + * be able to perform the bic lookup, needed by bio-merge + * functions, before the scheduler lock is taken, and thus + * avoid taking the request-queue lock while the scheduler + * lock is being held. + */ + struct bfq_io_cq *bio_bic; + /* bfqq associated with the task issuing current bio for merging */ + struct bfq_queue *bio_bfqq; +}; + +enum bfqq_state_flags { + BFQQF_just_created = 0, /* queue just allocated */ + BFQQF_busy, /* has requests or is in service */ + BFQQF_wait_request, /* waiting for a request */ + BFQQF_non_blocking_wait_rq, /* + * waiting for a request + * without idling the device + */ + BFQQF_fifo_expire, /* FIFO checked in this slice */ + BFQQF_idle_window, /* slice idling enabled */ + BFQQF_sync, /* synchronous queue */ + BFQQF_IO_bound, /* + * bfqq has timed-out at least once + * having consumed at most 2/10 of + * its budget + */ + BFQQF_in_large_burst, /* + * bfqq activated in a large burst, + * see comments to bfq_handle_burst. + */ + BFQQF_softrt_update, /* + * may need softrt-next-start + * update + */ + BFQQF_coop, /* bfqq is shared */ + BFQQF_split_coop /* shared bfqq will be split */ +}; + +#define BFQ_BFQQ_FNS(name) \ +void bfq_mark_bfqq_##name(struct bfq_queue *bfqq); \ +void bfq_clear_bfqq_##name(struct bfq_queue *bfqq); \ +int bfq_bfqq_##name(const struct bfq_queue *bfqq); + +BFQ_BFQQ_FNS(just_created); +BFQ_BFQQ_FNS(busy); +BFQ_BFQQ_FNS(wait_request); +BFQ_BFQQ_FNS(non_blocking_wait_rq); +BFQ_BFQQ_FNS(fifo_expire); +BFQ_BFQQ_FNS(idle_window); +BFQ_BFQQ_FNS(sync); +BFQ_BFQQ_FNS(IO_bound); +BFQ_BFQQ_FNS(in_large_burst); +BFQ_BFQQ_FNS(coop); +BFQ_BFQQ_FNS(split_coop); +BFQ_BFQQ_FNS(softrt_update); +#undef BFQ_BFQQ_FNS + +/* Expiration reasons. */ +enum bfqq_expiration { + BFQQE_TOO_IDLE = 0, /* + * queue has been idling for + * too long + */ + BFQQE_BUDGET_TIMEOUT, /* budget took too long to be used */ + BFQQE_BUDGET_EXHAUSTED, /* budget consumed */ + BFQQE_NO_MORE_REQUESTS, /* the queue has no more requests */ + BFQQE_PREEMPTED /* preemption in progress */ +}; + +struct bfqg_stats { +#ifdef CONFIG_BFQ_GROUP_IOSCHED + /* number of ios merged */ + struct blkg_rwstat merged; + /* total time spent on device in ns, may not be accurate w/ queueing */ + struct blkg_rwstat service_time; + /* total time spent waiting in scheduler queue in ns */ + struct blkg_rwstat wait_time; + /* number of IOs queued up */ + struct blkg_rwstat queued; + /* total disk time and nr sectors dispatched by this group */ + struct blkg_stat time; + /* sum of number of ios queued across all samples */ + struct blkg_stat avg_queue_size_sum; + /* count of samples taken for average */ + struct blkg_stat avg_queue_size_samples; + /* how many times this group has been removed from service tree */ + struct blkg_stat dequeue; + /* total time spent waiting for it to be assigned a timeslice. */ + struct blkg_stat group_wait_time; + /* time spent idling for this blkcg_gq */ + struct blkg_stat idle_time; + /* total time with empty current active q with other requests queued */ + struct blkg_stat empty_time; + /* fields after this shouldn't be cleared on stat reset */ + uint64_t start_group_wait_time; + uint64_t start_idle_time; + uint64_t start_empty_time; + uint16_t flags; +#endif /* CONFIG_BFQ_GROUP_IOSCHED */ +}; + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +/* + * struct bfq_group_data - per-blkcg storage for the blkio subsystem. + * + * @ps: @blkcg_policy_storage that this structure inherits + * @weight: weight of the bfq_group + */ +struct bfq_group_data { + /* must be the first member */ + struct blkcg_policy_data pd; + + unsigned int weight; +}; + +/** + * struct bfq_group - per (device, cgroup) data structure. + * @entity: schedulable entity to insert into the parent group sched_data. + * @sched_data: own sched_data, to contain child entities (they may be + * both bfq_queues and bfq_groups). + * @bfqd: the bfq_data for the device this group acts upon. + * @async_bfqq: array of async queues for all the tasks belonging to + * the group, one queue per ioprio value per ioprio_class, + * except for the idle class that has only one queue. + * @async_idle_bfqq: async queue for the idle class (ioprio is ignored). + * @my_entity: pointer to @entity, %NULL for the toplevel group; used + * to avoid too many special cases during group creation/ + * migration. + * @stats: stats for this bfqg. + * @active_entities: number of active entities belonging to the group; + * unused for the root group. Used to know whether there + * are groups with more than one active @bfq_entity + * (see the comments to the function + * bfq_bfqq_may_idle()). + * @rq_pos_tree: rbtree sorted by next_request position, used when + * determining if two or more queues have interleaving + * requests (see bfq_find_close_cooperator()). + * + * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup + * there is a set of bfq_groups, each one collecting the lower-level + * entities belonging to the group that are acting on the same device. + * + * Locking works as follows: + * o @bfqd is protected by the queue lock, RCU is used to access it + * from the readers. + * o All the other fields are protected by the @bfqd queue lock. + */ +struct bfq_group { + /* must be the first member */ + struct blkg_policy_data pd; + + struct bfq_entity entity; + struct bfq_sched_data sched_data; + + void *bfqd; + + struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; + struct bfq_queue *async_idle_bfqq; + + struct bfq_entity *my_entity; + + int active_entities; + + struct rb_root rq_pos_tree; + + struct bfqg_stats stats; +}; + +#else +struct bfq_group { + struct bfq_sched_data sched_data; + + struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR]; + struct bfq_queue *async_idle_bfqq; + + struct rb_root rq_pos_tree; +}; +#endif + +struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity); + +/* --------------- main algorithm interface ----------------- */ + +#define BFQ_SERVICE_TREE_INIT ((struct bfq_service_tree) \ + { RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 }) + +extern const int bfq_timeout; + +struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync); +void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync); +struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic); +void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity, + struct rb_root *root); +void bfq_weights_tree_remove(struct bfq_data *bfqd, struct bfq_entity *entity, + struct rb_root *root); +void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool compensate, enum bfqq_expiration reason); +void bfq_put_queue(struct bfq_queue *bfqq); +void bfq_end_wr_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg); +void bfq_schedule_dispatch(struct bfq_data *bfqd); +void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg); + +/* ------------ end of main algorithm interface -------------- */ + +/* ---------------- cgroups-support interface ---------------- */ + +void bfqg_stats_update_io_add(struct bfq_group *bfqg, struct bfq_queue *bfqq, + unsigned int op); +void bfqg_stats_update_io_remove(struct bfq_group *bfqg, unsigned int op); +void bfqg_stats_update_io_merged(struct bfq_group *bfqg, unsigned int op); +void bfqg_stats_update_completion(struct bfq_group *bfqg, uint64_t start_time, + uint64_t io_start_time, unsigned int op); +void bfqg_stats_update_dequeue(struct bfq_group *bfqg); +void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg); +void bfqg_stats_update_idle_time(struct bfq_group *bfqg); +void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg); +void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg); +void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, + struct bfq_group *bfqg); + +void bfq_init_entity(struct bfq_entity *entity, struct bfq_group *bfqg); +void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio); +void bfq_end_wr_async(struct bfq_data *bfqd); +struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, + struct blkcg *blkcg); +struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg); +struct bfq_group *bfqq_group(struct bfq_queue *bfqq); +struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node); +void bfqg_put(struct bfq_group *bfqg); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +extern struct cftype bfq_blkcg_legacy_files[]; +extern struct cftype bfq_blkg_files[]; +extern struct blkcg_policy blkcg_policy_bfq; +#endif + +/* ------------- end of cgroups-support interface ------------- */ + +/* - interface of the internal hierarchical B-WF2Q+ scheduler - */ + +#ifdef CONFIG_BFQ_GROUP_IOSCHED +/* both next loops stop at one of the child entities of the root group */ +#define for_each_entity(entity) \ + for (; entity ; entity = entity->parent) + +/* + * For each iteration, compute parent in advance, so as to be safe if + * entity is deallocated during the iteration. Such a deallocation may + * happen as a consequence of a bfq_put_queue that frees the bfq_queue + * containing entity. + */ +#define for_each_entity_safe(entity, parent) \ + for (; entity && ({ parent = entity->parent; 1; }); entity = parent) + +#else /* CONFIG_BFQ_GROUP_IOSCHED */ +/* + * Next two macros are fake loops when cgroups support is not + * enabled. I fact, in such a case, there is only one level to go up + * (to reach the root group). + */ +#define for_each_entity(entity) \ + for (; entity ; entity = NULL) + +#define for_each_entity_safe(entity, parent) \ + for (parent = NULL; entity ; entity = parent) +#endif /* CONFIG_BFQ_GROUP_IOSCHED */ + +struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq); +struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity); +struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity); +struct bfq_entity *bfq_entity_of(struct rb_node *node); +unsigned short bfq_ioprio_to_weight(int ioprio); +void bfq_put_idle_entity(struct bfq_service_tree *st, + struct bfq_entity *entity); +struct bfq_service_tree * +__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, + struct bfq_entity *entity); +void bfq_bfqq_served(struct bfq_queue *bfqq, int served); +void bfq_bfqq_charge_time(struct bfq_data *bfqd, struct bfq_queue *bfqq, + unsigned long time_ms); +bool __bfq_deactivate_entity(struct bfq_entity *entity, + bool ins_into_idle_tree); +bool next_queue_may_preempt(struct bfq_data *bfqd); +struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd); +void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd); +void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool ins_into_idle_tree, bool expiration); +void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq); +void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool expiration); +void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq); + +/* --------------- end of interface of B-WF2Q+ ---------------- */ + +/* Logging facilities. */ +#ifdef CONFIG_BFQ_GROUP_IOSCHED +struct bfq_group *bfqq_group(struct bfq_queue *bfqq); + +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \ + char __pbuf[128]; \ + \ + blkg_path(bfqg_to_blkg(bfqq_group(bfqq)), __pbuf, sizeof(__pbuf)); \ + blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid, \ + bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \ + __pbuf, ##args); \ +} while (0) + +#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do { \ + char __pbuf[128]; \ + \ + blkg_path(bfqg_to_blkg(bfqg), __pbuf, sizeof(__pbuf)); \ + blk_add_trace_msg((bfqd)->queue, "%s " fmt, __pbuf, ##args); \ +} while (0) + +#else /* CONFIG_BFQ_GROUP_IOSCHED */ + +#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \ + blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid, \ + bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \ + ##args) +#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0) + +#endif /* CONFIG_BFQ_GROUP_IOSCHED */ + +#define bfq_log(bfqd, fmt, args...) \ + blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args) + +#endif /* _BFQ_H */ diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c new file mode 100644 index 000000000000..b4fc3e4260b7 --- /dev/null +++ b/block/bfq-wf2q.c @@ -0,0 +1,1616 @@ +/* + * Hierarchical Budget Worst-case Fair Weighted Fair Queueing + * (B-WF2Q+): hierarchical scheduling algorithm by which the BFQ I/O + * scheduler schedules generic entities. The latter can represent + * either single bfq queues (associated with processes) or groups of + * bfq queues (associated with cgroups). + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of the + * License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include "bfq-iosched.h" + +/** + * bfq_gt - compare two timestamps. + * @a: first ts. + * @b: second ts. + * + * Return @a > @b, dealing with wrapping correctly. + */ +static int bfq_gt(u64 a, u64 b) +{ + return (s64)(a - b) > 0; +} + +static struct bfq_entity *bfq_root_active_entity(struct rb_root *tree) +{ + struct rb_node *node = tree->rb_node; + + return rb_entry(node, struct bfq_entity, rb_node); +} + +static unsigned int bfq_class_idx(struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + return bfqq ? bfqq->ioprio_class - 1 : + BFQ_DEFAULT_GRP_CLASS - 1; +} + +static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd); + +static bool bfq_update_parent_budget(struct bfq_entity *next_in_service); + +/** + * bfq_update_next_in_service - update sd->next_in_service + * @sd: sched_data for which to perform the update. + * @new_entity: if not NULL, pointer to the entity whose activation, + * requeueing or repositionig triggered the invocation of + * this function. + * + * This function is called to update sd->next_in_service, which, in + * its turn, may change as a consequence of the insertion or + * extraction of an entity into/from one of the active trees of + * sd. These insertions/extractions occur as a consequence of + * activations/deactivations of entities, with some activations being + * 'true' activations, and other activations being requeueings (i.e., + * implementing the second, requeueing phase of the mechanism used to + * reposition an entity in its active tree; see comments on + * __bfq_activate_entity and __bfq_requeue_entity for details). In + * both the last two activation sub-cases, new_entity points to the + * just activated or requeued entity. + * + * Returns true if sd->next_in_service changes in such a way that + * entity->parent may become the next_in_service for its parent + * entity. + */ +static bool bfq_update_next_in_service(struct bfq_sched_data *sd, + struct bfq_entity *new_entity) +{ + struct bfq_entity *next_in_service = sd->next_in_service; + bool parent_sched_may_change = false; + + /* + * If this update is triggered by the activation, requeueing + * or repositiong of an entity that does not coincide with + * sd->next_in_service, then a full lookup in the active tree + * can be avoided. In fact, it is enough to check whether the + * just-modified entity has a higher priority than + * sd->next_in_service, or, even if it has the same priority + * as sd->next_in_service, is eligible and has a lower virtual + * finish time than sd->next_in_service. If this compound + * condition holds, then the new entity becomes the new + * next_in_service. Otherwise no change is needed. + */ + if (new_entity && new_entity != sd->next_in_service) { + /* + * Flag used to decide whether to replace + * sd->next_in_service with new_entity. Tentatively + * set to true, and left as true if + * sd->next_in_service is NULL. + */ + bool replace_next = true; + + /* + * If there is already a next_in_service candidate + * entity, then compare class priorities or timestamps + * to decide whether to replace sd->service_tree with + * new_entity. + */ + if (next_in_service) { + unsigned int new_entity_class_idx = + bfq_class_idx(new_entity); + struct bfq_service_tree *st = + sd->service_tree + new_entity_class_idx; + + /* + * For efficiency, evaluate the most likely + * sub-condition first. + */ + replace_next = + (new_entity_class_idx == + bfq_class_idx(next_in_service) + && + !bfq_gt(new_entity->start, st->vtime) + && + bfq_gt(next_in_service->finish, + new_entity->finish)) + || + new_entity_class_idx < + bfq_class_idx(next_in_service); + } + + if (replace_next) + next_in_service = new_entity; + } else /* invoked because of a deactivation: lookup needed */ + next_in_service = bfq_lookup_next_entity(sd); + + if (next_in_service) { + parent_sched_may_change = !sd->next_in_service || + bfq_update_parent_budget(next_in_service); + } + + sd->next_in_service = next_in_service; + + if (!next_in_service) + return parent_sched_may_change; + + return parent_sched_may_change; +} + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + +struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq) +{ + struct bfq_entity *group_entity = bfqq->entity.parent; + + if (!group_entity) + group_entity = &bfqq->bfqd->root_group->entity; + + return container_of(group_entity, struct bfq_group, entity); +} + +/* + * Returns true if this budget changes may let next_in_service->parent + * become the next_in_service entity for its parent entity. + */ +static bool bfq_update_parent_budget(struct bfq_entity *next_in_service) +{ + struct bfq_entity *bfqg_entity; + struct bfq_group *bfqg; + struct bfq_sched_data *group_sd; + bool ret = false; + + group_sd = next_in_service->sched_data; + + bfqg = container_of(group_sd, struct bfq_group, sched_data); + /* + * bfq_group's my_entity field is not NULL only if the group + * is not the root group. We must not touch the root entity + * as it must never become an in-service entity. + */ + bfqg_entity = bfqg->my_entity; + if (bfqg_entity) { + if (bfqg_entity->budget > next_in_service->budget) + ret = true; + bfqg_entity->budget = next_in_service->budget; + } + + return ret; +} + +/* + * This function tells whether entity stops being a candidate for next + * service, according to the following logic. + * + * This function is invoked for an entity that is about to be set in + * service. If such an entity is a queue, then the entity is no longer + * a candidate for next service (i.e, a candidate entity to serve + * after the in-service entity is expired). The function then returns + * true. + * + * In contrast, the entity could stil be a candidate for next service + * if it is not a queue, and has more than one child. In fact, even if + * one of its children is about to be set in service, other children + * may still be the next to serve. As a consequence, a non-queue + * entity is not a candidate for next-service only if it has only one + * child. And only if this condition holds, then the function returns + * true for a non-queue entity. + */ +static bool bfq_no_longer_next_in_service(struct bfq_entity *entity) +{ + struct bfq_group *bfqg; + + if (bfq_entity_to_bfqq(entity)) + return true; + + bfqg = container_of(entity, struct bfq_group, entity); + + if (bfqg->active_entities == 1) + return true; + + return false; +} + +#else /* CONFIG_BFQ_GROUP_IOSCHED */ + +struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq) +{ + return bfqq->bfqd->root_group; +} + +static bool bfq_update_parent_budget(struct bfq_entity *next_in_service) +{ + return false; +} + +static bool bfq_no_longer_next_in_service(struct bfq_entity *entity) +{ + return true; +} + +#endif /* CONFIG_BFQ_GROUP_IOSCHED */ + +/* + * Shift for timestamp calculations. This actually limits the maximum + * service allowed in one timestamp delta (small shift values increase it), + * the maximum total weight that can be used for the queues in the system + * (big shift values increase it), and the period of virtual time + * wraparounds. + */ +#define WFQ_SERVICE_SHIFT 22 + +struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = NULL; + + if (!entity->my_sched_data) + bfqq = container_of(entity, struct bfq_queue, entity); + + return bfqq; +} + + +/** + * bfq_delta - map service into the virtual time domain. + * @service: amount of service. + * @weight: scale factor (weight of an entity or weight sum). + */ +static u64 bfq_delta(unsigned long service, unsigned long weight) +{ + u64 d = (u64)service << WFQ_SERVICE_SHIFT; + + do_div(d, weight); + return d; +} + +/** + * bfq_calc_finish - assign the finish time to an entity. + * @entity: the entity to act upon. + * @service: the service to be charged to the entity. + */ +static void bfq_calc_finish(struct bfq_entity *entity, unsigned long service) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + entity->finish = entity->start + + bfq_delta(service, entity->weight); + + if (bfqq) { + bfq_log_bfqq(bfqq->bfqd, bfqq, + "calc_finish: serv %lu, w %d", + service, entity->weight); + bfq_log_bfqq(bfqq->bfqd, bfqq, + "calc_finish: start %llu, finish %llu, delta %llu", + entity->start, entity->finish, + bfq_delta(service, entity->weight)); + } +} + +/** + * bfq_entity_of - get an entity from a node. + * @node: the node field of the entity. + * + * Convert a node pointer to the relative entity. This is used only + * to simplify the logic of some functions and not as the generic + * conversion mechanism because, e.g., in the tree walking functions, + * the check for a %NULL value would be redundant. + */ +struct bfq_entity *bfq_entity_of(struct rb_node *node) +{ + struct bfq_entity *entity = NULL; + + if (node) + entity = rb_entry(node, struct bfq_entity, rb_node); + + return entity; +} + +/** + * bfq_extract - remove an entity from a tree. + * @root: the tree root. + * @entity: the entity to remove. + */ +static void bfq_extract(struct rb_root *root, struct bfq_entity *entity) +{ + entity->tree = NULL; + rb_erase(&entity->rb_node, root); +} + +/** + * bfq_idle_extract - extract an entity from the idle tree. + * @st: the service tree of the owning @entity. + * @entity: the entity being removed. + */ +static void bfq_idle_extract(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *next; + + if (entity == st->first_idle) { + next = rb_next(&entity->rb_node); + st->first_idle = bfq_entity_of(next); + } + + if (entity == st->last_idle) { + next = rb_prev(&entity->rb_node); + st->last_idle = bfq_entity_of(next); + } + + bfq_extract(&st->idle, entity); + + if (bfqq) + list_del(&bfqq->bfqq_list); +} + +/** + * bfq_insert - generic tree insertion. + * @root: tree root. + * @entity: entity to insert. + * + * This is used for the idle and the active tree, since they are both + * ordered by finish time. + */ +static void bfq_insert(struct rb_root *root, struct bfq_entity *entity) +{ + struct bfq_entity *entry; + struct rb_node **node = &root->rb_node; + struct rb_node *parent = NULL; + + while (*node) { + parent = *node; + entry = rb_entry(parent, struct bfq_entity, rb_node); + + if (bfq_gt(entry->finish, entity->finish)) + node = &parent->rb_left; + else + node = &parent->rb_right; + } + + rb_link_node(&entity->rb_node, parent, node); + rb_insert_color(&entity->rb_node, root); + + entity->tree = root; +} + +/** + * bfq_update_min - update the min_start field of a entity. + * @entity: the entity to update. + * @node: one of its children. + * + * This function is called when @entity may store an invalid value for + * min_start due to updates to the active tree. The function assumes + * that the subtree rooted at @node (which may be its left or its right + * child) has a valid min_start value. + */ +static void bfq_update_min(struct bfq_entity *entity, struct rb_node *node) +{ + struct bfq_entity *child; + + if (node) { + child = rb_entry(node, struct bfq_entity, rb_node); + if (bfq_gt(entity->min_start, child->min_start)) + entity->min_start = child->min_start; + } +} + +/** + * bfq_update_active_node - recalculate min_start. + * @node: the node to update. + * + * @node may have changed position or one of its children may have moved, + * this function updates its min_start value. The left and right subtrees + * are assumed to hold a correct min_start value. + */ +static void bfq_update_active_node(struct rb_node *node) +{ + struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node); + + entity->min_start = entity->start; + bfq_update_min(entity, node->rb_right); + bfq_update_min(entity, node->rb_left); +} + +/** + * bfq_update_active_tree - update min_start for the whole active tree. + * @node: the starting node. + * + * @node must be the deepest modified node after an update. This function + * updates its min_start using the values held by its children, assuming + * that they did not change, and then updates all the nodes that may have + * changed in the path to the root. The only nodes that may have changed + * are the ones in the path or their siblings. + */ +static void bfq_update_active_tree(struct rb_node *node) +{ + struct rb_node *parent; + +up: + bfq_update_active_node(node); + + parent = rb_parent(node); + if (!parent) + return; + + if (node == parent->rb_left && parent->rb_right) + bfq_update_active_node(parent->rb_right); + else if (parent->rb_left) + bfq_update_active_node(parent->rb_left); + + node = parent; + goto up; +} + +/** + * bfq_active_insert - insert an entity in the active tree of its + * group/device. + * @st: the service tree of the entity. + * @entity: the entity being inserted. + * + * The active tree is ordered by finish time, but an extra key is kept + * per each node, containing the minimum value for the start times of + * its children (and the node itself), so it's possible to search for + * the eligible node with the lowest finish time in logarithmic time. + */ +static void bfq_active_insert(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *node = &entity->rb_node; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd = NULL; + struct bfq_group *bfqg = NULL; + struct bfq_data *bfqd = NULL; +#endif + + bfq_insert(&st->active, entity); + + if (node->rb_left) + node = node->rb_left; + else if (node->rb_right) + node = node->rb_right; + + bfq_update_active_tree(node); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + sd = entity->sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + bfqd = (struct bfq_data *)bfqg->bfqd; +#endif + if (bfqq) + list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else /* bfq_group */ + bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree); + + if (bfqg != bfqd->root_group) + bfqg->active_entities++; +#endif +} + +/** + * bfq_ioprio_to_weight - calc a weight from an ioprio. + * @ioprio: the ioprio value to convert. + */ +unsigned short bfq_ioprio_to_weight(int ioprio) +{ + return (IOPRIO_BE_NR - ioprio) * BFQ_WEIGHT_CONVERSION_COEFF; +} + +/** + * bfq_weight_to_ioprio - calc an ioprio from a weight. + * @weight: the weight value to convert. + * + * To preserve as much as possible the old only-ioprio user interface, + * 0 is used as an escape ioprio value for weights (numerically) equal or + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF. + */ +static unsigned short bfq_weight_to_ioprio(int weight) +{ + return max_t(int, 0, + IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight); +} + +static void bfq_get_entity(struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + if (bfqq) { + bfqq->ref++; + bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d", + bfqq, bfqq->ref); + } +} + +/** + * bfq_find_deepest - find the deepest node that an extraction can modify. + * @node: the node being removed. + * + * Do the first step of an extraction in an rb tree, looking for the + * node that will replace @node, and returning the deepest node that + * the following modifications to the tree can touch. If @node is the + * last node in the tree return %NULL. + */ +static struct rb_node *bfq_find_deepest(struct rb_node *node) +{ + struct rb_node *deepest; + + if (!node->rb_right && !node->rb_left) + deepest = rb_parent(node); + else if (!node->rb_right) + deepest = node->rb_left; + else if (!node->rb_left) + deepest = node->rb_right; + else { + deepest = rb_next(node); + if (deepest->rb_right) + deepest = deepest->rb_right; + else if (rb_parent(deepest) != node) + deepest = rb_parent(deepest); + } + + return deepest; +} + +/** + * bfq_active_extract - remove an entity from the active tree. + * @st: the service_tree containing the tree. + * @entity: the entity being removed. + */ +static void bfq_active_extract(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct rb_node *node; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd = NULL; + struct bfq_group *bfqg = NULL; + struct bfq_data *bfqd = NULL; +#endif + + node = bfq_find_deepest(&entity->rb_node); + bfq_extract(&st->active, entity); + + if (node) + bfq_update_active_tree(node); + +#ifdef CONFIG_BFQ_GROUP_IOSCHED + sd = entity->sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + bfqd = (struct bfq_data *)bfqg->bfqd; +#endif + if (bfqq) + list_del(&bfqq->bfqq_list); +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else /* bfq_group */ + bfq_weights_tree_remove(bfqd, entity, + &bfqd->group_weights_tree); + + if (bfqg != bfqd->root_group) + bfqg->active_entities--; +#endif +} + +/** + * bfq_idle_insert - insert an entity into the idle tree. + * @st: the service tree containing the tree. + * @entity: the entity to insert. + */ +static void bfq_idle_insert(struct bfq_service_tree *st, + struct bfq_entity *entity) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + struct bfq_entity *first_idle = st->first_idle; + struct bfq_entity *last_idle = st->last_idle; + + if (!first_idle || bfq_gt(first_idle->finish, entity->finish)) + st->first_idle = entity; + if (!last_idle || bfq_gt(entity->finish, last_idle->finish)) + st->last_idle = entity; + + bfq_insert(&st->idle, entity); + + if (bfqq) + list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list); +} + +/** + * bfq_forget_entity - do not consider entity any longer for scheduling + * @st: the service tree. + * @entity: the entity being removed. + * @is_in_service: true if entity is currently the in-service entity. + * + * Forget everything about @entity. In addition, if entity represents + * a queue, and the latter is not in service, then release the service + * reference to the queue (the one taken through bfq_get_entity). In + * fact, in this case, there is really no more service reference to + * the queue, as the latter is also outside any service tree. If, + * instead, the queue is in service, then __bfq_bfqd_reset_in_service + * will take care of putting the reference when the queue finally + * stops being served. + */ +static void bfq_forget_entity(struct bfq_service_tree *st, + struct bfq_entity *entity, + bool is_in_service) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + entity->on_st = false; + st->wsum -= entity->weight; + if (bfqq && !is_in_service) + bfq_put_queue(bfqq); +} + +/** + * bfq_put_idle_entity - release the idle tree ref of an entity. + * @st: service tree for the entity. + * @entity: the entity being released. + */ +void bfq_put_idle_entity(struct bfq_service_tree *st, struct bfq_entity *entity) +{ + bfq_idle_extract(st, entity); + bfq_forget_entity(st, entity, + entity == entity->sched_data->in_service_entity); +} + +/** + * bfq_forget_idle - update the idle tree if necessary. + * @st: the service tree to act upon. + * + * To preserve the global O(log N) complexity we only remove one entry here; + * as the idle tree will not grow indefinitely this can be done safely. + */ +static void bfq_forget_idle(struct bfq_service_tree *st) +{ + struct bfq_entity *first_idle = st->first_idle; + struct bfq_entity *last_idle = st->last_idle; + + if (RB_EMPTY_ROOT(&st->active) && last_idle && + !bfq_gt(last_idle->finish, st->vtime)) { + /* + * Forget the whole idle tree, increasing the vtime past + * the last finish time of idle entities. + */ + st->vtime = last_idle->finish; + } + + if (first_idle && !bfq_gt(first_idle->finish, st->vtime)) + bfq_put_idle_entity(st, first_idle); +} + +struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity) +{ + struct bfq_sched_data *sched_data = entity->sched_data; + unsigned int idx = bfq_class_idx(entity); + + return sched_data->service_tree + idx; +} + + +struct bfq_service_tree * +__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st, + struct bfq_entity *entity) +{ + struct bfq_service_tree *new_st = old_st; + + if (entity->prio_changed) { + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + unsigned int prev_weight, new_weight; + struct bfq_data *bfqd = NULL; + struct rb_root *root; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + struct bfq_sched_data *sd; + struct bfq_group *bfqg; +#endif + + if (bfqq) + bfqd = bfqq->bfqd; +#ifdef CONFIG_BFQ_GROUP_IOSCHED + else { + sd = entity->my_sched_data; + bfqg = container_of(sd, struct bfq_group, sched_data); + bfqd = (struct bfq_data *)bfqg->bfqd; + } +#endif + + old_st->wsum -= entity->weight; + + if (entity->new_weight != entity->orig_weight) { + if (entity->new_weight < BFQ_MIN_WEIGHT || + entity->new_weight > BFQ_MAX_WEIGHT) { + pr_crit("update_weight_prio: new_weight %d\n", + entity->new_weight); + if (entity->new_weight < BFQ_MIN_WEIGHT) + entity->new_weight = BFQ_MIN_WEIGHT; + else + entity->new_weight = BFQ_MAX_WEIGHT; + } + entity->orig_weight = entity->new_weight; + if (bfqq) + bfqq->ioprio = + bfq_weight_to_ioprio(entity->orig_weight); + } + + if (bfqq) + bfqq->ioprio_class = bfqq->new_ioprio_class; + entity->prio_changed = 0; + + /* + * NOTE: here we may be changing the weight too early, + * this will cause unfairness. The correct approach + * would have required additional complexity to defer + * weight changes to the proper time instants (i.e., + * when entity->finish <= old_st->vtime). + */ + new_st = bfq_entity_service_tree(entity); + + prev_weight = entity->weight; + new_weight = entity->orig_weight * + (bfqq ? bfqq->wr_coeff : 1); + /* + * If the weight of the entity changes, remove the entity + * from its old weight counter (if there is a counter + * associated with the entity), and add it to the counter + * associated with its new weight. + */ + if (prev_weight != new_weight) { + root = bfqq ? &bfqd->queue_weights_tree : + &bfqd->group_weights_tree; + bfq_weights_tree_remove(bfqd, entity, root); + } + entity->weight = new_weight; + /* + * Add the entity to its weights tree only if it is + * not associated with a weight-raised queue. + */ + if (prev_weight != new_weight && + (bfqq ? bfqq->wr_coeff == 1 : 1)) + /* If we get here, root has been initialized. */ + bfq_weights_tree_add(bfqd, entity, root); + + new_st->wsum += entity->weight; + + if (new_st != old_st) + entity->start = new_st->vtime; + } + + return new_st; +} + +/** + * bfq_bfqq_served - update the scheduler status after selection for + * service. + * @bfqq: the queue being served. + * @served: bytes to transfer. + * + * NOTE: this can be optimized, as the timestamps of upper level entities + * are synchronized every time a new bfqq is selected for service. By now, + * we keep it to better check consistency. + */ +void bfq_bfqq_served(struct bfq_queue *bfqq, int served) +{ + struct bfq_entity *entity = &bfqq->entity; + struct bfq_service_tree *st; + + for_each_entity(entity) { + st = bfq_entity_service_tree(entity); + + entity->service += served; + + st->vtime += bfq_delta(served, st->wsum); + bfq_forget_idle(st); + } + bfqg_stats_set_start_empty_time(bfqq_group(bfqq)); + bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served); +} + +/** + * bfq_bfqq_charge_time - charge an amount of service equivalent to the length + * of the time interval during which bfqq has been in + * service. + * @bfqd: the device + * @bfqq: the queue that needs a service update. + * @time_ms: the amount of time during which the queue has received service + * + * If a queue does not consume its budget fast enough, then providing + * the queue with service fairness may impair throughput, more or less + * severely. For this reason, queues that consume their budget slowly + * are provided with time fairness instead of service fairness. This + * goal is achieved through the BFQ scheduling engine, even if such an + * engine works in the service, and not in the time domain. The trick + * is charging these queues with an inflated amount of service, equal + * to the amount of service that they would have received during their + * service slot if they had been fast, i.e., if their requests had + * been dispatched at a rate equal to the estimated peak rate. + * + * It is worth noting that time fairness can cause important + * distortions in terms of bandwidth distribution, on devices with + * internal queueing. The reason is that I/O requests dispatched + * during the service slot of a queue may be served after that service + * slot is finished, and may have a total processing time loosely + * correlated with the duration of the service slot. This is + * especially true for short service slots. + */ +void bfq_bfqq_charge_time(struct bfq_data *bfqd, struct bfq_queue *bfqq, + unsigned long time_ms) +{ + struct bfq_entity *entity = &bfqq->entity; + int tot_serv_to_charge = entity->service; + unsigned int timeout_ms = jiffies_to_msecs(bfq_timeout); + + if (time_ms > 0 && time_ms < timeout_ms) + tot_serv_to_charge = + (bfqd->bfq_max_budget * time_ms) / timeout_ms; + + if (tot_serv_to_charge < entity->service) + tot_serv_to_charge = entity->service; + + /* Increase budget to avoid inconsistencies */ + if (tot_serv_to_charge > entity->budget) + entity->budget = tot_serv_to_charge; + + bfq_bfqq_served(bfqq, + max_t(int, 0, tot_serv_to_charge - entity->service)); +} + +static void bfq_update_fin_time_enqueue(struct bfq_entity *entity, + struct bfq_service_tree *st, + bool backshifted) +{ + struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity); + + st = __bfq_entity_update_weight_prio(st, entity); + bfq_calc_finish(entity, entity->budget); + + /* + * If some queues enjoy backshifting for a while, then their + * (virtual) finish timestamps may happen to become lower and + * lower than the system virtual time. In particular, if + * these queues often happen to be idle for short time + * periods, and during such time periods other queues with + * higher timestamps happen to be busy, then the backshifted + * timestamps of the former queues can become much lower than + * the system virtual time. In fact, to serve the queues with + * higher timestamps while the ones with lower timestamps are + * idle, the system virtual time may be pushed-up to much + * higher values than the finish timestamps of the idle + * queues. As a consequence, the finish timestamps of all new + * or newly activated queues may end up being much larger than + * those of lucky queues with backshifted timestamps. The + * latter queues may then monopolize the device for a lot of + * time. This would simply break service guarantees. + * + * To reduce this problem, push up a little bit the + * backshifted timestamps of the queue associated with this + * entity (only a queue can happen to have the backshifted + * flag set): just enough to let the finish timestamp of the + * queue be equal to the current value of the system virtual + * time. This may introduce a little unfairness among queues + * with backshifted timestamps, but it does not break + * worst-case fairness guarantees. + * + * As a special case, if bfqq is weight-raised, push up + * timestamps much less, to keep very low the probability that + * this push up causes the backshifted finish timestamps of + * weight-raised queues to become higher than the backshifted + * finish timestamps of non weight-raised queues. + */ + if (backshifted && bfq_gt(st->vtime, entity->finish)) { + unsigned long delta = st->vtime - entity->finish; + + if (bfqq) + delta /= bfqq->wr_coeff; + + entity->start += delta; + entity->finish += delta; + } + + bfq_active_insert(st, entity); +} + +/** + * __bfq_activate_entity - handle activation of entity. + * @entity: the entity being activated. + * @non_blocking_wait_rq: true if entity was waiting for a request + * + * Called for a 'true' activation, i.e., if entity is not active and + * one of its children receives a new request. + * + * Basically, this function updates the timestamps of entity and + * inserts entity into its active tree, ater possible extracting it + * from its idle tree. + */ +static void __bfq_activate_entity(struct bfq_entity *entity, + bool non_blocking_wait_rq) +{ + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + bool backshifted = false; + unsigned long long min_vstart; + + /* See comments on bfq_fqq_update_budg_for_activation */ + if (non_blocking_wait_rq && bfq_gt(st->vtime, entity->finish)) { + backshifted = true; + min_vstart = entity->finish; + } else + min_vstart = st->vtime; + + if (entity->tree == &st->idle) { + /* + * Must be on the idle tree, bfq_idle_extract() will + * check for that. + */ + bfq_idle_extract(st, entity); + entity->start = bfq_gt(min_vstart, entity->finish) ? + min_vstart : entity->finish; + } else { + /* + * The finish time of the entity may be invalid, and + * it is in the past for sure, otherwise the queue + * would have been on the idle tree. + */ + entity->start = min_vstart; + st->wsum += entity->weight; + /* + * entity is about to be inserted into a service tree, + * and then set in service: get a reference to make + * sure entity does not disappear until it is no + * longer in service or scheduled for service. + */ + bfq_get_entity(entity); + + entity->on_st = true; + } + + bfq_update_fin_time_enqueue(entity, st, backshifted); +} + +/** + * __bfq_requeue_entity - handle requeueing or repositioning of an entity. + * @entity: the entity being requeued or repositioned. + * + * Requeueing is needed if this entity stops being served, which + * happens if a leaf descendant entity has expired. On the other hand, + * repositioning is needed if the next_inservice_entity for the child + * entity has changed. See the comments inside the function for + * details. + * + * Basically, this function: 1) removes entity from its active tree if + * present there, 2) updates the timestamps of entity and 3) inserts + * entity back into its active tree (in the new, right position for + * the new values of the timestamps). + */ +static void __bfq_requeue_entity(struct bfq_entity *entity) +{ + struct bfq_sched_data *sd = entity->sched_data; + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + + if (entity == sd->in_service_entity) { + /* + * We are requeueing the current in-service entity, + * which may have to be done for one of the following + * reasons: + * - entity represents the in-service queue, and the + * in-service queue is being requeued after an + * expiration; + * - entity represents a group, and its budget has + * changed because one of its child entities has + * just been either activated or requeued for some + * reason; the timestamps of the entity need then to + * be updated, and the entity needs to be enqueued + * or repositioned accordingly. + * + * In particular, before requeueing, the start time of + * the entity must be moved forward to account for the + * service that the entity has received while in + * service. This is done by the next instructions. The + * finish time will then be updated according to this + * new value of the start time, and to the budget of + * the entity. + */ + bfq_calc_finish(entity, entity->service); + entity->start = entity->finish; + /* + * In addition, if the entity had more than one child + * when set in service, then was not extracted from + * the active tree. This implies that the position of + * the entity in the active tree may need to be + * changed now, because we have just updated the start + * time of the entity, and we will update its finish + * time in a moment (the requeueing is then, more + * precisely, a repositioning in this case). To + * implement this repositioning, we: 1) dequeue the + * entity here, 2) update the finish time and + * requeue the entity according to the new + * timestamps below. + */ + if (entity->tree) + bfq_active_extract(st, entity); + } else { /* The entity is already active, and not in service */ + /* + * In this case, this function gets called only if the + * next_in_service entity below this entity has + * changed, and this change has caused the budget of + * this entity to change, which, finally implies that + * the finish time of this entity must be + * updated. Such an update may cause the scheduling, + * i.e., the position in the active tree, of this + * entity to change. We handle this change by: 1) + * dequeueing the entity here, 2) updating the finish + * time and requeueing the entity according to the new + * timestamps below. This is the same approach as the + * non-extracted-entity sub-case above. + */ + bfq_active_extract(st, entity); + } + + bfq_update_fin_time_enqueue(entity, st, false); +} + +static void __bfq_activate_requeue_entity(struct bfq_entity *entity, + struct bfq_sched_data *sd, + bool non_blocking_wait_rq) +{ + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + + if (sd->in_service_entity == entity || entity->tree == &st->active) + /* + * in service or already queued on the active tree, + * requeue or reposition + */ + __bfq_requeue_entity(entity); + else + /* + * Not in service and not queued on its active tree: + * the activity is idle and this is a true activation. + */ + __bfq_activate_entity(entity, non_blocking_wait_rq); +} + + +/** + * bfq_activate_entity - activate or requeue an entity representing a bfq_queue, + * and activate, requeue or reposition all ancestors + * for which such an update becomes necessary. + * @entity: the entity to activate. + * @non_blocking_wait_rq: true if this entity was waiting for a request + * @requeue: true if this is a requeue, which implies that bfqq is + * being expired; thus ALL its ancestors stop being served and must + * therefore be requeued + */ +static void bfq_activate_requeue_entity(struct bfq_entity *entity, + bool non_blocking_wait_rq, + bool requeue) +{ + struct bfq_sched_data *sd; + + for_each_entity(entity) { + sd = entity->sched_data; + __bfq_activate_requeue_entity(entity, sd, non_blocking_wait_rq); + + if (!bfq_update_next_in_service(sd, entity) && !requeue) + break; + } +} + +/** + * __bfq_deactivate_entity - deactivate an entity from its service tree. + * @entity: the entity to deactivate. + * @ins_into_idle_tree: if false, the entity will not be put into the + * idle tree. + * + * Deactivates an entity, independently from its previous state. Must + * be invoked only if entity is on a service tree. Extracts the entity + * from that tree, and if necessary and allowed, puts it on the idle + * tree. + */ +bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree) +{ + struct bfq_sched_data *sd = entity->sched_data; + struct bfq_service_tree *st = bfq_entity_service_tree(entity); + int is_in_service = entity == sd->in_service_entity; + + if (!entity->on_st) /* entity never activated, or already inactive */ + return false; + + if (is_in_service) + bfq_calc_finish(entity, entity->service); + + if (entity->tree == &st->active) + bfq_active_extract(st, entity); + else if (!is_in_service && entity->tree == &st->idle) + bfq_idle_extract(st, entity); + + if (!ins_into_idle_tree || !bfq_gt(entity->finish, st->vtime)) + bfq_forget_entity(st, entity, is_in_service); + else + bfq_idle_insert(st, entity); + + return true; +} + +/** + * bfq_deactivate_entity - deactivate an entity representing a bfq_queue. + * @entity: the entity to deactivate. + * @ins_into_idle_tree: true if the entity can be put on the idle tree + */ +static void bfq_deactivate_entity(struct bfq_entity *entity, + bool ins_into_idle_tree, + bool expiration) +{ + struct bfq_sched_data *sd; + struct bfq_entity *parent = NULL; + + for_each_entity_safe(entity, parent) { + sd = entity->sched_data; + + if (!__bfq_deactivate_entity(entity, ins_into_idle_tree)) { + /* + * entity is not in any tree any more, so + * this deactivation is a no-op, and there is + * nothing to change for upper-level entities + * (in case of expiration, this can never + * happen). + */ + return; + } + + if (sd->next_in_service == entity) + /* + * entity was the next_in_service entity, + * then, since entity has just been + * deactivated, a new one must be found. + */ + bfq_update_next_in_service(sd, NULL); + + if (sd->next_in_service) + /* + * The parent entity is still backlogged, + * because next_in_service is not NULL. So, no + * further upwards deactivation must be + * performed. Yet, next_in_service has + * changed. Then the schedule does need to be + * updated upwards. + */ + break; + + /* + * If we get here, then the parent is no more + * backlogged and we need to propagate the + * deactivation upwards. Thus let the loop go on. + */ + + /* + * Also let parent be queued into the idle tree on + * deactivation, to preserve service guarantees, and + * assuming that who invoked this function does not + * need parent entities too to be removed completely. + */ + ins_into_idle_tree = true; + } + + /* + * If the deactivation loop is fully executed, then there are + * no more entities to touch and next loop is not executed at + * all. Otherwise, requeue remaining entities if they are + * about to stop receiving service, or reposition them if this + * is not the case. + */ + entity = parent; + for_each_entity(entity) { + /* + * Invoke __bfq_requeue_entity on entity, even if + * already active, to requeue/reposition it in the + * active tree (because sd->next_in_service has + * changed) + */ + __bfq_requeue_entity(entity); + + sd = entity->sched_data; + if (!bfq_update_next_in_service(sd, entity) && + !expiration) + /* + * next_in_service unchanged or not causing + * any change in entity->parent->sd, and no + * requeueing needed for expiration: stop + * here. + */ + break; + } +} + +/** + * bfq_calc_vtime_jump - compute the value to which the vtime should jump, + * if needed, to have at least one entity eligible. + * @st: the service tree to act upon. + * + * Assumes that st is not empty. + */ +static u64 bfq_calc_vtime_jump(struct bfq_service_tree *st) +{ + struct bfq_entity *root_entity = bfq_root_active_entity(&st->active); + + if (bfq_gt(root_entity->min_start, st->vtime)) + return root_entity->min_start; + + return st->vtime; +} + +static void bfq_update_vtime(struct bfq_service_tree *st, u64 new_value) +{ + if (new_value > st->vtime) { + st->vtime = new_value; + bfq_forget_idle(st); + } +} + +/** + * bfq_first_active_entity - find the eligible entity with + * the smallest finish time + * @st: the service tree to select from. + * @vtime: the system virtual to use as a reference for eligibility + * + * This function searches the first schedulable entity, starting from the + * root of the tree and going on the left every time on this side there is + * a subtree with at least one eligible (start >= vtime) entity. The path on + * the right is followed only if a) the left subtree contains no eligible + * entities and b) no eligible entity has been found yet. + */ +static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st, + u64 vtime) +{ + struct bfq_entity *entry, *first = NULL; + struct rb_node *node = st->active.rb_node; + + while (node) { + entry = rb_entry(node, struct bfq_entity, rb_node); +left: + if (!bfq_gt(entry->start, vtime)) + first = entry; + + if (node->rb_left) { + entry = rb_entry(node->rb_left, + struct bfq_entity, rb_node); + if (!bfq_gt(entry->min_start, vtime)) { + node = node->rb_left; + goto left; + } + } + if (first) + break; + node = node->rb_right; + } + + return first; +} + +/** + * __bfq_lookup_next_entity - return the first eligible entity in @st. + * @st: the service tree. + * + * If there is no in-service entity for the sched_data st belongs to, + * then return the entity that will be set in service if: + * 1) the parent entity this st belongs to is set in service; + * 2) no entity belonging to such parent entity undergoes a state change + * that would influence the timestamps of the entity (e.g., becomes idle, + * becomes backlogged, changes its budget, ...). + * + * In this first case, update the virtual time in @st too (see the + * comments on this update inside the function). + * + * In constrast, if there is an in-service entity, then return the + * entity that would be set in service if not only the above + * conditions, but also the next one held true: the currently + * in-service entity, on expiration, + * 1) gets a finish time equal to the current one, or + * 2) is not eligible any more, or + * 3) is idle. + */ +static struct bfq_entity * +__bfq_lookup_next_entity(struct bfq_service_tree *st, bool in_service) +{ + struct bfq_entity *entity; + u64 new_vtime; + + if (RB_EMPTY_ROOT(&st->active)) + return NULL; + + /* + * Get the value of the system virtual time for which at + * least one entity is eligible. + */ + new_vtime = bfq_calc_vtime_jump(st); + + /* + * If there is no in-service entity for the sched_data this + * active tree belongs to, then push the system virtual time + * up to the value that guarantees that at least one entity is + * eligible. If, instead, there is an in-service entity, then + * do not make any such update, because there is already an + * eligible entity, namely the in-service one (even if the + * entity is not on st, because it was extracted when set in + * service). + */ + if (!in_service) + bfq_update_vtime(st, new_vtime); + + entity = bfq_first_active_entity(st, new_vtime); + + return entity; +} + +/** + * bfq_lookup_next_entity - return the first eligible entity in @sd. + * @sd: the sched_data. + * + * This function is invoked when there has been a change in the trees + * for sd, and we need know what is the new next entity after this + * change. + */ +static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd) +{ + struct bfq_service_tree *st = sd->service_tree; + struct bfq_service_tree *idle_class_st = st + (BFQ_IOPRIO_CLASSES - 1); + struct bfq_entity *entity = NULL; + int class_idx = 0; + + /* + * Choose from idle class, if needed to guarantee a minimum + * bandwidth to this class (and if there is some active entity + * in idle class). This should also mitigate + * priority-inversion problems in case a low priority task is + * holding file system resources. + */ + if (time_is_before_jiffies(sd->bfq_class_idle_last_service + + BFQ_CL_IDLE_TIMEOUT)) { + if (!RB_EMPTY_ROOT(&idle_class_st->active)) + class_idx = BFQ_IOPRIO_CLASSES - 1; + /* About to be served if backlogged, or not yet backlogged */ + sd->bfq_class_idle_last_service = jiffies; + } + + /* + * Find the next entity to serve for the highest-priority + * class, unless the idle class needs to be served. + */ + for (; class_idx < BFQ_IOPRIO_CLASSES; class_idx++) { + entity = __bfq_lookup_next_entity(st + class_idx, + sd->in_service_entity); + + if (entity) + break; + } + + if (!entity) + return NULL; + + return entity; +} + +bool next_queue_may_preempt(struct bfq_data *bfqd) +{ + struct bfq_sched_data *sd = &bfqd->root_group->sched_data; + + return sd->next_in_service != sd->in_service_entity; +} + +/* + * Get next queue for service. + */ +struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd) +{ + struct bfq_entity *entity = NULL; + struct bfq_sched_data *sd; + struct bfq_queue *bfqq; + + if (bfqd->busy_queues == 0) + return NULL; + + /* + * Traverse the path from the root to the leaf entity to + * serve. Set in service all the entities visited along the + * way. + */ + sd = &bfqd->root_group->sched_data; + for (; sd ; sd = entity->my_sched_data) { + /* + * WARNING. We are about to set the in-service entity + * to sd->next_in_service, i.e., to the (cached) value + * returned by bfq_lookup_next_entity(sd) the last + * time it was invoked, i.e., the last time when the + * service order in sd changed as a consequence of the + * activation or deactivation of an entity. In this + * respect, if we execute bfq_lookup_next_entity(sd) + * in this very moment, it may, although with low + * probability, yield a different entity than that + * pointed to by sd->next_in_service. This rare event + * happens in case there was no CLASS_IDLE entity to + * serve for sd when bfq_lookup_next_entity(sd) was + * invoked for the last time, while there is now one + * such entity. + * + * If the above event happens, then the scheduling of + * such entity in CLASS_IDLE is postponed until the + * service of the sd->next_in_service entity + * finishes. In fact, when the latter is expired, + * bfq_lookup_next_entity(sd) gets called again, + * exactly to update sd->next_in_service. + */ + + /* Make next_in_service entity become in_service_entity */ + entity = sd->next_in_service; + sd->in_service_entity = entity; + + /* + * Reset the accumulator of the amount of service that + * the entity is about to receive. + */ + entity->service = 0; + + /* + * If entity is no longer a candidate for next + * service, then we extract it from its active tree, + * for the following reason. To further boost the + * throughput in some special case, BFQ needs to know + * which is the next candidate entity to serve, while + * there is already an entity in service. In this + * respect, to make it easy to compute/update the next + * candidate entity to serve after the current + * candidate has been set in service, there is a case + * where it is necessary to extract the current + * candidate from its service tree. Such a case is + * when the entity just set in service cannot be also + * a candidate for next service. Details about when + * this conditions holds are reported in the comments + * on the function bfq_no_longer_next_in_service() + * invoked below. + */ + if (bfq_no_longer_next_in_service(entity)) + bfq_active_extract(bfq_entity_service_tree(entity), + entity); + + /* + * For the same reason why we may have just extracted + * entity from its active tree, we may need to update + * next_in_service for the sched_data of entity too, + * regardless of whether entity has been extracted. + * In fact, even if entity has not been extracted, a + * descendant entity may get extracted. Such an event + * would cause a change in next_in_service for the + * level of the descendant entity, and thus possibly + * back to upper levels. + * + * We cannot perform the resulting needed update + * before the end of this loop, because, to know which + * is the correct next-to-serve candidate entity for + * each level, we need first to find the leaf entity + * to set in service. In fact, only after we know + * which is the next-to-serve leaf entity, we can + * discover whether the parent entity of the leaf + * entity becomes the next-to-serve, and so on. + */ + + } + + bfqq = bfq_entity_to_bfqq(entity); + + /* + * We can finally update all next-to-serve entities along the + * path from the leaf entity just set in service to the root. + */ + for_each_entity(entity) { + struct bfq_sched_data *sd = entity->sched_data; + + if (!bfq_update_next_in_service(sd, NULL)) + break; + } + + return bfqq; +} + +void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) +{ + struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue; + struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity; + struct bfq_entity *entity = in_serv_entity; + + bfq_clear_bfqq_wait_request(in_serv_bfqq); + hrtimer_try_to_cancel(&bfqd->idle_slice_timer); + bfqd->in_service_queue = NULL; + + /* + * When this function is called, all in-service entities have + * been properly deactivated or requeued, so we can safely + * execute the final step: reset in_service_entity along the + * path from entity to the root. + */ + for_each_entity(entity) + entity->sched_data->in_service_entity = NULL; + + /* + * in_serv_entity is no longer in service, so, if it is in no + * service tree either, then release the service reference to + * the queue it represents (taken with bfq_get_entity). + */ + if (!in_serv_entity->on_st) + bfq_put_queue(in_serv_bfqq); +} + +void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool ins_into_idle_tree, bool expiration) +{ + struct bfq_entity *entity = &bfqq->entity; + + bfq_deactivate_entity(entity, ins_into_idle_tree, expiration); +} + +void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + bfq_activate_requeue_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq), + false); + bfq_clear_bfqq_non_blocking_wait_rq(bfqq); +} + +void bfq_requeue_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + struct bfq_entity *entity = &bfqq->entity; + + bfq_activate_requeue_entity(entity, false, + bfqq == bfqd->in_service_queue); +} + +/* + * Called when the bfqq no longer has requests pending, remove it from + * the service tree. As a special case, it can be invoked during an + * expiration. + */ +void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq, + bool expiration) +{ + bfq_log_bfqq(bfqd, bfqq, "del from busy"); + + bfq_clear_bfqq_busy(bfqq); + + bfqd->busy_queues--; + + if (!bfqq->dispatched) + bfq_weights_tree_remove(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + + if (bfqq->wr_coeff > 1) + bfqd->wr_busy_queues--; + + bfqg_stats_update_dequeue(bfqq_group(bfqq)); + + bfq_deactivate_bfqq(bfqd, bfqq, true, expiration); +} + +/* + * Called when an inactive queue receives a new request. + */ +void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq) +{ + bfq_log_bfqq(bfqd, bfqq, "add to busy"); + + bfq_activate_bfqq(bfqd, bfqq); + + bfq_mark_bfqq_busy(bfqq); + bfqd->busy_queues++; + + if (!bfqq->dispatched) + if (bfqq->wr_coeff == 1) + bfq_weights_tree_add(bfqd, &bfqq->entity, + &bfqd->queue_weights_tree); + + if (bfqq->wr_coeff > 1) + bfqd->wr_busy_queues++; +} diff --git a/block/bio.c b/block/bio.c index e75878f8b14a..f4d207180266 100644 --- a/block/bio.c +++ b/block/bio.c @@ -30,6 +30,7 @@ #include <linux/cgroup.h> #include <trace/events/block.h> +#include "blk.h" /* * Test patch to inline a certain number of bi_io_vec's inside the bio @@ -427,7 +428,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs) * RETURNS: * Pointer to new bio on success, NULL on failure. */ -struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs) +struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int nr_iovecs, + struct bio_set *bs) { gfp_t saved_gfp = gfp_mask; unsigned front_pad; @@ -1824,6 +1826,11 @@ static inline bool bio_remaining_done(struct bio *bio) * bio_endio() will end I/O on the whole bio. bio_endio() is the preferred * way to end I/O on a bio. No one should call bi_end_io() directly on a * bio unless they own it and thus know that it has an end_io function. + * + * bio_endio() can be called several times on a bio that has been chained + * using bio_chain(). The ->bi_end_io() function will only be called the + * last time. At this point the BLK_TA_COMPLETE tracing event will be + * generated if BIO_TRACE_COMPLETION is set. **/ void bio_endio(struct bio *bio) { @@ -1844,6 +1851,13 @@ again: goto again; } + if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) { + trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), + bio, bio->bi_error); + bio_clear_flag(bio, BIO_TRACE_COMPLETION); + } + + blk_throtl_bio_endio(bio); if (bio->bi_end_io) bio->bi_end_io(bio); } @@ -1882,6 +1896,9 @@ struct bio *bio_split(struct bio *bio, int sectors, bio_advance(bio, split->bi_iter.bi_size); + if (bio_flagged(bio, BIO_TRACE_COMPLETION)) + bio_set_flag(bio, BIO_TRACE_COMPLETION); + return split; } EXPORT_SYMBOL(bio_split); diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index bbe7ee00bd3d..7c2947128f58 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -772,6 +772,27 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, } EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum); +/* Performs queue bypass and policy enabled checks then looks up blkg. */ +static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg, + const struct blkcg_policy *pol, + struct request_queue *q) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + lockdep_assert_held(q->queue_lock); + + if (!blkcg_policy_enabled(q, pol)) + return ERR_PTR(-EOPNOTSUPP); + + /* + * This could be the first entry point of blkcg implementation and + * we shouldn't allow anything to go through for a bypassing queue. + */ + if (unlikely(blk_queue_bypass(q))) + return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY); + + return __blkg_lookup(blkcg, q, true /* update_hint */); +} + /** * blkg_conf_prep - parse and prepare for per-blkg config update * @blkcg: target block cgroup @@ -789,6 +810,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, __acquires(rcu) __acquires(disk->queue->queue_lock) { struct gendisk *disk; + struct request_queue *q; struct blkcg_gq *blkg; struct module *owner; unsigned int major, minor; @@ -807,44 +829,95 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, if (!disk) return -ENODEV; if (part) { - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - return -ENODEV; + ret = -ENODEV; + goto fail; } - rcu_read_lock(); - spin_lock_irq(disk->queue->queue_lock); + q = disk->queue; - if (blkcg_policy_enabled(disk->queue, pol)) - blkg = blkg_lookup_create(blkcg, disk->queue); - else - blkg = ERR_PTR(-EOPNOTSUPP); + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + blkg = blkg_lookup_check(blkcg, pol, q); if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); + goto fail_unlock; + } + + if (blkg) + goto success; + + /* + * Create blkgs walking down from blkcg_root to @blkcg, so that all + * non-root blkgs have access to their parents. + */ + while (true) { + struct blkcg *pos = blkcg; + struct blkcg *parent; + struct blkcg_gq *new_blkg; + + parent = blkcg_parent(blkcg); + while (parent && !__blkg_lookup(parent, q, false)) { + pos = parent; + parent = blkcg_parent(parent); + } + + /* Drop locks to do new blkg allocation with GFP_KERNEL. */ + spin_unlock_irq(q->queue_lock); rcu_read_unlock(); - spin_unlock_irq(disk->queue->queue_lock); - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - /* - * If queue was bypassing, we should retry. Do so after a - * short msleep(). It isn't strictly necessary but queue - * can be bypassing for some time and it's always nice to - * avoid busy looping. - */ - if (ret == -EBUSY) { - msleep(10); - ret = restart_syscall(); + + new_blkg = blkg_alloc(pos, q, GFP_KERNEL); + if (unlikely(!new_blkg)) { + ret = -ENOMEM; + goto fail; } - return ret; - } + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + + blkg = blkg_lookup_check(pos, pol, q); + if (IS_ERR(blkg)) { + ret = PTR_ERR(blkg); + goto fail_unlock; + } + + if (blkg) { + blkg_free(new_blkg); + } else { + blkg = blkg_create(pos, q, new_blkg); + if (unlikely(IS_ERR(blkg))) { + ret = PTR_ERR(blkg); + goto fail_unlock; + } + } + + if (pos == blkcg) + goto success; + } +success: ctx->disk = disk; ctx->blkg = blkg; ctx->body = body; return 0; + +fail_unlock: + spin_unlock_irq(q->queue_lock); + rcu_read_unlock(); +fail: + owner = disk->fops->owner; + put_disk(disk); + module_put(owner); + /* + * If queue was bypassing, we should retry. Do so after a + * short msleep(). It isn't strictly necessary but queue + * can be bypassing for some time and it's always nice to + * avoid busy looping. + */ + if (ret == -EBUSY) { + msleep(10); + ret = restart_syscall(); + } + return ret; } EXPORT_SYMBOL_GPL(blkg_conf_prep); diff --git a/block/blk-core.c b/block/blk-core.c index d772c221cc17..24886b69690f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -268,10 +268,8 @@ void blk_sync_queue(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; - queue_for_each_hw_ctx(q, hctx, i) { - cancel_work_sync(&hctx->run_work); - cancel_delayed_work_sync(&hctx->delay_work); - } + queue_for_each_hw_ctx(q, hctx, i) + cancel_delayed_work_sync(&hctx->run_work); } else { cancel_delayed_work_sync(&q->delay_work); } @@ -500,6 +498,13 @@ void blk_set_queue_dying(struct request_queue *q) queue_flag_set(QUEUE_FLAG_DYING, q); spin_unlock_irq(q->queue_lock); + /* + * When queue DYING flag is set, we need to block new req + * entering queue, so we call blk_freeze_queue_start() to + * prevent I/O from crossing blk_queue_enter(). + */ + blk_freeze_queue_start(q); + if (q->mq_ops) blk_mq_wake_waiters(q); else { @@ -556,9 +561,13 @@ void blk_cleanup_queue(struct request_queue *q) * prevent that q->request_fn() gets invoked after draining finished. */ blk_freeze_queue(q); - spin_lock_irq(lock); - if (!q->mq_ops) + if (!q->mq_ops) { + spin_lock_irq(lock); __blk_drain_queue(q, true); + } else { + blk_mq_debugfs_unregister_mq(q); + spin_lock_irq(lock); + } queue_flag_set(QUEUE_FLAG_DEAD, q); spin_unlock_irq(lock); @@ -669,6 +678,15 @@ int blk_queue_enter(struct request_queue *q, bool nowait) if (nowait) return -EBUSY; + /* + * read pair of barrier in blk_freeze_queue_start(), + * we need to order reading __PERCPU_REF_DEAD flag of + * .q_usage_counter and reading .mq_freeze_depth or + * queue dying flag, otherwise the following wait may + * never return if the two reads are reordered. + */ + smp_rmb(); + ret = wait_event_interruptible(q->mq_freeze_wq, !atomic_read(&q->mq_freeze_depth) || blk_queue_dying(q)); @@ -720,6 +738,10 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (!q->backing_dev_info) goto fail_split; + q->stats = blk_alloc_queue_stats(); + if (!q->stats) + goto fail_stats; + q->backing_dev_info->ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_SIZE; q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK; @@ -776,6 +798,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) fail_ref: percpu_ref_exit(&q->q_usage_counter); fail_bdi: + blk_free_queue_stats(q->stats); +fail_stats: bdi_put(q->backing_dev_info); fail_split: bioset_free(q->bio_split); @@ -889,7 +913,6 @@ out_exit_flush_rq: q->exit_rq_fn(q, q->fq->flush_rq); out_free_flush_queue: blk_free_flush_queue(q->fq); - wbt_exit(q); return -ENOMEM; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1128,7 +1151,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, blk_rq_init(q, rq); blk_rq_set_rl(rq, rl); - blk_rq_set_prio(rq, ioc); rq->cmd_flags = op; rq->rq_flags = rq_flags; @@ -1608,17 +1630,23 @@ out: return ret; } -void init_request_from_bio(struct request *req, struct bio *bio) +void blk_init_request_from_bio(struct request *req, struct bio *bio) { + struct io_context *ioc = rq_ioc(bio); + if (bio->bi_opf & REQ_RAHEAD) req->cmd_flags |= REQ_FAILFAST_MASK; - req->errors = 0; req->__sector = bio->bi_iter.bi_sector; if (ioprio_valid(bio_prio(bio))) req->ioprio = bio_prio(bio); + else if (ioc) + req->ioprio = ioc->ioprio; + else + req->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); blk_rq_bio_prep(req->q, req, bio); } +EXPORT_SYMBOL_GPL(blk_init_request_from_bio); static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) { @@ -1709,7 +1737,7 @@ get_rq: * We don't worry about that case for efficiency. It won't happen * often, and the elevators are able to handle it. */ - init_request_from_bio(req, bio); + blk_init_request_from_bio(req, bio); if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags)) req->cpu = raw_smp_processor_id(); @@ -1936,7 +1964,13 @@ generic_make_request_checks(struct bio *bio) if (!blkcg_bio_issue_check(q, bio)) return false; - trace_block_bio_queue(q, bio); + if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) { + trace_block_bio_queue(q, bio); + /* Now that enqueuing has been traced, we need to trace + * completion as well. + */ + bio_set_flag(bio, BIO_TRACE_COMPLETION); + } return true; not_supported: @@ -2478,7 +2512,7 @@ void blk_start_request(struct request *req) blk_dequeue_request(req); if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) { - blk_stat_set_issue_time(&req->issue_stat); + blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req)); req->rq_flags |= RQF_STATS; wbt_issue(req->q->rq_wb, &req->issue_stat); } @@ -2540,22 +2574,11 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) { int total_bytes; - trace_block_rq_complete(req->q, req, nr_bytes); + trace_block_rq_complete(req, error, nr_bytes); if (!req->bio) return false; - /* - * For fs requests, rq is just carrier of independent bio's - * and each partial completion should be handled separately. - * Reset per-request error on each partial completion. - * - * TODO: tj: This is too subtle. It would be better to let - * low level drivers do what they see fit. - */ - if (!blk_rq_is_passthrough(req)) - req->errors = 0; - if (error && !blk_rq_is_passthrough(req) && !(req->rq_flags & RQF_QUIET)) { char *error_type; @@ -2601,6 +2624,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes) if (bio_bytes == bio->bi_iter.bi_size) req->bio = bio->bi_next; + /* Completion has already been traced */ + bio_clear_flag(bio, BIO_TRACE_COMPLETION); req_bio_endio(req, bio, bio_bytes, error); total_bytes += bio_bytes; @@ -2699,7 +2724,7 @@ void blk_finish_request(struct request *req, int error) struct request_queue *q = req->q; if (req->rq_flags & RQF_STATS) - blk_stat_add(&q->rq_stats[rq_data_dir(req)], req); + blk_stat_add(req); if (req->rq_flags & RQF_QUEUED) blk_queue_end_tag(q, req); @@ -2776,7 +2801,7 @@ static bool blk_end_bidi_request(struct request *rq, int error, * %false - we are done with this request * %true - still buffers pending for this request **/ -bool __blk_end_bidi_request(struct request *rq, int error, +static bool __blk_end_bidi_request(struct request *rq, int error, unsigned int nr_bytes, unsigned int bidi_bytes) { if (blk_update_bidi_request(rq, error, nr_bytes, bidi_bytes)) @@ -2829,43 +2854,6 @@ void blk_end_request_all(struct request *rq, int error) EXPORT_SYMBOL(blk_end_request_all); /** - * blk_end_request_cur - Helper function to finish the current request chunk. - * @rq: the request to finish the current chunk for - * @error: %0 for success, < %0 for error - * - * Description: - * Complete the current consecutively mapped chunk from @rq. - * - * Return: - * %false - we are done with this request - * %true - still buffers pending for this request - */ -bool blk_end_request_cur(struct request *rq, int error) -{ - return blk_end_request(rq, error, blk_rq_cur_bytes(rq)); -} -EXPORT_SYMBOL(blk_end_request_cur); - -/** - * blk_end_request_err - Finish a request till the next failure boundary. - * @rq: the request to finish till the next failure boundary for - * @error: must be negative errno - * - * Description: - * Complete @rq till the next failure boundary. - * - * Return: - * %false - we are done with this request - * %true - still buffers pending for this request - */ -bool blk_end_request_err(struct request *rq, int error) -{ - WARN_ON(error >= 0); - return blk_end_request(rq, error, blk_rq_err_bytes(rq)); -} -EXPORT_SYMBOL_GPL(blk_end_request_err); - -/** * __blk_end_request - Helper function for drivers to complete the request. * @rq: the request being processed * @error: %0 for success, < %0 for error @@ -2924,26 +2912,6 @@ bool __blk_end_request_cur(struct request *rq, int error) } EXPORT_SYMBOL(__blk_end_request_cur); -/** - * __blk_end_request_err - Finish a request till the next failure boundary. - * @rq: the request to finish till the next failure boundary for - * @error: must be negative errno - * - * Description: - * Complete @rq till the next failure boundary. Must be called - * with queue lock held. - * - * Return: - * %false - we are done with this request - * %true - still buffers pending for this request - */ -bool __blk_end_request_err(struct request *rq, int error) -{ - WARN_ON(error >= 0); - return __blk_end_request(rq, error, blk_rq_err_bytes(rq)); -} -EXPORT_SYMBOL_GPL(__blk_end_request_err); - void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio) { @@ -3106,6 +3074,13 @@ int kblockd_schedule_work_on(int cpu, struct work_struct *work) } EXPORT_SYMBOL(kblockd_schedule_work_on); +int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, + unsigned long delay) +{ + return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); +} +EXPORT_SYMBOL(kblockd_mod_delayed_work_on); + int kblockd_schedule_delayed_work(struct delayed_work *dwork, unsigned long delay) { diff --git a/block/blk-exec.c b/block/blk-exec.c index 8cd0e9bc8dc8..a9451e3b8587 100644 --- a/block/blk-exec.c +++ b/block/blk-exec.c @@ -69,8 +69,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct gendisk *bd_disk, if (unlikely(blk_queue_dying(q))) { rq->rq_flags |= RQF_QUIET; - rq->errors = -ENXIO; - __blk_end_request_all(rq, rq->errors); + __blk_end_request_all(rq, -ENXIO); spin_unlock_irq(q->queue_lock); return; } @@ -92,11 +91,10 @@ EXPORT_SYMBOL_GPL(blk_execute_rq_nowait); * Insert a fully prepared request at the back of the I/O scheduler queue * for execution and wait for completion. */ -int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, +void blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, struct request *rq, int at_head) { DECLARE_COMPLETION_ONSTACK(wait); - int err = 0; unsigned long hang_check; rq->end_io_data = &wait; @@ -108,10 +106,5 @@ int blk_execute_rq(struct request_queue *q, struct gendisk *bd_disk, while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2))); else wait_for_completion_io(&wait); - - if (rq->errors) - err = -EIO; - - return err; } EXPORT_SYMBOL(blk_execute_rq); diff --git a/block/blk-flush.c b/block/blk-flush.c index 0d5a9c1da1fc..c4e0880b54bb 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -447,7 +447,7 @@ void blk_insert_flush(struct request *rq) if (q->mq_ops) blk_mq_end_request(rq, 0); else - __blk_end_bidi_request(rq, 0, 0, 0); + __blk_end_request(rq, 0, 0); return; } @@ -497,8 +497,7 @@ void blk_insert_flush(struct request *rq) * Description: * Issue a flush for the block device in question. Caller can supply * room for storing the error offset in case of a flush error, if they - * wish to. If WAIT flag is not passed then caller may check only what - * request was pushed in some internal queue for later handling. + * wish to. */ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, sector_t *error_sector) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 9f0ff5ba4f84..0f891a9aff4d 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -389,7 +389,7 @@ static int blk_integrity_nop_fn(struct blk_integrity_iter *iter) return 0; } -static struct blk_integrity_profile nop_profile = { +static const struct blk_integrity_profile nop_profile = { .name = "nop", .generate_fn = blk_integrity_nop_fn, .verify_fn = blk_integrity_nop_fn, @@ -412,12 +412,13 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template bi->flags = BLK_INTEGRITY_VERIFY | BLK_INTEGRITY_GENERATE | template->flags; - bi->interval_exp = ilog2(queue_logical_block_size(disk->queue)); + bi->interval_exp = template->interval_exp ? : + ilog2(queue_logical_block_size(disk->queue)); bi->profile = template->profile ? template->profile : &nop_profile; bi->tuple_size = template->tuple_size; bi->tag_size = template->tag_size; - blk_integrity_revalidate(disk); + disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; } EXPORT_SYMBOL(blk_integrity_register); @@ -430,26 +431,11 @@ EXPORT_SYMBOL(blk_integrity_register); */ void blk_integrity_unregister(struct gendisk *disk) { - blk_integrity_revalidate(disk); + disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; memset(&disk->queue->integrity, 0, sizeof(struct blk_integrity)); } EXPORT_SYMBOL(blk_integrity_unregister); -void blk_integrity_revalidate(struct gendisk *disk) -{ - struct blk_integrity *bi = &disk->queue->integrity; - - if (!(disk->flags & GENHD_FL_UP)) - return; - - if (bi->profile) - disk->queue->backing_dev_info->capabilities |= - BDI_CAP_STABLE_WRITES; - else - disk->queue->backing_dev_info->capabilities &= - ~BDI_CAP_STABLE_WRITES; -} - void blk_integrity_add(struct gendisk *disk) { if (kobject_init_and_add(&disk->integrity_kobj, &integrity_ktype, diff --git a/block/blk-lib.c b/block/blk-lib.c index ed1e78e24db0..e8caecd71688 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -37,17 +37,12 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, return -ENXIO; if (flags & BLKDEV_DISCARD_SECURE) { - if (flags & BLKDEV_DISCARD_ZERO) - return -EOPNOTSUPP; if (!blk_queue_secure_erase(q)) return -EOPNOTSUPP; op = REQ_OP_SECURE_ERASE; } else { if (!blk_queue_discard(q)) return -EOPNOTSUPP; - if ((flags & BLKDEV_DISCARD_ZERO) && - !q->limits.discard_zeroes_data) - return -EOPNOTSUPP; op = REQ_OP_DISCARD; } @@ -109,7 +104,7 @@ EXPORT_SYMBOL(__blkdev_issue_discard); * @sector: start sector * @nr_sects: number of sectors to discard * @gfp_mask: memory allocation flags (for bio_alloc) - * @flags: BLKDEV_IFL_* flags to control behaviour + * @flags: BLKDEV_DISCARD_* flags to control behaviour * * Description: * Issue a discard request for the sectors in question. @@ -126,7 +121,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector, &bio); if (!ret && bio) { ret = submit_bio_wait(bio); - if (ret == -EOPNOTSUPP && !(flags & BLKDEV_DISCARD_ZERO)) + if (ret == -EOPNOTSUPP) ret = 0; bio_put(bio); } @@ -226,20 +221,9 @@ int blkdev_issue_write_same(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL(blkdev_issue_write_same); -/** - * __blkdev_issue_write_zeroes - generate number of bios with WRITE ZEROES - * @bdev: blockdev to issue - * @sector: start sector - * @nr_sects: number of sectors to write - * @gfp_mask: memory allocation flags (for bio_alloc) - * @biop: pointer to anchor bio - * - * Description: - * Generate and issue number of bios(REQ_OP_WRITE_ZEROES) with zerofiled pages. - */ static int __blkdev_issue_write_zeroes(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, - struct bio **biop) + struct bio **biop, unsigned flags) { struct bio *bio = *biop; unsigned int max_write_zeroes_sectors; @@ -258,7 +242,9 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, bio = next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; bio->bi_bdev = bdev; - bio_set_op_attrs(bio, REQ_OP_WRITE_ZEROES, 0); + bio->bi_opf = REQ_OP_WRITE_ZEROES; + if (flags & BLKDEV_ZERO_NOUNMAP) + bio->bi_opf |= REQ_NOUNMAP; if (nr_sects > max_write_zeroes_sectors) { bio->bi_iter.bi_size = max_write_zeroes_sectors << 9; @@ -282,14 +268,27 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev, * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) * @biop: pointer to anchor bio - * @discard: discard flag + * @flags: controls detailed behavior * * Description: - * Generate and issue number of bios with zerofiled pages. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. + * + * Note that this function may fail with -EOPNOTSUPP if the driver signals + * zeroing offload support, but the device fails to process the command (for + * some devices there is no non-destructive way to verify whether this + * operation is actually supported). In this case the caller should call + * retry the call to blkdev_issue_zeroout() and the fallback path will be used. + * + * If a device is using logical block provisioning, the underlying space will + * not be released if %flags contains BLKDEV_ZERO_NOUNMAP. + * + * If %flags contains BLKDEV_ZERO_NOFALLBACK, the function will return + * -EOPNOTSUPP if no explicit hardware offload for zeroing is provided. */ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, - bool discard) + unsigned flags) { int ret; int bi_size = 0; @@ -302,8 +301,8 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, return -EINVAL; ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask, - biop); - if (ret == 0 || (ret && ret != -EOPNOTSUPP)) + biop, flags); + if (ret != -EOPNOTSUPP || (flags & BLKDEV_ZERO_NOFALLBACK)) goto out; ret = 0; @@ -337,40 +336,23 @@ EXPORT_SYMBOL(__blkdev_issue_zeroout); * @sector: start sector * @nr_sects: number of sectors to write * @gfp_mask: memory allocation flags (for bio_alloc) - * @discard: whether to discard the block range + * @flags: controls detailed behavior * * Description: - * Zero-fill a block range. If the discard flag is set and the block - * device guarantees that subsequent READ operations to the block range - * in question will return zeroes, the blocks will be discarded. Should - * the discard request fail, if the discard flag is not set, or if - * discard_zeroes_data is not supported, this function will resort to - * zeroing the blocks manually, thus provisioning (allocating, - * anchoring) them. If the block device supports WRITE ZEROES or WRITE SAME - * command(s), blkdev_issue_zeroout() will use it to optimize the process of - * clearing the block range. Otherwise the zeroing will be performed - * using regular WRITE calls. + * Zero-fill a block range, either using hardware offload or by explicitly + * writing zeroes to the device. See __blkdev_issue_zeroout() for the + * valid values for %flags. */ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, - sector_t nr_sects, gfp_t gfp_mask, bool discard) + sector_t nr_sects, gfp_t gfp_mask, unsigned flags) { int ret; struct bio *bio = NULL; struct blk_plug plug; - if (discard) { - if (!blkdev_issue_discard(bdev, sector, nr_sects, gfp_mask, - BLKDEV_DISCARD_ZERO)) - return 0; - } - - if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, - ZERO_PAGE(0))) - return 0; - blk_start_plug(&plug); ret = __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask, - &bio, discard); + &bio, flags); if (ret == 0 && bio) { ret = submit_bio_wait(bio); bio_put(bio); diff --git a/block/blk-merge.c b/block/blk-merge.c index 2afa262425d1..3990ae406341 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -54,6 +54,20 @@ static struct bio *blk_bio_discard_split(struct request_queue *q, return bio_split(bio, split_sectors, GFP_NOIO, bs); } +static struct bio *blk_bio_write_zeroes_split(struct request_queue *q, + struct bio *bio, struct bio_set *bs, unsigned *nsegs) +{ + *nsegs = 1; + + if (!q->limits.max_write_zeroes_sectors) + return NULL; + + if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors) + return NULL; + + return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs); +} + static struct bio *blk_bio_write_same_split(struct request_queue *q, struct bio *bio, struct bio_set *bs, @@ -200,8 +214,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio, split = blk_bio_discard_split(q, *bio, bs, &nsegs); break; case REQ_OP_WRITE_ZEROES: - split = NULL; - nsegs = (*bio)->bi_phys_segments; + split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs); break; case REQ_OP_WRITE_SAME: split = blk_bio_write_same_split(q, *bio, bs, &nsegs); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index f6d917977b33..bcd2a7d4a3a5 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -43,11 +43,157 @@ static int blk_mq_debugfs_seq_open(struct inode *inode, struct file *file, return ret; } +static int blk_flags_show(struct seq_file *m, const unsigned long flags, + const char *const *flag_name, int flag_name_count) +{ + bool sep = false; + int i; + + for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) { + if (!(flags & BIT(i))) + continue; + if (sep) + seq_puts(m, " "); + sep = true; + if (i < flag_name_count && flag_name[i]) + seq_puts(m, flag_name[i]); + else + seq_printf(m, "%d", i); + } + return 0; +} + +static const char *const blk_queue_flag_name[] = { + [QUEUE_FLAG_QUEUED] = "QUEUED", + [QUEUE_FLAG_STOPPED] = "STOPPED", + [QUEUE_FLAG_SYNCFULL] = "SYNCFULL", + [QUEUE_FLAG_ASYNCFULL] = "ASYNCFULL", + [QUEUE_FLAG_DYING] = "DYING", + [QUEUE_FLAG_BYPASS] = "BYPASS", + [QUEUE_FLAG_BIDI] = "BIDI", + [QUEUE_FLAG_NOMERGES] = "NOMERGES", + [QUEUE_FLAG_SAME_COMP] = "SAME_COMP", + [QUEUE_FLAG_FAIL_IO] = "FAIL_IO", + [QUEUE_FLAG_STACKABLE] = "STACKABLE", + [QUEUE_FLAG_NONROT] = "NONROT", + [QUEUE_FLAG_IO_STAT] = "IO_STAT", + [QUEUE_FLAG_DISCARD] = "DISCARD", + [QUEUE_FLAG_NOXMERGES] = "NOXMERGES", + [QUEUE_FLAG_ADD_RANDOM] = "ADD_RANDOM", + [QUEUE_FLAG_SECERASE] = "SECERASE", + [QUEUE_FLAG_SAME_FORCE] = "SAME_FORCE", + [QUEUE_FLAG_DEAD] = "DEAD", + [QUEUE_FLAG_INIT_DONE] = "INIT_DONE", + [QUEUE_FLAG_NO_SG_MERGE] = "NO_SG_MERGE", + [QUEUE_FLAG_POLL] = "POLL", + [QUEUE_FLAG_WC] = "WC", + [QUEUE_FLAG_FUA] = "FUA", + [QUEUE_FLAG_FLUSH_NQ] = "FLUSH_NQ", + [QUEUE_FLAG_DAX] = "DAX", + [QUEUE_FLAG_STATS] = "STATS", + [QUEUE_FLAG_POLL_STATS] = "POLL_STATS", + [QUEUE_FLAG_REGISTERED] = "REGISTERED", +}; + +static int blk_queue_flags_show(struct seq_file *m, void *v) +{ + struct request_queue *q = m->private; + + blk_flags_show(m, q->queue_flags, blk_queue_flag_name, + ARRAY_SIZE(blk_queue_flag_name)); + seq_puts(m, "\n"); + return 0; +} + +static ssize_t blk_queue_flags_store(struct file *file, const char __user *ubuf, + size_t len, loff_t *offp) +{ + struct request_queue *q = file_inode(file)->i_private; + char op[16] = { }, *s; + + len = min(len, sizeof(op) - 1); + if (copy_from_user(op, ubuf, len)) + return -EFAULT; + s = op; + strsep(&s, " \t\n"); /* strip trailing whitespace */ + if (strcmp(op, "run") == 0) { + blk_mq_run_hw_queues(q, true); + } else if (strcmp(op, "start") == 0) { + blk_mq_start_stopped_hw_queues(q, true); + } else { + pr_err("%s: unsupported operation %s. Use either 'run' or 'start'\n", + __func__, op); + return -EINVAL; + } + return len; +} + +static int blk_queue_flags_open(struct inode *inode, struct file *file) +{ + return single_open(file, blk_queue_flags_show, inode->i_private); +} + +static const struct file_operations blk_queue_flags_fops = { + .open = blk_queue_flags_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, + .write = blk_queue_flags_store, +}; + +static void print_stat(struct seq_file *m, struct blk_rq_stat *stat) +{ + if (stat->nr_samples) { + seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu", + stat->nr_samples, stat->mean, stat->min, stat->max); + } else { + seq_puts(m, "samples=0"); + } +} + +static int queue_poll_stat_show(struct seq_file *m, void *v) +{ + struct request_queue *q = m->private; + int bucket; + + for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS/2; bucket++) { + seq_printf(m, "read (%d Bytes): ", 1 << (9+bucket)); + print_stat(m, &q->poll_stat[2*bucket]); + seq_puts(m, "\n"); + + seq_printf(m, "write (%d Bytes): ", 1 << (9+bucket)); + print_stat(m, &q->poll_stat[2*bucket+1]); + seq_puts(m, "\n"); + } + return 0; +} + +static int queue_poll_stat_open(struct inode *inode, struct file *file) +{ + return single_open(file, queue_poll_stat_show, inode->i_private); +} + +static const struct file_operations queue_poll_stat_fops = { + .open = queue_poll_stat_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static const char *const hctx_state_name[] = { + [BLK_MQ_S_STOPPED] = "STOPPED", + [BLK_MQ_S_TAG_ACTIVE] = "TAG_ACTIVE", + [BLK_MQ_S_SCHED_RESTART] = "SCHED_RESTART", + [BLK_MQ_S_TAG_WAITING] = "TAG_WAITING", + +}; static int hctx_state_show(struct seq_file *m, void *v) { struct blk_mq_hw_ctx *hctx = m->private; - seq_printf(m, "0x%lx\n", hctx->state); + blk_flags_show(m, hctx->state, hctx_state_name, + ARRAY_SIZE(hctx_state_name)); + seq_puts(m, "\n"); return 0; } @@ -63,11 +209,35 @@ static const struct file_operations hctx_state_fops = { .release = single_release, }; +static const char *const alloc_policy_name[] = { + [BLK_TAG_ALLOC_FIFO] = "fifo", + [BLK_TAG_ALLOC_RR] = "rr", +}; + +static const char *const hctx_flag_name[] = { + [ilog2(BLK_MQ_F_SHOULD_MERGE)] = "SHOULD_MERGE", + [ilog2(BLK_MQ_F_TAG_SHARED)] = "TAG_SHARED", + [ilog2(BLK_MQ_F_SG_MERGE)] = "SG_MERGE", + [ilog2(BLK_MQ_F_BLOCKING)] = "BLOCKING", + [ilog2(BLK_MQ_F_NO_SCHED)] = "NO_SCHED", +}; + static int hctx_flags_show(struct seq_file *m, void *v) { struct blk_mq_hw_ctx *hctx = m->private; - - seq_printf(m, "0x%lx\n", hctx->flags); + const int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(hctx->flags); + + seq_puts(m, "alloc_policy="); + if (alloc_policy < ARRAY_SIZE(alloc_policy_name) && + alloc_policy_name[alloc_policy]) + seq_puts(m, alloc_policy_name[alloc_policy]); + else + seq_printf(m, "%d", alloc_policy); + seq_puts(m, " "); + blk_flags_show(m, + hctx->flags ^ BLK_ALLOC_POLICY_TO_MQ_FLAG(alloc_policy), + hctx_flag_name, ARRAY_SIZE(hctx_flag_name)); + seq_puts(m, "\n"); return 0; } @@ -83,13 +253,83 @@ static const struct file_operations hctx_flags_fops = { .release = single_release, }; +static const char *const op_name[] = { + [REQ_OP_READ] = "READ", + [REQ_OP_WRITE] = "WRITE", + [REQ_OP_FLUSH] = "FLUSH", + [REQ_OP_DISCARD] = "DISCARD", + [REQ_OP_ZONE_REPORT] = "ZONE_REPORT", + [REQ_OP_SECURE_ERASE] = "SECURE_ERASE", + [REQ_OP_ZONE_RESET] = "ZONE_RESET", + [REQ_OP_WRITE_SAME] = "WRITE_SAME", + [REQ_OP_WRITE_ZEROES] = "WRITE_ZEROES", + [REQ_OP_SCSI_IN] = "SCSI_IN", + [REQ_OP_SCSI_OUT] = "SCSI_OUT", + [REQ_OP_DRV_IN] = "DRV_IN", + [REQ_OP_DRV_OUT] = "DRV_OUT", +}; + +static const char *const cmd_flag_name[] = { + [__REQ_FAILFAST_DEV] = "FAILFAST_DEV", + [__REQ_FAILFAST_TRANSPORT] = "FAILFAST_TRANSPORT", + [__REQ_FAILFAST_DRIVER] = "FAILFAST_DRIVER", + [__REQ_SYNC] = "SYNC", + [__REQ_META] = "META", + [__REQ_PRIO] = "PRIO", + [__REQ_NOMERGE] = "NOMERGE", + [__REQ_IDLE] = "IDLE", + [__REQ_INTEGRITY] = "INTEGRITY", + [__REQ_FUA] = "FUA", + [__REQ_PREFLUSH] = "PREFLUSH", + [__REQ_RAHEAD] = "RAHEAD", + [__REQ_BACKGROUND] = "BACKGROUND", + [__REQ_NR_BITS] = "NR_BITS", +}; + +static const char *const rqf_name[] = { + [ilog2((__force u32)RQF_SORTED)] = "SORTED", + [ilog2((__force u32)RQF_STARTED)] = "STARTED", + [ilog2((__force u32)RQF_QUEUED)] = "QUEUED", + [ilog2((__force u32)RQF_SOFTBARRIER)] = "SOFTBARRIER", + [ilog2((__force u32)RQF_FLUSH_SEQ)] = "FLUSH_SEQ", + [ilog2((__force u32)RQF_MIXED_MERGE)] = "MIXED_MERGE", + [ilog2((__force u32)RQF_MQ_INFLIGHT)] = "MQ_INFLIGHT", + [ilog2((__force u32)RQF_DONTPREP)] = "DONTPREP", + [ilog2((__force u32)RQF_PREEMPT)] = "PREEMPT", + [ilog2((__force u32)RQF_COPY_USER)] = "COPY_USER", + [ilog2((__force u32)RQF_FAILED)] = "FAILED", + [ilog2((__force u32)RQF_QUIET)] = "QUIET", + [ilog2((__force u32)RQF_ELVPRIV)] = "ELVPRIV", + [ilog2((__force u32)RQF_IO_STAT)] = "IO_STAT", + [ilog2((__force u32)RQF_ALLOCED)] = "ALLOCED", + [ilog2((__force u32)RQF_PM)] = "PM", + [ilog2((__force u32)RQF_HASHED)] = "HASHED", + [ilog2((__force u32)RQF_STATS)] = "STATS", + [ilog2((__force u32)RQF_SPECIAL_PAYLOAD)] = "SPECIAL_PAYLOAD", +}; + static int blk_mq_debugfs_rq_show(struct seq_file *m, void *v) { struct request *rq = list_entry_rq(v); - - seq_printf(m, "%p {.cmd_flags=0x%x, .rq_flags=0x%x, .tag=%d, .internal_tag=%d}\n", - rq, rq->cmd_flags, (__force unsigned int)rq->rq_flags, - rq->tag, rq->internal_tag); + const struct blk_mq_ops *const mq_ops = rq->q->mq_ops; + const unsigned int op = rq->cmd_flags & REQ_OP_MASK; + + seq_printf(m, "%p {.op=", rq); + if (op < ARRAY_SIZE(op_name) && op_name[op]) + seq_printf(m, "%s", op_name[op]); + else + seq_printf(m, "%d", op); + seq_puts(m, ", .cmd_flags="); + blk_flags_show(m, rq->cmd_flags & ~REQ_OP_MASK, cmd_flag_name, + ARRAY_SIZE(cmd_flag_name)); + seq_puts(m, ", .rq_flags="); + blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name, + ARRAY_SIZE(rqf_name)); + seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag, + rq->internal_tag); + if (mq_ops->show_rq) + mq_ops->show_rq(m, rq); + seq_puts(m, "}\n"); return 0; } @@ -322,60 +562,6 @@ static const struct file_operations hctx_io_poll_fops = { .release = single_release, }; -static void print_stat(struct seq_file *m, struct blk_rq_stat *stat) -{ - seq_printf(m, "samples=%d, mean=%lld, min=%llu, max=%llu", - stat->nr_samples, stat->mean, stat->min, stat->max); -} - -static int hctx_stats_show(struct seq_file *m, void *v) -{ - struct blk_mq_hw_ctx *hctx = m->private; - struct blk_rq_stat stat[2]; - - blk_stat_init(&stat[BLK_STAT_READ]); - blk_stat_init(&stat[BLK_STAT_WRITE]); - - blk_hctx_stat_get(hctx, stat); - - seq_puts(m, "read: "); - print_stat(m, &stat[BLK_STAT_READ]); - seq_puts(m, "\n"); - - seq_puts(m, "write: "); - print_stat(m, &stat[BLK_STAT_WRITE]); - seq_puts(m, "\n"); - return 0; -} - -static int hctx_stats_open(struct inode *inode, struct file *file) -{ - return single_open(file, hctx_stats_show, inode->i_private); -} - -static ssize_t hctx_stats_write(struct file *file, const char __user *buf, - size_t count, loff_t *ppos) -{ - struct seq_file *m = file->private_data; - struct blk_mq_hw_ctx *hctx = m->private; - struct blk_mq_ctx *ctx; - int i; - - hctx_for_each_ctx(hctx, ctx, i) { - blk_stat_init(&ctx->stat[BLK_STAT_READ]); - blk_stat_init(&ctx->stat[BLK_STAT_WRITE]); - } - return count; -} - -static const struct file_operations hctx_stats_fops = { - .open = hctx_stats_open, - .read = seq_read, - .write = hctx_stats_write, - .llseek = seq_lseek, - .release = single_release, -}; - static int hctx_dispatched_show(struct seq_file *m, void *v) { struct blk_mq_hw_ctx *hctx = m->private; @@ -636,6 +822,12 @@ static const struct file_operations ctx_completed_fops = { .release = single_release, }; +static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = { + {"poll_stat", 0400, &queue_poll_stat_fops}, + {"state", 0600, &blk_queue_flags_fops}, + {}, +}; + static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = { {"state", 0400, &hctx_state_fops}, {"flags", 0400, &hctx_flags_fops}, @@ -646,7 +838,6 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = { {"sched_tags", 0400, &hctx_sched_tags_fops}, {"sched_tags_bitmap", 0400, &hctx_sched_tags_bitmap_fops}, {"io_poll", 0600, &hctx_io_poll_fops}, - {"stats", 0600, &hctx_stats_fops}, {"dispatched", 0600, &hctx_dispatched_fops}, {"queued", 0600, &hctx_queued_fops}, {"run", 0600, &hctx_run_fops}, @@ -662,16 +853,17 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = { {}, }; -int blk_mq_debugfs_register(struct request_queue *q, const char *name) +int blk_mq_debugfs_register(struct request_queue *q) { if (!blk_debugfs_root) return -ENOENT; - q->debugfs_dir = debugfs_create_dir(name, blk_debugfs_root); + q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent), + blk_debugfs_root); if (!q->debugfs_dir) goto err; - if (blk_mq_debugfs_register_hctxs(q)) + if (blk_mq_debugfs_register_mq(q)) goto err; return 0; @@ -741,7 +933,7 @@ static int blk_mq_debugfs_register_hctx(struct request_queue *q, return 0; } -int blk_mq_debugfs_register_hctxs(struct request_queue *q) +int blk_mq_debugfs_register_mq(struct request_queue *q) { struct blk_mq_hw_ctx *hctx; int i; @@ -753,6 +945,9 @@ int blk_mq_debugfs_register_hctxs(struct request_queue *q) if (!q->mq_debugfs_dir) goto err; + if (!debugfs_create_files(q->mq_debugfs_dir, q, blk_mq_debugfs_queue_attrs)) + goto err; + queue_for_each_hw_ctx(q, hctx, i) { if (blk_mq_debugfs_register_hctx(q, hctx)) goto err; @@ -761,11 +956,11 @@ int blk_mq_debugfs_register_hctxs(struct request_queue *q) return 0; err: - blk_mq_debugfs_unregister_hctxs(q); + blk_mq_debugfs_unregister_mq(q); return -ENOMEM; } -void blk_mq_debugfs_unregister_hctxs(struct request_queue *q) +void blk_mq_debugfs_unregister_mq(struct request_queue *q) { debugfs_remove_recursive(q->mq_debugfs_dir); q->mq_debugfs_dir = NULL; diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c index 966c2169762e..0c3354cf3552 100644 --- a/block/blk-mq-pci.c +++ b/block/blk-mq-pci.c @@ -23,7 +23,7 @@ * @pdev: PCI device associated with @set. * * This function assumes the PCI device @pdev has at least as many available - * interrupt vetors as @set has queues. It will then queuery the vector + * interrupt vectors as @set has queues. It will then query the vector * corresponding to each queue for it's affinity mask and built queue mapping * that maps a queue to the CPUs that have irq affinity for the corresponding * vector. diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index c974a1bbf4cb..8b361e192e8a 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -30,43 +30,6 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); -int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, - int (*init)(struct blk_mq_hw_ctx *), - void (*exit)(struct blk_mq_hw_ctx *)) -{ - struct blk_mq_hw_ctx *hctx; - int ret; - int i; - - queue_for_each_hw_ctx(q, hctx, i) { - hctx->sched_data = kmalloc_node(size, GFP_KERNEL, hctx->numa_node); - if (!hctx->sched_data) { - ret = -ENOMEM; - goto error; - } - - if (init) { - ret = init(hctx); - if (ret) { - /* - * We don't want to give exit() a partially - * initialized sched_data. init() must clean up - * if it fails. - */ - kfree(hctx->sched_data); - hctx->sched_data = NULL; - goto error; - } - } - } - - return 0; -error: - blk_mq_sched_free_hctx_data(q, exit); - return ret; -} -EXPORT_SYMBOL_GPL(blk_mq_sched_init_hctx_data); - static void __blk_mq_sched_assign_ioc(struct request_queue *q, struct request *rq, struct bio *bio, @@ -119,7 +82,11 @@ struct request *blk_mq_sched_get_request(struct request_queue *q, if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); - if (e) { + /* + * For a reserved tag, allocate a normal request since we might + * have driver dependencies on the value of the internal tag. + */ + if (e && !(data->flags & BLK_MQ_REQ_RESERVED)) { data->flags |= BLK_MQ_REQ_INTERNAL; /* @@ -227,22 +194,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) } } -void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx, - struct list_head *rq_list, - struct request *(*get_rq)(struct blk_mq_hw_ctx *)) -{ - do { - struct request *rq; - - rq = get_rq(hctx); - if (!rq) - break; - - list_add_tail(&rq->queuelist, rq_list); - } while (1); -} -EXPORT_SYMBOL_GPL(blk_mq_sched_move_to_dispatch); - bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, struct request **merged_request) { @@ -508,11 +459,24 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) { struct elevator_queue *e = q->elevator; + int ret; if (!e) return 0; - return blk_mq_sched_alloc_tags(q, hctx, hctx_idx); + ret = blk_mq_sched_alloc_tags(q, hctx, hctx_idx); + if (ret) + return ret; + + if (e->type->ops.mq.init_hctx) { + ret = e->type->ops.mq.init_hctx(hctx, hctx_idx); + if (ret) { + blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx); + return ret; + } + } + + return 0; } void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, @@ -523,12 +487,18 @@ void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, if (!e) return; + if (e->type->ops.mq.exit_hctx && hctx->sched_data) { + e->type->ops.mq.exit_hctx(hctx, hctx_idx); + hctx->sched_data = NULL; + } + blk_mq_sched_free_tags(q->tag_set, hctx, hctx_idx); } int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) { struct blk_mq_hw_ctx *hctx; + struct elevator_queue *eq; unsigned int i; int ret; @@ -553,6 +523,18 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e) if (ret) goto err; + if (e->ops.mq.init_hctx) { + queue_for_each_hw_ctx(q, hctx, i) { + ret = e->ops.mq.init_hctx(hctx, i); + if (ret) { + eq = q->elevator; + blk_mq_exit_sched(q, eq); + kobject_put(&eq->kobj); + return ret; + } + } + } + return 0; err: @@ -563,6 +545,17 @@ err: void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e) { + struct blk_mq_hw_ctx *hctx; + unsigned int i; + + if (e->type->ops.mq.exit_hctx) { + queue_for_each_hw_ctx(q, hctx, i) { + if (hctx->sched_data) { + e->type->ops.mq.exit_hctx(hctx, i); + hctx->sched_data = NULL; + } + } + } if (e->type->ops.mq.exit_sched) e->type->ops.mq.exit_sched(e); blk_mq_sched_tags_teardown(q); diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 3a9e6e40558b..edafb5383b7b 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -4,10 +4,6 @@ #include "blk-mq.h" #include "blk-mq-tag.h" -int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size, - int (*init)(struct blk_mq_hw_ctx *), - void (*exit)(struct blk_mq_hw_ctx *)); - void blk_mq_sched_free_hctx_data(struct request_queue *q, void (*exit)(struct blk_mq_hw_ctx *)); @@ -28,9 +24,6 @@ void blk_mq_sched_insert_requests(struct request_queue *q, struct list_head *list, bool run_queue_async); void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx); -void blk_mq_sched_move_to_dispatch(struct blk_mq_hw_ctx *hctx, - struct list_head *rq_list, - struct request *(*get_rq)(struct blk_mq_hw_ctx *)); int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e); void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e); @@ -86,17 +79,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq, return true; } -static inline void -blk_mq_sched_completed_request(struct blk_mq_hw_ctx *hctx, struct request *rq) +static inline void blk_mq_sched_completed_request(struct request *rq) { - struct elevator_queue *e = hctx->queue->elevator; + struct elevator_queue *e = rq->q->elevator; if (e && e->type->ops.mq.completed_request) - e->type->ops.mq.completed_request(hctx, rq); - - BUG_ON(rq->internal_tag == -1); - - blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag); + e->type->ops.mq.completed_request(rq); } static inline void blk_mq_sched_started_request(struct request *rq) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index d745ab81033a..ec0afdf765e3 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -253,10 +253,12 @@ static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; + lockdep_assert_held(&q->sysfs_lock); + queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); - blk_mq_debugfs_unregister_hctxs(q); + blk_mq_debugfs_unregister_mq(q); kobject_uevent(&q->mq_kobj, KOBJ_REMOVE); kobject_del(&q->mq_kobj); @@ -267,9 +269,9 @@ static void __blk_mq_unregister_dev(struct device *dev, struct request_queue *q) void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) { - blk_mq_disable_hotplug(); + mutex_lock(&q->sysfs_lock); __blk_mq_unregister_dev(dev, q); - blk_mq_enable_hotplug(); + mutex_unlock(&q->sysfs_lock); } void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx) @@ -302,12 +304,13 @@ void blk_mq_sysfs_init(struct request_queue *q) } } -int blk_mq_register_dev(struct device *dev, struct request_queue *q) +int __blk_mq_register_dev(struct device *dev, struct request_queue *q) { struct blk_mq_hw_ctx *hctx; int ret, i; - blk_mq_disable_hotplug(); + WARN_ON_ONCE(!q->kobj.parent); + lockdep_assert_held(&q->sysfs_lock); ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq"); if (ret < 0) @@ -315,20 +318,38 @@ int blk_mq_register_dev(struct device *dev, struct request_queue *q) kobject_uevent(&q->mq_kobj, KOBJ_ADD); - blk_mq_debugfs_register(q, kobject_name(&dev->kobj)); + blk_mq_debugfs_register(q); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); if (ret) - break; + goto unreg; } - if (ret) - __blk_mq_unregister_dev(dev, q); - else - q->mq_sysfs_init_done = true; + q->mq_sysfs_init_done = true; + out: - blk_mq_enable_hotplug(); + return ret; + +unreg: + while (--i >= 0) + blk_mq_unregister_hctx(q->queue_hw_ctx[i]); + + blk_mq_debugfs_unregister_mq(q); + + kobject_uevent(&q->mq_kobj, KOBJ_REMOVE); + kobject_del(&q->mq_kobj); + kobject_put(&dev->kobj); + return ret; +} + +int blk_mq_register_dev(struct device *dev, struct request_queue *q) +{ + int ret; + + mutex_lock(&q->sysfs_lock); + ret = __blk_mq_register_dev(dev, q); + mutex_unlock(&q->sysfs_lock); return ret; } @@ -339,13 +360,17 @@ void blk_mq_sysfs_unregister(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; + mutex_lock(&q->sysfs_lock); if (!q->mq_sysfs_init_done) - return; + goto unlock; - blk_mq_debugfs_unregister_hctxs(q); + blk_mq_debugfs_unregister_mq(q); queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); + +unlock: + mutex_unlock(&q->sysfs_lock); } int blk_mq_sysfs_register(struct request_queue *q) @@ -353,10 +378,11 @@ int blk_mq_sysfs_register(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i, ret = 0; + mutex_lock(&q->sysfs_lock); if (!q->mq_sysfs_init_done) - return ret; + goto unlock; - blk_mq_debugfs_register_hctxs(q); + blk_mq_debugfs_register_mq(q); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); @@ -364,5 +390,8 @@ int blk_mq_sysfs_register(struct request_queue *q) break; } +unlock: + mutex_unlock(&q->sysfs_lock); + return ret; } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 9d97bfc4d465..d0be72ccb091 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -96,7 +96,10 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data, if (!(data->flags & BLK_MQ_REQ_INTERNAL) && !hctx_may_queue(data->hctx, bt)) return -1; - return __sbitmap_queue_get(bt); + if (data->shallow_depth) + return __sbitmap_queue_get_shallow(bt, data->shallow_depth); + else + return __sbitmap_queue_get(bt); } unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) diff --git a/block/blk-mq.c b/block/blk-mq.c index c7836a1ded97..bf90684a007a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -39,6 +39,26 @@ static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); +static void blk_mq_poll_stats_start(struct request_queue *q); +static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); + +static int blk_mq_poll_stats_bkt(const struct request *rq) +{ + int ddir, bytes, bucket; + + ddir = rq_data_dir(rq); + bytes = blk_rq_bytes(rq); + + bucket = ddir + 2*(ilog2(bytes) - 9); + + if (bucket < 0) + return -1; + else if (bucket >= BLK_MQ_POLL_STATS_BKTS) + return ddir + BLK_MQ_POLL_STATS_BKTS - 2; + + return bucket; +} + /* * Check if any of the ctx's have pending work in this hardware queue */ @@ -65,7 +85,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, sbitmap_clear_bit(&hctx->ctx_map, ctx->index_hw); } -void blk_mq_freeze_queue_start(struct request_queue *q) +void blk_freeze_queue_start(struct request_queue *q) { int freeze_depth; @@ -75,7 +95,7 @@ void blk_mq_freeze_queue_start(struct request_queue *q) blk_mq_run_hw_queues(q, false); } } -EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); +EXPORT_SYMBOL_GPL(blk_freeze_queue_start); void blk_mq_freeze_queue_wait(struct request_queue *q) { @@ -105,7 +125,7 @@ void blk_freeze_queue(struct request_queue *q) * no blk_unfreeze_queue(), and blk_freeze_queue() is not * exported to drivers as the only user for unfreeze is blk_mq. */ - blk_mq_freeze_queue_start(q); + blk_freeze_queue_start(q); blk_mq_freeze_queue_wait(q); } @@ -210,7 +230,6 @@ void blk_mq_rq_ctx_init(struct request_queue *q, struct blk_mq_ctx *ctx, #endif rq->special = NULL; /* tag was already set */ - rq->errors = 0; rq->extra_len = 0; INIT_LIST_HEAD(&rq->timeout_list); @@ -347,7 +366,7 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, if (rq->tag != -1) blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); if (sched_tag != -1) - blk_mq_sched_completed_request(hctx, rq); + blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); blk_mq_sched_restart(hctx); blk_queue_exit(q); } @@ -365,6 +384,7 @@ void blk_mq_finish_request(struct request *rq) { blk_mq_finish_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq); } +EXPORT_SYMBOL_GPL(blk_mq_finish_request); void blk_mq_free_request(struct request *rq) { @@ -402,12 +422,19 @@ static void __blk_mq_complete_request_remote(void *data) rq->q->softirq_done_fn(rq); } -static void blk_mq_ipi_complete_request(struct request *rq) +static void __blk_mq_complete_request(struct request *rq) { struct blk_mq_ctx *ctx = rq->mq_ctx; bool shared = false; int cpu; + if (rq->internal_tag != -1) + blk_mq_sched_completed_request(rq); + if (rq->rq_flags & RQF_STATS) { + blk_mq_poll_stats_start(rq->q); + blk_stat_add(rq); + } + if (!test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) { rq->q->softirq_done_fn(rq); return; @@ -428,33 +455,6 @@ static void blk_mq_ipi_complete_request(struct request *rq) put_cpu(); } -static void blk_mq_stat_add(struct request *rq) -{ - if (rq->rq_flags & RQF_STATS) { - /* - * We could rq->mq_ctx here, but there's less of a risk - * of races if we have the completion event add the stats - * to the local software queue. - */ - struct blk_mq_ctx *ctx; - - ctx = __blk_mq_get_ctx(rq->q, raw_smp_processor_id()); - blk_stat_add(&ctx->stat[rq_data_dir(rq)], rq); - } -} - -static void __blk_mq_complete_request(struct request *rq) -{ - struct request_queue *q = rq->q; - - blk_mq_stat_add(rq); - - if (!q->softirq_done_fn) - blk_mq_end_request(rq, rq->errors); - else - blk_mq_ipi_complete_request(rq); -} - /** * blk_mq_complete_request - end I/O on a request * @rq: the request being processed @@ -463,16 +463,14 @@ static void __blk_mq_complete_request(struct request *rq) * Ends all I/O on a request. It does not handle partial completions. * The actual completion happens out-of-order, through a IPI handler. **/ -void blk_mq_complete_request(struct request *rq, int error) +void blk_mq_complete_request(struct request *rq) { struct request_queue *q = rq->q; if (unlikely(blk_should_fake_timeout(q))) return; - if (!blk_mark_rq_complete(rq)) { - rq->errors = error; + if (!blk_mark_rq_complete(rq)) __blk_mq_complete_request(rq); - } } EXPORT_SYMBOL(blk_mq_complete_request); @@ -491,7 +489,7 @@ void blk_mq_start_request(struct request *rq) trace_block_rq_issue(q, rq); if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { - blk_stat_set_issue_time(&rq->issue_stat); + blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq)); rq->rq_flags |= RQF_STATS; wbt_issue(q->rq_wb, &rq->issue_stat); } @@ -526,6 +524,15 @@ void blk_mq_start_request(struct request *rq) } EXPORT_SYMBOL(blk_mq_start_request); +/* + * When we reach here because queue is busy, REQ_ATOM_COMPLETE + * flag isn't set yet, so there may be race with timeout handler, + * but given rq->deadline is just set in .queue_rq() under + * this situation, the race won't be possible in reality because + * rq->timeout should be set as big enough to cover the window + * between blk_mq_start_request() called from .queue_rq() and + * clearing REQ_ATOM_STARTED here. + */ static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; @@ -633,8 +640,7 @@ void blk_mq_abort_requeue_list(struct request_queue *q) rq = list_first_entry(&rq_list, struct request, queuelist); list_del_init(&rq->queuelist); - rq->errors = -EIO; - blk_mq_end_request(rq, rq->errors); + blk_mq_end_request(rq, -EIO); } } EXPORT_SYMBOL(blk_mq_abort_requeue_list); @@ -666,7 +672,7 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved) * just be ignored. This can happen due to the bitflag ordering. * Timeout first checks if STARTED is set, and if it is, assumes * the request is active. But if we race with completion, then - * we both flags will get cleared. So check here again, and ignore + * both flags will get cleared. So check here again, and ignore * a timeout event with a request that isn't active. */ if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags)) @@ -699,6 +705,19 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) return; + /* + * The rq being checked may have been freed and reallocated + * out already here, we avoid this race by checking rq->deadline + * and REQ_ATOM_COMPLETE flag together: + * + * - if rq->deadline is observed as new value because of + * reusing, the rq won't be timed out because of timing. + * - if rq->deadline is observed as previous value, + * REQ_ATOM_COMPLETE flag won't be cleared in reuse path + * because we put a barrier between setting rq->deadline + * and clearing the flag in blk_mq_start_request(), so + * this rq won't be timed out too. + */ if (time_after_eq(jiffies, rq->deadline)) { if (!blk_mark_rq_complete(rq)) blk_mq_rq_timed_out(rq, reserved); @@ -727,7 +746,7 @@ static void blk_mq_timeout_work(struct work_struct *work) * percpu_ref_tryget directly, because we need to be able to * obtain a reference even in the short window between the queue * starting to freeze, by dropping the first reference in - * blk_mq_freeze_queue_start, and the moment the last request is + * blk_freeze_queue_start, and the moment the last request is * consumed, marked by the instant q_usage_counter reaches * zero. */ @@ -845,6 +864,8 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, .flags = wait ? 0 : BLK_MQ_REQ_NOWAIT, }; + might_sleep_if(wait); + if (rq->tag != -1) goto done; @@ -964,20 +985,12 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) { struct blk_mq_hw_ctx *hctx; struct request *rq; - LIST_HEAD(driver_list); - struct list_head *dptr; int errors, queued, ret = BLK_MQ_RQ_QUEUE_OK; if (list_empty(list)) return false; /* - * Start off with dptr being NULL, so we start the first request - * immediately, even if we have more pending. - */ - dptr = NULL; - - /* * Now process all the entries, sending them to the driver. */ errors = queued = 0; @@ -993,23 +1006,21 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) * The initial allocation attempt failed, so we need to * rerun the hardware queue when a tag is freed. */ - if (blk_mq_dispatch_wait_add(hctx)) { - /* - * It's possible that a tag was freed in the - * window between the allocation failure and - * adding the hardware queue to the wait queue. - */ - if (!blk_mq_get_driver_tag(rq, &hctx, false)) - break; - } else { + if (!blk_mq_dispatch_wait_add(hctx)) + break; + + /* + * It's possible that a tag was freed in the window + * between the allocation failure and adding the + * hardware queue to the wait queue. + */ + if (!blk_mq_get_driver_tag(rq, &hctx, false)) break; - } } list_del_init(&rq->queuelist); bd.rq = rq; - bd.list = dptr; /* * Flag last if we have no more requests, or if we have more @@ -1038,20 +1049,12 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) pr_err("blk-mq: bad return on queue: %d\n", ret); case BLK_MQ_RQ_QUEUE_ERROR: errors++; - rq->errors = -EIO; - blk_mq_end_request(rq, rq->errors); + blk_mq_end_request(rq, -EIO); break; } if (ret == BLK_MQ_RQ_QUEUE_BUSY) break; - - /* - * We've done the first request. If we have more than 1 - * left in the list, set dptr to defer issue. - */ - if (!dptr && list->next != list->prev) - dptr = &driver_list; } while (!list_empty(list)); hctx->dispatched[queued_to_index(queued)]++; @@ -1062,8 +1065,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) */ if (!list_empty(list)) { /* - * If we got a driver tag for the next request already, - * free it again. + * If an I/O scheduler has been configured and we got a driver + * tag for the next request already, free it again. */ rq = list_first_entry(list, struct request, queuelist); blk_mq_put_driver_tag(rq); @@ -1073,16 +1076,24 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) spin_unlock(&hctx->lock); /* - * the queue is expected stopped with BLK_MQ_RQ_QUEUE_BUSY, but - * it's possible the queue is stopped and restarted again - * before this. Queue restart will dispatch requests. And since - * requests in rq_list aren't added into hctx->dispatch yet, - * the requests in rq_list might get lost. + * If SCHED_RESTART was set by the caller of this function and + * it is no longer set that means that it was cleared by another + * thread and hence that a queue rerun is needed. * - * blk_mq_run_hw_queue() already checks the STOPPED bit + * If TAG_WAITING is set that means that an I/O scheduler has + * been configured and another thread is waiting for a driver + * tag. To guarantee fairness, do not rerun this hardware queue + * but let the other thread grab the driver tag. * - * If RESTART or TAG_WAITING is set, then let completion restart - * the queue instead of potentially looping here. + * If no I/O scheduler has been configured it is possible that + * the hardware queue got stopped and restarted before requests + * were pushed back onto the dispatch list. Rerun the queue to + * avoid starvation. Notes: + * - blk_mq_run_hw_queue() checks whether or not a queue has + * been stopped before rerunning a queue. + * - Some but not all block drivers stop a queue before + * returning BLK_MQ_RQ_QUEUE_BUSY. Two exceptions are scsi-mq + * and dm-rq. */ if (!blk_mq_sched_needs_restart(hctx) && !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) @@ -1104,6 +1115,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) blk_mq_sched_dispatch_requests(hctx); rcu_read_unlock(); } else { + might_sleep(); + srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu); blk_mq_sched_dispatch_requests(hctx); srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx); @@ -1153,13 +1166,9 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, put_cpu(); } - if (msecs == 0) - kblockd_schedule_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->run_work); - else - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->delayed_run_work, - msecs_to_jiffies(msecs)); + kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + &hctx->run_work, + msecs_to_jiffies(msecs)); } void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) @@ -1172,6 +1181,7 @@ void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { __blk_mq_delay_run_hw_queue(hctx, async, 0); } +EXPORT_SYMBOL(blk_mq_run_hw_queue); void blk_mq_run_hw_queues(struct request_queue *q, bool async) { @@ -1210,8 +1220,7 @@ EXPORT_SYMBOL(blk_mq_queue_stopped); void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx) { - cancel_work(&hctx->run_work); - cancel_delayed_work(&hctx->delay_work); + cancel_delayed_work_sync(&hctx->run_work); set_bit(BLK_MQ_S_STOPPED, &hctx->state); } EXPORT_SYMBOL(blk_mq_stop_hw_queue); @@ -1268,38 +1277,40 @@ static void blk_mq_run_work_fn(struct work_struct *work) { struct blk_mq_hw_ctx *hctx; - hctx = container_of(work, struct blk_mq_hw_ctx, run_work); - - __blk_mq_run_hw_queue(hctx); -} + hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work); -static void blk_mq_delayed_run_work_fn(struct work_struct *work) -{ - struct blk_mq_hw_ctx *hctx; + /* + * If we are stopped, don't run the queue. The exception is if + * BLK_MQ_S_START_ON_RUN is set. For that case, we auto-clear + * the STOPPED bit and run it. + */ + if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) { + if (!test_bit(BLK_MQ_S_START_ON_RUN, &hctx->state)) + return; - hctx = container_of(work, struct blk_mq_hw_ctx, delayed_run_work.work); + clear_bit(BLK_MQ_S_START_ON_RUN, &hctx->state); + clear_bit(BLK_MQ_S_STOPPED, &hctx->state); + } __blk_mq_run_hw_queue(hctx); } -static void blk_mq_delay_work_fn(struct work_struct *work) -{ - struct blk_mq_hw_ctx *hctx; - - hctx = container_of(work, struct blk_mq_hw_ctx, delay_work.work); - - if (test_and_clear_bit(BLK_MQ_S_STOPPED, &hctx->state)) - __blk_mq_run_hw_queue(hctx); -} void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) { if (unlikely(!blk_mq_hw_queue_mapped(hctx))) return; + /* + * Stop the hw queue, then modify currently delayed work. + * This should prevent us from running the queue prematurely. + * Mark the queue as auto-clearing STOPPED when it runs. + */ blk_mq_stop_hw_queue(hctx); - kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->delay_work, msecs_to_jiffies(msecs)); + set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state); + kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), + &hctx->run_work, + msecs_to_jiffies(msecs)); } EXPORT_SYMBOL(blk_mq_delay_queue); @@ -1408,7 +1419,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) static void blk_mq_bio_to_request(struct request *rq, struct bio *bio) { - init_request_from_bio(rq, bio); + blk_init_request_from_bio(rq, bio); blk_account_io_start(rq, true); } @@ -1453,14 +1464,13 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq) return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); } -static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie, +static void __blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie, bool may_sleep) { struct request_queue *q = rq->q; struct blk_mq_queue_data bd = { .rq = rq, - .list = NULL, - .last = 1 + .last = true, }; struct blk_mq_hw_ctx *hctx; blk_qc_t new_cookie; @@ -1485,31 +1495,42 @@ static void blk_mq_try_issue_directly(struct request *rq, blk_qc_t *cookie, return; } - __blk_mq_requeue_request(rq); - if (ret == BLK_MQ_RQ_QUEUE_ERROR) { *cookie = BLK_QC_T_NONE; - rq->errors = -EIO; - blk_mq_end_request(rq, rq->errors); + blk_mq_end_request(rq, -EIO); return; } + __blk_mq_requeue_request(rq); insert: blk_mq_sched_insert_request(rq, false, true, false, may_sleep); } -/* - * Multiple hardware queue variant. This will not use per-process plugs, - * but will attempt to bypass the hctx queueing if we can go straight to - * hardware for SYNC IO. - */ +static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, + struct request *rq, blk_qc_t *cookie) +{ + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { + rcu_read_lock(); + __blk_mq_try_issue_directly(rq, cookie, false); + rcu_read_unlock(); + } else { + unsigned int srcu_idx; + + might_sleep(); + + srcu_idx = srcu_read_lock(&hctx->queue_rq_srcu); + __blk_mq_try_issue_directly(rq, cookie, true); + srcu_read_unlock(&hctx->queue_rq_srcu, srcu_idx); + } +} + static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) { const int is_sync = op_is_sync(bio->bi_opf); const int is_flush_fua = op_is_flush(bio->bi_opf); struct blk_mq_alloc_data data = { .flags = 0 }; struct request *rq; - unsigned int request_count = 0, srcu_idx; + unsigned int request_count = 0; struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; @@ -1545,147 +1566,21 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) cookie = request_to_qc_t(data.hctx, rq); - if (unlikely(is_flush_fua)) { - if (q->elevator) - goto elv_insert; - blk_mq_bio_to_request(rq, bio); - blk_insert_flush(rq); - goto run_queue; - } - plug = current->plug; - /* - * If the driver supports defer issued based on 'last', then - * queue it up like normal since we can potentially save some - * CPU this way. - */ - if (((plug && !blk_queue_nomerges(q)) || is_sync) && - !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) { - struct request *old_rq = NULL; - - blk_mq_bio_to_request(rq, bio); - - /* - * We do limited plugging. If the bio can be merged, do that. - * Otherwise the existing request in the plug list will be - * issued. So the plug list will have one request at most - */ - if (plug) { - /* - * The plug list might get flushed before this. If that - * happens, same_queue_rq is invalid and plug list is - * empty - */ - if (same_queue_rq && !list_empty(&plug->mq_list)) { - old_rq = same_queue_rq; - list_del_init(&old_rq->queuelist); - } - list_add_tail(&rq->queuelist, &plug->mq_list); - } else /* is_sync */ - old_rq = rq; + if (unlikely(is_flush_fua)) { blk_mq_put_ctx(data.ctx); - if (!old_rq) - goto done; - - if (!(data.hctx->flags & BLK_MQ_F_BLOCKING)) { - rcu_read_lock(); - blk_mq_try_issue_directly(old_rq, &cookie, false); - rcu_read_unlock(); + blk_mq_bio_to_request(rq, bio); + if (q->elevator) { + blk_mq_sched_insert_request(rq, false, true, true, + true); } else { - srcu_idx = srcu_read_lock(&data.hctx->queue_rq_srcu); - blk_mq_try_issue_directly(old_rq, &cookie, true); - srcu_read_unlock(&data.hctx->queue_rq_srcu, srcu_idx); + blk_insert_flush(rq); + blk_mq_run_hw_queue(data.hctx, true); } - goto done; - } - - if (q->elevator) { -elv_insert: - blk_mq_put_ctx(data.ctx); - blk_mq_bio_to_request(rq, bio); - blk_mq_sched_insert_request(rq, false, true, - !is_sync || is_flush_fua, true); - goto done; - } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { - /* - * For a SYNC request, send it to the hardware immediately. For - * an ASYNC request, just ensure that we run it later on. The - * latter allows for merging opportunities and more efficient - * dispatching. - */ -run_queue: - blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua); - } - blk_mq_put_ctx(data.ctx); -done: - return cookie; -} - -/* - * Single hardware queue variant. This will attempt to use any per-process - * plug for merging and IO deferral. - */ -static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) -{ - const int is_sync = op_is_sync(bio->bi_opf); - const int is_flush_fua = op_is_flush(bio->bi_opf); - struct blk_plug *plug; - unsigned int request_count = 0; - struct blk_mq_alloc_data data = { .flags = 0 }; - struct request *rq; - blk_qc_t cookie; - unsigned int wb_acct; - - blk_queue_bounce(q, &bio); - - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) { - bio_io_error(bio); - return BLK_QC_T_NONE; - } - - blk_queue_split(q, &bio, q->bio_split); - - if (!is_flush_fua && !blk_queue_nomerges(q)) { - if (blk_attempt_plug_merge(q, bio, &request_count, NULL)) - return BLK_QC_T_NONE; - } else - request_count = blk_plug_queued_count(q); - - if (blk_mq_sched_bio_merge(q, bio)) - return BLK_QC_T_NONE; - - wb_acct = wbt_wait(q->rq_wb, bio, NULL); - - trace_block_getrq(q, bio, bio->bi_opf); - - rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); - if (unlikely(!rq)) { - __wbt_done(q->rq_wb, wb_acct); - return BLK_QC_T_NONE; - } - - wbt_track(&rq->issue_stat, wb_acct); - - cookie = request_to_qc_t(data.hctx, rq); - - if (unlikely(is_flush_fua)) { - if (q->elevator) - goto elv_insert; - blk_mq_bio_to_request(rq, bio); - blk_insert_flush(rq); - goto run_queue; - } - - /* - * A task plug currently exists. Since this is completely lockless, - * utilize that to temporarily store requests until the task is - * either done or scheduled away. - */ - plug = current->plug; - if (plug) { + } else if (plug && q->nr_hw_queues == 1) { struct request *last = NULL; + blk_mq_put_ctx(data.ctx); blk_mq_bio_to_request(rq, bio); /* @@ -1694,13 +1589,14 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) */ if (list_empty(&plug->mq_list)) request_count = 0; + else if (blk_queue_nomerges(q)) + request_count = blk_plug_queued_count(q); + if (!request_count) trace_block_plug(q); else last = list_entry_rq(plug->mq_list.prev); - blk_mq_put_ctx(data.ctx); - if (request_count >= BLK_MAX_REQUEST_COUNT || (last && blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { blk_flush_plug_list(plug, false); @@ -1708,30 +1604,41 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) } list_add_tail(&rq->queuelist, &plug->mq_list); - return cookie; - } - - if (q->elevator) { -elv_insert: - blk_mq_put_ctx(data.ctx); + } else if (plug && !blk_queue_nomerges(q)) { blk_mq_bio_to_request(rq, bio); - blk_mq_sched_insert_request(rq, false, true, - !is_sync || is_flush_fua, true); - goto done; - } - if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + /* - * For a SYNC request, send it to the hardware immediately. For - * an ASYNC request, just ensure that we run it later on. The - * latter allows for merging opportunities and more efficient - * dispatching. + * We do limited plugging. If the bio can be merged, do that. + * Otherwise the existing request in the plug list will be + * issued. So the plug list will have one request at most + * The plug list might get flushed before this. If that happens, + * the plug list is empty, and same_queue_rq is invalid. */ -run_queue: - blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua); - } + if (list_empty(&plug->mq_list)) + same_queue_rq = NULL; + if (same_queue_rq) + list_del_init(&same_queue_rq->queuelist); + list_add_tail(&rq->queuelist, &plug->mq_list); + + blk_mq_put_ctx(data.ctx); + + if (same_queue_rq) + blk_mq_try_issue_directly(data.hctx, same_queue_rq, + &cookie); + } else if (q->nr_hw_queues > 1 && is_sync) { + blk_mq_put_ctx(data.ctx); + blk_mq_bio_to_request(rq, bio); + blk_mq_try_issue_directly(data.hctx, rq, &cookie); + } else if (q->elevator) { + blk_mq_put_ctx(data.ctx); + blk_mq_bio_to_request(rq, bio); + blk_mq_sched_insert_request(rq, false, true, true, true); + } else if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { + blk_mq_put_ctx(data.ctx); + blk_mq_run_hw_queue(data.hctx, true); + } else + blk_mq_put_ctx(data.ctx); - blk_mq_put_ctx(data.ctx); -done: return cookie; } @@ -1988,9 +1895,7 @@ static int blk_mq_init_hctx(struct request_queue *q, if (node == NUMA_NO_NODE) node = hctx->numa_node = set->numa_node; - INIT_WORK(&hctx->run_work, blk_mq_run_work_fn); - INIT_DELAYED_WORK(&hctx->delayed_run_work, blk_mq_delayed_run_work_fn); - INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn); + INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn); spin_lock_init(&hctx->lock); INIT_LIST_HEAD(&hctx->dispatch); hctx->queue = q; @@ -2067,8 +1972,6 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, spin_lock_init(&__ctx->lock); INIT_LIST_HEAD(&__ctx->rq_list); __ctx->queue = q; - blk_stat_init(&__ctx->stat[BLK_STAT_READ]); - blk_stat_init(&__ctx->stat[BLK_STAT_WRITE]); /* If the cpu isn't online, the cpu is mapped to first hctx */ if (!cpu_online(i)) @@ -2215,6 +2118,8 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, bool shared) { struct request_queue *q; + lockdep_assert_held(&set->tag_list_lock); + list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_freeze_queue(q); queue_set_hctx_shared(q, shared); @@ -2227,7 +2132,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q) struct blk_mq_tag_set *set = q->tag_set; mutex_lock(&set->tag_list_lock); - list_del_init(&q->tag_set_list); + list_del_rcu(&q->tag_set_list); + INIT_LIST_HEAD(&q->tag_set_list); if (list_is_singular(&set->tag_list)) { /* just transitioned to unshared */ set->flags &= ~BLK_MQ_F_TAG_SHARED; @@ -2235,6 +2141,8 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q) blk_mq_update_tag_set_depth(set, false); } mutex_unlock(&set->tag_list_lock); + + synchronize_rcu(); } static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, @@ -2252,7 +2160,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, } if (set->flags & BLK_MQ_F_TAG_SHARED) queue_set_hctx_shared(q, true); - list_add_tail(&q->tag_set_list, &set->tag_list); + list_add_tail_rcu(&q->tag_set_list, &set->tag_list); mutex_unlock(&set->tag_list_lock); } @@ -2364,6 +2272,12 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, /* mark the queue as mq asap */ q->mq_ops = set->ops; + q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn, + blk_mq_poll_stats_bkt, + BLK_MQ_POLL_STATS_BKTS, q); + if (!q->poll_cb) + goto err_exit; + q->queue_ctx = alloc_percpu(struct blk_mq_ctx); if (!q->queue_ctx) goto err_exit; @@ -2398,10 +2312,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, INIT_LIST_HEAD(&q->requeue_list); spin_lock_init(&q->requeue_lock); - if (q->nr_hw_queues > 1) - blk_queue_make_request(q, blk_mq_make_request); - else - blk_queue_make_request(q, blk_sq_make_request); + blk_queue_make_request(q, blk_mq_make_request); /* * Do this after blk_queue_make_request() overrides it... @@ -2456,8 +2367,6 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); - wbt_exit(q); - blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); @@ -2502,7 +2411,7 @@ static void blk_mq_queue_reinit_work(void) * take place in parallel. */ list_for_each_entry(q, &all_q_list, all_q_node) - blk_mq_freeze_queue_start(q); + blk_freeze_queue_start(q); list_for_each_entry(q, &all_q_list, all_q_node) blk_mq_freeze_queue_wait(q); @@ -2743,6 +2652,8 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) { struct request_queue *q; + lockdep_assert_held(&set->tag_list_lock); + if (nr_hw_queues > nr_cpu_ids) nr_hw_queues = nr_cpu_ids; if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues) @@ -2755,16 +2666,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) blk_mq_update_queue_map(set); list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_realloc_hw_ctxs(set, q); - - /* - * Manually set the make_request_fn as blk_queue_make_request - * resets a lot of the queue settings. - */ - if (q->nr_hw_queues > 1) - q->make_request_fn = blk_mq_make_request; - else - q->make_request_fn = blk_sq_make_request; - blk_mq_queue_reinit(q, cpu_online_mask); } @@ -2773,39 +2674,69 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues) } EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues); +/* Enable polling stats and return whether they were already enabled. */ +static bool blk_poll_stats_enable(struct request_queue *q) +{ + if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) || + test_and_set_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags)) + return true; + blk_stat_add_callback(q, q->poll_cb); + return false; +} + +static void blk_mq_poll_stats_start(struct request_queue *q) +{ + /* + * We don't arm the callback if polling stats are not enabled or the + * callback is already active. + */ + if (!test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags) || + blk_stat_is_active(q->poll_cb)) + return; + + blk_stat_activate_msecs(q->poll_cb, 100); +} + +static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb) +{ + struct request_queue *q = cb->data; + int bucket; + + for (bucket = 0; bucket < BLK_MQ_POLL_STATS_BKTS; bucket++) { + if (cb->stat[bucket].nr_samples) + q->poll_stat[bucket] = cb->stat[bucket]; + } +} + static unsigned long blk_mq_poll_nsecs(struct request_queue *q, struct blk_mq_hw_ctx *hctx, struct request *rq) { - struct blk_rq_stat stat[2]; unsigned long ret = 0; + int bucket; /* * If stats collection isn't on, don't sleep but turn it on for * future users */ - if (!blk_stat_enable(q)) + if (!blk_poll_stats_enable(q)) return 0; /* - * We don't have to do this once per IO, should optimize this - * to just use the current window of stats until it changes - */ - memset(&stat, 0, sizeof(stat)); - blk_hctx_stat_get(hctx, stat); - - /* * As an optimistic guess, use half of the mean service time * for this type of request. We can (and should) make this smarter. * For instance, if the completion latencies are tight, we can * get closer than just half the mean. This is especially * important on devices where the completion latencies are longer - * than ~10 usec. + * than ~10 usec. We do use the stats for the relevant IO size + * if available which does lead to better estimates. */ - if (req_op(rq) == REQ_OP_READ && stat[BLK_STAT_READ].nr_samples) - ret = (stat[BLK_STAT_READ].mean + 1) / 2; - else if (req_op(rq) == REQ_OP_WRITE && stat[BLK_STAT_WRITE].nr_samples) - ret = (stat[BLK_STAT_WRITE].mean + 1) / 2; + bucket = blk_mq_poll_stats_bkt(rq); + if (bucket < 0) + return ret; + + if (q->poll_stat[bucket].nr_samples) + ret = (q->poll_stat[bucket].mean + 1) / 2; return ret; } diff --git a/block/blk-mq.h b/block/blk-mq.h index 660a17e1d033..2814a14e529c 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -20,7 +20,6 @@ struct blk_mq_ctx { /* incremented at completion time */ unsigned long ____cacheline_aligned_in_smp rq_completed[2]; - struct blk_rq_stat stat[2]; struct request_queue *queue; struct kobject kobj; @@ -79,6 +78,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, */ extern void blk_mq_sysfs_init(struct request_queue *q); extern void blk_mq_sysfs_deinit(struct request_queue *q); +extern int __blk_mq_register_dev(struct device *dev, struct request_queue *q); extern int blk_mq_sysfs_register(struct request_queue *q); extern void blk_mq_sysfs_unregister(struct request_queue *q); extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx); @@ -87,13 +87,12 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx); * debugfs helpers */ #ifdef CONFIG_BLK_DEBUG_FS -int blk_mq_debugfs_register(struct request_queue *q, const char *name); +int blk_mq_debugfs_register(struct request_queue *q); void blk_mq_debugfs_unregister(struct request_queue *q); -int blk_mq_debugfs_register_hctxs(struct request_queue *q); -void blk_mq_debugfs_unregister_hctxs(struct request_queue *q); +int blk_mq_debugfs_register_mq(struct request_queue *q); +void blk_mq_debugfs_unregister_mq(struct request_queue *q); #else -static inline int blk_mq_debugfs_register(struct request_queue *q, - const char *name) +static inline int blk_mq_debugfs_register(struct request_queue *q) { return 0; } @@ -102,12 +101,12 @@ static inline void blk_mq_debugfs_unregister(struct request_queue *q) { } -static inline int blk_mq_debugfs_register_hctxs(struct request_queue *q) +static inline int blk_mq_debugfs_register_mq(struct request_queue *q) { return 0; } -static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q) +static inline void blk_mq_debugfs_unregister_mq(struct request_queue *q) { } #endif @@ -142,6 +141,7 @@ struct blk_mq_alloc_data { /* input parameter */ struct request_queue *q; unsigned int flags; + unsigned int shallow_depth; /* input & output parameter */ struct blk_mq_ctx *ctx; diff --git a/block/blk-settings.c b/block/blk-settings.c index 1e7174ffc9d4..4fa81ed383ca 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -103,7 +103,6 @@ void blk_set_default_limits(struct queue_limits *lim) lim->discard_granularity = 0; lim->discard_alignment = 0; lim->discard_misaligned = 0; - lim->discard_zeroes_data = 0; lim->logical_block_size = lim->physical_block_size = lim->io_min = 512; lim->bounce_pfn = (unsigned long)(BLK_BOUNCE_ANY >> PAGE_SHIFT); lim->alignment_offset = 0; @@ -127,7 +126,6 @@ void blk_set_stacking_limits(struct queue_limits *lim) blk_set_default_limits(lim); /* Inherit limits from component devices */ - lim->discard_zeroes_data = 1; lim->max_segments = USHRT_MAX; lim->max_discard_segments = 1; lim->max_hw_sectors = UINT_MAX; @@ -609,7 +607,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, t->io_opt = lcm_not_zero(t->io_opt, b->io_opt); t->cluster &= b->cluster; - t->discard_zeroes_data &= b->discard_zeroes_data; /* Physical block size a multiple of the logical block size? */ if (t->physical_block_size & (t->logical_block_size - 1)) { diff --git a/block/blk-stat.c b/block/blk-stat.c index 186fcb981e9b..6c2f40940439 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -4,10 +4,27 @@ * Copyright (C) 2016 Jens Axboe */ #include <linux/kernel.h> +#include <linux/rculist.h> #include <linux/blk-mq.h> #include "blk-stat.h" #include "blk-mq.h" +#include "blk.h" + +#define BLK_RQ_STAT_BATCH 64 + +struct blk_queue_stats { + struct list_head callbacks; + spinlock_t lock; + bool enable_accounting; +}; + +static void blk_stat_init(struct blk_rq_stat *stat) +{ + stat->min = -1ULL; + stat->max = stat->nr_samples = stat->mean = 0; + stat->batch = stat->nr_batch = 0; +} static void blk_stat_flush_batch(struct blk_rq_stat *stat) { @@ -48,209 +65,185 @@ static void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src) dst->nr_samples += src->nr_samples; } -static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst) +static void __blk_stat_add(struct blk_rq_stat *stat, u64 value) { - struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx; - uint64_t latest = 0; - int i, j, nr; - - blk_stat_init(&dst[BLK_STAT_READ]); - blk_stat_init(&dst[BLK_STAT_WRITE]); - - nr = 0; - do { - uint64_t newest = 0; - - queue_for_each_hw_ctx(q, hctx, i) { - hctx_for_each_ctx(hctx, ctx, j) { - blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]); - blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]); - - if (!ctx->stat[BLK_STAT_READ].nr_samples && - !ctx->stat[BLK_STAT_WRITE].nr_samples) - continue; - if (ctx->stat[BLK_STAT_READ].time > newest) - newest = ctx->stat[BLK_STAT_READ].time; - if (ctx->stat[BLK_STAT_WRITE].time > newest) - newest = ctx->stat[BLK_STAT_WRITE].time; - } - } + stat->min = min(stat->min, value); + stat->max = max(stat->max, value); - /* - * No samples - */ - if (!newest) - break; - - if (newest > latest) - latest = newest; - - queue_for_each_hw_ctx(q, hctx, i) { - hctx_for_each_ctx(hctx, ctx, j) { - if (ctx->stat[BLK_STAT_READ].time == newest) { - blk_stat_sum(&dst[BLK_STAT_READ], - &ctx->stat[BLK_STAT_READ]); - nr++; - } - if (ctx->stat[BLK_STAT_WRITE].time == newest) { - blk_stat_sum(&dst[BLK_STAT_WRITE], - &ctx->stat[BLK_STAT_WRITE]); - nr++; - } - } - } - /* - * If we race on finding an entry, just loop back again. - * Should be very rare. - */ - } while (!nr); + if (stat->batch + value < stat->batch || + stat->nr_batch + 1 == BLK_RQ_STAT_BATCH) + blk_stat_flush_batch(stat); - dst[BLK_STAT_READ].time = dst[BLK_STAT_WRITE].time = latest; + stat->batch += value; + stat->nr_batch++; } -void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst) +void blk_stat_add(struct request *rq) { - if (q->mq_ops) - blk_mq_stat_get(q, dst); - else { - blk_stat_flush_batch(&q->rq_stats[BLK_STAT_READ]); - blk_stat_flush_batch(&q->rq_stats[BLK_STAT_WRITE]); - memcpy(&dst[BLK_STAT_READ], &q->rq_stats[BLK_STAT_READ], - sizeof(struct blk_rq_stat)); - memcpy(&dst[BLK_STAT_WRITE], &q->rq_stats[BLK_STAT_WRITE], - sizeof(struct blk_rq_stat)); + struct request_queue *q = rq->q; + struct blk_stat_callback *cb; + struct blk_rq_stat *stat; + int bucket; + s64 now, value; + + now = __blk_stat_time(ktime_to_ns(ktime_get())); + if (now < blk_stat_time(&rq->issue_stat)) + return; + + value = now - blk_stat_time(&rq->issue_stat); + + blk_throtl_stat_add(rq, value); + + rcu_read_lock(); + list_for_each_entry_rcu(cb, &q->stats->callbacks, list) { + if (blk_stat_is_active(cb)) { + bucket = cb->bucket_fn(rq); + if (bucket < 0) + continue; + stat = &this_cpu_ptr(cb->cpu_stat)[bucket]; + __blk_stat_add(stat, value); + } } + rcu_read_unlock(); } -void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst) +static void blk_stat_timer_fn(unsigned long data) { - struct blk_mq_ctx *ctx; - unsigned int i, nr; + struct blk_stat_callback *cb = (void *)data; + unsigned int bucket; + int cpu; - nr = 0; - do { - uint64_t newest = 0; + for (bucket = 0; bucket < cb->buckets; bucket++) + blk_stat_init(&cb->stat[bucket]); - hctx_for_each_ctx(hctx, ctx, i) { - blk_stat_flush_batch(&ctx->stat[BLK_STAT_READ]); - blk_stat_flush_batch(&ctx->stat[BLK_STAT_WRITE]); + for_each_online_cpu(cpu) { + struct blk_rq_stat *cpu_stat; - if (!ctx->stat[BLK_STAT_READ].nr_samples && - !ctx->stat[BLK_STAT_WRITE].nr_samples) - continue; - - if (ctx->stat[BLK_STAT_READ].time > newest) - newest = ctx->stat[BLK_STAT_READ].time; - if (ctx->stat[BLK_STAT_WRITE].time > newest) - newest = ctx->stat[BLK_STAT_WRITE].time; + cpu_stat = per_cpu_ptr(cb->cpu_stat, cpu); + for (bucket = 0; bucket < cb->buckets; bucket++) { + blk_stat_sum(&cb->stat[bucket], &cpu_stat[bucket]); + blk_stat_init(&cpu_stat[bucket]); } + } - if (!newest) - break; - - hctx_for_each_ctx(hctx, ctx, i) { - if (ctx->stat[BLK_STAT_READ].time == newest) { - blk_stat_sum(&dst[BLK_STAT_READ], - &ctx->stat[BLK_STAT_READ]); - nr++; - } - if (ctx->stat[BLK_STAT_WRITE].time == newest) { - blk_stat_sum(&dst[BLK_STAT_WRITE], - &ctx->stat[BLK_STAT_WRITE]); - nr++; - } - } - /* - * If we race on finding an entry, just loop back again. - * Should be very rare, as the window is only updated - * occasionally - */ - } while (!nr); + cb->timer_fn(cb); } -static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now) +struct blk_stat_callback * +blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *), + int (*bucket_fn)(const struct request *), + unsigned int buckets, void *data) { - stat->min = -1ULL; - stat->max = stat->nr_samples = stat->mean = 0; - stat->batch = stat->nr_batch = 0; - stat->time = time_now & BLK_STAT_NSEC_MASK; -} + struct blk_stat_callback *cb; -void blk_stat_init(struct blk_rq_stat *stat) -{ - __blk_stat_init(stat, ktime_to_ns(ktime_get())); -} + cb = kmalloc(sizeof(*cb), GFP_KERNEL); + if (!cb) + return NULL; -static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now) -{ - return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK); + cb->stat = kmalloc_array(buckets, sizeof(struct blk_rq_stat), + GFP_KERNEL); + if (!cb->stat) { + kfree(cb); + return NULL; + } + cb->cpu_stat = __alloc_percpu(buckets * sizeof(struct blk_rq_stat), + __alignof__(struct blk_rq_stat)); + if (!cb->cpu_stat) { + kfree(cb->stat); + kfree(cb); + return NULL; + } + + cb->timer_fn = timer_fn; + cb->bucket_fn = bucket_fn; + cb->data = data; + cb->buckets = buckets; + setup_timer(&cb->timer, blk_stat_timer_fn, (unsigned long)cb); + + return cb; } +EXPORT_SYMBOL_GPL(blk_stat_alloc_callback); -bool blk_stat_is_current(struct blk_rq_stat *stat) +void blk_stat_add_callback(struct request_queue *q, + struct blk_stat_callback *cb) { - return __blk_stat_is_current(stat, ktime_to_ns(ktime_get())); + unsigned int bucket; + int cpu; + + for_each_possible_cpu(cpu) { + struct blk_rq_stat *cpu_stat; + + cpu_stat = per_cpu_ptr(cb->cpu_stat, cpu); + for (bucket = 0; bucket < cb->buckets; bucket++) + blk_stat_init(&cpu_stat[bucket]); + } + + spin_lock(&q->stats->lock); + list_add_tail_rcu(&cb->list, &q->stats->callbacks); + set_bit(QUEUE_FLAG_STATS, &q->queue_flags); + spin_unlock(&q->stats->lock); } +EXPORT_SYMBOL_GPL(blk_stat_add_callback); -void blk_stat_add(struct blk_rq_stat *stat, struct request *rq) +void blk_stat_remove_callback(struct request_queue *q, + struct blk_stat_callback *cb) { - s64 now, value; + spin_lock(&q->stats->lock); + list_del_rcu(&cb->list); + if (list_empty(&q->stats->callbacks) && !q->stats->enable_accounting) + clear_bit(QUEUE_FLAG_STATS, &q->queue_flags); + spin_unlock(&q->stats->lock); - now = __blk_stat_time(ktime_to_ns(ktime_get())); - if (now < blk_stat_time(&rq->issue_stat)) - return; - - if (!__blk_stat_is_current(stat, now)) - __blk_stat_init(stat, now); + del_timer_sync(&cb->timer); +} +EXPORT_SYMBOL_GPL(blk_stat_remove_callback); - value = now - blk_stat_time(&rq->issue_stat); - if (value > stat->max) - stat->max = value; - if (value < stat->min) - stat->min = value; +static void blk_stat_free_callback_rcu(struct rcu_head *head) +{ + struct blk_stat_callback *cb; - if (stat->batch + value < stat->batch || - stat->nr_batch + 1 == BLK_RQ_STAT_BATCH) - blk_stat_flush_batch(stat); + cb = container_of(head, struct blk_stat_callback, rcu); + free_percpu(cb->cpu_stat); + kfree(cb->stat); + kfree(cb); +} - stat->batch += value; - stat->nr_batch++; +void blk_stat_free_callback(struct blk_stat_callback *cb) +{ + if (cb) + call_rcu(&cb->rcu, blk_stat_free_callback_rcu); } +EXPORT_SYMBOL_GPL(blk_stat_free_callback); -void blk_stat_clear(struct request_queue *q) +void blk_stat_enable_accounting(struct request_queue *q) { - if (q->mq_ops) { - struct blk_mq_hw_ctx *hctx; - struct blk_mq_ctx *ctx; - int i, j; - - queue_for_each_hw_ctx(q, hctx, i) { - hctx_for_each_ctx(hctx, ctx, j) { - blk_stat_init(&ctx->stat[BLK_STAT_READ]); - blk_stat_init(&ctx->stat[BLK_STAT_WRITE]); - } - } - } else { - blk_stat_init(&q->rq_stats[BLK_STAT_READ]); - blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]); - } + spin_lock(&q->stats->lock); + q->stats->enable_accounting = true; + set_bit(QUEUE_FLAG_STATS, &q->queue_flags); + spin_unlock(&q->stats->lock); } -void blk_stat_set_issue_time(struct blk_issue_stat *stat) +struct blk_queue_stats *blk_alloc_queue_stats(void) { - stat->time = (stat->time & BLK_STAT_MASK) | - (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK); + struct blk_queue_stats *stats; + + stats = kmalloc(sizeof(*stats), GFP_KERNEL); + if (!stats) + return NULL; + + INIT_LIST_HEAD(&stats->callbacks); + spin_lock_init(&stats->lock); + stats->enable_accounting = false; + + return stats; } -/* - * Enable stat tracking, return whether it was enabled - */ -bool blk_stat_enable(struct request_queue *q) +void blk_free_queue_stats(struct blk_queue_stats *stats) { - if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { - set_bit(QUEUE_FLAG_STATS, &q->queue_flags); - return false; - } + if (!stats) + return; + + WARN_ON(!list_empty(&stats->callbacks)); - return true; + kfree(stats); } diff --git a/block/blk-stat.h b/block/blk-stat.h index a2050a0a5314..2fb20d1a341a 100644 --- a/block/blk-stat.h +++ b/block/blk-stat.h @@ -1,33 +1,85 @@ #ifndef BLK_STAT_H #define BLK_STAT_H -/* - * ~0.13s window as a power-of-2 (2^27 nsecs) - */ -#define BLK_STAT_NSEC 134217728ULL -#define BLK_STAT_NSEC_MASK ~(BLK_STAT_NSEC - 1) +#include <linux/kernel.h> +#include <linux/blkdev.h> +#include <linux/ktime.h> +#include <linux/rcupdate.h> +#include <linux/timer.h> /* - * Upper 3 bits can be used elsewhere + * from upper: + * 3 bits: reserved for other usage + * 12 bits: size + * 49 bits: time */ #define BLK_STAT_RES_BITS 3 -#define BLK_STAT_SHIFT (64 - BLK_STAT_RES_BITS) -#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SHIFT) - 1) -#define BLK_STAT_MASK ~BLK_STAT_TIME_MASK +#define BLK_STAT_SIZE_BITS 12 +#define BLK_STAT_RES_SHIFT (64 - BLK_STAT_RES_BITS) +#define BLK_STAT_SIZE_SHIFT (BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS) +#define BLK_STAT_TIME_MASK ((1ULL << BLK_STAT_SIZE_SHIFT) - 1) +#define BLK_STAT_SIZE_MASK \ + (((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT) +#define BLK_STAT_RES_MASK (~((1ULL << BLK_STAT_RES_SHIFT) - 1)) + +/** + * struct blk_stat_callback - Block statistics callback. + * + * A &struct blk_stat_callback is associated with a &struct request_queue. While + * @timer is active, that queue's request completion latencies are sorted into + * buckets by @bucket_fn and added to a per-cpu buffer, @cpu_stat. When the + * timer fires, @cpu_stat is flushed to @stat and @timer_fn is invoked. + */ +struct blk_stat_callback { + /* + * @list: RCU list of callbacks for a &struct request_queue. + */ + struct list_head list; + + /** + * @timer: Timer for the next callback invocation. + */ + struct timer_list timer; + + /** + * @cpu_stat: Per-cpu statistics buckets. + */ + struct blk_rq_stat __percpu *cpu_stat; + + /** + * @bucket_fn: Given a request, returns which statistics bucket it + * should be accounted under. Return -1 for no bucket for this + * request. + */ + int (*bucket_fn)(const struct request *); + + /** + * @buckets: Number of statistics buckets. + */ + unsigned int buckets; + + /** + * @stat: Array of statistics buckets. + */ + struct blk_rq_stat *stat; + + /** + * @fn: Callback function. + */ + void (*timer_fn)(struct blk_stat_callback *); + + /** + * @data: Private pointer for the user. + */ + void *data; -enum { - BLK_STAT_READ = 0, - BLK_STAT_WRITE, + struct rcu_head rcu; }; -void blk_stat_add(struct blk_rq_stat *, struct request *); -void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *); -void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *); -void blk_stat_clear(struct request_queue *); -void blk_stat_init(struct blk_rq_stat *); -bool blk_stat_is_current(struct blk_rq_stat *); -void blk_stat_set_issue_time(struct blk_issue_stat *); -bool blk_stat_enable(struct request_queue *); +struct blk_queue_stats *blk_alloc_queue_stats(void); +void blk_free_queue_stats(struct blk_queue_stats *); + +void blk_stat_add(struct request *); static inline u64 __blk_stat_time(u64 time) { @@ -36,7 +88,117 @@ static inline u64 __blk_stat_time(u64 time) static inline u64 blk_stat_time(struct blk_issue_stat *stat) { - return __blk_stat_time(stat->time); + return __blk_stat_time(stat->stat); +} + +static inline sector_t blk_capped_size(sector_t size) +{ + return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1); +} + +static inline sector_t blk_stat_size(struct blk_issue_stat *stat) +{ + return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT; +} + +static inline void blk_stat_set_issue(struct blk_issue_stat *stat, + sector_t size) +{ + stat->stat = (stat->stat & BLK_STAT_RES_MASK) | + (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) | + (((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT); +} + +/* record time/size info in request but not add a callback */ +void blk_stat_enable_accounting(struct request_queue *q); + +/** + * blk_stat_alloc_callback() - Allocate a block statistics callback. + * @timer_fn: Timer callback function. + * @bucket_fn: Bucket callback function. + * @buckets: Number of statistics buckets. + * @data: Value for the @data field of the &struct blk_stat_callback. + * + * See &struct blk_stat_callback for details on the callback functions. + * + * Return: &struct blk_stat_callback on success or NULL on ENOMEM. + */ +struct blk_stat_callback * +blk_stat_alloc_callback(void (*timer_fn)(struct blk_stat_callback *), + int (*bucket_fn)(const struct request *), + unsigned int buckets, void *data); + +/** + * blk_stat_add_callback() - Add a block statistics callback to be run on a + * request queue. + * @q: The request queue. + * @cb: The callback. + * + * Note that a single &struct blk_stat_callback can only be added to a single + * &struct request_queue. + */ +void blk_stat_add_callback(struct request_queue *q, + struct blk_stat_callback *cb); + +/** + * blk_stat_remove_callback() - Remove a block statistics callback from a + * request queue. + * @q: The request queue. + * @cb: The callback. + * + * When this returns, the callback is not running on any CPUs and will not be + * called again unless readded. + */ +void blk_stat_remove_callback(struct request_queue *q, + struct blk_stat_callback *cb); + +/** + * blk_stat_free_callback() - Free a block statistics callback. + * @cb: The callback. + * + * @cb may be NULL, in which case this does nothing. If it is not NULL, @cb must + * not be associated with a request queue. I.e., if it was previously added with + * blk_stat_add_callback(), it must also have been removed since then with + * blk_stat_remove_callback(). + */ +void blk_stat_free_callback(struct blk_stat_callback *cb); + +/** + * blk_stat_is_active() - Check if a block statistics callback is currently + * gathering statistics. + * @cb: The callback. + */ +static inline bool blk_stat_is_active(struct blk_stat_callback *cb) +{ + return timer_pending(&cb->timer); +} + +/** + * blk_stat_activate_nsecs() - Gather block statistics during a time window in + * nanoseconds. + * @cb: The callback. + * @nsecs: Number of nanoseconds to gather statistics for. + * + * The timer callback will be called when the window expires. + */ +static inline void blk_stat_activate_nsecs(struct blk_stat_callback *cb, + u64 nsecs) +{ + mod_timer(&cb->timer, jiffies + nsecs_to_jiffies(nsecs)); +} + +/** + * blk_stat_activate_msecs() - Gather block statistics during a time window in + * milliseconds. + * @cb: The callback. + * @msecs: Number of milliseconds to gather statistics for. + * + * The timer callback will be called when the window expires. + */ +static inline void blk_stat_activate_msecs(struct blk_stat_callback *cb, + unsigned int msecs) +{ + mod_timer(&cb->timer, jiffies + msecs_to_jiffies(msecs)); } #endif diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 37f0b3ad635e..3f37813ccbaf 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -208,7 +208,7 @@ static ssize_t queue_discard_max_store(struct request_queue *q, static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *page) { - return queue_var_show(queue_discard_zeroes_data(q), page); + return queue_var_show(0, page); } static ssize_t queue_write_same_max_show(struct request_queue *q, char *page) @@ -503,26 +503,6 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page) return queue_var_show(blk_queue_dax(q), page); } -static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre) -{ - return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n", - pre, (long long) stat->nr_samples, - (long long) stat->mean, (long long) stat->min, - (long long) stat->max); -} - -static ssize_t queue_stats_show(struct request_queue *q, char *page) -{ - struct blk_rq_stat stat[2]; - ssize_t ret; - - blk_queue_stat_get(q, stat); - - ret = print_stat(page, &stat[BLK_STAT_READ], "read :"); - ret += print_stat(page + ret, &stat[BLK_STAT_WRITE], "write:"); - return ret; -} - static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR }, .show = queue_requests_show, @@ -691,17 +671,20 @@ static struct queue_sysfs_entry queue_dax_entry = { .show = queue_dax_show, }; -static struct queue_sysfs_entry queue_stats_entry = { - .attr = {.name = "stats", .mode = S_IRUGO }, - .show = queue_stats_show, -}; - static struct queue_sysfs_entry queue_wb_lat_entry = { .attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR }, .show = queue_wb_lat_show, .store = queue_wb_lat_store, }; +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW +static struct queue_sysfs_entry throtl_sample_time_entry = { + .attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR }, + .show = blk_throtl_sample_time_show, + .store = blk_throtl_sample_time_store, +}; +#endif + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -733,9 +716,11 @@ static struct attribute *default_attrs[] = { &queue_poll_entry.attr, &queue_wc_entry.attr, &queue_dax_entry.attr, - &queue_stats_entry.attr, &queue_wb_lat_entry.attr, &queue_poll_delay_entry.attr, +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + &throtl_sample_time_entry.attr, +#endif NULL, }; @@ -810,7 +795,9 @@ static void blk_release_queue(struct kobject *kobj) struct request_queue *q = container_of(kobj, struct request_queue, kobj); - wbt_exit(q); + if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags)) + blk_stat_remove_callback(q, q->poll_cb); + blk_stat_free_callback(q->poll_cb); bdi_put(q->backing_dev_info); blkcg_exit_queue(q); @@ -819,6 +806,8 @@ static void blk_release_queue(struct kobject *kobj) elevator_exit(q, q->elevator); } + blk_free_queue_stats(q->stats); + blk_exit_rl(&q->root_rl); if (q->queue_tags) @@ -855,23 +844,6 @@ struct kobj_type blk_queue_ktype = { .release = blk_release_queue, }; -static void blk_wb_init(struct request_queue *q) -{ -#ifndef CONFIG_BLK_WBT_MQ - if (q->mq_ops) - return; -#endif -#ifndef CONFIG_BLK_WBT_SQ - if (q->request_fn) - return; -#endif - - /* - * If this fails, we don't get throttling - */ - wbt_init(q); -} - int blk_register_queue(struct gendisk *disk) { int ret; @@ -881,6 +853,11 @@ int blk_register_queue(struct gendisk *disk) if (WARN_ON(!q)) return -ENXIO; + WARN_ONCE(test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags), + "%s is registering an already registered queue\n", + kobject_name(&dev->kobj)); + queue_flag_set_unlocked(QUEUE_FLAG_REGISTERED, q); + /* * SCSI probing may synchronously create and destroy a lot of * request_queues for non-existent devices. Shutting down a fully @@ -900,9 +877,6 @@ int blk_register_queue(struct gendisk *disk) if (ret) return ret; - if (q->mq_ops) - blk_mq_register_dev(dev, q); - /* Prevent changes through sysfs until registration is completed. */ mutex_lock(&q->sysfs_lock); @@ -912,9 +886,14 @@ int blk_register_queue(struct gendisk *disk) goto unlock; } + if (q->mq_ops) + __blk_mq_register_dev(dev, q); + kobject_uevent(&q->kobj, KOBJ_ADD); - blk_wb_init(q); + wbt_enable_default(q); + + blk_throtl_register_queue(q); if (q->request_fn || (q->mq_ops && q->elevator)) { ret = elv_register_queue(q); @@ -939,6 +918,11 @@ void blk_unregister_queue(struct gendisk *disk) if (WARN_ON(!q)) return; + queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q); + + wbt_exit(q); + + if (q->mq_ops) blk_mq_unregister_dev(disk_to_dev(disk), q); diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 8fab716e4059..b78db2e5fdff 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -18,8 +18,17 @@ static int throtl_grp_quantum = 8; /* Total max dispatch from all groups in one round */ static int throtl_quantum = 32; -/* Throttling is performed over 100ms slice and after that slice is renewed */ -static unsigned long throtl_slice = HZ/10; /* 100 ms */ +/* Throttling is performed over a slice and after that slice is renewed */ +#define DFL_THROTL_SLICE_HD (HZ / 10) +#define DFL_THROTL_SLICE_SSD (HZ / 50) +#define MAX_THROTL_SLICE (HZ) +#define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */ +#define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */ +#define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */ +/* default latency target is 0, eg, guarantee IO latency by default */ +#define DFL_LATENCY_TARGET (0) + +#define SKIP_LATENCY (((u64)1) << BLK_STAT_RES_SHIFT) static struct blkcg_policy blkcg_policy_throtl; @@ -83,6 +92,12 @@ enum tg_state_flags { #define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node) +enum { + LIMIT_LOW, + LIMIT_MAX, + LIMIT_CNT, +}; + struct throtl_grp { /* must be the first member */ struct blkg_policy_data pd; @@ -119,20 +134,54 @@ struct throtl_grp { /* are there any throtl rules between this group and td? */ bool has_rules[2]; - /* bytes per second rate limits */ - uint64_t bps[2]; + /* internally used bytes per second rate limits */ + uint64_t bps[2][LIMIT_CNT]; + /* user configured bps limits */ + uint64_t bps_conf[2][LIMIT_CNT]; - /* IOPS limits */ - unsigned int iops[2]; + /* internally used IOPS limits */ + unsigned int iops[2][LIMIT_CNT]; + /* user configured IOPS limits */ + unsigned int iops_conf[2][LIMIT_CNT]; /* Number of bytes disptached in current slice */ uint64_t bytes_disp[2]; /* Number of bio's dispatched in current slice */ unsigned int io_disp[2]; + unsigned long last_low_overflow_time[2]; + + uint64_t last_bytes_disp[2]; + unsigned int last_io_disp[2]; + + unsigned long last_check_time; + + unsigned long latency_target; /* us */ /* When did we start a new slice */ unsigned long slice_start[2]; unsigned long slice_end[2]; + + unsigned long last_finish_time; /* ns / 1024 */ + unsigned long checked_last_finish_time; /* ns / 1024 */ + unsigned long avg_idletime; /* ns / 1024 */ + unsigned long idletime_threshold; /* us */ + + unsigned int bio_cnt; /* total bios */ + unsigned int bad_bio_cnt; /* bios exceeding latency threshold */ + unsigned long bio_cnt_reset_time; +}; + +/* We measure latency for request size from <= 4k to >= 1M */ +#define LATENCY_BUCKET_SIZE 9 + +struct latency_bucket { + unsigned long total_latency; /* ns / 1024 */ + int samples; +}; + +struct avg_latency_bucket { + unsigned long latency; /* ns / 1024 */ + bool valid; }; struct throtl_data @@ -145,8 +194,26 @@ struct throtl_data /* Total Number of queued bios on READ and WRITE lists */ unsigned int nr_queued[2]; + unsigned int throtl_slice; + /* Work for dispatching throttled bios */ struct work_struct dispatch_work; + unsigned int limit_index; + bool limit_valid[LIMIT_CNT]; + + unsigned long dft_idletime_threshold; /* us */ + + unsigned long low_upgrade_time; + unsigned long low_downgrade_time; + + unsigned int scale; + + struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE]; + struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE]; + struct latency_bucket __percpu *latency_buckets; + unsigned long last_calculate_time; + + bool track_bio_latency; }; static void throtl_pending_timer_fn(unsigned long arg); @@ -198,6 +265,76 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq) return container_of(sq, struct throtl_data, service_queue); } +/* + * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to + * make the IO dispatch more smooth. + * Scale up: linearly scale up according to lapsed time since upgrade. For + * every throtl_slice, the limit scales up 1/2 .low limit till the + * limit hits .max limit + * Scale down: exponentially scale down if a cgroup doesn't hit its .low limit + */ +static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td) +{ + /* arbitrary value to avoid too big scale */ + if (td->scale < 4096 && time_after_eq(jiffies, + td->low_upgrade_time + td->scale * td->throtl_slice)) + td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice; + + return low + (low >> 1) * td->scale; +} + +static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw) +{ + struct blkcg_gq *blkg = tg_to_blkg(tg); + struct throtl_data *td; + uint64_t ret; + + if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent) + return U64_MAX; + + td = tg->td; + ret = tg->bps[rw][td->limit_index]; + if (ret == 0 && td->limit_index == LIMIT_LOW) + return tg->bps[rw][LIMIT_MAX]; + + if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] && + tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) { + uint64_t adjusted; + + adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td); + ret = min(tg->bps[rw][LIMIT_MAX], adjusted); + } + return ret; +} + +static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw) +{ + struct blkcg_gq *blkg = tg_to_blkg(tg); + struct throtl_data *td; + unsigned int ret; + + if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent) + return UINT_MAX; + td = tg->td; + ret = tg->iops[rw][td->limit_index]; + if (ret == 0 && tg->td->limit_index == LIMIT_LOW) + return tg->iops[rw][LIMIT_MAX]; + + if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] && + tg->iops[rw][LIMIT_LOW] != tg->iops[rw][LIMIT_MAX]) { + uint64_t adjusted; + + adjusted = throtl_adjusted_limit(tg->iops[rw][LIMIT_LOW], td); + if (adjusted > UINT_MAX) + adjusted = UINT_MAX; + ret = min_t(unsigned int, tg->iops[rw][LIMIT_MAX], adjusted); + } + return ret; +} + +#define request_bucket_index(sectors) \ + clamp_t(int, order_base_2(sectors) - 3, 0, LATENCY_BUCKET_SIZE - 1) + /** * throtl_log - log debug message via blktrace * @sq: the service_queue being reported @@ -334,10 +471,17 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node) } RB_CLEAR_NODE(&tg->rb_node); - tg->bps[READ] = -1; - tg->bps[WRITE] = -1; - tg->iops[READ] = -1; - tg->iops[WRITE] = -1; + tg->bps[READ][LIMIT_MAX] = U64_MAX; + tg->bps[WRITE][LIMIT_MAX] = U64_MAX; + tg->iops[READ][LIMIT_MAX] = UINT_MAX; + tg->iops[WRITE][LIMIT_MAX] = UINT_MAX; + tg->bps_conf[READ][LIMIT_MAX] = U64_MAX; + tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX; + tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX; + tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX; + /* LIMIT_LOW will have default value 0 */ + + tg->latency_target = DFL_LATENCY_TARGET; return &tg->pd; } @@ -366,6 +510,8 @@ static void throtl_pd_init(struct blkg_policy_data *pd) if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent) sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue; tg->td = td; + + tg->idletime_threshold = td->dft_idletime_threshold; } /* @@ -376,20 +522,59 @@ static void throtl_pd_init(struct blkg_policy_data *pd) static void tg_update_has_rules(struct throtl_grp *tg) { struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq); + struct throtl_data *td = tg->td; int rw; for (rw = READ; rw <= WRITE; rw++) tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) || - (tg->bps[rw] != -1 || tg->iops[rw] != -1); + (td->limit_valid[td->limit_index] && + (tg_bps_limit(tg, rw) != U64_MAX || + tg_iops_limit(tg, rw) != UINT_MAX)); } static void throtl_pd_online(struct blkg_policy_data *pd) { + struct throtl_grp *tg = pd_to_tg(pd); /* * We don't want new groups to escape the limits of its ancestors. * Update has_rules[] after a new group is brought online. */ - tg_update_has_rules(pd_to_tg(pd)); + tg_update_has_rules(tg); +} + +static void blk_throtl_update_limit_valid(struct throtl_data *td) +{ + struct cgroup_subsys_state *pos_css; + struct blkcg_gq *blkg; + bool low_valid = false; + + rcu_read_lock(); + blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) { + struct throtl_grp *tg = blkg_to_tg(blkg); + + if (tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] || + tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) + low_valid = true; + } + rcu_read_unlock(); + + td->limit_valid[LIMIT_LOW] = low_valid; +} + +static void throtl_upgrade_state(struct throtl_data *td); +static void throtl_pd_offline(struct blkg_policy_data *pd) +{ + struct throtl_grp *tg = pd_to_tg(pd); + + tg->bps[READ][LIMIT_LOW] = 0; + tg->bps[WRITE][LIMIT_LOW] = 0; + tg->iops[READ][LIMIT_LOW] = 0; + tg->iops[WRITE][LIMIT_LOW] = 0; + + blk_throtl_update_limit_valid(tg->td); + + if (!tg->td->limit_valid[tg->td->limit_index]) + throtl_upgrade_state(tg->td); } static void throtl_pd_free(struct blkg_policy_data *pd) @@ -499,6 +684,17 @@ static void throtl_dequeue_tg(struct throtl_grp *tg) static void throtl_schedule_pending_timer(struct throtl_service_queue *sq, unsigned long expires) { + unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice; + + /* + * Since we are adjusting the throttle limit dynamically, the sleep + * time calculated according to previous limit might be invalid. It's + * possible the cgroup sleep time is very long and no other cgroups + * have IO running so notify the limit changes. Make sure the cgroup + * doesn't sleep too long to avoid the missed notification. + */ + if (time_after(expires, max_expire)) + expires = max_expire; mod_timer(&sq->pending_timer, expires); throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu", expires - jiffies, jiffies); @@ -556,7 +752,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg, if (time_after_eq(start, tg->slice_start[rw])) tg->slice_start[rw] = start; - tg->slice_end[rw] = jiffies + throtl_slice; + tg->slice_end[rw] = jiffies + tg->td->throtl_slice; throtl_log(&tg->service_queue, "[%c] new slice with credit start=%lu end=%lu jiffies=%lu", rw == READ ? 'R' : 'W', tg->slice_start[rw], @@ -568,7 +764,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw) tg->bytes_disp[rw] = 0; tg->io_disp[rw] = 0; tg->slice_start[rw] = jiffies; - tg->slice_end[rw] = jiffies + throtl_slice; + tg->slice_end[rw] = jiffies + tg->td->throtl_slice; throtl_log(&tg->service_queue, "[%c] new slice start=%lu end=%lu jiffies=%lu", rw == READ ? 'R' : 'W', tg->slice_start[rw], @@ -578,13 +774,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw) static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw, unsigned long jiffy_end) { - tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); + tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice); } static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw, unsigned long jiffy_end) { - tg->slice_end[rw] = roundup(jiffy_end, throtl_slice); + tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice); throtl_log(&tg->service_queue, "[%c] extend slice start=%lu end=%lu jiffies=%lu", rw == READ ? 'R' : 'W', tg->slice_start[rw], @@ -624,19 +820,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw) * is bad because it does not allow new slice to start. */ - throtl_set_slice_end(tg, rw, jiffies + throtl_slice); + throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice); time_elapsed = jiffies - tg->slice_start[rw]; - nr_slices = time_elapsed / throtl_slice; + nr_slices = time_elapsed / tg->td->throtl_slice; if (!nr_slices) return; - tmp = tg->bps[rw] * throtl_slice * nr_slices; + tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices; do_div(tmp, HZ); bytes_trim = tmp; - io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ; + io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) / + HZ; if (!bytes_trim && !io_trim) return; @@ -651,7 +848,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw) else tg->io_disp[rw] = 0; - tg->slice_start[rw] += nr_slices * throtl_slice; + tg->slice_start[rw] += nr_slices * tg->td->throtl_slice; throtl_log(&tg->service_queue, "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu", @@ -671,9 +868,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio, /* Slice has just started. Consider one slice interval */ if (!jiffy_elapsed) - jiffy_elapsed_rnd = throtl_slice; + jiffy_elapsed_rnd = tg->td->throtl_slice; - jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice); + jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice); /* * jiffy_elapsed_rnd should not be a big value as minimum iops can be @@ -682,7 +879,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio, * have been trimmed. */ - tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd; + tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd; do_div(tmp, HZ); if (tmp > UINT_MAX) @@ -697,7 +894,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio, } /* Calc approx time to dispatch */ - jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1; + jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1; if (jiffy_wait > jiffy_elapsed) jiffy_wait = jiffy_wait - jiffy_elapsed; @@ -720,11 +917,11 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio, /* Slice has just started. Consider one slice interval */ if (!jiffy_elapsed) - jiffy_elapsed_rnd = throtl_slice; + jiffy_elapsed_rnd = tg->td->throtl_slice; - jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice); + jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice); - tmp = tg->bps[rw] * jiffy_elapsed_rnd; + tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd; do_div(tmp, HZ); bytes_allowed = tmp; @@ -736,7 +933,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio, /* Calc approx time to dispatch */ extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed; - jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]); + jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw)); if (!jiffy_wait) jiffy_wait = 1; @@ -771,7 +968,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, bio != throtl_peek_queued(&tg->service_queue.queued[rw])); /* If tg->bps = -1, then BW is unlimited */ - if (tg->bps[rw] == -1 && tg->iops[rw] == -1) { + if (tg_bps_limit(tg, rw) == U64_MAX && + tg_iops_limit(tg, rw) == UINT_MAX) { if (wait) *wait = 0; return true; @@ -787,8 +985,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw])) throtl_start_new_slice(tg, rw); else { - if (time_before(tg->slice_end[rw], jiffies + throtl_slice)) - throtl_extend_slice(tg, rw, jiffies + throtl_slice); + if (time_before(tg->slice_end[rw], + jiffies + tg->td->throtl_slice)) + throtl_extend_slice(tg, rw, + jiffies + tg->td->throtl_slice); } if (tg_with_in_bps_limit(tg, bio, &bps_wait) && @@ -816,6 +1016,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio) /* Charge the bio to the group */ tg->bytes_disp[rw] += bio->bi_iter.bi_size; tg->io_disp[rw]++; + tg->last_bytes_disp[rw] += bio->bi_iter.bi_size; + tg->last_io_disp[rw]++; /* * BIO_THROTTLED is used to prevent the same bio to be throttled @@ -999,6 +1201,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq) return nr_disp; } +static bool throtl_can_upgrade(struct throtl_data *td, + struct throtl_grp *this_tg); /** * throtl_pending_timer_fn - timer function for service_queue->pending_timer * @arg: the throtl_service_queue being serviced @@ -1025,6 +1229,9 @@ static void throtl_pending_timer_fn(unsigned long arg) int ret; spin_lock_irq(q->queue_lock); + if (throtl_can_upgrade(td, NULL)) + throtl_upgrade_state(td); + again: parent_sq = sq->parent_sq; dispatched = false; @@ -1112,7 +1319,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct blkg_policy_data *pd, struct throtl_grp *tg = pd_to_tg(pd); u64 v = *(u64 *)((void *)tg + off); - if (v == -1) + if (v == U64_MAX) return 0; return __blkg_prfill_u64(sf, pd, v); } @@ -1123,7 +1330,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, struct blkg_policy_data *pd, struct throtl_grp *tg = pd_to_tg(pd); unsigned int v = *(unsigned int *)((void *)tg + off); - if (v == -1) + if (v == UINT_MAX) return 0; return __blkg_prfill_u64(sf, pd, v); } @@ -1150,8 +1357,8 @@ static void tg_conf_updated(struct throtl_grp *tg) throtl_log(&tg->service_queue, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u", - tg->bps[READ], tg->bps[WRITE], - tg->iops[READ], tg->iops[WRITE]); + tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE), + tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE)); /* * Update has_rules[] flags for the updated tg's subtree. A tg is @@ -1197,7 +1404,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of, if (sscanf(ctx.body, "%llu", &v) != 1) goto out_finish; if (!v) - v = -1; + v = U64_MAX; tg = blkg_to_tg(ctx.blkg); @@ -1228,25 +1435,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of, static struct cftype throtl_legacy_files[] = { { .name = "throttle.read_bps_device", - .private = offsetof(struct throtl_grp, bps[READ]), + .private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]), .seq_show = tg_print_conf_u64, .write = tg_set_conf_u64, }, { .name = "throttle.write_bps_device", - .private = offsetof(struct throtl_grp, bps[WRITE]), + .private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]), .seq_show = tg_print_conf_u64, .write = tg_set_conf_u64, }, { .name = "throttle.read_iops_device", - .private = offsetof(struct throtl_grp, iops[READ]), + .private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]), .seq_show = tg_print_conf_uint, .write = tg_set_conf_uint, }, { .name = "throttle.write_iops_device", - .private = offsetof(struct throtl_grp, iops[WRITE]), + .private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]), .seq_show = tg_print_conf_uint, .write = tg_set_conf_uint, }, @@ -1263,48 +1470,87 @@ static struct cftype throtl_legacy_files[] = { { } /* terminate */ }; -static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd, +static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd, int off) { struct throtl_grp *tg = pd_to_tg(pd); const char *dname = blkg_dev_name(pd->blkg); char bufs[4][21] = { "max", "max", "max", "max" }; + u64 bps_dft; + unsigned int iops_dft; + char idle_time[26] = ""; + char latency_time[26] = ""; if (!dname) return 0; - if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 && - tg->iops[READ] == -1 && tg->iops[WRITE] == -1) + + if (off == LIMIT_LOW) { + bps_dft = 0; + iops_dft = 0; + } else { + bps_dft = U64_MAX; + iops_dft = UINT_MAX; + } + + if (tg->bps_conf[READ][off] == bps_dft && + tg->bps_conf[WRITE][off] == bps_dft && + tg->iops_conf[READ][off] == iops_dft && + tg->iops_conf[WRITE][off] == iops_dft && + (off != LIMIT_LOW || + (tg->idletime_threshold == tg->td->dft_idletime_threshold && + tg->latency_target == DFL_LATENCY_TARGET))) return 0; - if (tg->bps[READ] != -1) - snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]); - if (tg->bps[WRITE] != -1) - snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]); - if (tg->iops[READ] != -1) - snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]); - if (tg->iops[WRITE] != -1) - snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]); - - seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n", - dname, bufs[0], bufs[1], bufs[2], bufs[3]); + if (tg->bps_conf[READ][off] != bps_dft) + snprintf(bufs[0], sizeof(bufs[0]), "%llu", + tg->bps_conf[READ][off]); + if (tg->bps_conf[WRITE][off] != bps_dft) + snprintf(bufs[1], sizeof(bufs[1]), "%llu", + tg->bps_conf[WRITE][off]); + if (tg->iops_conf[READ][off] != iops_dft) + snprintf(bufs[2], sizeof(bufs[2]), "%u", + tg->iops_conf[READ][off]); + if (tg->iops_conf[WRITE][off] != iops_dft) + snprintf(bufs[3], sizeof(bufs[3]), "%u", + tg->iops_conf[WRITE][off]); + if (off == LIMIT_LOW) { + if (tg->idletime_threshold == ULONG_MAX) + strcpy(idle_time, " idle=max"); + else + snprintf(idle_time, sizeof(idle_time), " idle=%lu", + tg->idletime_threshold); + + if (tg->latency_target == ULONG_MAX) + strcpy(latency_time, " latency=max"); + else + snprintf(latency_time, sizeof(latency_time), + " latency=%lu", tg->latency_target); + } + + seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n", + dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time, + latency_time); return 0; } -static int tg_print_max(struct seq_file *sf, void *v) +static int tg_print_limit(struct seq_file *sf, void *v) { - blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max, + blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit, &blkcg_policy_throtl, seq_cft(sf)->private, false); return 0; } -static ssize_t tg_set_max(struct kernfs_open_file *of, +static ssize_t tg_set_limit(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct blkcg *blkcg = css_to_blkcg(of_css(of)); struct blkg_conf_ctx ctx; struct throtl_grp *tg; u64 v[4]; + unsigned long idle_time; + unsigned long latency_time; int ret; + int index = of_cft(of)->private; ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx); if (ret) @@ -1312,15 +1558,17 @@ static ssize_t tg_set_max(struct kernfs_open_file *of, tg = blkg_to_tg(ctx.blkg); - v[0] = tg->bps[READ]; - v[1] = tg->bps[WRITE]; - v[2] = tg->iops[READ]; - v[3] = tg->iops[WRITE]; + v[0] = tg->bps_conf[READ][index]; + v[1] = tg->bps_conf[WRITE][index]; + v[2] = tg->iops_conf[READ][index]; + v[3] = tg->iops_conf[WRITE][index]; + idle_time = tg->idletime_threshold; + latency_time = tg->latency_target; while (true) { char tok[27]; /* wiops=18446744073709551616 */ char *p; - u64 val = -1; + u64 val = U64_MAX; int len; if (sscanf(ctx.body, "%26s%n", tok, &len) != 1) @@ -1348,15 +1596,43 @@ static ssize_t tg_set_max(struct kernfs_open_file *of, v[2] = min_t(u64, val, UINT_MAX); else if (!strcmp(tok, "wiops")) v[3] = min_t(u64, val, UINT_MAX); + else if (off == LIMIT_LOW && !strcmp(tok, "idle")) + idle_time = val; + else if (off == LIMIT_LOW && !strcmp(tok, "latency")) + latency_time = val; else goto out_finish; } - tg->bps[READ] = v[0]; - tg->bps[WRITE] = v[1]; - tg->iops[READ] = v[2]; - tg->iops[WRITE] = v[3]; + tg->bps_conf[READ][index] = v[0]; + tg->bps_conf[WRITE][index] = v[1]; + tg->iops_conf[READ][index] = v[2]; + tg->iops_conf[WRITE][index] = v[3]; + if (index == LIMIT_MAX) { + tg->bps[READ][index] = v[0]; + tg->bps[WRITE][index] = v[1]; + tg->iops[READ][index] = v[2]; + tg->iops[WRITE][index] = v[3]; + } + tg->bps[READ][LIMIT_LOW] = min(tg->bps_conf[READ][LIMIT_LOW], + tg->bps_conf[READ][LIMIT_MAX]); + tg->bps[WRITE][LIMIT_LOW] = min(tg->bps_conf[WRITE][LIMIT_LOW], + tg->bps_conf[WRITE][LIMIT_MAX]); + tg->iops[READ][LIMIT_LOW] = min(tg->iops_conf[READ][LIMIT_LOW], + tg->iops_conf[READ][LIMIT_MAX]); + tg->iops[WRITE][LIMIT_LOW] = min(tg->iops_conf[WRITE][LIMIT_LOW], + tg->iops_conf[WRITE][LIMIT_MAX]); + + if (index == LIMIT_LOW) { + blk_throtl_update_limit_valid(tg->td); + if (tg->td->limit_valid[LIMIT_LOW]) + tg->td->limit_index = LIMIT_LOW; + tg->idletime_threshold = (idle_time == ULONG_MAX) ? + ULONG_MAX : idle_time; + tg->latency_target = (latency_time == ULONG_MAX) ? + ULONG_MAX : latency_time; + } tg_conf_updated(tg); ret = 0; out_finish: @@ -1365,11 +1641,21 @@ out_finish: } static struct cftype throtl_files[] = { +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + { + .name = "low", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = tg_print_limit, + .write = tg_set_limit, + .private = LIMIT_LOW, + }, +#endif { .name = "max", .flags = CFTYPE_NOT_ON_ROOT, - .seq_show = tg_print_max, - .write = tg_set_max, + .seq_show = tg_print_limit, + .write = tg_set_limit, + .private = LIMIT_MAX, }, { } /* terminate */ }; @@ -1388,9 +1674,376 @@ static struct blkcg_policy blkcg_policy_throtl = { .pd_alloc_fn = throtl_pd_alloc, .pd_init_fn = throtl_pd_init, .pd_online_fn = throtl_pd_online, + .pd_offline_fn = throtl_pd_offline, .pd_free_fn = throtl_pd_free, }; +static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg) +{ + unsigned long rtime = jiffies, wtime = jiffies; + + if (tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW]) + rtime = tg->last_low_overflow_time[READ]; + if (tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) + wtime = tg->last_low_overflow_time[WRITE]; + return min(rtime, wtime); +} + +/* tg should not be an intermediate node */ +static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg) +{ + struct throtl_service_queue *parent_sq; + struct throtl_grp *parent = tg; + unsigned long ret = __tg_last_low_overflow_time(tg); + + while (true) { + parent_sq = parent->service_queue.parent_sq; + parent = sq_to_tg(parent_sq); + if (!parent) + break; + + /* + * The parent doesn't have low limit, it always reaches low + * limit. Its overflow time is useless for children + */ + if (!parent->bps[READ][LIMIT_LOW] && + !parent->iops[READ][LIMIT_LOW] && + !parent->bps[WRITE][LIMIT_LOW] && + !parent->iops[WRITE][LIMIT_LOW]) + continue; + if (time_after(__tg_last_low_overflow_time(parent), ret)) + ret = __tg_last_low_overflow_time(parent); + } + return ret; +} + +static bool throtl_tg_is_idle(struct throtl_grp *tg) +{ + /* + * cgroup is idle if: + * - single idle is too long, longer than a fixed value (in case user + * configure a too big threshold) or 4 times of slice + * - average think time is more than threshold + * - IO latency is largely below threshold + */ + unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice); + + time = min_t(unsigned long, MAX_IDLE_TIME, time); + return (ktime_get_ns() >> 10) - tg->last_finish_time > time || + tg->avg_idletime > tg->idletime_threshold || + (tg->latency_target && tg->bio_cnt && + tg->bad_bio_cnt * 5 < tg->bio_cnt); +} + +static bool throtl_tg_can_upgrade(struct throtl_grp *tg) +{ + struct throtl_service_queue *sq = &tg->service_queue; + bool read_limit, write_limit; + + /* + * if cgroup reaches low limit (if low limit is 0, the cgroup always + * reaches), it's ok to upgrade to next limit + */ + read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW]; + write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]; + if (!read_limit && !write_limit) + return true; + if (read_limit && sq->nr_queued[READ] && + (!write_limit || sq->nr_queued[WRITE])) + return true; + if (write_limit && sq->nr_queued[WRITE] && + (!read_limit || sq->nr_queued[READ])) + return true; + + if (time_after_eq(jiffies, + tg_last_low_overflow_time(tg) + tg->td->throtl_slice) && + throtl_tg_is_idle(tg)) + return true; + return false; +} + +static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg) +{ + while (true) { + if (throtl_tg_can_upgrade(tg)) + return true; + tg = sq_to_tg(tg->service_queue.parent_sq); + if (!tg || !tg_to_blkg(tg)->parent) + return false; + } + return false; +} + +static bool throtl_can_upgrade(struct throtl_data *td, + struct throtl_grp *this_tg) +{ + struct cgroup_subsys_state *pos_css; + struct blkcg_gq *blkg; + + if (td->limit_index != LIMIT_LOW) + return false; + + if (time_before(jiffies, td->low_downgrade_time + td->throtl_slice)) + return false; + + rcu_read_lock(); + blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) { + struct throtl_grp *tg = blkg_to_tg(blkg); + + if (tg == this_tg) + continue; + if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children)) + continue; + if (!throtl_hierarchy_can_upgrade(tg)) { + rcu_read_unlock(); + return false; + } + } + rcu_read_unlock(); + return true; +} + +static void throtl_upgrade_check(struct throtl_grp *tg) +{ + unsigned long now = jiffies; + + if (tg->td->limit_index != LIMIT_LOW) + return; + + if (time_after(tg->last_check_time + tg->td->throtl_slice, now)) + return; + + tg->last_check_time = now; + + if (!time_after_eq(now, + __tg_last_low_overflow_time(tg) + tg->td->throtl_slice)) + return; + + if (throtl_can_upgrade(tg->td, NULL)) + throtl_upgrade_state(tg->td); +} + +static void throtl_upgrade_state(struct throtl_data *td) +{ + struct cgroup_subsys_state *pos_css; + struct blkcg_gq *blkg; + + td->limit_index = LIMIT_MAX; + td->low_upgrade_time = jiffies; + td->scale = 0; + rcu_read_lock(); + blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) { + struct throtl_grp *tg = blkg_to_tg(blkg); + struct throtl_service_queue *sq = &tg->service_queue; + + tg->disptime = jiffies - 1; + throtl_select_dispatch(sq); + throtl_schedule_next_dispatch(sq, false); + } + rcu_read_unlock(); + throtl_select_dispatch(&td->service_queue); + throtl_schedule_next_dispatch(&td->service_queue, false); + queue_work(kthrotld_workqueue, &td->dispatch_work); +} + +static void throtl_downgrade_state(struct throtl_data *td, int new) +{ + td->scale /= 2; + + if (td->scale) { + td->low_upgrade_time = jiffies - td->scale * td->throtl_slice; + return; + } + + td->limit_index = new; + td->low_downgrade_time = jiffies; +} + +static bool throtl_tg_can_downgrade(struct throtl_grp *tg) +{ + struct throtl_data *td = tg->td; + unsigned long now = jiffies; + + /* + * If cgroup is below low limit, consider downgrade and throttle other + * cgroups + */ + if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) && + time_after_eq(now, tg_last_low_overflow_time(tg) + + td->throtl_slice) && + (!throtl_tg_is_idle(tg) || + !list_empty(&tg_to_blkg(tg)->blkcg->css.children))) + return true; + return false; +} + +static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg) +{ + while (true) { + if (!throtl_tg_can_downgrade(tg)) + return false; + tg = sq_to_tg(tg->service_queue.parent_sq); + if (!tg || !tg_to_blkg(tg)->parent) + break; + } + return true; +} + +static void throtl_downgrade_check(struct throtl_grp *tg) +{ + uint64_t bps; + unsigned int iops; + unsigned long elapsed_time; + unsigned long now = jiffies; + + if (tg->td->limit_index != LIMIT_MAX || + !tg->td->limit_valid[LIMIT_LOW]) + return; + if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children)) + return; + if (time_after(tg->last_check_time + tg->td->throtl_slice, now)) + return; + + elapsed_time = now - tg->last_check_time; + tg->last_check_time = now; + + if (time_before(now, tg_last_low_overflow_time(tg) + + tg->td->throtl_slice)) + return; + + if (tg->bps[READ][LIMIT_LOW]) { + bps = tg->last_bytes_disp[READ] * HZ; + do_div(bps, elapsed_time); + if (bps >= tg->bps[READ][LIMIT_LOW]) + tg->last_low_overflow_time[READ] = now; + } + + if (tg->bps[WRITE][LIMIT_LOW]) { + bps = tg->last_bytes_disp[WRITE] * HZ; + do_div(bps, elapsed_time); + if (bps >= tg->bps[WRITE][LIMIT_LOW]) + tg->last_low_overflow_time[WRITE] = now; + } + + if (tg->iops[READ][LIMIT_LOW]) { + iops = tg->last_io_disp[READ] * HZ / elapsed_time; + if (iops >= tg->iops[READ][LIMIT_LOW]) + tg->last_low_overflow_time[READ] = now; + } + + if (tg->iops[WRITE][LIMIT_LOW]) { + iops = tg->last_io_disp[WRITE] * HZ / elapsed_time; + if (iops >= tg->iops[WRITE][LIMIT_LOW]) + tg->last_low_overflow_time[WRITE] = now; + } + + /* + * If cgroup is below low limit, consider downgrade and throttle other + * cgroups + */ + if (throtl_hierarchy_can_downgrade(tg)) + throtl_downgrade_state(tg->td, LIMIT_LOW); + + tg->last_bytes_disp[READ] = 0; + tg->last_bytes_disp[WRITE] = 0; + tg->last_io_disp[READ] = 0; + tg->last_io_disp[WRITE] = 0; +} + +static void blk_throtl_update_idletime(struct throtl_grp *tg) +{ + unsigned long now = ktime_get_ns() >> 10; + unsigned long last_finish_time = tg->last_finish_time; + + if (now <= last_finish_time || last_finish_time == 0 || + last_finish_time == tg->checked_last_finish_time) + return; + + tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3; + tg->checked_last_finish_time = last_finish_time; +} + +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW +static void throtl_update_latency_buckets(struct throtl_data *td) +{ + struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE]; + int i, cpu; + unsigned long last_latency = 0; + unsigned long latency; + + if (!blk_queue_nonrot(td->queue)) + return; + if (time_before(jiffies, td->last_calculate_time + HZ)) + return; + td->last_calculate_time = jiffies; + + memset(avg_latency, 0, sizeof(avg_latency)); + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + struct latency_bucket *tmp = &td->tmp_buckets[i]; + + for_each_possible_cpu(cpu) { + struct latency_bucket *bucket; + + /* this isn't race free, but ok in practice */ + bucket = per_cpu_ptr(td->latency_buckets, cpu); + tmp->total_latency += bucket[i].total_latency; + tmp->samples += bucket[i].samples; + bucket[i].total_latency = 0; + bucket[i].samples = 0; + } + + if (tmp->samples >= 32) { + int samples = tmp->samples; + + latency = tmp->total_latency; + + tmp->total_latency = 0; + tmp->samples = 0; + latency /= samples; + if (latency == 0) + continue; + avg_latency[i].latency = latency; + } + } + + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) { + if (!avg_latency[i].latency) { + if (td->avg_buckets[i].latency < last_latency) + td->avg_buckets[i].latency = last_latency; + continue; + } + + if (!td->avg_buckets[i].valid) + latency = avg_latency[i].latency; + else + latency = (td->avg_buckets[i].latency * 7 + + avg_latency[i].latency) >> 3; + + td->avg_buckets[i].latency = max(latency, last_latency); + td->avg_buckets[i].valid = true; + last_latency = td->avg_buckets[i].latency; + } +} +#else +static inline void throtl_update_latency_buckets(struct throtl_data *td) +{ +} +#endif + +static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio) +{ +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + int ret; + + ret = bio_associate_current(bio); + if (ret == 0 || ret == -EBUSY) + bio->bi_cg_private = tg; + blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio)); +#else + bio_associate_current(bio); +#endif +} + bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, struct bio *bio) { @@ -1399,6 +2052,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, struct throtl_service_queue *sq; bool rw = bio_data_dir(bio); bool throttled = false; + struct throtl_data *td = tg->td; WARN_ON_ONCE(!rcu_read_lock_held()); @@ -1408,19 +2062,35 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, spin_lock_irq(q->queue_lock); + throtl_update_latency_buckets(td); + if (unlikely(blk_queue_bypass(q))) goto out_unlock; + blk_throtl_assoc_bio(tg, bio); + blk_throtl_update_idletime(tg); + sq = &tg->service_queue; +again: while (true) { + if (tg->last_low_overflow_time[rw] == 0) + tg->last_low_overflow_time[rw] = jiffies; + throtl_downgrade_check(tg); + throtl_upgrade_check(tg); /* throtl is FIFO - if bios are already queued, should queue */ if (sq->nr_queued[rw]) break; /* if above limits, break to queue */ - if (!tg_may_dispatch(tg, bio, NULL)) + if (!tg_may_dispatch(tg, bio, NULL)) { + tg->last_low_overflow_time[rw] = jiffies; + if (throtl_can_upgrade(td, tg)) { + throtl_upgrade_state(td); + goto again; + } break; + } /* within limits, let's charge and dispatch directly */ throtl_charge_bio(tg, bio); @@ -1453,12 +2123,14 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, /* out-of-limit, queue to @tg */ throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d", rw == READ ? 'R' : 'W', - tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw], - tg->io_disp[rw], tg->iops[rw], + tg->bytes_disp[rw], bio->bi_iter.bi_size, + tg_bps_limit(tg, rw), + tg->io_disp[rw], tg_iops_limit(tg, rw), sq->nr_queued[READ], sq->nr_queued[WRITE]); - bio_associate_current(bio); - tg->td->nr_queued[rw]++; + tg->last_low_overflow_time[rw] = jiffies; + + td->nr_queued[rw]++; throtl_add_bio_tg(bio, qn, tg); throttled = true; @@ -1483,9 +2155,94 @@ out: */ if (!throttled) bio_clear_flag(bio, BIO_THROTTLED); + +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW + if (throttled || !td->track_bio_latency) + bio->bi_issue_stat.stat |= SKIP_LATENCY; +#endif return throttled; } +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW +static void throtl_track_latency(struct throtl_data *td, sector_t size, + int op, unsigned long time) +{ + struct latency_bucket *latency; + int index; + + if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ || + !blk_queue_nonrot(td->queue)) + return; + + index = request_bucket_index(size); + + latency = get_cpu_ptr(td->latency_buckets); + latency[index].total_latency += time; + latency[index].samples++; + put_cpu_ptr(td->latency_buckets); +} + +void blk_throtl_stat_add(struct request *rq, u64 time_ns) +{ + struct request_queue *q = rq->q; + struct throtl_data *td = q->td; + + throtl_track_latency(td, blk_stat_size(&rq->issue_stat), + req_op(rq), time_ns >> 10); +} + +void blk_throtl_bio_endio(struct bio *bio) +{ + struct throtl_grp *tg; + u64 finish_time_ns; + unsigned long finish_time; + unsigned long start_time; + unsigned long lat; + + tg = bio->bi_cg_private; + if (!tg) + return; + bio->bi_cg_private = NULL; + + finish_time_ns = ktime_get_ns(); + tg->last_finish_time = finish_time_ns >> 10; + + start_time = blk_stat_time(&bio->bi_issue_stat) >> 10; + finish_time = __blk_stat_time(finish_time_ns) >> 10; + if (!start_time || finish_time <= start_time) + return; + + lat = finish_time - start_time; + /* this is only for bio based driver */ + if (!(bio->bi_issue_stat.stat & SKIP_LATENCY)) + throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat), + bio_op(bio), lat); + + if (tg->latency_target) { + int bucket; + unsigned int threshold; + + bucket = request_bucket_index( + blk_stat_size(&bio->bi_issue_stat)); + threshold = tg->td->avg_buckets[bucket].latency + + tg->latency_target; + if (lat > threshold) + tg->bad_bio_cnt++; + /* + * Not race free, could get wrong count, which means cgroups + * will be throttled + */ + tg->bio_cnt++; + } + + if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) { + tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies; + tg->bio_cnt /= 2; + tg->bad_bio_cnt /= 2; + } +} +#endif + /* * Dispatch all bios from all children tg's queued on @parent_sq. On * return, @parent_sq is guaranteed to not have any active children tg's @@ -1558,6 +2315,12 @@ int blk_throtl_init(struct request_queue *q) td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node); if (!td) return -ENOMEM; + td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) * + LATENCY_BUCKET_SIZE, __alignof__(u64)); + if (!td->latency_buckets) { + kfree(td); + return -ENOMEM; + } INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn); throtl_service_queue_init(&td->service_queue); @@ -1565,10 +2328,17 @@ int blk_throtl_init(struct request_queue *q) q->td = td; td->queue = q; + td->limit_valid[LIMIT_MAX] = true; + td->limit_index = LIMIT_MAX; + td->low_upgrade_time = jiffies; + td->low_downgrade_time = jiffies; + /* activate policy */ ret = blkcg_activate_policy(q, &blkcg_policy_throtl); - if (ret) + if (ret) { + free_percpu(td->latency_buckets); kfree(td); + } return ret; } @@ -1577,9 +2347,74 @@ void blk_throtl_exit(struct request_queue *q) BUG_ON(!q->td); throtl_shutdown_wq(q); blkcg_deactivate_policy(q, &blkcg_policy_throtl); + free_percpu(q->td->latency_buckets); kfree(q->td); } +void blk_throtl_register_queue(struct request_queue *q) +{ + struct throtl_data *td; + struct cgroup_subsys_state *pos_css; + struct blkcg_gq *blkg; + + td = q->td; + BUG_ON(!td); + + if (blk_queue_nonrot(q)) { + td->throtl_slice = DFL_THROTL_SLICE_SSD; + td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_SSD; + } else { + td->throtl_slice = DFL_THROTL_SLICE_HD; + td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_HD; + } +#ifndef CONFIG_BLK_DEV_THROTTLING_LOW + /* if no low limit, use previous default */ + td->throtl_slice = DFL_THROTL_SLICE_HD; +#endif + + td->track_bio_latency = !q->mq_ops && !q->request_fn; + if (!td->track_bio_latency) + blk_stat_enable_accounting(q); + + /* + * some tg are created before queue is fully initialized, eg, nonrot + * isn't initialized yet + */ + rcu_read_lock(); + blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) { + struct throtl_grp *tg = blkg_to_tg(blkg); + + tg->idletime_threshold = td->dft_idletime_threshold; + } + rcu_read_unlock(); +} + +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW +ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page) +{ + if (!q->td) + return -EINVAL; + return sprintf(page, "%u\n", jiffies_to_msecs(q->td->throtl_slice)); +} + +ssize_t blk_throtl_sample_time_store(struct request_queue *q, + const char *page, size_t count) +{ + unsigned long v; + unsigned long t; + + if (!q->td) + return -EINVAL; + if (kstrtoul(page, 10, &v)) + return -EINVAL; + t = msecs_to_jiffies(v); + if (t == 0 || t > MAX_THROTL_SLICE) + return -EINVAL; + q->td->throtl_slice = t; + return count; +} +#endif + static int __init throtl_init(void) { kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0); diff --git a/block/blk-timeout.c b/block/blk-timeout.c index a30441a200c0..cbff183f3d9f 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -89,7 +89,6 @@ static void blk_rq_timed_out(struct request *req) ret = q->rq_timed_out_fn(req); switch (ret) { case BLK_EH_HANDLED: - /* Can we use req->errors here? */ __blk_complete_request(req); break; case BLK_EH_RESET_TIMER: diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 1aedb1f7ee0c..17676f4d7fd1 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -255,8 +255,8 @@ static inline bool stat_sample_valid(struct blk_rq_stat *stat) * that it's writes impacting us, and not just some sole read on * a device that is in a lower power state. */ - return stat[BLK_STAT_READ].nr_samples >= 1 && - stat[BLK_STAT_WRITE].nr_samples >= RWB_MIN_WRITE_SAMPLES; + return (stat[READ].nr_samples >= 1 && + stat[WRITE].nr_samples >= RWB_MIN_WRITE_SAMPLES); } static u64 rwb_sync_issue_lat(struct rq_wb *rwb) @@ -277,7 +277,7 @@ enum { LAT_EXCEEDED, }; -static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) +static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) { struct backing_dev_info *bdi = rwb->queue->backing_dev_info; u64 thislat; @@ -293,7 +293,7 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) */ thislat = rwb_sync_issue_lat(rwb); if (thislat > rwb->cur_win_nsec || - (thislat > rwb->min_lat_nsec && !stat[BLK_STAT_READ].nr_samples)) { + (thislat > rwb->min_lat_nsec && !stat[READ].nr_samples)) { trace_wbt_lat(bdi, thislat); return LAT_EXCEEDED; } @@ -308,8 +308,8 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) * waited or still has writes in flights, consider us doing * just writes as well. */ - if ((stat[BLK_STAT_WRITE].nr_samples && blk_stat_is_current(stat)) || - wb_recent_wait(rwb) || wbt_inflight(rwb)) + if (stat[WRITE].nr_samples || wb_recent_wait(rwb) || + wbt_inflight(rwb)) return LAT_UNKNOWN_WRITES; return LAT_UNKNOWN; } @@ -317,8 +317,8 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) /* * If the 'min' latency exceeds our target, step down. */ - if (stat[BLK_STAT_READ].min > rwb->min_lat_nsec) { - trace_wbt_lat(bdi, stat[BLK_STAT_READ].min); + if (stat[READ].min > rwb->min_lat_nsec) { + trace_wbt_lat(bdi, stat[READ].min); trace_wbt_stat(bdi, stat); return LAT_EXCEEDED; } @@ -329,14 +329,6 @@ static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) return LAT_OK; } -static int latency_exceeded(struct rq_wb *rwb) -{ - struct blk_rq_stat stat[2]; - - blk_queue_stat_get(rwb->queue, stat); - return __latency_exceeded(rwb, stat); -} - static void rwb_trace_step(struct rq_wb *rwb, const char *msg) { struct backing_dev_info *bdi = rwb->queue->backing_dev_info; @@ -355,7 +347,6 @@ static void scale_up(struct rq_wb *rwb) rwb->scale_step--; rwb->unknown_cnt = 0; - blk_stat_clear(rwb->queue); rwb->scaled_max = calc_wb_limits(rwb); @@ -385,15 +376,12 @@ static void scale_down(struct rq_wb *rwb, bool hard_throttle) rwb->scaled_max = false; rwb->unknown_cnt = 0; - blk_stat_clear(rwb->queue); calc_wb_limits(rwb); rwb_trace_step(rwb, "step down"); } static void rwb_arm_timer(struct rq_wb *rwb) { - unsigned long expires; - if (rwb->scale_step > 0) { /* * We should speed this up, using some variant of a fast @@ -411,17 +399,16 @@ static void rwb_arm_timer(struct rq_wb *rwb) rwb->cur_win_nsec = rwb->win_nsec; } - expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec); - mod_timer(&rwb->window_timer, expires); + blk_stat_activate_nsecs(rwb->cb, rwb->cur_win_nsec); } -static void wb_timer_fn(unsigned long data) +static void wb_timer_fn(struct blk_stat_callback *cb) { - struct rq_wb *rwb = (struct rq_wb *) data; + struct rq_wb *rwb = cb->data; unsigned int inflight = wbt_inflight(rwb); int status; - status = latency_exceeded(rwb); + status = latency_exceeded(rwb, cb->stat); trace_wbt_timer(rwb->queue->backing_dev_info, status, rwb->scale_step, inflight); @@ -614,7 +601,7 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock) __wbt_wait(rwb, bio->bi_opf, lock); - if (!timer_pending(&rwb->window_timer)) + if (!blk_stat_is_active(rwb->cb)) rwb_arm_timer(rwb); if (current_is_kswapd()) @@ -666,22 +653,37 @@ void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on) rwb->wc = write_cache_on; } - /* - * Disable wbt, if enabled by default. Only called from CFQ, if we have - * cgroups enabled +/* + * Disable wbt, if enabled by default. Only called from CFQ. */ void wbt_disable_default(struct request_queue *q) { struct rq_wb *rwb = q->rq_wb; - if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT) { - del_timer_sync(&rwb->window_timer); - rwb->win_nsec = rwb->min_lat_nsec = 0; - wbt_update_limits(rwb); - } + if (rwb && rwb->enable_state == WBT_STATE_ON_DEFAULT) + wbt_exit(q); } EXPORT_SYMBOL_GPL(wbt_disable_default); +/* + * Enable wbt if defaults are configured that way + */ +void wbt_enable_default(struct request_queue *q) +{ + /* Throttling already enabled? */ + if (q->rq_wb) + return; + + /* Queue not registered? Maybe shutting down... */ + if (!test_bit(QUEUE_FLAG_REGISTERED, &q->queue_flags)) + return; + + if ((q->mq_ops && IS_ENABLED(CONFIG_BLK_WBT_MQ)) || + (q->request_fn && IS_ENABLED(CONFIG_BLK_WBT_SQ))) + wbt_init(q); +} +EXPORT_SYMBOL_GPL(wbt_enable_default); + u64 wbt_default_latency_nsec(struct request_queue *q) { /* @@ -694,29 +696,33 @@ u64 wbt_default_latency_nsec(struct request_queue *q) return 75000000ULL; } +static int wbt_data_dir(const struct request *rq) +{ + return rq_data_dir(rq); +} + int wbt_init(struct request_queue *q) { struct rq_wb *rwb; int i; - /* - * For now, we depend on the stats window being larger than - * our monitoring window. Ensure that this isn't inadvertently - * violated. - */ - BUILD_BUG_ON(RWB_WINDOW_NSEC > BLK_STAT_NSEC); BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS); rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); if (!rwb) return -ENOMEM; + rwb->cb = blk_stat_alloc_callback(wb_timer_fn, wbt_data_dir, 2, rwb); + if (!rwb->cb) { + kfree(rwb); + return -ENOMEM; + } + for (i = 0; i < WBT_NUM_RWQ; i++) { atomic_set(&rwb->rq_wait[i].inflight, 0); init_waitqueue_head(&rwb->rq_wait[i].wait); } - setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb); rwb->wc = 1; rwb->queue_depth = RWB_DEF_DEPTH; rwb->last_comp = rwb->last_issue = jiffies; @@ -726,10 +732,10 @@ int wbt_init(struct request_queue *q) wbt_update_limits(rwb); /* - * Assign rwb, and turn on stats tracking for this queue + * Assign rwb and add the stats callback. */ q->rq_wb = rwb; - blk_stat_enable(q); + blk_stat_add_callback(q, rwb->cb); rwb->min_lat_nsec = wbt_default_latency_nsec(q); @@ -744,7 +750,8 @@ void wbt_exit(struct request_queue *q) struct rq_wb *rwb = q->rq_wb; if (rwb) { - del_timer_sync(&rwb->window_timer); + blk_stat_remove_callback(q, rwb->cb); + blk_stat_free_callback(rwb->cb); q->rq_wb = NULL; kfree(rwb); } diff --git a/block/blk-wbt.h b/block/blk-wbt.h index 65f1de519f67..df6de50c5d59 100644 --- a/block/blk-wbt.h +++ b/block/blk-wbt.h @@ -32,27 +32,27 @@ enum { static inline void wbt_clear_state(struct blk_issue_stat *stat) { - stat->time &= BLK_STAT_TIME_MASK; + stat->stat &= ~BLK_STAT_RES_MASK; } static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat) { - return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT; + return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT; } static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct) { - stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT; + stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT; } static inline bool wbt_is_tracked(struct blk_issue_stat *stat) { - return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED; + return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED; } static inline bool wbt_is_read(struct blk_issue_stat *stat) { - return (stat->time >> BLK_STAT_SHIFT) & WBT_READ; + return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ; } struct rq_wait { @@ -81,7 +81,7 @@ struct rq_wb { u64 win_nsec; /* default window size */ u64 cur_win_nsec; /* current window size */ - struct timer_list window_timer; + struct blk_stat_callback *cb; s64 sync_issue; void *sync_cookie; @@ -117,6 +117,7 @@ void wbt_update_limits(struct rq_wb *); void wbt_requeue(struct rq_wb *, struct blk_issue_stat *); void wbt_issue(struct rq_wb *, struct blk_issue_stat *); void wbt_disable_default(struct request_queue *); +void wbt_enable_default(struct request_queue *); void wbt_set_queue_depth(struct rq_wb *, unsigned int); void wbt_set_write_cache(struct rq_wb *, bool); @@ -155,6 +156,9 @@ static inline void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat) static inline void wbt_disable_default(struct request_queue *q) { } +static inline void wbt_enable_default(struct request_queue *q) +{ +} static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth) { } diff --git a/block/blk.h b/block/blk.h index d1ea4bd9b9a3..2ed70228e44f 100644 --- a/block/blk.h +++ b/block/blk.h @@ -60,15 +60,12 @@ void blk_free_flush_queue(struct blk_flush_queue *q); int blk_init_rl(struct request_list *rl, struct request_queue *q, gfp_t gfp_mask); void blk_exit_rl(struct request_list *rl); -void init_request_from_bio(struct request *req, struct bio *bio); void blk_rq_bio_prep(struct request_queue *q, struct request *rq, struct bio *bio); void blk_queue_bypass_start(struct request_queue *q); void blk_queue_bypass_end(struct request_queue *q); void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); -bool __blk_end_bidi_request(struct request *rq, int error, - unsigned int nr_bytes, unsigned int bidi_bytes); void blk_freeze_queue(struct request_queue *q); static inline void blk_queue_enter_live(struct request_queue *q) @@ -319,10 +316,22 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node) extern void blk_throtl_drain(struct request_queue *q); extern int blk_throtl_init(struct request_queue *q); extern void blk_throtl_exit(struct request_queue *q); +extern void blk_throtl_register_queue(struct request_queue *q); #else /* CONFIG_BLK_DEV_THROTTLING */ static inline void blk_throtl_drain(struct request_queue *q) { } static inline int blk_throtl_init(struct request_queue *q) { return 0; } static inline void blk_throtl_exit(struct request_queue *q) { } +static inline void blk_throtl_register_queue(struct request_queue *q) { } #endif /* CONFIG_BLK_DEV_THROTTLING */ +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW +extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page); +extern ssize_t blk_throtl_sample_time_store(struct request_queue *q, + const char *page, size_t count); +extern void blk_throtl_bio_endio(struct bio *bio); +extern void blk_throtl_stat_add(struct request *rq, u64 time); +#else +static inline void blk_throtl_bio_endio(struct bio *bio) { } +static inline void blk_throtl_stat_add(struct request *rq, u64 time) { } +#endif #endif /* BLK_INTERNAL_H */ diff --git a/block/bsg-lib.c b/block/bsg-lib.c index cd15f9dbb147..0a23dbba2d30 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -37,7 +37,7 @@ static void bsg_destroy_job(struct kref *kref) struct bsg_job *job = container_of(kref, struct bsg_job, kref); struct request *rq = job->req; - blk_end_request_all(rq, rq->errors); + blk_end_request_all(rq, scsi_req(rq)->result); put_device(job->dev); /* release reference for the request */ @@ -74,7 +74,7 @@ void bsg_job_done(struct bsg_job *job, int result, struct scsi_request *rq = scsi_req(req); int err; - err = job->req->errors = result; + err = scsi_req(job->req)->result = result; if (err < 0) /* we're only returning the result field in the reply */ rq->sense_len = sizeof(u32); @@ -177,7 +177,7 @@ failjob_rls_job: * @q: request queue to manage * * On error the create_bsg_job function should return a -Exyz error value - * that will be set to the req->errors. + * that will be set to ->result. * * Drivers/subsys should pass this to the queue init function. */ @@ -201,7 +201,7 @@ static void bsg_request_fn(struct request_queue *q) ret = bsg_create_job(dev, req); if (ret) { - req->errors = ret; + scsi_req(req)->result = ret; blk_end_request_all(req, ret); spin_lock_irq(q->queue_lock); continue; diff --git a/block/bsg.c b/block/bsg.c index 74835dbf0c47..d9da1b613ced 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -391,13 +391,13 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, struct scsi_request *req = scsi_req(rq); int ret = 0; - dprintk("rq %p bio %p 0x%x\n", rq, bio, rq->errors); + dprintk("rq %p bio %p 0x%x\n", rq, bio, req->result); /* * fill in all the output members */ - hdr->device_status = rq->errors & 0xff; - hdr->transport_status = host_byte(rq->errors); - hdr->driver_status = driver_byte(rq->errors); + hdr->device_status = req->result & 0xff; + hdr->transport_status = host_byte(req->result); + hdr->driver_status = driver_byte(req->result); hdr->info = 0; if (hdr->device_status || hdr->transport_status || hdr->driver_status) hdr->info |= SG_INFO_CHECK; @@ -431,8 +431,8 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, * just a protocol response (i.e. non negative), that gets * processed above. */ - if (!ret && rq->errors < 0) - ret = rq->errors; + if (!ret && req->result < 0) + ret = req->result; blk_rq_unmap_user(bio); scsi_req_free_cmd(req); diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 440b95ee593c..da69b079725f 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3761,16 +3761,14 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq, } #ifdef CONFIG_CFQ_GROUP_IOSCHED -static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) +static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) { struct cfq_data *cfqd = cic_to_cfqd(cic); struct cfq_queue *cfqq; uint64_t serial_nr; - bool nonroot_cg; rcu_read_lock(); serial_nr = bio_blkcg(bio)->css.serial_nr; - nonroot_cg = bio_blkcg(bio) != &blkcg_root; rcu_read_unlock(); /* @@ -3778,7 +3776,7 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) * spuriously on a newly created cic but there's no harm. */ if (unlikely(!cfqd) || likely(cic->blkcg_serial_nr == serial_nr)) - return nonroot_cg; + return; /* * Drop reference to queues. New queues will be assigned in new @@ -3799,12 +3797,10 @@ static bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) } cic->blkcg_serial_nr = serial_nr; - return nonroot_cg; } #else -static inline bool check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) +static inline void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) { - return false; } #endif /* CONFIG_CFQ_GROUP_IOSCHED */ @@ -4449,12 +4445,11 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio, const int rw = rq_data_dir(rq); const bool is_sync = rq_is_sync(rq); struct cfq_queue *cfqq; - bool disable_wbt; spin_lock_irq(q->queue_lock); check_ioprio_changed(cic, bio); - disable_wbt = check_blkcg_changed(cic, bio); + check_blkcg_changed(cic, bio); new_queue: cfqq = cic_to_cfqq(cic, is_sync); if (!cfqq || cfqq == &cfqd->oom_cfqq) { @@ -4491,9 +4486,6 @@ new_queue: rq->elv.priv[1] = cfqq->cfqg; spin_unlock_irq(q->queue_lock); - if (disable_wbt) - wbt_disable_default(q); - return 0; } @@ -4706,6 +4698,7 @@ static void cfq_registered_queue(struct request_queue *q) */ if (blk_queue_nonrot(q)) cfqd->cfq_slice_idle = 0; + wbt_disable_default(q); } /* diff --git a/block/compat_ioctl.c b/block/compat_ioctl.c index 570021a0dc1c..04325b81c2b4 100644 --- a/block/compat_ioctl.c +++ b/block/compat_ioctl.c @@ -685,7 +685,7 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) case BLKALIGNOFF: return compat_put_int(arg, bdev_alignment_offset(bdev)); case BLKDISCARDZEROES: - return compat_put_uint(arg, bdev_discard_zeroes_data(bdev)); + return compat_put_uint(arg, 0); case BLKFLSBUF: case BLKROSET: case BLKDISCARD: diff --git a/block/elevator.c b/block/elevator.c index 4d9084a14c10..bf11e70f008b 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -41,6 +41,7 @@ #include "blk.h" #include "blk-mq-sched.h" +#include "blk-wbt.h" static DEFINE_SPINLOCK(elv_list_lock); static LIST_HEAD(elv_list); @@ -877,6 +878,8 @@ void elv_unregister_queue(struct request_queue *q) kobject_uevent(&e->kobj, KOBJ_REMOVE); kobject_del(&e->kobj); e->registered = 0; + /* Re-enable throttling in case elevator disabled it */ + wbt_enable_default(q); } } EXPORT_SYMBOL(elv_unregister_queue); diff --git a/block/genhd.c b/block/genhd.c index a9c516a8b37d..9a2d01abfa3b 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1060,8 +1060,19 @@ static struct attribute *disk_attrs[] = { NULL }; +static umode_t disk_visible(struct kobject *kobj, struct attribute *a, int n) +{ + struct device *dev = container_of(kobj, typeof(*dev), kobj); + struct gendisk *disk = dev_to_disk(dev); + + if (a == &dev_attr_badblocks.attr && !disk->bb) + return 0; + return a->mode; +} + static struct attribute_group disk_attr_group = { .attrs = disk_attrs, + .is_visible = disk_visible, }; static const struct attribute_group *disk_attr_groups[] = { @@ -1352,7 +1363,7 @@ struct kobject *get_disk(struct gendisk *disk) owner = disk->fops->owner; if (owner && !try_module_get(owner)) return NULL; - kobj = kobject_get(&disk_to_dev(disk)->kobj); + kobj = kobject_get_unless_zero(&disk_to_dev(disk)->kobj); if (kobj == NULL) { module_put(owner); return NULL; diff --git a/block/ioctl.c b/block/ioctl.c index 7b88820b93d9..0de02ee67eed 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -255,7 +255,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, truncate_inode_pages_range(mapping, start, end); return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL, - false); + BLKDEV_ZERO_NOUNMAP); } static int put_ushort(unsigned long arg, unsigned short val) @@ -547,7 +547,7 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKALIGNOFF: return put_int(arg, bdev_alignment_offset(bdev)); case BLKDISCARDZEROES: - return put_uint(arg, bdev_discard_zeroes_data(bdev)); + return put_uint(arg, 0); case BLKSECTGET: max_sectors = min_t(unsigned int, USHRT_MAX, queue_max_sectors(bdev_get_queue(bdev))); diff --git a/block/ioprio.c b/block/ioprio.c index 0c47a00f92a8..4b120c9cf7e8 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -163,22 +163,12 @@ out: int ioprio_best(unsigned short aprio, unsigned short bprio) { - unsigned short aclass; - unsigned short bclass; - if (!ioprio_valid(aprio)) aprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM); if (!ioprio_valid(bprio)) bprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM); - aclass = IOPRIO_PRIO_CLASS(aprio); - bclass = IOPRIO_PRIO_CLASS(bprio); - if (aclass == bclass) - return min(aprio, bprio); - if (aclass > bclass) - return bprio; - else - return aprio; + return min(aprio, bprio); } SYSCALL_DEFINE2(ioprio_get, int, which, int, who) diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c new file mode 100644 index 000000000000..3b0090bc5dd1 --- /dev/null +++ b/block/kyber-iosched.c @@ -0,0 +1,719 @@ +/* + * The Kyber I/O scheduler. Controls latency by throttling queue depths using + * scalable techniques. + * + * Copyright (C) 2017 Facebook + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <https://www.gnu.org/licenses/>. + */ + +#include <linux/kernel.h> +#include <linux/blkdev.h> +#include <linux/blk-mq.h> +#include <linux/elevator.h> +#include <linux/module.h> +#include <linux/sbitmap.h> + +#include "blk.h" +#include "blk-mq.h" +#include "blk-mq-sched.h" +#include "blk-mq-tag.h" +#include "blk-stat.h" + +/* Scheduling domains. */ +enum { + KYBER_READ, + KYBER_SYNC_WRITE, + KYBER_OTHER, /* Async writes, discard, etc. */ + KYBER_NUM_DOMAINS, +}; + +enum { + KYBER_MIN_DEPTH = 256, + + /* + * In order to prevent starvation of synchronous requests by a flood of + * asynchronous requests, we reserve 25% of requests for synchronous + * operations. + */ + KYBER_ASYNC_PERCENT = 75, +}; + +/* + * Initial device-wide depths for each scheduling domain. + * + * Even for fast devices with lots of tags like NVMe, you can saturate + * the device with only a fraction of the maximum possible queue depth. + * So, we cap these to a reasonable value. + */ +static const unsigned int kyber_depth[] = { + [KYBER_READ] = 256, + [KYBER_SYNC_WRITE] = 128, + [KYBER_OTHER] = 64, +}; + +/* + * Scheduling domain batch sizes. We favor reads. + */ +static const unsigned int kyber_batch_size[] = { + [KYBER_READ] = 16, + [KYBER_SYNC_WRITE] = 8, + [KYBER_OTHER] = 8, +}; + +struct kyber_queue_data { + struct request_queue *q; + + struct blk_stat_callback *cb; + + /* + * The device is divided into multiple scheduling domains based on the + * request type. Each domain has a fixed number of in-flight requests of + * that type device-wide, limited by these tokens. + */ + struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS]; + + /* + * Async request percentage, converted to per-word depth for + * sbitmap_get_shallow(). + */ + unsigned int async_depth; + + /* Target latencies in nanoseconds. */ + u64 read_lat_nsec, write_lat_nsec; +}; + +struct kyber_hctx_data { + spinlock_t lock; + struct list_head rqs[KYBER_NUM_DOMAINS]; + unsigned int cur_domain; + unsigned int batching; + wait_queue_t domain_wait[KYBER_NUM_DOMAINS]; + atomic_t wait_index[KYBER_NUM_DOMAINS]; +}; + +static int rq_sched_domain(const struct request *rq) +{ + unsigned int op = rq->cmd_flags; + + if ((op & REQ_OP_MASK) == REQ_OP_READ) + return KYBER_READ; + else if ((op & REQ_OP_MASK) == REQ_OP_WRITE && op_is_sync(op)) + return KYBER_SYNC_WRITE; + else + return KYBER_OTHER; +} + +enum { + NONE = 0, + GOOD = 1, + GREAT = 2, + BAD = -1, + AWFUL = -2, +}; + +#define IS_GOOD(status) ((status) > 0) +#define IS_BAD(status) ((status) < 0) + +static int kyber_lat_status(struct blk_stat_callback *cb, + unsigned int sched_domain, u64 target) +{ + u64 latency; + + if (!cb->stat[sched_domain].nr_samples) + return NONE; + + latency = cb->stat[sched_domain].mean; + if (latency >= 2 * target) + return AWFUL; + else if (latency > target) + return BAD; + else if (latency <= target / 2) + return GREAT; + else /* (latency <= target) */ + return GOOD; +} + +/* + * Adjust the read or synchronous write depth given the status of reads and + * writes. The goal is that the latencies of the two domains are fair (i.e., if + * one is good, then the other is good). + */ +static void kyber_adjust_rw_depth(struct kyber_queue_data *kqd, + unsigned int sched_domain, int this_status, + int other_status) +{ + unsigned int orig_depth, depth; + + /* + * If this domain had no samples, or reads and writes are both good or + * both bad, don't adjust the depth. + */ + if (this_status == NONE || + (IS_GOOD(this_status) && IS_GOOD(other_status)) || + (IS_BAD(this_status) && IS_BAD(other_status))) + return; + + orig_depth = depth = kqd->domain_tokens[sched_domain].sb.depth; + + if (other_status == NONE) { + depth++; + } else { + switch (this_status) { + case GOOD: + if (other_status == AWFUL) + depth -= max(depth / 4, 1U); + else + depth -= max(depth / 8, 1U); + break; + case GREAT: + if (other_status == AWFUL) + depth /= 2; + else + depth -= max(depth / 4, 1U); + break; + case BAD: + depth++; + break; + case AWFUL: + if (other_status == GREAT) + depth += 2; + else + depth++; + break; + } + } + + depth = clamp(depth, 1U, kyber_depth[sched_domain]); + if (depth != orig_depth) + sbitmap_queue_resize(&kqd->domain_tokens[sched_domain], depth); +} + +/* + * Adjust the depth of other requests given the status of reads and synchronous + * writes. As long as either domain is doing fine, we don't throttle, but if + * both domains are doing badly, we throttle heavily. + */ +static void kyber_adjust_other_depth(struct kyber_queue_data *kqd, + int read_status, int write_status, + bool have_samples) +{ + unsigned int orig_depth, depth; + int status; + + orig_depth = depth = kqd->domain_tokens[KYBER_OTHER].sb.depth; + + if (read_status == NONE && write_status == NONE) { + depth += 2; + } else if (have_samples) { + if (read_status == NONE) + status = write_status; + else if (write_status == NONE) + status = read_status; + else + status = max(read_status, write_status); + switch (status) { + case GREAT: + depth += 2; + break; + case GOOD: + depth++; + break; + case BAD: + depth -= max(depth / 4, 1U); + break; + case AWFUL: + depth /= 2; + break; + } + } + + depth = clamp(depth, 1U, kyber_depth[KYBER_OTHER]); + if (depth != orig_depth) + sbitmap_queue_resize(&kqd->domain_tokens[KYBER_OTHER], depth); +} + +/* + * Apply heuristics for limiting queue depths based on gathered latency + * statistics. + */ +static void kyber_stat_timer_fn(struct blk_stat_callback *cb) +{ + struct kyber_queue_data *kqd = cb->data; + int read_status, write_status; + + read_status = kyber_lat_status(cb, KYBER_READ, kqd->read_lat_nsec); + write_status = kyber_lat_status(cb, KYBER_SYNC_WRITE, kqd->write_lat_nsec); + + kyber_adjust_rw_depth(kqd, KYBER_READ, read_status, write_status); + kyber_adjust_rw_depth(kqd, KYBER_SYNC_WRITE, write_status, read_status); + kyber_adjust_other_depth(kqd, read_status, write_status, + cb->stat[KYBER_OTHER].nr_samples != 0); + + /* + * Continue monitoring latencies if we aren't hitting the targets or + * we're still throttling other requests. + */ + if (!blk_stat_is_active(kqd->cb) && + ((IS_BAD(read_status) || IS_BAD(write_status) || + kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER]))) + blk_stat_activate_msecs(kqd->cb, 100); +} + +static unsigned int kyber_sched_tags_shift(struct kyber_queue_data *kqd) +{ + /* + * All of the hardware queues have the same depth, so we can just grab + * the shift of the first one. + */ + return kqd->q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift; +} + +static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q) +{ + struct kyber_queue_data *kqd; + unsigned int max_tokens; + unsigned int shift; + int ret = -ENOMEM; + int i; + + kqd = kmalloc_node(sizeof(*kqd), GFP_KERNEL, q->node); + if (!kqd) + goto err; + kqd->q = q; + + kqd->cb = blk_stat_alloc_callback(kyber_stat_timer_fn, rq_sched_domain, + KYBER_NUM_DOMAINS, kqd); + if (!kqd->cb) + goto err_kqd; + + /* + * The maximum number of tokens for any scheduling domain is at least + * the queue depth of a single hardware queue. If the hardware doesn't + * have many tags, still provide a reasonable number. + */ + max_tokens = max_t(unsigned int, q->tag_set->queue_depth, + KYBER_MIN_DEPTH); + for (i = 0; i < KYBER_NUM_DOMAINS; i++) { + WARN_ON(!kyber_depth[i]); + WARN_ON(!kyber_batch_size[i]); + ret = sbitmap_queue_init_node(&kqd->domain_tokens[i], + max_tokens, -1, false, GFP_KERNEL, + q->node); + if (ret) { + while (--i >= 0) + sbitmap_queue_free(&kqd->domain_tokens[i]); + goto err_cb; + } + sbitmap_queue_resize(&kqd->domain_tokens[i], kyber_depth[i]); + } + + shift = kyber_sched_tags_shift(kqd); + kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U; + + kqd->read_lat_nsec = 2000000ULL; + kqd->write_lat_nsec = 10000000ULL; + + return kqd; + +err_cb: + blk_stat_free_callback(kqd->cb); +err_kqd: + kfree(kqd); +err: + return ERR_PTR(ret); +} + +static int kyber_init_sched(struct request_queue *q, struct elevator_type *e) +{ + struct kyber_queue_data *kqd; + struct elevator_queue *eq; + + eq = elevator_alloc(q, e); + if (!eq) + return -ENOMEM; + + kqd = kyber_queue_data_alloc(q); + if (IS_ERR(kqd)) { + kobject_put(&eq->kobj); + return PTR_ERR(kqd); + } + + eq->elevator_data = kqd; + q->elevator = eq; + + blk_stat_add_callback(q, kqd->cb); + + return 0; +} + +static void kyber_exit_sched(struct elevator_queue *e) +{ + struct kyber_queue_data *kqd = e->elevator_data; + struct request_queue *q = kqd->q; + int i; + + blk_stat_remove_callback(q, kqd->cb); + + for (i = 0; i < KYBER_NUM_DOMAINS; i++) + sbitmap_queue_free(&kqd->domain_tokens[i]); + blk_stat_free_callback(kqd->cb); + kfree(kqd); +} + +static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) +{ + struct kyber_hctx_data *khd; + int i; + + khd = kmalloc_node(sizeof(*khd), GFP_KERNEL, hctx->numa_node); + if (!khd) + return -ENOMEM; + + spin_lock_init(&khd->lock); + + for (i = 0; i < KYBER_NUM_DOMAINS; i++) { + INIT_LIST_HEAD(&khd->rqs[i]); + INIT_LIST_HEAD(&khd->domain_wait[i].task_list); + atomic_set(&khd->wait_index[i], 0); + } + + khd->cur_domain = 0; + khd->batching = 0; + + hctx->sched_data = khd; + + return 0; +} + +static void kyber_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx) +{ + kfree(hctx->sched_data); +} + +static int rq_get_domain_token(struct request *rq) +{ + return (long)rq->elv.priv[0]; +} + +static void rq_set_domain_token(struct request *rq, int token) +{ + rq->elv.priv[0] = (void *)(long)token; +} + +static void rq_clear_domain_token(struct kyber_queue_data *kqd, + struct request *rq) +{ + unsigned int sched_domain; + int nr; + + nr = rq_get_domain_token(rq); + if (nr != -1) { + sched_domain = rq_sched_domain(rq); + sbitmap_queue_clear(&kqd->domain_tokens[sched_domain], nr, + rq->mq_ctx->cpu); + } +} + +static struct request *kyber_get_request(struct request_queue *q, + unsigned int op, + struct blk_mq_alloc_data *data) +{ + struct kyber_queue_data *kqd = q->elevator->elevator_data; + struct request *rq; + + /* + * We use the scheduler tags as per-hardware queue queueing tokens. + * Async requests can be limited at this stage. + */ + if (!op_is_sync(op)) + data->shallow_depth = kqd->async_depth; + + rq = __blk_mq_alloc_request(data, op); + if (rq) + rq_set_domain_token(rq, -1); + return rq; +} + +static void kyber_put_request(struct request *rq) +{ + struct request_queue *q = rq->q; + struct kyber_queue_data *kqd = q->elevator->elevator_data; + + rq_clear_domain_token(kqd, rq); + blk_mq_finish_request(rq); +} + +static void kyber_completed_request(struct request *rq) +{ + struct request_queue *q = rq->q; + struct kyber_queue_data *kqd = q->elevator->elevator_data; + unsigned int sched_domain; + u64 now, latency, target; + + /* + * Check if this request met our latency goal. If not, quickly gather + * some statistics and start throttling. + */ + sched_domain = rq_sched_domain(rq); + switch (sched_domain) { + case KYBER_READ: + target = kqd->read_lat_nsec; + break; + case KYBER_SYNC_WRITE: + target = kqd->write_lat_nsec; + break; + default: + return; + } + + /* If we are already monitoring latencies, don't check again. */ + if (blk_stat_is_active(kqd->cb)) + return; + + now = __blk_stat_time(ktime_to_ns(ktime_get())); + if (now < blk_stat_time(&rq->issue_stat)) + return; + + latency = now - blk_stat_time(&rq->issue_stat); + + if (latency > target) + blk_stat_activate_msecs(kqd->cb, 10); +} + +static void kyber_flush_busy_ctxs(struct kyber_hctx_data *khd, + struct blk_mq_hw_ctx *hctx) +{ + LIST_HEAD(rq_list); + struct request *rq, *next; + + blk_mq_flush_busy_ctxs(hctx, &rq_list); + list_for_each_entry_safe(rq, next, &rq_list, queuelist) { + unsigned int sched_domain; + + sched_domain = rq_sched_domain(rq); + list_move_tail(&rq->queuelist, &khd->rqs[sched_domain]); + } +} + +static int kyber_domain_wake(wait_queue_t *wait, unsigned mode, int flags, + void *key) +{ + struct blk_mq_hw_ctx *hctx = READ_ONCE(wait->private); + + list_del_init(&wait->task_list); + blk_mq_run_hw_queue(hctx, true); + return 1; +} + +static int kyber_get_domain_token(struct kyber_queue_data *kqd, + struct kyber_hctx_data *khd, + struct blk_mq_hw_ctx *hctx) +{ + unsigned int sched_domain = khd->cur_domain; + struct sbitmap_queue *domain_tokens = &kqd->domain_tokens[sched_domain]; + wait_queue_t *wait = &khd->domain_wait[sched_domain]; + struct sbq_wait_state *ws; + int nr; + + nr = __sbitmap_queue_get(domain_tokens); + if (nr >= 0) + return nr; + + /* + * If we failed to get a domain token, make sure the hardware queue is + * run when one becomes available. Note that this is serialized on + * khd->lock, but we still need to be careful about the waker. + */ + if (list_empty_careful(&wait->task_list)) { + init_waitqueue_func_entry(wait, kyber_domain_wake); + wait->private = hctx; + ws = sbq_wait_ptr(domain_tokens, + &khd->wait_index[sched_domain]); + add_wait_queue(&ws->wait, wait); + + /* + * Try again in case a token was freed before we got on the wait + * queue. + */ + nr = __sbitmap_queue_get(domain_tokens); + } + return nr; +} + +static struct request * +kyber_dispatch_cur_domain(struct kyber_queue_data *kqd, + struct kyber_hctx_data *khd, + struct blk_mq_hw_ctx *hctx, + bool *flushed) +{ + struct list_head *rqs; + struct request *rq; + int nr; + + rqs = &khd->rqs[khd->cur_domain]; + rq = list_first_entry_or_null(rqs, struct request, queuelist); + + /* + * If there wasn't already a pending request and we haven't flushed the + * software queues yet, flush the software queues and check again. + */ + if (!rq && !*flushed) { + kyber_flush_busy_ctxs(khd, hctx); + *flushed = true; + rq = list_first_entry_or_null(rqs, struct request, queuelist); + } + + if (rq) { + nr = kyber_get_domain_token(kqd, khd, hctx); + if (nr >= 0) { + khd->batching++; + rq_set_domain_token(rq, nr); + list_del_init(&rq->queuelist); + return rq; + } + } + + /* There were either no pending requests or no tokens. */ + return NULL; +} + +static struct request *kyber_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ + struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data; + struct kyber_hctx_data *khd = hctx->sched_data; + bool flushed = false; + struct request *rq; + int i; + + spin_lock(&khd->lock); + + /* + * First, if we are still entitled to batch, try to dispatch a request + * from the batch. + */ + if (khd->batching < kyber_batch_size[khd->cur_domain]) { + rq = kyber_dispatch_cur_domain(kqd, khd, hctx, &flushed); + if (rq) + goto out; + } + + /* + * Either, + * 1. We were no longer entitled to a batch. + * 2. The domain we were batching didn't have any requests. + * 3. The domain we were batching was out of tokens. + * + * Start another batch. Note that this wraps back around to the original + * domain if no other domains have requests or tokens. + */ + khd->batching = 0; + for (i = 0; i < KYBER_NUM_DOMAINS; i++) { + if (khd->cur_domain == KYBER_NUM_DOMAINS - 1) + khd->cur_domain = 0; + else + khd->cur_domain++; + + rq = kyber_dispatch_cur_domain(kqd, khd, hctx, &flushed); + if (rq) + goto out; + } + + rq = NULL; +out: + spin_unlock(&khd->lock); + return rq; +} + +static bool kyber_has_work(struct blk_mq_hw_ctx *hctx) +{ + struct kyber_hctx_data *khd = hctx->sched_data; + int i; + + for (i = 0; i < KYBER_NUM_DOMAINS; i++) { + if (!list_empty_careful(&khd->rqs[i])) + return true; + } + return false; +} + +#define KYBER_LAT_SHOW_STORE(op) \ +static ssize_t kyber_##op##_lat_show(struct elevator_queue *e, \ + char *page) \ +{ \ + struct kyber_queue_data *kqd = e->elevator_data; \ + \ + return sprintf(page, "%llu\n", kqd->op##_lat_nsec); \ +} \ + \ +static ssize_t kyber_##op##_lat_store(struct elevator_queue *e, \ + const char *page, size_t count) \ +{ \ + struct kyber_queue_data *kqd = e->elevator_data; \ + unsigned long long nsec; \ + int ret; \ + \ + ret = kstrtoull(page, 10, &nsec); \ + if (ret) \ + return ret; \ + \ + kqd->op##_lat_nsec = nsec; \ + \ + return count; \ +} +KYBER_LAT_SHOW_STORE(read); +KYBER_LAT_SHOW_STORE(write); +#undef KYBER_LAT_SHOW_STORE + +#define KYBER_LAT_ATTR(op) __ATTR(op##_lat_nsec, 0644, kyber_##op##_lat_show, kyber_##op##_lat_store) +static struct elv_fs_entry kyber_sched_attrs[] = { + KYBER_LAT_ATTR(read), + KYBER_LAT_ATTR(write), + __ATTR_NULL +}; +#undef KYBER_LAT_ATTR + +static struct elevator_type kyber_sched = { + .ops.mq = { + .init_sched = kyber_init_sched, + .exit_sched = kyber_exit_sched, + .init_hctx = kyber_init_hctx, + .exit_hctx = kyber_exit_hctx, + .get_request = kyber_get_request, + .put_request = kyber_put_request, + .completed_request = kyber_completed_request, + .dispatch_request = kyber_dispatch_request, + .has_work = kyber_has_work, + }, + .uses_mq = true, + .elevator_attrs = kyber_sched_attrs, + .elevator_name = "kyber", + .elevator_owner = THIS_MODULE, +}; + +static int __init kyber_init(void) +{ + return elv_register(&kyber_sched); +} + +static void __exit kyber_exit(void) +{ + elv_unregister(&kyber_sched); +} + +module_init(kyber_init); +module_exit(kyber_exit); + +MODULE_AUTHOR("Omar Sandoval"); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("Kyber I/O scheduler"); diff --git a/block/partition-generic.c b/block/partition-generic.c index 7afb9907821f..0171a2faad68 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -497,7 +497,6 @@ rescan: if (disk->fops->revalidate_disk) disk->fops->revalidate_disk(disk); - blk_integrity_revalidate(disk); check_disk_size_change(disk, bdev); bdev->bd_invalidated = 0; if (!get_capacity(disk) || !(state = check_partition(disk, bdev))) diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c index 2a2fc768b27a..4a294a5f7fab 100644 --- a/block/scsi_ioctl.c +++ b/block/scsi_ioctl.c @@ -262,11 +262,11 @@ static int blk_complete_sghdr_rq(struct request *rq, struct sg_io_hdr *hdr, /* * fill in all the output members */ - hdr->status = rq->errors & 0xff; - hdr->masked_status = status_byte(rq->errors); - hdr->msg_status = msg_byte(rq->errors); - hdr->host_status = host_byte(rq->errors); - hdr->driver_status = driver_byte(rq->errors); + hdr->status = req->result & 0xff; + hdr->masked_status = status_byte(req->result); + hdr->msg_status = msg_byte(req->result); + hdr->host_status = host_byte(req->result); + hdr->driver_status = driver_byte(req->result); hdr->info = 0; if (hdr->masked_status || hdr->host_status || hdr->driver_status) hdr->info |= SG_INFO_CHECK; @@ -362,7 +362,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk, goto out_free_cdb; bio = rq->bio; - rq->retries = 0; + req->retries = 0; start_time = jiffies; @@ -476,13 +476,13 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, goto error; /* default. possible overriden later */ - rq->retries = 5; + req->retries = 5; switch (opcode) { case SEND_DIAGNOSTIC: case FORMAT_UNIT: rq->timeout = FORMAT_UNIT_TIMEOUT; - rq->retries = 1; + req->retries = 1; break; case START_STOP: rq->timeout = START_STOP_TIMEOUT; @@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, break; case READ_DEFECT_DATA: rq->timeout = READ_DEFECT_DATA_TIMEOUT; - rq->retries = 1; + req->retries = 1; break; default: rq->timeout = BLK_DEFAULT_SG_TIMEOUT; @@ -509,7 +509,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode, blk_execute_rq(q, disk, rq, 0); - err = rq->errors & 0xff; /* only 8 bit SCSI status */ + err = req->result & 0xff; /* only 8 bit SCSI status */ if (err) { if (req->sense_len && req->sense) { bytes = (OMAX_SB_LEN > req->sense_len) ? @@ -547,7 +547,8 @@ static int __blk_send_generic(struct request_queue *q, struct gendisk *bd_disk, scsi_req(rq)->cmd[0] = cmd; scsi_req(rq)->cmd[4] = data; scsi_req(rq)->cmd_len = 6; - err = blk_execute_rq(q, bd_disk, rq, 0); + blk_execute_rq(q, bd_disk, rq, 0); + err = scsi_req(rq)->result ? -EIO : 0; blk_put_request(rq); return err; diff --git a/block/sed-opal.c b/block/sed-opal.c index 14035f826b5e..9b30ae5ab843 100644 --- a/block/sed-opal.c +++ b/block/sed-opal.c @@ -275,8 +275,8 @@ static bool check_tper(const void *data) u8 flags = tper->supported_features; if (!(flags & TPER_SYNC_SUPPORTED)) { - pr_err("TPer sync not supported. flags = %d\n", - tper->supported_features); + pr_debug("TPer sync not supported. flags = %d\n", + tper->supported_features); return false; } @@ -289,7 +289,7 @@ static bool check_sum(const void *data) u32 nlo = be32_to_cpu(sum->num_locking_objects); if (nlo == 0) { - pr_err("Need at least one locking object.\n"); + pr_debug("Need at least one locking object.\n"); return false; } @@ -385,9 +385,9 @@ static int next(struct opal_dev *dev) error = step->fn(dev, step->data); if (error) { - pr_err("Error on step function: %d with error %d: %s\n", - state, error, - opal_error_to_human(error)); + pr_debug("Error on step function: %d with error %d: %s\n", + state, error, + opal_error_to_human(error)); /* For each OPAL command we do a discovery0 then we * start some sort of session. @@ -419,8 +419,8 @@ static int opal_discovery0_end(struct opal_dev *dev) print_buffer(dev->resp, hlen); if (hlen > IO_BUFFER_LENGTH - sizeof(*hdr)) { - pr_warn("Discovery length overflows buffer (%zu+%u)/%u\n", - sizeof(*hdr), hlen, IO_BUFFER_LENGTH); + pr_debug("Discovery length overflows buffer (%zu+%u)/%u\n", + sizeof(*hdr), hlen, IO_BUFFER_LENGTH); return -EFAULT; } @@ -503,7 +503,7 @@ static void add_token_u8(int *err, struct opal_dev *cmd, u8 tok) if (*err) return; if (cmd->pos >= IO_BUFFER_LENGTH - 1) { - pr_err("Error adding u8: end of buffer.\n"); + pr_debug("Error adding u8: end of buffer.\n"); *err = -ERANGE; return; } @@ -553,7 +553,7 @@ static void add_token_u64(int *err, struct opal_dev *cmd, u64 number) len = DIV_ROUND_UP(msb, 4); if (cmd->pos >= IO_BUFFER_LENGTH - len - 1) { - pr_err("Error adding u64: end of buffer.\n"); + pr_debug("Error adding u64: end of buffer.\n"); *err = -ERANGE; return; } @@ -579,7 +579,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd, } if (len >= IO_BUFFER_LENGTH - cmd->pos - header_len) { - pr_err("Error adding bytestring: end of buffer.\n"); + pr_debug("Error adding bytestring: end of buffer.\n"); *err = -ERANGE; return; } @@ -597,7 +597,7 @@ static void add_token_bytestring(int *err, struct opal_dev *cmd, static int build_locking_range(u8 *buffer, size_t length, u8 lr) { if (length > OPAL_UID_LENGTH) { - pr_err("Can't build locking range. Length OOB\n"); + pr_debug("Can't build locking range. Length OOB\n"); return -ERANGE; } @@ -614,7 +614,7 @@ static int build_locking_range(u8 *buffer, size_t length, u8 lr) static int build_locking_user(u8 *buffer, size_t length, u8 lr) { if (length > OPAL_UID_LENGTH) { - pr_err("Can't build locking range user, Length OOB\n"); + pr_debug("Can't build locking range user, Length OOB\n"); return -ERANGE; } @@ -648,7 +648,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn) add_token_u8(&err, cmd, OPAL_ENDLIST); if (err) { - pr_err("Error finalizing command.\n"); + pr_debug("Error finalizing command.\n"); return -EFAULT; } @@ -660,7 +660,7 @@ static int cmd_finalize(struct opal_dev *cmd, u32 hsn, u32 tsn) hdr->subpkt.length = cpu_to_be32(cmd->pos - sizeof(*hdr)); while (cmd->pos % 4) { if (cmd->pos >= IO_BUFFER_LENGTH) { - pr_err("Error: Buffer overrun\n"); + pr_debug("Error: Buffer overrun\n"); return -ERANGE; } cmd->cmd[cmd->pos++] = 0; @@ -679,14 +679,14 @@ static const struct opal_resp_tok *response_get_token( const struct opal_resp_tok *tok; if (n >= resp->num) { - pr_err("Token number doesn't exist: %d, resp: %d\n", - n, resp->num); + pr_debug("Token number doesn't exist: %d, resp: %d\n", + n, resp->num); return ERR_PTR(-EINVAL); } tok = &resp->toks[n]; if (tok->len == 0) { - pr_err("Token length must be non-zero\n"); + pr_debug("Token length must be non-zero\n"); return ERR_PTR(-EINVAL); } @@ -727,7 +727,7 @@ static ssize_t response_parse_short(struct opal_resp_tok *tok, tok->type = OPAL_DTA_TOKENID_UINT; if (tok->len > 9) { - pr_warn("uint64 with more than 8 bytes\n"); + pr_debug("uint64 with more than 8 bytes\n"); return -EINVAL; } for (i = tok->len - 1; i > 0; i--) { @@ -814,8 +814,8 @@ static int response_parse(const u8 *buf, size_t length, if (clen == 0 || plen == 0 || slen == 0 || slen > IO_BUFFER_LENGTH - sizeof(*hdr)) { - pr_err("Bad header length. cp: %u, pkt: %u, subpkt: %u\n", - clen, plen, slen); + pr_debug("Bad header length. cp: %u, pkt: %u, subpkt: %u\n", + clen, plen, slen); print_buffer(pos, sizeof(*hdr)); return -EINVAL; } @@ -848,7 +848,7 @@ static int response_parse(const u8 *buf, size_t length, } if (num_entries == 0) { - pr_err("Couldn't parse response.\n"); + pr_debug("Couldn't parse response.\n"); return -EINVAL; } resp->num = num_entries; @@ -861,18 +861,18 @@ static size_t response_get_string(const struct parsed_resp *resp, int n, { *store = NULL; if (!resp) { - pr_err("Response is NULL\n"); + pr_debug("Response is NULL\n"); return 0; } if (n > resp->num) { - pr_err("Response has %d tokens. Can't access %d\n", - resp->num, n); + pr_debug("Response has %d tokens. Can't access %d\n", + resp->num, n); return 0; } if (resp->toks[n].type != OPAL_DTA_TOKENID_BYTESTRING) { - pr_err("Token is not a byte string!\n"); + pr_debug("Token is not a byte string!\n"); return 0; } @@ -883,26 +883,26 @@ static size_t response_get_string(const struct parsed_resp *resp, int n, static u64 response_get_u64(const struct parsed_resp *resp, int n) { if (!resp) { - pr_err("Response is NULL\n"); + pr_debug("Response is NULL\n"); return 0; } if (n > resp->num) { - pr_err("Response has %d tokens. Can't access %d\n", - resp->num, n); + pr_debug("Response has %d tokens. Can't access %d\n", + resp->num, n); return 0; } if (resp->toks[n].type != OPAL_DTA_TOKENID_UINT) { - pr_err("Token is not unsigned it: %d\n", - resp->toks[n].type); + pr_debug("Token is not unsigned it: %d\n", + resp->toks[n].type); return 0; } if (!(resp->toks[n].width == OPAL_WIDTH_TINY || resp->toks[n].width == OPAL_WIDTH_SHORT)) { - pr_err("Atom is not short or tiny: %d\n", - resp->toks[n].width); + pr_debug("Atom is not short or tiny: %d\n", + resp->toks[n].width); return 0; } @@ -949,7 +949,7 @@ static int parse_and_check_status(struct opal_dev *dev) error = response_parse(dev->resp, IO_BUFFER_LENGTH, &dev->parsed); if (error) { - pr_err("Couldn't parse response.\n"); + pr_debug("Couldn't parse response.\n"); return error; } @@ -975,7 +975,7 @@ static int start_opal_session_cont(struct opal_dev *dev) tsn = response_get_u64(&dev->parsed, 5); if (hsn == 0 && tsn == 0) { - pr_err("Couldn't authenticate session\n"); + pr_debug("Couldn't authenticate session\n"); return -EPERM; } @@ -1012,7 +1012,7 @@ static int finalize_and_send(struct opal_dev *dev, cont_fn cont) ret = cmd_finalize(dev, dev->hsn, dev->tsn); if (ret) { - pr_err("Error finalizing command buffer: %d\n", ret); + pr_debug("Error finalizing command buffer: %d\n", ret); return ret; } @@ -1041,7 +1041,7 @@ static int gen_key(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building gen key command\n"); + pr_debug("Error building gen key command\n"); return err; } @@ -1059,8 +1059,8 @@ static int get_active_key_cont(struct opal_dev *dev) return error; keylen = response_get_string(&dev->parsed, 4, &activekey); if (!activekey) { - pr_err("%s: Couldn't extract the Activekey from the response\n", - __func__); + pr_debug("%s: Couldn't extract the Activekey from the response\n", + __func__); return OPAL_INVAL_PARAM; } dev->prev_data = kmemdup(activekey, keylen, GFP_KERNEL); @@ -1103,7 +1103,7 @@ static int get_active_key(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building get active key command\n"); + pr_debug("Error building get active key command\n"); return err; } @@ -1159,7 +1159,7 @@ static inline int enable_global_lr(struct opal_dev *dev, u8 *uid, err = generic_lr_enable_disable(dev, uid, !!setup->RLE, !!setup->WLE, 0, 0); if (err) - pr_err("Failed to create enable global lr command\n"); + pr_debug("Failed to create enable global lr command\n"); return err; } @@ -1217,7 +1217,7 @@ static int setup_locking_range(struct opal_dev *dev, void *data) } if (err) { - pr_err("Error building Setup Locking range command.\n"); + pr_debug("Error building Setup Locking range command.\n"); return err; } @@ -1234,11 +1234,8 @@ static int start_generic_opal_session(struct opal_dev *dev, u32 hsn; int err = 0; - if (key == NULL && auth != OPAL_ANYBODY_UID) { - pr_err("%s: Attempted to open ADMIN_SP Session without a Host" \ - "Challenge, and not as the Anybody UID\n", __func__); + if (key == NULL && auth != OPAL_ANYBODY_UID) return OPAL_INVAL_PARAM; - } clear_opal_cmd(dev); @@ -1273,12 +1270,12 @@ static int start_generic_opal_session(struct opal_dev *dev, add_token_u8(&err, dev, OPAL_ENDLIST); break; default: - pr_err("Cannot start Admin SP session with auth %d\n", auth); + pr_debug("Cannot start Admin SP session with auth %d\n", auth); return OPAL_INVAL_PARAM; } if (err) { - pr_err("Error building start adminsp session command.\n"); + pr_debug("Error building start adminsp session command.\n"); return err; } @@ -1369,7 +1366,7 @@ static int start_auth_opal_session(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building STARTSESSION command.\n"); + pr_debug("Error building STARTSESSION command.\n"); return err; } @@ -1391,7 +1388,7 @@ static int revert_tper(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_STARTLIST); add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building REVERT TPER command.\n"); + pr_debug("Error building REVERT TPER command.\n"); return err; } @@ -1426,7 +1423,7 @@ static int internal_activate_user(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building Activate UserN command.\n"); + pr_debug("Error building Activate UserN command.\n"); return err; } @@ -1453,7 +1450,7 @@ static int erase_locking_range(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building Erase Locking Range Command.\n"); + pr_debug("Error building Erase Locking Range Command.\n"); return err; } return finalize_and_send(dev, parse_and_check_status); @@ -1484,7 +1481,7 @@ static int set_mbr_done(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error Building set MBR Done command\n"); + pr_debug("Error Building set MBR Done command\n"); return err; } @@ -1516,7 +1513,7 @@ static int set_mbr_enable_disable(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error Building set MBR done command\n"); + pr_debug("Error Building set MBR done command\n"); return err; } @@ -1567,7 +1564,7 @@ static int set_new_pw(struct opal_dev *dev, void *data) if (generic_pw_cmd(usr->opal_key.key, usr->opal_key.key_len, cpin_uid, dev)) { - pr_err("Error building set password command.\n"); + pr_debug("Error building set password command.\n"); return -ERANGE; } @@ -1582,7 +1579,7 @@ static int set_sid_cpin_pin(struct opal_dev *dev, void *data) memcpy(cpin_uid, opaluid[OPAL_C_PIN_SID], OPAL_UID_LENGTH); if (generic_pw_cmd(key->key, key->key_len, cpin_uid, dev)) { - pr_err("Error building Set SID cpin\n"); + pr_debug("Error building Set SID cpin\n"); return -ERANGE; } return finalize_and_send(dev, parse_and_check_status); @@ -1657,7 +1654,7 @@ static int add_user_to_lr(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building add user to locking range command.\n"); + pr_debug("Error building add user to locking range command.\n"); return err; } @@ -1691,7 +1688,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) /* vars are initalized to locked */ break; default: - pr_err("Tried to set an invalid locking state... returning to uland\n"); + pr_debug("Tried to set an invalid locking state... returning to uland\n"); return OPAL_INVAL_PARAM; } @@ -1718,7 +1715,7 @@ static int lock_unlock_locking_range(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building SET command.\n"); + pr_debug("Error building SET command.\n"); return err; } return finalize_and_send(dev, parse_and_check_status); @@ -1752,14 +1749,14 @@ static int lock_unlock_locking_range_sum(struct opal_dev *dev, void *data) /* vars are initalized to locked */ break; default: - pr_err("Tried to set an invalid locking state.\n"); + pr_debug("Tried to set an invalid locking state.\n"); return OPAL_INVAL_PARAM; } ret = generic_lr_enable_disable(dev, lr_buffer, 1, 1, read_locked, write_locked); if (ret < 0) { - pr_err("Error building SET command.\n"); + pr_debug("Error building SET command.\n"); return ret; } return finalize_and_send(dev, parse_and_check_status); @@ -1811,7 +1808,7 @@ static int activate_lsp(struct opal_dev *dev, void *data) } if (err) { - pr_err("Error building Activate LockingSP command.\n"); + pr_debug("Error building Activate LockingSP command.\n"); return err; } @@ -1831,7 +1828,7 @@ static int get_lsp_lifecycle_cont(struct opal_dev *dev) /* 0x08 is Manufacured Inactive */ /* 0x09 is Manufactured */ if (lc_status != OPAL_MANUFACTURED_INACTIVE) { - pr_err("Couldn't determine the status of the Lifcycle state\n"); + pr_debug("Couldn't determine the status of the Lifecycle state\n"); return -ENODEV; } @@ -1868,7 +1865,7 @@ static int get_lsp_lifecycle(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error Building GET Lifecycle Status command\n"); + pr_debug("Error Building GET Lifecycle Status command\n"); return err; } @@ -1887,7 +1884,7 @@ static int get_msid_cpin_pin_cont(struct opal_dev *dev) strlen = response_get_string(&dev->parsed, 4, &msid_pin); if (!msid_pin) { - pr_err("%s: Couldn't extract PIN from response\n", __func__); + pr_debug("%s: Couldn't extract PIN from response\n", __func__); return OPAL_INVAL_PARAM; } @@ -1929,7 +1926,7 @@ static int get_msid_cpin_pin(struct opal_dev *dev, void *data) add_token_u8(&err, dev, OPAL_ENDLIST); if (err) { - pr_err("Error building Get MSID CPIN PIN command.\n"); + pr_debug("Error building Get MSID CPIN PIN command.\n"); return err; } @@ -2124,18 +2121,18 @@ static int opal_add_user_to_lr(struct opal_dev *dev, if (lk_unlk->l_state != OPAL_RO && lk_unlk->l_state != OPAL_RW) { - pr_err("Locking state was not RO or RW\n"); + pr_debug("Locking state was not RO or RW\n"); return -EINVAL; } if (lk_unlk->session.who < OPAL_USER1 || lk_unlk->session.who > OPAL_USER9) { - pr_err("Authority was not within the range of users: %d\n", - lk_unlk->session.who); + pr_debug("Authority was not within the range of users: %d\n", + lk_unlk->session.who); return -EINVAL; } if (lk_unlk->session.sum) { - pr_err("%s not supported in sum. Use setup locking range\n", - __func__); + pr_debug("%s not supported in sum. Use setup locking range\n", + __func__); return -EINVAL; } @@ -2312,7 +2309,7 @@ static int opal_activate_user(struct opal_dev *dev, /* We can't activate Admin1 it's active as manufactured */ if (opal_session->who < OPAL_USER1 || opal_session->who > OPAL_USER9) { - pr_err("Who was not a valid user: %d\n", opal_session->who); + pr_debug("Who was not a valid user: %d\n", opal_session->who); return -EINVAL; } @@ -2343,9 +2340,9 @@ bool opal_unlock_from_suspend(struct opal_dev *dev) ret = __opal_lock_unlock(dev, &suspend->unlk); if (ret) { - pr_warn("Failed to unlock LR %hhu with sum %d\n", - suspend->unlk.session.opal_key.lr, - suspend->unlk.session.sum); + pr_debug("Failed to unlock LR %hhu with sum %d\n", + suspend->unlk.session.opal_key.lr, + suspend->unlk.session.sum); was_failure = true; } } @@ -2363,10 +2360,8 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) return -EACCES; if (!dev) return -ENOTSUPP; - if (!dev->supported) { - pr_err("Not supported\n"); + if (!dev->supported) return -ENOTSUPP; - } p = memdup_user(arg, _IOC_SIZE(cmd)); if (IS_ERR(p)) @@ -2410,7 +2405,7 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, void __user *arg) ret = opal_secure_erase_locking_range(dev, p); break; default: - pr_warn("No such Opal Ioctl %u\n", cmd); + break; } kfree(p); diff --git a/block/t10-pi.c b/block/t10-pi.c index 2c97912335a9..680c6d636298 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -160,28 +160,28 @@ static int t10_pi_type3_verify_ip(struct blk_integrity_iter *iter) return t10_pi_verify(iter, t10_pi_ip_fn, 3); } -struct blk_integrity_profile t10_pi_type1_crc = { +const struct blk_integrity_profile t10_pi_type1_crc = { .name = "T10-DIF-TYPE1-CRC", .generate_fn = t10_pi_type1_generate_crc, .verify_fn = t10_pi_type1_verify_crc, }; EXPORT_SYMBOL(t10_pi_type1_crc); -struct blk_integrity_profile t10_pi_type1_ip = { +const struct blk_integrity_profile t10_pi_type1_ip = { .name = "T10-DIF-TYPE1-IP", .generate_fn = t10_pi_type1_generate_ip, .verify_fn = t10_pi_type1_verify_ip, }; EXPORT_SYMBOL(t10_pi_type1_ip); -struct blk_integrity_profile t10_pi_type3_crc = { +const struct blk_integrity_profile t10_pi_type3_crc = { .name = "T10-DIF-TYPE3-CRC", .generate_fn = t10_pi_type3_generate_crc, .verify_fn = t10_pi_type3_verify_crc, }; EXPORT_SYMBOL(t10_pi_type3_crc); -struct blk_integrity_profile t10_pi_type3_ip = { +const struct blk_integrity_profile t10_pi_type3_ip = { .name = "T10-DIF-TYPE3-IP", .generate_fn = t10_pi_type3_generate_ip, .verify_fn = t10_pi_type3_verify_ip, |