summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* block: avoid blk_bio_segment_split for small I/O operationsChristoph Hellwig2019-11-041-1/+15
| | | | | | | | | | | __blk_queue_split() adds significant overhead for small I/O operations. Add a shortcut to avoid it for cases where we know we never need to split. Based on a patch from Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: make sure that line break can be printedMing Lei2019-11-041-1/+1
| | | | | | | | | | | 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") avoids sysfs buffer overflow, and reserves one character for line break. However, the last snprintf() doesn't get correct 'size' parameter passed in, so fixed it. Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: Introduce Opal Datastore UIDRevanth Rajashekar2019-11-042-0/+3
| | | | | | | | | | | This patch introduces Opal Datastore UID. The generic read/write table ioctl can use this UID to access the Opal Datastore. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: Add support to read/write opal tables genericallyRevanth Rajashekar2019-11-042-1/+172
| | | | | | | | | | | | | | | | | | This feature gives the user RW access to any opal table with admin1 authority. The flags described in the new structure determines if the user wants to read/write the data. Flags are checked for valid values in order to allow future features to be added to the ioctl. The user can provide the desired table's UID. Also, the ioctl provides a size and offset field and internally will loop data accesses to return the full data block. Read overrun is prevented by the initiator's sec_send_recv() backend. The ioctl provides a private field with the intention to accommodate any future expansions to the ioctl. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: Generalizing write data to any opal tableRevanth Rajashekar2019-11-041-64/+74
| | | | | | | | | | | | This patch refactors the existing "write_shadowmbr" func and creates a new generalized function "generic_table_write_data", to write data to any opal table. Also, a few cleanups are included in this patch. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: avoid sysfs buffer overflow with too many CPU coresMing Lei2019-11-021-5/+10
| | | | | | | | | | | | | | | | It is reported that sysfs buffer overflow can be triggered if the system has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of hctx via /sys/block/$DEV/mq/$N/cpu_list. Use snprintf to avoid the potential buffer overflow. This version doesn't change the attribute format, and simply stops showing CPU numbers if the buffer is going to overflow. Cc: stable@vger.kernel.org Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Make blk_mq_run_hw_queue() return voidJohn Garry2019-11-011-6/+2
| | | | | | | | | | | Since commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue() is never checked, so make it return void, which very marginally simplifies the code. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: remove needless goto from blk_mq_get_driver_tagAndré Almeida2019-10-251-2/+1
| | | | | | | | | | | | | The only usage of the label "done" is when (rq->tag != -1) at the beginning of the function. Rather than jumping to label, we can just remove this label and execute the code at the "if". Besides that, the code that would be executed after the label "done" is the return of the logical expression (rq->tag != -1) but since we are already inside the if, we now that this is true. Remove the label and replace the goto with the proper result of the label. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Reduce the amount of memory used for tag setsBart Van Assche2019-10-251-17/+30
| | | | | | | | | | | | | | | | | | | | | Instead of allocating an array of size nr_cpu_ids for set->tags, allocate an array of size set->nr_hw_queues. This patch improves behavior that was introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w context count"). Reallocating tag sets from inside __blk_mq_update_nr_hw_queues() is safe because: - All request queues that share the tag sets are frozen before the tag sets are reallocated. - blk_mq_queue_tag_busy_iter() holds q->q_usage_counter while active and hence is serialized against __blk_mq_update_nr_hw_queues(). Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Reduce the amount of memory required per request queueBart Van Assche2019-10-251-7/+17
| | | | | | | | | | | | | | | Instead of always allocating at least nr_cpu_ids hardware queues per request queue, reallocate q->queue_hw_ctx if it has to grow. This patch improves behavior that was introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w context count"). Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()Bart Van Assche2019-10-251-4/+0
| | | | | | | | | | | | | | | | | | | | | Since the blk_mq_{,un}freeze_queue() calls in __blk_mq_update_nr_hw_queues() already serialize __blk_mq_update_nr_hw_queues() against blk_mq_queue_tag_busy_iter(), the synchronize_rcu() call in __blk_mq_update_nr_hw_queues() is not necessary. Hence remove it. Note: the synchronize_rcu() call in __blk_mq_update_nr_hw_queues() was introduced by commit f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter"). Commit 530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter callbacks") removed the rcu_read_{,un}lock() calls that correspond to the synchronize_rcu() call in __blk_mq_update_nr_hw_queues(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: account statistics for passthrough requestsLogan Gunthorpe2019-10-102-5/+4
| | | | | | | | | | | | | | | | | Presently, passthrough requests are not accounted for because blk_do_io_stat() expressly rejects them. Based on some digging in the history, this doesn't seem like a concious decision but one that evolved from the change from blk_fs_request() to blk_rq_is_passthrough(). To support this, call blk_account_io_start() in blk_execute_rq_nowait() and remove the passthrough check in blk_do_io_stat(). Link: https://lore.kernel.org/linux-block/20191010100526.GA27209@lst.de/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-stat: Optimise blk_stat_add()Pavel Begunkov2019-10-071-3/+4
| | | | | | | | | blk_stat_add() calls {get,put}_cpu_ptr() in a loop, which entails overhead of disabling/enabling preemption. The loop is under RCU (i.e.short) anyway, so do get_cpu() in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Embed counters into struct mq_inflightPavel Begunkov2019-10-071-7/+6
| | | | | | | | Store inflight counters immediately in struct mq_inflight. That's type-safer and removes extra indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Reuse callback in blk_mq_in_flight*()Pavel Begunkov2019-10-071-18/+3
| | | | | | | | Reuse a more generic callback in both blk_mq_in_flight() and blk_mq_in_flight_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Inline status checkersPavel Begunkov2019-10-072-21/+0
| | | | | | | | blk_mq_request_completed() and blk_mq_request_started() are short, inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Reduce sysfs_lock locking inside blk_cleanup_queue()Bart Van Assche2019-10-071-2/+0
| | | | | | | | | | | | | | | Since blk_cleanup_queue() is called after blk_unregister_queue() and since that last function removes all sysfs attributes, serializing any code in blk_cleanup_queue() against sysfs callback methods nor against I/O scheduler changes is necessary. Hence remove the syfs_lock locking calls from the start of blk_cleanup_queue(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove "dying" checks from sysfs callbacksBart Van Assche2019-10-073-20/+6
| | | | | | | | | | | | | | | | Block drivers must call del_gendisk() before blk_cleanup_queue(). del_gendisk() calls kobject_del() and kobject_del() waits until any ongoing sysfs callback functions have finished. In other words, the sysfs callback functions won't be called for a queue in the dying state. Hence remove the "dying" checks from the sysfs callback functions. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Remove request_queue.nr_queuesBart Van Assche2019-10-071-3/+3
| | | | | | | | | | | | | | | Commit 897bb0c7f1ea ("blk-mq: Use proper cpumask iterator"; v4.6) removed the last use of request_queue.nr_queues from outside blk_mq_init_allocate_queue(). Remove this member variable to make struct request_queue smaller. This patch does not change any functionality. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Fix three kernel-doc warningsBart Van Assche2019-10-071-6/+2
| | | | | | | | | | | | | | | | | Fix the following kernel-doc warnings: block/t10-pi.c:242: warning: Function parameter or member 'rq' not described in 't10_pi_type3_prepare' block/t10-pi.c:249: warning: Function parameter or member 'rq' not described in 't10_pi_type3_complete' block/t10-pi.c:249: warning: Function parameter or member 'nr_bytes' not described in 't10_pi_type3_complete' Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Fixes: 54d4e6ab91eb ("block: centralize PI remapping logic to the block layer") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: fix sparse warning: convert __be64 dataRandy Dunlap2019-10-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | sparse warns about incorrect type when using __be64 data. It is not being converted to CPU-endian but it should be. Fixes these sparse warnings: ../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types) ../block/sed-opal.c:375:20: expected unsigned long long [usertype] align ../block/sed-opal.c:375:20: got restricted __be64 const [usertype] alignment_granularity ../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types) ../block/sed-opal.c:376:25: expected unsigned long long [usertype] lowest_lba ../block/sed-opal.c:376:25: got restricted __be64 const [usertype] lowest_aligned_lba Fixes: 455a7b238cd6 ("block: Add Sed-opal library") Cc: Scott Bauer <scott.bauer@intel.com> Cc: Rafael Antognolli <rafael.antognolli@intel.com> Cc: linux-block@vger.kernel.org Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: sed-opal: fix sparse warning: obsolete array init.Randy Dunlap2019-10-031-1/+1
| | | | | | | | | | | | | | Fix sparse warning: (missing '=') ../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntax Fixes: ff91064ea37c ("block: sed-opal: check size of shadow mbr") Cc: linux-block@vger.kernel.org Cc: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Cc: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: apply normal plugging for HDDMing Lei2019-09-271-1/+5
| | | | | | | | | | | | | Some HDD drive may expose multiple hardware queues, such as MegraRaid. Let's apply the normal plugging for such devices because sequential IO may benefit a lot from plug merging. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: honor IO scheduler for multiqueue devicesMing Lei2019-09-271-2/+4
| | | | | | | | | | | | | | | | | | | | If a device is using multiple queues, the IO scheduler may be bypassed. This may hurt performance for some slow MQ devices, and it also breaks zoned devices which depend on mq-deadline for respecting the write order in one zone. Don't bypass io scheduler if we have one setup. This patch can double sequential write performance basically on MQ scsi_debug when mq-deadline is applied. Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: fix null pointer dereference in blk_mq_rq_timed_out()Yufen Yu2019-09-273-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We got a null pointer deference BUG_ON in blk_mq_rq_timed_out() as following: [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 108.827059] PGD 0 P4D 0 [ 108.827313] Oops: 0000 [#1] SMP PTI [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431 [ 108.829503] Workqueue: kblockd blk_mq_timeout_work [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330 [ 108.838191] Call Trace: [ 108.838406] bt_iter+0x74/0x80 [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450 [ 108.839074] ? __switch_to_asm+0x34/0x70 [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f [ 108.840732] blk_mq_timeout_work+0x74/0x200 [ 108.841151] process_one_work+0x297/0x680 [ 108.841550] worker_thread+0x29c/0x6f0 [ 108.841926] ? rescuer_thread+0x580/0x580 [ 108.842344] kthread+0x16a/0x1a0 [ 108.842666] ? kthread_flush_work+0x170/0x170 [ 108.843100] ret_from_fork+0x35/0x40 The bug is caused by the race between timeout handle and completion for flush request. When timeout handle function blk_mq_rq_timed_out() try to read 'req->q->mq_ops', the 'req' have completed and reinitiated by next flush request, which would call blk_rq_init() to clear 'req' as 0. After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"), normal requests lifetime are protected by refcount. Until 'rq->ref' drop to zero, the request can really be free. Thus, these requests cannot been reused before timeout handle finish. However, flush request has defined .end_io and rq->end_io() is still called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq' can be reused by the next flush request handle, resulting in null pointer deference BUG ON. We fix this problem by covering flush request with 'rq->ref'. If the refcount is not zero, flush_end_io() return and wait the last holder recall it. To record the request status, we add a new entry 'rq_status', which will be used in flush_end_io(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: stable@vger.kernel.org # v4.18+ Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> ------- v2: - move rq_status from struct request to struct blk_flush_queue v3: - remove unnecessary '{}' pair. v4: - let spinlock to protect 'fq->rq_status' v5: - move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <axboe@kernel.dk>
* rq-qos: get rid of redundant wbt_update_limits()Yufen Yu2019-09-271-1/+0
| | | | | | | | | We have updated limits after calling wbt_set_min_lat(). No need to update again. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* iocost: bump up default latency targets for hard disksTejun Heo2019-09-261-2/+2
| | | | | | | | | | | | | | | | | | The default hard disk param sets latency targets at 50ms. As the default target percentiles are zero, these don't directly regulate vrate; however, they're still used to calculate the period length - 100ms in this case. This is excessively low. A SATA drive with QD32 saturated with random IOs can easily reach avg completion latency of several hundred msecs. A period duration which is substantially lower than avg completion latency can lead to wildly fluctuating vrate. Let's bump up the default latency targets to 250ms so that the period duration is sufficiently long. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* iocost: improve nr_lagging handlingTejun Heo2019-09-261-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some IOs may span multiple periods. As latencies are collected on completion, the inbetween periods won't register them and may incorrectly decide to increase vrate. nr_lagging tracks these IOs to avoid those situations. Currently, whenever there are IOs which are spanning from the previous period, busy_level is reset to 0 if negative thus suppressing vrate increase. This has the following two problems. * When latency target percentiles aren't set, vrate adjustment should only be governed by queue depth depletion; however, the current code keeps nr_lagging active which pulls in latency results and can keep down vrate unexpectedly. * When lagging condition is detected, it resets the entire negative busy_level. This turned out to be way too aggressive on some devices which sometimes experience extended latencies on a small subset of commands. In addition, a lagging IO will be accounted as latency target miss on completion anyway and resetting busy_level amplifies its impact unnecessarily. This patch fixes the above two problems by disabling nr_lagging counting when latency target percentiles aren't set and blocking vrate increases when there are lagging IOs while leaving busy_level as-is. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* iocost: better trace vrate changesTejun Heo2019-09-261-1/+6
| | | | | | | | | | | vrate_adj tracepoint traces vrate changes; however, it does so only when busy_level is non-zero. busy_level turning to zero can sometimes be as interesting an event. This patch also enables vrate_adj tracepoint on other vrate related events - busy_level changes and non-zero nr_lagging. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't release queue's sysfs lock during switching elevatorMing Lei2019-09-262-39/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to release & acquire sysfs_lock before registering/un-registering elevator queue during switching elevator for avoiding potential deadlock from showing & storing 'queue/iosched' attributes and removing elevator's kobject. Turns out there isn't such deadlock because 'q->sysfs_lock' isn't required in .show & .store of queue/iosched's attributes, and just elevator's sysfs lock is acquired in elv_iosched_store() and elv_iosched_show(). So it is safe to hold queue's sysfs lock when registering/un-registering elevator queue. The biggest issue is that commit cecf5d87ff20 assumes that concurrent write on 'queue/scheduler' can't happen. However, this assumption isn't true, because kernfs_fop_write() only guarantees that concurrent write aren't called on the same open file, but the write could be from different open on the file. So we can't release & re-acquire queue's sysfs lock during switching elevator, otherwise use-after-free on elevator could be triggered. Fixes the issue by not releasing queue's sysfs lock during switching elevator. Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: move lockdep_assert_held() into elevator_exitMing Lei2019-09-262-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with lockdep_assert_held() called in blk_mq_sched_free_requests() which is run in failure path of elevator_init_mq(). blk_mq_sched_free_requests() is called in the following 3 functions: elevator_init_mq() elevator_exit() blk_cleanup_queue() In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly by 'mutex_lock(&q->sysfs_lock)'. So moving the lockdep_assert_held() from blk_mq_sched_free_requests() into elevator_exit() for fixing the report by syzbot. Reported-by: syzbot+da3b7677bb913dc1b737@syzkaller.appspotmail.com Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-blockLinus Torvalds2019-09-246-94/+144
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type
| * block: drop device references in bsg_queue_rq()Martin Wilck2019-09-231-4/+6
| | | | | | | | | | | | | | | | | | Make sure that bsg_queue_rq() calls put_device() if an error is encountered after get_device() was successful. Fixes: cd2f076f1d7a ("bsg: convert to use blk-mq") Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: t10-pi: fix -Wswitch warningMax Gurtovoy2019-09-231-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changing the switch() statement to symbolic constants made the compiler (at least clang-9, did not check gcc) notice that there is one enum value that is not handled here: block/t10-pi.c:62:11: error: enumeration value 'T10_PI_TYPE0_PROTECTION' not handled in switch [-Werror,-Wswitch] Add a BUG_ON statement if we ever get to t10_pi_verify function with TYPE0 and replace the switch() statement with if/else clause for the valid types. Fixes: 9b2061b1a262 ("block: use symbolic constants for t10_pi type") Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block, bfq: push up injection only after setting service timePaolo Valente2019-09-171-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | If equal to 0, the injection limit for a bfq_queue is pushed to 1 after a first sample of the total service time of the I/O requests of the queue is computed (to allow injection to start). Yet, because of a mistake in the branch that performs this action, the push may happen also in some other case. This commit fixes this issue. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block, bfq: increase update frequency of inject limitPaolo Valente2019-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | The update period of the injection limit has been tentatively set to 100 ms, to reduce fluctuations. This value however proved to cause, occasionally, the limit to be decremented for some bfq_queue only after the queue underwent excessive injection for a lot of time. This commit reduces the period to 10 ms. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1Paolo Valente2019-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upon an increment attempt of the injection limit, the latter is constrained not to become higher than twice the maximum number max_rq_in_driver of I/O requests that have happened to be in service in the drive. This high bound allows the injection limit to grow beyond max_rq_in_driver, which may then cause max_rq_in_driver itself to grow. However, since the limit is incremented by only one unit at a time, there is no need for such a high bound, and just max_rq_in_driver+1 is enough. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block, bfq: update inject limit only after injection occurredPaolo Valente2019-09-171-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ updates the injection limit of each bfq_queue as a function of how much the limit inflates the service times experienced by the I/O requests of the queue. So only service times affected by injection must be taken into account. Unfortunately, in the current implementation of this update scheme, the service time of an I/O request rq not affected by injection may happen to be considered in the following case: there is no I/O request in service when rq arrives. This commit fixes this issue by making sure that only service times affected by injection are considered for updating the injection limit. In particular, the service time of an I/O request rq is now considered only if at least one of the following two conditions holds: - the destination bfq_queue for rq underwent injection before rq arrival, and there is still I/O in service in the drive on rq arrival (the service of such unfinished I/O may delay the service of rq); - injection occurs between the arrival and the completion time of rq. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: centralize PI remapping logic to the block layerMax Gurtovoy2019-09-174-68/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: use symbolic constants for t10_pi typeMax Gurtovoy2019-09-171-14/+14
| | | | | | | | | | | | | | | | | | | | Replace all hard-coded values with T10_PI_TYPES to make the code more readable. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds2019-09-191-0/+23
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull dma-mapping updates from Christoph Hellwig: - add dma-mapping and block layer helpers to take care of IOMMU merging for mmc plus subsequent fixups (Yoshihiro Shimoda) - rework handling of the pgprot bits for remapping (me) - take care of the dma direct infrastructure for swiotlb-xen (me) - improve the dma noncoherent remapping infrastructure (me) - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me) - cleanup mmaping of coherent DMA allocations (me) - various misc cleanups (Andy Shevchenko, me) * tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits) mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE mmc: queue: Fix bigger segments usage arm64: use asm-generic/dma-mapping.h swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page swiotlb-xen: simplify cache maintainance swiotlb-xen: use the same foreign page check everywhere swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable xen: remove the exports for xen_{create,destroy}_contiguous_region xen/arm: remove xen_dma_ops xen/arm: simplify dma_cache_maint xen/arm: use dev_is_dma_coherent xen/arm: consolidate page-coherent.h xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance arm: remove wrappers for the generic dma remap helpers dma-mapping: introduce a dma_common_find_pages helper dma-mapping: always use VM_DMA_COHERENT for generic DMA remap vmalloc: lift the arm flag for coherent mappings to common code dma-mapping: provide a better default ->get_required_mask dma-mapping: remove the dma_declare_coherent_memory export remoteproc: don't allow modular build ...
| * block: add a helper function to merge the segmentsYoshihiro Shimoda2019-09-031-0/+23
| | | | | | | | | | | | | | | | | | | | This patch adds a helper function whether a queue can merge the segments by the DMA MAP layer (e.g. via IOMMU). Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Simon Horman <horms+renesas@verge.net.au Signed-off-by: Christoph Hellwig <hch@lst.de>
* | Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-blockLinus Torvalds2019-09-1730-342/+3274
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
| * | block: also check RQF_STATS in blk_mq_need_time_stamp()Hou Tao2019-09-151-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | In __blk_mq_end_request() if block stats needs update, we should ensure now is valid instead of 0 even when iostat is disabled. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: make rq sector size accessible for block statsHou Tao2019-09-152-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently rq->data_len will be decreased by partial completion or zeroed by completion, so when blk_stat_add() is invoked, data_len will be zero and there will never be samples in poll_cb because blk_mq_poll_stats_bkt() will return -1 if data_len is zero. We could move blk_stat_add() back to __blk_mq_complete_request(), but that would make the effort of trying to call ktime_get_ns() once in vain. Instead we can reuse throtl_size field, and use it for both block stats and block throttle, and adjust the logic in blk_mq_poll_stats_bkt() accordingly. Fixes: 4bc6339a583c ("block: move blk_stat_add() to __blk_mq_end_request()") Tested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bfq: Fix bfq linkage errorPavel Begunkov2019-09-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 795fe54c2a828099e ("bfq: Add per-device weight"), bfq uses blkg_conf_prep() and blkg_conf_finish(), which are not exported. So, it causes linkage error if bfq compiled as a module. Fixes: 795fe54c2a828099e ("bfq: Add per-device weight") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: fix race between switching elevator and removing queuesMing Lei2019-09-121-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to release & actuire sysfs_lock again during switching elevator. So it isn't enough to prevent switching elevator from happening by simply clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because in-progress switch still can move on after re-acquiring the lock, meantime the flag of QUEUE_FLAG_REGISTERED won't get checked. Fixes this issue by checking 'q->elevator' directly & locklessly after q->kobj is removed in blk_unregister_queue(), this way is safe because q->elevator can't be changed at that time. Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: bypass blk_set_runtime_active for uninitialized q->devStanley Chu2019-09-121-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some devices may skip blk_pm_runtime_init() and have null pointer in its request_queue->dev. For example, SCSI devices of UFS Well-Known LUNs. Currently the null pointer is checked by the user of blk_set_runtime_active(), i.e., scsi_dev_type_resume(). It is better to check it by blk_set_runtime_active() itself instead of by its users. Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | iocost_monitor: Report debtTejun Heo2019-09-101-3/+3
| | | | | | | | | | | | | | | | | | | | | Report debt and rename del_ms row to delay for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-iocost: Don't let merges push vtime into the futureTejun Heo2019-09-101-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merges have the same problem that forced-bios had which is fixed by the previous patch. The cost of a merge is calculated at the time of issue and force-advances vtime into the future. Until global vtime catches up, how the cgroup's hweight changes in the meantime doesn't matter and it often leads to situations where the cost is calculated at one hweight and paid at a very different one. See the previous patch for more details. Fix it by never advancing vtime into the future for merges. If budget is available, vtime is advanced. Otherwise, the cost is charged as debt. This brings merge cost handling in line with issue cost handling in ioc_rqos_throttle(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>