linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	bsg: Add sparse annotations to bsg_request_fn()	Bart Van Assche	2016-11-14	1	-0/+2
\| \| \| \| \| \| \|	Avoid that sparse complains about unbalanced lock actions. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-wbt: use BLK_STAT_{READ,WRITE} instead of 0/1	Jens Axboe	2016-11-11	1	-6/+6
\| \| \| \| \| \|	Since we have proper enums for the stats directions, use them. Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-wbt: remove stat ops	Jens Axboe	2016-11-11	3	-43/+8
\| \| \| \| \| \| \|	Again a leftover from when the throttling code was generic. Now that we just have the block user, get rid of the stat ops and indirections. Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-wbt: store queue instead of bdi	Jens Axboe	2016-11-11	2	-11/+13
\| \| \| \| \| \| \|	The bdi was a leftover from when the code was block layer agnostic. Now that we just support a block layer user, store the queue directly. Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: move poll code to blk-mq	Jens Axboe	2016-11-11	5	-49/+57
\| \| \| \| \| \| \| \|	The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
*	Block: mtip32xx: Improvement in code readability when memdup_user() fails.	Sachin Shukla	2016-11-11	1	-9/+5
\| \| \| \| \| \| \| \|	There is no need to call kfree() if memdup_user() fails, as no memory was allocated and the error in the error-valued pointer should be returned. Signed-off-by: Sachin Shukla <sachin.s5@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: blk_mq_try_issue_directly() should lookup hardware queue	Jens Axboe	2016-11-11	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \|	A previous commit changed this to pass in the hardware queue, but it was using the wrong hardware queue. Hence a request that was allocated on one hardware queue ended up being issued on another one, and that caused IO timeouts and oopses on some drivers. Since the request holds hardware queue private resources, like a tag, we can't just issue it on a different hardware queue. Fixes: 2253efc850c4 ("blk-mq: Move more code into blk_mq_direct_issue_request()") Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: hook up writeback throttling	Jens Axboe	2016-11-10	8	-4/+181
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-wbt: add general throttling mechanism	Jens Axboe	2016-11-10	4	-0/+1054
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can hook this up to the block layer, to help throttle buffered writes. wbt registers a few trace points that can be used to track what is happening in the system: wbt_lat: 259:0: latency 2446318 wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1, wmean=518866, wmin=15522, wmax=5330353, wsamples=57 wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32 This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat dumps the current read/write stats for that window, and wbt_step shows a step down event where we now scale back writes. Each trace includes the device, 259:0 in this case. Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: add scalable completion tracking of requests	Jens Axboe	2016-11-10	10	-3/+427
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: cfq_cpd_alloc() should use @gfp	Tejun Heo	2016-11-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was incorrectly hard coding GFP_KERNEL instead of using the mask specified through the @gfp parameter. This currently doesn't cause any actual issues because all current callers specify GFP_KERNEL. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods") Signed-off-by: Jens Axboe <axboe@fb.com>
*	nvme: don't pass the full CQE to nvme_complete_async_event	Christoph Hellwig	2016-11-10	5	-11/+21
\| \| \| \| \| \| \| \| \| \|	We only need the status and result fields, and passing them explicitly makes life a lot easier for the Fibre Channel transport which doesn't have a full CQE for the fast path case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	nvme: introduce struct nvme_request	Christoph Hellwig	2016-11-10	11	-86/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a shared per-request structure for all NVMe I/O. This structure is embedded as the first member in all NVMe transport drivers request private data and allows to implement common functionality between the drivers. The first use is to replace the current abuse of the SCSI command passthrough fields in struct request for the NVMe command passthrough, but it will grow a field more fields to allow implementing things like common abort handlers in the future. The passthrough commands are handled by having a pointer to the SQE (struct nvme_command) in struct nvme_request, and the union of the possible result fields, which had to be turned from an anonymous into a named union for that purpose. This avoids having to pass a reference to a full CQE around and thus makes checking the result a lot more lightweight. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	skd: fix function prototype	Arnd Bergmann	2016-11-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	Building with W=1 shows a harmless warning for the skd driver: drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration] This changes the prototype to the expected formatting. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	skd: fix msix error handling	Arnd Bergmann	2016-11-09	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As reported by gcc -Wmaybe-uninitialized, the cleanup path for skd_acquire_msix tries to free the already allocated msi-x vectors in reverse order, but the index variable may not have been used yet: drivers/block/skd_main.c: In function ‘skd_acquire_irq’: drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized] This changes the failure path to skip releasing the interrupts if we have not started requesting them yet. Fixes: 180b0ae77d49 ("skd: use pci_alloc_irq_vectors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: set REQ_SYNC if we clear REQ_FUA\|REQ_PREFLUSH	Jens Axboe	2016-11-08	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA, depending on flush settings. Since op_is_sync() factors those flags in for deciding whether this request is sync or not, we should set REQ_SYNC to avoid screwing up this accounting. This should be less fragile. Reported-by: Logan Gunthorpe <logang@deltatee.com> Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") Signed-off-by: Jens Axboe <axboe@fb.com>
*	writeback: track if we're sleeping on progress in balance_dirty_pages()	Jens Axboe	2016-11-08	3	-0/+4
\| \| \| \| \| \| \| \| \|	Note in the bdi_writeback structure whenever a task ends up sleeping waiting for progress. We can use that information in the lower layers to increase the priority of writes. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
*	skd_main: use %*ph to dump small buffers	Andy Shevchenko	2016-11-07	1	-12/+4
\| \| \| \| \| \| \| \| \| \|	Replace custom approach by %*ph specifier to dump small buffers in hex format. Unfortunately we can't use print_hex_dump_bytes() here since tha gap is present, though one familiar with the code may change this. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	skd: use pci_alloc_irq_vectors	Christoph Hellwig	2016-11-07	1	-146/+66
\| \| \| \| \| \| \| \| \| \| \|	Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not multiple MSI vectors. Otherwise this cleans up a lot of cruft and allows to a lot more common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	pktcdvd: don't scribble over the bvec array	Christoph Hellwig	2016-11-07	1	-41/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Hi Peter, hi Jens, I've been looking over the multi page bio vec work again recently, and one of the stumbling blocks is raw biovec access in the pktcdvd. The first issue is that it directly sets up the page and offset pointers in the biovec just before calling bio_add_page. As bio_add_page already does the setup it's trivial to just switch it to stack variables for the arguments. The second issue is the copy code in pkt_make_local_copy, which effectively is an opencoded version of bio_copy_data except that it skips pages that already are the same in the ѕource and destination. But we look at the only calleer we just set up the bio using bio_add_page to point exactly to the page array that pkt_make_local_copy compares, so the pages will always be the same and we can just remove this function. Note that all this is done based on code inspection, I don't have any packet writing hardware myself. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Always schedule hctx->next_cpu	Gabriel Krisman Bertazi	2016-11-06	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") attempts to avoid triggering the WARN_ON in __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the last batch execution before round robin, blk_mq_hctx_next_cpu can schedule a dead CPU and also update next_cpu to the next alive CPU in the mask, which will trigger the WARN_ON despite the previous workaround. The following patch fixes this scenario by always scheduling the value in hctx->next_cpu. This changes the moment when we round-robin the CPU running the hctx, but it really doesn't matter, since it still executes BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU. Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: add code to track actual device queue depth	Jens Axboe	2016-11-05	3	-0/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that drivers can call to more closely inform the block layer of the real queue depth. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
*	blk-mq: immediately dispatch big size request	Shaohua Li	2016-11-03	1	-1/+6
\| \| \| \| \| \| \| \|	This is corresponding part for blk-mq. Disk with multiple hardware queues doesn't need this as we only hold 1 request at most. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: immediately dispatch big size request	Shaohua Li	2016-11-03	2	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently block plug holds up to 16 non-mergeable requests. This makes sense if the request size is small, eg, reduce lock contention. But if request size is big enough, we don't need to worry about lock contention. Holding such request makes no sense and it lows the disk utilization. In practice, this improves 10% throughput for my raid5 sequential write workload. The size (128k) is arbitrary right now, but it makes sure lock contention is small. This probably could be more intelligent, eg, check average request size holded. Since this is mainly for sequential IO, probably not worthy. V2: check the last request instead of the first request, so as long as there is one big size request we flush the plug. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: drop q argument from bsg_validate_sgv4_hdr	Johannes Thumshirn	2016-11-03	1	-2/+2
\| \| \| \| \| \| \| \|	bsg_validate_sgv4_hdr() doesn't care about the request_queue, so drop it from it's arguments. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code	Bart Van Assche	2016-11-02	1	-14/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations that became superfluous because of this change. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). This patch fixes a race condition: using queue_flag_clear_unlocked() is not safe if any other function that manipulates the queue flags can be called concurrently, e.g. blk_cleanup_queue(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	nvme: Fix a race condition related to stopping queues	Bart Van Assche	2016-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Avoid that nvme_queue_rq() is still running when nvme_stop_queues() returns. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	dm: Fix a race condition related to stopping and starting queues	Bart Van Assche	2016-11-02	1	-12/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() calls have stopped before setting the "queue stopped" flag. This allows to remove the "queue stopped" test from dm_mq_queue_rq() and dm_mq_requeue_request(). This patch fixes a race condition because dm_mq_queue_rq() is called without holding the queue lock and hence BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is in progress. This patch prevents that the following hang occurs sporadically when using dm-mq: INFO: task systemd-udevd:10111 blocked for more than 480 seconds. Call Trace: [<ffffffff8161f397>] schedule+0x37/0x90 [<ffffffff816239ef>] schedule_timeout+0x27f/0x470 [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110 [<ffffffff8161fb36>] bit_wait_io+0x16/0x60 [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0 [<ffffffff8114fe69>] __lock_page+0xb9/0xc0 [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760 [<ffffffff81166120>] truncate_inode_pages+0x10/0x20 [<ffffffff81212a20>] kill_bdev+0x30/0x40 [<ffffffff81213d41>] __blkdev_put+0x71/0x360 [<ffffffff81214079>] blkdev_put+0x49/0x170 [<ffffffff812141c0>] blkdev_close+0x20/0x30 [<ffffffff811d48e8>] __fput+0xe8/0x1f0 [<ffffffff811d4a29>] ____fput+0x9/0x10 [<ffffffff810842d3>] task_work_run+0x83/0xb0 [<ffffffff8106606e>] do_exit+0x3ee/0xc40 [<ffffffff8106694b>] do_group_exit+0x4b/0xc0 [<ffffffff81073d9a>] get_signal+0x2ca/0x940 [<ffffffff8101bf43>] do_signal+0x23/0x660 [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0 [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0 [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8 Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code	Bart Van Assche	2016-11-02	1	-15/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED in the dm start and stop queue functions, only manipulate the latter flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()	Bart Van Assche	2016-11-02	7	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Introduce blk_mq_quiesce_queue()	Bart Van Assche	2016-11-02	4	-7/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations have finished. This function does not wait until all outstanding requests have finished (this means invocation of request.end_io()). The algorithm used by blk_mq_quiesce_queue() is as follows: * Hold either an RCU read lock or an SRCU read lock around .queue_rq() calls. The former is used if .queue_rq() does not block and the latter if .queue_rq() may block. * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues() followed by synchronize_srcu() or synchronize_rcu(). The latter call waits for .queue_rq() invocations that started before blk_mq_quiesce_queue() was called. * The blk_mq_hctx_stopped() calls that control whether or not .queue_rq() will be called are called with the (S)RCU read lock held. This is necessary to avoid race conditions against blk_mq_quiesce_queue(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Remove blk_mq_cancel_requeue_work()	Bart Van Assche	2016-11-02	4	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Avoid that requeueing starts stopped queues	Bart Van Assche	2016-11-02	4	-12/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since blk_mq_requeue_work() starts stopped queues and since execution of this function can be scheduled after a queue has been stopped it is not possible to stop queues without using an additional state variable to track whether or not the queue has been stopped. Hence modify blk_mq_requeue_work() such that it does not start stopped queues. My conclusion after a review of the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list() callers is as follows: * In the dm driver starting and stopping queues should only happen if __dm_suspend() or __dm_resume() is called and not if the requeue list is processed. * In the SCSI core queue stopping and starting should only be performed by the scsi_internal_device_block() and scsi_internal_device_unblock() functions but not by any other function. Although the blk_mq_stop_hw_queue() call in scsi_queue_rq() may help to reduce CPU load if a LLD queue is full, figuring out whether or not a queue should be restarted when requeueing a command would require to introduce additional locking in scsi_mq_requeue_cmd() to avoid a race with scsi_internal_device_block(). Avoid this complexity by removing the blk_mq_stop_hw_queue() call from scsi_queue_rq(). * In the NVMe core only the functions that call blk_mq_start_stopped_hw_queues() explicitly should start stopped queues. * A blk_mq_start_stopped_hwqueues() call must be added in the xen-blkfront driver in its blkif_recover() function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Move more code into blk_mq_direct_issue_request()	Bart Van Assche	2016-11-02	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the "hctx stopped" test and the insert request calls into blk_mq_direct_issue_request(). Rename that function into blk_mq_try_issue_directly() to reflect its new semantics. Pass the hctx pointer to that function instead of looking it up a second time. These changes avoid that code has to be duplicated in the next patch. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Introduce blk_mq_queue_stopped()	Bart Van Assche	2016-11-02	2	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The function blk_queue_stopped() allows to test whether or not a traditional request queue has been stopped. Introduce a helper function that allows block drivers to query easily whether or not one or more hardware contexts of a blk-mq queue have been stopped. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Introduce blk_mq_hctx_stopped()	Bart Van Assche	2016-11-02	2	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	Multiple functions test the BLK_MQ_S_STOPPED bit so introduce a helper function that performs this test. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	blk-mq: Do not invoke .queue_rq() for a stopped queue	Bart Van Assche	2016-11-02	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The meaning of the BLK_MQ_S_STOPPED flag is "do not call .queue_rq()". Hence modify blk_mq_make_request() such that requests are queued instead of issued if a queue has been stopped. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: add bio_iov_iter_get_pages()	Kent Overstreet	2016-11-02	2	-0/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a helper that pins down a range from an iov_iter and adds it to a bio without requiring a separate memory allocation for the page array. It will be used for upcoming direct I/O implementations for block devices and iomap based file systems. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: ported to the iov_iter interface, renamed and added comments. All blame should be directed to me and all fame should go to Kent after this!] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	writeback: mark background writeback as such	Jens Axboe	2016-11-02	1	-0/+2
\| \| \| \| \| \| \| \|	If we're doing background type writes, then use the appropriate background write flags for that. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
*	writeback: add wbc_to_write_flags()	Jens Axboe	2016-11-02	8	-16/+17
\| \| \| \| \| \| \| \| \| \|	Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
*	block: add REQ_BACKGROUND	Jens Axboe	2016-11-02	1	-0/+2
\| \| \| \| \| \| \| \|	This adds a new request flag, REQ_BACKGROUND, that callers can use to tell the block layer that this is background (non-urgent) IO. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
*	block: remove the CONFIG_BLOCK ifdef in blk_types.h	Christoph Hellwig	2016-11-01	1	-3/+0
\| \| \| \| \| \| \| \|	Now that we have a separate header for struct bio_vec there is absolutely no excuse for including this header from non-block I/O code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	mm: only include blk_types in swap.h if CONFIG_SWAP is enabled	Christoph Hellwig	2016-11-01	7	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	ceph: don't include blk_types.h in messenger.h	Christoph Hellwig	2016-11-01	1	-1/+1
\| \| \| \| \| \| \| \|	The file only needs the struct bvec_iter delcaration, which is available from bvec.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	arm, arm64: don't include blk_types.h in <asm/io.h>	Christoph Hellwig	2016-11-01	2	-2/+0
\| \| \| \| \| \| \| \|	No need for it - we only use struct bio_vec in prototypes and already have forward declarations for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block,fs: untangle fs.h and blk_types.h	Christoph Hellwig	2016-11-01	18	-1/+19
\| \| \| \| \| \| \|	Nothing in fs.h should require blk_types.h to be included. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block, fs: move submit_bio to bio.h	Christoph Hellwig	2016-11-01	2	-1/+2
\| \| \| \| \| \| \| \|	This is where all the other bio operations live, so users must include bio.h anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	fs: decouple READ and WRITE from the block layer ops	Christoph Hellwig	2016-11-01	4	-14/+11
\| \| \| \| \| \| \| \| \|	Move READ and WRITE to kernel.h and don't define them in terms of block layer ops; they are our generic data direction indicators these days and have no more resemblance with the block layer ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block,fs: use REQ_* flags directly	Christoph Hellwig	2016-11-01	53	-182/+133
\| \| \| \| \| \| \| \| \|	Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
*	block: replace REQ_NOIDLE with REQ_IDLE	Christoph Hellwig	2016-11-01	6	-28/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Noidle should be the default for writes as seen by all the compounds definitions in fs.h using it. In fact only direct I/O really should be using NODILE, so turn the whole flag around to get the defaults right, which will make our life much easier especially onces the WRITE_* defines go away. This assumes all the existing "raw" users of REQ_SYNC for writes want noidle behavior, which seems to be spot on from a quick audit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>