summaryrefslogtreecommitdiffstats
path: root/drivers
Commit message (Collapse)AuthorAgeFilesLines
* lightnvm: if LUNs are already allocated fix returnRakesh Pandit2017-06-271-2/+3
| | | | | | | | | | | While creating new device with NVM_DEV_CREATE if LUNs are already allocated ioctl would return -ENOMEM which is wrong. This patch propagates -EBUSY from nvm_reserve_luns which is correct response. Fixes: ade69e243 ("lightnvm: merge gennvm with core") Reviewed-by: Frans Klaver <fransklaver@gmail.com> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: fail gracefully on irrec. errorJavier González2017-06-2611-114/+332
| | | | | | | | | | | | | | | | | | | | | | Due to user writes being decoupled from media writes because of the need of an intermediate write buffer, irrecoverable media write errors lead to pblk stalling; user writes fill up the buffer and end up in an infinite retry loop. In order to let user writes fail gracefully, it is necessary for pblk to keep track of its own internal state and prevent further writes from being placed into the write buffer. This patch implements a state machine to keep track of internal errors and, in case of failure, fail further user writes in an standard way. Depending on the type of error, pblk will do its best to persist buffered writes (which are already acknowledged) and close down on a graceful manner. This way, data might be recovered by re-instantiating pblk. Such state machine paves out the way for a state-based FTL log. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: set mempool and workqueue params.Javier González2017-06-264-20/+44
| | | | | | | | | | | | Make constants to define sizes for internal mempools and workqueues. In this process, adjust the values to be more meaningful given the internal constrains of the FTL. In order to do this for workqueues, separate the current auxiliary workqueue into two dedicated workqueues to manage lines being closed and bad blocks. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: redesign GC algorithmJavier González2017-06-266-278/+368
| | | | | | | | | | | | | | | | | | | | At the moment, in order to get enough read parallelism, we have recycled several lines at the same time. This approach has proven not to work well when reaching capacity, since we end up mixing valid data from all lines, thus not maintaining a sustainable free/recycled line ratio. The new design, relies on a two level workqueue mechanism. In the first level, we read the metadata for a number of lines based on the GC list they reside on (this is governed by the number of valid sectors in each line). In the second level, we recycle a single line at a time. Here, we issue reads in parallel, while a single GC write thread places data in the write buffer. This design allows to (i) only move data from one line at a time, thus maintaining a sane free/recycled ration and (ii) maintain the GC writer busy with recycled data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add lock assertions on helpersJavier González2017-06-261-0/+4
| | | | | | | | Add lockdep assertions on helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: cleanup unnecessary codeJavier González2017-06-262-7/+0
| | | | | | | | Cleanup unnecessary headers and code lines. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: set metadata list for all I/OsJavier González2017-06-262-38/+54
| | | | | | | | | Set a dma area for all I/Os in order to read/write from/to the metadata stored on the per-sector out-of-bound area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: choose optimal victim GC lineJavier González2017-06-261-1/+15
| | | | | | | | | | | | | | | | | | At the moment, we separate the closed lines on three different list based on their number of valid sectors. GC recycles lines from each list based on capacity. Lines from each list are taken in a FIFO fashion. Since the number of lines is limited (it corresponds to the number of blocks in a LUN, which is somewhere between 1000-2000), we can afford scanning the lists to choose the optimal line to be recycled. This helps specially in lines with a high number of valid sectors. If the number of blocks per LUN increases, we will consider a more efficient policy. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: decouple bad block from line allocJavier González2017-06-261-16/+37
| | | | | | | | | Decouple bad block discovery from line allocation logic. This allows to return meaningful error codes in case of bad block discovery failure. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: simplify meta. memory allocationJavier González2017-06-264-8/+8
| | | | | | | | | | smeta size will always be suitable for a kmalloc allocation. Simplify the code and leave the vmalloc fallback only for emeta, where the pblk configuration has an impact. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: issue multiplane reads if possibleJavier González2017-06-264-12/+51
| | | | | | | | | | If a read request is sequential and its size aligns with a multi-plane page size, use the multi-plane hint to process the I/O in parallel in the controller. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: delete redundant buffer pointerJavier González2017-06-267-41/+11
| | | | | | | | | | | After refactoring the metadata path, the backpointer controlling synced I/Os in a line becomes unnecessary; metadata is scheduled on the write thread, thus we know when the end of the line is reached and act on it directly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: delete redundant debug line statJavier González2017-06-261-5/+3
| | | | | | | | | | Remove a legacy variable that helped verifying the consistency of the run-time metadata for the free line list. With the new metadata layout, this check is no longer necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: sched. metadata on write threadJavier González2017-06-268-285/+673
| | | | | | | | | | | | | | | | | | | | | | | | At the moment, line metadata is persisted on a separate work queue, that is kicked each time that a line is closed. The assumption when designing this was that freeing the write thread from creating a new write request was better than the potential impact of writes colliding on the media (user I/O and metadata I/O). Experimentation has proven that this assumption is wrong; collision can cause up to 25% of bandwidth and introduce long tail latencies on the write thread, which potentially cause user write threads to spend more time spinning to get a free entry on the write buffer. This patch moves the metadata logic to the write thread. When a line is closed, remaining metadata is written in memory and is placed on a metadata queue. The write thread then takes the metadata corresponding to the previous line, creates the write request and schedules it to minimize collisions on the media. Using this approach, we see that we can saturate the media's bandwidth, which helps reducing both write latencies and the spinning time for user writer threads. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: rename read request poolJavier González2017-06-265-37/+38
| | | | | | | | | | Read requests allocate some extra memory to store its per I/O context. Instead of requiring yet another memory pool for other type of requests, generalize this context allocation (and change naming accordingly). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: generalize erase pathJavier González2017-06-266-90/+116
| | | | | | | | | | | | | | | | Erase I/Os are scheduled with the following goals in mind: (i) minimize LUNs collisions with write I/Os, and (ii) even out the price of erasing on every write, instead of putting all the burden on when garbage collection runs. This works well on the current design, but is specific to the default mapping algorithm. This patch generalizes the erase path so that other mapping algorithms can select an arbitrary line to be erased instead. It also gets rid of the erase semaphore since it creates jittering for user writes. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: expose max sec per write on sysfsJavier González2017-06-264-1/+48
| | | | | | | | | | Allow to configure the number of maximum sectors per write command through sysfs. This makes it easier to tune write command sizes for different controller configurations. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add debug stat for read cache hitsJavier González2017-06-264-1/+10
| | | | | | | | Add a new debug counter to measure cache hits on the read path Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: spare double cpu_to_le64 calc.Javier González2017-06-262-4/+5
| | | | | | | | Spare a double calculation on the fast write path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: propagate right error code to targetJavier González2017-06-261-1/+1
| | | | | | | | | If nvme_alloc_request fails, propagate the right error, instead of assuming ENOMEM. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: re-convert ppa format on I/O failureJavier González2017-06-261-1/+7
| | | | | | | | | | In case of a failure when submitting a request, convert the ppa_list addresses to the target format so that it can interpret ppas for recovery Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* mtip32xx: fix up the checking for internal command failureJens Axboe2017-06-231-17/+4
| | | | | | | | | | | | | | | This fixes up two commits that have touched this driver. The command status field is now a blk_status_t, so we can't check for < 0 and we definitely can't assume it's holding -Exxxx error values. All we care about here is whether ->status is zero or not. Check for that, and remove the various attempts at smart error reporting. Just log to dmesg what command failed, and the blk_status_t value. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 2a842acab109 ("block: introduce new block status code type") Fixes: 3f5e6a35774c ("mtip32xx: convert internal command issue to block IO path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe2017-06-229-65/+74
|\ | | | | | | | | | | | | | | | | Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge branch 'stable/for-jens-4.12' of ↵Jens Axboe2017-06-203-41/+26
| |\ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Pull xen-blkback fixes from Konrad: "Security and memory leak fixes in xen block driver."
| | * xen-blkback: don't leak stack data via response ringJan Beulich2017-06-132-31/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than constructing a local structure instance on the stack, fill the fields directly on the shared ring, just like other backends do. Build on the fact that all response structure flavors are actually identical (the old code did make this assumption too). This is XSA-216. Cc: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: don't use xen_blkif_get() in xen-blkback kthreadJuergen Gross2017-06-132-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread of xen-blkback. Thread stopping is synchronous and using the blkif reference counting in the kthread will avoid to ever let the reference count drop to zero at the end of an I/O running concurrent to disconnecting and multiple rings. Setting ring->xenblkd to NULL after stopping the kthread isn't needed as the kthread does this already. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: don't free be structure too earlyJuergen Gross2017-06-131-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The be structure must not be freed when freeing the blkif structure isn't done. Otherwise a use-after-free of be when unmapping the ring used for communicating with the frontend will occur in case of a late call of xenblk_disconnect() (e.g. due to an I/O still active when trying to disconnect). Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: fix disconnect while I/Os in flightJuergen Gross2017-06-132-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today disconnecting xen-blkback is broken in case there are still I/Os in flight: xen_blkif_disconnect() will bail out early without releasing all resources in the hope it will be called again when the last request has terminated. This, however, won't happen as xen_blkif_free() won't be called on termination of the last running request: xen_blkif_put() won't decrement the blkif refcnt to 0 as xen_blkif_disconnect() didn't finish before thus some xen_blkif_put() calls in xen_blkif_disconnect() didn't happen. To solve this deadlock xen_blkif_disconnect() and xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and xen_blkif_get() but use some other way to do their accounting of resources. This at once fixes another error in xen_blkif_disconnect(): when it returned early with -EBUSY for another ring than 0 it would call xen_blkif_put() again for already handled rings on a subsequent call. This will lead to inconsistencies in the refcnt handling. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | Merge tag 'xtensa-20170612' of git://github.com/jcmvbkbc/linux-xtensaLinus Torvalds2017-06-132-2/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Xtensa fixes from Max Filippov: - don't use linux IRQ #0 in legacy irq domains: fixes timer interrupt assignment when it's hardware IRQ # is 0 and the kernel is built w/o device tree support - reduce reservation size for double exception vector literals from 48 to 20 bytes: fixes build on cores with small user exception vector - cleanups: use kmalloc_array instead of kmalloc in simdisk_init and seq_puts instead of seq_printf in c_show. * tag 'xtensa-20170612' of git://github.com/jcmvbkbc/linux-xtensa: xtensa: don't use linux IRQ #0 xtensa: reduce double exception literal reservation xtensa: ISS: Use kmalloc_array() in simdisk_init() xtensa: Use seq_puts() in c_show()
| | * | xtensa: don't use linux IRQ #0Max Filippov2017-06-052-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux IRQ #0 is reserved for error reporting and may not be used. Increase NR_IRQS for one additional slot and increase irq_domain_add_legacy parameter first_irq value to 1, so that linux IRQ #0 is not associated with hardware IRQ #0 in legacy IRQ domains. Introduce macro XTENSA_PIC_LINUX_IRQ for static translation of xtensa PIC hardware IRQ # to linux IRQ #. Use this macro in XTFPGA platform data definitions. This fixes inability to use hardware IRQ #0 in configurations that don't use device tree and allows for non-identity mapping between linux IRQ # and hardware IRQ #. Cc: stable@vger.kernel.org Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
| * | | Merge branch 'for-linus' of ↵Linus Torvalds2017-06-134-22/+46
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - A fix for KVM to avoid kernel oopses in case of host protection faults due to runtime instrumentation - A fix for the AP bus to avoid dead devices after unbind / bind - A fix for a compile warning merged from the vfio_ccw tree - Updated default configurations * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: update defconfig s390/zcrypt: Fix blocking queue device after unbind/bind. s390/vfio_ccw: make some symbols static s390/kvm: do not rely on the ILC on kvm host protection fauls
| | * | | s390/zcrypt: Fix blocking queue device after unbind/bind.Harald Freudenberger2017-06-023-16/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the association between a queue device and the driver is released via unbind and later re-associated the queue device was not operational any more. Reason was a wrong administration of the card/queue lists within the ap device driver. This patch introduces revised card/queue list handling within the ap device driver: when an ap device is detected it is initial not added to the card/queue list any more. With driver probe the card device is added to the card list/the queue device is added to the queue list within a card. With driver remove the device is removed from the card/queue list. Additionally there are some situations within the ap device live where the lists need update upon card/queue device release (for example device hot unplug or suspend/resume). Signed-off-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
| | * | | Merge tag 'vfio-ccw-20170522' of ↵Martin Schwidefsky2017-05-231-6/+6
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/vfio-ccw into fixes Pull vfio-ccw fix from Conelia Huck: "vfio-ccw: one patch" * Make some symbols in vfio-ccw static, as detected by sparse.
| | | * | | s390/vfio_ccw: make some symbols staticSebastian Ott2017-05-221-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make some symbols static to fix sparse warnings like: drivers/s390/cio/vfio_ccw_ops.c:73:1: warning: symbol 'mdev_type_attr_name' was not declared. Should it be static? Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
* | | | | | block: Change argument type of scsi_req_init()Bart Van Assche2017-06-203-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since scsi_req_init() works on a struct scsi_request, change the argument type into struct scsi_request *. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | block: Make most scsi_req_init() calls implicitBart Van Assche2017-06-2021-28/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of explicitly calling scsi_req_init() after blk_get_request(), call that function from inside blk_get_request(). Add an .initialize_rq_fn() callback function to the block drivers that need it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn() because it is too small to keep it as a separate function. Keep the scsi_req_init() call in ide_prep_sense() because it follows a blk_rq_init() call. References: commit 82ed4db499b8 ("block: split scsi_request out of struct request") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | null_blk: add support for shared tagsJens Axboe2017-06-201-42/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some storage drivers need to share tag sets between devices. It's useful to be able to model that with null_blk, to find hangs or performance issues. Add a 'shared_tags' bool module parameter that. If that is set to true and nr_devices is bigger than 1, all devices allocated will share the same tag set. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | nvme: host: unquiesce queue in nvme_kill_queues()Ming Lei2017-06-181-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When nvme_kill_queues() is run, queues may be in quiesced state, so we forcibly unquiesce queues to avoid blocking dispatch, and I/O hang can be avoided in remove path. Peviously we use blk_mq_start_stopped_hw_queues() as counterpart of blk_mq_quiesce_queue(), now we have introduced blk_mq_unquiesce_queue(), so use it explicitly. Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | blk-mq: use the introduced blk_mq_unquiesce_queue()Ming Lei2017-06-183-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_unquiesce_queue() is used for unquiescing the queue explicitly, so replace blk_mq_start_stopped_hw_queues() with it. For the scsi part, this patch takes Bart's suggestion to switch to block quiesce/unquiesce API completely. Cc: linux-nvme@lists.infradead.org Cc: linux-scsi@vger.kernel.org Cc: dm-devel@redhat.com Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | block: remove bio_clone() and all references.NeilBrown2017-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_clone() is no longer used. Only bio_clone_bioset() or bio_clone_fast(). This is for the best, as bio_clone() used fs_bio_set, and filesystems are unlikely to want to use bio_clone(). So remove bio_clone() and all references. This includes a fix to some incorrect documentation. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | bcache: use kmalloc to allocate bio in bch_data_verify()NeilBrown2017-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function allocates a bio, then a collection of pages. It copes with failure. It currently uses a mempool() to allocate the bio, but alloc_page() to allocate the pages. These fail in different ways, so the usage is inconsistent. Change the bio_clone() to bio_clone_kmalloc() so that no pool is used either for the bio or the pages. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by : Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | xen-blkfront: remove bio splitting.NeilBrown2017-06-181-51/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bios that are re-submitted will pass through blk_queue_split() when blk_queue_bio() is called, and this will split the bio if necessary. There is no longer any need to do this splitting in xen-blkfront. Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | lightnvm/pblk-read: use bio_clone_fast()NeilBrown2017-06-183-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pblk_submit_read() uses bio_clone_bioset() but doesn't change the io_vec, so bio_clone_fast() is a better choice. It also uses fs_bio_set which is intended for filesystems. Using it in a device driver can deadlock. So allocate a new bioset, and and use bio_clone_fast(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | pktcdvd: use bio_clone_fast() instead of bio_clone()NeilBrown2017-06-181-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pktcdvd doesn't change the bi_io_vec of the clone bio, so it is more efficient to use bio_clone_fast(), and not clone the bi_io_vec. This requires providing a bio_set, and it is safest to provide a dedicated bio_set rather than sharing fs_bio_set, which filesytems use. This new bio_set, pkt_bio_set, can also be use for the bio_split() call as the two allocations (bio_clone_fast, and bio_split) are independent, neither can block a bio allocated by the other. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | drbd: use bio_clone_fast() instead of bio_clone()NeilBrown2017-06-183-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drbd does not modify the bi_io_vec of the cloned bio, so there is no need to clone that part. So bio_clone_fast() is the better choice. For bio_clone_fast() we need to specify a bio_set. We could use fs_bio_set, which bio_clone() uses, or drbd_md_io_bio_set, which drbd uses for metadata, but it is generally best to avoid sharing bio_sets unless you can be certain that there are no interdependencies. So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast(). Also remove a "XXX cannot fail ???" comment because it definitely cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for sleeping. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | rbd: use bio_clone_fast() instead of bio_clone()NeilBrown2017-06-181-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_clone() makes a copy of the bi_io_vec, but rbd never changes that, so there is no need for a copy. bio_clone_fast() can be used instead, which avoids making the copy. This requires that we provide a bio_set. bio_clone() uses fs_bio_set, but it isn't, in general, safe to use the same bio_set at different levels of the stack, as that can lead to deadlocks. As filesystems use fs_bio_set, block devices shouldn't. As rbd never stacks, it is safe to have a single global bio_set for all rbd devices to use. So allocate that when the module is initialised, and use it with bio_clone_fast(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | blk: make the bioset rescue_workqueue optional.NeilBrown2017-06-185-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts bioset_create() to not create a workqueue by default, so alloctions will never trigger punt_bios_to_rescuer(). It also introduces a new flag BIOSET_NEED_RESCUER which tells bioset_create() to preserve the old behavior. All callers of bioset_create() that are inside block device drivers, are given the BIOSET_NEED_RESCUER flag. biosets used by filesystems or other top-level users do not need rescuing as the bio can never be queued behind other bios. This includes fs_bio_set, blkdev_dio_pool, btrfs_bioset, xfs_ioend_bioset, and one allocated by target_core_iblock.c. biosets used by md/raid do not need rescuing as their usage was recently audited and revised to never risk deadlock. It is hoped that most, if not all, of the remaining biosets can end up being the non-rescued version. Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | blk: replace bioset_create_nobvec() with a flags arg to bioset_create()NeilBrown2017-06-1812-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "flags" arguments are often seen as good API design as they allow easy extensibility. bioset_create_nobvec() is implemented internally as a variation in flags passed to __bioset_create(). To support future extension, make the internal structure part of the API. i.e. add a 'flags' argument to bioset_create() and discard bioset_create_nobvec(). Note that the bio_split allocations in drivers/md/raid* do not need the bvec mempool - they should have used bioset_create_nobvec(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | blk: remove bio_set arg from blk_queue_split()NeilBrown2017-06-1810-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_queue_split() is always called with the last arg being q->bio_split, where 'q' is the first arg. Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses q->bio_split. This is inconsistent and unnecessary. Remove the last arg and always use q->bio_split inside blk_queue_split() Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | loop: Add PF_LESS_THROTTLE to block/loop device thread.NeilBrown2017-06-181-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a filesystem is mounted from a loop device, writes are throttled by balance_dirty_pages() twice: once when writing to the filesystem and once when the loop_handle_cmd() writes to the backing file. This double-throttling can trigger positive feedback loops that create significant delays. The throttling at the lower level is seen by the upper level as a slow device, so it throttles extra hard. The PF_LESS_THROTTLE flag was created to handle exactly this circumstance, though with an NFS filesystem mounted from a local NFS server. It reduces the throttling on the lower layer so that it can proceed largely unthrottled. To demonstrate this, create a filesystem on a loop device and write (e.g. with dd) several large files which combine to consume significantly more than the limit set by /proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total time taken. When I do this directly on a device (no loop device) the total time for several runs (mkfs, mount, write 200 files, umount) is fairly stable: 28-35 seconds. When I do this over a loop device the times are much worse and less stable. 52-460 seconds. Half below 100seconds, half above. When I apply this patch, the times become stable again, though not as fast as the no-loop-back case: 53-72 seconds. There may be room for further improvement as the total overhead still seems too high, but this is a big improvement. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>