summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block: remove the queue_bounce_pfn helperChristoph Hellwig2017-06-272-8/+3
| | | | | | | | Only used inside the bounce code, and opencoding it makes it more obvious what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move bounce declarations to block/blk.hChristoph Hellwig2017-06-273-13/+14
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-map: call blk_queue_bounce from blk_rq_append_bioChristoph Hellwig2017-06-273-10/+4
| | | | | | | | | This makes moves the knowledge about bouncing out of the callers into the block core (just like we do for the normal I/O path), and allows to unexport blk_queue_bounce. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* pktcdvd: remove the call to blk_queue_bounceChristoph Hellwig2017-06-271-2/+0
| | | | | | | | | pktcdvd is a make_request based stacking driver and thus doesn't have any addressing limits on it's own. It also doesn't use bio_data() or page_address(), so it doesn't need a lowmem bounce either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme: add support for streams and directivesJens Axboe2017-06-273-4/+199
| | | | | | | | | | | | | | This adds support for Directives in NVMe, particular for the Streams directive. Support for Directives is a new feature in NVMe 1.3. It allows a user to pass in information about where to store the data, so that it the device can do so most effiently. If an application is managing and writing data with different life times, mixing differently retentioned data onto the same locations on flash can cause write amplification to grow. This, in turn, will reduce performance and life time of the device. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* btrfs: add support for passing in write hints for buffered writesJens Axboe2017-06-271-0/+1
| | | | | | | Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xfs: add support for passing in write hints for buffered writesJens Axboe2017-06-271-0/+2
| | | | | | | Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* ext4: add support for passing in write hints for buffered writesJens Axboe2017-06-271-0/+2
| | | | | | Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* fs: add support for buffered writeback to pass down write hintsJens Axboe2017-06-272-5/+9
| | | | | | Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* fs: add O_DIRECT and aio support for sending down write life time hintsJens Axboe2017-06-274-0/+6
| | | | | | Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: expose write hints through debugfsJens Axboe2017-06-272-0/+27
| | | | | | | | | | | | | | | Useful to verify that things are working the way they should. Reading the file will return number of kb written with each write hint. Writing the file will reset the statistics. No care is taken to ensure that we don't race on updates. Drivers will write to q->write_hints[] if they handle a given write hint. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add support for write hints in a bioJens Axboe2017-06-275-0/+20
| | | | | | | | | | | | | | No functional changes in this patch, we just use up some holes in the bio and request structures to define a write hint that we psas down the stack. Ensure that we don't merge requests that have different life time hints assigned to them, and that we inherit the write hint when cloning a bio. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* fs: add fcntl() interface for setting/getting write life time hintsJens Axboe2017-06-275-12/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define a set of write life time hints: RWH_WRITE_LIFE_NOT_SET No hint information set RWH_WRITE_LIFE_NONE No hints about write life time RWH_WRITE_LIFE_SHORT Data written has a short life time RWH_WRITE_LIFE_MEDIUM Data written has a medium life time RWH_WRITE_LIFE_LONG Data written has a long life time RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time The intent is for these values to be relative to each other, no absolute meaning should be attached to these flag names. Add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set on the underlying inode. F_SET_RW_HINT Set one of the above write hints on the underlying inode. F_GET_FILE_RW_HINT Returns the read/write hint set on the file descriptor. F_SET_FILE_RW_HINT Set one of the above write hints on the file descriptor. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags and in struct file as well, and pass them to the kiocb flags. If both a file and its corresponding inode has a write hint, then we use the one in the file, if available. The file hint can be used for sync/direct IO, for buffered writeback only the inode hint is available. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: get or set an inode write hint */ #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <stdbool.h> #include <inttypes.h> #ifndef F_GET_RW_HINT #define F_LINUX_SPECIFIC_BASE 1024 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11) #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #endif static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE", "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM", "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: file <hint>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) { hint = atoi(argv[2]); ret = fcntl(fd, F_SET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_SET_RW_HINT"); return 4; } } ret = fcntl(fd, F_GET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_GET_RW_HINT"); return 3; } printf("%s: hint %s\n", argv[1], str[hint]); close(fd); return 0; } Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: if LUNs are already allocated fix returnRakesh Pandit2017-06-271-2/+3
| | | | | | | | | | | While creating new device with NVM_DEV_CREATE if LUNs are already allocated ioctl would return -ENOMEM which is wrong. This patch propagates -EBUSY from nvm_reserve_luns which is correct response. Fixes: ade69e243 ("lightnvm: merge gennvm with core") Reviewed-by: Frans Klaver <fransklaver@gmail.com> Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: fail gracefully on irrec. errorJavier González2017-06-2611-114/+332
| | | | | | | | | | | | | | | | | | | | | | Due to user writes being decoupled from media writes because of the need of an intermediate write buffer, irrecoverable media write errors lead to pblk stalling; user writes fill up the buffer and end up in an infinite retry loop. In order to let user writes fail gracefully, it is necessary for pblk to keep track of its own internal state and prevent further writes from being placed into the write buffer. This patch implements a state machine to keep track of internal errors and, in case of failure, fail further user writes in an standard way. Depending on the type of error, pblk will do its best to persist buffered writes (which are already acknowledged) and close down on a graceful manner. This way, data might be recovered by re-instantiating pblk. Such state machine paves out the way for a state-based FTL log. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: set mempool and workqueue params.Javier González2017-06-264-20/+44
| | | | | | | | | | | | Make constants to define sizes for internal mempools and workqueues. In this process, adjust the values to be more meaningful given the internal constrains of the FTL. In order to do this for workqueues, separate the current auxiliary workqueue into two dedicated workqueues to manage lines being closed and bad blocks. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: redesign GC algorithmJavier González2017-06-266-278/+368
| | | | | | | | | | | | | | | | | | | | At the moment, in order to get enough read parallelism, we have recycled several lines at the same time. This approach has proven not to work well when reaching capacity, since we end up mixing valid data from all lines, thus not maintaining a sustainable free/recycled line ratio. The new design, relies on a two level workqueue mechanism. In the first level, we read the metadata for a number of lines based on the GC list they reside on (this is governed by the number of valid sectors in each line). In the second level, we recycle a single line at a time. Here, we issue reads in parallel, while a single GC write thread places data in the write buffer. This design allows to (i) only move data from one line at a time, thus maintaining a sane free/recycled ration and (ii) maintain the GC writer busy with recycled data. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add lock assertions on helpersJavier González2017-06-261-0/+4
| | | | | | | | Add lockdep assertions on helper functions. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: cleanup unnecessary codeJavier González2017-06-262-7/+0
| | | | | | | | Cleanup unnecessary headers and code lines. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: set metadata list for all I/OsJavier González2017-06-262-38/+54
| | | | | | | | | Set a dma area for all I/Os in order to read/write from/to the metadata stored on the per-sector out-of-bound area. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: choose optimal victim GC lineJavier González2017-06-261-1/+15
| | | | | | | | | | | | | | | | | | At the moment, we separate the closed lines on three different list based on their number of valid sectors. GC recycles lines from each list based on capacity. Lines from each list are taken in a FIFO fashion. Since the number of lines is limited (it corresponds to the number of blocks in a LUN, which is somewhere between 1000-2000), we can afford scanning the lists to choose the optimal line to be recycled. This helps specially in lines with a high number of valid sectors. If the number of blocks per LUN increases, we will consider a more efficient policy. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: decouple bad block from line allocJavier González2017-06-261-16/+37
| | | | | | | | | Decouple bad block discovery from line allocation logic. This allows to return meaningful error codes in case of bad block discovery failure. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: simplify meta. memory allocationJavier González2017-06-264-8/+8
| | | | | | | | | | smeta size will always be suitable for a kmalloc allocation. Simplify the code and leave the vmalloc fallback only for emeta, where the pblk configuration has an impact. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: issue multiplane reads if possibleJavier González2017-06-264-12/+51
| | | | | | | | | | If a read request is sequential and its size aligns with a multi-plane page size, use the multi-plane hint to process the I/O in parallel in the controller. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: delete redundant buffer pointerJavier González2017-06-267-41/+11
| | | | | | | | | | | After refactoring the metadata path, the backpointer controlling synced I/Os in a line becomes unnecessary; metadata is scheduled on the write thread, thus we know when the end of the line is reached and act on it directly. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: delete redundant debug line statJavier González2017-06-261-5/+3
| | | | | | | | | | Remove a legacy variable that helped verifying the consistency of the run-time metadata for the free line list. With the new metadata layout, this check is no longer necessary. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: sched. metadata on write threadJavier González2017-06-268-285/+673
| | | | | | | | | | | | | | | | | | | | | | | | At the moment, line metadata is persisted on a separate work queue, that is kicked each time that a line is closed. The assumption when designing this was that freeing the write thread from creating a new write request was better than the potential impact of writes colliding on the media (user I/O and metadata I/O). Experimentation has proven that this assumption is wrong; collision can cause up to 25% of bandwidth and introduce long tail latencies on the write thread, which potentially cause user write threads to spend more time spinning to get a free entry on the write buffer. This patch moves the metadata logic to the write thread. When a line is closed, remaining metadata is written in memory and is placed on a metadata queue. The write thread then takes the metadata corresponding to the previous line, creates the write request and schedules it to minimize collisions on the media. Using this approach, we see that we can saturate the media's bandwidth, which helps reducing both write latencies and the spinning time for user writer threads. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: rename read request poolJavier González2017-06-265-37/+38
| | | | | | | | | | Read requests allocate some extra memory to store its per I/O context. Instead of requiring yet another memory pool for other type of requests, generalize this context allocation (and change naming accordingly). Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: generalize erase pathJavier González2017-06-266-90/+116
| | | | | | | | | | | | | | | | Erase I/Os are scheduled with the following goals in mind: (i) minimize LUNs collisions with write I/Os, and (ii) even out the price of erasing on every write, instead of putting all the burden on when garbage collection runs. This works well on the current design, but is specific to the default mapping algorithm. This patch generalizes the erase path so that other mapping algorithms can select an arbitrary line to be erased instead. It also gets rid of the erase semaphore since it creates jittering for user writes. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: expose max sec per write on sysfsJavier González2017-06-264-1/+48
| | | | | | | | | | Allow to configure the number of maximum sectors per write command through sysfs. This makes it easier to tune write command sizes for different controller configurations. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: add debug stat for read cache hitsJavier González2017-06-264-1/+10
| | | | | | | | Add a new debug counter to measure cache hits on the read path Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: pblk: spare double cpu_to_le64 calc.Javier González2017-06-262-4/+5
| | | | | | | | Spare a double calculation on the fast write path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: propagate right error code to targetJavier González2017-06-261-1/+1
| | | | | | | | | If nvme_alloc_request fails, propagate the right error, instead of assuming ENOMEM. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* lightnvm: re-convert ppa format on I/O failureJavier González2017-06-261-1/+7
| | | | | | | | | | In case of a failure when submitting a request, convert the ppa_list addresses to the target format so that it can interpret ppas for recovery Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <matias@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* mtip32xx: fix up the checking for internal command failureJens Axboe2017-06-231-17/+4
| | | | | | | | | | | | | | | This fixes up two commits that have touched this driver. The command status field is now a blk_status_t, so we can't check for < 0 and we definitely can't assume it's holding -Exxxx error values. All we care about here is whether ->status is zero or not. Check for that, and remove the various attempts at smart error reporting. Just log to dmesg what command failed, and the blk_status_t value. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 2a842acab109 ("block: introduce new block status code type") Fixes: 3f5e6a35774c ("mtip32xx: convert internal command issue to block IO path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge commit '8e8320c9315c' into for-4.13/blockJens Axboe2017-06-2227-148/+275
|\ | | | | | | | | | | | | | | | | Pull in the fix for shared tags, as it conflicts with the pending changes in for-4.13/block. We already pulled in v4.12-rc5 to solve other conflicts or get fixes that went into 4.12, so not a lot of changes in this merge. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: fix performance regression with shared tagsJens Axboe2017-06-214-24/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have shared tags enabled, then every IO completion will trigger a full loop of every queue belonging to a tag set, and every hardware queue for each of those queues, even if nothing needs to be done. This causes a massive performance regression if you have a lot of shared devices. Instead of doing this huge full scan on every IO, add an atomic counter to the main queue that tracks how many hardware queues have been marked as needing a restart. With that, we can avoid looking for restartable queues, if we don't have to. Max reports that this restores performance. Before this patch, 4K IOPS was limited to 22-23K IOPS. With the patch, we are running at 950-970K IOPS. Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Reported-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge branch 'stable/for-jens-4.12' of ↵Jens Axboe2017-06-203-41/+26
| |\ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus Pull xen-blkback fixes from Konrad: "Security and memory leak fixes in xen block driver."
| | * xen-blkback: don't leak stack data via response ringJan Beulich2017-06-132-31/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than constructing a local structure instance on the stack, fill the fields directly on the shared ring, just like other backends do. Build on the fact that all response structure flavors are actually identical (the old code did make this assumption too). This is XSA-216. Cc: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: don't use xen_blkif_get() in xen-blkback kthreadJuergen Gross2017-06-132-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread of xen-blkback. Thread stopping is synchronous and using the blkif reference counting in the kthread will avoid to ever let the reference count drop to zero at the end of an I/O running concurrent to disconnecting and multiple rings. Setting ring->xenblkd to NULL after stopping the kthread isn't needed as the kthread does this already. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: don't free be structure too earlyJuergen Gross2017-06-131-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The be structure must not be freed when freeing the blkif structure isn't done. Otherwise a use-after-free of be when unmapping the ring used for communicating with the frontend will occur in case of a late call of xenblk_disconnect() (e.g. due to an I/O still active when trying to disconnect). Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/blkback: fix disconnect while I/Os in flightJuergen Gross2017-06-132-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today disconnecting xen-blkback is broken in case there are still I/Os in flight: xen_blkif_disconnect() will bail out early without releasing all resources in the hope it will be called again when the last request has terminated. This, however, won't happen as xen_blkif_free() won't be called on termination of the last running request: xen_blkif_put() won't decrement the blkif refcnt to 0 as xen_blkif_disconnect() didn't finish before thus some xen_blkif_put() calls in xen_blkif_disconnect() didn't happen. To solve this deadlock xen_blkif_disconnect() and xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and xen_blkif_get() but use some other way to do their accounting of resources. This at once fixes another error in xen_blkif_disconnect(): when it returned early with -EBUSY for another ring than 0 it would call xen_blkif_put() again for already handled rings on a subsequent call. This will lead to inconsistencies in the refcnt handling. Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Steven Haigh <netwiz@crc.id.au> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | block: Fix a blk_exit_rl() regressionBart Van Assche2017-06-142-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid that the following complaint is reported: BUG: sleeping function called from invalid context at kernel/workqueue.c:2790 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3 1 lock held by rcuop/3/41: #0: (rcu_callback){......}, at: [<ffffffff8111f9a2>] rcu_nocb_kthread+0x282/0x500 Call Trace: dump_stack+0x86/0xcf ___might_sleep+0x174/0x260 __might_sleep+0x4a/0x80 flush_work+0x7e/0x2e0 __cancel_work_timer+0x143/0x1c0 cancel_work_sync+0x10/0x20 blk_throtl_exit+0x25/0x60 blkcg_exit_queue+0x35/0x40 blk_release_queue+0x42/0x130 kobject_put+0xa9/0x190 This happens since we invoke callbacks that need to block from the queue release handler. Fix this by pushing the final release to a workqueue. Reported-by: Ross Zwisler <zwisler@gmail.com> Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Updated changelog Signed-off-by: Jens Axboe <axboe@fb.com>
| * | Merge tag 'xtensa-20170612' of git://github.com/jcmvbkbc/linux-xtensaLinus Torvalds2017-06-139-22/+18
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Xtensa fixes from Max Filippov: - don't use linux IRQ #0 in legacy irq domains: fixes timer interrupt assignment when it's hardware IRQ # is 0 and the kernel is built w/o device tree support - reduce reservation size for double exception vector literals from 48 to 20 bytes: fixes build on cores with small user exception vector - cleanups: use kmalloc_array instead of kmalloc in simdisk_init and seq_puts instead of seq_printf in c_show. * tag 'xtensa-20170612' of git://github.com/jcmvbkbc/linux-xtensa: xtensa: don't use linux IRQ #0 xtensa: reduce double exception literal reservation xtensa: ISS: Use kmalloc_array() in simdisk_init() xtensa: Use seq_puts() in c_show()
| | * | xtensa: don't use linux IRQ #0Max Filippov2017-06-056-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux IRQ #0 is reserved for error reporting and may not be used. Increase NR_IRQS for one additional slot and increase irq_domain_add_legacy parameter first_irq value to 1, so that linux IRQ #0 is not associated with hardware IRQ #0 in legacy IRQ domains. Introduce macro XTENSA_PIC_LINUX_IRQ for static translation of xtensa PIC hardware IRQ # to linux IRQ #. Use this macro in XTFPGA platform data definitions. This fixes inability to use hardware IRQ #0 in configurations that don't use device tree and allows for non-identity mapping between linux IRQ # and hardware IRQ #. Cc: stable@vger.kernel.org Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
| | * | xtensa: reduce double exception literal reservationMax Filippov2017-06-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Double exception vector only needs 20 bytes of space for 5 literals, not 48. Reduce the reservation for double exception vector literals accordingly. This fixes build for configurations with small user exception vector size. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
| | * | xtensa: ISS: Use kmalloc_array() in simdisk_init()Markus Elfring2017-05-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * A multiplication for the size determination of a memory allocation indicated that an array data structure should be processed. Thus use the corresponding function "kmalloc_array". This issue was detected by using the Coccinelle software. * Replace the specification of a data type by a pointer dereference to make the corresponding size determination a bit safer according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
| | * | xtensa: Use seq_puts() in c_show()Markus Elfring2017-05-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A string which did not contain a data format specification should be put into a sequence. Thus use the corresponding function "seq_puts". This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
| * | | Merge branch 'for-linus' of ↵Linus Torvalds2017-06-1310-49/+146
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - A fix for KVM to avoid kernel oopses in case of host protection faults due to runtime instrumentation - A fix for the AP bus to avoid dead devices after unbind / bind - A fix for a compile warning merged from the vfio_ccw tree - Updated default configurations * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: update defconfig s390/zcrypt: Fix blocking queue device after unbind/bind. s390/vfio_ccw: make some symbols static s390/kvm: do not rely on the ILC on kvm host protection fauls
| | * | | s390: update defconfigMartin Schwidefsky2017-06-085-21/+87
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>