summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-211-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A set of fixes that missed the merge window, mostly due to me being away around that time. Nothing major here, a mix of nvme cleanups and fixes, and one fix for the badblocks handling" * 'for-linus' of git://git.kernel.dk/linux-block: nvmet: use symbolic constants for CNS values nvme: use symbolic constants for CNS values nvme.h: add an enum for cns values nvme.h: don't use uuid_be nvme.h: resync with nvme-cli nvme: Add tertiary number to NVME_VS nvme : Add sysfs entry for NVMe CMBs when appropriate nvme: don't schedule multiple resets nvme: Delete created IO queues on reset nvme: Stop probing a removed device badblocks: fix overlapping check for clearing
| * badblocks: fix overlapping check for clearingTomasz Majchrzak2016-10-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Current bad block clear implementation assumes the range to clear overlaps with at least one bad block already stored. If given range to clear precedes first bad block in a list, the first entry is incorrectly updated. Check not only if stored block end is past clear block end but also if stored block start is before clear block end. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'gcc-plugins-v4.9-rc1' of ↵Linus Torvalds2016-10-151-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin
| * | latent_entropy: Mark functions with __latent_entropyEmese Revfy2016-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The __latent_entropy gcc attribute can be used only on functions and variables. If it is on a function then the plugin will instrument it for gathering control-flow entropy. If the attribute is on a variable then the plugin will initialize it with random contents. The variable must be an integer, an integer array type or a structure with integer fields. These specific functions have been selected because they are init functions (to help gather boot-time entropy), are called at unpredictable times, or they have variable loops, each of which provide some level of latent entropy. Signed-off-by: Emese Revfy <re.emese@gmail.com> [kees: expanded commit message] Signed-off-by: Kees Cook <keescook@chromium.org>
* | | Merge branch 'for-4.9' of ↵Linus Torvalds2016-10-141-3/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()
| * | | blkcg: Unlock blkcg_pol_mutex only once when cpd == NULLBart Van Assche2016-09-301-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register() such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch avoids that smatch reports the following error: block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex' Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> # v4.2+ Signed-off-by: Tejun Heo <tj@kernel.org>
* | | | block: require write_same and discard requests align to logical block sizeDarrick J. Wong2016-10-111-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that the offset and length arguments that we're using to construct WRITE SAME and DISCARD requests are actually aligned to the logical block size. Failure to do this causes other errors in other parts of the block layer or the SCSI layer because disks don't support partial logical block writes. Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header Cc: Brian Foster <bfoster@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | block: invalidate the page cache when issuing BLKZEROOUTDarrick J. Wong2016-10-111-6/+12
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "fallocate for block devices", v11. This is a patchset to fix page cache coherency with BLKZEROOUT and implement fallocate for block devices. The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate the page cache if the zeroing command to the underlying device succeeds. Without this patch we still have the pagecache coherence bug that's been in the kernel forever. The second patch changes the internal block device functions to reject attempts to discard or zeroout that are not aligned to the logical block size. Previously, we only checked that the start/len parameters were 512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA devices. The third patch creates an fallocate handler for block devices, wires up the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent fallocate interface between files and block devices. It also allows the combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard. Test cases for the new block device fallocate are now in xfstests as generic/349-351. This patch (of 3): Invalidate the page cache (as a regular O_DIRECT write would do) to avoid returning stale cache contents at a later time. Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-4.9/block-smp' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-094-142/+57
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull blk-mq CPU hotplug update from Jens Axboe: "This is the conversion of blk-mq to the new hotplug state machine" * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block: blk-mq: fixup "Convert to new hotplug state machine" blk-mq: Convert to new hotplug state machine blk-mq/cpu-notif: Convert to new hotplug state machine
| * | | blk-mq: fixup "Convert to new hotplug state machine"Sebastian Andrzej Siewior2016-09-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "blk_mq_queue_reinit_dead()" just cleared the cpumask instead doing a copy. Since we might never had an online callback we could end up with a ZERO mask which in turn leads to crash as test robot demonstarted. Fixes: 65d5291eee66 ("blk-mq: Convert to new hotplug state machine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | blk-mq: Convert to new hotplug state machineSebastian Andrzej Siewior2016-09-221-44/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Install the callbacks via the state machine so we can phase out the cpu hotplug notifiers mess. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | blk-mq/cpu-notif: Convert to new hotplug state machineThomas Gleixner2016-09-224-98/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the block-mq notifier list management with the multi instance facility in the cpu hotplug state machine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-block@vger.kernel.org Cc: rt@linutronix.de Cc: Christoph Hellwing <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-0910-92/+110
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event
| * | | | blk_mq: linux/blk-mq.h does not include all the headers it depends onStephen Rothwell2016-09-193-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and building block/blk-mq-pci.o should depend on CONFIG_BLOCK Fixes: 973c4e372c8f ("blk-mq: provide a default queue mapping for PCI device") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: kill unused blk_mq_create_mq_map()Jens Axboe2016-09-151-24/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Fixes 1b157939f92a ("blk-mq: get rid of the cpumask in struct blk_mq_tags") Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: get rid of the cpumask in struct blk_mq_tagsChristoph Hellwig2016-09-153-11/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unused now that NVMe sets up irq affinity before calling into blk-mq. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: provide a default queue mapping for PCI deviceChristoph Hellwig2016-09-152-1/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: allow the driver to pass in a queue mappingChristoph Hellwig2016-09-153-26/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows drivers specify their own queue mapping by overriding the setup-time function that builds the mq_map. This can be used for example to build the map based on the MSI-X vector mapping provided by the core interrupt layer for PCI devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: remove ->map_queueChristoph Hellwig2016-09-155-43/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All drivers use the default, so provide an inline version of it. If we ever need other queue mapping we can add an optional method back, although supporting will also require major changes to the queue setup code. This provides better code generation, and better debugability as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: only allocate a single mq_map per tag_setChristoph Hellwig2016-09-151-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mapping is identical for all queues in a tag_set, so stop wasting memory for building multiple. Note that for now I've kept the mq_map pointer in the request_queue, but we'll need to investigate if we can remove it without suffering too much from the additional pointer chasing. The same would apply to the mq_ops pointer as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: don't redistribute hardware queues on a CPU hotplug eventChristoph Hellwig2016-09-151-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently blk-mq will totally remap hardware context when a CPU hotplug even happened, which causes major havoc for drivers, as they are never told about this remapping. E.g. any carefully sorted out CPU affinity will just be completely messed up. The rebuild also doesn't really help for the common case of cpu hotplug, which is soft onlining / offlining of cpus - in this case we should just leave the queue and irq mapping as is. If it actually worked it would have helped in the case of physical cpu hotplug, although for that we'd need a way to actually notify the driver. Note that drivers may already be able to accommodate such a topology change on their own, e.g. using the reset_controller sysfs file in NVMe will cause the driver to get things right for this case. With the rebuild removed we will simplify retain the queue mapping for a soft offlined CPU that will work when it comes back online, and will map any newly onlined CPU to queue 0 until the driver initiates a rebuild of the queue map. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-blockLinus Torvalds2016-10-0710-574/+244
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...
| * | | | | cfq: fix starvation of asynchronous writesGlauber Costa2016-09-231-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While debugging timeouts happening in my application workload (ScyllaDB), I have observed calls to open() taking a long time, ranging everywhere from 2 seconds - the first ones that are enough to time out my application - to more than 30 seconds. The problem seems to happen because XFS may block on pending metadata updates under certain circumnstances, and that's confirmed with the following backtrace taken by the offcputime tool (iovisor/bcc): ffffffffb90c57b1 finish_task_switch ffffffffb97dffb5 schedule ffffffffb97e310c schedule_timeout ffffffffb97e1f12 __down ffffffffb90ea821 down ffffffffc046a9dc xfs_buf_lock ffffffffc046abfb _xfs_buf_find ffffffffc046ae4a xfs_buf_get_map ffffffffc046babd xfs_buf_read_map ffffffffc0499931 xfs_trans_read_buf_map ffffffffc044a561 xfs_da_read_buf ffffffffc0451390 xfs_dir3_leaf_read.constprop.16 ffffffffc0452b90 xfs_dir2_leaf_lookup_int ffffffffc0452e0f xfs_dir2_leaf_lookup ffffffffc044d9d3 xfs_dir_lookup ffffffffc047d1d9 xfs_lookup ffffffffc0479e53 xfs_vn_lookup ffffffffb925347a path_openat ffffffffb9254a71 do_filp_open ffffffffb9242a94 do_sys_open ffffffffb9242b9e sys_open ffffffffb97e42b2 entry_SYSCALL_64_fastpath 00007fb0698162ed [unknown] Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very high "Dispatch wait" times, on the dozens of seconds range and consistent with the open() times I have saw in that run. Still from the blktrace output, we can after searching a bit, identify the request that wasn't dispatched: 8,0 11 152 81.092472813 804 A WM 141698288 + 8 <- (8,1) 141696240 8,0 11 153 81.092472889 804 Q WM 141698288 + 8 [xfsaild/sda1] 8,0 11 154 81.092473207 804 G WM 141698288 + 8 [xfsaild/sda1] 8,0 11 206 81.092496118 804 I WM 141698288 + 8 ( 22911) [xfsaild/sda1] <==== 'I' means Inserted (into the IO scheduler) ===================================> 8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0] <==== Only 15s later the CFQ scheduler dispatches the request ======================> As we can see above, in this particular example CFQ took 15 seconds to dispatch this request. Going back to the full trace, we can see that the xfsaild queue had plenty of opportunity to run, and it was selected as the active queue many times. It would just always be preempted by something else (example): 8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request 8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr 8,0 1 0 81.117914044 0 m N cfq1618SN / preempt 8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1 8,0 1 0 81.117914755 0 m N cfq767A / resid=40 8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448 8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0 where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB IO dispatchers. The requests preempting the xfsaild queue are synchronous requests. That's a characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests. While it can be argued that preempting ASYNC requests in favor of SYNC is part of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's goal. Moreover, unless I am misunderstanding something, that breaks the expectation set by the "fifo_expire_async" tunable, which in my system is set to the default. Looking at the code, it seems to me that the issue is that after we make an async queue active, there is no guarantee that it will execute any request. When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC requests in flight. An incoming request from another queue can also preempt it in such situation before we have the chance to execute anything (as seen in the trace above). This patch sets the must_dispatch flag if we notice that we have requests that are already fifo_expired. This flag is always cleared after cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin the queue for subsequent requests (unless they are themselves expired) Care is taken during preempt to still allow rt requests to preempt us regardless. Testing my workload with this patch applied produces much better results. From the application side I see no timeouts, and the open() latency histogram generated by systemtap looks much better, with the worst outlier at 131ms: Latency histogram of xfs_buf_lock acquisition (microseconds): value |-------------------------------------------------- count 0 | 11 1 |@@@@ 161 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966 4 |@ 54 8 | 36 16 | 7 32 | 0 64 | 0 ~ 1024 | 0 2048 | 0 4096 | 1 8192 | 1 16384 | 2 32768 | 0 65536 | 0 131072 | 1 262144 | 0 524288 | 0 Signed-off-by: Glauber Costa <glauber@scylladb.com> CC: Jens Axboe <axboe@kernel.dk> CC: linux-block@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Glauber Costa <glauber@scylladb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | blk-mq: add flag for drivers wanting blocking ->queue_rq()Jens Axboe2016-09-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its ->queue_rq() handler. For that case, blk-mq ensures that we always calls it from a safe context. Signed-off-by: Jens Axboe <axboe@fb.com> Tested-by: Josef Bacik <jbacik@fb.com>
| * | | | | blk-mq: remove non-blocking pass in blk_mq_map_requestChristoph Hellwig2016-09-221-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bt_get already does a non-blocking pass as well as running the queue when scheduling internally, no need to duplicate it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | | blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()Jens Axboe2016-09-221-14/+2
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two cases: 1) blk_mq_alloc_request() needlessly re-runs the queue, after calling into the tag allocation without NOWAIT set. We don't need to do that. 2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with the async flag set to false. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | | | block: export bio_free_pages to other modulesGuoqing Jiang2016-09-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bio_free_pages is introduced in commit 1dfa0f68c040 ("block: add a helper to free bio bounce buffer pages"), we can reuse the func in other modules after it was imported. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: register device instead of diskMatias Bjørling2016-09-212-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable devices without a gendisk instance to register itself with blk-mq and expose the associated multi-queue sysfs entries. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | sbitmap: randomize initial alloc_hint valuesOmar Sandoval2016-09-172-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to get good cache behavior from a sbitmap, we want each CPU to stick to its own cacheline(s) as much as possible. This might happen naturally as the bitmap gets filled up and the alloc_hint values spread out, but we really want this behavior from the start. blk-mq apparently intended to do this, but the code to do this was never wired up. Get rid of the dead code and make it part of the sbitmap library. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | sbitmap: push alloc policy into sbitmap_queueOmar Sandoval2016-09-172-19/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Again, there's no point in passing this in every time. Make it part of struct sbitmap_queue and clean up the API. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | sbitmap: push per-cpu last_tag into sbitmap_queueOmar Sandoval2016-09-174-40/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocating your own per-cpu allocation hint separately makes for an awkward API. Instead, allocate the per-cpu hint as part of the struct sbitmap_queue. There's no point for a struct sbitmap_queue without the cache, but you can still use a bare struct sbitmap. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: abstract tag allocation out into sbitmap libraryOmar Sandoval2016-09-175-469/+153
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a generally useful data structure, so make it available to anyone else who might want to use it. It's also a nice cleanup separating the allocation logic from the rest of the tag handling logic. The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only selected by CONFIG_BLOCK for now. This should be a complete noop functionality-wise. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | blk-mq: account higher order dispatchJens Axboe2016-09-162-7/+14
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently account a '0' dispatch, and anything above that still falls below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more, we don't account it. Change the last bucket to be inclusive of anything above the range we track, and have the sysfs file reflect that by including a '+' in the output: $ cat /sys/block/nvme0n1/mq/0/dispatched 0 1006 1 20229 2 1 4 0 8 0 16 0 32+ 0 Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
| * | | blk-mq: introduce blk_mq_delay_kick_requeue_list()Mike Snitzer2016-09-141-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_delay_kick_requeue_list() provides the ability to kick the q->requeue_list after a specified time. To do this the request_queue's 'requeue_work' member was changed to a delayed_work. blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued requests while it doesn't make sense to immediately requeue them (e.g. when all paths in a DM multipath have failed). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: remove remnant refs to hardsectLinus Walleij2016-09-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1 "block: Do away with the notion of hardsect_size" removed the notion of "hardware sector size" from the kernel in favor of logical block size, but references remain in comments and documentation. Update the remaining sites mentioning hardsect. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: enable zeroing of io_poll statisticsStephen Bates2016-09-141-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the io_poll statistics to be zeroed to make for easier logging of polling event. Signed-off-by: Stephen Bates <sbates@raithlin.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: add poll_considered statisticStephen Bates2016-09-142-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to help determine the effectiveness of polling in a running system it is usful to determine the ratio of how often the poll function is called vs how often the completion is checked. For this reason we add a poll_considered variable and add it to the sysfs entry for io_poll. Signed-off-by: Stephen Bates <sbates@raithlin.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | blk-mq: prefetch request in blk_mq_tag_to_rq()Jens Axboe2016-08-291-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When drivers or the core calls this function, they usually dereference the request shortly there after. Prefetch the first cache line. Profiling IO workloads shows that this is the most common cache miss on the block side of things. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | blk-mq: turn hctx->run_work into a regular work structJens Axboe2016-08-292-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't need the larger delayed work struct, since we always run it immediately. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: add kblockd_schedule_work_on()Jens Axboe2016-08-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | Add a helper to schedule a regular struct work on a particular CPU. Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds2016-10-031-17/+10
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...
| * | | block/softirq: Convert to hotplug state machineSebastian Andrzej Siewior2016-09-191-17/+10
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160906170457.32393-9-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | blk-mq: skip unmapped queues in blk_mq_alloc_request_hctxChristoph Hellwig2016-09-231-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | This provides the caller a feedback that a given hctx is not mapped and thus no command can be sent on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | blk-throttle: Extend slice if throttle group is not emptyVivek Goyal2016-09-191-2/+4
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, if slice is expired, we start a new slice. If a bio is queued, we keep on extending slice by throtle_slice interval (100ms). This worked well as long as pending timer function got executed with-in few milli seconds of scheduled time. But looks like with recent changes in timer subsystem, slack can be much longer depending on the expiry time of the scheduled timer. commit 500462a9de65 ("timers: Switch to a non-cascading wheel") This means, by the time timer function gets executed, it is possible the delay from scheduled time is more than 100ms. That means current code will conclude that existing slice has expired and a new one needs to be started. New slice will be 100ms by default and that will not be sufficient to meet rate requirement of group given the bio size and bio will not be dispatched and we will start a new timer function to wait. And when that timer expires, same process will repeat and we will wait again and this can easily be an infinite loop. Solve this issue by starting a new slice only if throttle gropup is empty. If it is not empty, that means there should be an active slice going on. Ideally it should not be expired but given the slack, it is possible that it has expired. Reported-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: improve warning for running a queue on the wrong CPUJens Axboe2016-08-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | __blk_mq_run_hw_queue() currently warns if we are running the queue on a CPU that isn't set in its mask. However, this can happen if a CPU is being offlined, and the workqueue handling will place the work on CPU0 instead. Improve the warning so that it only triggers if the batch cpu in the hardware queue is currently online. If it triggers for that case, then it's indicative of a flow problem in blk-mq, so we want to retain it for that case. Signed-off-by: Jens Axboe <axboe@fb.com>
* | blk-mq: don't overwrite rq->mq_ctxJens Axboe2016-08-241-39/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We do this in a few places, if the CPU is offline. This isn't allowed, though, since on multi queue hardware, we can't just move a request from one software queue to another, if they map to different hardware queues. The request and tag isn't valid on another hardware queue. This can happen if plugging races with CPU offlining. But it does no harm, since it can only happen in the window where we are currently busy freezing the queue and flushing IO, in preparation for redoing the software <-> hardware queue mappings. Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: make sure a big bio is split into at most 256 bvecsMing Lei2016-08-241-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After arbitrary bio size was introduced, the incoming bio may be very big. We have to split the bio into small bios so that each holds at most BIO_MAX_PAGES bvecs for safety reason, such as bio_clone(). This patch fixes the following kernel crash: > [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 > [ 172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a > [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0 > [ 172.660399] Oops: 0000 [#1] SMP > [...] > [ 172.664780] Call Trace: > [ 172.664813] [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1] > [ 172.664846] [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4 > [ 172.664880] [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod] > [ 172.664912] [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155 > [ 172.664947] [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache] > [ 172.664981] [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache] > [ 172.665016] [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache] The issue can be reproduced by the following steps: - create one raid1 over two virtio-blk - build bcache device over the above raid1 and another cache device and bucket size is set as 2Mbytes - set cache mode as writeback - run random write over ext4 on the bcache device Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios) Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net> Cc: stable@vger.kernel.org (4.3+) Cc: Shaohua Li <shli@fb.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: Fix race triggered by blk_set_queue_dying()Bart Van Assche2016-08-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | blk_set_queue_dying() can be called while another thread is submitting I/O or changing queue flags, e.g. through dm_stop_queue(). Hence protect the QUEUE_FLAG_DYING flag change with locking. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* | block: Fix secure eraseAdrian Hunter2016-08-163-25/+31
|/ | | | | | | | | | Commit 288dab8a35a0 ("block: add a separate operation type for secure erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering all the places REQ_OP_DISCARD was being used to mean either. Fix those. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase") Signed-off-by: Jens Axboe <axboe@fb.com>
* block: rename bio bi_rw to bi_opfJens Axboe2016-08-077-32/+32
| | | | | | | | | | | | | Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>