summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm snapshot: fix data corruptionMikulas Patocka2013-10-161-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a particular type of data corruption that has been encountered when loading a snapshot's metadata from disk. When we allocate a new chunk in persistent_prepare, we increment ps->next_free and we make sure that it doesn't point to a metadata area by further incrementing it if necessary. When we load metadata from disk on device activation, ps->next_free is positioned after the last used data chunk. However, if this last used data chunk is followed by a metadata area, ps->next_free is positioned erroneously to the metadata area. A newly-allocated chunk is placed at the same location as the metadata area, resulting in data or metadata corruption. This patch changes the code so that ps->next_free skips the metadata area when metadata are loaded in function read_exceptions. The patch also moves a piece of code from persistent_prepare_exception to a separate function skip_metadata to avoid code duplication. CVE-2013-4299 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* bcache: Fix a null ptr deref regressionKent Overstreet2013-10-101-2/+1
| | | | | | | | | | | Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'dm-3.12-fixes' of ↵Linus Torvalds2013-09-258-25/+118
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper fixes from Mike Snitzer: "A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error handling fixes for dm-multipath. A fix for the thin provisioning target to not expose non-zero discard limits if discards are disabled. Lastly, add two DM module parameters which allow users to tune the emergency memory reserves that DM mainatins per device -- this helps fix a long-standing issue for dm-multipath. The conservative default reserve for request-based dm-multipath devices (256) has proven problematic for users with many multipathed SCSI devices but relatively little memory. To responsibly select a smaller value users should use the new nr_bios tracepoint info (via commit 75afb352 "block: Add nr_bios to block_rq_remap tracepoint") to determine the peak number of bios their workloads create" * tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: add reserved_bio_based_ios module parameter dm: add reserved_rq_based_ios module parameter dm: lower bio-based mempool reservation dm thin: do not expose non-zero discard limits if discards disabled dm mpath: disable WRITE SAME if it fails dm-snapshot: fix performance degradation due to small hash size dm snapshot: workaround for a false positive lockdep warning dm stats: fix possible counter corruption on 32-bit systems dm mpath: do not fail path on -ENOSPC
| * dm: add reserved_bio_based_ios module parameterMike Snitzer2013-09-233-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow user to change the number of IOs that are reserved by bio-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_bio_based_ios The default value is RESERVED_BIO_BASED_IOS (16). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_bio_based_ios() for use by DM targets and core code. Switch to sizing dm-io's mempool and bioset using DM core's configurable 'reserved_bio_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com>
| * dm: add reserved_rq_based_ios module parameterMike Snitzer2013-09-233-4/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow user to change the number of IOs that are reserved by request-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_rq_based_ios The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_rq_based_ios() for use by DM targets and core code. Switch to sizing dm-mpath's mempool using DM core's configurable 'reserved_rq_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
| * dm: lower bio-based mempool reservationMike Snitzer2013-09-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bio-based device mapper processing doesn't need larger mempools (like request-based DM does), so lower the number of reserved entries for bio-based operation. 16 was already used for bio-based DM's bioset but mistakenly wasn't used for it's _io_cache. Formalize difference between bio-based and request-based defaults by introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS. (based on older code from Mikulas Patocka) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
| * dm thin: do not expose non-zero discard limits if discards disabledMike Snitzer2013-09-231-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix issue where the block layer would stack the discard limits of the pool's data device even if the "ignore_discard" pool feature was specified. The pool and thin device(s) still had discards disabled because the QUEUE_FLAG_DISCARD request_queue flag wasn't set. But to avoid user confusion when "ignore_discard" is used: both the pool device and the thin device(s) have zeroes for all discard limits. Also, always set discard_zeroes_data_unsupported in targets because they should never advertise the 'discard_zeroes_data' capability (even if the pool's data device supports it). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * dm mpath: disable WRITE SAME if it failsMike Snitzer2013-09-202-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit 5db44863b6eb ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit 66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # 3.10+
| * dm-snapshot: fix performance degradation due to small hash sizeMikulas Patocka2013-09-201-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | LVM2, since version 2.02.96, creates origin with zero size, then loads the snapshot driver and then loads the origin. Consequently, the snapshot driver sees the origin size zero and sets the hash size to the lower bound 64. Such small hash table causes performance degradation. This patch changes it so that the hash size is determined by the size of snapshot volume, not minimum of origin and snapshot size. It doesn't make sense to set the snapshot size significantly larger than the origin size, so we do not need to take origin size into account when calculating the hash size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm snapshot: workaround for a false positive lockdep warningMikulas Patocka2013-09-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel reports a lockdep warning if a snapshot is invalidated because it runs out of space. The lockdep warning was triggered by commit 0976dfc1d0cd80a4e9dfaf87bd87 ("workqueue: Catch more locking problems with flush_work()") in v3.5. The warning is false positive. The real cause for the warning is that the lockdep engine treats different instances of md->lock as a single lock. This patch is a workaround - we use flush_workqueue instead of flush_work. This code path is not performance sensitive (it is called only on initialization or invalidation), thus it doesn't matter that we flush the whole workqueue. The real fix for the problem would be to teach the lockdep engine to treat different instances of md->lock as separate locks. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.5+
| * dm stats: fix possible counter corruption on 32-bit systemsMikulas Patocka2013-09-181-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a deliberate race condition in dm_stat_for_entry() to avoid the overhead of disabling and enabling interrupts. The race could result in some events not being counted on 64-bit architectures. However, on 32-bit architectures, operations on long long variables are not atomic, so the race condition could cause the counter to jump by 2^32. Such jumps could be disruptive, so we need to do proper locking on 32-bit architectures. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Alasdair G. Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm mpath: do not fail path on -ENOSPCJun'ichi Nomura2013-09-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since ENOSPC is a target-side error, dm-mpath should just pass the error information to upper layer instead of retrying itself with path failover. Otherwise it will end up failing all paths down while path checkers find all paths ok. ENOSPC can now be returned from SCSI device after commit a9d6ceb8 ("[SCSI] return ENOSPC on thin provisioning failure"). Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | bcache: Fix flushes in writeback modeKent Overstreet2013-09-241-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In writeback mode, when we get a cache flush we need to make sure we issue a flush to the backing device. The code for sending down an extra flush was wrong - by cloning the bio we were probably getting flags that didn't make sense for a bare flush, and also the old code was firing for FUA bios, for which we don't need to send a flush to the backing device. This was causing data corruption somehow - the mechanism was never determined, but this patch fixes it for the users that were seeing it. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix for handling overlapping extents when reading in a btree nodeKent Overstreet2013-09-241-11/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | btree_sort_fixup() was overly clever, because it was trying to avoid pulling a key off the btree iterator in more than one place. This led to a really obscure bug where we'd break early from the loop in btree_sort_fixup() if the current key overlapped with keys in more than one older set, and the next key it overlapped with was zero size. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix a shrinker deadlockKent Overstreet2013-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | GFP_NOIO means we could be getting called recursively - mca_alloc() -> mca_data_alloc() - definitely can't use mutex_lock(bucket_lock) then. Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix a dumb CPU spinning bug in writebackKent Overstreet2013-09-241-2/+1
| | | | | | | | | | | | | | | | schedule_timeout() != schedule_timeout_uninterruptible() Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix a flush/fua performance bugKent Overstreet2013-09-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | bch_journal_meta() was missing the flush to make the journal write actually go down (instead of waiting up to journal_delay_ms)... Whoops Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix a writeback performance regressionKent Overstreet2013-09-244-30/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Background writeback works by scanning the btree for dirty data and adding those keys into a fixed size buffer, then for each dirty key in the keybuf writing it to the backing device. When read_dirty() finishes and it's time to scan for more dirty data, we need to wait for the outstanding writeback IO to finish - they still take up slots in the keybuf (so that foreground writes can check for them to avoid races) - without that wait, we'll continually rescan when we'll be able to add at most a key or two to the keybuf, and that takes locks that starves foreground IO. Doh. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Correct printf()-style format length modifierGeert Uytterhoeven2013-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | Fix drivers/md/bcache/btree.c: In function ‘bch_btree_node_read’: drivers/md/bcache/btree.c:259: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘size_t’ Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix for when no journal entries are foundKent Overstreet2013-09-241-12/+18
| | | | | | | | | | | | | | | | | | The journal replay code didn't handle this case, causing it to go into an infinite loop... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Strip endline when writing the label through sysfsGabriel de Perthuis2013-09-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | sysfs attributes with unusual characters have crappy failure modes in Squeeze (udev 164); later versions of udev are unaffected. This should make these characters more unusual. Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | bcache: Fix a dumb journal discard bugKent Overstreet2013-09-241-1/+1
|/ | | | | | | | | That switch statement was obviously wrong, leading to some sort of weird spinning on rare occasion with discards enabled... Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-123-43/+66
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 4 from Al Viro: "list_lru pile, mostly" This came out of Andrew's pile, Al ended up doing the merge work so that Andrew didn't have to. Additionally, a few fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits) super: fix for destroy lrus list_lru: dynamically adjust node arrays shrinker: Kill old ->shrink API. shrinker: convert remaining shrinkers to count/scan API staging/lustre/libcfs: cleanup linux-mem.h staging/lustre/ptlrpc: convert to new shrinker API staging/lustre/obdclass: convert lu_object shrinker to count/scan API staging/lustre/ldlm: convert to shrinkers to count/scan API hugepage: convert huge zero page shrinker to new shrinker API i915: bail out earlier when shrinker cannot acquire mutex drivers: convert shrinkers to new count/scan API fs: convert fs shrinkers to new scan/count API xfs: fix dquot isolation hang xfs-convert-dquot-cache-lru-to-list_lru-fix xfs: convert dquot cache lru to list_lru xfs: rework buffer dispose list tracking xfs-convert-buftarg-lru-to-generic-code-fix xfs: convert buftarg LRU to generic code fs: convert inode and dentry shrinking to be node aware vmscan: per-node deferred work ...
| * drivers: convert shrinkers to new count/scan APIDave Chinner2013-09-103-43/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert the driver shrinkers to the new API. Most changes are compile tested only because I either don't have the hardware or it's staging stuff. FWIW, the md and android code is pretty good, but the rest of it makes me want to claw my eyes out. The amount of broken code I just encountered is mind boggling. I've added comments explaining what is broken, but I fear that some of the code would be best dealt with by being dragged behind the bike shed, burying in mud up to it's neck and then run over repeatedly with a blunt lawn mower. Special mention goes to the zcache/zcache2 drivers. They can't co-exist in the build at the same time, they are under different menu options in menuconfig, they only show up when you've got the right set of mm subsystem options configured and so even compile testing is an exercise in pulling teeth. And that doesn't even take into account the horrible, broken code... [glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Kent Overstreet <koverstreet@google.com> Cc: John Stultz <john.stultz@linaro.org> Cc: David Rientjes <rientjes@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'dm-3.12-changes' of ↵Linus Torvalds2013-09-1018-151/+1359
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper updates from Mike Snitzer: "Add the ability to collect I/O statistics on user-defined regions of a device-mapper device. This dm-stats code required the reintroduction of a div64_u64_rem() helper, but as a separate method that doesn't slow down div64_u64() -- especially on 32-bit systems. Allow the error target to replace request-based DM devices (e.g. multipath) in addition to bio-based DM devices. Various other small code fixes and improvements to thin-provisioning, DM cache and the DM ioctl interface" * tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm stripe: silence a couple sparse warnings dm: add statistics support dm thin: always return -ENOSPC if no_free_space is set dm ioctl: cleanup error handling in table_load dm ioctl: increase granularity of type_lock when loading table dm ioctl: prevent rename to empty name or uuid dm thin: set pool read-only if breaking_sharing fails block allocation dm thin: prefix pool error messages with pool device name dm: allow error target to replace bio-based and request-based targets math64: New separate div64_u64_rem helper dm space map: optimise sm_ll_dec and sm_ll_inc dm btree: prefetch child nodes when walking tree for a dm_btree_del dm btree: use pop_frame in dm_btree_del to cleanup code dm cache: eliminate holes in cache structure dm cache: fix stacking of geometry limits dm thin: fix stacking of geometry limits dm thin: add data block size limits to Documentation dm cache: add data block size limits to code and Documentation dm cache: document metadata device is exclussive to a cache dm: stop using WQ_NON_REENTRANT
| * | dm stripe: silence a couple sparse warningsMike Snitzer2013-09-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Eliminate the following sparse warnings: drivers/md/dm-stripe.c:443:12: warning: symbol 'dm_stripe_init' was not declared. Should it be static? drivers/md/dm-stripe.c:456:6: warning: symbol 'dm_stripe_exit' was not declared. Should it be static? Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm: add statistics supportMikulas Patocka2013-09-056-12/+1102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support the collection of I/O statistics on user-defined regions of a DM device. If no regions are defined no statistics are collected so there isn't any performance impact. Only bio-based DM devices are currently supported. Each user-defined region specifies a starting sector, length and step. Individual statistics will be collected for each step-sized area within the range specified. The I/O statistics counters for each step-sized area of a region are in the same format as /sys/block/*/stat or /proc/diskstats but extra counters (12 and 13) are provided: total time spent reading and writing in milliseconds. All these counters may be accessed by sending the @stats_print message to the appropriate DM device via dmsetup. The creation of DM statistics will allocate memory via kmalloc or fallback to using vmalloc space. At most, 1/4 of the overall system memory may be allocated by DM statistics. The admin can see how much memory is used by reading /sys/module/dm_mod/parameters/stats_current_allocated_bytes See Documentation/device-mapper/statistics.txt for more details. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm thin: always return -ENOSPC if no_free_space is setMike Snitzer2013-09-051-25/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If pool has 'no_free_space' set it means a previous allocation already determined the pool has no free space (and failed that allocation with -ENOSPC). By always returning -ENOSPC if 'no_free_space' is set, we do not allow the pool to oscillate between allocating blocks and then not. But a side-effect of this determinism is that if a user wants to be able to allocate new blocks they'll need to reload the pool's table (to clear the 'no_free_space' flag). This reload will happen automatically if the pool's data volume is resized. But if the user takes action to free a lot of space by deleting snapshot volumes, etc the pool will no longer allow data allocations to continue without an intervening table reload. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm ioctl: cleanup error handling in table_loadMike Snitzer2013-09-051-18/+16
| | | | | | | | | | | | | | | | | | | | | Make use of common cleanup code. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm ioctl: increase granularity of type_lock when loading tableMike Snitzer2013-09-053-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hold the mapped device's type_lock before calling populate_table() since it is where the table's type is determined based on the specified targets. There is no need to allow concurrent table loads to race to establish the table's targets or type. This eliminates the need to grab the lock in dm_table_set_type(). Also verify that the type_lock is held in both dm_set_md_type() and dm_get_md_type(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm ioctl: prevent rename to empty name or uuidAlasdair Kergon2013-09-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A device-mapper device must always have a name consisting of a non-empty string. If the device also has a uuid, this similarly must not be an empty string. The DM_DEV_CREATE ioctl enforces these rules when the device is created, but this patch is needed to enforce them when DM_DEV_RENAME is used to change the name or uuid. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
| * | dm thin: set pool read-only if breaking_sharing fails block allocationMike Snitzer2013-09-051-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | break_sharing() now handles an arbitrary alloc_data_block() error the same way as provision_block(): marks pool read-only and errors the cell. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm thin: prefix pool error messages with pool device nameMike Snitzer2013-09-051-16/+32
| | | | | | | | | | | | | | | | | | | | | | | | Useful to know which pool is experiencing the error. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm: allow error target to replace bio-based and request-based targetsMike Snitzer2013-09-053-3/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It may be useful to switch a request-based table to the "error" target. Enhance the DM core to allow a hybrid target_type which is capable of handling either bios (via .map) or requests (via .map_rq). Add a request-based map function (.map_rq) to the "error" target_type; making it DM's first hybrid target. Train dm_table_set_type() to prefer the mapped device's established type (request-based or bio-based). If the mapped device doesn't have an established type default to making the table with the hybrid target(s) bio-based. Tested 'dmsetup wipe_table' to work on both bio-based and request-based devices. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm space map: optimise sm_ll_dec and sm_ll_incJoe Thornber2013-08-231-28/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this patch these methods did a lookup followed by an insert. Instead they now call a common mutate function that adjusts the value according to a callback function. This avoids traversing the data structures twice and hence improves performance. Also factor out sm_ll_lookup_big_ref_count() for use by both sm_ll_lookup() and sm_ll_mutate(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm btree: prefetch child nodes when walking tree for a dm_btree_delJoe Thornber2013-08-233-5/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm-btree now takes advantage of dm-bufio's ability to prefetch data via dm_bm_prefetch(). Prior to this change many btree node visits were causing a synchronous read. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm btree: use pop_frame in dm_btree_del to cleanup codeJoe Thornber2013-08-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a visited leaf straight away from the stack, rather than marking all it's children as visited and letting it get removed on the next iteration. May also offer a micro optimisation in dm_btree_del. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm cache: eliminate holes in cache structureMike Snitzer2013-08-231-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reorder members in the cache structure to eliminate 6 out of 7 holes (reclaiming 24 bytes). Also, the 'worker' and 'waker' members no longer straddle cachelines. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm cache: fix stacking of geometry limitsMike Snitzer2013-08-231-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not blindly override the queue limits (specifically io_min and io_opt). Allow traditional stacking of these limits if io_opt is a factor of the cache's data block size. Without this patch mkfs.xfs does not recognize the cache device's provided limits as a useful geometry (e.g. raid) so these hints are ignored. This was due to setting io_min to a useless value. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm thin: fix stacking of geometry limitsMike Snitzer2013-08-231-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not blindly override the queue limits (specifically io_min and io_opt). Allow traditional stacking of these limits if io_opt is a factor of the thin-pool's data block size. Without this patch mkfs.xfs does not recognize the thin device's provided limits as a useful geometry (e.g. raid) so these hints are ignored. This was due to setting io_min to a useless value. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm cache: add data block size limits to code and DocumentationMike Snitzer2013-08-231-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Place upper bound on the cache's data block size (1GB). Inform users that the data block size can't be any arbitrary number, i.e. its value must be between 32KB and 1GB. Also, it should be a multiple of 32KB. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * | dm: stop using WQ_NON_REENTRANTTejun Heo2013-08-234-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dbf2576e37 ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
* | | Merge tag 'md/3.12' of git://neil.brown.name/mdLinus Torvalds2013-09-104-40/+406
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md update from Neil Brown: "Headline item is multithreading for RAID5 so that more IO/sec can be supported on fast (SSD) devices. Also TILE-Gx SIMD suppor for RAID6 calculations and an assortment of bug fixes" * tag 'md/3.12' of git://neil.brown.name/md: raid5: only wakeup necessary threads md/raid5: flush out all pending requests before proceeding with reshape. md/raid5: use seqcount to protect access to shape in make_request. raid5: sysfs entry to control worker thread number raid5: offload stripe handle to workqueue raid5: fix stripe release order raid5: make release_stripe lockless md: avoid deadlock when dirty buffers during md_stop. md: Don't test all of mddev->flags at once. md: Fix apparent cut-and-paste error in super_90_validate raid6/test: replace echo -e with printf RAID: add tilegx SIMD implementation of raid6 md: fix safe_mode buglet. md: don't call md_allow_write in get_bitmap_file.
| * | raid5: only wakeup necessary threadsShaohua Li2013-09-022-6/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there are not enough stripes to handle, we'd better not always queue all available work_structs. If one worker can only handle small or even none stripes, it will impact request merge and create lock contention. With this patch, the number of work_struct running will depend on pending stripes number. Note: some statistics info used in the patch are accessed without locking protection. This should doesn't matter, we just try best to avoid queue unnecessary work_struct. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | md/raid5: flush out all pending requests before proceeding with reshape.NeilBrown2013-08-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some requests - particularly 'discard' and 'read' are handled differently depending on whether a reshape is active or not. It is harmless to assume reshape is active if it isn't but wrong to act as though reshape is not active when it is. So when we start reshape - after making clear to all requests that reshape has started - use mddev_suspend/mddev_resume to flush out all requests. This will ensure that no requests will be assuming the absence of reshape once it really starts. Signed-off-by: NeilBrown <neilb@suse.de>
| * | md/raid5: use seqcount to protect access to shape in make_request.NeilBrown2013-08-282-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | make_request() access various shape parameters (raid_disks, chunk_size etc) which might be changed by raid5_start_reshape(). If the later is called at and awkward time during the form, the wrong stripe_head might be used. So introduce a 'seqcount' and after finding a stripe_head make sure there is no reason to expect that we got the wrong one. Signed-off-by: NeilBrown <neilb@suse.de>
| * | raid5: sysfs entry to control worker thread numberShaohua Li2013-08-281-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | Add a sysfs entry to control running workqueue thread number. If group_thread_cnt is set to 0, we will disable workqueue offload handling of stripes. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | raid5: offload stripe handle to workqueueShaohua Li2013-08-282-15/+186
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is another attempt to create multiple threads to handle raid5 stripes. This time I use workqueue. raid5 handles request (especially write) in stripe unit. A stripe is page size aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a state machine for the corresponding stripe, which includes reading some disks of the stripe, calculating parity, and writing some disks of the stripe. The state machine is running in raid5d thread currently. Since there is only one thread, it doesn't scale well for high speed storage. An obvious solution is multi-threading. To get better performance, we have some requirements: a. locality. stripe corresponding to request submitted from one cpu is better handled in thread in local cpu or local node. local cpu is preferred but some times could be a bottleneck, for example, parity calculation is too heavy. local node running has wide adaptability. b. configurablity. Different setup of raid5 array might need diffent configuration. Especially the thread number. More threads don't always mean better performance because of lock contentions. My original implementation is creating some kernel threads. There are interfaces to control which cpu's stripe each thread should handle. And userspace can set affinity of the threads. This provides biggest flexibility and configurability. But it's hard to use and apparently a new thread pool implementation is disfavor. Recent workqueue improvement is quite promising. unbound workqueue will be bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to do affinity setting. For example, we can only include one HT sibling in affinity. Since work is non-reentrant by default, and we can control running thread number by limiting dispatched work_struct number. In this patch, I created several stripe worker group. A group is a numa node. stripes from cpus of one node will be added to a group list. Workqueue thread of one node will only handle stripes of worker group of the node. In this way, stripe handling has numa node locality. And as I said, we can control thread number by limiting dispatched work_struct number. The work_struct callback function handles several stripes in one run. A typical work queue usage is to run one unit in each work_struct. In raid5 case, the unit is a stripe. But we can't do that: a. Though handling a stripe doesn't need lock because of reference accounting and stripe isn't in any list, queuing a work_struct for each stripe will make workqueue lock contended very heavily. b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we might dispatch request. If each work_struct only handles one stripe, such block plug is meaningless. This implementation can't do very fine grained configuration. But the numa binding is most popular usage model, should be enough for most workloads. Note: since we have only one stripe queue, switching to multi-thread might decrease request size dispatching down to low level layer. The impact depends on thread number, raid configuration and workload. So multi-thread raid5 might not be proper for all setups. Changes V1 -> V2: 1. remove WQ_NON_REENTRANT 2. disabling multi-threading by default 3. Add more descriptions in changelog Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | raid5: fix stripe release orderShaohua Li2013-08-281-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | patch "make release_stripe lockless" changes the order stripes are released. Originally I thought block layer can take care of request merge, but it appears there are still some requests not merged. It's easy to fix the order. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * | raid5: make release_stripe locklessShaohua Li2013-08-282-3/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | release_stripe still has big lock contention. We just add the stripe to a llist without taking device_lock. We let the raid5d thread to do the real stripe release, which must hold device_lock anyway. In this way, release_stripe doesn't hold any locks. The side effect is the released stripes order is changed. But sounds not a big deal, stripes are never handled in order. And I thought block layer can already do nice request merge, which means order isn't that important. I kept the unplug release batch, which is unnecessary with this patch from lock contention avoid point of view, and actually if we delete it, the stripe_head release_list and lru can share storage. But the unplug release batch is also helpful for request merge. We probably can delay wakeup raid5d till unplug, but I'm still afraid of the case which raid5d is running. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>