summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm: fix missed error code if .end_io isn't implemented by target_typezhendong chen2014-12-171-1/+1
| | | | | | | | | | | | | In bio-based DM's clone_endio(), when target_type doesn't implement .end_io (e.g. linear) r will be always be initialized 0. So if a WRITE SAME bio fails WRITE SAME will not be disabled as intended. Fix this by initializing r to error, rather than 0, in clone_endio(). Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: 7eee4ae2db ("dm: disable WRITE SAME if it fails") Cc: stable@vger.kernel.org
* dm thin: fix crash by initializing thin device's refcount and completion earlierMarc Dionne2014-12-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 80e96c5484be ("dm thin: do not allow thin device activation while pool is suspended") delayed the initialization of a new thin device's refcount and completion until after this new thin was added to the pool's active_thins list and the pool lock is released. This opens a race with a worker thread that walks the list and calls thin_get/put, noticing that the refcount goes to 0 and calling complete, freezing up the system and giving the oops below: kernel: BUG: unable to handle kernel NULL pointer dereference at (null) kernel: IP: [<ffffffff810d360b>] __wake_up_common+0x2b/0x90 kernel: Call Trace: kernel: [<ffffffff810d3683>] __wake_up_locked+0x13/0x20 kernel: [<ffffffff810d3dc7>] complete+0x37/0x50 kernel: [<ffffffffa0595c50>] thin_put+0x20/0x30 [dm_thin_pool] kernel: [<ffffffffa059aab7>] do_worker+0x667/0x870 [dm_thin_pool] kernel: [<ffffffff816a8a4c>] ? __schedule+0x3ac/0x9a0 kernel: [<ffffffff810b1aef>] process_one_work+0x14f/0x400 kernel: [<ffffffff810b206b>] worker_thread+0x6b/0x490 kernel: [<ffffffff810b2000>] ? rescuer_thread+0x260/0x260 kernel: [<ffffffff810b6a7b>] kthread+0xdb/0x100 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 kernel: [<ffffffff816ad7ec>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff810b69a0>] ? kthread_create_on_node+0x170/0x170 Set the thin device's initial refcount and initialize the completion before adding it to the pool's active_thins list in thin_ctr(). Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* dm thin: fix missing out-of-data-space to write mode transition if blocks ↵Joe Thornber2014-12-171-2/+20
| | | | | | | | | | | | | | | | | are released Discard bios and thin device deletion have the potential to release data blocks. If the thin-pool is in out-of-data-space mode, and blocks were released, transition the thin-pool back to full write mode. The correct time to do this is just after the thin-pool metadata commit. It cannot be done before the commit because the space maps will not allow immediate reuse of the data blocks in case there's a rollback following power failure. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
* dm thin: fix inability to discard blocks when in out-of-data-space modeJoe Thornber2014-12-171-1/+1
| | | | | | | | | | | | | | | When the pool was in PM_OUT_OF_SPACE mode its process_prepared_discard function pointer was incorrectly being set to process_prepared_discard_passdown rather than process_prepared_discard. This incorrect function pointer meant the discard was being passed down, but not effecting the mapping. As such any discard that was issued, in an attempt to reclaim blocks, would not successfully free data space. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
* Merge tag 'md/3.19' of git://neil.brown.name/mdLinus Torvalds2014-12-142-13/+32
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md updates from Neil Brown: "Three fixes for md. I did have a largish set of locking changes queued, but late testing showed they weren't quite as stable as I thought and while I fixed what I found, I decided it safer to delay them a release ... particularly as I'll be AFK for a few weeks. So expect a larger batch next time :-)" * tag 'md/3.19' of git://neil.brown.name/md: md: Check MD_RECOVERY_RUNNING as well as ->sync_thread. md: fix semicolon.cocci warnings md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
| * md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.NeilBrown2014-12-111-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent change to md started the ->sync_thread from a asynchronously from a work_queue rather than synchronously. This means that there can be a small window between the time when MD_RECOVERY_RUNNING is set and when ->sync_thread is set. So code that checks ->sync_thread might now conclude that the thread has not been started and (because a lock is held) will not be started. That is no longer the case. Most of those places are best fixed by testing MD_RECOVERY_RUNNING as well. To make this completely reliable, we wake_up(&resync_wait) after clearing that flag as well as after clearing ->sync_thread. Other places are better served by flushing the relevant workqueue to ensure that that if the sync thread was starting, it has now started. This is particularly best if we are about to stop the sync thread. Fixes: ac05f256691fe427a3e84c19261adb0b67dd73c0 Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix semicolon.cocci warningskbuild test robot2014-12-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | drivers/md/md.c:7175:43-44: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.NeilBrown2014-12-031-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is critical that fetch_block() and handle_stripe_dirtying() are consistent in their analysis of what needs to be loaded. Otherwise raid5 can wait forever for a block that won't be loaded. Currently when writing to a RAID5 that is resyncing, to a location beyond the resync offset, handle_stripe_dirtying chooses a reconstruct-write cycle, but fetch_block() assumes a read-modify-write, and a lockup can happen. So treat that case just like RAID6, just as we do in handle_stripe_dirtying. RAID6 always does reconstruct-write. This bug was introduced when the behaviour of handle_stripe_dirtying was changed in 3.7, so the patch is suitable for any kernel since, though it will need careful merging for some versions. Cc: stable@vger.kernel.org (v3.7+) Fixes: a7854487cd7128a30a7f4f5259de9f67d5efb95f Reported-by: Henry Cai <henryplusplus@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2014-12-133-32/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer driver updates from Jens Axboe: - NVMe updates: - The blk-mq conversion from Matias (and others) - A stack of NVMe bug fixes from the nvme tree, mostly from Keith. - Various bug fixes from me, fixing issues in both the blk-mq conversion and generic bugs. - Abort and CPU online fix from Sam. - Hot add/remove fix from Indraneel. - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp) - With the generic IO stat accounting from 3.19/core, converting md, bcache, and rsxx to use those. From Gu Zheng. - Boundary check for queue/irq mode for null_blk from Matias. Fixes cases where invalid values could be given, causing the device to hang. - The xen blkfront pull request, with two bug fixes from Vitaly. * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits) NVMe: fix race condition in nvme_submit_sync_cmd() NVMe: fix retry/error logic in nvme_queue_rq() NVMe: Fix FS mount issue (hot-remove followed by hot-add) NVMe: fix error return checking from blk_mq_alloc_request() NVMe: fix freeing of wrong request in abort path xen/blkfront: remove redundant flush_op xen/blkfront: improve protection against issuing unsupported REQ_FUA NVMe: Fix command setup on IO retry null_blk: boundary check queue_mode and irqmode block/rsxx: use generic io stats accounting functions to simplify io stat accounting md: use generic io stats accounting functions to simplify io stat accounting drbd: use generic io stats accounting functions to simplify io stat accounting md/bcache: use generic io stats accounting functions to simplify io stat accounting NVMe: Update module version major number NVMe: fail pci initialization if the device doesn't have any BARs NVMe: add ->exit_hctx() hook NVMe: make setup work for devices that don't do INTx NVMe: enable IO stats by default NVMe: nvme_submit_async_admin_req() must use atomic rq allocation NVMe: replace blk_put_request() with blk_mq_free_request() ...
| * | md: use generic io stats accounting functions to simplify io stat accountingGu Zheng2014-11-242-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | md/bcache: use generic io stats accounting functions to simplify io stat ↵Gu Zheng2014-11-241-17/+6
| |/ | | | | | | | | | | | | | | | | | | | | accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Kent Overstreet <kmo@datera.io> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'dm-3.19-changes' of ↵Linus Torvalds2014-12-0821-569/+1610
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Significant DM thin-provisioning performance improvements to meet performance requirements that were requested by the Gluster distributed filesystem. Specifically, dm-thinp now takes care to aggregate IO that will be issued to the same thinp block before issuing IO to the underlying devices. This really helps improve performance on HW RAID6 devices that have a writeback cache because it avoids RMW in the HW RAID controller. - Some stable fixes: fix leak in DM bufio if integrity profiles were enabled, use memzero_explicit in DM crypt to avoid any potential for information leak, and a DM cache fix to properly mark a cache block dirty if it was promoted to the cache via the overwrite optimization. - A few simple DM persistent data library fixes - DM cache multiqueue policy block promotion improvements. - DM cache discard improvements that take advantage of range (multiblock) discard support in the DM bio-prison. This allows for much more efficient bulk discard processing (e.g. when mkfs.xfs discards the entire device). - Some small optimizations in DM core and RCU deference cleanups - DM core changes to suspend/resume code to introduce the new internal suspend/resume interface that the DM thin-pool target now uses to suspend/resume active thin devices when the thin-pool must suspend/resume. This avoids forcing userspace to track all active thin volumes in a thin-pool when the thin-pool is suspended for the purposes of metadata or data space resize. * tag 'dm-3.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (49 commits) dm crypt: use memzero_explicit for on-stack buffer dm space map metadata: fix sm_bootstrap_get_count() dm space map metadata: fix sm_bootstrap_get_nr_blocks() dm bufio: fix memleak when using a dm_buffer's inline bio dm cache: fix spurious cell_defer when dealing with partial block at end of device dm cache: dirty flag was mistakenly being cleared when promoting via overwrite dm cache: only use overwrite optimisation for promotion when in writeback mode dm cache: discard block size must be a multiple of cache block size dm cache: fix a harmless race when working out if a block is discarded dm cache: when reloading a discard bitset allow for a different discard block size dm cache: fix some issues with the new discard range support dm array: if resizing the array is a noop set the new root to the old one dm: use rcu_dereference_protected instead of rcu_dereference dm thin: fix pool_io_hints to avoid looking at max_hw_sectors dm thin: suspend/resume active thin devices when reloading thin-pool dm: enhance internal suspend and resume interface dm thin: do not allow thin device activation while pool is suspended dm: add presuspend_undo hook to target_type dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctl dm thin: remove stale 'trim' message in block comment above pool_message ...
| * dm crypt: use memzero_explicit for on-stack bufferMilan Broz2014-12-021-1/+1
| | | | | | | | | | | | | | | | | | Use memzero_explicit to cleanup sensitive data allocated on stack to prevent the compiler from optimizing and removing memset() calls. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm space map metadata: fix sm_bootstrap_get_count()Joe Thornber2014-12-021-1/+3
| | | | | | | | | | | | | | Must set 'result' accordingly rather than return it. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm space map metadata: fix sm_bootstrap_get_nr_blocks()Dan Carpenter2014-12-011-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function isn't right and it causes a static checker warning: drivers/md/dm-thin.c:3016 maybe_resize_data_dev() error: potentially using uninitialized 'sb_data_size'. It should set "*count" and return zero on success the same as the sm_metadata_get_nr_blocks() function does earlier. Fixes: 3241b1d3e0aa ('dm: add persistent data library') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bufio: fix memleak when using a dm_buffer's inline bioDarrick J. Wong2014-12-011-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dm-bufio sets out to use the bio built into a struct dm_buffer to issue an IO, it needs to call bio_reset after it's done with the bio so that we can free things attached to the bio such as the integrity payload. Therefore, inject our own endio callback to take care of the bio_reset after calling submit_io's end_io callback. Test case: 1. modprobe scsi_debug delay=0 dif=1 dix=199 ato=1 dev_size_mb=300 2. Set up a dm-bufio client, e.g. dm-verity, on the scsi_debug device 3. Repeatedly read metadata and watch kmalloc-192 leak! Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm cache: fix spurious cell_defer when dealing with partial block at end of ↵Joe Thornber2014-12-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | device We never bother caching a partial block that is at the back end of the origin device. No cell ever gets locked, but the calling code was assuming it was and trying to release it. Now the code only releases if the cell has been set to a non NULL value. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm cache: dirty flag was mistakenly being cleared when promoting via overwriteJoe Thornber2014-12-011-3/+7
| | | | | | | | | | | | | | | | | | | | | | If the incoming bio is a WRITE and completely covers a block then we don't bother to do any copying for a promotion operation. Once this is done the cache block and origin block will be different, so we need to set it to 'dirty'. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm cache: only use overwrite optimisation for promotion when in writeback modeJoe Thornber2014-12-011-1/+2
| | | | | | | | | | | | | | | | | | Overwrite causes the cache block and origin blocks to diverge, which is only allowed in writeback mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm cache: discard block size must be a multiple of cache block sizeJoe Thornber2014-12-011-6/+3
| | | | | | | | | | | | | | | | Otherwise the cache blocks may span two discard blocks, which we don't handle when doing the discard lookup. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: fix a harmless race when working out if a block is discardedJoe Thornber2014-12-011-2/+4
| | | | | | | | | | | | | | | | | | It is more correct to hold the cell before checking the discard state. These flags are only used as hints to the policy so this change will have negligable effect. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: when reloading a discard bitset allow for a different discard ↵Joe Thornber2014-12-011-7/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | block size The discard block size can change if the origin changes size or if an old DM cache is upgraded from using a discard block size that was equal to cache block size. To fix this an extent of discarded blocks is established for the purpose of translating the old discard block size to the new in-core discard block size and set bits. The old (potentially huge) discard bitset is left ondisk until it is re-written using the new in-core information on the next successful DM cache shutdown. Fixes: 7ae34e777896 ("dm cache: improve discard support") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: fix some issues with the new discard range supportJoe Thornber2014-12-011-3/+3
| | | | | | | | | | | | | | | | | | | | | | Commit 7ae34e777 ("dm cache: improve discard support") needed to also: - discontinue having DM core split the discard bios on cache block boundaries - calculate the cache's discard_nr_blocks relative to the determined discard_block_size rather than using oblock_to_dblock() Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm array: if resizing the array is a noop set the new root to the old oneJoe Thornber2014-12-011-1/+3
| | | | | | | | | | | | | | | | | | | | This could've been quite bad (to return success but not update the new root to point at the old) but in practice the only known consumer of the dm array code is the DM cache target. And the DM cache target passes in the same old root to array_resize() anyway. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: use rcu_dereference_protected instead of rcu_dereferenceEric Dumazet2014-11-231-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rcu_dereference() should be used in sections protected by rcu_read_lock. For writers, holding some kind of mutex or lock, rcu_dereference_protected() is the way to go, adding explicit lockdep bits. In __unbind(), we are the last user of this mapped device, so can use the constant '1' instead of a lockdep_is_held(), not consistent with other uses of rcu_dereference_protected() which use md->suspend_lock mutex. Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 33423974bfc1 ("dm: Use rcu_dereference() for accessing rcu pointer") Cc: Pranith Kumar <bobby.prani@gmail.com> [snitzer: allow lines longer than 80 columns, refine subject] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: fix pool_io_hints to avoid looking at max_hw_sectorsMike Snitzer2014-11-211-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the pool_io_hints code that works to establish a max_sectors value that is a power-of-2 factor of the thin-pool's blocksize. The biggest associated improvement is that the DM thin-pool is no longer concerning itself with the data device's max_hw_sectors when adjusting max_sectors. This fixes the relative fragility of the original "dm thin: adjust max_sectors_kb based on thinp blocksize" commit that only became apparent when testing was performed using a DM thin-pool ontop of a virtio_blk device. One proposed upstream patch detailed the problems inherent in virtio_blk: https://lkml.org/lkml/2014/11/20/611 So even though virtio_blk incorrectly set its max_hw_sectors it actually helped make it clear that we need DM thinp to be tolerant of any future Linux driver that incorrectly sets max_hw_sectors. We only need to be concerned with modifying the thin-pool device's max_sectors limit if it is smaller than the thin-pool's blocksize. In this case the value of max_sectors does become a limiting factor when upper layers (e.g. filesystems) construct their bios. But if the hardware can support IOs larger than the thin-pool's blocksize the user is encouraged to adjust the thin-pool's data device's max_sectors accordingly -- doing so will enable the thin-pool to inherit the established user-defined max_sectors. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: suspend/resume active thin devices when reloading thin-poolMike Snitzer2014-11-191-2/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this change it was expected that userspace would first suspend all active thin devices, reload/resize the thin-pool target, then resume all active thin devices. Now the thin-pool suspend/resume will trigger the suspend/resume of all active thins via appropriate calls to dm_internal_suspend and dm_internal_resume. Store the mapped_device for each thin device in struct thin_c to make these calls possible. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * dm: enhance internal suspend and resume interfaceMike Snitzer2014-11-194-58/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename dm_internal_{suspend,resume} to dm_internal_{suspend,resume}_fast -- dm-stats will continue using these methods to avoid all the extra suspend/resume logic that is not needed in order to quickly flush IO. Introduce dm_internal_suspend_noflush() variant that actually calls the mapped_device's target callbacks -- otherwise target-specific hooks are avoided (e.g. dm-thin's thin_presuspend and thin_postsuspend). Common code between dm_internal_{suspend_noflush,resume} and dm_{suspend,resume} was factored out as __dm_{suspend,resume}. Update dm_internal_{suspend_noflush,resume} to always take and release the mapped_device's suspend_lock. Also update dm_{suspend,resume} to be aware of potential for DM_INTERNAL_SUSPEND_FLAG to be set and respond accordingly by interruptibly waiting for the DM_INTERNAL_SUSPEND_FLAG to be cleared. Add lockdep annotation to dm_suspend() and dm_resume(). The existing DM_SUSPEND_FLAG remains unchanged. DM_INTERNAL_SUSPEND_FLAG is set by dm_internal_suspend_noflush() and cleared by dm_internal_resume(). Both DM_SUSPEND_FLAG and DM_INTERNAL_SUSPEND_FLAG may be set if a device was already suspended when dm_internal_suspend_noflush() was called -- this can be thought of as a "nested suspend". A "nested suspend" can occur with legacy userspace dm-thin code that might suspend all active thin volumes before suspending the pool for resize. But otherwise, in the normal dm-thin-pool suspend case moving forward: the thin-pool will have DM_SUSPEND_FLAG set and all active thins from that thin-pool will have DM_INTERNAL_SUSPEND_FLAG set. Also add DM_INTERNAL_SUSPEND_FLAG to status report. This new DM_INTERNAL_SUSPEND_FLAG state is being reported to assist with debugging (e.g. 'dmsetup info' will report an internally suspended device accordingly). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * dm thin: do not allow thin device activation while pool is suspendedMike Snitzer2014-11-191-10/+45
| | | | | | | | | | | | | | | | | | | | | | | | Otherwise IO could be issued to the pool while it is suspended. Care was taken to properly interlock between the thin and thin-pool targets when accessing the pool's 'suspended' flag. The thin_ctr will not add a new thin device to the pool's active_thins list if the pool is susepended. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * dm: add presuspend_undo hook to target_typeMike Snitzer2014-11-193-9/+38
| | | | | | | | | | | | | | | | The DM thin-pool target now must undo the changes performed during pool_presuspend() so introduce presuspend_undo hook in target_type. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
| * dm: return earlier from dm_blk_ioctl if target doesn't implement .ioctlMike Snitzer2014-11-191-2/+3
| | | | | | | | | | | | | | No point checking if the device is suspended if the current target doesn't even implement .ioctl Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: remove stale 'trim' message in block comment above pool_messageMike Snitzer2014-11-121-1/+0
| | | | | | | | Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: fix a race in thin_dtrMikulas Patocka2014-11-121-3/+3
| | | | | | | | | | | | | | | | | | As long as struct thin_c is in the list, anyone can grab a reference of it. Consequently, we must wait for the reference count to drop to zero *after* we remove the structure from the list, not before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: emit a warning message if there are a lot of cache blocksJoe Thornber2014-11-121-3/+16
| | | | | | | | | | | | | | | | | | Loading and saving millions of block mappings takes time. We may as well explain what's going on, and encourage people to use a larger cache block size. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: improve discard supportJoe Thornber2014-11-101-45/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Safely allow the discard blocksize to be larger than the cache blocksize by using the bio prison's range locking support. This also improves discard performance considerly because larger discards are issued to the dm-cache device. The discard blocksize was always intended to be greater than the cache blocksize. But until now it wasn't implemented safely. Also, by safely restoring the ability to have discard blocksize larger than cache blocksize we're able to significantly reduce the memory used for the cache's discard bitset. Before, with a small discard blocksize, the discard bitset could get quite large because its size is a function of the discard blocksize and the origin device's size. For example, previously, using a 32KB cache blocksize with a 40TB origin resulted in 1280MB of incore memory use for the discard bitset! Now, the discard blocksize is scaled up accordingly to ensure the discard bitset is capped at 2**14 bits, or 16KB. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: revert "prevent corruption caused by discard_block_size > ↵Joe Thornber2014-11-101-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | cache_block_size" This reverts commit d132cc6d9e92424bb9d4fd35f5bd0e55d583f4be because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: revert "remove remainder of distinct discard block size"Joe Thornber2014-11-104-46/+77
| | | | | | | | | | | | | | | | | | | | This reverts commit 64ab346a360a4b15c28fb8531918d4a01f4eabd9 because we actually do want to allow the discard blocksize to be larger than the cache blocksize. Further dm-cache discard changes will make this possible. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm bio prison: introduce support for locking ranges of blocksJoe Thornber2014-11-104-9/+16
| | | | | | | | | | | | | | | | | | | | Ranges will be placed in the same cell if they overlap. Range locking is a prerequisite for more efficient multi-block discard support in both the cache and thin-provisioning targets. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: simplify ability to promote sequential IO to the cacheMike Snitzer2014-11-101-3/+4
| | | | | | | | | | | | | | | | | | | | Before, if the user wanted sequential IO to be promoted to the cache they'd have to set sequential_threshold to some nebulous large value. Now, the user may easily disable sequential IO detection (and sequential IO's implicit bypass of the cache) by setting sequential_threshold to 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: tweak algorithm that decides when to promote a blockJoe Thornber2014-11-101-25/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than maintaining a separate promote_threshold variable that we periodically update we now use the hit count of the oldest clean block. Also add a fudge factor to discourage demoting dirty blocks. With some tests this has a sizeable difference, because the old code was too eager to demote blocks. For example, device-mapper-test-suite's git_extract_cache_quick test goes from taking 190 seconds, to 142 (linear on spindle takes 250). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: do not call dm_sync_table() when creating new devicesHannes Reinecke2014-11-101-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating new devices dm_sync_table() calls synchronize_rcu_expedited(), causing _all_ pending RCU pointers to be flushed. This causes a latency overhead that is especially noticeable when creating lots of devices. And all of this is pointless as there are no old maps to be disconnected, and hence no stale pointers which would need to be cleared up. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: sparse: Annotate field with __rcu for checkingPranith Kumar2014-11-101-1/+1
| | | | | | | | | | | | | | | | | | Annotate the map field with __rcu since this is a rcu pointer which is checked by sparse. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: Use rcu_dereference() for accessing rcu pointerPranith Kumar2014-11-101-4/+4
| | | | | | | | | | | | | | | | | | The map field in 'struct mapped_device' is an rcu pointer. Use rcu_dereference() while accessing it. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: refactor requeue_io to eliminate spinlock bouncingMike Snitzer2014-11-101-20/+23
| | | | | | | | | | | | Also refactor some other bio_list erroring helpers. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: optimize retry_bios_on_resumeMike Snitzer2014-11-101-7/+2
| | | | | | | | | | | | | | Eliminate redundant should_error_unserviceable_bio check and error loop. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: sort the deferred cellsJoe Thornber2014-11-101-20/+68
| | | | | | | | | | | | | | | | | | Sort the cells in logical block order before processing each cell in process_thin_deferred_cells(). This significantly improves the ondisk layout on rotational storage, whereby improving read performance. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: direct dispatch when breaking sharingJoe Thornber2014-11-101-13/+57
| | | | | | | | | | | | | | | | | | This use of direct submission in process_shared_bio() reduces latency for submitting bios in the shared cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: remap the bios in a cell immediatelyJoe Thornber2014-11-103-29/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This use of direct submission in process_prepared_mapping() reduces latency for submitting bios in a cell by avoiding adding those bios to the deferred list and waiting for the next iteration of the worker. But this direct submission exposes the potential for a race between releasing a cell and incrementing deferred set. Fix this by introducing dm_cell_visit_release() and refactoring inc_remap_and_issue_cell() accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: defer whole cells rather than individual biosJoe Thornber2014-11-102-47/+208
| | | | | | | | | | | | | | | | | | | | | | | | This avoids dropping the cell, so increases the probability that other bios will collect within the cell, rather than being passed individually to the worker. Also add required process_cell and process_discard_cell error handling wrappers and set associated pool-mode function pointers accordingly. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm thin: factor out remap_and_issue_overwriteMike Snitzer2014-11-101-18/+20
| | | | | | | | | | | | Purely cleanup of duplicated code, no functional change. Signed-off-by: Mike Snitzer <snitzer@redhat.com>