summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* FIX: md: process hangs at wait_barrier after 0->10 takeoverKrzysztof Wojcik2011-02-081-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following symptoms were observed: 1. After raid0->raid10 takeover operation we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state. 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D" In scenarios above processes hang at the same function- wait_barrier in raid10.c. Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens. Reason was that at the end of level_store, after calling pers->run, we call mddev_resume. This calls pers->quiesce(mddev, 0) with RAID10, that calls lower_barrier. However raise_barrier hadn't been called on that 'conf' yet, so conf->barrier becomes negative, which is bad. This patch introduces setting conf->barrier=1 after takeover operation. It prevents to become barrier negative after call lower_barrier(). Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md_make_request: don't touch the bio after calling make_requestChris Mason2011-02-081-2/+7
| | | | | | | | | | | | | | | | md_make_request was calling bio_sectors() for part_stat_add after it was calling the make_request function. This is bad because the make_request function can free the bio and because the bi_size field can change around. The fix here was suggested by Jens Axboe. It saves the sector count before the make_request call. I hit this with CONFIG_DEBUG_PAGEALLOC turned on while trying to break his pretty fusionio card. Cc: <stable@kernel.org> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Don't allow slot_store while resync/recovery is happening.NeilBrown2011-02-021-0/+3
| | | | | | | | | | Activating a spare in an array while resync/recovery is already happening can lead the that spare being marked in-sync when it isn't really. So don't allow the 'slot' to be set (this activating the device) while resync/recovery is happening. Signed-off-by: NeilBrown <neilb@suse.de>
* md: don't clear curr_resync_completed at end of resync.NeilBrown2011-01-311-3/+0
| | | | | | | | | | | | | | There is no need to set this to zero at this point. It will be set to zero by remove_and_add_spares or at the start of md_do_sync at the latest. And setting it to zero before MD_RECOVERY_RUNNING is cleared can make a 'zero' appear briefly in the 'sync_completed' sysfs attribute just as resync is finishing. So simply remove this setting to zero. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Don't use remove_and_add_spares to remove failed devices from a ↵NeilBrown2011-01-311-2/+15
| | | | | | | | | | | | | | | | read-only array remove_and_add_spares is called in two places where the needs really are very different. remove_and_add_spares should not be called on an array which is about to be reshaped as some extra devices might have been manually added and that would remove them. However if the array is 'read-auto', that will currently happen, which is bad. So in the 'ro != 0' case don't call remove_and_add_spares but simply remove the failed devices as the comment suggests is needed. Signed-off-by: NeilBrown <neilb@suse.de>
* Add raid1->raid0 takeover supportKrzysztof Wojcik2011-01-311-0/+40
| | | | | | | | This patch introduces raid 1 to raid0 takeover operation in kernel space. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: Neil Brown <neilb@nbeee.brown>
* md: Remove the AllReserved flag for component devices.NeilBrown2011-01-312-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | This flag is not needed and is used badly. Devices that are included in a native-metadata array are reserved exclusively for that array - and currently have AllReserved set. They all are bd_claimed for the rdev and so cannot be shared. Devices that are included in external-metadata arrays can be shared among multiple arrays - providing there is no overlap. These are bd_claimed for md in general - not for a particular rdev. When changing the amount of a device that is used in an array we need to check for overlap. This currently includes a check on AllReserved So even without overlap, sharing with an AllReserved device is not allowed. However the bd_claim usage already precludes sharing with these devices, so the test on AllReserved is not needed. And in fact it is wrong. As this is the only use of AllReserved, simply remove all usage and definition of AllReserved. Signed-off-by: NeilBrown <neilb@suse.de>
* md: don't abort checking spares as soon as one cannot be added.NeilBrown2011-01-311-2/+1
| | | | | | | | | | | | As spares can be added manually before a reshape starts, we need to find them all to mark some of them as in_sync. Previously we would abort looking for spares when we found an unallocated spare what could not be added to the array (implying there was no room for new spares). However already-added spares could be later in the list, so we need to keep searching. Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix the test for finding spares in raid5_start_reshape.NeilBrown2011-01-311-2/+2
| | | | | | | | | | | | | As spares can be added to the array before the reshape is started, we need to find and count them when checking there are enough. The array could have been degraded, so we need to check all devices, no just those out side of the range of devices in the array before the reshape. So instead of checking the index, check the In_sync flag as that reliably tells if the device is a spare or this purpose. Signed-off-by: NeilBrown <neilb@suse.de>
* md: simplify some 'if' conditionals in raid5_start_reshape.NeilBrown2011-01-311-27/+28
| | | | | | | | | | | | | | | | | | There are two consecutive 'if' statements. if (mddev->delta_disks >= 0) .... if (mddev->delta_disks > 0) The code in the second is equally valid if delta_disks == 0, and these two statements are the only place that 'added_devices' is used. So make them a single if statement, make added_devices a local variable, and re-indent it all. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>
* md: revert change to raid_disks on failure.NeilBrown2011-01-311-0/+2
| | | | | | | | If we try to update_raid_disks and it fails, we should put 'delta_disks' back to zero. This is important because some code, such as slot_store, assumes that delta_disks has been validated. Signed-off-by: NeilBrown <neilb@suse.de>
* block: restore multiple bd_link_disk_holder() supportTejun Heo2011-01-142-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Commit e09b457b (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit e09b457b sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds2011-01-1317-292/+1581
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: (32 commits) dm: raid456 basic support dm: per target unplug callback support dm: introduce target callbacks and congestion callback dm mpath: delay activate_path retry on SCSI_DH_RETRY dm: remove superfluous irq disablement in dm_request_fn dm log: use PTR_ERR value instead of ENOMEM dm snapshot: avoid storing private suspended state dm snapshot: persistent make metadata_wq multithreaded dm: use non reentrant workqueues if equivalent dm: convert workqueues to alloc_ordered dm stripe: switch from local workqueue to system_wq dm: dont use flush_scheduled_work dm snapshot: remove unused dm_snapshot queued_bios_work dm ioctl: suppress needless warning messages dm crypt: add loop aes iv generator dm crypt: add multi key capability dm crypt: add post iv call to iv generator dm crypt: use io thread for reads only if mempool exhausted dm crypt: scale to multiple cpus dm crypt: simplify compatible table output ...
| * dm: raid456 basic supportNeilBrown2011-01-133-0/+722
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is the skeleton for the DM target that will be the bridge from DM to MD (initially RAID456 and later RAID1). It provides a way to use device-mapper interfaces to the MD RAID456 drivers. As with all device-mapper targets, the nominal public interfaces are the constructor (CTR) tables and the status outputs (both STATUSTYPE_INFO and STATUSTYPE_TABLE). The CTR table looks like the following: 1: <s> <l> raid \ 2: <raid_type> <#raid_params> <raid_params> \ 3: <#raid_devs> <meta_dev1> <dev1> .. <meta_devN> <devN> Line 1 contains the standard first three arguments to any device-mapper target - the start, length, and target type fields. The target type in this case is "raid". Line 2 contains the arguments that define the particular raid type/personality/level, the required arguments for that raid type, and any optional arguments. Possible raid types include: raid4, raid5_la, raid5_ls, raid5_rs, raid6_zr, raid6_nr, and raid6_nc. (again, raid1 is planned for the future.) The list of required and optional parameters is the same for all the current raid types. The required parameters are positional, while the optional parameters are given as key/value pairs. The possible parameters are as follows: <chunk_size> Chunk size in sectors. [[no]sync] Force/Prevent RAID initialization [rebuild <idx>] Rebuild the drive indicated by the index [daemon_sleep <ms>] Time between bitmap daemon work to clear bits [min_recovery_rate <kB/sec/disk>] Throttle RAID initialization [max_recovery_rate <kB/sec/disk>] Throttle RAID initialization [max_write_behind <value>] See '-write-behind=' (man mdadm) [stripe_cache <sectors>] Stripe cache size for higher RAIDs Line 3 contains the list of devices that compose the array in metadata/data device pairs. If the metadata is stored separately, a '-' is given for the metadata device position. If a drive has failed or is missing at creation time, a '-' can be given for both the metadata and data drives for a given position. Examples: # RAID4 - 4 data drives, 1 parity # No metadata devices specified to hold superblock/bitmap info # Chunk size of 1MiB # (Lines separated for easy reading) 0 1960893648 raid \ raid4 1 2048 \ 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 # RAID4 - 4 data drives, 1 parity (no metadata devices) # Chunk size of 1MiB, force RAID initialization, # min recovery rate at 20 kiB/sec/disk 0 1960893648 raid \ raid4 4 2048 min_recovery_rate 20 sync\ 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 Performing a 'dmsetup table' should display the CTR table used to construct the mapping (with possible reordering of optional parameters). Performing a 'dmsetup status' will yield information on the state and health of the array. The output is as follows: 1: <s> <l> raid \ 2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio> Line 1 is standard DM output. Line 2 is best shown by example: 0 1960893648 raid raid4 5 AAAAA 2/490221568 Here we can see the RAID type is raid4, there are 5 devices - all of which are 'A'live, and the array is 2/490221568 complete with recovery. Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: per target unplug callback supportNeilBrown2011-01-131-0/+5
| | | | | | | | | | | | | | | | | | | | Add per-target unplug callback support. Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: introduce target callbacks and congestion callbackNeilBrown2011-01-131-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DM currently implements congestion checking by checking on congestion in each component device. For raid456 we need to also check if the stripe cache is congested. Add per-target congestion checker callback support. Extending the target_callbacks structure with additional callback functions allows for establishing multiple callbacks per-target (a callback is also needed for unplug). Cc: linux-raid@vger.kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: delay activate_path retry on SCSI_DH_RETRYChandra Seetharaman2011-01-131-10/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a user-configurable 'pg_init_delay_msecs' feature. Use this feature to specify the number of milliseconds to delay before retrying scsi_dh_activate, when SCSI_DH_RETRY is returned. SCSI Device Handlers return SCSI_DH_IMM_RETRY if we could retry activation immediately and SCSI_DH_RETRY in cases where it is better to retry after some delay. Currently we immediately retry scsi_dh_activate irrespective of SCSI_DH_IMM_RETRY and SCSI_DH_RETRY. The 'pg_init_delay_msecs' feature may be provided during table create or load, e.g.: dmsetup create --table "0 20971520 multipath 3 queue_if_no_path \ pg_init_delay_msecs 2500 ..." mpatha The default for 'pg_init_delay_msecs' is 2000 milliseconds. Maximum configurable delay is 60000 milliseconds. Specifying a 'pg_init_delay_msecs' of 0 will cause immediate retry. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: remove superfluous irq disablement in dm_request_fnKiyoshi Ueda2011-01-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes spin_lock_irq() to spin_lock() in dm_request_fn(). This patch is just a clean-up and no functional change. The spin_lock_irq() was leftover from the early request-based dm code, where map_request() used to enable interrupts. Since current map_request() never enables interrupts, we can change it to spin_lock() to match the prior spin_unlock(). Auditing through the dm and block-layer code called from map_request(), I confirmed all functions save/restore interrupt status, so no function returning with interrupts enabled. Also I haven't observed any problem on my test environment which uses scsi and lpfc driver after heavy I/O testing with occasional path down/up. Added BUG_ON() to detect breakage in future. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log: use PTR_ERR value instead of ENOMEMDan Carpenter2011-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's nicer to return the PTR_ERR() value instead of just returning -ENOMEM. In the current code the PTR_ERR() value is always equal to -ENOMEM so this doesn't actually affect anything, but still... In addition, dm_dirty_log_create() doesn't check for a specific -ENOMEM return. So this change is safe relative to potential for a non -ENOMEM return in the future. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: avoid storing private suspended stateMike Snitzer2011-01-131-20/+4
| | | | | | | | | | | | | | | | Use dm_suspended() rather than having each snapshot target maintain a private 'suspended' flag in struct dm_snapshot. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: persistent make metadata_wq multithreadedTejun Heo2011-01-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | metadata_wq serves on-stack work items from chunk_io(). Even if multiple chunk_io() are simultaneously in progress, each is independent and queued only once, so multithreaded workqueue can be safely used. Switch metadata_wq to multithread and flush the work item instead of the workqueue in chunk_io(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: use non reentrant workqueues if equivalentTejun Heo2011-01-133-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | kmirrord_wq, kcopyd_work and md->wq are created per dm instance and serve only a single work item from the dm instance, so non-reentrant workqueues would provide the same ordering guarantees as ordered ones while allowing CPU affinity and use of the workqueues for other purposes. Switch them to non-reentrant workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: convert workqueues to alloc_orderedTejun Heo2011-01-136-7/+8
| | | | | | | | | | | | | | | | | | | | Convert all create[_singlethread]_work() users to the new alloc[_ordered]_workqueue(). This conversion is mechanical and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm stripe: switch from local workqueue to system_wqTejun Heo2011-01-131-20/+7
| | | | | | | | | | | | | | | | | | | | | | | | kstriped only serves sc->kstriped_ws which runs dm_table_event(). This doesn't need to be executed from an ordered workqueue w/ rescuer. Drop kstriped and use the system_wq instead. While at it, rename kstriped_ws to trigger_event so that it's consistent with other dm modules. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: dont use flush_scheduled_workTejun Heo2011-01-132-2/+2
| | | | | | | | | | | | | | | | | | | | flush_scheduled_work() is being deprecated. Flush the used work directly instead. In all dm targets, the only work which uses system_wq is ->trigger_event. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm snapshot: remove unused dm_snapshot queued_bios_workTejun Heo2011-01-131-38/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | dm_snapshot->queued_bios_work isn't used. Remove ->queued_bios[_work] from dm_snapshot structure, the flush_queued_bios work function and ksnapd workqueue. The DM snapshot changes that were going to use the ksnapd workqueue were either superseded (fix for origin write races) or never completed (deallocation of invalid snapshot's memory via workqueue). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm ioctl: suppress needless warning messagesMilan Broz2011-01-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | The device-mapper should not send warning messages to syslog if a device is not found. This can be done by userspace according to the returned dm-ioctl error code. So move these messages to debug level and use rate limiting to not flood syslog. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: add loop aes iv generatorMilan Broz2011-01-131-1/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a compatible implementation of the block chaining mode used by the Loop-AES block device encryption system (http://loop-aes.sourceforge.net/) designed by Jari Ruusu. It operates on full 512 byte sectors and uses CBC with an IV derived from the sector number, the data and optionally extra IV seed. This means that after CBC decryption the first block of sector must be tweaked according to decrypted data. Loop-AES can use three encryption schemes: version 1: is plain aes-cbc mode (already compatible) version 2: uses 64 multikey scheme with own IV generator version 3: the same as version 2 with additional IV seed (it uses 65 keys, last key is used as IV seed) The IV generator is here named lmk (Loop-AES multikey) and for the cipher specification looks like: aes:64-cbc-lmk Version 2 and 3 is recognised according to length of provided multi-key string (which is just hexa encoded "raw key" used in original Loop-AES ioctl). Configuration of the device and decoding key string will be done in userspace (cryptsetup). (Loop-AES stores keys in gpg encrypted file, raw keys are output of simple hashing of lines in this file). Based on an implementation by Max Vozeler: http://article.gmane.org/gmane.linux.kernel.cryptoapi/3752/ Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> CC: Max Vozeler <max@hinterhof.net>
| * dm crypt: add multi key capabilityMilan Broz2011-01-131-21/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds generic multikey handling to be used in following patch for Loop-AES mode compatibility. This patch extends mapping table to optional keycount and implements generic multi-key capability. With more keys defined the <key> string is divided into several <keycount> sections and these are used for tfms. The tfm is used according to sector offset (sector 0->tfm[0], sector 1->tfm[1], sector N->tfm[N modulo keycount]) (only power of two values supported for keycount here). Because of tfms per-cpu allocation, this mode can be take a lot of memory on large smp systems. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Max Vozeler <max@hinterhof.net>
| * dm crypt: add post iv call to iv generatorMilan Broz2011-01-131-13/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IV (initialisation vector) can in principle depend not only on sector but also on plaintext data (or other attributes). Change IV generator interface to work directly with dmreq structure to allow such dependence in generator. Also add post() function which is called after the crypto operation. This allows tricky modification of decrypted data or IV internals. In asynchronous mode the post() can be called after ctx->sector count was increased so it is needed to add iv_sector copy directly to dmreq structure. (N.B. dmreq always include only one sector in scatterlists) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: use io thread for reads only if mempool exhaustedMilan Broz2011-01-131-14/+23
| | | | | | | | | | | | | | | | | | | | | | If there is enough memory, code can directly submit bio instead queing this operation in separate thread. Try to alloc bio clone with GFP_NOWAIT and only if it fails use separate queue (map function cannot block here). Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: scale to multiple cpusAndi Kleen2011-01-131-58/+196
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dm-crypt does all the encryption work for a single dm-crypt mapping in a single workqueue. This does not scale well when multiple CPUs are submitting IO at a high rate. The single CPU running the single thread cannot keep up with the encryption and encrypted IO performance tanks. This patch changes the crypto workqueue to be per CPU. This means that as long as the IO submitter (or the interrupt target CPUs for reads) runs on different CPUs the encryption work will be also parallel. To avoid a bottleneck on the IO worker I also changed those to be per-CPU threads. There is still some shared data, so I suspect some bouncing cache lines. But I haven't done a detailed study on that yet. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: simplify compatible table outputMilan Broz2011-01-131-16/+12
| | | | | | | | | | | | | | | | Rename cc->cipher_mode to cc->cipher_string and store the whole of the cipher information so it can easily be printed when processing the DM_DEV_STATUS ioctl. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log userspace: add version number to commsJonathan Brassow2011-01-132-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a 'version' field to the 'dm_ulog_request' structure. The 'version' field is taken from a portion of the unused 'padding' field in the 'dm_ulog_request' structure. This was done to avoid changing the size of the structure and possibly disrupting backwards compatibility. The version number will help notify user-space daemons when a change has been made to the kernel/userspace log API. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log userspace: group clear and mark requestsJonathan Brassow2011-01-131-23/+79
| | | | | | | | | | | | | | | | | | | | | | Allow the device-mapper log's 'mark' and 'clear' requests to be grouped and processed in a batch. This can significantly reduce the amount of traffic going between the kernel and userspace (where the processing daemon resides). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log userspace: split flush queueJonathan Brassow2011-01-131-9/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the 'flush_list', which contained a mix of both 'mark' and 'clear' requests, into two distinct lists ('mark_list' and 'clear_list'). The device mapper log implementations (used by various DM targets) are allowed to cache 'mark' and 'clear' requests until a 'flush' is received. Until now, these cached requests were kept in the same list. They will now be put into distinct lists to facilitate group processing of these requests (in the next patch). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm kcopyd: delay unpluggingMikulas Patocka2011-01-131-3/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make kcopyd merge more I/O requests by using device unplugging. Without this patch, each I/O request is dispatched separately to the device. If the device supports tagged queuing, there are many small requests sent to the device. To improve performance, this patch will batch as many requests as possible, allowing the queue to merge consecutive requests, and send them to the device at once. In my tests (15k SCSI disk), this patch improves sequential write throughput: Sequential write throughput (chunksize of 4k, 32k, 512k) unpatched: 15.2, 18.5, 17.5 MB/s patched: 14.4, 22.6, 23.0 MB/s In most common uses (snapshot or two-way mirror), kcopyd is only used for two devices, one for reading and the other for writing, thus this optimization is implemented only for two devices. The optimization may be extended to n-way mirrors with some code complexity increase. We keep track of two block devices to unplug (one for read and the other for write) and unplug them when exiting "do_work" thread. If there are more devices used (in theory it could happen, in practice it is rare), we unplug immediately. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm log userspace: trap all failed log construction errorsJonathan Brassow2011-01-131-4/+6
| | | | | | | | | | | | | | | | | | When constructing a mirror log, it is possible for the initial request to fail for other reasons besides -ESRCH. These must be handled too. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm crypt: set key size earlyMilan Broz2011-01-131-6/+7
| | | | | | | | | | | | | | | | | | | | Simplify key size verification (hexadecimal string) and set key size early in constructor. (Patch required by later changes.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm: remove dm_mutex after bkl conversionMilan Broz2011-01-131-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch replaces dm_mutex with _minor_lock in dm_blk_close() and then removes it. During the BKL conversion, commit 6e9624b8caec290d28b4c6d9ec75749df6372b87 (block: push down BKL into .open and .release) pushed lock_kernel() down into dm_blk_open/close calls. Commit 2a48fc0ab24241755dc93bfd4f01d68efab47f5a (block: autoconvert trivial BKL users to private mutex) converted it to a local mutex, but _minor_lock is sufficient. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm raid1: support discardMike Snitzer2011-01-131-2/+10
| | | | | | | | | | | | | | | | Enable discard support in the DM mirror target. Also change an existing use of 'bvec' to 'addr' in the union. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm ioctl: allow rename to fill empty uuidPeter Jones2011-01-131-25/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the uuid of a mapped device to be set after device creation. Previously the uuid (which is optional) could only be set by DM_DEV_CREATE. If no uuid was supplied it could not be set later. Sometimes it's necessary to create the device before the uuid is known, and in such cases the uuid must be filled in after the creation. This patch extends DM_DEV_RENAME to accept a uuid accompanied by a new flag DM_UUID_FLAG. This can only be done once and if no uuid was previously supplied. It cannot be used to change an existing uuid. DM_VERSION_MINOR is also bumped to 19 to indicate this interface extension is available. Signed-off-by: Peter Jones <pjones@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm io: remove BIO_RW_SYNCIO flag from kcopydMikulas Patocka2011-01-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the REQ_SYNC flag to improve write throughput when writing to the origin with a snapshot on the same device (using the CFQ I/O scheduler). Sequential write throughput (chunksize of 4k, 32k, 512k) unpatched: 8.5, 8.6, 9.3 MB/s patched: 15.2, 18.5, 17.5 MB/s Snapshot exception reallocations are triggered by writes that are usually async, so mark the associated dm_io_request as async as well. This helps when using the CFQ I/O scheduler because it has separate queues for sync and async I/O. Async is optimized for throughput; sync for latency. With this change we're consciously favoring throughput over latency. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * dm mpath: disable blk_abort_queueMike Snitzer2011-01-131-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commit 224cb3e981f1b2f9f93dbd49eaef505d17d894c2 dm: Call blk_abort_queue on failed paths Multipath began to use blk_abort_queue() to allow for lower latency path deactivation. This was found to cause list corruption: the cmd gets blk_abort_queued/timedout run on it and the scsi eh somehow is able to complete and run scsi_queue_insert while scsi_request_fn is still trying to process the request. https://www.redhat.com/archives/dm-devel/2010-November/msg00085.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: stable@kernel.org
| * dm: dont take i_mutex to change device sizeMike Snitzer2011-01-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No longer needlessly hold md->bdev->bd_inode->i_mutex when changing the size of a DM device. This additional locking is unnecessary because i_size_write() is already protected by the existing critical section in dm_swap_table(). DM already has a reference on md->bdev so the associated bd_inode may be changed without lifetime concerns. A negative side-effect of having held md->bdev->bd_inode->i_mutex was that a concurrent DM device resize and flush (via fsync) would deadlock. Dropping md->bdev->bd_inode->i_mutex eliminates this potential for deadlock. The following reproducer no longer deadlocks: https://www.redhat.com/archives/dm-devel/2009-July/msg00284.html Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
* | Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2011-01-136-116/+172
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: Fix removal of extra drives when converting RAID6 to RAID5 md: range check slot number when manually adding a spare. md/raid5: handle manually-added spares in start_reshape. md: fix sync_completed reporting for very large drives (>2TB) md: allow suspend_lo and suspend_hi to decrease as well as increase. md: Don't let implementation detail of curr_resync leak out through sysfs. md: separate meta and data devs md-new-param-to_sync_page_io md-new-param-to-calc_dev_sboffset md: Be more careful about clearing flags bit in ->recovery md: md_stop_writes requires mddev_lock. md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer md: Ensure no IO request to get md device before it is properly initialised. md: Fix single printks with multiple KERN_<level>s md: fix regression resulting in delays in clearing bits in a bitmap md: fix regression with re-adding devices to arrays with no metadata
| * md: Fix removal of extra drives when converting RAID6 to RAID5NeilBrown2011-01-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a RAID6 is converted to a RAID5, the extra drive should be discarded. However it isn't due to a typo in a comparison. This bug was introduced in commit e93f68a1fc6 in 2.6.35-rc4 and is suitable for any -stable since than. As the extra drive is not removed, the 'degraded' counter is wrong and so the RAID5 will not respond correctly to a subsequent failure. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
| * md: range check slot number when manually adding a spare.NeilBrown2011-01-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | When adding a spare to an active array, we should check the slot number, but allow it to be larger than raid_disks if a reshape is being prepared. Apply the same test when adding a device to an array-under-construction. It already had most of the test in place, but not quite all. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: handle manually-added spares in start_reshape.NeilBrown2011-01-141-2/+7
| | | | | | | | | | | | | | | | | | It is possible to manually add spares to specific slots before starting a reshape. raid5_start_reshape should recognised this possibility and include it in the accounting. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix sync_completed reporting for very large drives (>2TB)Rémi Rérolle2011-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The values exported in the sync_completed file are unsigned long, which overflows with very large drives, resulting in wrong values reported. Since sync_completed uses sectors as unit, we'll start getting wrong values with components larger than 2TB. This patch simply replaces the use of unsigned long by unsigned long long. Signed-off-by: Rémi Rérolle <rrerolle@lacie.com> Signed-off-by: NeilBrown <neilb@suse.de>