summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* md: cancel check/repair requests when recovery is neededDan Williams2008-08-071-1/+3
| | | | | | | | | | If a 'repair' is requested when an array is in a position to 'recover' raid1 will perform the repair while md believes a recovery is happening. Address this at both ends, i.e. cancel check/repair requests upon detecting a recover condition and do not call ->spare_active after completing a check/repair. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Allow raid10 resync to happening in larger chunks.NeilBrown2008-08-051-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | The raid10 resync/recovery code currently limits the amount of in-flight resync IO to 2Meg. This was copied from raid1 where it seems quite adequate. However for raid10, some layouts require a bit of seeking to perform a resync, and allowing a larger buffer size means that the seeking can be significantly reduced. There is probably no real need to limit the amount of in-flight IO at all. Any shortage of memory will naturally reduce the amount of buffer space available down to a set minimum, and any concurrent normal IO will quickly cause resync IO to back off. The only problem would be that normal IO has to wait for all resync IO to finish, so a very large amount of resync IO could cause unpleasant latency when normal IO starts up. So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is available) which seems to be a good amount. Also reduce the amount of memory reserved as there is no need to keep 2Meg just for resync if memory is tight. Thanks to Keld for the suggestion. Cc: Keld Jørn Simonsen <keld@dkuug.dk> Signed-off-by: NeilBrown <neilb@suse.de>
* Allow faulty devices to be removed from a readonly array.NeilBrown2008-08-051-1/+12
| | | | | | | | | | | | | | | Removing faulty devices from an array is a two stage process. First the device is moved from being a part of the active array to being similar to a spare device. Then it can be removed by a request from user space. The first step is currently not performed for read-only arrays, so the second step can never succeed. So allow readonly arrays to remove failed devices (which aren't blocked). Signed-off-by: NeilBrown <neilb@suse.de>
* Don't let a blocked_rdev interfere with read request in raid5/6NeilBrown2008-08-051-8/+21
| | | | | | | | | | | | | | | | | | | When we have externally managed metadata, we need to mark a failed device as 'Blocked' and not allow any writes until that device have been marked as faulty in the metadata and the Blocked flag has been removed. However it is perfectly OK to allow read requests when there is a Blocked device, and with a readonly array, there may not be any metadata-handler watching for blocked devices. So in raid5/raid6 only allow a Blocked device to interfere with Write request or resync. Read requests go through untouched. raid1 and raid10 already differentiate between read and write properly. Signed-off-by: NeilBrown <neilb@suse.de>
* Fail safely when trying to grow an array with a write-intent bitmap.NeilBrown2008-08-052-0/+8
| | | | | | | | | | | | We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>
* Restore force switch of md array to readonly at reboot time.NeilBrown2008-08-051-1/+5
| | | | | | | | | | | | | | | A recent patch allowed do_md_stop to know whether it was being called via an ioctl or not, and thus where to allow for an extra open file descriptor when checking if it is in use. This broke then switch to readonly performed by the shutdown notifier, which needs to work even when the array is still (apparently) active (as md doesn't get told when the filesystem becomes readonly). So restore this feature by pretending that there can be lots of file descriptors open, but we still want do_md_stop to switch to readonly. Signed-off-by: NeilBrown <neilb@suse.de>
* Make writes to md/safe_mode_delay immediately effective.NeilBrown2008-08-051-0/+5
| | | | | | | | | | | | If we reduce the 'safe_mode_delay', it could still wait for the old delay to completely expire before doing anything about safe_mode. Thus the effect if the change is delayed. To make the effect more immediate, run the timeout function immediately if the delay was reduced. This may cause it to run slightly earlier that required, but that is the safer option. Signed-off-by: NeilBrown <neilb@suse.de>
* Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2008-08-013-13/+27
|\ | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: raid10: wake up frozen array md: do not count blocked devices as spares md: do not progress the resync process if the stripe was blocked md: delay notification of 'active_idle' to the recovery thread md: fix merge error md: move async_tx_issue_pending_all outside spin_lock_irq
| * md: raid10: wake up frozen arrayArthur Jones2008-08-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: do not count blocked devices as sparesDan Williams2008-07-281-1/+2
| | | | | | | | | | | | | | | | remove_and_add_spares() assumes that failed devices have been hot-removed from the array. Removal is skipped in the 'blocked' case so do not count a device in this state as 'spare'. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * md: do not progress the resync process if the stripe was blockedDan Williams2008-07-281-6/+13
| | | | | | | | | | | | | | handle_stripe will take no action on a stripe when waiting for userspace to unblock the array, so do not report completed sectors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * md: delay notification of 'active_idle' to the recovery threadDan Williams2008-07-231-1/+4
| | | | | | | | | | | | sysfs_notify might sleep, so do not call it from md_safemode_timeout. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * md: fix merge errorDan Williams2008-07-231-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original STRIPE_OP_IO removal patch had the following hunk: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } However it appears the hunk became broken after merging: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); set_bit(R5_LOCKED, &dev->flags); s.locked++; - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } Signed-off-by: Dan Williams <dan.j.williams@intel.com>
| * md: move async_tx_issue_pending_all outside spin_lock_irqDan Williams2008-07-231-3/+2
| | | | | | | | | | | | | | | | | | Some dma drivers need to call spin_lock_bh in their device_issue_pending routines. This change avoids: WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85() Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2008-08-011-1/+1
|\ \ | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: md: the bitmap code needs to use blk_plug_device_unlocked() block: add a blk_plug_device_unlocked() that grabs the queue lock
| * | md: the bitmap code needs to use blk_plug_device_unlocked()Jens Axboe2008-08-011-1/+1
| | | | | | | | | | | | | | | | | | | | | It doesn't hold the queue lock, so it's both racey on the queue flags and thus spews a warning. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* | | [PATCH] switch mtd and dm-table to lookup_bdev()Al Viro2008-08-011-23/+6
|/ / | | | | | | | | | | No need to open-code it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | [SCSI] scsi_dh: attach to hardware handler from dm-mpathHannes Reinecke2008-07-261-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | multipath keeps a separate device table which may be more current than the built-in one. So we should make sure to always call ->attach whenever a multipath map with hardware handler is instantiated. And we should call ->detach on removal, too. [sekharan: update as per comments from agk] Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds2008-07-219-47/+262
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm crypt: add merge dm table: remove merge_bvec sector restriction dm: linear add merge dm: introduce merge_bvec_fn dm snapshot: use per device mempools dm snapshot: fix race during exception creation dm snapshot: track snapshot reads dm mpath: fix test for reinstate_path dm mpath: return parameter error dm io: remove struct padding dm log: make dm_dirty_log init and exit static dm mpath: free path selector on invalid args
| * | dm crypt: add mergeMilan Broz2008-07-211-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements biovec merge function for crypt target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm table: remove merge_bvec sector restrictionMilan Broz2008-07-211-7/+6
| | | | | | | | | | | | | | | | | | | | | Remove max_sector restriction - merge function replaced it. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm: linear add mergeMilan Broz2008-07-211-5/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements biovec merge function for linear target. If the underlying device has merge function defined, call it. If not, keep precomputed value. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm: introduce merge_bvec_fnMilan Broz2008-07-211-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a bvec merge function for device mapper devices for dynamic size restrictions. This code ensures the requested biovec lies within a single target and then calls a target-specific function to check against any constraints imposed by underlying devices. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm snapshot: use per device mempoolsMikulas Patocka2008-07-212-18/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change snapshot per-module mempool to per-device mempool. Per-module mempools could cause a deadlock if multiple snapshot devices are stacked above each other. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm snapshot: fix race during exception creationMikulas Patocka2008-07-211-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a race condition that returns incorrect data when a write causes an exception to be allocated whilst a read is still in flight. The race condition happens as follows: * A read to non-reallocated sector in the snapshot is submitted so that the read is routed to the original device. * A write to the original device is submitted. The write causes an exception that reallocates the block. The write proceeds. * The original read is dequeued and reads the wrong data. This race can be triggered with CFQ scheduler and one thread writing and multiple threads reading simultaneously. (This patch relies upon the earlier dm-kcopyd-per-device.patch to avoid a deadlock.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm snapshot: track snapshot readsMikulas Patocka2008-07-212-10/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever a snapshot read gets mapped through to the origin, track it in a per-snapshot hash table indexed by chunk number, using memory allocated from a new per-snapshot mempool. We need to track these reads to avoid race conditions which will be fixed by patches that follow. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm mpath: fix test for reinstate_pathAlasdair G Kergon2008-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | | Fix test for reinstate_path method before attempting to use it. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Julia Lawall <julia@diku.dk>
| * | dm mpath: return parameter errorMikulas Patocka2008-07-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return a specific error message if there are an invalid number of multipath arguments. This invalid command returns an "Unknown error" because the ti->error field is not set dmsetup create --table '0 2 multipath 0 0 1 1 round-robin 0 1 1 /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm io: remove struct paddingRichard Kennedy2008-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Rearrange struct dm_io. Shrinks size from 40 -> 32 allowing more objects/slab. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm log: make dm_dirty_log init and exit staticAdrian Bunk2008-07-212-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | dm_dirty_log_{init,exit}() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
| * | dm mpath: free path selector on invalid argsMikulas Patocka2008-07-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Free path selector if the arguments are invalid. This command (note that it is invalid) causes reference leak on module "dm_round_robin" and prevents the module from being removed. dmsetup create --table '0 2 multipath 0 0 1 1 round-robin /dev/sdh' mpath0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2008-07-219-761/+752
|\ \ \ | |/ / |/| / | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: (52 commits) md: Protect access to mddev->disks list using RCU md: only count actual openers as access which prevent a 'stop' md: linear: Make array_size sector-based and rename it to array_sectors. md: Make mddev->array_size sector-based. md: Make super_type->rdev_size_change() take sector-based sizes. md: Fix check for overlapping devices. md: Tidy up rdev_size_store a bit: md: Remove some unused macros. md: Turn rdev->sb_offset into a sector-based quantity. md: Make calc_dev_sboffset() return a sector count. md: Replace calc_dev_size() by calc_num_sectors(). md: Make update_size() take the number of sectors. md: Better control of when do_md_stop is allowed to stop the array. md: get_disk_info(): Don't convert between signed and unsigned and back. md: Simplify restart_array(). md: alloc_disk_sb(): Return proper error value. md: Simplify sb_equal(). md: Simplify uuid_equal(). md: sb_equal(): Fix misleading printk. md: Fix a typo in the comment to cmd_match(). ...
| * md: Protect access to mddev->disks list using RCUNeilBrown2008-07-212-17/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All modifications and most access to the mddev->disks list are made under the reconfig_mutex lock. However there are three places where the list is walked without any locking. If a reconfig happens at this time, havoc (and oops) can ensue. So use RCU to protect these accesses: - wrap them in rcu_read_{,un}lock() - use list_for_each_entry_rcu - add to the list with list_add_rcu - delete from the list with list_del_rcu - delay the 'free' with call_rcu rather than schedule_work Note that export_rdev did a list_del_init on this list. In almost all cases the entry was not in the list anymore so it was a no-op and so safe. It is no longer safe as after list_del_rcu we may not touch the list_head. An audit shows that export_rdev is called: - after unbind_rdev_from_array, in which case the delete has already been done, - after bind_rdev_to_array fails, in which case the delete isn't needed. - before the device has been put on a list at all (e.g. in add_new_disk where reading the superblock fails). - and in autorun devices after a failure when the device is on a different list. So remove the list_del_init call from export_rdev, and add it back immediately before the called to export_rdev for that last case. Note also that ->same_set is sometimes used for lists other than mddev->list (e.g. candidates). In these cases rcu is not needed. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: only count actual openers as access which prevent a 'stop'NeilBrown2008-07-211-3/+6
| | | | | | | | | | | | | | | | | | Open isn't the only thing that increments ->active. e.g. reading /proc/mdstat will increment it briefly. So to avoid false positives in testing for concurrent access, introduce a new counter that counts just the number of times the md device it open. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: linear: Make array_size sector-based and rename it to array_sectors.Andre Noll2008-07-211-8/+8
| | | | | | | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Make mddev->array_size sector-based.Andre Noll2008-07-218-27/+32
| | | | | | | | | | | | | | | | | | This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Make super_type->rdev_size_change() take sector-based sizes.Andre Noll2008-07-211-21/+19
| | | | | | | | | | | | | | | | Also, change the type of the size parameter from unsigned long long to sector_t and rename it to num_sectors. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Fix check for overlapping devices.Andre Noll2008-07-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The checks in overlaps() expect all parameters either in block-based or sector-based quantities. However, its single caller passes two rdev->data_offset arguments as well as two rdev->size arguments, the former being sector counts while the latter are measured in 1K blocks. This could cause rdev_size_store() to accept an invalid size from user space. Fix it by passing only sector-based quantities to overlaps(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Tidy up rdev_size_store a bit:Neil Brown2008-07-211-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - used strict_strtoull in place of simple_strtoull - use my_mddev in place of rdev->mddev (they have the same value) and more significantly, - don't adjust mddev->size to fit, rather reject changes which make rdev->size smaller than mddev->size Adjusting mddev->size is a hangover from bind_rdev_to_array which does a similar thing. But it really is a better design to insist that mddev->size is set as required, then the rdev->sizes are set to allow for that. The previous way invites confusion. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Turn rdev->sb_offset into a sector-based quantity.Andre Noll2008-07-112-48/+43
| | | | | | | | | | | | | | Rename it to sb_start to make sure all users have been converted. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Make calc_dev_sboffset() return a sector count.Andre Noll2008-07-111-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | As BLOCK_SIZE_BITS is 10 and MD_NEW_SIZE_SECTORS(2 * x) = 2 * NEW_SIZE_BLOCKS(x), the return value of calc_dev_sboffset() doubles. Fix up all three callers accordingly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Replace calc_dev_size() by calc_num_sectors().Andre Noll2008-07-111-11/+7
| | | | | | | | | | | | | | | | | | Number of sectors is the preferred unit for sizes of raid devices, so change calc_dev_size() so that it returns this unit instead of the number of 1K blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Make update_size() take the number of sectors.Andre Noll2008-07-111-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Changing the internal representations of sizes of raid devices from 1K blocks to sector counts (512B units) is desirable because it allows to get rid of many divisions/multiplications and unnecessary casts that are present in the current code. This patch is a first step in this direction. It replaces the old 1K-based "size" argument of update_size() by "num_sectors" and fixes up its two callers. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Better control of when do_md_stop is allowed to stop the array.Neil Brown2008-07-111-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | do_md_stop check the number of active users before allowing the array to be stopped. Two problems: 1/ it assumes the request is coming through an open file descriptor (via ioctl) so it allows for that. This is not always the case. 2/ it doesn't do the check it the array hasn't been activated. This is not good for cases when we use an inactive array to hold some devices in a container. Signed-off-by: Neil Brown <neilb@suse.de>
| * md: get_disk_info(): Don't convert between signed and unsigned and back.Andre Noll2008-07-111-4/+1
| | | | | | | | | | | | | | | | | | The current code copies a signed int from user space, converts it to unsigned and passes the unsigned value to find_rdev_nr() which expects a signed value. Simply pass the signed value from user space directly. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Simplify restart_array().Andre Noll2008-07-111-32/+17
| | | | | | | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: alloc_disk_sb(): Return proper error value.Andre Noll2008-07-111-1/+1
| | | | | | | | | | | | | | | | If alloc_page() fails, ENOMEM is a more suitable error value than EINVAL. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Simplify sb_equal().Andre Noll2008-07-111-5/+1
| | | | | | | | | | | | | | | | The only caller of sb_equal() tests the return value against zero, so it's OK to return the negated return value of memcmp(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: Simplify uuid_equal().Andre Noll2008-07-111-9/+4
| | | | | | | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>
| * md: sb_equal(): Fix misleading printk.Andre Noll2008-07-081-1/+1
| | | | | | | | | | Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Neil Brown <neilb@suse.de>