From 0ba959774e93911caff596de6391f085fb640ac4 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Wed, 1 Mar 2017 17:30:29 +0800 Subject: md-cluster: use sync way to handle METADATA_UPDATED msg Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NeilBrown Signed-off-by: Guoqing Jiang Signed-off-by: Shaohua Li --- drivers/md/md.h | 3 --- 1 file changed, 3 deletions(-) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index dde8ecb760c8..1c00160b09f9 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -219,9 +219,6 @@ enum mddev_flags { * it then */ MD_JOURNAL_CLEAN, /* A raid with journal is already clean */ MD_HAS_JOURNAL, /* The raid array has journal feature set */ - MD_RELOAD_SB, /* Reload the superblock because another node - * updated it. - */ MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node * already took resync lock, need to * release the lock */ -- cgit v1.2.3 From ea0213e0c7cc1c1b52badf27bd7db4f50a67baaa Mon Sep 17 00:00:00 2001 From: Artur Paszkiewicz Date: Thu, 9 Mar 2017 09:59:57 +0100 Subject: md: superblock changes for PPL Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz Signed-off-by: Shaohua Li --- drivers/md/md.h | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index 1c00160b09f9..a7b2f16452c4 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -122,6 +122,13 @@ struct md_rdev { * sysfs entry */ struct badblocks badblocks; + + struct { + short offset; /* Offset from superblock to start of PPL. + * Not used by external metadata. */ + unsigned int size; /* Size in sectors of the PPL space */ + sector_t sector; /* First sector of the PPL space */ + } ppl; }; enum flag_bits { Faulty, /* device is known to have a fault */ @@ -226,6 +233,7 @@ enum mddev_flags { * supported as calls to md_error() will * never cause the array to become failed. */ + MD_HAS_PPL, /* The raid array has PPL feature set */ }; enum mddev_sb_flags { -- cgit v1.2.3 From ba903a3ea465bd2f2bb9316054b295e79a7a518e Mon Sep 17 00:00:00 2001 From: Artur Paszkiewicz Date: Thu, 9 Mar 2017 10:00:03 +0100 Subject: raid5-ppl: runtime PPL enabling or disabling Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: Artur Paszkiewicz Signed-off-by: Shaohua Li --- drivers/md/md.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index a7b2f16452c4..e0940064c3ec 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -545,6 +545,8 @@ struct md_personality /* congested implements bdi.congested_fn(). * Will not be called while array is 'suspended' */ int (*congested)(struct mddev *mddev, int bits); + /* Changes the consistency policy of an active array. */ + int (*change_consistency_policy)(struct mddev *mddev, const char *buf); }; struct md_sysfs_entry { -- cgit v1.2.3 From 497280509f32340d90feac030bce18006a3e3605 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 15 Mar 2017 14:05:12 +1100 Subject: md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown Signed-off-by: Shaohua Li --- drivers/md/md.h | 1 + 1 file changed, 1 insertion(+) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index e0940064c3ec..0cd12721a536 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -648,6 +648,7 @@ extern void md_wakeup_thread(struct md_thread *thread); extern void md_check_recovery(struct mddev *mddev); extern void md_reap_sync_thread(struct mddev *mddev); extern void md_write_start(struct mddev *mddev, struct bio *bi); +extern void md_write_inc(struct mddev *mddev, struct bio *bi); extern void md_write_end(struct mddev *mddev); extern void md_done_sync(struct mddev *mddev, int blocks, int ok); extern void md_error(struct mddev *mddev, struct md_rdev *rdev); -- cgit v1.2.3 From 4ad23a976413aa57fe5ba7a25953dc35ccca5b71 Mon Sep 17 00:00:00 2001 From: NeilBrown Date: Wed, 15 Mar 2017 14:05:14 +1100 Subject: MD: use per-cpu counter for writes_pending The 'writes_pending' counter is used to determine when the array is stable so that it can be marked in the superblock as "Clean". Consequently it needs to be updated frequently but only checked for zero occasionally. Recent changes to raid5 cause the count to be updated even more often - once per 4K rather than once per bio. This provided justification for making the updates more efficient. So we replace the atomic counter a percpu-refcount. This can be incremented and decremented cheaply most of the time, and can be switched to "atomic" mode when more precise counting is needed. As it is possible for multiple threads to want a precise count, we introduce a "sync_checker" counter to count the number of threads in "set_in_sync()", and only switch the refcount back to percpu mode when that is zero. We need to be careful about races between set_in_sync() setting ->in_sync to 1, and md_write_start() setting it to zero. md_write_start() holds the rcu_read_lock() while checking if the refcount is in percpu mode. If it is, then we know a switch to 'atomic' will not happen until after we call rcu_read_unlock(), in which case set_in_sync() will see the elevated count, and not set in_sync to 1. If it is not in percpu mode, we take the mddev->lock to ensure proper synchronization. It is no longer possible to quickly check if the count is zero, which we previously did to update a timer or to schedule the md_thread. So now we do these every time we decrement that counter, but make sure they are fast. mod_timer() already optimizes the case where the timeout value doesn't actually change. We leverage that further by always rounding off the jiffies to the timeout value. This may delay the marking of 'clean' slightly, but ensure we only perform atomic operation here when absolutely needed. md_wakeup_thread() current always calls wake_up(), even if THREAD_WAKEUP is already set. That too can be optimised to avoid calls to wake_up(). Signed-off-by: NeilBrown Signed-off-by: Shaohua Li --- drivers/md/md.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index 0cd12721a536..7a7847d1cc39 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -409,7 +409,8 @@ struct mddev { */ unsigned int safemode_delay; struct timer_list safemode_timer; - atomic_t writes_pending; + struct percpu_ref writes_pending; + int sync_checkers; /* # of threads checking writes_pending */ struct request_queue *queue; /* for plugging ... */ struct bitmap *bitmap; /* the bitmap for the device */ -- cgit v1.2.3 From d8e29fbc3bed181f2653fb89ac8c34e40db39c30 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 17 Mar 2017 00:12:23 +0800 Subject: md: move two macros into md.h Both raid1 and raid10 share common resync block size and page count, so move them into md.h. Signed-off-by: Ming Lei Signed-off-by: Shaohua Li --- drivers/md/md.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index 7a7847d1cc39..31d2d70849b6 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -718,4 +718,9 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio) !bdev_get_queue(bio->bi_bdev)->limits.max_write_same_sectors) mddev->queue->limits.max_write_same_sectors = 0; } + +/* Maximum size of each resync request */ +#define RESYNC_BLOCK_SIZE (64*1024) +#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE) + #endif /* _MD_MD_H */ -- cgit v1.2.3 From 513e2faa0138462ce014e1b0e226ca45c83bc6c1 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 17 Mar 2017 00:12:24 +0800 Subject: md: prepare for managing resync I/O pages in clean way Now resync I/O use bio's bec table to manage pages, this way is very hacky, and may not work any more once multipage bvec is introduced. So introduce helpers and new data structure for managing resync I/O pages more cleanly. Signed-off-by: Ming Lei Signed-off-by: Shaohua Li --- drivers/md/md.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) (limited to 'drivers/md/md.h') diff --git a/drivers/md/md.h b/drivers/md/md.h index 31d2d70849b6..0418b29945e7 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -723,4 +723,54 @@ static inline void mddev_check_writesame(struct mddev *mddev, struct bio *bio) #define RESYNC_BLOCK_SIZE (64*1024) #define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE) +/* for managing resync I/O pages */ +struct resync_pages { + unsigned idx; /* for get/put page from the pool */ + void *raid_bio; + struct page *pages[RESYNC_PAGES]; +}; + +static inline int resync_alloc_pages(struct resync_pages *rp, + gfp_t gfp_flags) +{ + int i; + + for (i = 0; i < RESYNC_PAGES; i++) { + rp->pages[i] = alloc_page(gfp_flags); + if (!rp->pages[i]) + goto out_free; + } + + return 0; + +out_free: + while (--i >= 0) + put_page(rp->pages[i]); + return -ENOMEM; +} + +static inline void resync_free_pages(struct resync_pages *rp) +{ + int i; + + for (i = 0; i < RESYNC_PAGES; i++) + put_page(rp->pages[i]); +} + +static inline void resync_get_all_pages(struct resync_pages *rp) +{ + int i; + + for (i = 0; i < RESYNC_PAGES; i++) + get_page(rp->pages[i]); +} + +static inline struct page *resync_fetch_page(struct resync_pages *rp, + unsigned idx) +{ + if (WARN_ON_ONCE(idx >= RESYNC_PAGES)) + return NULL; + return rp->pages[idx]; +} + #endif /* _MD_MD_H */ -- cgit v1.2.3