summaryrefslogtreecommitdiffstats
path: root/drivers/md/md-cluster.c
Commit message (Collapse)AuthorAgeFilesLines
* md-cluster: check for timeout while a new disk addingDenis Plotnikov2023-10-121-4/+11
| | | | | | | | | | | A new disk adding may end up with timeout and a new disk won't be added. Add returning the error in that case. Found by Linux Verification Center (linuxtesting.org) with SVACE Signed-off-by: Denis Plotnikov <den-plotnikov@yandex-team.ru> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230925125940.1542506-1-den-plotnikov@yandex-team.ru
* md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng2023-08-151-4/+4
| | | | | | | | | | | | Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
* md: protect md_thread with rcuYu Kuai2023-06-131-6/+11
| | | | | | | | | | | | | | | | | | | | | Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
* fs: dlm: remove DLM_LSFL_FS from uapiAlexander Aring2022-08-231-2/+2
| | | | | | | | | | | | The DLM_LSFL_FS flag is set in lockspaces created directly for a kernel user, as opposed to those lockspaces created for user space applications. The user space libdlm allowed this flag to be set for lockspaces created from user space, but then used by a kernel user. No kernel user has ever used this method, so remove the ability to do it. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
* md: Fix spelling mistake in commentsZhang Jiaming2022-08-021-2/+2
| | | | | | | | There are 2 spelling mistakes in comments. Fix it. Signed-off-by: Zhang Jiaming <jiaming@nfschina.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: replace deprecated strlcpy & remove duplicated lineHeming Zhao2022-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by b97e92574c0bf become necessary. - commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So cf921cc19cf7 became useless after d3b178adb3a3a, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>
* md: fix spelling of "its"Randy Dunlap2022-01-061-1/+1
| | | | | | | | | | Use the possessive "its" instead of the contraction "it's" in printed messages. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
* Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-blockLinus Torvalds2020-12-161-29/+38
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Nothing major in here: - NVMe pull request from Christoph: - nvmet passthrough improvements (Chaitanya Kulkarni) - fcloop error injection support (James Smart) - read-only support for zoned namespaces without Zone Append (Javier González) - improve some error message (Minwoo Im) - reject I/O to offline fabrics namespaces (Victor Gladkov) - PCI queue allocation cleanups (Niklas Schnelle) - remove an unused allocation in nvmet (Amit Engel) - a Kconfig spelling fix (Colin Ian King) - nvme_req_qid simplication (Baolin Wang) - MD pull request from Song: - Fix race condition in md_ioctl() (Dae R. Jeong) - Initialize read_slot properly for raid10 (Kevin Vigor) - Code cleanup (Pankaj Gupta) - md-cluster resync/reshape fix (Zhao Heming) - Move null_blk into its own directory (Damien Le Moal) - null_blk zone and discard improvements (Damien Le Moal) - bcache race fix (Dongsheng Yang) - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang, Lutz Pogrell, Md Haris Iqbal) - lightnvm NULL pointer deref fix (tangzhenhao) - sr in_interrupt() removal (Sebastian Andrzej Siewior) - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included as it made it easier for them to funnel the feature through the block driver tree. - Follow up fixes (Colin Ian King)" * tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits) block: drop dead assignments in loop_init() sr: Remove in_interrupt() usage in sr_init_command(). sr: Switch the sector size back to 2048 if sr_read_sector() changed it. cdrom: Reset sector_size back it is not 2048. drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c null_blk: Move driver into its own directory null_blk: Allow controlling max_hw_sectors limit null_blk: discard zones on reset null_blk: cleanup discard handling null_blk: Improve implicit zone close null_blk: improve zone locking block: Align max_hw_sectors to logical blocksize null_blk: Fail zone append to conventional zones null_blk: Fix zone size initialization bcache: fix race between setting bdev state to none and new write request direct to backing block/rnbd: fix a null pointer dereference on dev->blk_symlink_name block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name block/rnbd: call kobject_put in the failure path Documentation/ABI/rnbd-srv: add document for force_close block/rnbd-srv: close a mapped device from server side. ...
| * md/cluster: fix deadlock when node is doing resync jobZhao Heming2020-11-301-29/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg. During sending msg, node can concurrently receive msg from another node. When node does resync job, grab token_lockres:EX may trigger a deadlock: ``` nodeA nodeB -------------------- -------------------- a. send METADATA_UPDATED held token_lockres:EX b. md_do_sync resync_info_update send RESYNCING + set MD_CLUSTER_SEND_LOCK + wait for holding token_lockres:EX c. mdadm /dev/md0 --remove /dev/sdg + held reconfig_mutex + send REMOVE + wait_event(MD_CLUSTER_SEND_LOCK) d. recv_daemon //METADATA_UPDATED from A process_metadata_update + (mddev_trylock(mddev) || MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD) //this time, both return false forever ``` Explaination: a. A send METADATA_UPDATED This will block another node to send msg b. B does sync jobs, which will send RESYNCING at intervals. This will be block for holding token_lockres:EX lock. c. B do "mdadm --remove", which will send REMOVE. This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1. d. B recv METADATA_UPDATED msg, which send from A in step <a>. This will be blocked by step <c>: holding mddev lock, it makes wait_event can't hold mddev lock. (btw, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.) There is a similar deadlock in commit 0ba959774e93 ("md-cluster: use sync way to handle METADATA_UPDATED msg") In that commit, step c is "update sb". This patch step c is "mdadm --remove". For fixing this issue, we can refer the solution of function: metadata_update_start. Which does the same grab lock_token action. lock_comm can use the same steps to avoid deadlock. By moving MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm. It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, but it is safe & can break deadlock. Repro steps (I only triggered 3 times with hundreds tests): two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB. ``` ssh root@node2 "mdadm -S --scan" mdadm -S --scan for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \ count=20; done mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \ --bitmap-chunk=1M ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh" sleep 5 mkfs.xfs /dev/md0 mdadm --manage --add /dev/md0 /dev/sdi mdadm --wait /dev/md0 mdadm --grow --raid-devices=3 /dev/md0 mdadm /dev/md0 --fail /dev/sdg mdadm /dev/md0 --remove /dev/sdg mdadm --grow --raid-devices=2 /dev/md0 ``` test script will hung when executing "mdadm --remove". ``` # dump stacks by "echo t > /proc/sysrq-trigger" md0_cluster_rec D 0 5329 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? _cond_resched+0x2d/0x40 ? schedule+0x4a/0xb0 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster] ? wait_woken+0x80/0x80 ? process_recvd_msg+0x113/0x1d0 [md_cluster] ? recv_daemon+0x9e/0x120 [md_cluster] ? md_thread+0x94/0x160 [md_mod] ? wait_woken+0x80/0x80 ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 mdadm D 0 5423 1 0x00004004 Call Trace: __schedule+0x1f6/0x560 ? __schedule+0x1fe/0x560 ? schedule+0x4a/0xb0 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster] ? wait_woken+0x80/0x80 ? remove_disk+0x4f/0x90 [md_cluster] ? hot_remove_disk+0xb1/0x1b0 [md_mod] ? md_ioctl+0x50c/0xba0 [md_mod] ? wait_woken+0x80/0x80 ? blkdev_ioctl+0xa2/0x2a0 ? block_ioctl+0x39/0x40 ? ksys_ioctl+0x82/0xc0 ? __x64_sys_ioctl+0x16/0x20 ? do_syscall_64+0x5f/0x150 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 md0_resync D 0 5425 2 0x80004000 Call Trace: __schedule+0x1f6/0x560 ? schedule+0x4a/0xb0 ? dlm_lock_sync+0xa1/0xd0 [md_cluster] ? wait_woken+0x80/0x80 ? lock_token+0x2d/0x90 [md_cluster] ? resync_info_update+0x95/0x100 [md_cluster] ? raid1_sync_request+0x7d3/0xa40 [raid1] ? md_do_sync.cold+0x737/0xc8f [md_mod] ? md_thread+0x94/0x160 [md_mod] ? md_congested+0x30/0x30 [md_mod] ? kthread+0x115/0x140 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x40 ``` At last, thanks for Xiao's solution. Cc: stable@vger.kernel.org Signed-off-by: Zhao Heming <heming.zhao@suse.com> Suggested-by: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* | md: remove a spurious call to revalidate_disk_size in update_sizeChristoph Hellwig2020-11-161-2/+0
| | | | | | | | | | | | | | | | | | None of the ->resize methods updates the disk size, so calling revalidate_disk_size here won't do anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | md: use set_capacity_and_notifyChristoph Hellwig2020-11-161-4/+2
|/ | | | | | | | | Use set_capacity_and_notify to set the size of both the disk and block device. This also gets the uevent notifications for the resize for free. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/bitmap: fix memory leak of temporary bitmapZhao Heming2020-10-081-0/+1
| | | | | | | | Callers of get_bitmap_from_slot() are responsible to free the bitmap. Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* block: add a new revalidate_disk_size helperChristoph Hellwig2020-09-021-3/+3
| | | | | | | | | | | | | | | | | | revalidate_disk is a relative awkward helper for driver use, as it first calls an optional driver method and then updates the block device size, while most callers either don't need the method call at all, or want to keep state between the caller and the called method. Add a revalidate_disk_size helper that just performs the update of the block device size from the gendisk one, and switch all drivers that do not implement ->revalidate_disk to use the new helper instead of revalidate_disk() Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md-cluster: Fix potential error pointer dereference in resize_bitmaps()Dan Carpenter2020-08-051-0/+1
| | | | | | | | | | | The error handling calls md_bitmap_free(bitmap) which checks for NULL but will Oops if we pass an error pointer. Let's set "bitmap" to NULL on this error path. Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md-cluster: fix wild pointer of unlock_all_bitmaps()Zhao Heming2020-07-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reproduction steps: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb node1 # mdadm -G /dev/md0 -b none mdadm: failed to remove clustered bitmap. node1 # mdadm -S --scan ^C <==== mdadm hung & kernel crash ``` kernel stack: ``` [ 335.230657] general protection fault: 0000 [#1] SMP NOPTI [...] [ 335.230848] Call Trace: [ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster] [ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster] [ 335.230899] leave+0x10f/0x190 [md_cluster] [ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod] [ 335.230947] ? leave+0x5/0x190 [md_cluster] [ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod] [ 335.230999] md_bitmap_free+0x142/0x150 [md_mod] [ 335.231013] ? _cond_resched+0x15/0x40 [ 335.231025] ? mutex_lock+0xe/0x30 [ 335.231056] __md_stop+0x1c/0xa0 [md_mod] [ 335.231083] do_md_stop+0x160/0x580 [md_mod] [ 335.231119] ? 0xffffffffc05fb078 [ 335.231148] md_ioctl+0xa04/0x1930 [md_mod] [ 335.231165] ? filename_lookup+0xf2/0x190 [ 335.231179] blkdev_ioctl+0x93c/0xa10 [ 335.231205] ? _cond_resched+0x15/0x40 [ 335.231214] ? __check_object_size+0xd4/0x1a0 [ 335.231224] block_ioctl+0x39/0x40 [ 335.231243] do_vfs_ioctl+0xa0/0x680 [ 335.231253] ksys_ioctl+0x70/0x80 [ 335.231261] __x64_sys_ioctl+0x16/0x20 [ 335.231271] do_syscall_64+0x65/0x1f0 [ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 45Thomas Gleixner2019-05-241-6/+1
| | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 11 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170858.370933192@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md-cluster: remove suspend_infoGuoqing Jiang2018-10-181-71/+32
| | | | | | | | | | | | | Previously, we allow multiple nodes can resync device, but we had changed it to only support one node can do resync at one time, but suspend_info is still used. Now, let's remove the structure and use suspend_lo/hi to record the range. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interruptedGuoqing Jiang2018-10-181-3/+16
| | | | | | | | | | | | | | We need to continue the reshaping if it was interrupted in original node. So original node should call resync_bitmap in case reshaping is aborted. Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes, node which continues the reshaping should restart reshape from mddev->reshape_position instead of from the first beginning. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stageGuoqing Jiang2018-10-181-7/+20
| | | | | | | | | | | | | When reshape is happening in one node, other nodes could receive lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called. Since the resyncing window is typically small in these RESYNCING messages, so WARN is always triggered, so we should not call the func when reshape is happening. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: introduce resync_info_get interface for sanity checkGuoqing Jiang2018-10-181-0/+14
| | | | | | | | | | Since the resync region from suspend_info means one node is reshaping this area, so the position of reshape_progress should be included in the area. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster/raid10: resize all the bitmaps before start reshapeGuoqing Jiang2018-10-181-0/+81
| | | | | | | | | | | | | | | | | | | | | To support add disk under grow mode, we need to resize all the bitmaps of each node before reshape, so that we can ensure all nodes have the same view of the bitmap of the clustered raid. So after the master node resized the bitmap, it broadcast a message to other slave nodes, and it checks the size of each bitmap are same or not by compare pages. We can only continue the reshaping after all nodes update the bitmap to the same size (by checking the pages), otherwise revert bitmap size to previous value. The resize_bitmaps interface and BITMAP_RESIZE message are introduced in md-cluster.c for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: release RESYNC lock after the last resync messageGuoqing Jiang2018-08-311-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the RESYNC messages are sent with resync lock held, the only exception is resync_finish which releases resync_lockres before send the last resync message, this should be changed as well. Otherwise, we can see deadlock issue as follows: clustermd2-gqjiang2:~ # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sdg[0] sdf[1] 134144 blocks super 1.2 [2/2] [UU] [===================>.] resync = 99.6% (134144/134144) finish=0.0min speed=26K/sec bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> clustermd2-gqjiang2:~ # ps aux|grep md|grep D root 20497 0.0 0.0 0 0 ? D 16:00 0:00 [md0_raid1] clustermd2-gqjiang2:~ # cat /proc/20497/stack [<ffffffffc05ff51e>] dlm_lock_sync+0x8e/0xc0 [md_cluster] [<ffffffffc05ff7e8>] __sendmsg+0x98/0x130 [md_cluster] [<ffffffffc05ff900>] sendmsg+0x20/0x30 [md_cluster] [<ffffffffc05ffc35>] resync_info_update+0xb5/0xc0 [md_cluster] [<ffffffffc0593e84>] md_reap_sync_thread+0x134/0x170 [md_mod] [<ffffffffc059514c>] md_check_recovery+0x28c/0x510 [md_mod] [<ffffffffc060c882>] raid1d+0x42/0x800 [raid1] [<ffffffffc058ab61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9a0a5b3f>] kthread+0xff/0x140 [<ffffffff9a800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # ps aux|grep md|grep D root 20531 0.0 0.0 0 0 ? D 16:00 0:00 [md0_raid1] root 20537 0.0 0.0 0 0 ? D 16:00 0:00 [md0_cluster_rec] root 20676 0.0 0.0 0 0 ? D 16:01 0:00 [md0_resync] clustermd-gqjiang1:~ # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sdf[1] sdg[0] 134144 blocks super 1.2 [2/2] [UU] [===================>.] resync = 97.3% (131072/134144) finish=8076.8min speed=0K/sec bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> clustermd-gqjiang1:~ # cat /proc/20531/stack [<ffffffffc080974d>] metadata_update_start+0xcd/0xd0 [md_cluster] [<ffffffffc079c897>] md_update_sb.part.61+0x97/0x820 [md_mod] [<ffffffffc079f15b>] md_check_recovery+0x29b/0x510 [md_mod] [<ffffffffc0816882>] raid1d+0x42/0x800 [raid1] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # cat /proc/20537/stack [<ffffffffc0813222>] freeze_array+0xf2/0x140 [raid1] [<ffffffffc080a56e>] recv_daemon+0x41e/0x580 [md_cluster] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff clustermd-gqjiang1:~ # cat /proc/20676/stack [<ffffffffc080951e>] dlm_lock_sync+0x8e/0xc0 [md_cluster] [<ffffffffc080957f>] lock_token+0x2f/0xa0 [md_cluster] [<ffffffffc0809622>] lock_comm+0x32/0x90 [md_cluster] [<ffffffffc08098f5>] sendmsg+0x15/0x30 [md_cluster] [<ffffffffc0809c0a>] resync_info_update+0x8a/0xc0 [md_cluster] [<ffffffffc08130ba>] raid1_sync_request+0xa9a/0xb10 [raid1] [<ffffffffc079b8ea>] md_do_sync+0xbaa/0xf90 [md_mod] [<ffffffffc0794b61>] md_thread+0x121/0x150 [md_mod] [<ffffffff9e0a5b3f>] kthread+0xff/0x140 [<ffffffff9e800235>] ret_from_fork+0x35/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2018-08-181-10/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new driver for Rohm BU21029 touch controller - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free - updates to Atmel, eeti. pxrc and iforce drivers - assorted driver cleanups and fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits) MAINTAINERS: Add PhoenixRC Flight Controller Adapter Input: do not use WARN() in input_alloc_absinfo() Input: mark expected switch fall-throughs Input: raydium_i2c_ts - use true and false for boolean values Input: evdev - switch to bitmap API Input: gpio-keys - switch to bitmap_zalloc() Input: elan_i2c_smbus - cast sizeof to int for comparison bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free() md: Avoid namespace collision with bitmap API dm: Avoid namespace collision with bitmap API Input: pm8941-pwrkey - add resin entry Input: pm8941-pwrkey - abstract register offsets and event code Input: iforce - reorganize joystick configuration lists Input: atmel_mxt_ts - move completion to after config crc is updated Input: atmel_mxt_ts - don't report zero pressure from T9 Input: atmel_mxt_ts - zero terminate config firmware file Input: atmel_mxt_ts - refactor config update code to add context struct Input: atmel_mxt_ts - config CRC may start at T71 Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE ...
| * md: Avoid namespace collision with bitmap APIAndy Shevchenko2018-08-011-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
* | md-cluster: don't send msg if array is closingGuoqing Jiang2018-07-051-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | If we close an array which resync thread is running, then we don't need the node to send msg since another node would launch the resync thread to continue the rest works. Also send a message is time consuming, we should avoid it. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | md-cluster: show array's status more accurateGuoqing Jiang2018-07-051-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When resync or recovery is happening in one node, other nodes don't show the appropriate info now. For example, when create an array in master node without "--assume-clean", then assemble the array in slave nodes, you can see "resync=PENDING" when read /proc/mdstat in slave nodes. However, the info is confusing since "PENDING" status is introduced for start array in read-only mode. We introduce RESYNCING_REMOTE flag to indicate that resync thread is running in remote node. The flags is set when node receive RESYNCING msg. And we clear the REMOTE flag in following cases: 1. resync or recover is finished in master node, which means slaves receive msg with both lo and hi are set to 0. 2. node continues resync/recovery in recover_bitmaps. 3. when resync_finish is called. Then we show accurate information in status_resync by check REMOTE flags and with other conditions. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | md-cluster: clear another node's suspend_area after the copy is finishedGuoqing Jiang2018-07-051-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When one node leaves cluster or stops the resyncing (resync or recovery) array, then other nodes need to call recover_bitmaps to continue the unfinished task. But we need to clear suspend_area later after other nodes copy the resync information to their bitmap (by call bitmap_copy_from_slot). Otherwise, all nodes could write to the suspend_area even the suspend_area is not handled by any node, because area_resyncing returns 0 at the beginning of raid1_write_request. Which means one node could write suspend_area while another node is resyncing the same area, then data could be inconsistent. So let's clear suspend_area later to avoid above issue with the protection of bm lock. Also it is straightforward to clear suspend_area after nodes have copied the resync info to bitmap. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* | treewide: kzalloc() -> kcalloc()Kees Cook2018-06-121-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
* md-cluster: update document for raid10Guoqing Jiang2017-11-011-1/+1
| | | | | Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: remove special meaning of ->quiesce(.., 2)NeilBrown2017-11-011-3/+3
| | | | | | | | | | | | | | | | | | | | | The '2' argument means "wake up anything that is waiting". This is an inelegant part of the design and was added to help support management of suspend_lo/suspend_hi setting. Now that suspend_lo/hi is managed in mddev_suspend/resume, that need is gone. These is still a couple of places where we call 'quiesce' with an argument of '2', but they can safely be changed to call ->quiesce(.., 1); ->quiesce(.., 0) which achieve the same result at the small cost of pausing IO briefly. This removes a small "optimization" from suspend_{hi,lo}_store, but it isn't clear that optimization served a useful purpose. The code now is a lot clearer. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md: rename some drivers/md/ files to have an "md-" prefixMike Snitzer2017-10-161-1/+1
| | | | | | | | | | | | | | | Motivated by the desire to illiminate the imprecise nature of DM-specific patches being unnecessarily sent to both the MD maintainer and mailing-list. Which is born out of the fact that DM files also reside in drivers/md/ Now all MD-specific files in drivers/md/ start with either "raid" or "md-" and the MAINTAINERS file has been updated accordingly. Shaohua: don't change module name Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: make function cluster_check_sync_size staticColin Ian King2017-10-161-1/+1
| | | | | | | | | | | The function cluster_check_sync_size is local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: symbol 'cluster_check_sync_size' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: fix potential lock issue in add_new_diskGuoqing Jiang2017-05-211-1/+3
| | | | | | | | | | | The add_new_disk returns with communication locked if __sendmsg returns failure, fix it with call unlock_comm before return. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: Fix a memleak in an error handling pathChristophe JAILLET2017-04-141-1/+1
| | | | | | | | | | | | We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a no-op. According to resource handling in case of error a few lines below, it is likely that 'bitmap_free(bitmap)' was expected instead. Fixes: b98938d16a10 ("md-cluster: introduce cluster_check_sync_size") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: add the support for resizeGuoqing Jiang2017-03-161-0/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: introduce cluster_check_sync_sizeGuoqing Jiang2017-03-161-0/+60
| | | | | | | | | | | | | | | | | | | | | Support resize is a little complex for clustered raid, since we need to ensure all the nodes share the same knowledge about the size of raid. We achieve the goal by check the sync_size which is in each node's bitmap, we can only change the capacity after cluster_check_sync_size returns 0. Also, get_bitmap_from_slot is added to get a slot's bitmap. And we exported some funcs since they are used in cluster_check_sync_size(). We can also reuse get_bitmap_from_slot to remove redundant code existed in bitmap_copy_from_slot. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: add CHANGE_CAPACITY message typeGuoqing Jiang2017-03-161-0/+5
| | | | | | | | | | | | The msg type CHANGE_CAPACITY is introduced to support resize clustered raid in later patch, and it is sent after all the nodes have the same sync_size, receiver node just need to set new capacity once received this msg. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: use sync way to handle METADATA_UPDATED msgGuoqing Jiang2017-03-161-16/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, when node received METADATA_UPDATED msg, it just need to wakeup mddev->thread, then md_reload_sb will be called eventually. We taken the asynchronous way to avoid a deadlock issue, the deadlock issue could happen when one node is receiving the METADATA_UPDATED msg (wants reconfig_mutex) and trying to run the path: md_check_recovery -> mddev_trylock(hold reconfig_mutex) -> md_update_sb-metadata_update_start (want EX on token however token is got by the sending node) Since we will support resizing for clustered raid, and we need the metadata update handling to be synchronous so that the initiating node can detect failure, so we need to change the way for handling METADATA_UPDATED msg. But, we obviously need to avoid above deadlock with the sync way. To make this happen, we considered to not hold reconfig_mutex to call md_reload_sb, if some other thread has already taken reconfig_mutex and waiting for the 'token', then process_recvd_msg() can safely call md_reload_sb() without taking the mutex. This is because we can be certain that no other thread will take the mutex, and we also certain that the actions performed by md_reload_sb() won't interfere with anything that the other thread is in the middle of. To make this more concrete, we added a new cinfo->state bit MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD Which is set in lock_token() just before dlm_lock_sync() is called, and cleared just after. As lock_token() is always called with reconfig_mutex() held (the specific case is the resync_info_update which is distinguished well in previous patch), if process_recvd_msg() finds that the new bit is set, then the mutex must be held by some other thread, and it will keep waiting. So process_metadata_update() can call md_reload_sb() if either mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is set. The tricky bit is what to do if neither of these apply. We need to wait. Fortunately mddev_unlock() always calls wake_up() on mddev->thread->wqueue. So we can get lock_token() to call wake_up() on that when it sets the bit. There are also some related changes inside this commit: 1. remove RELOAD_SB related codes since there are not valid anymore. 2. mddev is added into md_cluster_info then we can get mddev inside lock_token. 3. add new parameter for lock_token to distinguish reconfig_mutex is held or not. And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below: 1. set it before unregister thread, otherwise a deadlock could appear if stop a resyncing array. This is because md_unregister_thread(&cinfo->recv_thread) is blocked by recv_daemon -> process_recvd_msg -> process_metadata_update. To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is also need to be set before unregister thread. 2. set it in metadata_update_start to fix another deadlock. a. Node A sends METADATA_UPDATED msg (held Token lock). b. Node B wants to do resync, and is blocked since it can't get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is not set since the callchain (md_do_sync -> sync_request -> resync_info_update -> sendmsg -> lock_comm -> lock_token) doesn't hold reconfig_mutex. c. Node B trys to update sb (held reconfig_mutex), but stopped at wait_event() in metadata_update_start since we have set MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2). d. Then Node B receives METADATA_UPDATED msg from A, of course recv_daemon is blocked forever. Since metadata_update_start always calls lock_token with reconfig_mutex, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and lock_token don't need to set it twice unless lock_token is invoked from lock_comm. Finally, thanks to Neil for his great idea and help! Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: remove useless memset from gather_all_resync_infoGuoqing Jiang2017-03-091-1/+0
| | | | | | | | | | This memset is not needed. The lvb is already zeroed because it was recently allocated by lockres_init, which uses kzalloc(), and read_resync_info() doesn't need it to be zero anyway. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: free md_cluster_info if node leave clusterGuoqing Jiang2017-03-091-0/+1
| | | | | | | | | To avoid memory leak, we need to free the cinfo which is allocated when node join cluster. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: make resync lock also could be interrupttedGuoqing Jiang2016-09-211-2/+3
| | | | | | | | | | | | | | | When one node is perform resync or recovery, other nodes can't get resync lock and could block for a while before it holds the lock, so we can't stop array immediately for this scenario. To make array could be stop quickly, we check MD_CLOSING in dlm_lock_sync_interruptible to make us can interrupt the lock request. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hangGuoqing Jiang2016-09-211-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When some node leaves cluster, then it's bitmap need to be synced by another node, so "md*_recover" thread is triggered for the purpose. However, with below steps. we can find tasks hang happened either in B or C. 1. Node A create a resyncing cluster raid1, assemble it in other two nodes (B and C). 2. stop array in B and C. 3. stop array in A. linux44:~ # ps aux|grep md|grep D root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0 root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover] linux44:~ # cat /proc/5939/stack [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster] [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster] [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 linux44:~ # cat /proc/5938/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster] [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod] [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod] [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod] [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod] [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0 [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40 [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0 [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b The problem is caused by recover_bitmaps can't reliably abort when the thread is unregistered. So dlm_lock_sync_interruptible is introduced to detect the thread's situation to fix the problem. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: convert the completion to wait queueGuoqing Jiang2016-09-211-5/+9
| | | | | | | | | | | | Previously, we used completion to sync between require dlm lock and sync_ast, however we will have to expose completion.wait and completion.done in dlm_lock_sync_interruptible (introduced later), it is not a common usage for completion, so convert related things to wait queue. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: protect md_find_rdev_nr_rcu with rcu lockGuoqing Jiang2016-09-211-4/+8
| | | | | | | | | | We need to use rcu_read_lock/unlock to avoid potential race. Reported-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: remove some unnecessary dlm_unlock_syncGuoqing Jiang2016-09-211-5/+1
| | | | | | | | | | Since DLM_LKF_FORCEUNLOCK is used in lockres_free, we don't need to call dlm_unlock_sync before free lock resource. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: use FORCEUNLOCK in lockres_freeGuoqing Jiang2016-09-211-15/+11
| | | | | | | | | | | | For dlm_unlock, we need to pass flag to dlm_unlock as the third parameter instead of set res->flags. Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock since it works even the lock is on waiting or convert queue. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: fix error return code in join()Wei Yongjun2016-08-241-3/+9
| | | | | | | | Fix to return error code -ENOMEM from the lockres_init() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: check the return value of process_recvd_msgGuoqing Jiang2016-05-091-3/+10
| | | | | | | | | We don't need to run the full path of recv_daemon if process_recvd_msg doesn't return 0. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: gather resync infos and enable recv_thread after bitmap is readyGuoqing Jiang2016-05-091-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The in-memory bitmap is not ready when node joins cluster, so it doesn't make sense to make gather_all_resync_info() called so earlier, we need to call it after the node's bitmap is setup. Also, recv_thread could be wake up after node joins cluster, but it could cause problem if node receives RESYNCING message without persionality since mddev->pers->quiesce is called in process_suspend_info. This commit introduces a new cluster interface load_bitmaps to fix above problems, load_bitmaps is called in bitmap_load where bitmap and persionality are ready, and load_bitmaps does the following tasks: 1. call gather_all_resync_info to load all the node's bitmap info. 2. set MD_CLUSTER_ALREADY_IN_CLUSTER bit to recv_thread could be wake up, and wake up recv_thread if there is pending recv event. Then ack_bast only wakes up recv_thread after IN_CLUSTER bit is ready otherwise MD_CLUSTER_PENDING_RESYNC_EVENT is set. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
* md-cluster: sync bitmap when node received RESYNCING msgGuoqing Jiang2016-05-041-0/+27
| | | | | | | | | | | | | | | | If the node received RESYNCING message which means another node will perform resync with the area, then we don't want to do it again in another node. Let's set RESYNC_MASK and clear NEEDED_MASK for the region from old-low to new-low which has finished syncing, and the region from old-hi to new-hi is about to syncing, bitmap_sync_with_cluste is introduced for the purpose. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>