summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/dev-replace.c
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: sysfs, rename device_link add/remove functionsAnand Jain2020-03-231-2/+2
| | | | | | | | | | | | Since commit 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes"), the functions btrfs_sysfs_add_device_link() and btrfs_sysfs_rm_device_link() do more than just adding and removing the device link as its name indicated. Rename them to be more specific that's about the directory with the attirbutes Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Add overview of device replaceQu Wenruo2020-03-231-0/+40
| | | | | | | | | | | The overview of btrfs dev-replace. It mentions some corner cases caused by the write duplication and scrub based data copy. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ adjust wording ] Signed-off-by: David Sterba <dsterba@suse.com>
* Merge tag 'for-5.6-tag' of ↵Linus Torvalds2020-01-281-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features, highlights: - async discard - "mount -o discard=async" to enable it - freed extents are not discarded immediatelly, but grouped together and trimmed later, with IO rate limiting - the "sync" mode submits short extents that could have been ignored completely by the device, for SATA prior to 3.1 the requests are unqueued and have a big impact on performance - the actual discard IO requests have been moved out of transaction commit to a worker thread, improving commit latency - IO rate and request size can be tuned by sysfs files, for now enabled only with CONFIG_BTRFS_DEBUG as we might need to add/delete the files and don't have a stable-ish ABI for general use, defaults are conservative - export device state info in sysfs, eg. missing, writeable - no discard of extents known to be untouched on disk (eg. after reservation) - device stats reset is logged with process name and PID that called the ioctl Fixes: - fix missing hole after hole punching and fsync when using NO_HOLES - writeback: range cyclic mode could miss some dirty pages and lead to OOM - two more corner cases for metadata_uuid change after power loss during the change - fix infinite loop during fsync after mix of rename operations Core changes: - qgroup assign returns ENOTCONN when quotas not enabled, used to return EINVAL that was confusing - device closing does not need to allocate memory anymore - snapshot aware code got removed, disabled for years due to performance problems, reimplmentation will allow to select wheter defrag breaks or does not break COW on shared extents - tree-checker: - check leaf chunk item size, cross check against number of stripes - verify location keys for DIR_ITEM, DIR_INDEX and XATTR items - new self test for physical -> logical mapping code, used for super block range exclusion - assertion helpers/macros updated to avoid objtool "unreachable code" reports on older compilers or config option combinations" * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (84 commits) btrfs: free block groups after free'ing fs trees btrfs: Fix split-brain handling when changing FSID to metadata uuid btrfs: Handle another split brain scenario with metadata uuid feature btrfs: Factor out metadata_uuid code from find_fsid. btrfs: Call find_fsid from find_fsid_inprogress Btrfs: fix infinite loop during fsync after rename operations btrfs: set trans->drity in btrfs_commit_transaction btrfs: drop log root for dropped roots btrfs: sysfs, add devid/dev_state kobject and device attributes btrfs: Refactor btrfs_rmap_block to improve readability btrfs: Add self-tests for btrfs_rmap_block btrfs: selftests: Add support for dummy devices btrfs: Move and unexport btrfs_rmap_block btrfs: separate definition of assertion failure handlers btrfs: device stats, log when stats are zeroed btrfs: fix improper setting of scanned for range cyclic write cache pages btrfs: safely advance counter when looking up bio csums btrfs: remove unused member btrfs_device::work btrfs: remove unnecessary wrapper get_alloc_profile btrfs: add correction to handle -1 edge case in async discard ...
| * btrfs: sysfs, add devid/dev_state kobject and device attributesAnand Jain2020-01-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New sysfs attributes that track the filesystem status of devices, stored in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's a directory for each device, with name corresponding to the numerical device id. in_fs_metadata - device is in the list of fs metadata missing - device is missing (no device node or block device) replace_target - device is target of replace writeable - writes from fs are allowed These attributes reflect the state of the device::dev_state and created at mount time. Sample output: $ pwd /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/ $ ls in_fs_metadata missing replace_target writeable $ cat missing 0 The output from these attributes are 0 or 1. 0 indicates unset and 1 indicates set. These attributes are readonly. It is observed that the device delete thread and sysfs read thread will not race because the delete thread calls sysfs kobject_put() which in turn waits for existing sysfs read to complete. Note for device replace devid swap: During the replace the target device temporarily assumes devid 0 before assigning the devid of the soruce device. In btrfs_dev_replace_finishing() we remove source sysfs devid using the function btrfs_sysfs_remove_devices_attr(), so after that call kobject_rename() to update the devid in the sysfs. This adds and calls btrfs_sysfs_update_devid() helper function to update the device id. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: dev-replace: remove warning for unknown return codes when finishedDavid Sterba2020-01-251-4/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fstests btrfs/011 triggered a warning at the end of device replace, [ 1891.998975] BTRFS warning (device vdd): failed setting block group ro: -28 [ 1892.038338] BTRFS error (device vdd): btrfs_scrub_dev(/dev/vdd, 1, /dev/vdb) failed -28 [ 1892.059993] ------------[ cut here ]------------ [ 1892.063032] WARNING: CPU: 2 PID: 2244 at fs/btrfs/dev-replace.c:506 btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.074346] CPU: 2 PID: 2244 Comm: btrfs Not tainted 5.5.0-rc7-default+ #942 [ 1892.079956] RIP: 0010:btrfs_dev_replace_start.cold+0xf9/0x140 [btrfs] [ 1892.096576] RSP: 0018:ffffbb58c7b3fd10 EFLAGS: 00010286 [ 1892.098311] RAX: 00000000ffffffe4 RBX: 0000000000000001 RCX: 8888888888888889 [ 1892.100342] RDX: 0000000000000001 RSI: ffff9e889645f5d8 RDI: ffffffff92821080 [ 1892.102291] RBP: ffff9e889645c000 R08: 000001b8878fe1f6 R09: 0000000000000000 [ 1892.104239] R10: ffffbb58c7b3fd08 R11: 0000000000000000 R12: ffff9e88a0017000 [ 1892.106434] R13: ffff9e889645f608 R14: ffff9e88794e1000 R15: ffff9e88a07b5200 [ 1892.108642] FS: 00007fcaed3f18c0(0000) GS:ffff9e88bda00000(0000) knlGS:0000000000000000 [ 1892.111558] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1892.113492] CR2: 00007f52509ff420 CR3: 00000000603dd002 CR4: 0000000000160ee0 [ 1892.115814] Call Trace: [ 1892.116896] btrfs_dev_replace_by_ioctl+0x35/0x60 [btrfs] [ 1892.118962] btrfs_ioctl+0x1d62/0x2550 [btrfs] caused by the previous patch ("btrfs: scrub: Require mandatory block group RO for dev-replace"). Hitting ENOSPC is possible and could happen when the block group is set read-only, preventing NOCOW writes to the area that's being accessed by dev-replace. This has happend with scratch devices of size 12G but not with 5G and 20G, so this is depends on timing and other activity on the filesystem. The whole replace operation is restartable, the space state should be examined by the user in any case. The error code is propagated back to the ioctl caller so the kernel warning is causing false alerts. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add __pure attribute to functionsDavid Sterba2019-11-181-1/+1
| | | | | | | | | | | | The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move cond_wake_up functions out of ctreeDavid Sterba2019-09-091-0/+1
| | | | | | | | | The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace: BTRFS_DEV_REPLACE_ITEM_STATE_x defines should goAnand Jain2019-09-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The BTRFS_DEV_REPLACE_ITEM_STATE_x defines, as shown in [1], are unused in both kernel and btrfs-progs (except for one instance of BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED in kernel). [1] btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_FINISHED 2 btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED 3 btrfs.h:#define BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED 4 Further these define-values are different form its counterpart BTRFS_IOCTL_DEV_REPLACE_STATE_x series as shown in [2]. [2] btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_SUSPENDED 2 btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_FINISHED 3 btrfs_tree.h:#define BTRFS_DEV_REPLACE_ITEM_STATE_CANCELED 4 So this patch deletes the BTRFS_DEV_REPLACE_ITEM_STATE_x altogether, and one instance of BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED is replaced with BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED in the kernel. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove mapping tree structures indirectionDavid Sterba2019-07-011-1/+1
| | | | | | | | | fs_info::mapping_tree is the physical<->logical mapping tree and uses the same underlying structure as extents, but is embedded to another structure. There are no other members and this indirection is useless. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove redundant assignment of tgt_device->commit_total_bytesNikolay Borisov2019-07-011-1/+0
| | | | | | | | | | | | | | This is already done in btrfs_init_dev_replace_tgtdev which is the first phase of device replace, called before doing scrub. During that time exclusive lock is held. Additionally btrfs_fs_device::commit_total_bytes is always set based on the size of the underlying block device which shouldn't change once set. This makes the 2nd assignment of the variable in the finishing phase redundant. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Explicitly reserve space for devreplace itemNikolay Borisov2019-07-011-2/+2
| | | | | | | | | | | | | | Part of device replace involves writing an item to the device root containing information about pending replace operations. Currently space for this item is not being explicitly reserved so this works thanks to presence of global reserve. While not fatal it's not a good practice. Let's be explicit about space requirement of device replace and reserve space when starting the transaction. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Streamline replace sem unlock in btrfs_dev_replace_startNikolay Borisov2019-07-011-6/+2
| | | | | | | | | | | | | There are only 2 branches which goto leave label with need_unlock set to true. Essentially need_unlock is used as a substitute for directly calling up_write. Since the branches needing this are only 2 and their context is not that big it's more clear to just call up_write where required. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Ensure btrfs_init_dev_replace_tgtdev sees up to date valuesNikolay Borisov2019-07-011-5/+5
| | | | | | | | | | | | | | | | | btrfs_init_dev_replace_tgtdev reads certain values from the source device (such as commit_total_bytes) which are updated during transaction commit. Currently this function is called before committing any pending transaction, leading to possibly reading outdated values. Fix this by moving the function below the transaction commit, at this point the EXCL_OP bit it set hence once transaction is complete the total size of the device cannot be changed (it's usually changed by resize/remove ops which are blocked). Fixes: 9e271ae27e44 ("Btrfs: kernel operation should come after user input has been verified") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: Remove impossible WARN_ONNikolay Borisov2019-07-011-1/+0
| | | | | | | | | | | This WARN_ON can never trigger because src_device cannot be null. btrfs_find_device_by_devspec always returns either an error or a valid pointer to the device. Just remove it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Reduce critical section in btrfs_init_dev_replace_tgtdevNikolay Borisov2019-07-011-1/+2
| | | | | | | | | | | | | | There is no point in holding btrfs_fs_devices::device_list_mutex while initialising fields of the not-yet-published device. Instead, hold the mutex only when the newly initialised device is being published. I think holding device_list_mutex here is redundant altogether, because at this point BTRFS_FS_EXCL_OP is set which prevents device removal/addition/balance/resize to occur. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Don't opencode sync_blockdev in btrfs_init_dev_replace_tgtdevNikolay Borisov2019-07-011-1/+1
| | | | | | | | | | Using sync_blockdev makes it plain obvious what's happening. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Ensure replaced device doesn't have pending chunk allocationNikolay Borisov2019-05-281-10/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update operations during transaction commit") combined the way certain operations are recoded in a transaction. As a result an ASSERT was added in dev_replace_finish to ensure the new code works correctly. Unfortunately I got reports that it's possible to trigger the assert, meaning that during a device replace it's possible to have an unfinished chunk allocation on the source device. This is supposed to be prevented by the fact that a transaction is committed before finishing the replace oepration and alter acquiring the chunk mutex. This is not sufficient since by the time the transaction is committed and the chunk mutex acquired it's possible to allocate a chunk depending on the workload being executed on the replaced device. This bug has been present ever since device replace was introduced but there was never code which checks for it. The correct way to fix is to ensure that there is no pending device modification operation when the chunk mutex is acquire and if there is repeat transaction commit. Unfortunately it's not possible to just exclude the source device from btrfs_fs_devices::dev_alloc_list since this causes ENOSPC to be hit in transaction commit. Fixing that in another way would need to add special cases to handle the last writes and forbid new ones. The looped transaction fix is more obvious, and can be easily backported. The runtime of dev-replace is long so there's no noticeable delay caused by that. Reported-by: David Sterba <dsterba@suse.com> Fixes: 391cd9df81ac ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from device in btrfs_rm_dev_replace_free_srcdevDavid Sterba2019-04-291-1/+1
| | | | | | We can read fs_info from the device and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_run_dev_replaceDavid Sterba2019-04-291-2/+2
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: combine device update operations during transaction commitNikolay Borisov2019-04-291-1/+1
| | | | | | | | | | | | | | | | | | | We currently overload the pending_chunks list to handle updating btrfs_device->commit_bytes used. We don't actually care about the extent mapping or even the device mapping for the chunk - we just need the device, and we can end up processing it multiple times. The fs_devices->resized_list does more or less the same thing, but with the disk size. They are called consecutively during commit and have more or less the same purpose. We can combine the two lists into a single list that attaches to the transaction and contains a list of devices that need updating. Since we always add the device to a list when we change bytes_used or disk_total_size, there's no harm in copying both values at once. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop the lock on error in btrfs_dev_replace_cancelDan Carpenter2019-02-251-0/+1
| | | | | | | | | | | | | | | | | We should drop the lock on this error path. This has been found by a static tool. The lock needs to be released, it's there to protect access to the dev_replace members and is not supposed to be left locked. The value of state that's being switched would need to be artifically changed to an invalid value so the default: branch is taken. Fixes: d189dd70e255 ("btrfs: fix use-after-free due to race between replace start and cancel") CC: stable@vger.kernel.org # 5.0+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: merge btrfs_find_device and find_deviceAnand Jain2019-02-251-2/+2
| | | | | | | | | | Both btrfs_find_device() and find_device() does the same thing except that the latter does not take the seed device onto account in the device scanning context. We can merge them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: refactor btrfs_find_device() take fs_devices as argumentAnand Jain2019-02-251-3/+3
| | | | | | | | | | | | btrfs_find_device() accepts fs_info as an argument and retrieves fs_devices from fs_info. Instead use fs_devices, so that this function can be used in non-mount (during device scanning) context as well. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Fix typos in comments and stringsAndrea Gelmini2018-12-171-1/+1
| | | | | | | | | The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: open code trivial locking helpersDavid Sterba2018-12-171-50/+31
| | | | | | | The dev-replace locking functions are now trivial wrappers around rw semaphore that can be used directly everywhere. No functional change. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: remove custom read/write blocking schemeDavid Sterba2018-12-171-16/+0
| | | | | | | | | | | | After the rw semaphore has been added, the custom blocking using ::blocking_readers and ::read_lock_wq is redundant. The blocking logic in __btrfs_map_block is replaced by extending the time the semaphore is held, that has the same blocking effect on writes as the previous custom scheme that waited until ::blocking_readers was zero. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: swich locking to rw semaphoreDavid Sterba2018-12-171-6/+6
| | | | | | | | | | | | | | | | | This is the first part of removing the custom locking and waiting scheme used for device replace. It was probably copied from extent buffer locking, but there's nothing that would require more than is provided by the common locking primitives. The rw spinlock protects waiting tasks counter in case of incompatible locks and the waitqueue. Same as rw semaphore. This patch only switches the locking primitive, for better bisectability. There should be no functional change other than the overhead of the locking and potential sleeping instead of spinning when the lock is contended. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't report user-requested cancel as an errorAnand Jain2018-12-171-1/+2
| | | | | | | | | As of now only user requested replace cancel can cancel the replace-scrub so no need to log the error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: silence warning if replace is canceledAnand Jain2018-12-171-2/+2
| | | | | | | | | | | | | When we successfully cancel the device replace, its scrub worker returns -ECANCELED, which is then passed to btrfs_dev_replace_finishing. It cleans up based on the returned status and propagates the same -ECANCELED back the parent function. As of now only user can cancel the replace-scrub, so its ok to silence the warning here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: add explicit check for replace result "no error"Anand Jain2018-12-171-2/+3
| | | | | | | | | | | | | We recast the replace return status BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no error. And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0, which is also declared as 0, so we just return. Instead add it to the if statement so that there is enough clarity while reading the code. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: replace's scrub must not be running in suspended stateAnand Jain2018-12-171-1/+3
| | | | | | | | | | | When the replace state is in the suspended state, btrfs_scrub_cancel() should fail with -ENOTCONN as there is no scrub running. As a safety catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it doesn't. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: set result code of cancel by status of scrubAnand Jain2018-12-171-7/+14
| | | | | | | | | | | | | | | The device-replace needs to check the result code of the scrub workers in btrfs_dev_replace_cancel and distinguish if successful cancel operation and when the there was no operation running. If btrfs_scrub_cancel() fails, return BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try to cancel the replace again. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix use-after-free due to race between replace start and cancelAnand Jain2018-12-171-22/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device replace cancel thread can race with the replace start thread and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will fail to stop the scrub thread. The scrub thread continues with the scrub for replace which then will try to write to the target device and which is already freed by the cancel thread. scrub_setup_ctx() warns as tgtdev is NULL. struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace) { ... if (is_dev_replace) { WARN_ON(!fs_info->dev_replace.tgtdev); <=== sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO; sctx->wr_tgtdev = fs_info->dev_replace.tgtdev; sctx->flush_all_writes = false; } [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.432970] Call Trace: [ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs] [ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs] [ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs] [ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs] [ 6852.434365] ? do_sigaction+0x7d/0x1e0 [ 6852.434623] do_vfs_ioctl+0xa9/0x6c0 [ 6852.434865] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435124] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435387] ksys_ioctl+0x60/0x90 [ 6852.435663] __x64_sys_ioctl+0x16/0x20 [ 6852.435907] do_syscall_64+0x50/0x180 [ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe Further, as the replace thread enters scrub_write_page_to_dev_replace() without the target device it panics: static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { ... bio_set_dev(bio, sbio->dev->bdev); <====== [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0 .. [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs] [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260 [btrfs] .. [ 6929.721430] Call Trace: [ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs] [ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs] [ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs] [ 6929.722552] process_one_work+0x1f4/0x520 [ 6929.722805] ? process_one_work+0x16e/0x520 [ 6929.723063] worker_thread+0x46/0x3d0 [ 6929.723313] kthread+0xf8/0x130 [ 6929.723544] ? process_one_work+0x520/0x520 [ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 6929.724081] ret_from_fork+0x3a/0x50 Fix this by letting the btrfs_dev_replace_finishing() to do the job of cleaning after the cancel, including freeing of the target device. btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns along with the scrub return status. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: go back to suspend state if another EXCL_OP is runningAnand Jain2018-12-171-0/+4
| | | | | | | | | | | | | | | | | | | | In a secnario where balance and replace co-exists as below, - start balance - pause balance - start replace - reboot and when system restarts, balance resumes first. Then the replace is attempted to restart but will fail as the EXCL_OP lock is already held by the balance. If so place the replace state back to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state. Fixes: 010a47bde9420 ("btrfs: add proper safety check before resuming dev-replace") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: go back to suspended state if target device is missingAnand Jain2018-12-171-0/+2
| | | | | | | | | | | | | | | | | At the time of forced unmount we place the running replace to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes back and expect the target device is missing. Then let the replace state continue to be in BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub running as part of replace. Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: mark btrfs_dev_replace_start as staticAnand Jain2018-12-171-1/+1
| | | | | | | | There isn't any other consumer other than in its own file dev-replace.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove redundant replace_state initAnand Jain2018-12-171-1/+0
| | | | | | | | | | dev_replace::replace_state has been set to BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function, So delete the line which sets replace_state = 0; Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: prevent ioctls from interfering with a swap fileOmar Sandoval2018-12-171-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: remove pointless assert in write unlockDavid Sterba2018-10-151-1/+0
| | | | | | | | The value of blocking_readers is increased only when the lock is taken for read, no way we can fail the condition with the write lock. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: move replace members out of fs_infoDavid Sterba2018-10-151-8/+8
| | | | | | | | | | | The replace_wait and bio_counter were mistakenly added to fs_info in commit c404e0dc2c843b154f ("Btrfs: fix use-after-free in the finishing procedure of the device replace"), but they logically belong to fs_info::dev_replace. Besides, bio_counter is a very generic name and is confusing in bare fs_info context. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: dev-replace: avoid useless lock on error handling pathDavid Sterba2018-10-151-1/+6
| | | | | | | | | | | | | The exit sequence in btrfs_dev_replace_start does not allow to simply add a label to the right place so the error handling after starting transaction failure jumps there. Currently there's a lock that pairs with the unlock in the section, which is unnecessary and only raises questions. Add a variable to track the locking status and avoid the extra locking. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: open code btrfs_after_dev_replace_commitDavid Sterba2018-10-151-8/+0
| | | | | | | | Too trivial, the purpose can be simply documented in a comment. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: open code btrfs_dev_replace_clear_lock_blockingDavid Sterba2018-10-151-12/+0
| | | | | | | | | | | There's a single caller and the function name does not say it's actually taking the lock, so open coding makes it more explicit. For now, btrfs_dev_replace_read_lock is used instead of read_lock so it's paired with the unlocking wrapper in the same block. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove btrfs_dev_replace::read_locksDavid Sterba2018-10-151-5/+0
| | | | | | | | | | | | | | | | This member seems to be copied from the extent_buffer locking scheme and is at least used to assert that the read lock/unlock is properly nested. In some way. While the _inc/_dec are called inside the read lock section, the asserts are both inside and outside, so the ordering is not guaranteed and we can see read/inc/dec ordered in any way (theoretically). A missing call of btrfs_dev_replace_clear_lock_blocking could cause unexpected read_locks count, so this at least looks like a valid assertion, but this will become unnecessary with later updates. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix error handling in btrfs_dev_replace_startJeff Mahoney2018-10-151-2/+5
| | | | | | | | | | | | | | | | | | | | When we fail to start a transaction in btrfs_dev_replace_start, we leave dev_replace->replace_start set to STARTED but clear ->srcdev and ->tgtdev. Later, that can result in an Oops in btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED implies that ->srcdev is valid. Also fix error handling when the state is already STARTED or SUSPENDED while starting. That, too, will clear ->srcdev and ->tgtdev even though it doesn't own them. This should be an impossible case to hit since we should be protected by the BTRFS_FS_EXCL_OP bit being set. Let's add an ASSERT there while we're at it. Fixes: e93c89c1aaaaa (Btrfs: add new sources for device replace code) CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Make btrfs_find_device_by_devspec return btrfs_device directlyNikolay Borisov2018-10-151-4/+4
| | | | | | | | | | | | Instead of returning an error value and using one of the parameters for returning the actual object we are interested in just refactor the function to directly return btrfs_device *. Also bubble up the error handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into btrfs_rm_device. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: replace: Reset on-disk dev stats value after replaceMisono Tomohiro2018-08-061-0/+6
| | | | | | | | | | | | | | | | | | | on-disk devs stats value is updated in btrfs_run_dev_stats(), which is called during commit transaction, if device->dev_stats_ccnt is not zero. Since current replace operation does not touch dev_stats_ccnt, on-disk dev stats value is not updated. Therefore "btrfs device stats" may return old device's value after umount/mount (Example: See "btrfs ins dump-t -t DEV $DEV" after btrfs/100 finish). Fix this by just incrementing dev_stats_ccnt in btrfs_dev_replace_finishing() when replace is succeeded and this will update the values. Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove fs_info from btrfs_destroy_dev_replace_tgtdevNikolay Borisov2018-08-061-3/+3
| | | | | | | | | This function is always passed a well-formed tgtdevice so the fs_info can be referenced from there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove fs_info from btrfs_assign_next_active_deviceNikolay Borisov2018-08-061-1/+1
| | | | | | | | | | It can be referenced from the passed 'device' argument which is always a well-formed device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove fs_info from btrfs_rm_dev_replace_remove_srcdevNikolay Borisov2018-08-061-1/+1
| | | | | | | | It can be referenced from the passed srcdev argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>