summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2017-05-1013-86/+255
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "The two main items are support for disabling automatic rbd exclusive lock transfers from myself and the long awaited -ENOSPC handling series from Jeff. The former will allow rbd users to take advantage of exclusive lock's built-in blacklist/break-lock functionality while staying in control of who owns the lock. With the latter in place, we will abort filesystem writes on -ENOSPC instead of having them block indefinitely. Beyond that we've got the usual pile of filesystem fixes from Zheng, some refcount_t conversion patches from Elena and a patch for an ancient open() flags handling bug from Alexander" * tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits) ceph: fix memory leak in __ceph_setxattr() ceph: fix file open flags on ppc64 ceph: choose readdir frag based on previous readdir reply rbd: exclusive map option rbd: return ResponseMessage result from rbd_handle_request_lock() rbd: kill rbd_is_lock_supported() rbd: support updating the lock cookie without releasing the lock rbd: store lock cookie rbd: ignore unlock errors rbd: fix error handling around rbd_init_disk() rbd: move rbd_unregister_watch() call into rbd_dev_image_release() rbd: move rbd_dev_destroy() call out of rbd_dev_image_release() ceph: when seeing write errors on an inode, switch to sync writes Revert "ceph: SetPageError() for writeback pages if writepages fails" ceph: handle epoch barriers in cap messages libceph: add an epoch_barrier field to struct ceph_osd_client libceph: abort already submitted but abortable requests when map or pool goes full libceph: allow requests to return immediately on full conditions if caller wishes libceph: remove req->r_replay_version ceph: make seeky readdir more efficient ...
| * ceph: fix memory leak in __ceph_setxattr()Luis Henriques2017-05-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ceph_inode_xattr needs to be released when removing an xattr. Easily reproducible running the 'generic/020' test from xfstests or simply by doing: attr -s attr0 -V 0 /mnt/test && attr -r attr0 /mnt/test While there, also fix the error path. Here's the kmemleak splat: unreferenced object 0xffff88001f86fbc0 (size 64): comm "attr", pid 244, jiffies 4294904246 (age 98.464s) hex dump (first 32 bytes): 40 fa 86 1f 00 88 ff ff 80 32 38 1f 00 88 ff ff @........28..... 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ backtrace: [<ffffffff81560199>] kmemleak_alloc+0x49/0xa0 [<ffffffff810f3e5b>] kmem_cache_alloc+0x9b/0xf0 [<ffffffff812b157e>] __ceph_setxattr+0x17e/0x820 [<ffffffff812b1c57>] ceph_set_xattr_handler+0x37/0x40 [<ffffffff8111fb4b>] __vfs_removexattr+0x4b/0x60 [<ffffffff8111fd37>] vfs_removexattr+0x77/0xd0 [<ffffffff8111fdd1>] removexattr+0x41/0x60 [<ffffffff8111fe65>] path_removexattr+0x75/0xa0 [<ffffffff81120aeb>] SyS_lremovexattr+0xb/0x10 [<ffffffff81564b20>] entry_SYSCALL_64_fastpath+0x13/0x94 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: fix file open flags on ppc64Alexander Graf2017-05-041-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The file open flags (O_foo) are platform specific and should never go out to an interface that is not local to the system. Unfortunately these flags have leaked out onto the wire in the cephfs implementation. That lead to bogus flags getting transmitted on ppc64. This patch converts the kernel view of flags to the ceph view of file open flags. Fixes: 124e68e74 ("ceph: file operations") Signed-off-by: Alexander Graf <agraf@suse.de> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: choose readdir frag based on previous readdir replyYan, Zheng2017-05-041-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dirfragtree is lazily updated, it's not always accurate. Infinite loops happens in following circumstance. - client send request to read frag A - frag A has been fragmented into frag B and C. So mds fills the reply with contents of frag B - client wants to read next frag C. ceph_choose_frag(frag value of C) return frag A. The fix is using previous readdir reply to calculate next readdir frag when possible. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: when seeing write errors on an inode, switch to sync writesJeff Layton2017-05-043-14/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we don't have a real feedback mechanism in place for when we start seeing buffered writeback errors. If writeback is failing, there is nothing that prevents an application from continuing to dirty pages that aren't being cleaned. In the event that we're seeing write errors of any sort occur on an inode, have the callback set a flag to force further writes to be synchronous. When the next write succeeds, clear the flag to allow buffered writeback to continue. Since this is just a hint to the write submission mechanism, we only take the i_ceph_lock when a lockless check shows that the flag needs to be changed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * Revert "ceph: SetPageError() for writeback pages if writepages fails"Jeff Layton2017-05-041-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit b109eec6f4332bd517e2f41e207037c4b9065094. If I'm filling up a filesystem with this sort of command: $ dd if=/dev/urandom of=/mnt/cephfs/fillfile bs=2M oflag=sync ...then I'll eventually get back EIO on a write. Further calls will give us ENOSPC. I'm not sure what prompted this change, but I don't think it's what we want to do. If writepages failed, we will have already set the mapping error appropriately, and that's what gets reported by fsync() or close(). __filemap_fdatawait_range however, does this: wait_on_page_writeback(page); if (TestClearPageError(page)) ret = -EIO; ...and that -EIO ends up trumping the mapping's error if one exists. When writepages fails, we only want to set the error in the mapping, and not flag the individual pages. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: handle epoch barriers in cap messagesJeff Layton2017-05-043-7/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | Have the client store and update the osdc epoch_barrier when a cap message comes in with one. When sending cap messages, send the epoch barrier as well. This allows clients to inform servers that their released caps may not be used until a particular OSD map epoch. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: "Yan, Zheng” <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: allow requests to return immediately on full conditions if caller ↵Jeff Layton2017-05-042-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wishes Usually, when the osd map is flagged as full or the pool is at quota, write requests just hang. This is not what we want for cephfs, where it would be better to simply report -ENOSPC back to userland instead of stalling. If the caller knows that it will want an immediate error return instead of blocking on a full or at-quota error condition then allow it to set a flag to request that behavior. Set that flag in ceph_osdc_new_request (since ceph.ko is the only caller), and on any other write request from ceph.ko. A later patch will deal with requests that were submitted before the new map showing the full condition came in. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: make seeky readdir more efficientYan, Zheng2017-05-044-6/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current cephfs client uses string to indicate start position of readdir. The string is last entry of previous readdir reply. This approach does not work for seeky readdir because we can not easily convert the new postion to a string. For seeky readdir, mds needs to return dentries from the beginning. Client keeps retrying if the reply does not contain the dentry it wants. In current version of ceph, mds sorts CDentry in its cache in hash order. Client also uses dentry hash to compose dir postion. For seeky readdir, if client passes the hash part of dir postion to mds. mds can avoid replying useless dentries. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: close stopped mds' sessionYan, Zheng2017-05-041-0/+16
| | | | | | | | | | | | | | | | | | If a mds has stopped, close its session and clean up its session requests/caps. The process is similar to handling SESSION_CLOSE initiated by mds. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: fix potential use-after-freeYan, Zheng2017-05-041-2/+8
| | | | | | | | | | | | | | | | | | | | __unregister_session() free the session if it drops the last reference. We should grab an extra reference if we want to use session after __unregister_session(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: allow connecting to mds whose rank >= mdsmap::m_max_mdsYan, Zheng2017-05-043-23/+52
| | | | | | | | | | | | | | | | | | mdsmap::m_max_mds is the expected count of active mds. It's not the max rank of active mds. User can decrease mdsmap::m_max_mds, but does not stop mds whose rank >= mdsmap::m_max_mds. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: fix wrong check in ceph_renew_caps()Yan, Zheng2017-05-041-1/+1
| | | | | | | | | | Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: convert ceph_pagelist.refcnt from atomic_t to refcount_tElena Reshetova2017-05-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: convert ceph_cap_snap.nref from atomic_t to refcount_tElena Reshetova2017-05-043-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: convert ceph_mds_session.s_ref from atomic_t to refcount_tElena Reshetova2017-05-042-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph, ceph: always advertise all supported featuresIlya Dryomov2017-05-041-6/+1
| | | | | | | | | | | | | | No reason to hide CephFS-specific features in the rbd case. Recent feature bits mix RADOS and CephFS-specific stuff together anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | Merge branch 'for-linus-4.12' of ↵Linus Torvalds2017-05-1037-827/+1437
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has fixes and cleanups Dave Sterba collected for the merge window. The biggest functional fixes are between btrfs raid5/6 and scrub, and raid5/6 and device replacement. Some of our pending qgroup fixes are included as well while I bash on the rest in testing. We also have the usual set of cleanups, including one that makes __btrfs_map_block() much more maintainable, and conversions from atomic_t to refcount_t" * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits) btrfs: fix the gfp_mask for the reada_zones radix tree Btrfs: fix reported number of inode blocks Btrfs: send, fix file hole not being preserved due to inline extent Btrfs: fix extent map leak during fallocate error path Btrfs: fix incorrect space accounting after failure to insert inline extent Btrfs: fix invalid attempt to free reserved space on failure to cow range btrfs: Handle delalloc error correctly to avoid ordered extent hang btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error btrfs: check if the device is flush capable btrfs: delete unused member nobarriers btrfs: scrub: Fix RAID56 recovery race condition btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: Use ktime_get_real_ts for root ctime Btrfs: handle only applicable errors returned by btrfs_get_extent btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option btrfs: use q which is already obtained from bdev_get_queue Btrfs: switch to div64_u64 if with a u64 divisor Btrfs: update scrub_parity to use u64 stripe_len Btrfs: enable repair during read for raid56 profile btrfs: use clear_page where appropriate ...
| * | btrfs: fix the gfp_mask for the reada_zones radix treeChris Mason2017-05-042-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commits cc8385b59e17 and 7ef70b4d9987a7 added preallocation for the reada radix trees and also switched them over to GFP_KERNEL for the default gfp mask. Since we're doing radix tree insertions under spinlocks, we need to make sure the mask doesn't allow sleeping. This fix keeps the radix preallocation but switches back to the original gfp_mask. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Merge branch 'for-chris-4.12' of ↵Chris Mason2017-04-275-66/+277
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.12
| | * | Btrfs: fix reported number of inode blocksFilipe Manana2017-04-264-18/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes. Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec) # The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1 $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec) $ sync $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec) # The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2 $ sync # After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2 So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed. Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode. Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * | Btrfs: send, fix file hole not being preserved due to inline extentFilipe Manana2017-04-261-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Normally we don't have inline extents followed by regular extents, but there's currently at least one harmless case where this happens. For example, when the page size is 4Kb and compression is enabled: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar In this case we get a compressed inline extent, representing 4Kb of data, followed by a hole extent and then a regular data extent. The inline extent was not expanded/converted to a regular extent exactly because it represents 4Kb of data. This does not cause any apparent problem (such as the issue solved by commit e1699d2d7bf6 ("btrfs: add missing memset while reading compressed inline extents")) except trigger an unexpected case in the incremental send code path that makes us issue an operation to write a hole when it's not needed, resulting in more writes at the receiver and wasting space at the receiver. So teach the incremental send code to deal with this particular case. The issue can be currently triggered by running fstests btrfs/137 with compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| | * | Btrfs: fix extent map leak during fallocate error pathFilipe Manana2017-04-261-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the call to btrfs_qgroup_reserve_data() failed, we were leaking an extent map structure. The failure can happen either due to an -ENOMEM condition or, when quotas are enabled, due to -EDQUOT for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
| | * | Btrfs: fix incorrect space accounting after failure to insert inline extentFilipe Manana2017-04-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using compression, if we fail to insert an inline extent we incorrectly end up attempting to free the reserved data space twice, once through extent_clear_unlock_delalloc(), because we pass it the flag EXTENT_DO_ACCOUNTING, and once through a direct call to btrfs_free_reserved_data_space_noquota(). This results in a trace like the following: [ 834.576240] ------------[ cut here ]------------ [ 834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [ 834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] [ 834.596103] Call Trace: [ 834.596103] dump_stack+0x67/0x90 [ 834.596103] __warn+0xc2/0xdd [ 834.596103] warn_slowpath_null+0x1d/0x1f [ 834.596103] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 834.596103] compress_file_range.constprop.42+0x2fa/0x3fc [btrfs] [ 834.596103] ? submit_compressed_extents+0x3a7/0x3a7 [btrfs] [ 834.596103] async_cow_start+0x32/0x4d [btrfs] [ 834.596103] btrfs_scrubparity_helper+0x187/0x3e7 [btrfs] [ 834.596103] btrfs_delalloc_helper+0xe/0x10 [btrfs] [ 834.596103] process_one_work+0x273/0x4e4 [ 834.596103] worker_thread+0x1eb/0x2ca [ 834.596103] ? rescuer_thread+0x2b6/0x2b6 [ 834.596103] kthread+0x100/0x108 [ 834.596103] ? __list_del_entry+0x22/0x22 [ 834.596103] ret_from_fork+0x2e/0x40 [ 834.611656] ---[ end trace 719902fe6bdef08f ]--- So fix this by not calling directly btrfs_free_reserved_data_space_noquota() if an error happened. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * | Btrfs: fix invalid attempt to free reserved space on failure to cow rangeFilipe Manana2017-04-262-18/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When attempting to COW a file range (we are starting writeback and doing COW), if we manage to reserve an extent for the range we will write into but fail after reserving it and before creating the respective ordered extent, we end up in an error path where we attempt to decrement the data space's bytes_may_use counter after we already did it while reserving the extent, leading to a warning/trace like the following: [ 847.621524] ------------[ cut here ]------------ [ 847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg [ 847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2 [ 847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 847.648601] Call Trace: [ 847.648601] dump_stack+0x67/0x90 [ 847.648601] __warn+0xc2/0xdd [ 847.648601] warn_slowpath_null+0x1d/0x1f [ 847.648601] btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs] [ 847.648601] btrfs_clear_bit_hook+0x140/0x258 [btrfs] [ 847.648601] clear_state_bit+0x87/0x128 [btrfs] [ 847.648601] __clear_extent_bit+0x222/0x2b7 [btrfs] [ 847.648601] clear_extent_bit+0x17/0x19 [btrfs] [ 847.648601] extent_clear_unlock_delalloc+0x3b/0x6b [btrfs] [ 847.648601] cow_file_range.isra.39+0x387/0x39a [btrfs] [ 847.648601] run_delalloc_nocow+0x4d7/0x70e [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] run_delalloc_range+0xa7/0x2b5 [btrfs] [ 847.648601] writepage_delalloc.isra.31+0xb9/0x15c [btrfs] [ 847.648601] __extent_writepage+0x249/0x2e8 [btrfs] [ 847.648601] extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs] [ 847.648601] ? arch_local_irq_save+0x9/0xc [ 847.648601] ? mark_lock+0x24/0x201 [ 847.648601] extent_writepages+0x4b/0x5c [btrfs] [ 847.648601] ? btrfs_writepage_start_hook+0xed/0xed [btrfs] [ 847.648601] btrfs_writepages+0x28/0x2a [btrfs] [ 847.648601] do_writepages+0x23/0x2c [ 847.648601] __filemap_fdatawrite_range+0x5a/0x61 [ 847.648601] filemap_fdatawrite_range+0x13/0x15 [ 847.648601] btrfs_fdatawrite_range+0x20/0x46 [btrfs] [ 847.648601] start_ordered_ops+0x19/0x23 [btrfs] [ 847.648601] btrfs_sync_file+0x136/0x42c [btrfs] [ 847.648601] vfs_fsync_range+0x8c/0x9e [ 847.648601] vfs_fsync+0x1c/0x1e [ 847.648601] do_fsync+0x31/0x4a [ 847.648601] SyS_fsync+0x10/0x14 [ 847.648601] entry_SYSCALL_64_fastpath+0x18/0xad [ 847.648601] RIP: 0033:0x7f5b05200800 [ 847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800 [ 847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003 [ 847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f [ 847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046 [ 847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740 [ 847.648601] ? trace_hardirqs_off_caller+0x3f/0xaa [ 847.685787] ---[ end trace 2a4a3e15382508e8 ]--- So fix this by not attempting to decrement the data space info's bytes_may_use counter if we already reserved the extent and an error happened before creating the ordered extent. We are already correctly freeing the reserved extent if an error happens, so there's no additional measure needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| | * | btrfs: Handle delalloc error correctly to avoid ordered extent hangQu Wenruo2017-04-261-15/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] If run_delalloc_range() returns error and there is already some ordered extents created, btrfs will be hanged with the following backtrace: Call Trace: __schedule+0x2d4/0xae0 schedule+0x3d/0x90 btrfs_start_ordered_extent+0x160/0x200 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_run_ordered_extent_work+0x25/0x40 [btrfs] btrfs_scrubparity_helper+0x1c1/0x620 [btrfs] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 [CAUSE] |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>| |<---------- cleanup range --------->| || \_=> First page handled by end_extent_writepage() in __extent_writepage() The problem is caused by error handler of run_delalloc_range(), which doesn't handle any created ordered extents, leaving them waiting on btrfs_finish_ordered_io() to finish. However after run_delalloc_range() returns error, __extent_writepage() won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered except the first page, and btrfs_finish_ordered_io() won't be triggered for created ordered extents either. So OE 2~n will hang forever, and if OE 1 is larger than one page, it will also hang. [FIX] Introduce btrfs_cleanup_ordered_extents() function to cleanup created ordered extents and finish them manually. The function is based on existing btrfs_endio_direct_write_update_ordered() function, and modify it to act just like btrfs_writepage_endio_hook() but handles specified range other than one page. After fix, delalloc error will be handled like: |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<>|<-------- ----------->|<------ old error handler --------->| || || || \_=> Cleaned up by cleanup_ordered_extents() \_=> First page handled by end_extent_writepage() in __extent_writepage() Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * | btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum errorQu Wenruo2017-04-261-12/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] When btrfs_reloc_clone_csum() reports error, it can underflow metadata and leads to kernel assertion on outstanding extents in run_delalloc_nocow() and cow_file_range(). BTRFS info (device vdb5): relocating block group 12582912 flags data BTRFS info (device vdb5): found 1 extents assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858 Currently, due to another bug blocking ordered extents, the bug is only reproducible under certain block group layout and using error injection. a) Create one data block group with one 4K extent in it. To avoid the bug that hangs btrfs due to ordered extent which never finishes b) Make btrfs_reloc_clone_csum() always fail c) Relocate that block group [CAUSE] run_delalloc_nocow() and cow_file_range() handles error from btrfs_reloc_clone_csum() wrongly: (The ascii chart shows a more generic case of this bug other than the bug mentioned above) |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- cleanup range --------------->| |<----------- ----------->| \/ btrfs_finish_ordered_io() range So error handler, which calls extent_clear_unlock_delalloc() with EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() will both cover OE n, and free its metadata, causing metadata under flow. [Fix] The fix is to ensure after calling btrfs_add_ordered_extent(), we only call error handler after increasing the iteration offset, so that cleanup range won't cover any created ordered extent. |<------------------ delalloc range --------------------------->| | OE 1 | OE 2 | ... | OE n | |<----------- ----------->|<---------- cleanup range --------->| \/ btrfs_finish_ordered_io() range Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | btrfs: check if the device is flush capableAnand Jain2017-04-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block layer call chain from submit_bio will check if the write cache is enabled for the given queue before submitting the flush. This will add a code to fail fast if its not. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog to reflect current code stat, blkdev_issue_flush is not used yet ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: delete unused member nobarriersAnand Jain2017-04-182-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The last consumer of nobarriers is removed by the commit [1] and sync won't fail with EOPNOTSUPP anymore. Thus, now when write cache is write through it just return success without actually transpiring such a request to the block device/lun. [1] commit b25de9d6da49b1a8760a89672283128aa8c78345 block: remove BIO_EOPNOTSUPP And, as the device/lun write cache state may change dynamically saving such as state won't help either. So deleting the member nobarriers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: scrub: Fix RAID56 recovery race conditionQu Wenruo2017-04-181-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16 (assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted | Correct | Correct | | Scrub dev3 (D1) | Scrub dev2 (D2) | Scrub dev1(P) | ------------------------------------------------------------------------ Read out D1 |Read out D2 |Read full stripe | Check csum |Check csum |Check parity | Csum mismatch |Csum match, continue |Parity mismatch | handle_errored_block | |handle_errored_block | Read out full stripe | | Read out full stripe| D1 csum error(err++) | | D1 csum error(err++)| Recover D1 | | Recover D1 | So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protection, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo Baroncelli <kreijack@libero.it> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: scrub: Introduce full stripe lock for RAID56Qu Wenruo2017-04-183-0/+251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protection, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which store a rb_tree of mutexes for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: Use ktime_get_real_ts for root ctimeDeepa Dinamani2017-04-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_root_item maintains the ctime for root updates. This is not part of vfs_inode. Since current_time() uses struct inode* as an argument as Linus suggested, this cannot be used to update root times unless, we modify the signature to use inode. Since btrfs uses nanosecond time granularity, it can also use ktime_get_real_ts directly to obtain timestamp for the root. It is necessary to use the timespec time api here because the same btrfs_set_stack_timespec_*() apis are used for vfs inode times as well. These can be transitioned to using timespec64 when btrfs internally changes to use timespec64 as well. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: handle only applicable errors returned by btrfs_get_extentDan Carpenter2017-04-182-26/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_get_extent() never returns NULL pointers, so this code introduces a static checker warning. The btrfs_get_extent() is a bit complex, but trust me that it doesn't return NULLs and also if it did we would trigger the BUG_ON(!em) before the last return statement. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> [ updated subject ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount optionQu Wenruo2017-04-181-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] The easist way to reproduce the bug is: ------ # mkfs.btrfs -f $dev -n 16K # mount $dev $mnt -o inode_cache # btrfs quota enable $mnt # btrfs quota rescan -w $mnt # btrfs qgroup show $mnt qgroupid rfer excl -------- ---- ---- 0/5 32.00KiB 32.00KiB ^^ Twice the correct value ------ And fstests/btrfs qgroup test group can easily detect them with inode_cache mount option. Although some of them are false alerts since old test cases are using fixed golden output. While new test cases will use "btrfs check" to detect qgroup mismatch. [CAUSE] Inode_cache mount option will make commit_fs_roots() to call btrfs_save_ino_cache() to update fs/subvol trees, and generate new delayed refs. However we call btrfs_qgroup_prepare_account_extents() too early, before commit_fs_roots(). This makes the "old_roots" for newly generated extents are always NULL. For freeing extent case, this makes both new_roots and old_roots to be empty, while correct old_roots should not be empty. This causing qgroup numbers not decreased correctly. [FIX] Modify the timing of calling btrfs_qgroup_prepare_account_extents() to just before btrfs_qgroup_account_extents(), and add needed delayed_refs handler. So qgroup can handle inode_map mount options correctly. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: use q which is already obtained from bdev_get_queueAnand Jain2017-04-181-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have already assigned q from bdev_get_queue() so use it. And rearrange the code for better view. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: switch to div64_u64 if with a u64 divisorLiu Bo2017-04-183-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is fixing code pieces where we use div_u64 when passing a u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: update scrub_parity to use u64 stripe_lenLiu Bo2017-04-181-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3d8da6781760 ("Btrfs: fix divide error upon chunk's stripe_len") changed stripe_len in struct map_lookup to u64, but didn't update stripe_len in struct scrub_parity. This updates the type and switches to div64_u64_rem to match u64 divisor. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: enable repair during read for raid56 profileLiu Bo2017-04-181-13/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that scrub can fix data errors with the help of parity for raid56 profile, repair during read is able to as well. Although the mirror num in raid56 scenario has different meanings, i.e. 0 or 1: read data directly > 1: do recover with parity, it could be fit into how we repair bad block during read. The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the device and position on it. Cc: David Sterba <dsterba@suse.cz> Tested-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: use clear_page where appropriateDavid Sterba2017-04-182-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a helper to clear whole page, with a arch-specific optimized code. The replaced cases do not seem to be in performace critical code, but we still might get some percent gain. Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: Prevent scrub recheck from racing with dev replaceQu Wenruo2017-04-181-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses bbio without protection of bio_counter. This can lead to use-after-free if racing with dev replace cancel. Fix it by increasing bio_counter before calling btrfs_map_sblock() and decreasing the bio_counter when corresponding recover is finished. Cc: Liu Bo <bo.li.liu@oracle.com> Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: Wait for in-flight bios before freeing target device for raid56Qu Wenruo2017-04-183-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When raid56 dev-replace is cancelled by running scrub, we will free target device without waiting for in-flight bios, causing the following NULL pointer deference or general protection failure. BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0 IP: generic_make_request_checks+0x4d/0x610 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G O 4.11.0-rc2 #72 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] task: ffff88002875b4c0 task.stack: ffffc90001334000 RIP: 0010:generic_make_request_checks+0x4d/0x610 Call Trace: ? generic_make_request+0xc7/0x360 generic_make_request+0x24/0x360 ? generic_make_request+0xc7/0x360 submit_bio+0x64/0x120 ? page_in_rbio+0x4d/0x80 [btrfs] ? rbio_orig_end_io+0x80/0x80 [btrfs] finish_rmw+0x3f4/0x540 [btrfs] validate_rbio_for_rmw+0x36/0x40 [btrfs] raid_rmw_end_io+0x7a/0x90 [btrfs] bio_endio+0x56/0x60 end_workqueue_fn+0x3c/0x40 [btrfs] btrfs_scrubparity_helper+0xef/0x620 [btrfs] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] process_one_work+0x2af/0x720 ? process_one_work+0x22b/0x720 worker_thread+0x4b/0x4f0 kthread+0x10f/0x150 ? process_one_work+0x720/0x720 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8 In btrfs_dev_replace_finishing(), we will call btrfs_rm_dev_replace_blocked() to wait bios before destroying the target device when scrub is finished normally. However when dev-replace is aborted, either due to error or cancelled by scrub, we didn't wait for bios, this can lead to use-after-free if there are bios holding the target device. Furthermore, for raid56 scrub, at least 2 places are calling btrfs_map_sblock() without protection of bio_counter, leading to the problem. This patch fixes the problem: 1) Wait for bio_counter before freeing target device when canceling replace 2) When calling btrfs_map_sblock() for raid56, use bio_counter to protect the call. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: scrub: Don't append on-disk pages for raid56 scrubQu Wenruo2017-04-181-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the following situation, scrub will calculate wrong parity to overwrite the correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0x0000 (Bad) | 0xcdcd | 0x0000 | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --------------------------------------------------- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --------------------------------------------------- 4K | 0xcdcd | 0xcdcd | 0x0000 | ... | 0xcdcd | 0xcdcd | 0x0000 | --------------------------------------------------- 64K The reason is that after raid56 read rebuild rbio->stripe_pages are all correctly recovered (0xcd for data stripes). However when we check and repair parity in scrub_parity_check_and_repair(), we will append pages in sparity->spages list to rbio->bio_pages[], which contains old on-disk data. And when we submit parity data to disk, we calculate parity using rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback to rbio->stripe_pages[]. The patch fix it by not appending pages from sparity->spages. So finish_parity_scrub() will use rbio->stripe_pages[] which is correct. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: qgroup: Re-arrange tracepoint timing to co-operate with reserved ↵Qu Wenruo2017-04-182-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | space tracepoint Newly introduced qgroup reserved space trace points are normally nested into several common qgroup operations. While some other trace points are not well placed to co-operate with them, causing confusing output. This patch re-arrange trace_btrfs_qgroup_release_data() and trace_btrfs_qgroup_free_delayed_ref() trace points so they are triggered before reserved space ones. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: qgroup: Add trace point for qgroup reserved spaceQu Wenruo2017-04-182-44/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the following trace points: qgroup_update_reserve qgroup_meta_reserve These trace points are handy to trace qgroup reserve space related problems. Also export btrfs_qgroup structure, as now we directly pass btrfs_qgroup structure to trace points, so that structure needs to be exported. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: drop redundant parameters from btrfs_map_sblockDavid Sterba2017-04-183-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | All callers pass 0 for mirror_num and 1 for need_raid_map. Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: sink GFP flags parameter to tree_mod_log_insert_rootDavid Sterba2017-04-181-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: sink GFP flags parameter to tree_mod_log_insert_moveDavid Sterba2017-04-181-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | All (1) callers pass the same value. Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: fix wrong failed mirror_num of read-repair on raid56Liu Bo2017-04-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In raid56 scenario, after trying parity recovery, we didn't set mirror_num for btrfs_bio with failed mirror_num, hence end_bio_extent_readpage() will report a random mirror_num in dmesg log. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: set scrub page's io_error if failing to submit ioLiu Bo2017-04-181-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Scrub repairs data by the unit called scrub_block, which may contain several pages. Scrub always tries to look up a good copy of a whole block, but if there's no such copy, it tries to do repair page by page. If we don't set page's io_error when checking this bad copy, in the last step, we may skip this page when repairing bad copy from good copy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: track exclusive filesystem operation in flagsDavid Sterba2017-04-184-26/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several operations, usually started from ioctls, that cannot run concurrently. The status is tracked in mutually_exclusive_operation_running as an atomic_t. We can easily track the status as one of the per-filesystem flag bits with same synchronization guarantees. The conversion replaces: * atomic_xchg(..., 1) -> test_and_set_bit(FLAG, ...) * atomic_set(..., 0) -> clear_bit(FLAG, ...) Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>