summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | | | | | | btrfs: Make btrfs_orphan_add take btrfs_inodeNikolay Borisov2017-02-284-22/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_orphan_del take btrfs_inodeNikolay Borisov2017-02-281-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_free_io_failure_record take btrfs_inodeNikolay Borisov2017-02-283-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make clean_io_failure take btrfs_inodeNikolay Borisov2017-02-283-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make repair_io_failure take btrfs_inodeNikolay Borisov2017-02-283-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make check_compressed_csum take btrfs_inodeNikolay Borisov2017-02-281-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_print_data_csum_error take btrfs_inodeNikolay Borisov2017-02-283-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make free_io_failure take btrfs_inodeNikolay Borisov2017-02-283-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make lock_and_cleanup_extent_if_need take btrfs_inodeNikolay Borisov2017-02-281-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make check_can_nocow take btrfs_inodeNikolay Borisov2017-02-281-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_lookup_ordered_range take btrfs_inodeNikolay Borisov2017-02-286-18/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_mark_extent_written take btrfs_inodeNikolay Borisov2017-02-283-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make fill_holes take btrfs_inodeNikolay Borisov2017-02-281-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make hole_mergeable take btrfs_inodeNikolay Borisov2017-02-281-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_drop_extent_cache take btrfs_inodeNikolay Borisov2017-02-286-26/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_requeue_inode_defrag take btrfs_inodeNikolay Borisov2017-02-281-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inodeNikolay Borisov2017-02-283-16/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: all btrfs_delalloc_release_metadata take btrfs_inodeNikolay Borisov2017-02-287-19/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inodeNikolay Borisov2017-02-284-38/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_orphan_release_metadata take btrfs_inodeNikolay Borisov2017-02-283-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inodeNikolay Borisov2017-02-283-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make calc_csum_metadata_size take btrfs_inodeNikolay Borisov2017-02-281-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make drop_outstanding_extent take btrfs_inodeNikolay Borisov2017-02-281-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inodeNikolay Borisov2017-02-283-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_is_free_space_inode take btrfs_inodeNikolay Borisov2017-02-285-19/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_i_size_write take btrfs_inodeNikolay Borisov2017-02-287-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_set_inode_index take btrfs_inodeNikolay Borisov2017-02-283-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: make btrfs_set_inode_index_count take btrfs_inodeNikolay Borisov2017-02-281-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_insert_dir_item take btrfs_inodeNikolay Borisov2017-02-285-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | | | | | | | btrfs: Make btrfs_log_all_parents take btrfs_inodeNikolay Borisov2017-02-281-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | | | | | Btrfs: try harder to migrate items to left sibling before splitting a leafFilipe Manana2017-02-241-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before attempting to split a leaf we try to migrate items from the leaf to its right and left siblings. We start by trying to move items into the rigth sibling and, if the new item is meant to be inserted at the end of our leaf, we try to free from our leaf an amount of bytes equal to the number of bytes used by the new item, by setting the variable space_needed to the byte size of that new item. However if we fail to move enough items to the right sibling due to lack of space in that sibling, we then try to move items into the left sibling, and in that case we try to free an amount equal to the size of the new item from our leaf, when we need only to free an amount corresponding to the size of the new item minus the current free space of our leaf. So make sure that before we try to move items to the left sibling we do set the variable space_needed with a value corresponding to the new item's size minus the leaf's current free space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | | | | | | | Btrfs: fix data loss after truncate when using the no-holes featureFilipe Manana2017-02-241-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a file with an implicit hole (NO_HOLES feature enabled) that has an extent following the hole, delayed writes against regions of the file behind the hole happened before but were not yet flushed and then we truncate the file to a smaller size that lies inside the hole, we end up persisting a wrong disk_i_size value for our inode that leads to data loss after umounting and mounting again the filesystem or after the inode is evicted and loaded again. This happens because at inode.c:btrfs_truncate_inode_items() we end up setting last_size to the offset of the extent that we deleted and that followed the hole. We then pass that value to btrfs_ordered_update_i_size() which updates the inode's disk_i_size to a value smaller then the offset of the buffered (delayed) writes. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo $ xfs_io -c "truncate 60K" /mnt/foo --> inode's disk_i_size updated to 0 $ md5sum /mnt/foo 3c5ca3c3ab42f4b04d7e7eb0b0d4d806 /mnt/foo $ umount /dev/sdb $ mount /dev/sdb /mnt $ md5sum /mnt/foo d41d8cd98f00b204e9800998ecf8427e /mnt/foo --> Empty file, all data lost! Cc: <stable@vger.kernel.org> # 3.14+ Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | | | | | | | Btrfs: incremental send, fix unnecessary hole writes for sparse filesFilipe Manana2017-02-241-2/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using the NO_HOLES feature, during an incremental send we often issue write operations for holes when we should not, because that range is already a hole in the destination snapshot. While that does not change the contents of the file at the receiver, it avoids preservation of file holes, leading to wasted disk space and extra IO during send/receive. A couple examples where the holes are not preserved follows. $ mkfs.btrfs -O no-holes -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 # Now add one new extent to our first test file, increasing its size and # leaving a 1Mb hole between the first extent and this new extent. $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo # Now overwrite the last extent of our second test file. $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25088..25095 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24576..24583 8 0x2001 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25096..25103 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24584..24591 8 0x2001 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt # It's not relevant to enable no-holes in the new filesystem. $ mkfs.btrfs -O no-holes -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive /mnt -f /tmp/2.snap $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24576..24583 8 0x2000 1: [8..2063]: 25624..27679 2056 0x1 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24584..24591 8 0x2000 1: [8..2063]: 27680..29735 2056 0x1 The holes do not exist in the second filesystem and they were replaced with extents filled with the byte 0x00, making each file take 1032Kb of space instead of 8Kb. So fix this by not issuing the write operations consisting of buffers filled with the byte 0x00 when the destination snapshot already has a hole for the respective range. A test case for fstests will follow soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | | | | | | | | Btrfs: fix use-after-free due to wrong order of destroying work queuesFilipe Manana2017-02-241-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before we destroy all work queues (and wait for their tasks to complete) we were destroying the work queues used for metadata I/O operations, which can result in a use-after-free problem because most tasks from all work queues do metadata I/O operations. For example, the tasks from the caching workers work queue (fs_info->caching_workers), which is destroyed only after the work queue used for metadata reads (fs_info->endio_meta_workers) is destroyed, do metadata reads, which result in attempts to queue tasks into the later work queue, triggering a use-after-free with a trace like the following: [23114.613543] general protection fault: 0000 [#1] PREEMPT SMP [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug] [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs] [23114.616932] task: ffff880221d45780 task.stack: ffffc9000bc50000 [23114.616932] RIP: 0010:[<ffffffffa037c1bf>] [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP: 0018:ffff88023f443d60 EFLAGS: 00010246 [23114.616932] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000102 [23114.616932] RDX: ffffffffa0419000 RSI: ffff88011df534f0 RDI: ffff880101f01c00 [23114.616932] RBP: ffff88023f443d80 R08: 00000000000f7000 R09: 000000000000ffff [23114.616932] R10: ffff88023f443d48 R11: 0000000000001000 R12: ffff88011df534f0 [23114.616932] R13: ffff880135963868 R14: 0000000000001000 R15: 0000000000001000 [23114.616932] FS: 0000000000000000(0000) GS:ffff88023f440000(0000) knlGS:0000000000000000 [23114.616932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [23114.616932] CR2: 00007f0fb9f8e520 CR3: 0000000001a0b000 CR4: 00000000000006e0 [23114.616932] Stack: [23114.616932] ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000 [23114.616932] ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868 [23114.616932] ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0 [23114.616932] Call Trace: [23114.616932] <IRQ> [23114.616932] [<ffffffffa03470af>] end_workqueue_bio+0xd5/0xda [btrfs] [23114.616932] [<ffffffff8125293c>] bio_endio+0x54/0x57 [23114.616932] [<ffffffffa0377929>] btrfs_end_bio+0xf7/0x106 [btrfs] [23114.616932] [<ffffffff8125293c>] bio_endio+0x54/0x57 [23114.616932] [<ffffffff8125955f>] blk_update_request+0x21a/0x30f [23114.616932] [<ffffffffa0022316>] scsi_end_request+0x31/0x182 [scsi_mod] [23114.616932] [<ffffffffa00235fc>] scsi_io_completion+0x1ce/0x4c8 [scsi_mod] [23114.616932] [<ffffffffa001ba9d>] scsi_finish_command+0x104/0x10d [scsi_mod] [23114.616932] [<ffffffffa002311f>] scsi_softirq_done+0x101/0x10a [scsi_mod] [23114.616932] [<ffffffff8125fbd9>] blk_done_softirq+0x82/0x8d [23114.616932] [<ffffffff814c8a4b>] __do_softirq+0x1ab/0x412 [23114.616932] [<ffffffff8105b01d>] irq_exit+0x49/0x99 [23114.616932] [<ffffffff81035135>] smp_call_function_single_interrupt+0x24/0x26 [23114.616932] [<ffffffff814c7ec9>] call_function_single_interrupt+0x89/0x90 [23114.616932] <EOI> [23114.616932] [<ffffffffa0023262>] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [<ffffffff814c596c>] ? _raw_spin_unlock_irq+0x32/0x4a [23114.616932] [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [<ffffffffa0023262>] scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [<ffffffff8125590e>] __blk_run_queue_uncond+0x22/0x2b [23114.616932] [<ffffffff81255930>] __blk_run_queue+0x19/0x1b [23114.616932] [<ffffffff8125ab01>] blk_queue_bio+0x268/0x282 [23114.616932] [<ffffffff81258f44>] generic_make_request+0xbd/0x160 [23114.616932] [<ffffffff812590e7>] submit_bio+0x100/0x11d [23114.616932] [<ffffffff81298603>] ? __this_cpu_preempt_check+0x13/0x15 [23114.616932] [<ffffffff812a1805>] ? __percpu_counter_add+0x8e/0xa7 [23114.616932] [<ffffffffa03bfd47>] btrfsic_submit_bio+0x1a/0x1d [btrfs] [23114.616932] [<ffffffffa0377db2>] btrfs_map_bio+0x1f4/0x26d [btrfs] [23114.616932] [<ffffffffa0348a33>] btree_submit_bio_hook+0x74/0xbf [btrfs] [23114.616932] [<ffffffffa03489bf>] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs] [23114.616932] [<ffffffffa03697a9>] submit_one_bio+0x6b/0x89 [btrfs] [23114.616932] [<ffffffffa036f5be>] read_extent_buffer_pages+0x170/0x1ec [btrfs] [23114.616932] [<ffffffffa03471fa>] ? free_root_pointers+0x64/0x64 [btrfs] [23114.616932] [<ffffffffa0348adf>] readahead_tree_block+0x3f/0x4c [btrfs] [23114.616932] [<ffffffffa032e115>] read_block_for_search.isra.20+0x1ce/0x23d [btrfs] [23114.616932] [<ffffffffa032fab8>] btrfs_search_slot+0x65f/0x774 [btrfs] [23114.616932] [<ffffffffa036eff1>] ? free_extent_buffer+0x73/0x7e [btrfs] [23114.616932] [<ffffffffa0331ba4>] btrfs_next_old_leaf+0xa1/0x33c [btrfs] [23114.616932] [<ffffffffa0331e4f>] btrfs_next_leaf+0x10/0x12 [btrfs] [23114.616932] [<ffffffffa0336aa6>] caching_thread+0x22d/0x416 [btrfs] [23114.616932] [<ffffffffa037bce9>] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs] [23114.616932] [<ffffffffa037c036>] btrfs_cache_helper+0xe/0x10 [btrfs] [23114.616932] [<ffffffff8106cf96>] process_one_work+0x273/0x4e4 [23114.616932] [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca [23114.616932] [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6 [23114.616932] [<ffffffff81072a81>] kthread+0xd5/0xdd [23114.616932] [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a [23114.616932] [<ffffffff814c6257>] ret_from_fork+0x27/0x40 [23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b [23114.616932] RIP [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP <ffff88023f443d60> [23114.689493] ---[ end trace 6e48b6bc707ca34b ]--- [23114.690166] Kernel panic - not syncing: Fatal exception in interrupt [23114.691283] Kernel Offset: disabled [23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt The following diagram shows the sequence of operations that lead to the use-after-free problem from the above trace: CPU 1 CPU 2 CPU 3 caching_thread() close_ctree() btrfs_stop_all_workers() btrfs_destroy_workqueue( fs_info->endio_meta_workers) btrfs_search_slot() read_block_for_search() readahead_tree_block() read_extent_buffer_pages() submit_one_bio() btree_submit_bio_hook() btrfs_bio_wq_end_io() --> sets the bio's bi_end_io callback to end_workqueue_bio() --> bio is submitted bio completes and its bi_end_io callback is invoked --> end_workqueue_bio() --> attempts to queue a task on fs_info->endio_meta_workers btrfs_destroy_workqueue( fs_info->caching_workers) So fix this by destroying the queues used for metadata I/O tasks only after destroying all the other queues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | | | | | | | Btrfs: fix assertion failure when freeing block groups at close_ctree()Filipe Manana2017-02-242-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At close_ctree() we free the block groups and then only after we wait for any running worker kthreads to finish and shutdown the workqueues. This behaviour is racy and it triggers an assertion failure when freeing block groups because while we are doing it we can have for example a block group caching kthread running, and in that case the block group's reference count can still be greater than 1 by the time we assert its reference count is 1, leading to an assertion failure: [19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799 [19041.200584] ------------[ cut here ]------------ [19041.201692] kernel BUG at fs/btrfs/ctree.h:3418! [19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP [19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000 [19041.208082] RIP: 0010:[<ffffffffa03e319e>] [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP: 0018:ffffc9000ad37d60 EFLAGS: 00010286 [19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001 [19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff [19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000 [19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170 [19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100 [19041.208082] FS: 00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000 [19041.208082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0 [19041.208082] Stack: [19041.208082] ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630 [19041.208082] ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8 [19041.208082] ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d [19041.208082] Call Trace: [19041.208082] [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs] [19041.208082] [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs] [19041.208082] [<ffffffff811a634d>] ? evict_inodes+0x132/0x141 [19041.208082] [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs] [19041.208082] [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb [19041.208082] [<ffffffff8119004f>] kill_anon_super+0x12/0x1c [19041.208082] [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs] [19041.208082] [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68 [19041.208082] [<ffffffff8118fb34>] deactivate_super+0x36/0x39 [19041.208082] [<ffffffff811a9946>] cleanup_mnt+0x58/0x76 [19041.208082] [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14 [19041.208082] [<ffffffff81071573>] task_work_run+0x6f/0x95 [19041.208082] [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1 [19041.208082] [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2 [19041.208082] [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad [19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9 [19041.208082] RIP [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP <ffffc9000ad37d60> [19041.279264] ---[ end trace 23330586f16f064d ]--- This started happening as of kernel 4.8, since commit f3bca8028bd9 ("Btrfs: add ASSERT for block group's memory leak") introduced these assertions. So fix this by freeing the block groups only after waiting for all worker kthreads to complete and shutdown the workqueues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | | | | | | | Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabledFilipe Manana2017-02-241-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We log holes explicitly by using file extent items, however when replaying a log tree, if a logged file extent item corresponds to a hole and the NO_HOLES feature is enabled we do not need to copy the file extent item into the fs/subvolume tree, as the absence of such file extent items is the purpose of the NO_HOLES feature. So skip the copying of file extent items representing holes when the NO_HOLES feature is enabled. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | | | | | | | | Btrfs: fix leak of subvolume writers counterRobbie Ko2017-02-241-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When falling back from a nocow write to a regular cow write, we were leaking the subvolume writers counter in 2 situations, preventing snapshot creation from ever completing in the future, as it waits for that counter to go down to zero before the snapshot creation starts. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Improved changelog and subject] Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | | | | | | | | Btrfs: bulk delete checksum items in the same leafFilipe Manana2017-02-241-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Very often we have the checksums for an extent spread in multiple items in the checksums tree, and currently the algorithm to delete them starts by looking for them one by one and then deleting them one by one, which is not optimal since each deletion involves shifting all the other items in the leaf and when the leaf reaches some low threshold, to move items off the leaf into its left and right neighbor leafs. Also, after each item deletion we release our search path and start a new search for other checksums items. So optimize this by deleting in bulk all the items in the same leaf that contain checksums for the extent being freed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| * | | | | | | | | Btrfs: incremental send, do not issue invalid rmdir operationsRobbie Ko2017-02-241-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When both the parent and send snapshots have a directory inode with the same number but different generations (therefore they are different inodes) and both have an entry with the same name, an incremental send stream will contain an invalid rmdir operation that refers to the orphanized name of the inode from the parent snapshot. The following example scenario shows how this happens. Parent snapshot: . |---- d259_old/ (ino 259, gen 9) | |---- d1/ (ino 258, gen 9) | |---- f (ino 257, gen 9) Send snapshot: . |---- d258/ (ino 258, gen 7) |---- d259/ (ino 259, gen 7) |---- d1/ (ino 257, gen 7) When the kernel is processing inode 258 it notices that in both snapshots there is an inode numbered 259 that is a parent of an inode 258. However it ignores the fact that the inodes numbered 259 have different generations in both snapshots, which means they are effectively different inodes. Then it checks that both inodes 259 have a dentry named "d1" and because of that it issues a rmdir operation with orphanized name of the inode 258 from the parent snapshot. This happens at send.c:process_record_refs(), which calls send.c:did_overwrite_first_ref() that returns true and because of that later on at process_recorded_refs() such rmdir operation is issued because the inode being currently processed (258) is a directory and it was deleted in the send snapshot (and replaced with another inode that has the same number and is a directory too). Fix this issue by comparing the generations of parent directory inodes that have the same number and make send.c:did_overwrite_first_ref() when the generations are different. The following steps reproduce the problem. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/f $ mkdir /mnt/d1 $ mkdir /mnt/d259_old $ mv /mnt/d1 /mnt/d259_old/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/d1 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/d1 /mnt/dir259/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt/ -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inodes from the second snapshot all have # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. This means all the inodes # in the first snapshot (snap1) have a generation value of 9. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive -vv /mnt -f /tmp/2.snap receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9 utimes unlink f mkdir o257-7-0 mkdir o259-7-0 rename o257-7-0 -> o259-7-0/d1 chown o259-7-0/d1 - uid=0, gid=0 chmod o259-7-0/d1 - mode=0755 utimes o259-7-0/d1 rmdir o258-9-0 ERROR: rmdir o258-9-0 failed: No such file or directory Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | | | | | | | | Btrfs: incremental send, do not delay rename when parent inode is newFilipe Manana2017-02-241-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are checking if we need to delay the rename operation for an inode we not checking if a parent inode that exists in the send and parent snapshots is really the same inode or not, that is, we are not comparing the generation number of the parent inode in the send and parent snapshots. Not only this results in unnecessarily delaying a rename operation but also can later on make us generate an incorrect name for a new inode in the send snapshot that has the same number as another inode in the parent snapshot but a different generation. Here follows an example where this happens. Parent snapshot: . (ino 256, gen 3) |--- dir258/ (ino 258, gen 7) | |--- dir257/ (ino 257, gen 7) | |--- dir259/ (ino 259, gen 7) Send snapshot: . (ino 256, gen 3) |--- file258 (ino 258, gen 10) | |--- new_dir259/ (ino 259, gen 10) |--- dir257/ (ino 257, gen 7) The following steps happen when computing the incremental send stream: 1) When processing inode 257, its new parent is created using its orphan name (o257-21-0), and the rename operation for inode 257 is delayed because its new parent (inode 259) was not yet processed - this decision to delay the rename operation does not make much sense because the inode 259 in the send snapshot is a new inode, it's not the same as inode 259 in the parent snapshot. 2) When processing inode 258 we end up delaying its rmdir operation, because inode 257 was not yet renamed (moved away from the directory inode 258 represents). We also create the new inode 258 using its orphan name "o258-10-0", then rename it to its final name of "file258" and then issue a truncate operation for it. However this truncate operation contains an incorrect name, which corresponds to the orphan name and not to the final name, which makes the receiver fail. This happens because when we attempt to compute the inode's current name we verify that there's another inode with the same number (258) that has its rmdir operation pending and because of that we generate an orphan name for the new inode 258 (we do this in the function get_cur_path()). Fix this by not delayed the rename operation of an inode if it has parents with the same number but different generations in both snapshots. The following steps reproduce this example scenario. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir257 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/dir257 /mnt/dir258/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/dir258/dir257 /mnt/dir257 $ rmdir /mnt/dir258 $ rmdir /mnt/dir259 # Remount the filesystem so that the next created inodes will have the # numbers 258 and 259. This is because when a filesystem is mounted, # btrfs sets the subvolume's inode counter to a value corresponding to # the highest inode number in the subvolume plus 1. This inode counter # is used to assign a unique number to each new inode and it's # incremented by 1 after very inode creation. # Note: we unmount and then mount instead of doing a mount with # "-o remount" because otherwise the inode counter remains at value 260. $ umount /mnt $ mount /dev/sdb /mnt $ touch /mnt/file258 $ mkdir /mnt/new_dir259 $ mv /mnt/dir257 /mnt/new_dir259/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmo/1.snap $ btrfs receive /mnt -f /tmo/2.snap -vv receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7 utimes mkdir o259-10-0 rename dir258 -> o258-7-0 utimes mkfile o258-10-0 rename o258-10-0 -> file258 utimes truncate o258-10-0 size=0 ERROR: truncate o258-10-0 failed: No such file or directory Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | | | | | | | | Btrfs: send, fix failure to rename top level inode due to name collisionRobbie Ko2017-02-241-1/+4
| |/ / / / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) |---- a1/ (ino 257, gen 9) |---- a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) |---- a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>
* | | | | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2017-03-028-201/+213
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile two from Al Viro: - orangefs fix - series of fs/namei.c cleanups from me - VFS stuff coming from overlayfs tree * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: Use RCU for destroy_inode vfs: use helper for calling f_op->fsync() mm: use helper for calling f_op->mmap() vfs: use helpers for calling f_op->{read,write}_iter() vfs: pass type instead of fn to do_{loop,iter}_readv_writev() vfs: extract common parts of {compat_,}do_readv_writev() vfs: wrap write f_ops with file_{start,end}_write() vfs: deny copy_file_range() for non regular files vfs: deny fallocate() on directory vfs: create vfs helper vfs_tmpfile() namei.c: split unlazy_walk() namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate() lookup_fast(): clean up the logics around the fallback to non-rcu mode namei: fold unlazy_link() into its sole caller
| * \ \ \ \ \ \ \ \ Merge remote-tracking branch 'ovl/for-viro' into for-linusAl Viro2017-03-027-112/+110
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overlayfs-related series from Miklos and Amir
| | * | | | | | | | | vfs: use helper for calling f_op->fsync()Miklos Szeredi2017-02-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | | mm: use helper for calling f_op->mmap()Miklos Szeredi2017-02-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | | vfs: use helpers for calling f_op->{read,write}_iter()Miklos Szeredi2017-02-203-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | | vfs: pass type instead of fn to do_{loop,iter}_readv_writev()Miklos Szeredi2017-02-201-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | | vfs: extract common parts of {compat_,}do_readv_writev()Miklos Szeredi2017-02-201-51/+29
| | | |_|/ / / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | vfs: wrap write f_ops with file_{start,end}_write()Amir Goldstein2017-02-072-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before calling write f_ops, call file_start_write() instead of sb_start_write(). Replace {sb,file}_start_write() for {copy,clone}_file_range() and for fallocate(). Beyond correct semantics, this avoids freeze protection to sb when operating on special inodes, such as fallocate() on a blockdev. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | | | | | | | vfs: deny copy_file_range() for non regular filesAmir Goldstein2017-02-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no in-tree file system that implements copy_file_range() for non regular files. Deny an attempt to copy_file_range() a directory with EISDIR and any other non regualr file with EINVAL to conform with behavior of vfs_{clone,dedup}_file_range(). This change is needed prior to converting sb_start_write() to file_start_write() in the vfs helper. Cc: linux-api@vger.kernel.org Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>