summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-4.18-tag' of ↵Linus Torvalds2018-06-0447-2746/+3248
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "User visible features: - added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags, successor of GET/SETFLAGS; now supports only existing flags: append, immutable, noatime, nodump, sync - 3 new unprivileged ioctls to allow users to enumerate subvolumes - dedupe syscall implementation does not restrict the range to 16MiB, though it still splits the whole range to 16MiB chunks - on user demand, rmdir() is able to delete an empty subvolume, export the capability in sysfs - fix inode number types in tracepoints, other cleanups - send: improved speed when dealing with a large removed directory, measurements show decrease from 2000 minutes to 2 minutes on a directory with 2 million entries - pre-commit check of superblock to detect a mysterious in-memory corruption - log message updates Other changes: - orphan inode cleanup improved, does no keep long-standing reservations that could lead up to early ENOSPC in some cases - slight improvement of handling snapshotted NOCOW files by avoiding some unnecessary tree searches - avoid OOM when dealing with many unmergeable small extents at flush time - speedup conversion of free space tree representations from/to bitmap/tree - code refactoring, deletion, cleanups: + delayed refs + delayed iput + redundant argument removals + memory barrier cleanups + remove a redundant mutex supposedly excluding several ioctls to run in parallel - new tracepoints for blockgroup manipulation - more sanity checks of compressed headers" * tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits) btrfs: Add unprivileged version of ino_lookup ioctl btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF btrfs: Add unprivileged ioctl which returns subvolume information Btrfs: clean up error handling in btrfs_truncate() btrfs: Factor out write portion of btrfs_get_blocks_direct btrfs: Factor out read portion of btrfs_get_blocks_direct btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist btrfs: raid56: Remove VLA usage btrfs: return error value if create_io_em failed in cow_file_range btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot btrfs: drop unused parameter qgroup_reserved btrfs: balance dirty metadata pages in btrfs_finish_ordered_io btrfs: lift some btrfs_cross_ref_exist checks in nocow path btrfs: Remove fs_info argument from btrfs_uuid_tree_rem btrfs: Remove fs_info argument from btrfs_uuid_tree_add Btrfs: remove unused check of skip_locking Btrfs: remove always true check in unlock_up Btrfs: grab write lock directly if write_lock_level is the max level Btrfs: move get root out of btrfs_search_slot to a helper Btrfs: use more straightforward extent_buffer_uptodate check ...
| * btrfs: Add unprivileged version of ino_lookup ioctlTomohiro Misono2018-05-311-0/+204
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER to allow normal users to call "btrfs subvolume list/show" etc. in combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF. This can be used like BTRFS_IOC_INO_LOOKUP but the argument is different. This is because it always searches the fs/file tree correspoinding to the fd with which this ioctl is called and also returns the name of bottom subvolume. The main differences from original ino_lookup ioctl are: 1. Read + Exec permission will be checked using inode_permission() during path construction. -EACCES will be returned in case of failure. 2. Path construction will be stopped at the inode number which corresponds to the fd with which this ioctl is called. If constructed path does not exist under fd's inode, -EACCES will be returned. 3. The name of bottom subvolume is also searched and filled. Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1) bytes than ino_lookup ioctl because of space of subvolume's name. Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REFTomohiro Misono2018-05-311-0/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns ROOT_REF information of the subvolume containing this inode except the subvolume name (this is because to prevent potential name leak). The subvolume name will be gained by user version of ino_lookup ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check. The min id of root ref's subvolume to be searched is specified by @min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search ends, @min_id is set to the last searched root ref's subvolid + 1. Also, if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW is returned. Therefore the caller can just call this ioctl again without changing the argument to continue search. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ style fixes and struct item renames ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Add unprivileged ioctl which returns subvolume informationTomohiro Misono2018-05-311-0/+121
| | | | | | | | | | | | | | | | | | | | | | | | Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns the information of subvolume containing this inode. (i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.) Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> [ minor style fixes, update struct comments ] Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: clean up error handling in btrfs_truncate()Omar Sandoval2018-05-301-19/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_truncate() uses two variables for error handling, ret and err (if this sounds familiar, it's because btrfs_truncate_inode_items() did something similar). This is error prone, as was made evident by "Btrfs: fix error handling in btrfs_truncate()". We only have err because we don't want to mask an error if we call btrfs_update_inode() and btrfs_end_transaction(), so let's make that its own scoped return variable and use ret everywhere else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Factor out write portion of btrfs_get_blocks_directNikolay Borisov2018-05-301-99/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that the read side is extracted into its own function, do the same to the write side. This leaves btrfs_get_blocks_direct_write with the sole purpose of handling common locking required. Also flip the condition in btrfs_get_blocks_direct_write so that the write case comes first and we check for if (Create) rather than if (!create). This is purely subjective but I believe makes reading a bit more "linear". No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Factor out read portion of btrfs_get_blocks_directNikolay Borisov2018-05-301-13/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently this function handles both the READ and WRITE dio cases. This is facilitated by a bunch of 'if' statements, a goto short-circuit statement and a very perverse aliasing of "!created"(READ) case by setting lockstart = lockend and checking for lockstart < lockend for detecting the write. Let's simplify this mess by extracting the READ-only code into a separate __btrfs_get_block_direct_read function. This is only the first step, the next one will be to factor out the write side as well. The end goal will be to have the common locking/ unlocking code in btrfs_get_blocks_direct and then it will call either the read|write subvariants. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_existSu Yue2018-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | The error code does not match the reason of failure and may confuse the callers. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: raid56: Remove VLA usageKees Cook2018-05-301-10/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | In the quest to remove all stack VLA usage from the kernel[1], this allocates the working buffers during regular init, instead of using stack space. This refactors the allocation code a bit to make it easier to review. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: return error value if create_io_em failed in cow_file_rangeSu Yue2018-05-301-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: 6f9994dbabe5 ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshotGu JinXiang2018-05-301-1/+0
| | | | | | | | | | | | | | | | Since there is no more use of qgroup_reserved member in struct btrfs_pending_snapshot, remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: drop unused parameter qgroup_reservedGu JinXiang2018-05-304-14/+5
| | | | | | | | | | | | | | | | | | | | Since commit 7775c8184ec0 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: balance dirty metadata pages in btrfs_finish_ordered_ioEthan Lien2018-05-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Problem description and how we fix it] We should balance dirty metadata pages at the end of btrfs_finish_ordered_io, since a small, unmergeable random write can potentially produce dirty metadata which is multiple times larger than the data itself. For example, a small, unmergeable 4KiB write may produce: 16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree 16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree 16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree Although we do call balance dirty pages in write side, but in the buffered write path, most metadata are dirtied only after we reach the dirty background limit (which by far only counts dirty data pages) and wakeup the flusher thread. If there are many small, unmergeable random writes spread in a large btree, we'll find a burst of dirty pages exceeds the dirty_bytes limit after we wakeup the flusher thread - which is not what we expect. In our machine, it caused out-of-memory problem since a page cannot be dropped if it is marked dirty. Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay, but since we do btrfs_finish_ordered_io in a separate worker, it will not stop the flusher consuming dirty pages. Also, we use different worker for metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle the size of dirty metadata pages. [Reproduce steps] To reproduce the problem, we need to do 4KiB write randomly spread in a large btree. In our 2GiB RAM machine: 1) Create 4 subvolumes. 2) Run fio on each subvolume: [global] direct=0 rw=randwrite ioengine=libaio bs=4k iodepth=16 numjobs=1 group_reporting size=128G runtime=1800 norandommap time_based randrepeat=0 3) Take snapshot on each subvolume and repeat fio on existing files. 4) Repeat step (3) until we get large btrees. In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of metadata in each subvolume tree and 12GiB of metadata in extent tree. 5) Stop all fio, take snapshot again, and wait until all delayed work is completed. 6) Start all fio. Few seconds later we hit OOM when the flusher starts to work. It can be reproduced even when using nocow write. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: lift some btrfs_cross_ref_exist checks in nocow pathEthan Lien2018-05-301-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In nocow path, we check if the extent is snapshotted in btrfs_cross_ref_exist(). We can do the similar check earlier and avoid unnecessary search into extent tree. A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows: [global] group_reporting time_based thread=1 ioengine=libaio bs=4k iodepth=32 size=64G runtime=180 numjobs=8 rw=randwrite [file1] filename=/mnt/nocow/testfile IOPS result: unpatched patched 1 fio round: 46670 46958 snapshot 2 fio round: 51826 54498 3 fio round: 59767 61289 After snapshot, the first fio get about 5% performance gain. As we continually write to the same file, all writes will resume to nocow mode and eventually we have no performance gain. Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Remove fs_info argument from btrfs_uuid_tree_remLu Fengqi2018-05-304-9/+7
| | | | | | | | | | | | | | | | | | | | This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ rename the function ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Remove fs_info argument from btrfs_uuid_tree_addLu Fengqi2018-05-305-13/+10
| | | | | | | | | | | | | | | | | | This function always takes a transaction handle which contains a reference to the fs_info. Use that and remove the extra argument. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: remove unused check of skip_lockingLiu Bo2018-05-301-2/+5
| | | | | | | | | | | | | | | | | | | | | | The check is superfluous since all callers who set search_for_commit also have skip_locking set. ASSERT() is put in place to ensure skip_locking is set by new callers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: remove always true check in unlock_upLiu Bo2018-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As unlock_up() is written as for () { if (!path->locks[i]) break; ... if (... && path->locks[i]) { } } Apparently, @path->locks[i] is always true at this 'if'. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: grab write lock directly if write_lock_level is the max levelLiu Bo2018-05-301-11/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Typically, when acquiring root node's lock, btrfs tries its best to get read lock and trade for write lock if @write_lock_level implies to do so. In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's write lock directly. In this particular case, the dance of acquiring read lock and then trading for write lock can be saved. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: move get root out of btrfs_search_slot to a helperLiu Bo2018-05-301-45/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's good to have a helper instead of having all get-root details open-coded. The new helper locks (if necessary) and sets root node of the path. Also invert the checks to make the code flow easier to read. There is no functional change in this commit. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: use more straightforward extent_buffer_uptodate checkLiu Bo2018-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | If parent_transid "0" is passed to btrfs_buffer_uptodate(), btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but extent_buffer_uptodate() is preferred since we don't have to look into verify_parent_transid(). Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: remove superfluous free_extent_buffer in read_block_for_searchLiu Bo2018-05-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | read_block_for_search() can be simplified as: tmp = find_extent_buffer(); if (tmp) return; ... free_extent_buffer(); read_tree_block(); Apparently, @tmp must be NULL at this point, free_extent_buffer() is not needed. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: drop unused space_info parameter from create_space_infoLu Fengqi2018-05-301-8/+5
| | | | | | | | | | | | | | | | | | | | Since commit dc2d3005d27d ("btrfs: remove dead create_space_info calls"), there is only one caller btrfs_init_space_info. However, it doesn't need create_space_info to return space_info at all. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: add parent_transid parameter to veirfy_level_keyLiu Bo2018-05-301-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As verify_level_key() is checked after verify_parent_transid(), i.e. if (verify_parent_transid()) ret = -EIO; else if (verify_level_key()) ret = -EUCLEAN; if parent_transid is 0, verify_parent_transid() skips verifying parent_transid and considers eb as valid, and if verify_level_key() reports something wrong, we're not going to know if it's caused by corrupted metadata or non-checkecd eb (e.g. stale eb). The stale eb can be from an outdated raid1 mirror after a degraded mount, see eg "btrfs: fix reading stale metadata blocks after degraded raid1 mounts" (02a3307aa9c20b4f66262) for more details. @parent_transid is able to tell whether the eb's generation has been verified by the caller. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: qgroup: show more meaningful qgroup_rescan_init error messageQu Wenruo2018-05-301-15/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error message from qgroup_rescan_init() mostly looks like: BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115 Which is far from meaningful, and sometimes confusing as for above -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes it can also indicate problem if the return value is -EINVAL. Change it to some more meaningful messages like: BTRFS info (device nvme0n1p1): qgroup rescan is already in progress And BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> [ update the messages and level ] Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2()Omar Sandoval2018-05-301-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have invalid flags set, when we error out we must drop our writer counter and free the buffer we allocated for the arguments. This bug is trivially reproduced with the following program on 4.7+: #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/btrfs.h> #include <linux/btrfs_tree.h> int main(int argc, char **argv) { struct btrfs_ioctl_vol_args_v2 vol_args = { .flags = UINT64_MAX, }; int ret; int fd; if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } fd = open(argv[1], O_WRONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args); if (ret == -1) perror("ioctl"); close(fd); return EXIT_SUCCESS; } When unmounting the filesystem, we'll hit the WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the filesystem to be remounted read-only as the writer count will stay lifted. Fixes: 6b526ed70cf1 ("btrfs: introduce device delete by devid") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: lzo: Harden inline lzo compressed extent decompressionQu Wenruo2018-05-301-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | For inlined extent, we only have one segment, thus less things to check. And further more, inlined extent always has the csum in its leaf header, it's less probable to have corrupted data. Anyway, still check header and segment header. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: lzo: Add header length check to avoid potential out-of-bounds accessQu Wenruo2018-05-301-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | James Harvey reported that some corrupted compressed extent data can lead to various kernel memory corruption. Such corrupted extent data belongs to inode with NODATASUM flags, thus data csum won't help us detecting such bug. If lucky enough, KASAN could catch it like: BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs] Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338 CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G O 4.17.0-rc5-custom+ #50 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Call Trace: dump_stack+0xc2/0x16b print_address_description+0x6a/0x270 kasan_report+0x260/0x380 memcpy+0x34/0x50 lzo_decompress_bio+0x384/0x7a0 [btrfs] end_compressed_bio_read+0x99f/0x10b0 [btrfs] bio_endio+0x32e/0x640 normal_work_helper+0x15a/0xea0 [btrfs] process_one_work+0x7e3/0x1470 worker_thread+0x1b0/0x1170 kthread+0x2db/0x390 ret_from_fork+0x22/0x40 ... The offending compressed data has the following info: Header: length 32768 (looks completely valid) Segment 0 Header: length 3472882419 (obviously out of bounds) Then when handling segment 0, since it's over the current page, we need the copy the compressed data to temporary buffer in workspace, then such large size would trigger out-of-bounds memory access, screwing up the whole kernel. Fix it by adding extra checks on header and segment headers to ensure we won't access out-of-bounds, and even checks the decompressed data won't be out-of-bounds. Reported-by: James Harvey <jamespharvey20@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated comments ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: lzo: document the compressed data formatQu Wenruo2018-05-291-0/+37
| | | | | | | | | | | | | | | | | | | | | | Although it's not that complex, but such comment could still save several minutes for newer reader/reviewer instead of inferring that from the code. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor wording updates ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: compression: Add linux/sizes.h for compression.hQu Wenruo2018-05-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Since compression.h is using the SZ_* macros, and if some file includes only compression.h without linux/sizes.h, it will cause compile error. One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED. Fix it by adding linux/sizes.h in compression.h Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix clone vs chattr NODATASUM raceOmar Sandoval2018-05-291-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In btrfs_clone_files(), we must check the NODATASUM flag while the inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags() will change the flags after we check and we can end up with a party checksummed file. The race window is only a few instructions in size, between the if and the locks which is: 3834 if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) 3835 return -EISDIR; where the setflags must be run and toggle the NODATASUM flag (provided the file size is 0). The clone will block on the inode lock, segflags takes the inode lock, changes flags, releases log and clone continues. Not impossible but still needs a lot of bad luck to hit unintentionally. Fixes: 0e7b824c4ef9 ("Btrfs: don't make a file partly checksummed through file clone") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: propagate failures of __exclude_logged_extent to upper callerGu Jinxiang2018-05-291-2/+5
| | | | | | | | | | | | | | | | | | | | | | Function btrfs_exclude_logged_extents may call __exclude_logged_extent which may fail. Propagate the failures of __exclude_logged_extent to upper caller. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Streamline shared ref check in alloc_reserved_tree_blockNikolay Borisov2018-05-291-8/+2
| | | | | | | | | | | | | | | | | | | | | | Instead of setting "parent" to ref->parent only when dealing with a shared ref and subsequently performing another check to see if (parent > 0), check the "node->type" directly and act accordingly. This makes the code more streamline. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_blockNikolay Borisov2018-05-291-6/+5
| | | | | | | | | | | | | | | | | | | | Instead of taking only specific member of this structure, which results in 2 extra arguments, just take the delayed_extent_op struct and reference the arguments inside the functions. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Simplify alloc_reserved_tree_block interfaceNikolay Borisov2018-05-291-32/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function currently takes 7 parameters, most of which are proxies for values from btrfs_delayed_ref_node struct which is not passed. This patch simplifies the interface of the function by simply passing said delayed ref node struct to the function. This enables us to: 1. Move locals variables and init code related to them from run_delayed_tree_ref which should only be used inside alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key, representing the extent being inserted. This removes the need for the "ins" argument. Instead, it's replaced by a local var with a more verbose name - extent_key. 2. Now that we have a reference to the node in alloc_reserved_tree_block the delayed_tree_ref struct can be referenced inside the function and this enable removing the "ref->level", "parent" and "ref_root" arguments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Remove fs_info argument from alloc_reserved_tree_blockNikolay Borisov2018-05-291-4/+2
| | | | | | | | | | | | | | | | | | This function already takes a transaction handle which contains a reference to the fs_info. So use this and remove the extra argument. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: tests: drop newline from test_msg stringsDavid Sterba2018-05-298-22/+22
| | | | | | | | | | | | | | Now that test_err strings do not need the newline, remove them also from the test_msg. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: tests: add helper for error messages and update themDavid Sterba2018-05-298-401/+412
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The test failures are not clearly visible in the system log as they're printed at INFO level. Add a new helper that is level ERROR. As this touches almost all strings, I took the opportunity to unify them: - decapitalize the first letter as there's a prefix and the text continues after ":" - glue strings split to more lines and un-indent so they fit to 80 columns - use %llu instead of %Lu - drop \n from the modified messages (test_msg is left untouched) Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use error code returned by btrfs_read_fs_root_no_name in search ioctlMisono Tomohiro2018-05-281-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_read_fs_root_no_name() may return ERR_PTR(-ENOENT) or ERR_PTR(-ENOMEM) and therefore search_ioctl() and btrfs_search_path_in_tree() should use PTR_ERR() instead of -ENOENT, which all other callers of btrfs_read_fs_root_no_name() do. Drop the error message as it would be confusing, the caller of ioctl will likely interpret the error code and not look into the syslog. Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: allow empty subvol= againOmar Sandoval2018-05-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got a report that after upgrading to 4.16, someone's filesystems weren't mounting: [ 23.845852] BTRFS info (device loop0): unrecognized mount option 'subvol=' Before 4.16, this mounted the default subvolume. It turns out that this empty "subvol=" is actually an application bug, but it was causing the application to fail, so it's an ABI break if you squint. The generic parsing code we use for mount options (match_token()) doesn't match an empty string as "%s". Previously, setup_root_args() removed the "subvol=" string, but the mount path was cleaned up to not need that. Add a dummy Opt_subvol_empty to fix this. The simple workaround is to use / or . for the value of 'subvol=' . Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") CC: stable@vger.kernel.org # 4.16+ Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix describe_relocation when printing unknown flagsAnand Jain2018-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Looks like the original idea was to print the hex of the flags which is not coded with their flag name. So use the current buf pointer bp instead of buf. Reaching the uknown flags should never happen, it's there just in case. Fixes: ebce0e01b930b ("btrfs: make block group flags in balance printks human-readable") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use kvzalloc for EXTENT_SAME temporary dataDavid Sterba2018-05-281-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | The dedupe range is 16 MiB, with 4 KiB pages and 8 byte pointers, the arrays can be 32KiB large. To avoid allocation failures due to fragmented memory, use the allocation with fallback to vmalloc. The arrays are allocated and freed only inside btrfs_extent_same and reused for all the ranges. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: reuse cmp workspace in EXTENT_SAME ioctlTimofey Titovets2018-05-281-39/+40
| | | | | | | | | | | | | | | | | | | | | | | | We support big dedup requests by splitting range to smaller parts, and call dedupe logic on each of them. Instead of repeated allocation and deallocation, allocate once at the beginning and reuse in the iteration. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: dedupe_file_range ioctl: remove 16MiB restrictionTimofey Titovets2018-05-281-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently btrfs_dedupe_file_range silently restricts the dedupe range to to 16MiB to limit locking and working memory size and is documented in manual page as implementation specific. Let's remove that restriction by iterating over the dedup range in 16MiB steps. This is backward compatible and will not change anything for requests smaller then 16MiB. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: split btrfs_extent_sameTimofey Titovets2018-05-281-28/+36
| | | | | | | | | | | | | | | | | | | | Split btrfs_extent_same() to two parts where one is the main EXTENT_SAME entry and a helper that can be repeatedly called on a range. This will be used in following patches. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: reserve space for O_TMPFILE orphan item deletionOmar Sandoval2018-05-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but it doesn't reserve space to do so. Even before the removal of the orphan_block_rsv it wasn't using it. Fixes: ef3b9af50bfa ("Btrfs: implement inode_operations callback tmpfile") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: renumber BTRFS_INODE_ runtime flags and switch to enumsOmar Sandoval2018-05-281-9/+11
| | | | | | | | | | | | | | | | | | | | We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make them consecutive again. Signed-off-by: Omar Sandoval <osandov@fb.com> [ switch them enums so we don't have to do that again ] Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: get rid of unused orphan infrastructureOmar Sandoval2018-05-286-99/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we don't keep long-standing reservations for orphan items, root->orphan_block_rsv isn't used. We can git rid of it, along with: - root->orphan_lock, which was used to protect root->orphan_block_rsv - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations in root->orphan_block_rsv - btrfs_orphan_commit_root(), which was the last user of any of these and does nothing else Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: fix ENOSPC caused by orphan items reservationsOmar Sandoval2018-05-281-120/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we keep space reserved for all inode orphan items until the inode is evicted (i.e., all references to it are dropped). We hit an issue where an application would keep a bunch of deleted files open (by design) and thus keep a large amount of space reserved, causing ENOSPC errors when other operations tried to reserve space. This long-standing reservation isn't absolutely necessary for a couple of reasons: - We can almost always make the reservation we need or steal from the global reserve for the orphan item - If we can't, it's not the end of the world if we drop the orphan item on the floor and let the next mount clean it up So, get rid of persistent reservation and just reserve space in btrfs_evict_inode(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * Btrfs: refactor btrfs_evict_inode() reserve refill danceOmar Sandoval2018-05-281-71/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The truncate loop in btrfs_evict_inode() does two things at once: - It refills the temporary block reserve, potentially stealing from the global reserve or committing - It calls btrfs_truncate_inode_items() The tangle of continues hides the fact that these two steps are actually separate. Split the first step out into a separate function both for clarity and so that we can reuse it in a later patch. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>