summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: add missing memset while reading compressed inline extentsZygo Blaxell2017-12-201-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e1699d2d7bf6e6cce3e1baff19f9dd4595a58664 ] This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b978188c ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5750 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a418 ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption found in the wild: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo, which demonstrates another method of creating the same inline extent and hole pattern: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: Correct output: file contains 1000 data bytes followed by zeros: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000 0001760 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 Actual output: the data after the first 1000 bytes will be different each run: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: clear space cache inode generation alwaysJosef Bacik2017-12-051-7/+7
| | | | | | | | | | | | | | | | | | | | | commit 8e138e0d92c6c9d3d481674fb14e3439b495be37 upstream. We discovered a box that had double allocations, and suspected the space cache may be to blame. While auditing the write out path I noticed that if we've already setup the space cache we will just carry on. This means that any error we hit after cache_save_setup before we go to actually write the cache out we won't reset the inode generation, so whatever was already written will be considered correct, except it'll be stale. Fix this by _always_ resetting the generation on the block group inode, this way we only ever have valid or invalid cache. With this patch I was no longer able to reproduce cache corruption with dm-log-writes and my bpf error injection tool. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: return the actual error value from from btrfs_uuid_tree_iteratePan Bian2017-11-301-3/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 73ba39ab9307340dc98ec3622891314bbc09cc2e ] In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret on errors. However, it directly returns 0. It may be better to return ret. This patch also removes the warning, because the caller already prints a warning. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731 Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Omar Sandoval <osandov@fb.com> [ edited subject ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: send, fix failure to rename top level inode due to name collisionRobbie Ko2017-10-211-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4dd9920d991745c4a16f53a8f615f706fbe4b3f7 ] Under certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) |---- a1/ (ino 257, gen 9) |---- a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) |---- a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: prevent to set invalid default subvolidsatoru takeuchi2017-10-051-0/+4
| | | | | | | | | | | | | | | | commit 6d6d282932d1a609e60dc4467677e0e863682f57 upstream. `btrfs sub set-default` succeeds to set an ID which isn't corresponding to any fs/file tree. If such the bad ID is set to a filesystem, we can't mount this filesystem without specifying `subvol` or `subvolid` mount options. Fixes: 6ef5ed0d386b ("Btrfs: add ioctl and incompat flag to set the default mount subvol") Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: resume qgroup rescan on rw remountAleksa Sarai2017-09-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | commit 6c6b5a39c4bf3dbd8cf629c9f5450e983c19dbb9 upstream. Several distributions mount the "proper root" as ro during initrd and then remount it as rw before pivot_root(2). Thus, if a rescan had been aborted by a previous shutdown, the rescan would never be resumed. This issue would manifest itself as several btrfs ioctl(2)s causing the entire machine to hang when btrfs_qgroup_wait_for_completion was hit (due to the fs_info->qgroup_rescan_running flag being set but the rescan itself not being resumed). Notably, Docker's btrfs storage driver makes regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT (causing this problem to be manifested on boot for some machines). Cc: Jeff Mahoney <jeffm@suse.com> Fixes: b382a324b60f ("Btrfs: fix qgroup rescan resume on mount") Signed-off-by: Aleksa Sarai <asarai@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix memory leak in update_space_info failure pathJeff Mahoney2017-06-141-0/+1
| | | | | | | | | | | | | | commit 896533a7da929136d0432713f02a3edffece2826 upstream. If we fail to add the space_info kobject, we'll leak the memory for the percpu counter. Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: use correct types for page indices in btrfs_page_exists_in_rangeDavid Sterba2017-06-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit cc2b702c52094b637a351d7491ac5200331d0445 upstream. Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* posix_acl: Clear SGID bit when setting file permissionsJan Kara2017-05-081-4/+2
| | | | | | | | | | | | | | | | | | commit 073931017b49d9458aa351605b43a7e34598caef upstream. When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix state->private cast on 32 bit machinesSatoru Takeuchi2017-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | commit 6e1103a6e9b19dbdc348077d04a546b626911fc5 upstream. Suppress the following warning displayed on building 32bit (i686) kernel. =============================================================================== ... CC [M] fs/btrfs/extent_io.o fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’: fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] failrec = (struct io_failure_record *)state->private; ... =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Chris Mason <clm@fb.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Btrfs: fix tree search logic when replaying directory entry deletesRobbie Ko2017-01-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2a7bf53f577e49c43de4ffa7776056de26db65d9 ] If a log tree has a layout like the following: leaf N: ... item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 dir log end 1275809046 leaf N + 1: item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 dir log end 18446744073709551615 ... When we pass the value 1275809046 + 1 as the parameter start_ret to the function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we end up with path->slots[0] having the value 239 (points to the last item of leaf N, item 240). Because the dir log item in that position has an offset value smaller than *start_ret (1275809046 + 1) we need to move on to the next leaf, however the logic for that is wrong since it compares the current slot to the number of items in the leaf, which is smaller and therefore we don't lookup for the next leaf but instead we set the slot to point to an item that does not exist, at slot 240, and we later operate on that slot which has unexpected content or in the worst case can result in an invalid memory access (accessing beyond the last page of leaf N's extent buffer). So fix the logic that checks when we need to lookup at the next leaf by first incrementing the slot and only after to check if that slot is beyond the last item of the current leaf. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: e02119d5a7b4 (Btrfs: Add a write ahead tree log to optimize synchronous operations) Cc: stable@vger.kernel.org # 2.6.29+ Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness] Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
* btrfs: fix races on root_log_ctx listsChris Mason2016-11-231-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 570dd45042a7c8a7aba1ee029c5dd0f5ccf41b9b ] btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the list because it knows all of the waiters are patiently waiting for the commit to finish. But, there's a small race where btrfs_sync_log can remove itself from the list if it finds a log commit is already done. Also, it uses list_del_init() to remove itself from the list, but there's no way to know if btrfs_remove_all_log_ctxs has already run, so we don't know for sure if it is safe to call list_del_init(). This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and just calls it with the proper locking. This is part two of the corruption fixed by cbd60aa7cd1. I should have done this in the first place, but convinced myself the optimizations were safe. A 12 hour run of dbench 2048 will eventually trigger a list debug WARN_ON for the list_del_init() in btrfs_sync_log(). Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4 Reported-by: Dave Jones <davej@codemonkey.org.uk> cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
* btrfs: ensure that file descriptor used with subvol ioctls is a dirJeff Mahoney2016-10-051-0/+12
| | | | | | | | | | | | | | | | | [ Upstream commit 325c50e3cebb9208009083e841550f98a863bfa0 ] If the subvol/snapshot create/destroy ioctls are passed a regular file with execute permissions set, we'll eventually Oops while trying to do inode->i_op->lookup via lookup_one_len. This patch ensures that the file descriptor refers to a directory. Fixes: cb8e70901d (Btrfs: Fix subvolume creation locking rules) Fixes: 76dda93c6a (Btrfs: add snapshot/subvolume destroy ioctl) Cc: <stable@vger.kernel.org> #v2.6.29+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
* Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returnsChris Mason2016-10-051-0/+1
| | | | | | | | | | | | | | | | | [ Upstream commit cbd60aa7cd17d81a434234268c55192862147439 ] We use a btrfs_log_ctx structure to pass information into the tree log commit, and get error values out. It gets added to a per log-transaction list which we walk when things go bad. Commit d1433debe added an optimization to skip waiting for the log commit, but didn't take root_log_ctx out of the list. This patch makes sure we remove things before exiting. Signed-off-by: Chris Mason <clm@fb.com> Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4 cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
* Btrfs: add missing blk_finish_plug in btrfs_sync_log()Forrest Liu2016-10-051-0/+1
| | | | | | | | | | | [ Upstream commit 3da5ab56482f322a9736c484db8773899c5c731b ] Add missing blk_finish_plug in btrfs_sync_log() Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
* Btrfs: don't use src fd for printkJosef Bacik2016-07-121-1/+1
| | | | | | | | | | | | [ Upstream commit c79b4713304f812d3d6c95826fc3e5fc2c0b0c14 ] The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: account for non-CoW'd blocks in btrfs_abort_transactionJeff Mahoney2016-07-125-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 64c12921e11b3a0c10d088606e328c58e29274d8 ] The test for !trans->blocks_used in btrfs_abort_transaction is insufficient to determine whether it's safe to drop the transaction handle on the floor. btrfs_cow_block, informed by should_cow_block, can return blocks that have already been CoW'd in the current transaction. trans->blocks_used is only incremented for new block allocations. If an operation overlaps the blocks in the current transaction entirely and must abort the transaction, we'll happily let it clean up the trans handle even though it may have modified the blocks and will commit an incomplete operation. In the long-term, I'd like to do closer tracking of when the fs is actually modified so we can still recover as gracefully as possible, but that approach will need some discussion. In the short term, since this is the only code using trans->blocks_used, let's just switch it to a bool indicating whether any blocks were used and set it when should_cow_block returns false. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: make btrfs_abort_transaction consider existence of new block groupsFilipe Manana2016-07-122-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c92f6be34c501406daf5e61f3569a1813f985393 ] If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50 [25230.404330] ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4 [25230.404332] ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8 [25230.404334] Call Trace: [25230.404338] [<ffffffff813e7a13>] dump_stack+0x45/0x56 [25230.404342] [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98 [25230.404351] [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50 [25230.404362] [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [<ffffffff810f6501>] ? get_page+0x1a/0x2b [25230.404441] [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18 [25230.404449] [<ffffffff81138d68>] new_sync_write+0x7c/0xa0 [25230.404450] [<ffffffff81139401>] vfs_write+0xb0/0x112 [25230.404452] [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84 [25230.404454] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctlLuke Dashjr2016-06-034-2/+24
| | | | | | | | | | | | | | | [ Upstream commit 4c63c2454eff996c5e27991221106eb511f7db38 ] 32-bit ioctl uses these rather than the regular FS_IOC_* versions. They can be handled in btrfs using the same code. Without this, 32-bit {ch,ls}attr fail. Signed-off-by: Luke Dashjr <luke-jr+git@utopios.org> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix list transaction->pending_ordered corruptionFilipe Manana2016-04-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d3efe08400317888f559bbedf0e42cd31575d0ef ] When we call btrfs_commit_transaction(), we splice the list "ordered" of our transaction handle into the transaction's "pending_ordered" list, but we don't re-initialize the "ordered" list of our transaction handle, this means it still points to the same elements it used to before the splice. Then we check if the current transaction's state is >= TRANS_STATE_COMMIT_START and if it is we end up calling btrfs_end_transaction() which simply splices again the "ordered" list of our handle into the transaction's "pending_ordered" list, leaving multiple pointers to the same ordered extents which results in list corruption when we are iterating, removing and freeing ordered extents at btrfs_wait_pending_ordered(), resulting in access to dangling pointers / use-after-free issues. Similarly, btrfs_end_transaction() can end up in some cases calling btrfs_commit_transaction(), and both did a list splice of the transaction handle's "ordered" list into the transaction's "pending_ordered" without re-initializing the handle's "ordered" list, resulting in exactly the same problem. This produces the following warning on a kernel with linked list debugging enabled: [109749.265416] ------------[ cut here ]------------ [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d (...) [109749.287505] Call Trace: [109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98 [109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98 [109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs] [109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs] [109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs] [109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs] [109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e [109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e [109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e [109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14 [109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [109749.382274] ---[ end trace 48e0d07f7c03d95a ]--- On a non-debug kernel this leads to invalid memory accesses, causing a crash. Fix this by using list_splice_init() instead of list_splice() in btrfs_commit_transaction() and btrfs_end_transaction(). Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix file/data loss caused by fsync after rename and new inodeFilipe Manana2016-04-201-0/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 56f23fdbb600e6087db7b009775b95ce07cc3195 ] If we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x <power failure happens> The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo <power failure happens> The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir <power failure> The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] ------------[ cut here ]------------ [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758 [52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd [52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90 [52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2 [52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50 [52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284 [52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8 [52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59 [52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f [52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after renaming directory https://patchwork.kernel.org/patch/8694281/ * fstests: generic test for fsync after renaming file https://patchwork.kernel.org/patch/8694301/ * fstests: add btrfs test for fsync after snapshot deletion https://patchwork.kernel.org/patch/8670671/ Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix fsync after truncate when no_holes feature is enabledFilipe Manana2016-04-201-0/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a89ca6f24ffe435edad57de02eaabd37a2c6bff6 ] When we have the no_holes feature enabled, if a we truncate a file to a smaller size, truncate it again but to a size greater than or equals to its original size and fsync it, the log tree will not have any information about the hole covering the range [truncate_1_offset, new_file_size[. Which means if the fsync log is replayed, the file will remain with the state it had before both truncate operations. Without the no_holes feature this does not happen, since when the inode is logged (full sync flag is set) it will find in the fs/subvol tree a leaf with a generation matching the current transaction id that has an explicit extent item representing the hole. Fix this by adding an explicit extent item representing a hole between the last extent and the inode's i_size if we are doing a full sync. The issue is easy to reproduce with the following test case for fstests: . ./common/rc . ./common/filter . ./common/dmflakey _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey # This test was motivated by an issue found in btrfs when the btrfs # no-holes feature is enabled (introduced in kernel 3.14). So enable # the feature if the fs being tested is btrfs. if [ $FSTYP == "btrfs" ]; then _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" fi rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test files and make sure everything is durably persisted. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \ -c "pwrite -S 0xbb 64K 61K" \ $SCRATCH_MNT/foo | _filter_xfs_io $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \ -c "pwrite -S 0xff 64K 61K" \ $SCRATCH_MNT/bar | _filter_xfs_io sync # Now truncate our file foo to a smaller size (64Kb) and then truncate # it to the size it had before the shrinking truncate (125Kb). Then # fsync our file. If a power failure happens after the fsync, we expect # our file to have a size of 125Kb, with the first 64Kb of data having # the value 0xaa and the second 61Kb of data having the value 0x00. $XFS_IO_PROG -c "truncate 64K" \ -c "truncate 125K" \ -c "fsync" \ $SCRATCH_MNT/foo # Do something similar to our file bar, but the first truncation sets # the file size to 0 and the second truncation expands the size to the # double of what it was initially. $XFS_IO_PROG -c "truncate 0" \ -c "truncate 253K" \ -c "fsync" \ $SCRATCH_MNT/bar _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file # contents. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # We expect foo to have a size of 125Kb, the first 64Kb of data all # having the value 0xaa and the remaining 61Kb to be a hole (all bytes # with value 0x00). echo "File foo content after log replay:" od -t x1 $SCRATCH_MNT/foo # We expect bar to have a size of 253Kb and no extents (any byte read # from bar has the value 0x00). echo "File bar content after log replay:" od -t x1 $SCRATCH_MNT/bar status=0 exit The expected file contents in the golden output are: File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0372000 File bar content after log replay: 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0772000 Without this fix, their contents are: File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0372000 File bar content after log replay: 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0772000 A test case submission for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix fsync xattr loss in the fast fsync pathFilipe Manana2016-04-201-0/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 36283bf777d963fac099213297e155d071096994 ] After commit 4f764e515361 ("Btrfs: remove deleted xattrs on fsync log replay"), we can end up in a situation where during log replay we end up deleting xattrs that were never deleted when their file was last fsynced. This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING set, the xattr was added in a past transaction and the leaf where the xattr is located was not updated (COWed or created) in the current transaction. In this scenario the xattr item never ends up in the log tree and therefore at log replay time, which makes the replay code delete the xattr from the fs/subvol tree as it thinks that xattr was deleted prior to the last fsync. Fix this by always logging all xattrs, which is the simplest and most reliable way to detect deleted xattrs and replay the deletes at log replay time. This issue is reproducible with the following test case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! _cleanup() { _cleanup_flakey rm -f $tmp.* } trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey . ./common/attr # real QA test starts here # We create a lot of xattrs for a single file. Only btrfs and xfs are currently # able to store such a large mount of xattrs per file, other filesystems such # as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add # less than 1000 xattrs with very small values. _supported_fs btrfs xfs _supported_os Linux _need_to_be_root _require_scratch _require_dm_flakey _require_attrs _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create the test file with some initial data and make sure everything is # durably persisted. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io sync # Add many small xattrs to our file. # We create such a large amount because it's needed to trigger the issue found # in btrfs - we need to have an amount that causes the fs to have at least 3 # btree leafs with xattrs stored in them, and it must work on any leaf size # (maximum leaf/node size is 64Kb). num_xattrs=2000 for ((i = 1; i <= $num_xattrs; i++)); do name="user.attr_$(printf "%04d" $i)" $SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo done # Sync the filesystem to force a commit of the current btrfs transaction, this # is a necessary condition to trigger the bug on btrfs. sync # Now update our file's data and fsync the file. # After a successful fsync, if the fsync log/journal is replayed we expect to # see all the xattrs we added before with the same values (and the updated file # data of course). Btrfs used to delete some of these xattrs when it replayed # its fsync log/journal. $XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount. This makes the fs replay its fsync log. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File content after crash and log replay:" od -t x1 $SCRATCH_MNT/foo echo "File xattrs after crash and log replay:" for ((i = 1; i <= $num_xattrs; i++)); do name="user.attr_$(printf "%04d" $i)" echo -n "$name=" $GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo echo done status=0 exit The golden output expects all xattrs to be available, and with the correct values, after the fsync log is replayed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix fsync data loss after append writeFilipe Manana2016-04-201-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e4545de5b035c7debb73d260c78377dbb69cbfb5 ] If we do an append write to a file (which increases its inode's i_size) that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, and the previous transaction added a new hard link to the file, which sets the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync the file, the inode's new i_size isn't logged. This has the consequence that after the fsync log is replayed, the file size remains what it was before the append write operation, which means users/applications will not be able to read the data that was successsfully fsync'ed before. This happens because neither the inode item nor the delayed inode get their i_size updated when the append write is made - doing so would require starting a transaction in the buffered write path, something that we do not do intentionally for performance reasons. Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is set the inode is logged with its current i_size (log the in-memory inode into the log tree). This issue is not a recent regression and is easy to reproduce with the following test case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! _cleanup() { _cleanup_flakey rm -f $tmp.* } trap "_cleanup; exit \$status" 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs generic _supported_os Linux _need_to_be_root _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV _crash_and_mount() { # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount. This makes the fs replay its fsync log. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey } rm -f $seqres.full _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create the test file with some initial data and then fsync it. # The fsync here is only needed to trigger the issue in btrfs, as it causes the # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io sync # Add a hard link to our file. # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode, # which is a necessary condition to trigger the issue. ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Sync the filesystem to force a commit of the current btrfs transaction, this # is a necessary condition to trigger the bug on btrfs. sync # Now append more data to our file, increasing its size, and fsync the file. # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the # write path did not update the inode item in the btree nor the delayed inode # item (in memory struture) in the current transaction (created by the fsync # handler), the fsync did not record the inode's new i_size in the fsync # log/journal. This made the data unavailable after the fsync log/journal is # replayed. $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \ -c "fsync" \ $SCRATCH_MNT/foo | _filter_xfs_io echo "File content after fsync and before crash:" od -t x1 $SCRATCH_MNT/foo _crash_and_mount echo "File content after crash and log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected file output before and after the crash/power failure expects the appended data to be available, which is: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0200000 Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: async-thread: Fix a use-after-free error for traceQu Wenruo2016-04-131-1/+1
| | | | | | | | | | | | | | | [ Upstream commit 0a95b851370b84a4b9d92ee6d1fa0926901d0454 ] Parameter of trace_btrfs_work_queued() can be freed in its workqueue. So no one use use that pointer after queue_work(). Fix the user-after-free bug by move the trace line before queue_work(). Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: initialize the seq counter in struct btrfs_deviceSebastian Andrzej Siewior2016-04-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 546bed631203344611f42b2af1d224d2eedb4e6b ] I managed to trigger this: | INFO: trying to register non-static key. | the code is fine but needs lockdep annotation. | turning off the locking correctness validator. | CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14 | Hardware name: ARM-Versatile Express | [<80307cec>] (dump_stack) | [<80070e98>] (__lock_acquire) | [<8007184c>] (lock_acquire) | [<80287800>] (btrfs_ioctl) | [<8012a8d4>] (do_vfs_ioctl) | [<8012ac14>] (SyS_ioctl) so I think that btrfs_device_data_ordered_init() is not invoked behind a macro somewhere. Fixes: 7cc8e58d53cd ("Btrfs: fix unprotected device's variants on 32bits machine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix transaction handle leak on failure to create hard linkFilipe Manana2016-04-131-2/+4
| | | | | | | | | | | | | | [ Upstream commit 271dba4521aed0c37c063548f876b49f5cd64b2e ] If we failed to create a hard link we were not always releasing the the transaction handle we got before, resulting in a memory leak and preventing any other tasks from being able to commit the current transaction. Fix this by always releasing our transaction handle. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix number of transaction units required to create symlinkFilipe Manana2016-04-131-1/+3
| | | | | | | | | | | | [ Upstream commit 9269d12b2d57d9e3d13036bb750762d1110d425c ] We weren't accounting for the insertion of an inline extent item for the symlink inode nor that we need to update the parent inode item (through the call to btrfs_add_nondir()). So fix this by including two more transaction units. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: send, don't BUG_ON() when an empty symlink is foundFilipe Manana2016-04-131-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a879719b8c90e15c9e7fa7266d5e3c0ca962f9df ] When a symlink is successfully created it always has an inline extent containing the source path. However if an error happens when creating the symlink, we can leave in the subvolume's tree a symlink inode without any such inline extent item - this happens if after btrfs_symlink() calls btrfs_end_transaction() and before it calls the inode eviction handler (through the final iput() call), the transaction gets committed and a crash happens before the eviction handler gets called, or if a snapshot of the subvolume is made before the eviction handler gets called. Sadly we can't just avoid this by making btrfs_symlink() call btrfs_end_transaction() after it calls the eviction handler, because the later can commit the current transaction before it removes any items from the subvolume tree (if it encounters ENOSPC errors while reserving space for removing all the items). So make send fail more gracefully, with an -EIO error, and print a message to dmesg/syslog informing that there's an empty symlink inode, so that the user can delete the empty symlink or do something else about it. Reported-by: Stephen R. van den Berg <srb@cuci.nl> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: statfs: report zero available if metadata are exhaustedDavid Sterba2016-04-131-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ca8a51b3a979d57b082b14eda38602b7f52d81d1 ] There is one ENOSPC case that's very confusing. There's Available greater than zero but no file operation succeds (besides removing files). This happens when the metadata are exhausted and there's no possibility to allocate another chunk. In this scenario it's normal that there's still some space in the data chunk and the calculation in df reflects that in the Avail value. To at least give some clue about the ENOSPC situation, let statfs report zero value in Avail, even if there's still data space available. Current: /dev/sdb1 4.0G 3.3G 719M 83% /mnt/test New: /dev/sdb1 4.0G 3.3G 0 100% /mnt/test We calculate the remaining metadata space minus global reserve. If this is (supposedly) smaller than zero, there's no space. But this does not hold in practice, the exhausted state happens where's still some positive delta. So we apply some guesswork and compare the delta to a 4M threshold. (Practically observed delta was 2M.) We probably cannot calculate the exact threshold value because this depends on the internal reservations requested by various operations, so some operations that consume a few metadata will succeed even if the Avail is zero. But this is better than the other way around. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: igrab inode in writepageJosef Bacik2016-04-131-2/+15
| | | | | | | | | | | | | | | | | [ Upstream commit be7bd730841e69fe8f70120098596f648cd1f3ff ] We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: add missing brelse when superblock checksum failsAnand Jain2016-04-131-0/+1
| | | | | | | | | | | | | [ Upstream commit b2acdddfad13c38a1e8b927d83c3cf321f63601a ] Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: properly set the termination value of ctx->pos in readdirDavid Sterba2016-03-043-3/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bc4ef7592f657ae81b017207a1098817126ad4cb ] The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctlFilipe Manana2016-03-021-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0c0fe3b0fa45082cd752553fdb3a4b42503a118e ] While doing some tests I ran into an hang on an extent buffer's rwlock that produced the following trace: [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166] [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165] [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800016] irq event stamp: 0 [39389.800016] hardirqs last enabled at (0): [< (null)>] (null) [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800016] softirqs last disabled at (0): [< (null)>] (null) [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000 [39389.800016] RIP: 0010:[<ffffffff810902af>] [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158 [39389.800016] RSP: 0018:ffff8800a185fb80 EFLAGS: 00000202 [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101 [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001 [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000 [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98 [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40 [39389.800016] FS: 00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000 [39389.800016] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800016] Stack: [39389.800016] ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0 [39389.800016] ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895 [39389.800016] ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c [39389.800016] Call Trace: [39389.800016] [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60 [39389.800016] [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41 [39389.800016] [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44 [39389.800016] [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs] [39389.800016] [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs] [39389.800016] [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs] [39389.800016] [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs] [39389.800016] [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs] [39389.800016] [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs] [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15 [39389.800016] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800016] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800016] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800016] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800016] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800016] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8 [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [39389.800012] irq event stamp: 0 [39389.800012] hardirqs last enabled at (0): [< (null)>] (null) [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35 [39389.800012] softirqs last disabled at (0): [< (null)>] (null) [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G L 4.4.0-rc6-btrfs-next-18+ #1 [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000 [39389.800012] RIP: 0010:[<ffffffff81091e8d>] [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72 [39389.800012] RSP: 0018:ffff880034a639f0 EFLAGS: 00000206 [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000 [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000 [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98 [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00 [39389.800012] FS: 00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000 [39389.800012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0 [39389.800012] Stack: [39389.800012] ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98 [39389.800012] ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00 [39389.800012] ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58 [39389.800012] Call Trace: [39389.800012] [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c [39389.800012] [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41 [39389.800012] [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs] [39389.800012] [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs] [39389.800012] [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs] [39389.800012] [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs] [39389.800012] [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs] [39389.800012] [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28 [39389.800012] [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs] [39389.800012] [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf [39389.800012] [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc [39389.800012] [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs] [39389.800012] [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs] [39389.800012] [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44 [39389.800012] [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs] [39389.800012] [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs] [39389.800012] [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs] [39389.800012] [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs] [39389.800012] [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs] [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff81140261>] ? __might_fault+0x4c/0xa7 [39389.800012] [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc [39389.800012] [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d [39389.800012] [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea [39389.800012] [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71 [39389.800012] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [39389.800012] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00 This happens because in the code path executed by the inode_paths ioctl we end up nesting two calls to read lock a leaf's rwlock when after the first call to read_lock() and before the second call to read_lock(), another task (running the delayed items as part of a transaction commit) has already called write_lock() against the leaf's rwlock. This situation is illustrated by the following diagram: Task A Task B btrfs_ref_to_path() btrfs_commit_transaction() read_lock(&eb->lock); btrfs_run_delayed_items() __btrfs_commit_inode_delayed_items() __btrfs_update_delayed_inode() btrfs_lookup_inode() write_lock(&eb->lock); --> task waits for lock read_lock(&eb->lock); --> makes this task hang forever (and task B too of course) So fix this by avoiding doing the nested read lock, which is easily avoidable. This issue does not happen if task B calls write_lock() after task A does the second call to read_lock(), however there does not seem to exist anything in the documentation that mentions what is the expected behaviour for recursive locking of rwlocks (leaving the idea that doing so is not a good usage of rwlocks). Also, as a side effect necessary for this fix, make sure we do not needlessly read lock extent buffers when the input path has skip_locking set (used when called from send). Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix race when listing an inode's xattrsFilipe Manana2016-01-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d ] When listing a inode's xattrs we have a time window where we race against a concurrent operation for adding a new hard link for our inode that makes us not return any xattr to user space. In order for this to happen, the first xattr of our inode needs to be at slot 0 of a leaf and the previous leaf must still have room for an inode ref (or extref) item, and this can happen because an inode's listxattrs callback does not lock the inode's i_mutex (nor does the VFS does it for us), but adding a hard link to an inode makes the VFS lock the inode's i_mutex before calling the inode's link callback. If we have the following leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ] slot N - 2 slot N - 1 slot 0 The race illustrated by the following sequence diagram is possible: CPU 1 CPU 2 btrfs_listxattr() searches for key (257 XATTR_ITEM 0) gets path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() releases the path adds key (257 INODE_REF 666) to the end of leaf X (slot N), and leaf X now has N + 1 items searches for the key (257 INODE_REF 256), with path->keep_locks == 1, because that is the last key it saw in leaf X before releasing the path ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in leaf X, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 666) btrfs_listxattr's loop iteration sees that the type of the key pointed by the path is different from the type BTRFS_XATTR_ITEM_KEY and so it breaks the loop and stops looking for more xattr items --> the application doesn't get any xattr listed for our inode So fix this by breaking the loop only if the key's type is greater than BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix race leading to BUG_ON when running delalloc for nodatacowFilipe Manana2016-01-211-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d ] If we are using the NO_HOLES feature, we have a tiny time window when running delalloc for a nodatacow inode where we can race with a concurrent link or xattr add operation leading to a BUG_ON. This happens because at run_delalloc_nocow() we end up casting a leaf item of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a file extent item (struct btrfs_file_extent_item) and then analyse its extent type field, which won't match any of the expected extent types (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an explicit BUG_ON(1). The following sequence diagram shows how the race happens when running a no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 (Note the implicit hole for inode 257 regarding the [0, 8K[ range) CPU 1 CPU 2 run_dealloc_nocow() btrfs_lookup_file_extent() --> searches for a key with value (257 EXTENT_DATA 4096) in the fs/subvol tree --> returns us a path with path->nodes[0] == leaf X and path->slots[0] == N because path->slots[0] is >= btrfs_header_nritems(leaf X), it calls btrfs_next_leaf() btrfs_next_leaf() --> releases the path hard link added to our inode, with key (257 INODE_REF 500) added to the end of leaf X, so leaf X now has N + 1 keys --> searches for the key (257 INODE_REF 256), because it was the last key in leaf X before it released the path, with path->keep_locks set to 1 --> ends up at leaf X again and it verifies that the key (257 INODE_REF 256) is no longer the last key in the leaf, so it returns with path->nodes[0] == leaf X and path->slots[0] == N, pointing to the new item with key (257 INODE_REF 500) the loop iteration of run_dealloc_nocow() does not break out the loop and continues because the key referenced in the path at path->nodes[0] and path->slots[0] is for inode 257, its type is < BTRFS_EXTENT_DATA_KEY and its offset (500) is less then our delalloc range's end (8192) the item pointed by the path, an inode reference item, is (incorrectly) interpreted as a file extent item and we get an invalid extent type, leading to the BUG_ON(1): if (extent_type == BTRFS_FILE_EXTENT_REG || extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) } else { BUG_ON(1) } The same can happen if a xattr is added concurrently and ends up having a key with an offset smaller then the delalloc's range end. So fix this by skipping keys with a type smaller than BTRFS_EXTENT_DATA_KEY. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix race leading to incorrect item deletion when dropping extentsFilipe Manana2016-01-211-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c ] While running a stress test I got the following warning triggered: [191627.672810] ------------[ cut here ]------------ [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]() (...) [191627.701485] Call Trace: [191627.702037] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [191627.702992] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [191627.704091] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [191627.705380] [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.706637] [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c [191627.707789] [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs] [191627.709155] [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0 [191627.712444] [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18 [191627.714162] [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs] [191627.715887] [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs] [191627.717287] [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs] [191627.728865] [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs] [191627.730045] [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs] [191627.731256] [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs] [191627.732661] [<ffffffff81061119>] process_one_work+0x24c/0x4ae [191627.733822] [<ffffffff810615b0>] worker_thread+0x206/0x2c2 [191627.734857] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.736052] [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f [191627.737349] [<ffffffff810669a6>] kthread+0xef/0xf7 [191627.738267] [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28 [191627.739330] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.741976] [<ffffffff81465592>] ret_from_fork+0x42/0x70 [191627.743080] [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad [191627.744206] ---[ end trace bbfddacb7aaada8d ]--- $ cat -n fs/btrfs/file.c 691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans, (...) 758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); 759 if (key.objectid > ino || 760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end) 761 break; 762 763 fi = btrfs_item_ptr(leaf, path->slots[0], 764 struct btrfs_file_extent_item); 765 extent_type = btrfs_file_extent_type(leaf, fi); 766 767 if (extent_type == BTRFS_FILE_EXTENT_REG || 768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) { (...) 774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) { (...) 778 } else { 779 WARN_ON(1); 780 extent_end = search_start; 781 } (...) This happened because the item we were processing did not match a file extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this case we cast the item to a struct btrfs_file_extent_item pointer and then find a type field value that does not match any of the expected values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens due to a tiny time window where a race can happen as exemplified below. For example, consider the following scenario where we're using the NO_HOLES feature and we have the following two neighbour leafs: Leaf X (has N items) Leaf Y [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ] slot N - 2 slot N - 1 slot 0 Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather than explicit because NO_HOLES is enabled). Now if our inode has an ordered extent for the range [4K, 8K[ that is finishing, the following can happen: CPU 1 CPU 2 btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() Searches for the key (257 EXTENT_DATA 4096) through btrfs_lookup_file_extent() Key not found and we get a path where path->nodes[0] == leaf X and path->slots[0] == N Because path->slots[0] is >= btrfs_header_nritems(leaf X), we call btrfs_next_leaf() btrfs_next_leaf() releases the path inserts key (257 INODE_REF 4096) at the end of leaf X, leaf X now has N + 1 keys, and the new key is at slot N btrfs_next_leaf() searches for key (257 INODE_REF 256), with path->keep_locks set to 1, because it was the last key it saw in leaf X finds it in leaf X again and notices it's no longer the last key of the leaf, so it returns 0 with path->nodes[0] == leaf X and path->slots[0] == N (which is now < btrfs_header_nritems(leaf X)), pointing to the new key (257 INODE_REF 4096) __btrfs_drop_extents() casts the item at path->nodes[0], slot path->slots[0], to a struct btrfs_file_extent_item - it does not skip keys for the target inode with a type less than BTRFS_EXTENT_DATA_KEY (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY) sees a bogus value for the type field triggering the WARN_ON in the trace shown above, and sets extent_end = search_start (4096) does the if-then-else logic to fixup 0 length extent items created by a past bug from hole punching: if (extent_end == key.offset && extent_end >= search_start) goto delete_extent_item; that evaluates to true and it ends up deleting the key pointed to by path->slots[0], (257 INODE_REF 4096), from leaf X The same could happen for example for a xattr that ends up having a key with an offset value that matches search_start (very unlikely but not impossible). So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are skipped, never casted to struct btrfs_file_extent_item and never deleted by accident. Also protect against the unexpected case of getting a key for a lower inode number by skipping that key and issuing a warning. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix file corruption and data loss after cloning inline extentsFilipe Manana2016-01-211-29/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 ] Currently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ | _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size not greater than the size of # bar's inline extent (40 < 50). # It should be possible to do the extent cloning from bar to this file. $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4 # Doing IO against any range in the first 4K of the file should work. echo "File foo4 data after clone operation:" # Must match file bar's content. od -t x1 $SCRATCH_MNT/foo4 $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size greater than the size of bar's # inline extent (60 > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5 # Reading the file should not fail. echo "File foo5 data after clone operation:" # Must have a size of 60 bytes, with all bytes having a value of 0x03 # (the clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo5 # Test cloning the inline extent against a file which has no extents but # has a size greater than bar's inline extent (16K > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6 # Reading the file should not fail. echo "File foo6 data after clone operation:" # Must have a size of 16K, with all bytes having a value of 0x00 (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo6 # Test cloning the inline extent against a file which has no extents but # has a size not greater than bar's inline extent (30 < 50). # It should be possible to clone the inline extent from file bar into # this file. $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7 # Reading the file should not fail. echo "File foo7 data after clone operation:" # Must have a size of 50 bytes, with all bytes having a value of 0xbb. od -t x1 $SCRATCH_MNT/foo7 # Test cloning the inline extent against a file which has a size not # greater than the size of bar's inline extent (20 < 50) but has # a prealloc extent that goes beyond the file's size. It should not be # possible to clone the inline extent from bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" \ -c "pwrite -S 0x88 0 20" \ $SCRATCH_MNT/foo8 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8 echo "File foo8 data after clone operation:" # Must have a size of 20 bytes, with all bytes having a value of 0x88 # (the clone operation did not modify our file). od -t x1 $SCRATCH_MNT/foo8 _scratch_unmount } echo -e "\nTesting without compression and without the no-holes feature...\n" test_cloning_inline_extents echo -e "\nTesting with compression and without the no-holes feature...\n" test_cloning_inline_extents "" "-o compress" echo -e "\nTesting without compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "" echo -e "\nTesting with compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "-o compress" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: fix use after free iterating extrefsChris Mason2015-11-131-5/+3
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit dc6c5fb3b514221f2e9d21ee626a9d95d3418dff ] The code for btrfs inode-resolve has never worked properly for files with enough hard links to trigger extrefs. It was trying to get the leaf out of a path after freeing the path: btrfs_release_path(path); leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, slot); The fix here is to use the extent buffer we cloned just a little higher up to avoid deadlocks caused by using the leaf in the path. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v3.7+ cc: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: update fix for read corruption of compressed and shared extentsFilipe Manana2015-10-271-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 808f80b46790f27e145c72112189d6a3be2bc884 ] My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix read corruption of compressed and shared extentsFilipe Manana2015-10-271-8/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 005efedf2c7d0a270ffbe28d8997b03844f3e3e7 ] If a file has a range pointing to a compressed extent, followed by another range that points to the same compressed extent and a read operation attempts to read both ranges (either completely or part of them), the pages that correspond to the second range are incorrectly filled with zeroes. Consider the following example: File layout [0 - 8K] [8K - 24K] | | | | points to extent X, points to extent X, offset 4K, length of 8K offset 0, length 16K [extent X, compressed length = 4K uncompressed length = 16K] If a readpages() call spans the 2 ranges, a single bio to read the extent is submitted - extent_io.c:submit_extent_page() would only create a new bio to cover the second range pointing to the extent if the extent it points to had a different logical address than the extent associated with the first range. This has a consequence of the compressed read end io handler (compression.c:end_compressed_bio_read()) finish once the extent is decompressed into the pages covering the first range, leaving the remaining pages (belonging to the second range) filled with zeroes (done by compression.c:btrfs_clear_biovec_end()). So fix this by submitting the current bio whenever we find a range pointing to a compressed extent that was preceded by a range with a different extent map. This is the simplest solution for this corner case. Making the end io callback populate both ranges (or more, if we have multiple pointing to the same extent) is a much more complex solution since each bio is tightly coupled with a single extent map and the extent maps associated to the ranges pointing to the shared extent can have different offsets and lengths. The following test case for fstests triggers the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create a test file with a single extent that is compressed (the # data we write into it is highly compressible no matter which # compression algorithm is used, zlib or lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \ -c "pwrite -S 0xbb 4K 8K" \ -c "pwrite -S 0xcc 12K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone our extent into an adjacent offset. $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Same as before but for this file we clone the extent into a lower # file offset. $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \ -c "pwrite -S 0xbb 12K 8K" \ -c "pwrite -S 0xcc 20K 4K" \ $SCRATCH_MNT/bar | _filter_xfs_io $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/bar echo "File digests before unmounting filesystem:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch # Evicting the inode or clearing the page cache before reading # again the file would also trigger the bug - reads were returning # all bytes in the range corresponding to the second reference to # the extent with a value of 0, but the correct data was persisted # (it was a bug exclusively in the read path). The issue happened # only if the same readpages() call targeted pages belonging to the # first and second ranges that point to the same compressed extent. _scratch_remount echo "File digests after mounting filesystem again:" # Must match the same digests we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: skip waiting on ordered range for special filesJeff Mahoney2015-10-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a30e577c96f59b1e1678ea5462432b09bf7d5cbc ] In btrfs_evict_inode, we properly truncate the page cache for evicted inodes but then we call btrfs_wait_ordered_range for every inode as well. It's the right thing to do for regular files but results in incorrect behavior for device inodes for block devices. filemap_fdatawrite_range gets called with inode->i_mapping which gets resolved to the block device inode before getting passed to wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens next depends on whether there's an open file handle associated with the inode. If there is, we write to the block device, which is unexpected behavior. If there isn't, we through normally and inode->i_data is used. We can also end up racing against open/close which can result in crashes when i_mapping points to a block device inode that has been closed. Since there can't be any page cache associated with special file inodes, it's safe to skip the btrfs_wait_ordered_range call entirely and avoid the problem. Cc: <stable@vger.kernel.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911 Tested-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: check if previous transaction aborted to avoid fs corruptionFilipe Manana2015-10-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1f9b8c8fbc9a4d029760b16f477b9d15500e3a34 ] While we are committing a transaction, it's possible the previous one is still finishing its commit and therefore we wait for it to finish first. However we were not checking if that previous transaction ended up getting aborted after we waited for it to commit, so we ended up committing the current transaction which can lead to fs corruption because the new superblock can point to trees that have had one or more nodes/leafs that were never durably persisted. The following sequence diagram exemplifies how this is possible: CPU 0 CPU 1 transaction N starts (...) btrfs_commit_transaction(N) cur_trans->state = TRANS_STATE_COMMIT_START; (...) cur_trans->state = TRANS_STATE_COMMIT_DOING; (...) cur_trans->state = TRANS_STATE_UNBLOCKED; root->fs_info->running_transaction = NULL; btrfs_start_transaction() --> starts transaction N + 1 btrfs_write_and_wait_transaction(trans, root); --> starts writing all new or COWed ebs created at transaction N creates some new ebs, COWs some existing ebs but doesn't COW or deletes eb X btrfs_commit_transaction(N + 1) (...) cur_trans->state = TRANS_STATE_COMMIT_START; (...) wait_for_commit(root, prev_trans); --> prev_trans == transaction N btrfs_write_and_wait_transaction() continues writing ebs --> fails writing eb X, we abort transaction N and set bit BTRFS_FS_STATE_ERROR on fs_info->fs_state, so no new transactions can start after setting that bit cleanup_transaction() btrfs_cleanup_one_transaction() wakes up task at CPU 1 continues, doesn't abort because cur_trans->aborted (transaction N + 1) is zero, and no checks for bit BTRFS_FS_STATE_ERROR in fs_info->fs_state are made btrfs_write_and_wait_transaction(trans, root); --> succeeds, no errors during writeback write_ctree_super(trans, root, 0); --> succeeds --> we have now a superblock that points us to some root that uses eb X, which was never written to disk In this scenario future attempts to read eb X from disk results in an error message like "parent transid verify failed on X wanted Y found Z". So fix this by aborting the current transaction if after waiting for the previous transaction we verify that it was aborted. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: cleanup orphans while looking up default subvolumeJeff Mahoney2015-08-271-0/+9
| | | | | | | | | | | | | | | | [ Upstream commit 727b9784b6085c99c2f836bf4fcc2848dc9cf904 ] Orphans in the fs tree are cleaned up via open_ctree and subvolume orphans are cleaned via btrfs_lookup_dentry -- except when a default subvolume is in use. The name for the default subvolume uses a manual lookup that doesn't trigger orphan cleanup and needs to trigger it manually as well. This doesn't apply to the remount case since the subvolumes are cleaned up by walking the root radix tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: incorrect handling for fiemap_fill_next_extent returnChengyu Song2015-08-271-1/+4
| | | | | | | | | | | | | | | [ Upstream commit 26e726afe01c1c82072cf23a5ed89ce25f39d9f2 ] fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: fix memory leak in the extent_same ioctlFilipe Manana2015-08-041-1/+3
| | | | | | | | | | | | | | | | | | [ Upstream commit 497b4050e0eacd4c746dd396d14916b1e669849d ] We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: use kmem_cache_free when freeing entry in inode cacheFilipe Manana2015-08-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa ] The free space entries are allocated using kmem_cache_zalloc(), through __btrfs_add_free_space(), therefore we should use kmem_cache_free() and not kfree() to avoid any confusion and any potential problem. Looking at the kfree() definition at mm/slab.c it has the following comment: /* * (...) * * Don't free memory not originally allocated by kmalloc() * or you will run into trouble. */ So better be safe and use kmem_cache_free(). Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: make xattr replace operations atomicFilipe Manana2015-07-134-65/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 02590fd855d1690568b2fa439c942e933221b57a ] commit 5f5bc6b1e2d5a6f827bc860ef2dc5b6f365d1339 upstream. Replacing a xattr consists of doing a lookup for its existing value, delete the current value from the respective leaf, release the search path and then finally insert the new value. This leaves a time window where readers (getxattr, listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs, so this has security implications. This change also fixes 2 other existing issues which were: *) Deleting the old xattr value without verifying first if the new xattr will fit in the existing leaf item (in case multiple xattrs are packed in the same item due to name hash collision); *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't exist but we have have an existing item that packs muliple xattrs with the same name hash as the input xattr. In this case we should return ENOSPC. A test case for xfstests follows soon. Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace implementation. Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* Btrfs: don't invalidate root dentry when subvolume deletion failsOmar Sandoval2015-07-031-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 64ad6c488975d7516230cf7849190a991fd615ae ] Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
* btrfs: unlock i_mutex after attempting to delete subvolume during sendOmar Sandoval2015-05-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 909e26dce3f7600f5e293ac0522c28790a0c8c9c ] Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>