summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus-4.7' of ↵Linus Torvalds2016-05-2128-1788/+1530
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has our merge window series of cleanups and fixes. These target a wide range of issues, but do include some important fixes for qgroups, O_DIRECT, and fsync handling. Jeff Mahoney moved around a few definitions to make them easier for userland to consume. Also whiteout support is included now that issues with overlayfs have been cleared up. I have one more fix pending for page faults during btrfs_copy_from_user, but I wanted to get this bulk out the door first" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits) btrfs: fix memory leak during RAID 5/6 device replacement Btrfs: add semaphore to synchronize direct IO writes with fsync Btrfs: fix race between block group relocation and nocow writes Btrfs: fix race between fsync and direct IO writes for prealloc extents Btrfs: fix number of transaction units for renames with whiteout Btrfs: pin logs earlier when doing a rename exchange operation Btrfs: unpin logs if rename exchange operation fails Btrfs: fix inode leak on failure to setup whiteout inode in rename btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT Btrfs: pin log earlier when renaming Btrfs: unpin log if rename operation fails Btrfs: don't do unnecessary delalloc flushes when relocating Btrfs: don't wait for unrelated IO to finish before relocation Btrfs: fix empty symlink after creating symlink and fsync parent dir Btrfs: fix for incorrect directory entries after fsync log replay btrfs: build fixup for qgroup_account_snapshot btrfs: qgroup: Fix qgroup accounting when creating snapshot Btrfs: fix fspath error deallocation btrfs: make find_workspace warn if there are no workspaces btrfs: make find_workspace always succeed ...
| * Merge branch 'for-chris-4.7' of ↵Chris Mason2016-05-1712-134/+608
| |\ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.7 Signed-off-by: Chris Mason <clm@fb.com>
| | * Btrfs: add semaphore to synchronize direct IO writes with fsyncFilipe Manana2016-05-133-118/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths. This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * Btrfs: fix race between block group relocation and nocow writesFilipe Manana2016-05-134-1/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Relocation of a block group waits for all existing tasks flushing dellaloc, starting direct IO writes and any ordered extents before starting the relocation process. However for direct IO writes that end up doing nocow (inode either has the flag nodatacow set or the write is against a prealloc extent) we have a short time window that allows for a race that makes relocation proceed without waiting for the direct IO write to complete first, resulting in data loss after the relocation finishes. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) direct IO write starts against an extent in block group X using nocow mode (inode has the nodatacow flag or the write is for a prealloc extent) btrfs_direct_IO() btrfs_get_blocks_direct() --> can_nocow_extent() returns 1 btrfs_inc_block_group_ro(bg X) --> turns block group into RO mode btrfs_wait_ordered_roots() --> returns and does not know about the DIO write happening at CPU 2 (the task there has not created yet an ordered extent) relocate_block_group(bg X) --> rc->stage == MOVE_DATA_EXTENTS find_next_extent() --> returns extent that the DIO write is going to write to relocate_data_extent() relocate_file_extent_cluster() --> reads the extent from disk into pages belonging to the relocation inode and dirties them --> creates DIO ordered extent btrfs_submit_direct() --> submits bio against a location on disk obtained from an extent map before the relocation started btrfs_wait_ordered_range() --> writes all the pages read before to disk (belonging to the relocation inode) relocation finishes bio completes and wrote new data to the old location of the block group So fix this by tracking the number of nocow writers for a block group and make sure relocation waits for that number to go down to 0 before starting to move the extents. The same race can also happen with buffered writes in nocow mode since the patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes when relocating", because we are no longer flushing all delalloc which served as a synchonization mechanism (due to page locking) and ensured the ordered extents for nocow buffered writes were created before we called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow mode existed before that patch (no pages are locked or used during direct IO) and that fixed only races with direct IO writes that do cow. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * Btrfs: fix race between fsync and direct IO writes for prealloc extentsFilipe Manana2016-05-131-6/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we do a direct IO write against a preallocated extent (fallocate) that does not go beyond the i_size of the inode, we do the write operation without holding the inode's i_mutex (an optimization that landed in commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows for a very tiny time window where a race can happen with a concurrent fsync using the fast code path, as the direct IO write path creates first a new extent map (no longer flagged as a prealloc extent) and then it creates the ordered extent, while the fast fsync path first collects ordered extents and then it collects extent maps. This allows for the possibility of the fast fsync path to collect the new extent map without collecting the new ordered extent, and therefore logging an extent item based on the extent map without waiting for the ordered extent to be created and complete. This can result in a situation where after a log replay we end up with an extent not marked anymore as prealloc but it was only partially written (or not written at all), exposing random, stale or garbage data corresponding to the unwritten pages and without any checksums in the csum tree covering the extent's range. This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix race between fsync and lockless direct IO writes"). So fix this by creating first the ordered extent and then the extent map, so that this way if the fast fsync patch collects the new extent map it also collects the corresponding ordered extent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
| | * Btrfs: fix number of transaction units for renames with whiteoutFilipe Manana2016-05-131-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we do a rename with the whiteout flag, we need to create the whiteout inode, which in the worst case requires 5 transaction units (1 inode item, 1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the number of transaction units from 11 to 16 if the whiteout flag is set. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: pin logs earlier when doing a rename exchange operationFilipe Manana2016-05-131-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(), which had a race fixed by my previous patch titled "Btrfs: pin log earlier when renaming", and so it suffers from the same problem. We pin the logs of the affected roots after we insert the new inode references, leaving a time window where concurrent tasks logging the inodes can end up logging both the new and old references, resulting in log trees that when replayed can turn the metadata into inconsistent states. This behaviour was added to btrfs_rename() in 2009 without any explanation about why not pinning the logs earlier, just leaving a comment about the posibility for the race. As of today it's perfectly safe and sane to pin the logs before we start doing any of the steps involved in the rename operation. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: unpin logs if rename exchange operation failsFilipe Manana2016-05-131-2/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If rename exchange operations fail at some point after we pinned any of the logs, we end up aborting the current transaction but never unpin the logs, which leaves concurrent tasks that are trying to sync the logs (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: fix inode leak on failure to setup whiteout inode in renameFilipe Manana2016-05-131-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | If we failed to fully setup the whiteout inode during a rename operation with the whiteout flag, we ended up leaking the inode, not decrementing its link count nor removing all its items from the fs/subvol tree. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUTDan Fuhry2016-05-131-7/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new behavior in the renameat2() syscall. This behavior is primarily used by overlayfs. This patch adds support for these flags to btrfs, enabling it to be used as a fully functional upper layer for overlayfs. RENAME_EXCHANGE support was written by Davide Italiano originally submitted on 2 April 2015. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Dan Fuhry <dfuhry@datto.com> [ remove unlikely ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: pin log earlier when renamingFilipe Manana2016-05-131-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were pinning the log right after the first step in the rename operation (inserting inode ref for the new name in the destination directory) instead of doing it before. This behaviour was introduced in 2009 for some reason that was not mentioned neither on the changelog nor any comment, with the drawback of a small time window where concurrent log writers can end up logging the new inode reference for the inode we are renaming while the rename operation is in progress (so that we can end up with a log containing both the new and old references). As of today there's no reason to not pin the log before that first step anymore, so just fix this. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: unpin log if rename operation failsFilipe Manana2016-05-131-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If rename operations fail at some point after we pinned the log, we end up aborting the current transaction but never unpin the log, which leaves concurrent tasks that are trying to sync the log (as part of an fsync request from user space) blocked forever and preventing the filesystem from being unmountable. Fix this by safely unpinning the log. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: don't do unnecessary delalloc flushes when relocatingFilipe Manana2016-05-134-7/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before we start the actual relocation process of a block group, we do calls to flush delalloc of all inodes and then wait for ordered extents to complete. However we do these flush calls just to make sure we don't race with concurrent tasks that have actually already started to run delalloc and have allocated an extent from the block group we want to relocate, right before we set it to readonly mode, but have not yet created the respective ordered extents. The flush calls make us wait for such concurrent tasks because they end up calling filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() -> __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() -> btrfs_run_delalloc_work()) which ends up serializing us with those tasks due to attempts to lock the same pages (and the delalloc flush procedure calls the allocator and creates the ordered extents before unlocking the pages). These flushing calls not only make us waste time (cpu, IO) but also reduce the chances of writing larger extents (applications might be writing to contiguous ranges and we flush before they finish dirtying the whole ranges). So make sure we don't flush delalloc and just wait for concurrent tasks that have already started flushing delalloc and have allocated an extent from the block group we are about to relocate. This change also ends up fixing a race with direct IO writes that makes relocation not wait for direct IO ordered extents. This race is illustrated by the following diagram: CPU 1 CPU 2 btrfs_relocate_block_group(bg X) starts direct IO write, target inode currently has no ordered extents ongoing nor dirty pages (delalloc regions), therefore the root for our inode is not in the list fs_info->ordered_roots btrfs_direct_IO() __blockdev_direct_IO() btrfs_get_blocks_direct() btrfs_lock_extent_direct() locks range in the io tree btrfs_new_extent_direct() btrfs_reserve_extent() --> extent allocated from bg X btrfs_inc_block_group_ro(bg X) btrfs_start_delalloc_roots() __start_delalloc_inodes() --> does nothing, no dealloc ranges in the inode's io tree so the inode's root is not in the list fs_info->delalloc_roots btrfs_wait_ordered_roots() --> does not find the inode's root in the list fs_info->ordered_roots --> ends up not waiting for the direct IO write started by the task at CPU 2 relocate_block_group(rc->stage == MOVE_DATA_EXTENTS) prepare_to_relocate() btrfs_commit_transaction() iterates the extent tree, using its commit root and moves extents into new locations btrfs_add_ordered_extent_dio() --> now a ordered extent is created and added to the list root->ordered_extents and the root added to the list fs_info->ordered_roots --> this is too late and the task at CPU 1 already started the relocation btrfs_commit_transaction() btrfs_finish_ordered_io() btrfs_alloc_reserved_file_extent() --> adds delayed data reference for the extent allocated from bg X relocate_block_group(rc->stage == UPDATE_DATA_PTRS) prepare_to_relocate() btrfs_commit_transaction() --> delayed refs are run, so an extent item for the allocated extent from bg X is added to extent tree --> commit roots are switched, so the next scan in the extent tree will see the extent item sees the extent in the extent tree When this happens the relocation produces the following warning when it finishes: [ 7260.832836] ------------[ cut here ]------------ [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 [ 7260.852998] Call Trace: [ 7260.852998] [<ffffffff812648b3>] dump_stack+0x67/0x90 [ 7260.852998] [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2 [ 7260.852998] [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c [ 7260.852998] [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] [ 7260.852998] [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs] [ 7260.852998] [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19 [ 7260.852998] [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] [ 7260.852998] [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs] [ 7260.852998] [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63 [ 7260.852998] [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44 [ 7260.852998] [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc [ 7260.852998] [<ffffffff811876ab>] vfs_ioctl+0x18/0x34 [ 7260.852998] [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be [ 7260.852998] [<ffffffff81190c30>] ? __fget_light+0x4d/0x71 [ 7260.852998] [<ffffffff81187d77>] SyS_ioctl+0x57/0x79 [ 7260.852998] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- This is because at the end of the first stage, in relocate_block_group(), we commit the current transaction, which makes delayed refs run, the commit roots are switched and so the second stage will find the extent item that the ordered extent added to the delayed refs. But this extent was not moved (ordered extent completed after first stage finished), so at the end of the relocation our block group item still has a positive used bytes counter, triggering a warning at the end of btrfs_relocate_block_group(). Later on when trying to read the extent contents from disk we hit a BUG_ON() due to the inability to map a block with a logical address that belongs to the block group we relocated and is no longer valid, resulting in the following trace: [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 [ 7344.887518] ------------[ cut here ]------------ [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>] [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 [ 7344.888431] Stack: [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 [ 7344.888431] Call Trace: [ 7344.888431] [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs] [ 7344.888431] [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs] [ 7344.888431] [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf [ 7344.888431] [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs] [ 7344.888431] [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10 [ 7344.888431] [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs] [ 7344.888431] [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd [ 7344.888431] [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs] [ 7344.888431] [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc [ 7344.888431] [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207 [ 7344.888431] [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154 [ 7344.888431] [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f [ 7344.888431] [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1 [ 7344.888431] [<ffffffff8117773a>] __vfs_read+0x79/0x9d [ 7344.888431] [<ffffffff81178050>] vfs_read+0x8f/0xd2 [ 7344.888431] [<ffffffff81178a38>] SyS_read+0x50/0x7e [ 7344.888431] [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0 [ 7344.888431] RIP [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP <ffff8802046878f0> [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| | * Btrfs: don't wait for unrelated IO to finish before relocationFilipe Manana2016-05-138-19/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before the relocation process of a block group starts, it sets the block group to readonly mode, then flushes all delalloc writes and then finally it waits for all ordered extents to complete. This last step includes waiting for ordered extents destinated at extents allocated in other block groups, making us waste unecessary time. So improve this by waiting only for ordered extents that fall into the block group's range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
| | * Btrfs: fix empty symlink after creating symlink and fsync parent dirFilipe Manana2016-05-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we create a symlink, fsync its parent directory, crash/power fail and mount the filesystem, we end up with an empty symlink, which not only is useless it's also not allowed in linux (the man page symlink(2) is well explicit about that). So we just need to make sure to fully log an inode if it's a symlink, to ensure its inline extent gets logged, ensuring the same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ sync $ ln -s /mnt/foo /mnt/testdir/bar $ xfs_io -c fsync /mnt/testdir <power fail> $ mount /dev/sdb /mnt $ readlink /mnt/testdir/bar <empty string> A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| | * Btrfs: fix for incorrect directory entries after fsync log replayFilipe Manana2016-05-131-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we move a directory to a new parent and later log that parent and don't explicitly log the old parent, when we replay the log we can end up with entries for the moved directory in both the old and new parent directories. Besides being ilegal to have directories with multiple hard links in linux, it also resulted in the leaving the inode item with a link count of 1. A similar issue also happens if we move a regular file - after the log tree is replayed the file has a link in both the old and new parent directories, when it should be only at the new directory. Sample reproducer: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/x $ mkdir /mnt/y $ touch /mnt/x/foo $ mkdir /mnt/y/z $ sync $ ln /mnt/x/foo /mnt/x/bar $ mv /mnt/y/z /mnt/x/z < power fail > $ mount /dev/sdc /mnt $ ls -1Ri /mnt /mnt: 257 x 258 y /mnt/x: 259 bar 259 foo 260 z /mnt/x/z: /mnt/y: 260 z /mnt/y/z: $ umount /dev/sdc $ btrfs check /dev/sdc Checking filesystem on /dev/sdc UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be checking extents checking free space cache checking fs roots root 5 inode 260 errors 2000, link count wrong unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0 unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0 (...) Attempting to remove the directory becomes impossible: $ mount /dev/sdc /mnt $ rmdir /mnt/y/z $ ls -lh /mnt/y ls: cannot access /mnt/y/z: No such file or directory total 0 d????????? ? ? ? ? ? z $ rmdir /mnt/x/z rmdir: failed to remove ‘/mnt/x/z’: Stale file handle $ ls -lh /mnt/x ls: cannot access /mnt/x/z: Stale file handle total 0 -rw-r--r-- 2 root root 0 Apr 6 18:06 bar -rw-r--r-- 2 root root 0 Apr 6 18:06 foo d????????? ? ? ? ? ? z So make sure that on rename we set the last_unlink_trans value for our inode, even if it's a directory, to the value of the current transaction's ID and that if the new parent directory is logged that we fallback to a transaction commit. A test case for fstests is being submitted as well. Signed-off-by: Filipe Manana <fdmanana@suse.com>
| * | Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516David Sterba2016-05-162-1059/+1
| |\ \ | | | | | | | | | | | | | | | | # Conflicts: # include/uapi/linux/btrfs.h
| | * | btrfs: uapi/linux/btrfs_tree.h migration, item types and definesJeff Mahoney2016-04-281-948/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly to userspace. In order to decode them, full type information is required. Create a new header, btrfs_tree to contain these since most users won't need them. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_argsJeff Mahoney2016-04-281-31/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE ioctl. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: uapi/linux/btrfs.h migration, move balance flagsJeff Mahoney2016-04-281-46/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE ioctl. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: uapi/linux/btrfs.h migration, move feature flagsJeff Mahoney2016-04-281-25/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The compat/compat_ro/incompat feature flags are used by the feature set/get ioctls. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: uapi/linux/btrfs.h migration, qgroup limit flagsJeff Mahoney2016-04-281-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZEJeff Mahoney2016-04-281-1/+0
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and BTRFS_IOC_SET_FSLABEL ioctls. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516David Sterba2016-05-165-241/+310
| |\ \
| | * | btrfs: cleanup assigning next active device with a checkAnand Jain2016-05-043-22/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Creates helper fucntion as needed by the device delete and replace operations. Also now it checks if the next device being assigned is an active device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: s_bdev is not null after missing replaceAnand Jain2016-05-042-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Yauhen reported in the ML that s_bdev is null at mount, and s_bdev gets updated to some device when missing device is replaced, as because bdev is null for missing device, things gets matched up. Fix this by checking if s_bdev is set. I didn't want to completely remove updating s_bdev because the future multi device support at vfs layer may need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: refactor btrfs_dev_replace_start for reuseAnand Jain2016-04-283-23/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A refactor patch, and avoids user input verification in the btrfs_dev_replace_start(), and so this function can be reused. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: use fs_info directlyAnand Jain2016-04-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Local variable fs_info, contains root->fs_info, use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: rename flags for vol args v2David Sterba2016-04-281-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename BTRFS_DEVICE_BY_ID so it's more descriptive that we specify the device by id, it'll be part of the public API. The mask of supported flags is also renamed, only for internal use. The error code for unknown flags is EOPNOTSUPP, fixed. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: rename btrfs_find_device_by_user_inputDavid Sterba2016-04-283-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For clarity how we are going to find the device, let's call it a device specifier, devspec for short. Also rename the arguments that are a leftover from previous function purpose. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: use existing device constraints table btrfs_raid_arrayDavid Sterba2016-04-281-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should avoid duplicating the device constraints, let's use the btrfs_raid_array in btrfs_check_raid_min_devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: introduce raid-type to error-code table, for minimum device constraintDavid Sterba2016-04-282-1/+16
| | | | | | | | | | | | | | | | | | | | Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: pass number of devices to btrfs_check_raid_min_devicesDavid Sterba2016-04-281-15/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this patch, btrfs_check_raid_min_devices would do an off-by-one check of the constraints and not the miminmum check, as its name suggests. This is not a problem if the only caller is device remove, but would be confusing for others. Add an argument with the exact number and let the caller(s) decide if this needs any adjustments, like when device replace is running. Reviewed-by: Anand Jain <anand.jain@oracle.com> Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: rename __check_raid_min_devicesDavid Sterba2016-04-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Underscores are for special functions, use the full prefix for better stacktrace recognition. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: optimize check for stale deviceAnand Jain2016-04-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimize check for stale device to only be checked when there is device added or changed. If there is no update to the device, there is no need to call btrfs_free_stale_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: introduce device delete by devidAnand Jain2016-04-283-4/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as an user argument. The patch won't delete the old ioctl interface and so kernel remains backward compatible with user land progs. Test case/script: echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk mount /dev/sdd /btrfs dmsetup suspend bad_disk echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk dmsetup resume bad_disk echo "bad disk failed. now deleting/replacing" btrfs dev del 3 /btrfs echo $? btrfs fi show /btrfs umount /btrfs btrfs-show-super /dev/sdd | egrep num_device dmsetup remove bad_disk wipefs -a /dev/sdf Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Martin <m_btrfs@ml1.co.uk> [ adjust messages, s/disk/device/ ] Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()Anand Jain2016-04-281-64/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the previous patches now the btrfs_scratch_superblocks() is ready to be used in btrfs_rm_device() so use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL ] Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: enhance btrfs_find_device_by_user_input() to check device pathAnand Jain2016-04-282-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The operation of device replace and device delete follows same steps upto some depth with in btrfs kernel, however they don't share codes. This enhancement will help replace and delete to share codes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: make use of btrfs_find_device_by_user_input()Anand Jain2016-04-281-63/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_rm_device() has a section of the code which can be replaced btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: create helper btrfs_find_device_by_user_input()Anand Jain2016-04-283-23/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch renames btrfs_dev_replace_find_srcdev() to btrfs_find_device_by_user_input() and moves it to volumes.c, so that delete device can use it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: clean up and optimize __check_raid_min_device()Anand Jain2016-04-281-24/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __check_raid_min_device() which was pealed from btrfs_rm_device() maintianed its original code to show the block move. This patch cleans up __check_raid_min_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: create helper function __check_raid_min_devices()Anand Jain2016-04-281-19/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | move a section of btrfs_rm_device() code to check for min number of the devices into the function __check_raid_min_devices() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: create a helper function to read the disk superAnand Jain2016-04-281-35/+52
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner. (In this process it also moves the code which ensures null terminating label). So this creates easy opportunity to merge various duplicate codes on read disk super. Earlier attempt to merge duplicate codes highlighted that there were some issues for which there are duplicate codes (to read disk super), however it was not clear what was the issue. So until we figure that out, its better to keep them in a separate functions. Signed-off-by: Anand Jain <anand.jain@oracle.com> [ use GFP_KERNEL, PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | Merge branch 'cleanups-4.7' into for-chris-4.7-20160516David Sterba2016-05-1615-190/+176
| |\ \
| | * | btrfs: GFP_NOFS does not GFP_HIGHMEMDavid Sterba2016-05-103-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Masking HIGHMEM out of NOFS does not make sense. Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: switch to common message helpers in open_ctree, adjust messagesDavid Sterba2016-05-101-52/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we lack the identification of the filesystem in most if not all mount messages, done via printk/pr_* functions. We can use the btrfs_* helpers in open_ctree, as the fs_info <-> sb link is established at the beginning of the function. The messages have been updated at the same time to be more consistent: * dropped sb->s_id, as it's not available via btrfs_* * added %d for return code where appropriate * wording changed * %Lx replaced by %llx Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: ioctl: reorder exclusive op check in RM_DEVDavid Sterba2016-05-061-12/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the op exclusivity check before the other code (same as in ADD_DEV). Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: kill unused writepage_io_hook callbackDavid Sterba2016-05-062-24/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems to be long time unused, since 2008 and 6885f308b5570 ("Btrfs: Misc 2.6.25 updates"). Propagating the removal touches some code but has no functional effect. Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: pass the right error code to the btrfs_std_errorAnand Jain2016-05-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Also drop the newline from the message. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| | * | btrfs: fix typos in commentsLuis de Bethencourt2016-04-281-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Correct a typo in the chunk_mutex name to make it grepable. Since it is better to fix several typos at once, fixing the 2 more in the same file. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: David Sterba <dsterba@suse.com>