summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* simple local filesystems: switch to ->iterate_shared()Al Viro2016-05-026-6/+6
| | | | | | | no changes needed (XFS isn't simple, but it has the same parallelism in the interesting parts exercised from CXFS). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* dcache_{readdir,dir_lseek}() users: switch to ->iterate_sharedAl Viro2016-05-022-6/+3
| | | | | | | no need to lock directory in dcache_dir_lseek(), while we are at it - per-struct file exclusion is enough. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* cifs: switch to ->iterate_shared()Al Viro2016-05-022-27/+30
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fuse: switch to ->iterate_shared()Al Viro2016-05-021-49/+45
| | | | | | | Switch dcache pre-seeding on readdir to d_alloc_parallel(); nothing else is needed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* switch all procfs directories ->iterate_shared()Al Viro2016-05-027-20/+21
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* proc_sys_fill_cache(): switch to d_alloc_parallel()Al Viro2016-05-021-7/+8
| | | | | | make it usable with directory locked shared Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* proc_fill_cache(): switch to d_alloc_parallel()Al Viro2016-05-021-5/+10
| | | | | | ... making it usable with directory locked shared Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* introduce a parallel variant of ->iterate()Al Viro2016-05-023-11/+29
| | | | | | | | New method: ->iterate_shared(). Same arguments as in ->iterate(), called with the directory locked only shared. Once all filesystems switch, the old one will be gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* give readdir(2)/getdents(2)/etc. uniform exclusion with lseek()Al Viro2016-05-025-25/+18
| | | | | | same as read() on regular files has, and for the same reason. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* parallel lookups: actual switch to rwsemAl Viro2016-05-029-26/+34
| | | | | | | | | | | | ta-da! The main issue is the lack of down_write_killable(), so the places like readdir.c switched to plain inode_lock(); once killable variants of rwsem primitives appear, that'll be dealt with. lockdep side also might need more work Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* parallel lookups machinery, part 4 (and last)Al Viro2016-05-022-21/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | If we *do* run into an in-lookup match, we need to wait for it to cease being in-lookup. Fortunately, we do have unused space in in-lookup dentries - d_lru is never looked at until it stops being in-lookup. So we can stash a pointer to wait_queue_head from stack frame of the caller of ->lookup(). Some precautions are needed while waiting, but it's not that hard - we do hold a reference to dentry we are waiting for, so it can't go away. If it's found to be in-lookup the wait_queue_head is still alive and will remain so at least while ->d_lock is held. Moreover, the condition we are waiting for becomes true at the same point where everything on that wq gets woken up, so we can just add ourselves to the queue once. d_alloc_parallel() gets a pointer to wait_queue_head_t from its caller; lookup_slow() adjusted, d_add_ci() taught to use d_alloc_parallel() if the dentry passed to it happens to be in-lookup one (i.e. if it's been called from the parallel lookup). That's pretty much it - all that remains is to switch ->i_mutex to rwsem and have lookup_slow() take it shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* parallel lookups machinery, part 3Al Viro2016-05-022-25/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will need to be able to check if there is an in-lookup dentry with matching parent/name. Right now it's impossible, but as soon as start locking directories shared such beasts will appear. Add a secondary hash for locating those. Hash chains go through the same space where d_alias will be once it's not in-lookup anymore. Search is done under the same bitlock we use for modifications - with the primary hash we can rely on d_rehash() into the wrong chain being the worst that could happen, but here the pointers are buggered once it's removed from the chain. On the other hand, the chains are not going to be long and normally we'll end up adding to the chain anyway. That allows us to avoid bothering with ->d_lock when doing the comparisons - everything is stable until removed from chain. New helper: d_alloc_parallel(). Right now it allocates, verifies that no hashed and in-lookup matches exist and adds to in-lookup hash. Returns ERR_PTR() for error, hashed match (in the unlikely case it's been found) or new dentry. In-lookup matches trigger BUG() for now; that will change in the next commit when we introduce waiting for ongoing lookup to finish. Note that in-lookup matches won't be possible until we actually go for shared locking. lookup_slow() switched to use of d_alloc_parallel(). Again, these commits are separated only for making it easier to review. All this machinery will start doing something useful only when we go for shared locking; it's just that the combination is too large for my taste. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* parallel lookups machinery, part 2Al Viro2016-05-022-2/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We'll need to verify that there's neither a hashed nor in-lookup dentry with desired parent/name before adding to in-lookup set. One possible solution would be to hold the parent's ->d_lock through both checks, but while the in-lookup set is relatively small at any time, dcache is not. And holding the parent's ->d_lock through something like __d_lookup_rcu() would suck too badly. So we leave the parent's ->d_lock alone, which means that we watch out for the following scenario: * we verify that there's no hashed match * existing in-lookup match gets hashed by another process * we verify that there's no in-lookup matches and decide that everything's fine. Solution: per-directory kinda-sorta seqlock, bumped around the times we hash something that used to be in-lookup or move (and hash) something in place of in-lookup. Then the above would turn into * read the counter * do dcache lookup * if no matches found, check for in-lookup matches * if there had been none of those either, check if the counter has changed; repeat if it has. The "kinda-sorta" part is due to the fact that we don't have much spare space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe), so the counter part is not a problem, but spinlock is a different story. We could use the parent's ->d_lock, and it would be less painful in terms of contention, for __d_add() it would be rather inconvenient to grab; we could do that (using lock_parent()), but... Fortunately, we can get serialization on the counter itself, and it might be a good idea in general; we can use cmpxchg() in a loop to get from even to odd and smp_store_release() from odd to even. This commit adds the counter and updating logics; the readers will be added in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* beginning of transition to parallel lookups - marking in-lookup dentriesAl Viro2016-05-022-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | marked as such when (would be) parallel lookup is about to pass them to actual ->lookup(); unmarked when * __d_add() is about to make it hashed, positive or not. * __d_move() (from d_splice_alias(), directly or via __d_unalias()) puts a preexisting dentry in its place * in caller of ->lookup() if it has escaped all of the above. Bug (WARN_ON, actually) if it reaches the final dput() or d_instantiate() while still marked such. As the result, we are guaranteed that for as long as the flag is set, dentry will * remain negative unhashed with positive refcount * never have its ->d_alias looked at * never have its ->d_lru looked at * never have its ->d_parent and ->d_name changed Right now we have at most one such for any given parent directory. With parallel lookups that restriction will weaken to * only exist when parent is locked shared * at most one with given (parent,name) pair (comparison of names is according to ->d_compare()) * only exist when there's no hashed dentry with the same (parent,name) Transition will take the next several commits; unfortunately, we'll only be able to switch to rwsem at the end of this series. The reason for not making it a single patch is to simplify review. New primitives: d_in_lookup() (a predicate checking if dentry is in the in-lookup state) and d_lookup_done() (tells the system that we are done with lookup and if it's still marked as in-lookup, it should cease to be such). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __d_add(): don't drop/regain ->d_lockAl Viro2016-05-021-3/+11
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* lookup_slow(): bugger off on IS_DEADDIR() from the very beginningAl Viro2016-05-021-6/+17
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* nfs: missing wakeup in nfs_unblock_sillyrename()Al Viro2016-05-021-0/+1
| | | | | | will be needed as soon as lookups are not serialized by ->i_mutex Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make ext2_get_page() and friends work without external serializationAl Viro2016-05-025-35/+35
| | | | | | | | | | | | | | | | | Right now ext2_get_page() (and its analogues in a bunch of other filesystems) relies upon the directory being locked - the way it sets and tests Checked and Error bits would be racy without that. Switch to a slightly different scheme, _not_ setting Checked in case of failure. That way the logics becomes if Checked => OK else if Error => fail else if !validate => fail else => OK with validation setting Checked or Error on success and failure resp. and returning which one had happened. Equivalent to the current logics, but unlike the current logics not sensitive to the order of set_bit, test_bit getting reordered by CPU, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ovl_lookup_real(): use lookup_one_len_unlocked()Al Viro2016-05-021-3/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* reconnect_one(): use lookup_one_len_unlocked()Al Viro2016-05-021-3/+7
| | | | | | | ... and explain the non-obvious logics in case when lookup yields a different dentry. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()Al Viro2016-05-021-1/+5
| | | | | | ... and have it use inode_lock() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* orangefs: don't open-code inode_lock/inode_unlockAl Viro2016-05-022-4/+4
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2: don't open-code inode_lock/inode_unlockAl Viro2016-05-021-2/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* configfs_detach_prep(): make sure that wait_mutex won't go awayAl Viro2016-05-021-8/+9
| | | | | | | | grab a reference to dentry we'd got the sucker from, and return that dentry via *wait, rather than just returning the address of ->i_mutex. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* kernfs: use lookup_one_len_unlocked()Al Viro2016-05-021-3/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* security_d_instantiate(): move to the point prior to attaching dentry to inodeAl Viro2016-05-021-8/+7
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge getxattr prototype change into work.lookupsAl Viro2016-05-0293-426/+364
|\ | | | | | | The rest of work.xattr stuff isn't needed for this branch
| * ->getxattr(): pass dentry and inode as separate argumentsAl Viro2016-04-1124-66/+69
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * xattr_handler: pass dentry and inode as separate arguments of ->get()Al Viro2016-04-1029-110/+108
| | | | | | | | | | | | ... and do not assume they are already attached to each other Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * reiserfs: switch to generic_{get,set,remove}xattr()Al Viro2016-04-107-98/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reiserfs_xattr_[sg]et() will fail with -EOPNOTSUPP for V1 inodes anyway, and all reiserfs instances of ->[sg]et() call it and so does ->set_acl(). Checks for name length in the instances had been bogus; they should've been "bugger off if it's _exactly_ the prefix" (as generic would do on its own) and not "bugger off if it's shorter than the prefix" - that can't happen. xattr_full_name() is needed to adjust for the fact that generic instances will skip the prefix in the name passed to ->[gs]et(); reiserfs homegrown analogues didn't. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * cifs: kill more bogus checks in ->...xattr() methodsAl Viro2016-04-101-36/+6
| | | | | | | | | | | | | | none of that stuff can ever be called for NULL or negative dentry. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * don't bother with ->d_inode->i_sb - it's always equal to ->d_sbAl Viro2016-04-1025-42/+33
| | | | | | | | | | | | ... and neither can ever be NULL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * posix_acl: Unexport acl_by_type and make it staticAndreas Gruenbacher2016-03-311-2/+1
| | | | | | | | | | | | | | | | | | | | acl_by_type(inode, type) returns a pointer to either inode->i_acl or inode->i_default_acl depending on type. This is useful in fs/posix_acl.c, but should never have been visible outside that file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * posix_acl: Inode acl caching fixesAndreas Gruenbacher2016-03-3116-78/+126
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When get_acl() is called for an inode whose ACL is not cached yet, the get_acl inode operation is called to fetch the ACL from the filesystem. The inode operation is responsible for updating the cached acl with set_cached_acl(). This is done without locking at the VFS level, so another task can call set_cached_acl() or forget_cached_acl() before the get_acl inode operation gets to calling set_cached_acl(), and then get_acl's call to set_cached_acl() results in caching an outdate ACL. Prevent this from happening by setting the cached ACL pointer to a task-specific sentinel value before calling the get_acl inode operation. Move the responsibility for updating the cached ACL from the get_acl inode operations to get_acl(). There, only set the cached ACL if the sentinel value hasn't changed. The sentinel values are chosen to have odd values. Likewise, the value of ACL_NOT_CACHED is odd. In contrast, ACL object pointers always have an even value (ACLs are aligned in memory). This allows to distinguish uncached ACLs values from ACL objects. In addition, switch from guarding inode->i_acl and inode->i_default_acl upates by the inode->i_lock spinlock to using xchg() and cmpxchg(). Filesystems that do not want ACLs returned from their get_acl inode operations to be cached must call forget_cached_acl() to prevent the VFS from doing so. (Patch written by Al Viro and Andreas Gruenbacher.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * jfs: Remove unnecessary code in jfs_get_aclAndreas Gruenbacher2016-03-281-4/+0
| | | | | | | | | | | | | | | | | | The get_acl inode operation is called only when no ACL is cached. It makes no sense to check for a cached ACL as the first thing inside such inode operations. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * reiserfs_cache_default_acl(): use get_acl()Al Viro2016-03-281-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Revert "ext4: allow readdir()'s of large empty directories to be interrupted"Linus Torvalds2016-04-102-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 1028b55bafb7611dda1d8fed2aeca16a436b7dff. It's broken: it makes ext4 return an error at an invalid point, causing the readdir wrappers to write the the position of the last successful directory entry into the position field, which means that the next readdir will now return that last successful entry _again_. You can only return fatal errors (that terminate the readdir directory walk) from within the filesystem readdir functions, the "normal" errors (that happen when the readdir buffer fills up, for example) happen in the iterorator where we know the position of the actual failing entry. I do have a very different patch that does the "signal_pending()" handling inside the iterator function where it is allowable, but while that one passes all the sanity checks, I screwed up something like four times while emailing it out, so I'm not going to commit it today. So my track record is not good enough, and the stars will have to align better before that one gets committed. And it would be good to get some review too, of course, since celestial alignments are always an iffy debugging model. IOW, let's just revert the commit that caused the problem for now. Reported-by: Greg Thelen <gthelen@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus-4.6' of ↵Linus Torvalds2016-04-098-32/+215
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are bug fixes, including a really old fsync bug, and a few trace points to help us track down problems in the quota code" * 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix file/data loss caused by fsync after rename and new inode btrfs: Reset IO error counters before start of device replacing btrfs: Add qgroup tracing Btrfs: don't use src fd for printk btrfs: fallback to vmalloc in btrfs_compare_tree btrfs: handle non-fatal errors in btrfs_qgroup_inherit() btrfs: Output more info for enospc_debug mount option Btrfs: fix invalid reference in replace_path Btrfs: Improve FL_KEEP_SIZE handling in fallocate
| * | Btrfs: fix file/data loss caused by fsync after rename and new inodeFilipe Manana2016-04-061-0/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we rename an inode A (be it a file or a directory), create a new inode B with the old name of inode A and under the same parent directory, fsync inode B and then power fail, at log tree replay time we end up removing inode A completely. If inode A is a directory then all its files are gone too. Example scenarios where this happens: This is reproducible with the following steps, taken from a couple of test cases written for fstests which are going to be submitted upstream soon: # Scenario 1 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir -p /mnt/a/x echo "hello" > /mnt/a/x/foo echo "world" > /mnt/a/x/bar sync mv /mnt/a/x /mnt/a/y mkdir /mnt/a/x xfs_io -c fsync /mnt/a/x <power failure happens> The next time the fs is mounted, log tree replay happens and the directory "y" does not exist nor do the files "foo" and "bar" exist anywhere (neither in "y" nor in "x", nor the root nor anywhere). # Scenario 2 mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/a echo "hello" > /mnt/a/foo sync mv /mnt/a/foo /mnt/a/bar echo "world" > /mnt/a/foo xfs_io -c fsync /mnt/a/foo <power failure happens> The next time the fs is mounted, log tree replay happens and the file "bar" does not exists anymore. A file with the name "foo" exists and it matches the second file we created. Another related problem that does not involve file/data loss is when a new inode is created with the name of a deleted snapshot and we fsync it: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt mkdir /mnt/testdir btrfs subvolume snapshot /mnt /mnt/testdir/snap btrfs subvolume delete /mnt/testdir/snap rmdir /mnt/testdir mkdir /mnt/testdir xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir <power failure> The next time the fs is mounted the log replay procedure fails because it attempts to delete the snapshot entry (which has dir item key type of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry, resulting in the following error that causes mount to fail: [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [52174.512570] ------------[ cut here ]------------ [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]() [52174.514681] BTRFS: Transaction aborted (error -2) [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G W 4.5.0-rc6-btrfs-next-27+ #1 [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [52174.524053] 0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758 [52174.524053] 0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd [52174.524053] 00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88 [52174.524053] Call Trace: [52174.524053] [<ffffffff81264e93>] dump_stack+0x67/0x90 [52174.524053] [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2 [52174.524053] [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50 [52174.524053] [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs] [52174.524053] [<ffffffff8118f5e9>] ? iput+0xb0/0x284 [52174.524053] [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [52174.524053] [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs] [52174.524053] [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs] [52174.524053] [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs] [52174.524053] [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs] [52174.524053] [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs] [52174.524053] [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs] [52174.524053] [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs] [52174.524053] [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs] [52174.524053] [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf [52174.524053] [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3 [52174.524053] [<ffffffff8117bafa>] mount_fs+0x67/0x131 [52174.524053] [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde [52174.524053] [<ffffffff8119590f>] do_mount+0x8a6/0x9e8 [52174.524053] [<ffffffff811358dd>] ? strndup_user+0x3f/0x59 [52174.524053] [<ffffffff81195c65>] SyS_mount+0x77/0x9f [52174.524053] [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]--- Fix this by forcing a transaction commit when such cases happen. This means we check in the commit root of the subvolume tree if there was any other inode with the same reference when the inode we are fsync'ing is a new inode (created in the current transaction). Test cases for fstests, covering all the scenarios given above, were submitted upstream for fstests: * fstests: generic test for fsync after renaming directory https://patchwork.kernel.org/patch/8694281/ * fstests: generic test for fsync after renaming file https://patchwork.kernel.org/patch/8694301/ * fstests: add btrfs test for fsync after snapshot deletion https://patchwork.kernel.org/patch/8670671/ Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: Reset IO error counters before start of device replacingYauhen Kharuzhy2016-04-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If device replace entry was found on disk at mounting and its num_write_errors stats counter has non-NULL value, then replace operation will never be finished and -EIO error will be reported by btrfs_scrub_dev() because this counter is never reset. # mount -o degraded /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ # btrfs replace status /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ Started on 25.Mar 07:28:00, canceled on 25.Mar 07:28:01 at 0.0%, 40 write errs, 0 uncorr. read errs # btrfs replace start -B 4 /dev/sdg /media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/ ERROR: ioctl(DEV_REPLACE_START) failed on "/media/a4fb5c0a-21c5-4fe7-8d0e-fdd87d5f71ee/": Input/output error, no error Reset num_write_errors and num_uncorrectable_read_errors counters in the dev_replace structure before start of replacing. Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Add qgroup tracingMark Fasheh2016-04-041-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds tracepoints to the qgroup code on both the reporting side (insert_dirty_extents) and the accounting side. Taken together it allows us to see what qgroup operations have happened, and what their result was. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: don't use src fd for printkJosef Bacik2016-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The fd we pass in may not be on a btrfs file system, so don't try to do BTRFS_I() on it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fallback to vmalloc in btrfs_compare_treeDavid Sterba2016-04-041-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The allocation of node could fail if the memory is too fragmented for a given node size, practically observed with 64k. http://article.gmane.org/gmane.comp.file-systems.btrfs/54689 Reported-and-tested-by: Jean-Denis Girard <jd.girard@sysnux.pf> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: handle non-fatal errors in btrfs_qgroup_inherit()Mark Fasheh2016-04-041-22/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | create_pending_snapshot() will go readonly on _any_ error return from btrfs_qgroup_inherit(). If qgroups are enabled, a user can crash their fs by just making a snapshot and asking it to inherit from an invalid qgroup. For example: $ btrfs sub snap -i 1/10 /btrfs/ /btrfs/foo Will cause a transaction abort. Fix this by only throwing errors in btrfs_qgroup_inherit() when we know going readonly is acceptable. The following xfstests test case reproduces this bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # remove previous $seqres.full before test rm -f $seqres.full # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch rm -f $seqres.full _scratch_mkfs _scratch_mount _run_btrfs_util_prog quota enable $SCRATCH_MNT # The qgroup '1/10' does not exist and should be silently ignored _run_btrfs_util_prog subvolume snapshot -i 1/10 $SCRATCH_MNT $SCRATCH_MNT/snap1 _scratch_unmount echo "Silence is golden" status=0 exit Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Output more info for enospc_debug mount optionQu Wenruo2016-04-041-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | As one user in mail list report reproducible balance ENOSPC error, it's better to add more debug info for enospc_debug mount option. Reported-by: Marc Haber <mh+linux-btrfs@zugschlus.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix invalid reference in replace_pathLiu Bo2016-04-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dan Carpenter's static checker has found this error, it's introduced by commit 64c043de466d ("Btrfs: fix up read_tree_block to return proper error") It's really supposed to 'break' the loop on error like others. Cc: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: Improve FL_KEEP_SIZE handling in fallocateDavide Italiano2016-04-041-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - We call inode_size_ok() only if FL_KEEP_SIZE isn't specified. - As an optimisation we can skip the call if (off + len) isn't greater than the current size of the file. This operation is called under the lock so the less work we do, the better. - If we call inode_size_ok() pass to it the correct value rather than a more conservative estimation. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | Merge tag 'for-linus-4.6-ofs1' of ↵Linus Torvalds2016-04-096-53/+28
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fixes from Mike Marshall: "Orangefs cleanups and a strncpy vulnerability fix. Cleanups: - remove an unused variable from orangefs_readdir. - clean up printk wrapper used for ofs "gossip" debugging. - clean up truncate ctime and mtime setting in inode.c - remove a useless null check found by coccinelle. - optimize some memcpy/memset boilerplate code. - remove some useless sanity checks from xattr.c Fix: - fix a potential strncpy vulnerability" * tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: remove unused variable orangefs: Add KERN_<LEVEL> to gossip_<level> macros orangefs: strncpy -> strscpy orangefs: clean up truncate ctime and mtime setting Orangefs: fix ifnullfree.cocci warnings Orangefs: optimize boilerplate code. Orangefs: xattr.c cleanup
| * | | orangefs: remove unused variableMartin Brandenburg2016-04-081-3/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
| * | | orangefs: Add KERN_<LEVEL> to gossip_<level> macrosJoe Perches2016-04-081-14/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Emit the logging messages at the appropriate levels. Miscellanea: o Change format to fmt o Use the more common ##__VA_ARGS__ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>