summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* fs: Give dentry to inode_change_ok() instead of inodeJan Kara2016-09-2247-63/+62
| | | | | | | | | | | inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* fuse: Propagate dentry down to inode_change_ok()Jan Kara2016-09-223-5/+6
| | | | | | | | | | To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate it down to fuse_do_setattr(). Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* ceph: Propagate dentry down to inode_change_ok()Jan Kara2016-09-222-8/+16
| | | | | | | | | | | | | | To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. ceph_setattr() has the dentry easily available but __ceph_setattr() is also called from ceph_set_acl() where dentry is not easily available. Luckily that call path does not need inode_change_ok() to be called anyway. So reorganize functions a bit so that inode_change_ok() is called only from paths where dentry is available. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
* xfs: Propagate dentry down to inode_change_ok()Jan Kara2016-09-225-35/+68
| | | | | | | | | | | | | | To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate dentry down to functions calling inode_change_ok(). This is rather straightforward except for xfs_set_mode() function which does not have dentry easily available. Luckily that function does not call inode_change_ok() anyway so we just have to do a little dance with function prototypes. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* posix_acl: Clear SGID bit when setting file permissionsJan Kara2016-09-2215-102/+88
| | | | | | | | | | | | | | When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* udf: don't bother with full-page write optimisations in adinicb caseAl Viro2016-09-191-1/+15
| | | | | | | ... it would get converted to regular if such had been attempted Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>
* reiserfs: Unlock superblock before calling reiserfs_quota_on_mount()Mike Galbraith2016-09-161-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we hold the superblock lock while calling reiserfs_quota_on_mount(), we can deadlock our own worker - mount blocks kworker/3:2, sleeps forever more. crash> ps|grep UN 715 2 3 ffff880220734d30 UN 0.0 0 0 [kworker/3:2] 9369 9341 2 ffff88021ffb7560 UN 1.3 493404 123184 Xorg 9665 9664 3 ffff880225b92ab0 UN 0.0 47368 812 udisks-daemon 10635 10403 3 ffff880222f22c70 UN 0.0 14904 936 mount crash> bt ffff880220734d30 PID: 715 TASK: ffff880220734d30 CPU: 3 COMMAND: "kworker/3:2" #0 [ffff8802244c3c20] schedule at ffffffff8144584b #1 [ffff8802244c3cc8] __rt_mutex_slowlock at ffffffff814472b3 #2 [ffff8802244c3d28] rt_mutex_slowlock at ffffffff814473f5 #3 [ffff8802244c3dc8] reiserfs_write_lock at ffffffffa05f28fd [reiserfs] #4 [ffff8802244c3de8] flush_async_commits at ffffffffa05ec91d [reiserfs] #5 [ffff8802244c3e08] process_one_work at ffffffff81073726 #6 [ffff8802244c3e68] worker_thread at ffffffff81073eba #7 [ffff8802244c3ec8] kthread at ffffffff810782e0 #8 [ffff8802244c3f48] kernel_thread_helper at ffffffff81450064 crash> rd ffff8802244c3cc8 10 ffff8802244c3cc8: ffffffff814472b3 ffff880222f23250 .rD.....P2.".... ffff8802244c3cd8: 0000000000000000 0000000000000286 ................ ffff8802244c3ce8: ffff8802244c3d30 ffff880220734d80 0=L$.....Ms .... ffff8802244c3cf8: ffff880222e8f628 0000000000000000 (.."............ ffff8802244c3d08: 0000000000000000 0000000000000002 ................ crash> struct rt_mutex ffff880222e8f628 struct rt_mutex { wait_lock = { raw_lock = { slock = 65537 } }, wait_list = { node_list = { next = 0xffff8802244c3d48, prev = 0xffff8802244c3d48 } }, owner = 0xffff880222f22c71, save_state = 0 } crash> bt 0xffff880222f22c70 PID: 10635 TASK: ffff880222f22c70 CPU: 3 COMMAND: "mount" #0 [ffff8802216a9868] schedule at ffffffff8144584b #1 [ffff8802216a9910] schedule_timeout at ffffffff81446865 #2 [ffff8802216a99a0] wait_for_common at ffffffff81445f74 #3 [ffff8802216a9a30] flush_work at ffffffff810712d3 #4 [ffff8802216a9ab0] schedule_on_each_cpu at ffffffff81074463 #5 [ffff8802216a9ae0] invalidate_bdev at ffffffff81178aba #6 [ffff8802216a9af0] vfs_load_quota_inode at ffffffff811a3632 #7 [ffff8802216a9b50] dquot_quota_on_mount at ffffffff811a375c #8 [ffff8802216a9b80] finish_unfinished at ffffffffa05dd8b0 [reiserfs] #9 [ffff8802216a9cc0] reiserfs_fill_super at ffffffffa05de825 [reiserfs] RIP: 00007f7b9303997a RSP: 00007ffff443c7a8 RFLAGS: 00010202 RAX: 00000000000000a5 RBX: ffffffff8144ef12 RCX: 00007f7b932e9ee0 RDX: 00007f7b93d9a400 RSI: 00007f7b93d9a3e0 RDI: 00007f7b93d9a3c0 RBP: 00007f7b93d9a2c0 R8: 00007f7b93d9a550 R9: 0000000000000001 R10: ffffffffc0ed040e R11: 0000000000000202 R12: 000000000000040e R13: 0000000000000000 R14: 00000000c0ed040e R15: 00007ffff443ca20 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mike Galbraith <mgalbraith@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
* udf: Remove useless check in udf_adinicb_write_begin()Jan Kara2016-09-061-1/+1
| | | | | | | | | As Al properly points out, len is guaranteed to be smaller than PAGE_SIZE when we reach udf_adinicb_write_begin() as otherwise we would have converted the file to the normal format. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>
* quota: fill in Q_XGETQSTAT inode information for inactive quotasEric Sandeen2016-08-151-6/+12
| | | | | | | | | | | | | | | | | | | The manpage for quotactl says that the Q_XGETQSTAT command is "useful in finding out how much space is spent to store quota information," but the current implementation does not report this info if the inode is allocated, but its quota type is not enabled. This is a change from the earlier XFS implementation, which reported information about allocated quota inodes even if their quota type was not currently active. Change quota_getstate() and quota_getstatev() to copy out the inode information if the filesystem has provided it, even if the quota type for that inode is not currently active. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
* ext2: Check return value from ext2_get_group_desc()Jan Kara2016-08-091-0/+5
| | | | | | | | | ext2_get_group_desc() can return NULL if there is some error. This usually means there is some programming error in the ext2 driver itself but let's be defensive and handle that case. Coverity-id: 115628 Signed-off-by: Jan Kara <jack@suse.cz>
* block: rename bio bi_rw to bi_opfJens Axboe2016-08-074-12/+12
| | | | | | | | | | | | | Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
* block/mm: make bdev_ops->rw_page() take a bool for read/writeJens Axboe2016-08-072-5/+3
| | | | | | | | | | | | | | Commit abf545484d31 changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge tag 'binfmt-for-linus' of ↵Linus Torvalds2016-08-073-2/+60
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc Pull binfmt_misc update from James Bottomley: "This update is to allow architecture emulation containers to function such that the emulation binary can be housed outside the container itself. The container and fs parts both have acks from relevant experts. To use the new feature you have to add an F option to your binfmt_misc configuration" From the docs: "The usual behaviour of binfmt_misc is to spawn the binary lazily when the misc format file is invoked. However, this doesn't work very well in the face of mount namespaces and changeroots, so the F mode opens the binary as soon as the emulation is installed and uses the opened image to spawn the emulator, meaning it is always available once installed, regardless of how the environment changes" * tag 'binfmt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/binfmt_misc: binfmt_misc: add F option description to documentation binfmt_misc: add persistent opened binary handler for containers fs: add filp_clone_open API
| * binfmt_misc: add persistent opened binary handler for containersJames Bottomley2016-03-301-2/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new flag 'F' to the binfmt handlers. If you pass in 'F' the binary that runs the emulation will be opened immediately and in future, will be cloned from the open file. The net effect is that the handler survives both changeroots and mount namespace changes, making it easy to work with foreign architecture containers without contaminating the container image with the emulator. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
| * fs: add filp_clone_open APIJames Bottomley2016-03-302-0/+21
| | | | | | | | | | | | | | | | | | | | | | I need an API that allows me to obtain a clone of the current file pointer to pass in to an exec handler. I've labelled this as an internal API because I can't see how it would be useful outside of the fs subsystem. The use case will be a persistent binfmt_misc handler. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.cz>
* | fs: return EPERM on immutable inodeEryu Guan2016-08-074-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In most cases, EPERM is returned on immutable inode, and there're only a few places returning EACCES. I noticed this when running LTP on overlayfs, setxattr03 failed due to unexpected EACCES on immutable inode. So converting all EACCES to EPERM on immutable inode. Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus-2' of ↵Linus Torvalds2016-08-0726-121/+71
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted cleanups and fixes. In the "trivial API change" department - ->d_compare() losing 'parent' argument" * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cachefiles: Fix race between inactivating and culling a cache object 9p: use clone_fid() 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()" vfs: make dentry_needs_remove_privs() internal vfs: remove file_needs_remove_privs() vfs: fix deadlock in file_remove_privs() on overlayfs get rid of 'parent' argument of ->d_compare() cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare() affs ->d_compare(): don't bother with ->d_inode fold _d_rehash() and __d_rehash() together fold dentry_rcuwalk_invalidate() into its only remaining caller
| * | cachefiles: Fix race between inactivating and culling a cache objectDavid Howells2016-08-031-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a race between cachefiles_mark_object_inactive() and cachefiles_cull(): (1) cachefiles_cull() can't delete a backing file until the cache object is marked inactive, but as soon as that's the case it's fair game. (2) cachefiles_mark_object_inactive() marks the object as being inactive and *only then* reads the i_blocks on the backing inode - but cachefiles_cull() might've managed to delete it by this point. Fix this by making sure cachefiles_mark_object_inactive() gets any data it needs from the backing inode before deactivating the object. Without this, the following oops may occur: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles] ... CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G I ------------ 3.10.0-470.el7.x86_64 #1 Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000 RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles] RSP: 0018:ffff8800b77c3d70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8 RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000 R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600 R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498 FS: 0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0 ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658 ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600 Call Trace: [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles] [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache] [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff810a605b>] process_one_work+0x17b/0x470 [<ffffffff810a6e96>] worker_thread+0x126/0x410 [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460 [<ffffffff810ae64f>] kthread+0xcf/0xe0 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81695418>] ret_from_fork+0x58/0x90 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140 The oopsing code shows: callq 0xffffffff810af6a0 <wake_up_bit> mov 0xf8(%r12),%rax mov 0x30(%rax),%rax mov 0x98(%rax),%rax <---- oops here lock add %rax,0x130(%rbx) where this is: d_backing_inode(object->dentry)->i_blocks Fixes: a5b3a80b899bda0f456f1246c4c5a1191ea01519 (CacheFiles: Provide read-and-reset release counters for cachefilesd) Reported-by: Jianhong Yin <jiyin@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | Merge branch 'for-viro' of ↵Al Viro2016-08-032-4/+4
| |\ \ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-linus
| | * | vfs: make dentry_needs_remove_privs() internalMiklos Szeredi2016-08-032-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Only used by the vfs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| | * | vfs: fix deadlock in file_remove_privs() on overlayfsMiklos Szeredi2016-08-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | file_remove_privs() is called with inode lock on file_inode(), which proceeds to calling notify_change() on file->f_path.dentry. Which triggers the WARN_ON_ONCE(!inode_is_locked(inode)) in addition to deadlocking later when ovl_setattr tries to lock the underlying inode again. Fix this mess by not mixing the layers, but doing everything on underlying dentry/inode. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 07a2daab49c5 ("ovl: Copy up underlying inode's ->i_mode to overlay inode") Cc: <stable@vger.kernel.org>
| * | | 9p: use clone_fid()Al Viro2016-08-035-31/+8
| | | | | | | | | | | | | | | | | | | | | | | | in a bunch of places it cleans the things up Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | 9p: fix braino introduced in "9p: new helper - v9fs_parent_fid()"Al Viro2016-08-032-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In v9fs_vfs_rename() we need to clone the parents' fids, not just find them. Spotted-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | get rid of 'parent' argument of ->d_compare()Al Viro2016-07-3117-34/+29
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | cifs, msdos, vfat, hfs+: don't bother with parent in ->d_compare()Al Viro2016-07-295-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | dentry->d_sb is just as good as parent->d_sb Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | affs ->d_compare(): don't bother with ->d_inodeAl Viro2016-07-292-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | Use ->d_sb directly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fold _d_rehash() and __d_rehash() togetherAl Viro2016-07-291-23/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only place where we feed to __d_rehash() something other than d_hash(dentry->d_name.hash) is __d_move(), where we give it d_hash of another dentry. Postpone rehashing until we'd switched the names and we are rid of that exception, along with the need to keep _d_rehash() and __d_rehash() separate. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fold dentry_rcuwalk_invalidate() into its only remaining callerAl Viro2016-07-291-15/+2
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge tag 'xfs-rmap-for-linus-4.8-rc1' of ↵Linus Torvalds2016-08-0664-915/+6267
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull more xfs updates from Dave Chinner: "This is the second part of the XFS updates for this merge cycle, and contains the new reverse block mapping feature for XFS. Reverse mapping allows us to track the owner of a specific block on disk precisely. It is implemented as a set of btrees (one per allocation group) that track the owners of allocated extents. Effectively it is a "used space tree" that is updated when we allocate or free extents. i.e. it is coherent with the free space btrees we already maintain and never overlaps with them. This reverse mapping infrastructure is the building block of several upcoming features - reflink, copy-on-write data, dedupe, online metadata and data scrubbing, highly accurate bad sector/data loss reporting to users, and significantly improved reconstruction of damaged and corrupted filesystems. There's a lot of new stuff coming along in the next couple of cycles,a nd it all builds in the rmap infrastructure. As such, it's a huge chunk of new code with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time kernel with this code in it is released. The new rmap enabled code regresses 3 xfstests - all are ENOSPC related corner cases, one of which Darrick posted a fix for a few hours ago. The other two are fixed by infrastructure that is part of the upcoming reflink patchset. This new ENOSPC infrastructure requires a on-disk format tweak required to keep mount times in check - we need to keep an on-disk count of allocated rmapbt blocks so we don't have to scan the entire btrees at mount time to count them. This is currently being tested and will be part of the fixes sent in the next week or two so users will not be exposed to this change" * tag 'xfs-rmap-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (52 commits) xfs: move (and rename) the deferred bmap-free tracepoints xfs: collapse single use static functions xfs: remove unnecessary parentheses from log redo item recovery functions xfs: remove the extents array from the rmap update done log item xfs: in btree_lshift, only allocate temporary cursor when needed xfs: remove unnecesary lshift/rshift key initialization xfs: remove the get*keys and update_keys btree ops pointers xfs: enable the rmap btree functionality xfs: don't update rmapbt when fixing agfl xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabled xfs: add rmap btree block detection to log recovery xfs: add rmap btree geometry feature flag xfs: propagate bmap updates to rmapbt xfs: enable the xfs_defer mechanism to process rmaps to update xfs: log rmap intent items xfs: create rmap update intent log items xfs: add rmap btree insert and delete helpers xfs: convert unwritten status of reverse mappings xfs: remove an extent from the rmap btree xfs: add an extent to the rmap btree ...
| * | | | xfs: move (and rename) the deferred bmap-free tracepointsDarrick J. Wong2016-08-032-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the deferred bmap-free to extent_free and make them only trigger when we're really running deferred ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: collapse single use static functionsDarrick J. Wong2016-08-032-128/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: remove unnecessary parentheses from log redo item recovery functionsDarrick J. Wong2016-08-032-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: remove the extents array from the rmap update done log itemDarrick J. Wong2016-08-037-89/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nothing ever uses the extent array in the rmap update done redo item, so remove it before it is fixed in the on-disk log format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: in btree_lshift, only allocate temporary cursor when neededDarrick J. Wong2016-08-031-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only need the temporary cursor in _btree_lshift if we're shifting in an overlapped btree. Therefore, factor that into a single block of code so we avoid unnecessary cursor duplication. Also fix use of the wrong cursor when checking for corruption in xfs_btree_rshift(). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: remove unnecesary lshift/rshift key initializationDarrick J. Wong2016-08-031-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the lshift/rshift functions we don't use the key variable for anything now, so remove the variable and its initializer. The update_keys functions figure out the key for a block on their own. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: remove the get*keys and update_keys btree ops pointersDarrick J. Wong2016-08-036-136/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are internal btree functions; we don't need them to be dispatched via function pointers. Make them static again and just check the overlapped flag to figure out what we need to do. The strategy behind this patch was suggested by Christoph. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: enable the rmap btree functionalityDarrick J. Wong2016-08-032-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> Add the feature flag to the supported matrix so that the kernel can mount and use rmap btree enabled filesystems Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick.wong@oracle.com: move the experimental tag] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: don't update rmapbt when fixing agflDarrick J. Wong2016-08-032-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow a caller of xfs_alloc_fix_freelist to disable rmapbt updates when fixing the AG freelist. xfs_repair needs this during phase 5 to be able to adjust the freelist while it's reconstructing the rmap btree; the missing entries will be added back at the very end of phase 5 once the AGFL contents settle down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: disable XFS_IOC_SWAPEXT when rmap btree is enabledDarrick J. Wong2016-08-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Swapping extents between two inodes requires the owner to be updated in the rmap tree for all the extents that are swapped. This code does not yet exist, so switch off the XFS_IOC_SWAPEXT ioctl until support has been implemented. This will need to be done before the rmap btree code can have the experimental tag removed. This functionality will be provided in a (much) later patch, using some of the reflink deferred block remapping functionality to accomlish extent swapping with rmap updates. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: add rmap btree block detection to log recoveryDarrick J. Wong2016-08-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> So such blocks can be correctly identified and have their operations structures attached to validate recovery has not resulted in a correct block. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: add rmap btree geometry feature flagDarrick J. Wong2016-08-032-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> So xfs_info and other userspace utilities know the filesystem is using this feature. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: propagate bmap updates to rmapbtDarrick J. Wong2016-08-038-16/+400
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we map, unmap, or convert an extent in a file's data or attr fork, schedule a respective update in the rmapbt. Previous versions of this patch required a 1:1 correspondence between bmap and rmap, but this is no longer true as we now have ability to make interval queries against the rmapbt. We use the deferred operations code to handle redo operations atomically and deadlock free. This plumbs in all five rmap actions (map, unmap, convert extent, alloc, free); we'll use the first three now for file data, and reflink will want the last two. We also add an error injection site to test log recovery. Finally, we need to fix the bmap shift extent code to adjust the rmaps correctly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: enable the xfs_defer mechanism to process rmaps to updateDarrick J. Wong2016-08-034-10/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Connect the xfs_defer mechanism with the pieces that we'll need to handle deferred rmap updates. We'll wire up the existing code to our new deferred mechanism later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: log rmap intent itemsDarrick J. Wong2016-08-036-0/+438
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a mechanism for higher levels to create RUI/RUD items, submit them to the log, and a stub function to deal with recovered RUI items. These parts will be connected to the rmapbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: create rmap update intent log itemsDarrick J. Wong2016-08-036-2/+662
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create rmap update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: add rmap btree insert and delete helpersDarrick J. Wong2016-08-032-1/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a couple of helper functions to encapsulate rmap btree insert and delete operations. Add tracepoints to the update function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: convert unwritten status of reverse mappingsDarrick J. Wong2016-08-032-0/+446
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a function to convert an unwritten rmap extent to a real one and vice versa. [ dchinner: Note that this algorithm and code was derived from the existing bmapbt unwritten extent conversion code in xfs_bmap_add_extent_unwritten_real(). ] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: remove an extent from the rmap btreeDarrick J. Wong2016-08-031-5/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> Now that we have records in the rmap btree, we need to remove them when extents are freed. This needs to find the relevant record in the btree and remove/trim/split it accordingly. [darrick.wong@oracle.com: make rmap routines handle the enlarged keyspace] [dchinner: remove remaining unused debug printks] [darrick: fix a bug when growfs in an AG with an rmap ending at EOFS] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: add an extent to the rmap btreeDarrick J. Wong2016-08-032-5/+223
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally-From: Dave Chinner <dchinner@redhat.com> Now all the btree, free space and transaction infrastructure is in place, we can finally add the code to insert reverse mappings to the rmap btree. Freeing will be done in a separate patch, so just the addition operation can be focussed on here. [darrick: handle owner offsets when adding rmaps] [dchinner: remove remaining debug printk statements] [darrick: move unwritten bit to rm_offset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
| * | | | xfs: add tracepoints for the rmap functionsDarrick J. Wong2016-08-031-2/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>