summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'overlayfs-linus' of ↵Linus Torvalds2018-03-096-68/+228
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "This fixes a corner case for NFS exporting (introduced in this cycle) as well as fixing miscellaneous bugs" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: update Kconfig texts ovl: redirect_dir=nofollow should not follow redirect for opaque lower ovl: fix ptr_ret.cocci warnings ovl: check ERR_PTR() return value from ovl_lookup_real() ovl: check lower ancestry on encode of lower dir file handle ovl: hash non-dir by lower inode for fsnotify
| * ovl: update Kconfig textsMiklos Szeredi2018-03-071-0/+14
| | | | | | | | | | | | | | | | | | | | Add some hints about overlayfs kernel config options. Enabling NFS export by default is especially recommended against, as it incurs a performance penalty even if the filesystem is not actually exported. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: redirect_dir=nofollow should not follow redirect for opaque lowerVivek Goyal2018-02-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | redirect_dir=nofollow should not follow a redirect. But in a specific configuration it can still follow it. For example try this. $ mkdir -p lower0 lower1/foo upper work merged $ touch lower1/foo/lower-file.txt $ setfattr -n "trusted.overlay.opaque" -v "y" lower1/foo $ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=on none merged $ cd merged $ mv foo foo-renamed $ umount merged # mount again. This time with redirect_dir=nofollow $ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=nofollow none merged $ ls merged/foo-renamed/ # This lists lower-file.txt, while it should not have. Basically, we are doing redirect check after we check for d.stop. And if this is not last lower, and we find an opaque lower, d.stop will be set. ovl_lookup_single() if (!d->last && ovl_is_opaquedir(this)) { d->stop = d->opaque = true; goto out; } To fix this, first check redirect is allowed. And after that check if d.stop has been set or not. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Fixes: 438c84c2f0c7 ("ovl: don't follow redirects if redirect_dir=off") Cc: <stable@vger.kernel.org> #v4.15 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: fix ptr_ret.cocci warningsFengguang Wu2018-02-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | fs/overlayfs/export.c:459:10-16: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Fixes: 4b91c30a5a19 ("ovl: lookup connected ancestor of dir in inode cache") CC: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: check ERR_PTR() return value from ovl_lookup_real()Amir Goldstein2018-02-161-2/+2
| | | | | | | | | | | | | | Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 061701540349 ("ovl: lookup indexed ancestor of lower dir") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: check lower ancestry on encode of lower dir file handleAmir Goldstein2018-02-163-44/+168
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change relaxes copy up on encode of merge dir with lower layer > 1 and handles the case of encoding a merge dir with lower layer 1, where an ancestor is a non-indexed merge dir. In that case, decode of the lower file handle will not have been possible if the non-indexed ancestor is redirected before or after encode. Before encoding a non-upper directory file handle from real layer N, we need to check if it will be possible to reconnect an overlay dentry from the real lower decoded dentry. This is done by following the overlay ancestry up to a "layer N connected" ancestor and verifying that all parents along the way are "layer N connectable". If an ancestor that is NOT "layer N connectable" is found, we need to copy up an ancestor, which is "layer N connectable", thus making that ancestor "layer N connected". For example: layer 1: /a layer 2: /a/b/c The overlay dentry /a is NOT "layer 2 connectable", because if dir /a is copied up and renamed, upper dir /a will be indexed by lower dir /a from layer 1. The dir /a from layer 2 will never be indexed, so the algorithm in ovl_lookup_real_ancestor() (*) will not be able to lookup a connected overlay dentry from the connected lower dentry /a/b/c. To avoid this problem on decode time, we need to copy up an ancestor of /a/b/c, which is "layer 2 connectable", on encode time. That ancestor is /a/b. After copy up (and index) of /a/b, it will become "layer 2 connected" and when the time comes to decode the file handle from lower dentry /a/b/c, ovl_lookup_real_ancestor() will find the indexed ancestor /a/b and decoding a connected overlay dentry will be accomplished. (*) the algorithm in ovl_lookup_real_ancestor() can be improved to lookup an entry /a in the lower layers above layer N and find the indexed dir /a from layer 1. If that improvement is made, then the check for "layer N connected" will need to verify there are no redirects in lower layers above layer N. In the example above, /a will be "layer 2 connectable". However, if layer 2 dir /a is a target of a layer 1 redirect, then /a will NOT be "layer 2 connectable": layer 1: /A (redirect = /a) layer 2: /a/b/c Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ovl: hash non-dir by lower inode for fsnotifyAmir Goldstein2018-02-161-18/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 31747eda41ef ("ovl: hash directory inodes for fsnotify") fixed an issue of inotify watch on directory that stops getting events after dropping dentry caches. A similar issue exists for non-dir non-upper files, for example: $ mkdir -p lower upper work merged $ touch lower/foo $ mount -t overlay -o lowerdir=lower,workdir=work,upperdir=upper none merged $ inotifywait merged/foo & $ echo 2 > /proc/sys/vm/drop_caches $ cat merged/foo inotifywait doesn't get the OPEN event, because ovl_lookup() called from 'cat' allocates a new overlay inode and does not reuse the watched inode. Fix this by hashing non-dir overlay inodes by lower real inode in the following cases that were not hashed before this change: - A non-upper overlay mount - A lower non-hardlink when index=off A helper ovl_hash_bylower() was added to put all the logic and documentation about which real inode an overlay inode is hashed by into one place. The issue dates back to initial version of overlayfs, but this patch depends on ovl_inode code that was introduced in kernel v4.13. Cc: <stable@vger.kernel.org> #v4.13 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | Merge tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2018-03-091-12/+30
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: - Fix some iomap locking problems - Don't allocate cow blocks when we're zeroing file data * tag 'xfs-4.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't block on the ilock for RWF_NOWAIT xfs: don't start out with the exclusive ilock for direct I/O xfs: don't allocate COW blocks for zeroing holes or unwritten extents
| * | xfs: don't block on the ilock for RWF_NOWAITChristoph Hellwig2018-03-011-8/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed, so that we don't block io_submit callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: don't start out with the exclusive ilock for direct I/OChristoph Hellwig2018-03-011-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no reason to take the ilock exclusively at the start of xfs_file_iomap_begin for direct I/O, given that it will be demoted just before calling xfs_iomap_write_direct anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: don't allocate COW blocks for zeroing holes or unwritten extentsChristoph Hellwig2018-03-011-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The iomap zeroing interface is smart enough to skip zeroing holes or unwritten extents. Don't subvert this logic for reflink files. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | gfs2: Fixes to "Implement iomap for block_map" (2)Andreas Gruenbacher2018-03-071-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for block_map"' introduced another bug in gfs2_iomap_begin that can cause gfs2_block_map to set bh->b_size of an actual buffer to 0. This can lead to arbitrary incorrect behavior including crashes or disk corruption. Revert the incorrect part of that commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
* | | Merge tag 'for-4.16-rc3-tag' of ↵Linus Torvalds2018-03-0410-47/+191
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - when NR_CPUS is large, a SRCU structure can significantly inflate size of the main filesystem structure that would not be possible to allocate by kmalloc, so the kvalloc fallback is used - improved error handling - fix endiannes when printing some filesystem attributes via sysfs, this is could happen when a filesystem is moved between different endianity hosts - send fixes: the NO_HOLE mode should not send a write operation for a file hole - fix log replay for for special files followed by file hardlinks - fix log replay failure after unlink and link combination - fix max chunk size calculation for DUP allocation * tag 'for-4.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix log replay failure after unlink and link combination Btrfs: fix log replay failure after linking special file and fsync Btrfs: send, fix issuing write op when processing hole in no data mode btrfs: use proper endianness accessors for super_copy btrfs: alloc_chunk: fix DUP stripe size handling btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster btrfs: handle failure of add_pending_csums btrfs: use kvzalloc to allocate btrfs_fs_info
| * | | Btrfs: fix log replay failure after unlink and link combinationFilipe Manana2018-03-013-22/+139
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a file with 2 (or more) hard links in the same directory, remove one of the hard links, create a new file (or link an existing file) in the same directory with the name of the removed hard link, and then finally fsync the new file, we end up with a log that fails to replay, causing a mount failure. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ touch /mnt/testdir/foo $ ln /mnt/testdir/foo /mnt/testdir/bar $ sync $ unlink /mnt/testdir/bar $ touch /mnt/testdir/bar $ xfs_io -c "fsync" /mnt/testdir/bar <power failure> $ mount /dev/sdb /mnt mount: mount(2) failed: /mnt: No such file or directory When replaying the log, for that example, we also see the following in dmesg/syslog: [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257 [71813.674204] ------------[ cut here ]------------ [71813.675694] BTRFS: Transaction aborted (error -2) [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs] [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1 [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286 [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001 [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001 [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8 [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101 [71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 [71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0 [71813.679669] Call Trace: [71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs] [71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs] [71813.679669] add_inode_ref+0x71e/0x851 [btrfs] [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs] [71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs] [71813.679669] ? rcu_read_unlock+0x3a/0x57 [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs] [71813.679669] walk_log_tree+0xad/0x188 [btrfs] [71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs] [71813.679669] ? replay_one_extent+0x544/0x544 [btrfs] [71813.679669] open_ctree+0x1cf6/0x2209 [btrfs] [71813.679669] btrfs_mount_root+0x368/0x482 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] btrfs_mount+0x13e/0x772 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] do_mount+0x6e5/0x973 [71813.679669] ? memdup_user+0x3e/0x5c [71813.679669] SyS_mount+0x72/0x98 [71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b [71813.679669] RIP: 0033:0x7f7cedf150ba [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206 [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c [71813.679669] ---[ end trace 83bd473fc5b4663b ]--- [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree) [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30 [71814.128078] BTRFS error (device dm-0): open_ctree failed This happens because the log has inode reference items for both inode 258 (the first file we created) and inode 259 (the second file created), and when processing the reference item for inode 258, we replace the corresponding item in the subvolume tree (which has two names, "foo" and "bar") witht he one in the log (which only has one name, "foo") without removing the corresponding dir index keys from the parent directory. Later, when processing the inode reference item for inode 259, which has a name of "bar" associated to it, we notice that dir index entries exist for that name and for a different inode, so we attempt to unlink that name, which fails because the inode reference item for inode 258 no longer has the name "bar" associated to it, making a call to btrfs_unlink_inode() fail with a -ENOENT error. Fix this by unlinking all the names in an inode reference item from a subvolume tree that are not present in the inode reference item found in the log tree, before overwriting it with the item from the log tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: fix log replay failure after linking special file and fsyncFilipe Manana2018-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If in the same transaction we rename a special file (fifo, character/block device or symbolic link), create a hard link for it having its old name then sync the log, we will end up with a log that can not be replayed and at when attempting to replay it, an EEXIST error is returned and mounting the filesystem fails. Example scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ mkfifo /mnt/testdir/foo # Make sure everything done so far is durably persisted. $ sync # Create some unrelated file and fsync it, this is just to create a log # tree. The file must be in the same directory as our special file. $ touch /mnt/testdir/f1 $ xfs_io -c "fsync" /mnt/testdir/f1 # Rename our special file and then create a hard link with its old name. $ mv /mnt/testdir/foo /mnt/testdir/bar $ ln /mnt/testdir/bar /mnt/testdir/foo # Create some other unrelated file and fsync it, this is just to persist # the log tree which was modified by the previous rename and link # operations. Alternatively we could have modified file f1 and fsync it. $ touch /mnt/f2 $ xfs_io -c "fsync" /mnt/f2 <power failure> $ mount /dev/sdc /mnt mount: mount /dev/sdc on /mnt failed: File exists This happens because when both the log tree and the subvolume's tree have an entry in the directory "testdir" with the same name, that is, there is one key (258 INODE_REF 257) in the subvolume tree and another one in the log tree (where 258 is the inode number of our special file and 257 is the inode for directory "testdir"). Only the data of those two keys differs, in the subvolume tree the index field for inode reference has a value of 3 while the log tree it has a value of 5. Because the same key exists in both trees, but have different index, the log replay fails with an -EEXIST error when attempting to replay the inode reference from the log tree. Fix this by setting the last_unlink_trans field of the inode (our special file) to the current transaction id when a hard link is created, as this forces logging the parent directory inode, solving the conflict at log replay time. A new generic test case for fstests was also submitted. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | Btrfs: send, fix issuing write op when processing hole in no data modeFilipe Manana2018-03-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing an incremental send of a filesystem with the no-holes feature enabled, we end up issuing a write operation when using the no data mode send flag, instead of issuing an update extent operation. Fix this by issuing the update extent operation instead. Trivial reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdc /mnt/sdc $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1 $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2 $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \ | btrfs receive -vv /mnt/sdd Before this change the output of the second receive command is: receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447... utimes write foobar, offset 8192, len 8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-... After this change it is: receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... utimes update_extent foobar: offset=8192, len=8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: use proper endianness accessors for super_copyAnand Jain2018-03-012-13/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fs_info::super_copy is a byte copy of the on-disk structure and all members must use the accessor macros/functions to obtain the right value. This was missing in update_super_roots and in sysfs readers. Moving between opposite endianness hosts will report bogus numbers in sysfs, and mount may fail as the root will not be restored correctly. If the filesystem is always used on a same endian host, this will not be a problem. Fix this by using the btrfs_set_super...() functions to set fs_info::super_copy values, and for the sysfs, use the cached fs_info::nodesize/sectorsize values. CC: stable@vger.kernel.org Fixes: df93589a17378 ("btrfs: export more from FS_INFO to sysfs") Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: alloc_chunk: fix DUP stripe size handlingHans van Kranenburg2018-03-011-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a400 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db25785a "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de0051. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de0051 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db25785a ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_clusterNikolay Borisov2018-03-011-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Essentially duplicate the error handling from the above block which handles the !PageUptodate(page) case and additionally clear EXTENT_BOUNDARY. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: handle failure of add_pending_csumsNikolay Borisov2018-03-011-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add_pending_csums was added as part of the new data=ordered implementation in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Even back then it called the btrfs_csum_file_blocks which can fail but it never bothered handling the failure. In ENOMEM situation this could lead to the filesystem failing to write the checksums for a particular extent and not detect this. On read this could lead to the filesystem erroring out due to crc mismatch. Fix it by propagating failure from add_pending_csums and handling them. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: use kvzalloc to allocate btrfs_fs_infoJeff Mahoney2018-03-012-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On kernels built with NR_CPUS=8192, this can result in kmalloc failures that prevent mounting. There is work in progress to try to resolve this for every user of srcu_struct but using kvzalloc will work around the failures until that is complete. As an example with NR_CPUS=512 on x86_64: the overall size of subvol_srcu is 3460 bytes, fs_info is 6496. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | Merge tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-clientLinus Torvalds2018-03-024-38/+45
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: "A cap handling fix from Zhi that ensures that metadata writeback isn't delayed and three error path memory leak fixups from Chengguang" * tag 'ceph-for-4.16-rc4' of git://github.com/ceph/ceph-client: ceph: fix potential memory leak in init_caches() ceph: fix dentry leak when failing to init debugfs libceph, ceph: avoid memory leak when specifying same option several times ceph: flush dirty caps of unlinked inode ASAP
| * | | | ceph: fix potential memory leak in init_caches()Chengguang Xu2018-03-011-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is lack of cache destroy operation for ceph_file_cachep when failing from fscache register. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | ceph: fix dentry leak when failing to init debugfsChengguang Xu2018-02-261-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When failing from ceph_fs_debugfs_init() in ceph_real_mount(), there is lack of dput of root_dentry and it causes slab errors, so change the calling order of ceph_fs_debugfs_init() and open_root_dentry() and do some cleanups to avoid this issue. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | libceph, ceph: avoid memory leak when specifying same option several timesChengguang Xu2018-02-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When parsing string option, in order to avoid memory leak we need to carefully free it first in case of specifying same option several times. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | ceph: flush dirty caps of unlinked inode ASAPZhi Zhang2018-02-263-24/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Client should release unlinked inode from its cache ASAP. But client can't release inode with dirty caps. Link: http://tracker.ceph.com/issues/22886 Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | | Merge tag 'for-linus-20180302' of git://git.kernel.dk/linux-blockLinus Torvalds2018-03-022-19/+33
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A collection of fixes for this series. This is a little larger than usual at this time, but that's mainly because I was out on vacation last week. Nothing in here is major in any way, it's just two weeks of fixes. This contains: - NVMe pull from Keith, with a set of fixes from the usual suspects. - mq-deadline zone unlock fix from Damien, fixing an issue with the SMR zone locking added for 4.16. - two bcache fixes sent in by Michael, with changes from Coly and Tang. - comment typo fix from Eric for blktrace. - return-value error handling fix for nbd, from Gustavo. - fix a direct-io case where we don't defer to a completion handler, making us sleep from IRQ device completion. From Jan. - a small series from Jan fixing up holes around handling of bdev references. - small set of regression fixes from Jiufei, mostly fixing problems around the gendisk pointer -> partition index change. - regression fix from Ming, fixing a boundary issue with the discard page cache invalidation. - two-patch series from Ming, fixing both a core blk-mq-sched and kyber issue around token freeing on a requeue condition" * tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits) block: fix a typo block: display the correct diskname for bio block: fix the count of PGPGOUT for WRITE_SAME mq-deadline: Make sure to always unlock zones nvmet: fix PSDT field check in command format nvme-multipath: fix sysfs dangerously created links nbd: fix return value in error handling path bcache: fix kcrashes with fio in RAID5 backend dev bcache: correct flash only vols (check all uuids) blktrace_api.h: fix comment for struct blk_user_trace_setup blockdev: Avoid two active bdev inodes for one device genhd: Fix BUG in blkdev_open() genhd: Fix use after free in __blkdev_get() genhd: Add helper put_disk_and_module() genhd: Rename get_disk() to get_disk_and_module() genhd: Fix leaked module reference for NVME devices direct-io: Fix sleep in atomic due to sync AIO nvme-pci: Fix nvme queue cleanup if IRQ setup fails block: kyber: fix domain token leak during requeue blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch ...
| * | | | | blockdev: Avoid two active bdev inodes for one deviceJan Kara2018-02-261-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When blkdev_open() races with device removal and creation it can happen that unhashed bdev inode gets associated with newly created gendisk like: CPU0 CPU1 blkdev_open() bdev = bd_acquire() del_gendisk() bdev_unhash_inode(bdev); remove device create new device with the same number __blkdev_get() disk = get_gendisk() - gets reference to gendisk of the new device Now another blkdev_open() will not find original 'bdev' as it got unhashed, create a new one and associate it with the same 'disk' at which point problems start as we have two independent page caches for one device. Fix the problem by verifying that the bdev inode didn't get unhashed before we acquired gendisk reference. That way we make sure gendisk can get associated only with visible bdev inodes. Tested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | | genhd: Fix use after free in __blkdev_get()Jan Kara2018-02-261-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When two blkdev_open() calls race with device removal and recreation, __blkdev_get() can use looked up gendisk after it is freed: CPU0 CPU1 CPU2 del_gendisk(disk); bdev_unhash_inode(inode); blkdev_open() blkdev_open() bdev = bd_acquire(inode); - creates and returns new inode bdev = bd_acquire(inode); - returns the same inode __blkdev_get(devt) __blkdev_get(devt) disk = get_gendisk(devt); - got structure of device going away <finish device removal> <new device gets created under the same device number> disk = get_gendisk(devt); - got new device structure if (!bdev->bd_openers) { does the first open } if (!bdev->bd_openers) - false } else { put_disk_and_module(disk) - remember this was old device - this was last ref and disk is now freed } disk_unblock_events(disk); -> oops Fix the problem by making sure we drop reference to disk in __blkdev_get() only after we are really done with it. Reported-by: Hou Tao <houtao1@huawei.com> Tested-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | | genhd: Add helper put_disk_and_module()Jan Kara2018-02-261-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a proper counterpart to get_disk_and_module() - put_disk_and_module(). Currently it is opencoded in several places. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | | direct-io: Fix sleep in atomic due to sync AIOJan Kara2018-02-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional way for direct IO to become synchronous and thus trigger fsync from the IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for AIO operations" allowed these flags to be set for AIO as well. However that commit forgot to update the condition checking whether the IO completion handling should be defered to a workqueue and thus AIO DIO with RWF_[D]SYNC set will call fsync() from IRQ context resulting in sleep in atomic. Fix the problem by checking directly iocb flags (the same way as it is done in dio_complete()) instead of checking all conditions that could lead to IO being synchronous. CC: Christoph Hellwig <hch@lst.de> CC: Goldwyn Rodrigues <rgoldwyn@suse.com> CC: stable@vger.kernel.org Reported-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | | Merge tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2018-02-284-5/+13
|\ \ \ \ \ \ | |_|/ / / / |/| | | / / | | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: - fix some compiler warnings - fix block reservations for transactions created during log recovery - fix resource leaks when respecifying mount options * tag 'xfs-4.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix potential memory leak in mount option parsing xfs: reserve blocks for refcount / rmap log item recovery xfs: use memset to initialize xfs_scrub_agfl_info
| * | | | xfs: fix potential memory leak in mount option parsingChengguang Xu2018-02-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When specifying string type mount option (e.g., logdev) several times in a mount, current option parsing may cause memory leak. Hence, call kfree for previous one in this case. Signed-off-by: Chengguang Xu <cgxu519@icloud.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | | | xfs: reserve blocks for refcount / rmap log item recoveryDarrick J. Wong2018-02-222-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During log recovery, the per-AG reservations aren't yet set up, so log recovery has to reserve enough blocks to handle all possible btree splits. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | | | xfs: use memset to initialize xfs_scrub_agfl_infoEric Sandeen2018-02-221-1/+2
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Apparently different gcc versions have competing and incompatible notions of how to initialize at declaration, so just give up and fall back to the time-tested memset(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | | Merge tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2018-02-253-11/+11
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: - fix a broken cast in nfs4_callback_recallany() - fix an Oops during NFSv4 migration events - make struct nlmclnt_fl_close_lock_ops static * tag 'nfs-for-4.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: make struct nlmclnt_fl_close_lock_ops static nfs: system crashes after NFS4ERR_MOVED recovery NFSv4: Fix broken cast in nfs4_callback_recallany()
| * | | NFS: make struct nlmclnt_fl_close_lock_ops staticColin Ian King2018-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The structure nlmclnt_fl_close_lock_ops s local to the source and does not need to be in global scope, so make it static. Cleans up sparse warning: fs/nfs/nfs3proc.c:876:33: warning: symbol 'nlmclnt_fl_close_lock_ops' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | nfs: system crashes after NFS4ERR_MOVED recoveryBill.Baker@oracle.com2018-02-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs4_update_server unconditionally releases the nfs_client for the source server. If migration fails, this can cause the source server's nfs_client struct to be left with a low reference count, resulting in use-after-free. Also, adjust reference count handling for ELOOP. NFS: state manager: migration failed on NFSv4 server nfsvmu10 with error 6 WARNING: CPU: 16 PID: 17960 at fs/nfs/client.c:281 nfs_put_client+0xfa/0x110 [nfs]() nfs_put_client+0xfa/0x110 [nfs] nfs4_run_state_manager+0x30/0x40 [nfsv4] kthread+0xd8/0xf0 BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 nfs4_xdr_enc_write+0x6b/0x160 [nfsv4] rpcauth_wrap_req+0xac/0xf0 [sunrpc] call_transmit+0x18c/0x2c0 [sunrpc] __rpc_execute+0xa6/0x490 [sunrpc] rpc_async_schedule+0x15/0x20 [sunrpc] process_one_work+0x160/0x470 worker_thread+0x112/0x540 ? rescuer_thread+0x3f0/0x3f0 kthread+0xd8/0xf0 This bug was introduced by 32e62b7c ("NFS: Add nfs4_update_server"), but the fix applies cleanly to 52442f9b ("NFS4: Avoid migration loops") Reported-by: Helen Chao <helen.chao@oracle.com> Fixes: 52442f9b11b7 ("NFS4: Avoid migration loops") Signed-off-by: Bill Baker <bill.baker@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | NFSv4: Fix broken cast in nfs4_callback_recallany()Trond Myklebust2018-02-211-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | Passing a pointer to a unsigned integer to test_bit() is broken. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | | | Merge branch 'siginfo-linus' of ↵Linus Torvalds2018-02-221-3/+12
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo fix from Eric Biederman: "This fixes a build error that only shows up on blackfin" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: fs/signalfd: fix build error for BUS_MCEERR_AR
| * | | | fs/signalfd: fix build error for BUS_MCEERR_ARRandy Dunlap2018-02-221-3/+12
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix build error in fs/signalfd.c by using same method that is used in kernel/signal.c: separate blocks for different signal si_code values. ./fs/signalfd.c: error: 'BUS_MCEERR_AR' undeclared (first use in this function) Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | | | efivarfs: Limit the rate for non-root to read filesLuck, Tony2018-02-221-0/+6
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each read from a file in efivarfs results in two calls to EFI (one to get the file size, another to get the actual data). On X86 these EFI calls result in broadcast system management interrupts (SMI) which affect performance of the whole system. A malicious user can loop performing reads from efivarfs bringing the system to its knees. Linus suggested per-user rate limit to solve this. So we add a ratelimit structure to "user_struct" and initialize it for the root user for no limit. When allocating user_struct for other users we set the limit to 100 per second. This could be used for other places that want to limit the rate of some detrimental user action. In efivarfs if the limit is exceeded when reading, we take an interruptible nap for 50ms and check the rate limit again. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'for-4.16-rc1-tag' of ↵Linus Torvalds2018-02-167-21/+80
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We have a few assorted fixes, some of them show up during fstests so I gave them more testing" * tag 'for-4.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device Btrfs: fix null pointer dereference when replacing missing device btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes btrfs: Ignore errors from btrfs_qgroup_trace_extent_post Btrfs: fix unexpected -EEXIST when creating new inode Btrfs: fix use-after-free on root->orphan_block_rsv Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly Btrfs: fix extent state leak from tree log Btrfs: fix crash due to not cleaning up tree log block's dirty bits Btrfs: fix deadlock in run_delalloc_nocow
| * | btrfs: Fix use-after-free when cleaning up fs_devs with a single stale deviceNikolay Borisov2018-02-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4fde46f0cc71 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0cc71 ("Btrfs: free the stale device") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix null pointer dereference when replacing missing deviceFilipe Manana2018-02-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are replacing a missing device we mount the filesystem with the degraded mode option in which case we are allowed to have a btrfs device structure without a backing device member (its bdev member is NULL) and therefore we can't dereference that member. Commit 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") started to dereference that member when discarding extents, resulting in a null pointer dereference: [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing [ 3145.364116] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sdg started [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0 [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] PGD 0 P4D 0 [ 3145.415085] Oops: 0000 [#1] PREEMPT SMP PTI [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2 [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: G W 4.15.0-rc9-btrfs-next-55+ #1 [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] RSP: 0018:ffffc90004813c60 EFLAGS: 00010293 [ 3145.415085] RAX: ffff88020d39cc00 RBX: ffff88020c4ea2a0 RCX: 0000000000000002 [ 3145.415085] RDX: 0000000000000000 RSI: ffff88020c4ea240 RDI: 0000000000000000 [ 3145.415085] RBP: 0000000000000000 R08: 0000000000004000 R09: 0000000000000000 [ 3145.415085] R10: ffffc90004813ae8 R11: 0000000000000000 R12: 0000000000000000 [ 3145.415085] R13: ffff88020c418000 R14: 0000000000000000 R15: 0000000000000000 [ 3145.415085] FS: 00007f565681f8c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [ 3145.415085] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3145.415085] CR2: 00000000000000e0 CR3: 000000020d208006 CR4: 00000000001606f0 [ 3145.415085] Call Trace: [ 3145.415085] btrfs_finish_extent_commit+0x9a/0x1be [btrfs] [ 3145.415085] btrfs_commit_transaction+0x649/0x7a0 [btrfs] [ 3145.415085] ? start_transaction+0x2b0/0x3b3 [btrfs] [ 3145.415085] btrfs_dev_replace_start+0x274/0x30c [btrfs] [ 3145.415085] btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs] [ 3145.415085] btrfs_ioctl+0x1a91/0x1d62 [btrfs] [ 3145.415085] ? lock_acquire+0x16a/0x1af [ 3145.415085] ? vfs_ioctl+0x1b/0x28 [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] vfs_ioctl+0x1b/0x28 [ 3145.415085] do_vfs_ioctl+0x5a9/0x5e0 [ 3145.415085] ? _raw_spin_unlock_irq+0x34/0x46 [ 3145.415085] ? entry_SYSCALL_64_fastpath+0x5/0x8b [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] SyS_ioctl+0x52/0x76 [ 3145.415085] entry_SYSCALL_64_fastpath+0x1e/0x8b [ 3145.415085] RIP: 0033:0x7f56558b3c47 [ 3145.415085] RSP: 002b:00007ffdcfac4c58 EFLAGS: 00000202 [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00 [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: ffffc90004813c60 [ 3145.415085] CR2: 00000000000000e0 [ 3145.458185] ---[ end trace 06302e7ac31902bf ]--- This is trivially reproduced by running the test btrfs/027 from fstests like this: $ MOUNT_OPTIONS="-o discard" ./check btrfs/027 Fix this by skipping devices without a backing device before attempting to discard. Fixes: 38b5f68e9811 ("btrfs: drop btrfs_device::can_discard to query directly") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodesZygo Blaxell2018-02-021-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until v4.14, this warning was very infrequent: WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: [...] CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 ? trace_hardirqs_off_caller+0x1f/0x140 Starting with v4.14 (specifically 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees")) the WARN_ON occurs three orders of magnitude more frequently--almost once per second while running workloads like bees. Replace the WARN_ON() with a comment rationale for its removal. The rationale is paraphrased from an explanation by Edmund Nadolski <enadolski@suse.de> on the linux-btrfs mailing list. Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Ignore errors from btrfs_qgroup_trace_extent_postNikolay Borisov2018-02-022-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running generic/019 with qgroups on the scratch device enabled is almost guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed to trigger only on -ENOMEM, in reality, however, it's possible to get -EIO from btrfs_qgroup_trace_extent_post. This function just finds the roots of the extent being tracked and sets the qrecord->old_roots list. If this operation fails nothing critical happens except the quota accounting can be considered wrong. In such case just set the INCONSISTENT flag for the quota and print a warning, rather than killing off the system. Additionally, it's possible to trigger a BUG_ON in btrfs_truncate_inode_items as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ error message adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix unexpected -EEXIST when creating new inodeLiu Bo2018-02-021-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The highest objectid, which is assigned to new inode, is decided at the time of initializing fs roots. However, in cases where log replay gets processed, the btree which fs root owns might be changed, so we have to search it again for the highest objectid, otherwise creating new inode would end up with -EEXIST. cc: <stable@vger.kernel.org> v4.4-rc6+ Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix use-after-free on root->orphan_block_rsvLiu Bo2018-02-021-13/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got these from running generic/475, WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs] Call Trace: btrfs_orphan_release_metadata+0x9f/0x200 [btrfs] btrfs_orphan_del+0x10d/0x170 [btrfs] btrfs_setattr+0x500/0x640 [btrfs] notify_change+0x7ae/0x870 do_truncate+0xca/0x130 vfs_truncate+0x2ee/0x3d0 do_sys_truncate+0xaf/0xf0 SyS_truncate+0xe/0x10 entry_SYSCALL_64_fastpath+0x1f/0x96 The race is between btrfs_orphan_commit_root and btrfs_orphan_del, t1 t2 btrfs_orphan_commit_root btrfs_orphan_del spin_lock check (&root->orphan_inodes) root->orphan_block_rsv = NULL; spin_unlock atomic_dec(&root->orphan_inodes); access root->orphan_block_rsv Accessing root->orphan_block_rsv must be done before decreasing root->orphan_inodes. cc: <stable@vger.kernel.org> v3.12+ Fixes: 703c88e03524 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctlyLiu Bo2018-02-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This regression is introduced in commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction"). There are two problems, a) it is ->destroy_inode() that does the final free on inode, not ->evict_inode(), b) clear_inode() must be called before ->evict_inode() returns. This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); in evict() because I_CLEAR is set in clear_inode(). Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction") Cc: <stable@vger.kernel.org> # v4.7-rc6+ Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>