summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfs-for-5.1-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2019-03-264-7/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data() - fix mount/umount race in nlmclnt. - NFSv4.1 don't free interrupted slot on open Bugfixes: - Don't let RPC_SOFTCONN tasks time out if the transport is connected - Fix a typo in nfs_init_timeout_values() - Fix layoutstats handling during read failovers - fix uninitialized variable warning" * tag 'nfs-for-5.1-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: fix uninitialized variable warning pNFS/flexfiles: Fix layoutstats handling during read failovers NFS: Fix a typo in nfs_init_timeout_values() SUNRPC: Don't let RPC_SOFTCONN tasks time out if the transport is connected NFSv4.1 don't free interrupted slot on open NFS: fix mount/umount race in nlmclnt. NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()
| * pNFS/flexfiles: Fix layoutstats handling during read failoversTrond Myklebust2019-03-231-1/+4
| | | | | | | | | | | | | | | | During a read failover, we may end up changing the value of the pgio_mirror_idx, so make sure that we record the layout stats before that update. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * NFS: Fix a typo in nfs_init_timeout_values()Trond Myklebust2019-03-231-1/+1
| | | | | | | | | | | | | | | | | | Specifying a retrans=0 mount parameter to a NFS/TCP mount, is inadvertently causing the NFS client to rewrite any specified timeout parameter to the default of 60 seconds. Fixes: a956beda19a6 ("NFS: Allow the mount option retrans=0") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * NFSv4.1 don't free interrupted slot on openOlga Kornievskaia2019-03-191-1/+2
| | | | | | | | | | | | | | | | | | | | Allow the async rpc task for finish and update the open state if needed, then free the slot. Otherwise, the async rpc unable to decode the reply. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Fixes: ae55e59da0e4 ("pnfs: Don't release the sequence slot...") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * NFS: fix mount/umount race in nlmclnt.NeilBrown2019-03-181-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the last NFSv3 unmount from a given host races with a mount from the same host, we can destroy an nlm_host that is still in use. Specifically nlmclnt_lookup_host() can increment h_count on an nlm_host that nlmclnt_release_host() has just successfully called refcount_dec_and_test() on. Once nlmclnt_lookup_host() drops the mutex, nlm_destroy_host_lock() will be called to destroy the nlmclnt which is now in use again. The cause of the problem is that the dec_and_test happens outside the locked region. This is easily fixed by using refcount_dec_and_mutex_lock(). Fixes: 8ea6ecc8b075 ("lockd: Create client-side nlm_host cache") Cc: stable@vger.kernel.org (v2.6.38+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * NFS: Fix nfs4_lock_state refcounting in nfs4_alloc_{lock,unlock}data()Catalin Marinas2019-03-181-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7b587e1a5a6c ("NFS: use locks_copy_lock() to copy locks.") changed the lock copying from memcpy() to the dedicated locks_copy_lock() function. The latter correctly increments the nfs4_lock_state.ls_count via nfs4_fl_copy_lock(), however, this refcount has already been incremented in the nfs4_alloc_{lock,unlock}data(). Kmemleak subsequently reports an unreferenced nfs4_lock_state object as below (arm64 platform): unreferenced object 0xffff8000fce0b000 (size 256): comm "systemd-sysuser", pid 1608, jiffies 4294892825 (age 32.348s) hex dump (first 32 bytes): 20 57 4c fb 00 80 ff ff 20 57 4c fb 00 80 ff ff WL..... WL..... 00 57 4c fb 00 80 ff ff 01 00 00 00 00 00 00 00 .WL............. backtrace: [<000000000d15010d>] kmem_cache_alloc+0x178/0x208 [<00000000d7c1d264>] nfs4_set_lock_state+0x124/0x1f0 [<000000009c867628>] nfs4_proc_lock+0x90/0x478 [<000000001686bd74>] do_setlk+0x64/0xe8 [<00000000e01500d4>] nfs_lock+0xe8/0x1f0 [<000000004f387d8d>] vfs_lock_file+0x18/0x40 [<00000000656ab79b>] do_lock_file_wait+0x68/0xf8 [<00000000f17c4a4b>] fcntl_setlk+0x224/0x280 [<0000000052a242c6>] do_fcntl+0x418/0x730 [<000000004f47291a>] __arm64_sys_fcntl+0x84/0xd0 [<00000000d6856e01>] el0_svc_common+0x80/0xf0 [<000000009c4bd1df>] el0_svc_handler+0x2c/0x80 [<00000000b1a0d479>] el0_svc+0x8/0xc [<0000000056c62a0f>] 0xffffffffffffffff This patch removes the original refcount_inc(&lsp->ls_count) that was paired with the memcpy() lock copying. Fixes: 7b587e1a5a6c ("NFS: use locks_copy_lock() to copy locks.") Cc: <stable@vger.kernel.org> # 5.0.x- Cc: NeilBrown <neilb@suse.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* | Merge tag 'for-5.1-rc2-tag' of ↵Linus Torvalds2019-03-266-21/+72
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fsync fixes: i_size for truncate vs fsync, dio vs buffered during snapshotting, remove complicated but incomplete assertion - removed excessive warnigs, misreported device stats updates - fix raid56 page mapping for 32bit arch - fixes reported by static analyzer * tag 'for-5.1-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix assertion failure on fsync with NO_HOLES enabled btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size btrfs: Fix bound checking in qgroup_trace_new_subtree_blocks btrfs: raid56: properly unmap parity page in finish_parity_scrub() btrfs: don't report readahead errors and don't update statistics Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes btrfs: remove WARN_ON in log_dir_items Btrfs: fix incorrect file size after shrinking truncate and fsync
| * | Btrfs: fix assertion failure on fsync with NO_HOLES enabledFilipe Manana2019-03-201-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in commit a89ca6f24ffe4 ("Btrfs: fix fsync after truncate when no_holes feature is enabled") I added an assertion that is triggered when an inline extent is found to assert that the length of the (uncompressed) data the extent represents is the same as the i_size of the inode, since that is true most of the time I couldn't find or didn't remembered about any exception at that time. Later on the assertion was expanded twice to deal with a case of a compressed inline extent representing a range that matches the sector size followed by an expanding truncate, and another case where fallocate can update the i_size of the inode without adding or updating existing extents (if the fallocate range falls entirely within the first block of the file). These two expansion/fixes of the assertion were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of regular file when using no-holes feature") and commit 6399fb5a0b69a ("Btrfs: fix assertion failure during fsync in no-holes mode"). These however missed the case where an falloc expands the i_size of an inode to exactly the sector size and inline extent exists, for example: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mount /dev/sdc /mnt $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar wrote 1096/1096 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec) $ xfs_io -c "falloc 1096 3000" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar Segmentation fault $ dmesg [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727 [701253.602962] ------------[ cut here ]------------ [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533! [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G W 5.0.0-rc8-btrfs-next-45 #1 [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs] (...) [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286 [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000 [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868 [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000 [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18 [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80 [701253.607769] FS: 00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000 [701253.608163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0 [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [701253.609608] Call Trace: [701253.609994] btrfs_log_inode+0xdfb/0xe40 [btrfs] [701253.610383] btrfs_log_inode_parent+0x2be/0xa60 [btrfs] [701253.610770] ? do_raw_spin_unlock+0x49/0xc0 [701253.611150] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [701253.611537] btrfs_sync_file+0x3b2/0x440 [btrfs] [701253.612010] ? do_sysinfo+0xb0/0xf0 [701253.612552] do_fsync+0x38/0x60 [701253.612988] __x64_sys_fsync+0x10/0x20 [701253.613360] do_syscall_64+0x60/0x1b0 [701253.613733] entry_SYSCALL_64_after_hwframe+0x49/0xbe [701253.614103] RIP: 0033:0x7f14da4e66d0 (...) [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0 [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003 [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010 [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260 [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10 (...) [701253.619941] ---[ end trace e088d74f132b6da5 ]--- Updating the assertion again to allow for this particular case would result in a meaningless assertion, plus there is currently no risk of logging content that would result in any corruption after a log replay if the size of the data encoded in an inline extent is greater than the inode's i_size (which is not currently possibe either with or without compression), therefore just remove the assertion. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Avoid possible qgroup_rsv_size overflow in ↵Nikolay Borisov2019-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_calculate_inode_block_rsv_size qgroup_rsv_size is calculated as the product of outstanding_extent * fs_info->nodesize. The product is calculated with 32 bit precision since both variables are defined as u32. Yet qgroup_rsv_size expects a 64 bit result. Avoid possible multiplication overflow by casting outstanding_extent to u64. Such overflow would in the worst case (64K nodesize) require more than 65536 extents, which is quite large and i'ts not likely that it would happen in practice. Fixes-coverity-id: 1435101 Fixes: ff6bc37eb7f6 ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: Fix bound checking in qgroup_trace_new_subtree_blocksNikolay Borisov2019-03-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If 'cur_level' is 7 then the bound checking at the top of the function will actually pass. Later on, it's possible to dereference ds_path->nodes[cur_level+1] which will be an out of bounds. The correct check will be cur_level >= BTRFS_MAX_LEVEL - 1 . Fixes-coverty-id: 1440918 Fixes-coverty-id: 1440911 Fixes: ea49f3e73c4b ("btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree") CC: stable@vger.kernel.org # 4.20+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: raid56: properly unmap parity page in finish_parity_scrub()Andrea Righi2019-03-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Parity page is incorrectly unmapped in finish_parity_scrub(), triggering a reference counter bug on i386, i.e.: [ 157.662401] kernel BUG at mm/highmem.c:349! [ 157.666725] invalid opcode: 0000 [#1] SMP PTI The reason is that kunmap(p_page) was completely left out, so we never did an unmap for the p_page and the loop unmapping the rbio page was iterating over the wrong number of stripes: unmapping should be done with nr_data instead of rbio->real_stripes. Test case to reproduce the bug: - create a raid5 btrfs filesystem: # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde - mount it: # mount /dev/sdb /mnt - run btrfs scrub in a loop: # while :; do btrfs scrub start -BR /mnt; done BugLink: https://bugs.launchpad.net/bugs/1812845 Fixes: 5a6ac9eacb49 ("Btrfs, raid56: support parity scrub on raid56") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: don't report readahead errors and don't update statisticsDavid Sterba2019-03-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As readahead is an optimization, all errors are usually filtered out, but still properly handled when the real read call is done. The commit 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") added REQ_RAHEAD to readpages() because that's only used for readahead (despite what one would expect from the callback name). This causes a flood of messages and inflated read error stats, so skip reporting in case it's readahead. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403 Reported-by: LimeTech <tomm@lime-technology.com> Fixes: 5e9d398240b2 ("btrfs: readpages() should submit IO as read-ahead") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writesFilipe Manana2019-03-131-6/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are mixing buffered writes with direct IO writes against the same file and snapshotting is happening concurrently, we can end up with a corrupt file content in the snapshot. Example: 1) Inode/file is empty. 2) Snapshotting starts. 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the inode to 256Kb, disk_i_size remains zero. This happens after the task doing the snapshot flushes all existing delalloc. 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and updates the inode item in the fs tree with a size of 1Mb (which is the value of disk_i_size). 4) The dealloc for the range [0, 256Kb[ did not start yet. 5) The transaction used in the DIO ordered extent completion, which updated the inode item, is committed by the snapshotting task. 6) Snapshot creation completes. 7) Dealloc for the range [0, 256Kb[ is flushed. After that when reading the file from the snapshot we always get zeroes for the range [0, 256Kb[, the file has a size of 1Mb and the data written by the direct IO write is found. From an application's point of view this is a corruption, since in the source subvolume it could never read a version of the file that included the data from the direct IO write without the data from the buffered write included as well. In the snapshot's tree, file extent items are missing for the range [0, 256Kb[. The issue, obviously, does not happen when using the -o flushoncommit mount option. Fix this by flushing delalloc for all the roots that are about to be snapshotted when committing a transaction. This guarantees total ordering when updating the disk_i_size of an inode since the flush for dealloc is done when a transaction is in the TRANS_STATE_COMMIT_START state and wait is done once no more external writers exist. This is similar to what we do when using the flushoncommit mount option, but we do it only if the transaction has snapshots to create and only for the roots of the subvolumes to be snapshotted. The bulk of the dealloc is flushed in the snapshot creation ioctl, so the flush work we do inside the transaction is minimized. This issue, involving buffered and direct IO writes with snapshotting, is often triggered by fstest btrfs/078, and got reported by fsck when not using the NO_HOLES features, for example: $ cat results/btrfs/078.full (...) _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent *** fsck.btrfs output *** [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 258 inode 264 errors 100, file extent discount Found file extent holes: start: 524288, len: 65536 ERROR: errors found in fs roots Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove WARN_ON in log_dir_itemsJosef Bacik2019-03-131-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When Filipe added the recursive directory logging stuff in 2f2ff0ee5e430 ("Btrfs: fix metadata inconsistencies after directory fsync") he specifically didn't take the directory i_mutex for the children directories that we need to log because of lockdep. This is generally fine, but can lead to this WARN_ON() tripping if we happen to run delayed deletion's in between our first search and our second search of dir_item/dir_indexes for this directory. We expect this to happen, so the WARN_ON() isn't necessary. Drop the WARN_ON() and add a comment so we know why this case can happen. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fix incorrect file size after shrinking truncate and fsyncFilipe Manana2019-03-131-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we do a shrinking truncate against an inode which is already present in the respective log tree and then rename it, as part of logging the new name we end up logging an inode item that reflects the old size of the file (the one which we previously logged) and not the new smaller size. The decision to preserve the size previously logged was added by commit 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to inode") in order to avoid data loss after replaying the log. However that decision is only needed for the case the logged inode size is smaller then the current size of the inode, as explained in that commit's change log. If the current size of the inode is smaller then the previously logged size, we know a shrinking truncate happened and therefore need to use that smaller size. Example to trigger the problem: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo $ xfs_io -c "fsync" /mnt/foo $ xfs_io -c "truncate 3000" /mnt/foo $ mv /mnt/foo /mnt/bar $ xfs_io -c "fsync" /mnt/bar <power failure> $ mount /dev/sdb /mnt $ od -t x1 -A d /mnt/bar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0008000 Once we rename the file, we log its name (and inode item), and because the inode was already logged before in the current transaction, we log it with a size of 8000 bytes because that is the size we previously logged (with the first fsync). As part of the rename, besides logging the inode, we do also sync the log, which is done since commit d4682ba03ef618 ("Btrfs: sync log after logging new name"), so the next fsync against our inode is effectively a no-op, since no new changes happened since the rename operation. Even if did not sync the log during the rename operation, the same problem (fize size of 8000 bytes instead of 3000 bytes) would be visible after replaying the log if the log ended up getting synced to disk through some other means, such as for example by fsyncing some other modified file. In the example above the fsync after the rename operation is there just because not every filesystem may guarantee logging/journalling the inode (and syncing the log/journal) during the rename operation, for example it is needed for f2fs, but not for ext4 and xfs. Fix this scenario by, when logging a new name (which is triggered by rename and link operations), using the current size of the inode instead of the previously logged inode size. A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695 CC: stable@vger.kernel.org # 4.4+ Reported-by: Seulbae Kim <seulbae@gatech.edu> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | locks: wake any locks blocked on request before deadlock checkJeff Layton2019-03-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andreas reported that he was seeing the tdbtorture test fail in some cases with -EDEADLCK when it wasn't before. Some debugging showed that deadlock detection was sometimes discovering the caller's lock request itself in a dependency chain. While we remove the request from the blocked_lock_hash prior to reattempting to acquire it, any locks that are blocked on that request will still be present in the hash and will still have their fl_blocker pointer set to the current request. This causes posix_locks_deadlock to find a deadlock dependency chain when it shouldn't, as a lock request cannot block itself. We are going to end up waking all of those blocked locks anyway when we go to reinsert the request back into the blocked_lock_hash, so just do it prior to checking for deadlocks. This ensures that any lock blocked on the current request will no longer be part of any blocked request chain. URL: https://bugzilla.kernel.org/show_bug.cgi?id=202975 Fixes: 5946c4319ebb ("fs/locks: allow a lock request to block other requests.") Cc: stable@vger.kernel.org Reported-by: Andreas Schneider <asn@redhat.com> Signed-off-by: Neil Brown <neilb@suse.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
* | | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-03-247-59/+58
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for 5.1" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: prohibit fstrim in norecovery mode ext4: cleanup bh release code in ext4_ind_remove_space() ext4: brelse all indirect buffer in ext4_ind_remove_space() ext4: report real fs size after failed resize ext4: add missing brelse() in add_new_gdb_meta_bg() ext4: remove useless ext4_pin_inode() ext4: avoid panic during forced reboot ext4: fix data corruption caused by unaligned direct AIO ext4: fix NULL pointer dereference while journal is aborted
| * | | ext4: prohibit fstrim in norecovery modeDarrick J. Wong2019-03-231-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ext4 fstrim implementation uses the block bitmaps to find free space that can be discarded. If we haven't replayed the journal, the bitmaps will be stale and we absolutely *cannot* use stale metadata to zap the underlying storage. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: cleanup bh release code in ext4_ind_remove_space()zhangyi (F)2019-03-231-25/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we are releasing the indirect buffer where we are done with it in ext4_ind_remove_space(), so we can see the brelse() and BUFFER_TRACE() everywhere. It seems fragile and hard to read, and we may probably forget to release the buffer some day. This patch cleans up the code by putting of the code which releases the buffers to the end of the function. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
| * | | ext4: brelse all indirect buffer in ext4_ind_remove_space()zhangyi (F)2019-03-231-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All indirect buffers get by ext4_find_shared() should be released no mater the branch should be freed or not. But now, we forget to release the lower depth indirect buffers when removing space from the same higher depth indirect block. It will lead to buffer leak and futher more, it may lead to quota information corruption when using old quota, consider the following case. - Create and mount an empty ext4 filesystem without extent and quota features, - quotacheck and enable the user & group quota, - Create some files and write some data to them, and then punch hole to some files of them, it may trigger the buffer leak problem mentioned above. - Disable quota and run quotacheck again, it will create two new aquota files and write the checked quota information to them, which probably may reuse the freed indirect block(the buffer and page cache was not freed) as data block. - Enable quota again, it will invoke vfs_load_quota_inode()->invalidate_bdev() to try to clean unused buffers and pagecache. Unfortunately, because of the buffer of quota data block is still referenced, quota code cannot read the up to date quota info from the device and lead to quota information corruption. This problem can be reproduced by xfstests generic/231 on ext3 file system or ext4 file system without extent and quota features. This patch fix this problem by releasing the missing indirect buffers, in ext4_ind_remove_space(). Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | | ext4: report real fs size after failed resizeLukas Czerner2019-03-151-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when the file system resize using ext4_resize_fs() fails it will report into log that "resized filesystem to <requested block count>". However this may not be true in the case of failure. Use the current block count as returned by ext4_blocks_count() to report the block count. Additionally, report a warning that "error occurred during file system resize" Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: add missing brelse() in add_new_gdb_meta_bg()Lukas Czerner2019-03-151-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently in add_new_gdb_meta_bg() there is a missing brelse of gdb_bh in case ext4_journal_get_write_access() fails. Additionally kvfree() is missing in the same error path. Fix it by moving the ext4_journal_get_write_access() before the ext4 sb update as Ted suggested and release n_group_desc and gdb_bh in case it fails. Fixes: 61a9c11e5e7a ("ext4: add missing brelse() add_new_gdb_meta_bg()'s error path") Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: remove useless ext4_pin_inode()Jason Yan2019-03-141-30/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is never used from the beginning (and is commented out); let's remove it. Signed-off-by: Jason Yan <yanaijie@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: avoid panic during forced rebootJan Kara2019-03-141-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When admin calls "reboot -f" - i.e., does a hard system reboot by directly calling reboot(2) - ext4 filesystem mounted with errors=panic can panic the system. This happens because the underlying device gets disabled without unmounting the filesystem and thus some syscall running in parallel to reboot(2) can result in the filesystem getting IO errors. This is somewhat surprising to the users so try improve the behavior by switching to errors=remount-ro behavior when the system is running reboot(2). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: fix data corruption caused by unaligned direct AIOLukas Czerner2019-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 needs to serialize unaligned direct AIO because the zeroing of partial blocks of two competing unaligned AIOs can result in data corruption. However it decides not to serialize if the potentially unaligned aio is past i_size with the rationale that no pending writes are possible past i_size. Unfortunately if the i_size is not block aligned and the second unaligned write lands past i_size, but still into the same block, it has the potential of corrupting the previous unaligned write to the same block. This is (very simplified) reproducer from Frank // 41472 = (10 * 4096) + 512 // 37376 = 41472 - 4096 ftruncate(fd, 41472); io_prep_pwrite(iocbs[0], fd, buf[0], 4096, 37376); io_prep_pwrite(iocbs[1], fd, buf[1], 4096, 41472); io_submit(io_ctx, 1, &iocbs[1]); io_submit(io_ctx, 1, &iocbs[2]); io_getevents(io_ctx, 2, 2, events, NULL); Without this patch the 512B range from 40960 up to the start of the second unaligned write (41472) is going to be zeroed overwriting the data written by the first write. This is a data corruption. 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 * 0000a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 With this patch the data corruption is avoided because we will recognize the unaligned_aio and wait for the unwritten extent conversion. 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 00009200 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 * 0000a200 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 * 0000b200 Reported-by: Frank Sorenson <fsorenso@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Fixes: e9e3bcecf44c ("ext4: serialize unaligned asynchronous DIO") Cc: stable@vger.kernel.org
| * | | ext4: fix NULL pointer dereference while journal is abortedJiufei Xue2019-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We see the following NULL pointer dereference while running xfstests generic/475: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 PGD 8000000c84bad067 P4D 8000000c84bad067 PUD c84e62067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 7 PID: 9886 Comm: fsstress Kdump: loaded Not tainted 5.0.0-rc8 #10 RIP: 0010:ext4_do_update_inode+0x4ec/0x760 ... Call Trace: ? jbd2_journal_get_write_access+0x42/0x50 ? __ext4_journal_get_write_access+0x2c/0x70 ? ext4_truncate+0x186/0x3f0 ext4_mark_iloc_dirty+0x61/0x80 ext4_mark_inode_dirty+0x62/0x1b0 ext4_truncate+0x186/0x3f0 ? unmap_mapping_pages+0x56/0x100 ext4_setattr+0x817/0x8b0 notify_change+0x1df/0x430 do_truncate+0x5e/0x90 ? generic_permission+0x12b/0x1a0 This is triggered because the NULL pointer handle->h_transaction was dereferenced in function ext4_update_inode_fsync_trans(). I found that the h_transaction was set to NULL in jbd2__journal_restart but failed to attached to a new transaction while the journal is aborted. Fix this by checking the handle before updating the inode. Fixes: b436b9bef84d ("ext4: Wait for proper transaction commit on fsync") Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: stable@kernel.org
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2019-03-241-0/+27
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 fixes: - Prevent potential NULL pointer dereferences in the HPET and HyperV code - Exclude the GART aperture from /proc/kcore to prevent kernel crashes on access - Use the correct macros for Cyrix I/O on Geode processors - Remove yet another kernel address printk leak - Announce microcode reload completion as requested by quite some people. Microcode loading has become popular recently. - Some 'Make Clang' happy fixlets - A few cleanups for recently added code" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/gart: Exclude GART aperture from kcore x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error x86/mm/pti: Make local symbols static x86/cpu/cyrix: Remove {get,set}Cx86_old macros used for Cyrix processors x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors x86/microcode: Announce reload operation's completion x86/hyperv: Prevent potential NULL pointer dereference x86/hpet: Prevent potential NULL pointer dereference x86/lib: Fix indentation issue, remove extra tab x86/boot: Restrict header scope to make Clang happy x86/mm: Don't leak kernel addresses x86/cpufeature: Fix various quality problems in the <asm/cpu_device_hd.h> header
| * | | | x86/gart: Exclude GART aperture from kcoreKairui Song2019-03-231-0/+27
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On machines where the GART aperture is mapped over physical RAM, /proc/kcore contains the GART aperture range. Accessing the GART range via /proc/kcore results in a kernel crash. vmcore used to have the same issue, until it was fixed with commit 2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging existing hook infrastructure in vmcore to let /proc/vmcore return zeroes when attempting to read the aperture region, and so it won't read from the actual memory. Apply the same workaround for kcore. First implement the same hook infrastructure for kcore, then reuse the hook functions introduced in the previous vmcore fix. Just with some minor adjustment, rename some functions for more general usage, and simplify the hook infrastructure a bit as there is no module usage yet. Suggested-by: Baoquan He <bhe@redhat.com> Signed-off-by: Kairui Song <kasong@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiri Bohac <jbohac@suse.cz> Acked-by: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Dave Young <dyoung@redhat.com> Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
* | | | Merge tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2019-03-246-70/+102
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull smb3 fixes from Steve French: - two fixes for stable for guest mount problems with smb3.1.1 - two fixes for crediting (SMB3 flow control) on resent requests - a byte range lock leak fix - two fixes for incorrect rc mappings * tag '5.1-rc1-cifs-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number SMB3: Fix SMB3.1.1 guest mounts to Samba cifs: Fix slab-out-of-bounds when tracing SMB tcon cifs: allow guest mounts to work for smb3.11 fix incorrect error code mapping for OBJECTID_NOT_FOUND cifs: fix that return -EINVAL when do dedupe operation CIFS: Fix an issue with re-sending rdata when transport returning -EAGAIN CIFS: Fix an issue with re-sending wdata when transport returning -EAGAIN
| * | | | cifs: update internal module version numberSteve French2019-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To 2.19 Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | SMB3: Fix SMB3.1.1 guest mounts to SambaSteve French2019-03-221-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Workaround problem with Samba responses to SMB3.1.1 null user (guest) mounts. The server doesn't set the expected flag in the session setup response so we have to do a similar check to what is done in smb3_validate_negotiate where we also check if the user is a null user (but not sec=krb5 since username might not be passed in on mount for Kerberos case). Note that the commit below tightened the conditions and forced signing for the SMB2-TreeConnect commands as per MS-SMB2. However, this should only apply to normal user sessions and not for cases where there is no user (even if server forgets to set the flag in the response) since we don't have anything useful to sign with. This is especially important now that the more secure SMB3.1.1 protocol is in the default dialect list. An earlier patch ("cifs: allow guest mounts to work for smb3.11") fixed the guest mounts to Windows. Fixes: 6188f28bf608 ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares") Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara <palcantara@suse.de> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | cifs: Fix slab-out-of-bounds when tracing SMB tconPaulo Alcantara (SUSE)2019-03-221-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following KASAN report: [ 779.044746] BUG: KASAN: slab-out-of-bounds in string+0xab/0x180 [ 779.044750] Read of size 1 at addr ffff88814f327968 by task trace-cmd/2812 [ 779.044756] CPU: 1 PID: 2812 Comm: trace-cmd Not tainted 5.1.0-rc1+ #62 [ 779.044760] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014 [ 779.044761] Call Trace: [ 779.044769] dump_stack+0x5b/0x90 [ 779.044775] ? string+0xab/0x180 [ 779.044781] print_address_description+0x6c/0x23c [ 779.044787] ? string+0xab/0x180 [ 779.044792] ? string+0xab/0x180 [ 779.044797] kasan_report.cold.3+0x1a/0x32 [ 779.044803] ? string+0xab/0x180 [ 779.044809] string+0xab/0x180 [ 779.044816] ? widen_string+0x160/0x160 [ 779.044822] ? vsnprintf+0x5bf/0x7f0 [ 779.044829] vsnprintf+0x4e7/0x7f0 [ 779.044836] ? pointer+0x4a0/0x4a0 [ 779.044841] ? seq_buf_vprintf+0x79/0xc0 [ 779.044848] seq_buf_vprintf+0x62/0xc0 [ 779.044855] trace_seq_printf+0x113/0x210 [ 779.044861] ? trace_seq_puts+0x110/0x110 [ 779.044867] ? trace_raw_output_prep+0xd8/0x110 [ 779.044876] trace_raw_output_smb3_tcon_class+0x9f/0xc0 [ 779.044882] print_trace_line+0x377/0x890 [ 779.044888] ? tracing_buffers_read+0x300/0x300 [ 779.044893] ? ring_buffer_read+0x58/0x70 [ 779.044899] s_show+0x6e/0x140 [ 779.044906] seq_read+0x505/0x6a0 [ 779.044913] vfs_read+0xaf/0x1b0 [ 779.044919] ksys_read+0xa1/0x130 [ 779.044925] ? kernel_write+0xa0/0xa0 [ 779.044931] ? __do_page_fault+0x3d5/0x620 [ 779.044938] do_syscall_64+0x63/0x150 [ 779.044944] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 779.044949] RIP: 0033:0x7f62c2c2db31 [ 779.044955] Code: fe ff ff 48 8d 3d 17 9e 09 00 48 83 ec 08 e8 96 02 02 00 66 0f 1f 44 00 00 8b 05 fa fc 2c 00 48 63 ff 85 c0 75 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 55 53 48 89 d5 48 89 [ 779.044958] RSP: 002b:00007ffd6e116678 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 779.044964] RAX: ffffffffffffffda RBX: 0000560a38be9260 RCX: 00007f62c2c2db31 [ 779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003 [ 779.044966] RDX: 0000000000002000 RSI: 00007ffd6e116710 RDI: 0000000000000003 [ 779.044969] RBP: 00007f62c2ef5420 R08: 0000000000000000 R09: 0000000000000003 [ 779.044972] R10: ffffffffffffffa8 R11: 0000000000000246 R12: 00007ffd6e116710 [ 779.044975] R13: 0000000000002000 R14: 0000000000000d68 R15: 0000000000002000 [ 779.044981] Allocated by task 1257: [ 779.044987] __kasan_kmalloc.constprop.5+0xc1/0xd0 [ 779.044992] kmem_cache_alloc+0xad/0x1a0 [ 779.044997] getname_flags+0x6c/0x2a0 [ 779.045003] user_path_at_empty+0x1d/0x40 [ 779.045008] do_faccessat+0x12a/0x330 [ 779.045012] do_syscall_64+0x63/0x150 [ 779.045017] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 779.045019] Freed by task 1257: [ 779.045023] __kasan_slab_free+0x12e/0x180 [ 779.045029] kmem_cache_free+0x85/0x1b0 [ 779.045034] filename_lookup.part.70+0x176/0x250 [ 779.045039] do_faccessat+0x12a/0x330 [ 779.045043] do_syscall_64+0x63/0x150 [ 779.045048] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 779.045052] The buggy address belongs to the object at ffff88814f326600 which belongs to the cache names_cache of size 4096 [ 779.045057] The buggy address is located 872 bytes to the right of 4096-byte region [ffff88814f326600, ffff88814f327600) [ 779.045058] The buggy address belongs to the page: [ 779.045062] page:ffffea00053cc800 count:1 mapcount:0 mapping:ffff88815b191b40 index:0x0 compound_mapcount: 0 [ 779.045067] flags: 0x200000000010200(slab|head) [ 779.045075] raw: 0200000000010200 dead000000000100 dead000000000200 ffff88815b191b40 [ 779.045081] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 779.045083] page dumped because: kasan: bad access detected [ 779.045085] Memory state around the buggy address: [ 779.045089] ffff88814f327800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 779.045093] ffff88814f327880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 779.045097] >ffff88814f327900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 779.045099] ^ [ 779.045103] ffff88814f327980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 779.045107] ffff88814f327a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 779.045109] ================================================================== [ 779.045110] Disabling lock debugging due to kernel taint Correctly assign tree name str for smb3_tcon event. Signed-off-by: Paulo Alcantara (SUSE) <paulo@paulo.ac> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | cifs: allow guest mounts to work for smb3.11Ronnie Sahlberg2019-03-221-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix Guest/Anonymous sessions so that they work with SMB 3.11. The commit noted below tightened the conditions and forced signing for the SMB2-TreeConnect commands as per MS-SMB2. However, this should only apply to normal user sessions and not for Guest/Anonumous sessions. Fixes: 6188f28bf608 ("Tree connect for SMB3.1.1 must be signed for non-encrypted shares") Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | fix incorrect error code mapping for OBJECTID_NOT_FOUNDSteve French2019-03-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was mapped to EIO which can be confusing when user space queries for an object GUID for an object for which the server file system doesn't support (or hasn't saved one). As Amir Goldstein suggested this is similar to ENOATTR (equivalently ENODATA in Linux errno definitions) so changing NT STATUS code mapping for OBJECTID_NOT_FOUND to ENODATA. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Amir Goldstein <amir73il@gmail.com>
| * | | | cifs: fix that return -EINVAL when do dedupe operationXiaoli Feng2019-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dedupe_file_range operations is combiled into remap_file_range. But it's always skipped for dedupe operations in function cifs_remap_file_range. Example to test: Before this patch: # dd if=/dev/zero of=cifs/file bs=1M count=1 # xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file XFS_IOC_FILE_EXTENT_SAME: Invalid argument After this patch: # dd if=/dev/zero of=cifs/file bs=1M count=1 # xfs_io -c "dedupe cifs/file 4k 64k 4k" cifs/file XFS_IOC_FILE_EXTENT_SAME: Operation not supported Influence for xfstests: generic/091 generic/112 generic/127 generic/263 These tests report this error "do_copy_range:: Invalid argument" instead of "FIDEDUPERANGE: Invalid argument". Because there are still two bugs cause these test failed. https://bugzilla.kernel.org/show_bug.cgi?id=202935 https://bugzilla.kernel.org/show_bug.cgi?id=202785 Signed-off-by: Xiaoli Feng <fengxiaoli0714@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | CIFS: Fix an issue with re-sending rdata when transport returning -EAGAINLong Li2019-03-221-30/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sending a rdata, transport may return -EAGAIN. In this case we should re-obtain credits because the session may have been reconnected. Change in v2: adjust_credits before re-sending Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
| * | | | CIFS: Fix an issue with re-sending wdata when transport returning -EAGAINLong Li2019-03-221-32/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When sending a wdata, transport may return -EAGAIN. In this case we should re-obtain credits because the session may have been reconnected. Change in v2: adjust_credits before re-sending Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
* | | | | Merge tag 'io_uring-20190323' of git://git.kernel.dk/linux-blockLinus Torvalds2019-03-233-233/+230
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes and improvements from Jens Axboe: "The first five in this series are heavily inspired by the work Al did on the aio side to fix the races there. The last two re-introduce a feature that was in io_uring before it got merged, but which I pulled since we didn't have a good way to have BVEC iters that already have a stable reference. These aren't necessarily related to block, it's just how io_uring pins fixed buffers" * tag 'io_uring-20190323' of git://git.kernel.dk/linux-block: block: add BIO_NO_PAGE_REF flag iov_iter: add ITER_BVEC_FLAG_NO_REF flag io_uring: mark me as the maintainer io_uring: retry bulk slab allocs as single allocs io_uring: fix poll races io_uring: fix fget/fput handling io_uring: add prepped flag io_uring: make io_read/write return an integer io_uring: use regular request ref counts
| * | | | block: add BIO_NO_PAGE_REF flagJens Axboe2019-03-182-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If bio_iov_iter_get_pages() is called on an iov_iter that is flagged with NO_REF, then we don't need to add a page reference for the pages that we add. Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows not to drop a reference to these pages. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | iov_iter: add ITER_BVEC_FLAG_NO_REF flagJens Axboe2019-03-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ITER_BVEC, if we're holding on to kernel pages, the caller doesn't need to grab a reference to the bvec pages, and drop that same reference on IO completion. This is essentially safe for any ITER_BVEC, but some use cases end up reusing pages and uncondtionally dropping a page reference on completion. And example of that is sendfile(2), that ends up being a splice_in + splice_out on the pipe pages. Add a flag that tells us it's fine to not grab a page reference to the bvec pages, since that caller knows not to drop a reference when it's done with the pages. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: retry bulk slab allocs as single allocsJens Axboe2019-03-181-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've seen cases where bulk alloc fails, since the bulk alloc API is all-or-nothing - either we get the number we ask for, or it returns 0 as number of entries. If we fail a batch bulk alloc, retry a "normal" kmem_cache_alloc() and just use that instead of failing with -EAGAIN. While in there, ensure we use GFP_KERNEL. That was an oversight in the original code, when we switched away from GFP_ATOMIC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: fix poll racesJens Axboe2019-03-151-55/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a straight port of Al's fix for the aio poll implementation, since the io_uring version is heavily based on that. The below description is almost straight from that patch, just modified to fit the io_uring situation. io_poll() has to cope with several unpleasant problems: * requests that might stay around indefinitely need to be made visible for io_cancel(2); that must not be done to a request already completed, though. * in cases when ->poll() has placed us on a waitqueue, wakeup might have happened (and request completed) before ->poll() returns. * worse, in some early wakeup cases request might end up re-added into the queue later - we can't treat "woken up and currently not in the queue" as "it's not going to stick around indefinitely" * ... moreover, ->poll() might have decided not to put it on any queues to start with, and that needs to be distinguished from the previous case * ->poll() might have tried to put us on more than one queue. Only the first will succeed for io poll, so we might end up missing wakeups. OTOH, we might very well notice that only after the wakeup hits and request gets completed (all before ->poll() gets around to the second poll_wait()). In that case it's too late to decide that we have an error. req->woken was an attempt to deal with that. Unfortunately, it was broken. What we need to keep track of is not that wakeup has happened - the thing might come back after that. It's that async reference is already gone and won't come back, so we can't (and needn't) put the request on the list of cancellables. The easiest case is "request hadn't been put on any waitqueues"; we can tell by seeing NULL apt.head, and in that case there won't be anything async. We should either complete the request ourselves (if vfs_poll() reports anything of interest) or return an error. In all other cases we get exclusion with wakeups by grabbing the queue lock. If request is currently on queue and we have something interesting from vfs_poll(), we can steal it and complete the request ourselves. If it's on queue and vfs_poll() has not reported anything interesting, we either put it on the cancellable list, or, if we know that it hadn't been put on all queues ->poll() wanted it on, we steal it and return an error. If it's _not_ on queue, it's either been already dealt with (in which case we do nothing), or there's io_poll_complete_work() about to be executed. In that case we either put it on the cancellable list, or, if we know it hadn't been put on all queues ->poll() wanted it on, simulate what cancel would've done. Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: fix fget/fput handlingJens Axboe2019-03-151-133/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This isn't a straight port of commit 84c4e1f89fef for aio.c, since io_uring doesn't use files in exactly the same way. But it's pretty close. See the commit message for that commit. This essentially fixes a use-after-free with the poll command handling, but it takes cue from Linus's approach to just simplifying the file handling. We move the setup of the file into a higher level location, so the individual commands don't have to deal with it. And then we release the reference when we free the associated io_kiocb. Fixes: 221c5eb23382 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: add prepped flagJens Axboe2019-03-141-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently use the fact that if ->ki_filp is already set, then we've done the prep. In preparation for moving the file assignment earlier, use a separate flag to tell whether the request has been prepped for IO or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: make io_read/write return an integerJens Axboe2019-03-141-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The callers all convert to an integer, and we only return 0/-ERROR anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | io_uring: use regular request ref countsJens Axboe2019-03-141-19/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Get rid of the special casing of "normal" requests not having any references to the io_kiocb. We initialize the ref count to 2, one for the submission side, and one or the completion side. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | | Merge tag 'fixes_for_v5.1-rc2' of ↵Linus Torvalds2019-03-213-4/+10
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull udf fixes from Jan Kara: "Two udf error handling fixes" * tag 'fixes_for_v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Propagate errors from udf_truncate_extents() udf: Fix crash on IO error during truncate
| * | | | | udf: Propagate errors from udf_truncate_extents()Jan Kara2019-03-183-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make udf_truncate_extents() properly propagate errors to its callers and let udf_setsize() handle the error properly as well. This lets userspace know in case there's some error when truncating blocks. Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | | udf: Fix crash on IO error during truncateJan Kara2019-03-181-0/+3
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When truncate(2) hits IO error when reading indirect extent block the code just bugs with: kernel BUG at linux-4.15.0/fs/udf/truncate.c:249! ... Fix the problem by bailing out cleanly in case of IO error. CC: stable@vger.kernel.org Reported-by: jean-luc malet <jeanluc.malet@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | | | | Merge tag 'fsnotify_for_v5.1-rc2' of ↵Linus Torvalds2019-03-212-3/+16
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "One inotify and one fanotify fix" * tag 'fsnotify_for_v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: Allow copying of file handle to userspace inotify: Fix fsnotify_mark refcount leak in inotify_update_existing_watch()