summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'timers-urgent-2020-04-19' of ↵Linus Torvalds2020-04-191-1/+13
|\ | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time namespace fix from Thomas Gleixner: "An update for the proc interface of time namespaces: Use symbolic names instead of clockid numbers. The usability nuisance of numbers was noticed by Michael when polishing the man page" * tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
| * proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsetsAndrei Vagin2020-04-161-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael Kerrisk suggested to replace numeric clock IDs with symbolic names. Now the content of these files looks like this: $ cat /proc/774/timens_offsets monotonic 864000 0 boottime 1728000 0 For setting offsets, both representations of clocks (numeric and symbolic) can be used. As for compatibility, it is acceptable to change things as long as userspace doesn't care. The format of timens_offsets files is very new and there are no userspace tools yet which rely on this format. But three projects crun, util-linux and criu rely on the interface of setting time offsets and this is why it's required to continue supporting the numeric clock IDs on write. Fixes: 04a8682a71be ("fs/proc: Introduce /proc/pid/timens_offsets") Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
* | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2020-04-198-18/+26
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous bug fixes and cleanups for ext4, including a fix for generic/388 in data=journal mode, removing some BUG_ON's, and cleaning up some compiler warnings" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: convert BUG_ON's to WARN_ON's in mballoc.c ext4: increase wait time needed before reuse of deleted inode numbers ext4: remove set but not used variable 'es' in ext4_jbd2.c ext4: remove set but not used variable 'es' ext4: do not zeroout extents beyond i_disksize ext4: fix return-value types in several function comments ext4: use non-movable memory for superblock readahead ext4: use matching invalidatepage in ext4_writepage
| * | ext4: convert BUG_ON's to WARN_ON's in mballoc.cTheodore Ts'o2020-04-151-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the in-core buddy bitmap gets corrupted (or out of sync with the block bitmap), issue a WARN_ON and try to recover. In most cases this involves skipping trying to allocate out of a particular block group. We can end up declaring the file system corrupted, which is fair, since the file system probably should be checked before we proceed any further. Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu Google-Bug-Id: 34811296 Google-Bug-Id: 34639169 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: increase wait time needed before reuse of deleted inode numbersTheodore Ts'o2020-04-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current wait times have proven to be too short to protect against inode reuses that lead to metadata inconsistencies. Now that we will retry the inode allocation if we can't find any recently deleted inodes, it's a lot safer to increase the recently deleted time from 5 seconds to a minute. Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu Google-Bug-Id: 36602237 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: remove set but not used variable 'es' in ext4_jbd2.cJason Yan2020-04-151-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following gcc warning: fs/ext4/ext4_jbd2.c:341:30: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~ Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20200402034759.29957-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: remove set but not used variable 'es'Jason Yan2020-04-151-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following gcc warning: fs/ext4/super.c:599:27: warning: variable 'es' set but not used [-Wunused-but-set-variable] struct ext4_super_block *es; ^~ Fixes: 2ea2fc775321 ("ext4: save all error info in save_error_info() and drop ext4_set_errno()") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20200402033939.25303-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: do not zeroout extents beyond i_disksizeJan Kara2020-04-151-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do not want to create initialized extents beyond end of file because for e2fsck it is impossible to distinguish them from a case of corrupted file size / extent tree and so it complains like: Inode 12, i_size is 147456, should be 163840. Fix? no Code in ext4_ext_convert_to_initialized() and ext4_split_convert_extents() try to make sure it does not create initialized extents beyond inode size however they check against inode->i_size which is wrong. They should instead check against EXT4_I(inode)->i_disksize which is the current inode size on disk. That's what e2fsck is going to see in case of crash before all dirty data is written. This bug manifests as generic/456 test failure (with recent enough fstests where fsx got fixed to properly pass FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock mount option. CC: stable@vger.kernel.org Fixes: 21ca087a3891 ("ext4: Do not zero out uninitialized extents beyond i_size") Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: fix return-value types in several function commentsJosh Triplett2020-04-152-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The documentation comments for ext4_read_block_bitmap_nowait and ext4_read_inode_bitmap describe them as returning NULL on error, but they return an ERR_PTR on error; update the documentation to match. The documentation comment for ext4_wait_block_bitmap describes it as returning 1 on error, but it returns -errno on error; update the documentation to match. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: use non-movable memory for superblock readaheadRoman Gushchin2020-04-153-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit a8ac900b8163 ("ext4: use non-movable memory for the superblock") buffers for ext4 superblock were allocated using the sb_bread_unmovable() helper which allocated buffer heads out of non-movable memory blocks. It was necessarily to not block page migrations and do not cause cma allocation failures. However commit 85c8f176a611 ("ext4: preload block group descriptors") broke this by introducing pre-reading of the ext4 superblock. The problem is that __breadahead() is using __getblk() underneath, which allocates buffer heads out of movable memory. It resulted in page migration failures I've seen on a machine with an ext4 partition and a preallocated cma area. Fix this by introducing sb_breadahead_unmovable() and __breadahead_gfp() helpers which use non-movable memory for buffer head allocations and use them for the ext4 superblock readahead. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 85c8f176a611 ("ext4: preload block group descriptors") Signed-off-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: use matching invalidatepage in ext4_writepageyangerkun2020-04-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Run generic/388 with journal data mode sometimes may trigger the warning in ext4_invalidatepage. Actually, we should use the matching invalidatepage in ext4_writepage. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2020-04-194-3/+22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "Three small smb3 fixes: two debug related (helping network tracing for SMB2 mounts, and the other removing an unintended debug line on signing failures), and one fixing a performance problem with 64K pages" * tag '5.7-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: remove overly noisy debug line in signing errors cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+ cifs: dump the session id and keys also for SMB2 sessions
| * | | smb3: remove overly noisy debug line in signing errorsSteve French2020-04-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A dump_stack call for signature related errors can be too noisy and not of much value in debugging such problems. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
| * | | cifs: improve read performance for page size 64KB & cache=strict & vers=2.1+Jones Syue2020-04-152-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Found a read performance issue when linux kernel page size is 64KB. If linux kernel page size is 64KB and mount options cache=strict & vers=2.1+, it does not support cifs_readpages(). Instead, it is using cifs_readpage() and cifs_read() with maximum read IO size 16KB, which is much slower than read IO size 1MB when negotiated SMB 2.1+. Since modern SMB server supported SMB 2.1+ and Max Read Size can reach more than 64KB (for example 1MB ~ 8MB), this patch check max_read instead of maxBuf to determine whether server support readpages() and improve read performance for page size 64KB & cache=strict & vers=2.1+, and for SMB1 it is more cleaner to initialize server->max_read to server->maxBuf. The client is a linux box with linux kernel 4.2.8, page size 64KB (CONFIG_ARM64_64K_PAGES=y), cpu arm 1.7GHz, and use mount.cifs as smb client. The server is another linux box with linux kernel 4.2.8, share a file '10G.img' with size 10GB, and use samba-4.7.12 as smb server. The client mount a share from the server with different cache options: cache=strict and cache=none, mount -tcifs //<server_ip>/Public /cache_strict -overs=3.0,cache=strict,username=<xxx>,password=<yyy> mount -tcifs //<server_ip>/Public /cache_none -overs=3.0,cache=none,username=<xxx>,password=<yyy> The client download a 10GbE file from the server across 1GbE network, dd if=/cache_strict/10G.img of=/dev/null bs=1M count=10240 dd if=/cache_none/10G.img of=/dev/null bs=1M count=10240 Found that cache=strict (without patch) is slower read throughput and smaller read IO size than cache=none. cache=strict (without patch): read throughput 40MB/s, read IO size is 16KB cache=strict (with patch): read throughput 113MB/s, read IO size is 1MB cache=none: read throughput 109MB/s, read IO size is 1MB Looks like if page size is 64KB, cifs_set_ops() would use cifs_addr_ops_smallbuf instead of cifs_addr_ops, /* check if server can support readpages */ if (cifs_sb_master_tcon(cifs_sb)->ses->server->maxBuf < PAGE_SIZE + MAX_CIFS_HDR_SIZE) inode->i_data.a_ops = &cifs_addr_ops_smallbuf; else inode->i_data.a_ops = &cifs_addr_ops; maxBuf is came from 2 places, SMB2_negotiate() and CIFSSMBNegotiate(), (SMB2_MAX_BUFFER_SIZE is 64KB) SMB2_negotiate(): /* set it to the maximum buffer size value we can send with 1 credit */ server->maxBuf = min_t(unsigned int, le32_to_cpu(rsp->MaxTransactSize),       SMB2_MAX_BUFFER_SIZE); CIFSSMBNegotiate(): server->maxBuf = le32_to_cpu(pSMBr->MaxBufferSize); Page size 64KB and cache=strict lead to read_pages() use cifs_readpage() instead of cifs_readpages(), and then cifs_read() using maximum read IO size 16KB, which is much slower than maximum read IO size 1MB. (CIFSMaxBufSize is 16KB by default) /* FIXME: set up handlers for larger reads and/or convert to async */ rsize = min_t(unsigned int, cifs_sb->rsize, CIFSMaxBufSize); Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Jones Syue <jonessyue@qnap.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: dump the session id and keys also for SMB2 sessionsRonnie Sahlberg2020-04-151-0/+15
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | We already dump these keys for SMB3, lets also dump it for SMB2 sessions so that we can use the session key in wireshark to check and validate that the signatures are correct. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
* | | Merge tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2020-04-185-20/+42
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "The three commits here fix some livelocks and other clashes with fsfreeze, a potential corruption problem, and a minor race between processes freeing and allocating space when the filesystem is near ENOSPC. Summary: - Fix a partially uninitialized variable. - Teach the background gc threads to apply for fsfreeze protection. - Fix some scaling problems when multiple threads try to flush the filesystem when we're about to hit ENOSPC" * tag 'xfs-5.7-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: move inode flush to the sync workqueue xfs: fix partially uninitialized structure in xfs_reflink_remap_extent xfs: acquire superblock freeze protection on eofblocks scans
| * | | xfs: move inode flush to the sync workqueueDarrick J. Wong2020-04-162-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: fix partially uninitialized structure in xfs_reflink_remap_extentDarrick J. Wong2020-04-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the reflink extent remap function, it turns out that uirec (the block mapping corresponding only to the part of the passed-in mapping that got unmapped) was not fully initialized. Specifically, br_state was not being copied from the passed-in struct to the uirec. This could lead to unpredictable results such as the reflinked mapping being marked unwritten in the destination file. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
| * | | xfs: acquire superblock freeze protection on eofblocks scansBrian Foster2020-04-132-1/+14
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The filesystem freeze sequence in XFS waits on any background eofblocks or cowblocks scans to complete before the filesystem is quiesced. At this point, the freezer has already stopped the transaction subsystem, however, which means a truncate or cowblock cancellation in progress is likely blocked in transaction allocation. This results in a deadlock between freeze and the associated scanner. Fix this problem by holding superblock write protection across calls into the block reapers. Since protection for background scans is acquired from the workqueue task context, trylock to avoid a similar deadlock between freeze and blocking on the write lock. Fixes: d6b636ebb1c9f ("xfs: halt auto-reclamation activities while rebuilding rmap") Reported-by: Paul Furtado <paulfurtado91@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Rajendra <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2020-04-171-0/+7
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull proc fix from Eric Biederman: "While running syzbot happened to spot one more oversight in my rework of proc_flush_task. The fields proc_self and proc_thread_self were not being reinitialized when proc was unmounted, which could cause problems if the mount of proc fails" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Handle umounts cleanly
| * | | proc: Handle umounts cleanlyEric W. Biederman2020-04-151-0/+7
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot writes: > KASAN: use-after-free Read in dput (2) > > proc_fill_super: allocate dentry failed > ================================================================== > BUG: KASAN: use-after-free in fast_dput fs/dcache.c:727 [inline] > BUG: KASAN: use-after-free in dput+0x53e/0xdf0 fs/dcache.c:846 > Read of size 4 at addr ffff88808a618cf0 by task syz-executor.0/8426 > > CPU: 0 PID: 8426 Comm: syz-executor.0 Not tainted 5.6.0-next-20200412-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x188/0x20d lib/dump_stack.c:118 > print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382 > __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511 > kasan_report+0x33/0x50 mm/kasan/common.c:625 > fast_dput fs/dcache.c:727 [inline] > dput+0x53e/0xdf0 fs/dcache.c:846 > proc_kill_sb+0x73/0xf0 fs/proc/root.c:195 > deactivate_locked_super+0x8c/0xf0 fs/super.c:335 > vfs_get_super+0x258/0x2d0 fs/super.c:1212 > vfs_get_tree+0x89/0x2f0 fs/super.c:1547 > do_new_mount fs/namespace.c:2813 [inline] > do_mount+0x1306/0x1b30 fs/namespace.c:3138 > __do_sys_mount fs/namespace.c:3347 [inline] > __se_sys_mount fs/namespace.c:3324 [inline] > __x64_sys_mount+0x18f/0x230 fs/namespace.c:3324 > do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 > entry_SYSCALL_64_after_hwframe+0x49/0xb3 > RIP: 0033:0x45c889 > Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > RSP: 002b:00007ffc1930ec48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 > RAX: ffffffffffffffda RBX: 0000000001324914 RCX: 000000000045c889 > RDX: 0000000020000140 RSI: 0000000020000040 RDI: 0000000000000000 > RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 > R13: 0000000000000749 R14: 00000000004ca15a R15: 0000000000000013 Looking at the code now that it the internal mount of proc is no longer used it is possible to unmount proc. If proc is unmounted the fields of the pid namespace that were used for filesystem specific state are not reinitialized. Which means that proc_self and proc_thread_self can be pointers to already freed dentries. The reported user after free appears to be from mounting and unmounting proc followed by mounting proc again and using error injection to cause the new root dentry allocation to fail. This in turn results in proc_kill_sb running with proc_self and proc_thread_self still retaining their values from the previous mount of proc. Then calling dput on either proc_self of proc_thread_self will result in double put. Which KASAN sees as a use after free. Solve this by always reinitializing the filesystem state stored in the struct pid_namespace, when proc is unmounted. Reported-by: syzbot+72868dd424eb66c6b95f@syzkaller.appspotmail.com Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Fixes: 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | Merge tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-blockLinus Torvalds2020-04-171-139/+162
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: - wrap up the init/setup cleanup (Pavel) - fix some issues around deferral sequences (Pavel) - fix splice punt check using the wrong struct file member - apply poll re-arm logic for pollable retry too - pollable retry should honor cancelation - fix setup time error handling syzbot reported crash - restore work state when poll is canceled * tag 'io_uring-5.7-2020-04-17' of git://git.kernel.dk/linux-block: io_uring: don't count rqs failed after current one io_uring: kill already cached timeout.seq_offset io_uring: fix cached_sq_head in io_timeout() io_uring: only post events in io_poll_remove_all() if we completed some io_uring: io_async_task_func() should check and honor cancelation io_uring: check for need to re-wait in polled async handling io_uring: correct O_NONBLOCK check for splice punt io_uring: restore req->work when canceling poll request io_uring: move all request init code in one place io_uring: keep all sqe->flags in req->flags io_uring: early submission req fail code io_uring: track mm through current->mm io_uring: remove obsolete @mm_fault
| * | | io_uring: don't count rqs failed after current onePavel Begunkov2020-04-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When checking for draining with __req_need_defer(), it tries to match how many requests were sent before a current one with number of already completed. Dropped SQEs are included in req->sequence, and they won't ever appear in CQ. To compensate for that, __req_need_defer() substracts ctx->cached_sq_dropped. However, what it should really use is number of SQEs dropped __before__ the current one. In other words, any submitted request shouldn't shouldn't affect dequeueing from the drain queue of previously submitted ones. Instead of saving proper ctx->cached_sq_dropped in each request, substract from req->sequence it at initialisation, so it includes number of properly submitted requests. note: it also changes behaviour of timeouts, but 1. it's already diverge from the description because of using SQ 2. the description is ambiguous regarding dropped SQEs Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: kill already cached timeout.seq_offsetPavel Begunkov2020-04-141-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | req->timeout.count and req->io->timeout.seq_offset store the same value, which is sqe->off. Kill the second one Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fix cached_sq_head in io_timeout()Pavel Begunkov2020-04-141-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_timeout() can be executed asynchronously by a worker and without holding ctx->uring_lock 1. using ctx->cached_sq_head there is racy there 2. it should count events from a moment of timeout's submission, but not execution Use req->sequence. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: only post events in io_poll_remove_all() if we completed someJens Axboe2020-04-131-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot reports this crash: BUG: unable to handle page fault for address: ffffffffffffffe8 PGD f96e17067 P4D f96e17067 PUD f96e19067 PMD 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI CPU: 55 PID: 211750 Comm: trinity-c127 Tainted: G B L 5.7.0-rc1-next-20200413 #4 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 04/12/2017 RIP: 0010:__wake_up_common+0x98/0x290 el/sched/wait.c:87 Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10 RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8 RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8 RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000 R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8 FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0 Call Trace: __wake_up_common_lock+0xea/0x150 ommon_lock at kernel/sched/wait.c:124 ? __wake_up_common+0x290/0x290 ? lockdep_hardirqs_on+0x16/0x2c0 __wake_up+0x13/0x20 io_cqring_ev_posted+0x75/0xe0 v_posted at fs/io_uring.c:1160 io_ring_ctx_wait_and_kill+0x1c0/0x2f0 l at fs/io_uring.c:7305 io_uring_create+0xa8d/0x13b0 ? io_req_defer_prep+0x990/0x990 ? __kasan_check_write+0x14/0x20 io_uring_setup+0xb8/0x130 ? io_uring_create+0x13b0/0x13b0 ? check_flags.part.28+0x220/0x220 ? lockdep_hardirqs_on+0x16/0x2c0 __x64_sys_io_uring_setup+0x31/0x40 do_syscall_64+0xcc/0xaf0 ? syscall_return_slowpath+0x580/0x580 ? lockdep_hardirqs_off+0x1f/0x140 ? entry_SYSCALL_64_after_hwframe+0x3e/0xb3 ? trace_hardirqs_off_caller+0x3a/0x150 ? trace_hardirqs_off_thunk+0x1a/0x1c entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fdcb9dd76ed Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe7fd4e4f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 00000000000001a9 RCX: 00007fdcb9dd76ed RDX: fffffffffffffffc RSI: 0000000000000000 RDI: 0000000000005d54 RBP: 00000000000001a9 R08: 0000000e31d3caa7 R09: 0082400004004000 R10: ffffffffffffffff R11: 0000000000000246 R12: 0000000000000002 R13: 00007fdcb842e058 R14: 00007fdcba4c46c0 R15: 00007fdcb842e000 Modules linked in: bridge stp llc nfnetlink cn brd vfat fat ext4 crc16 mbcache jbd2 loop kvm_intel kvm irqbypass intel_cstate intel_uncore dax_pmem intel_rapl_perf dax_pmem_core ip_tables x_tables xfs sd_mod tg3 firmware_class libphy hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: binfmt_misc] CR2: ffffffffffffffe8 ---[ end trace f9502383d57e0e22 ]--- RIP: 0010:__wake_up_common+0x98/0x290 Code: 40 4d 8d 78 e8 49 8d 7f 18 49 39 fd 0f 84 80 00 00 00 e8 6b bd 2b 00 49 8b 5f 18 45 31 e4 48 83 eb 18 4c 89 ff e8 08 bc 2b 00 <45> 8b 37 41 f6 c6 04 75 71 49 8d 7f 10 e8 46 bd 2b 00 49 8b 47 10 RSP: 0018:ffffc9000adbfaf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: ffffffffaa9636b8 RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffffffffe8 RBP: ffffc9000adbfb40 R08: fffffbfff582c5fd R09: fffffbfff582c5fd R10: ffffffffac162fe3 R11: fffffbfff582c5fc R12: 0000000000000000 R13: ffff888ef82b0960 R14: ffffc9000adbfb80 R15: ffffffffffffffe8 FS: 00007fdcba4c4740(0000) GS:ffff889033780000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffe8 CR3: 0000000f776a0004 CR4: 00000000001606e0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x29800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]— which is due to error injection (or allocation failure) preventing the rings from being setup. On shutdown, we attempt to remove any pending requests, and for poll request, we call io_cqring_ev_posted() when we've killed poll requests. However, since the rings aren't setup, we won't find any poll requests. Make the calling of io_cqring_ev_posted() dependent on actually having completed requests. This fixes this setup corner case, and removes spurious calls if we remove poll requests and don't find any. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: io_async_task_func() should check and honor cancelationJens Axboe2020-04-131-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If the request has been marked as canceled, don't try and issue it. Instead just fill a canceled event and finish the request. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: check for need to re-wait in polled async handlingJens Axboe2020-04-131-14/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We added this for just the regular poll requests in commit a6ba632d2c24 ("io_uring: retry poll if we got woken with non-matching mask"), we should do the same for the poll handler used pollable async requests. Move the re-wait check and arm into a helper, and call it from io_async_task_func() as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: correct O_NONBLOCK check for splice puntJens Axboe2020-04-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The splice file punt check uses file->f_mode to check for O_NONBLOCK, but it should be checking file->f_flags. This leads to punting even for files that have O_NONBLOCK set, which isn't necessary. This equates to checking for FMODE_PATH, which will never be set on the fd in question. Fixes: 7d67af2c0134 ("io_uring: add splice(2) support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: restore req->work when canceling poll requestXiaoguang Wang2020-04-121-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running liburing test case 'accept', I got below warning: RED: Invalid credentials RED: At include/linux/cred.h:285 RED: Specified credentials: 00000000d02474a0 RED: ->magic=4b, put_addr=000000005b4f46e9 RED: ->usage=-1699227648, subscr=-25693 RED: ->*uid = { 256,-25693,-25693,65534 } RED: ->*gid = { 0,-1925859360,-1789740800,-1827028688 } RED: ->security is 00000000258c136e eneral protection fault, probably for non-canonical address 0xdead4ead00000000: 0000 [#1] SMP PTI PU: 21 PID: 2037 Comm: accept Not tainted 5.6.0+ #318 ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 IP: 0010:dump_invalid_creds+0x16f/0x184 ode: 48 8b 83 88 00 00 00 48 3d ff 0f 00 00 76 29 48 89 c2 81 e2 00 ff ff ff 48 81 fa 00 6b 6b 6b 74 17 5b 48 c7 c7 4b b1 10 8e 5d <8b> 50 04 41 5c 8b 30 41 5d e9 67 e3 04 00 5b 5d 41 5c 41 5d c3 0f SP: 0018:ffffacc1039dfb38 EFLAGS: 00010087 AX: dead4ead00000000 RBX: ffff9ba39319c100 RCX: 0000000000000007 DX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8e10b14b BP: ffffffff8e108476 R08: 0000000000000000 R09: 0000000000000001 10: 0000000000000000 R11: ffffacc1039df9e5 R12: 000000009552b900 13: 000000009319c130 R14: ffff9ba39319c100 R15: 0000000000000246 S: 00007f96b2bfc4c0(0000) GS:ffff9ba39f340000(0000) knlGS:0000000000000000 S: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 R2: 0000000000401870 CR3: 00000007db7a4000 CR4: 00000000000006e0 all Trace: __invalid_creds+0x48/0x4a __io_req_aux_free+0x2e8/0x3b0 ? io_poll_remove_one+0x2a/0x1d0 __io_free_req+0x18/0x200 io_free_req+0x31/0x350 io_poll_remove_one+0x17f/0x1d0 io_poll_cancel.isra.80+0x6c/0x80 io_async_find_and_cancel+0x111/0x120 io_issue_sqe+0x181/0x10e0 ? __lock_acquire+0x552/0xae0 ? lock_acquire+0x8e/0x310 ? fs_reclaim_acquire.part.97+0x5/0x30 __io_queue_sqe.part.100+0xc4/0x580 ? io_submit_sqes+0x751/0xbd0 ? rcu_read_lock_sched_held+0x32/0x40 io_submit_sqes+0x9ba/0xbd0 ? __x64_sys_io_uring_enter+0x2b2/0x460 ? __x64_sys_io_uring_enter+0xaf/0x460 ? find_held_lock+0x2d/0x90 ? __x64_sys_io_uring_enter+0x111/0x460 __x64_sys_io_uring_enter+0x2d7/0x460 do_syscall_64+0x5a/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xb3 After looking into codes, it turns out that this issue is because we didn't restore the req->work, which is changed in io_arm_poll_handler(), req->work is a union with below struct: struct { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; }; If we forget to restore, members in struct io_wq_work would be invalid, restore the req->work to fix this issue. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Get rid of not needed 'need_restore' variable. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: move all request init code in one placePavel Begunkov2020-04-121-52/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Requests initialisation is scattered across several functions, namely io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it in io_init_req() for better data locality and code clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: keep all sqe->flags in req->flagsPavel Begunkov2020-04-121-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a good idea to not read sqe->flags twice, as it's prone to security bugs. Instead of passing it around, embeed them in req->flags. It's already so except for IOSQE_IO_LINK. 1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD 2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK And leave req_set_fail_links() using new REQ_F_LINK, because it's more sensible. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: early submission req fail codePavel Begunkov2020-04-121-31/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Having only one place for cleaning up a request after a link assembly/ submission failure will play handy in the future. At least it allows to remove duplicated cleanup sequence. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: track mm through current->mmPavel Begunkov2020-04-121-21/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a preparation for extracting request init bits, remove self-coded mm tracking from io_submit_sqes(), but rely on current->mm. It's more convenient, than passing this piece of state in other functions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove obsolete @mm_faultPavel Begunkov2020-04-121-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If io_submit_sqes() can't grab an mm, it fails and exits right away. There is no need to track the fact of the failure. Remove @mm_fault. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'for-5.7-rc1-tag' of ↵Linus Torvalds2020-04-171-2/+17
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix for a warning caused by running balance and snapshot creation in parallel" * tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix setting last_trans for reloc roots
| * | | | btrfs: fix setting last_trans for reloc rootsJosef Bacik2020-04-171-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I made a mistake with my previous fix, I assumed that we didn't need to mess with the reloc roots once we were out of the part of relocation where we are actually moving the extents. The subtle thing that I missed is that btrfs_init_reloc_root() also updates the last_trans for the reloc root when we do btrfs_record_root_in_trans() for the corresponding fs_root. I've added a comment to make sure future me doesn't make this mistake again. This showed up as a WARN_ON() in btrfs_copy_root() because our last_trans didn't == the current transid. This could happen if we snapshotted a fs root with a reloc root after we set rc->create_reloc_tree = 0, but before we actually merge the reloc root. Worth mentioning that the regression produced the following warning when running snapshot creation and balance in parallel: BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup ------------[ cut here ]------------ WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs] CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs] RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202 RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48 RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818 RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002 R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818 R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00 FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? create_reloc_root+0x49/0x2b0 [btrfs] ? kmem_cache_alloc_trace+0xe5/0x200 create_reloc_root+0x8b/0x2b0 [btrfs] btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs] create_pending_snapshot+0x610/0x1010 [btrfs] create_pending_snapshots+0xa8/0xd0 [btrfs] btrfs_commit_transaction+0x4c7/0xc50 [btrfs] ? btrfs_mksubvol+0x3cd/0x560 [btrfs] btrfs_mksubvol+0x455/0x560 [btrfs] __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs] ? mem_cgroup_commit_charge+0x6e/0x540 btrfs_ioctl+0x12d8/0x3760 [btrfs] ? do_raw_spin_unlock+0x49/0xc0 ? _raw_spin_unlock+0x29/0x40 ? __handle_mm_fault+0x11b3/0x14b0 ? ksys_ioctl+0x92/0xb0 ksys_ioctl+0x92/0xb0 ? trace_hardirqs_off_thunk+0x1a/0x1c __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fdeabd3bdd7 Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | | Merge tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2020-04-161-1/+2
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfix from Trond Myklebust: "Fix an ABBA spinlock issue in pnfs_update_layout()" * tag 'nfs-for-5.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix an ABBA spinlock issue in pnfs_update_layout()
| * | | | | NFS: Fix an ABBA spinlock issue in pnfs_update_layout()Trond Myklebust2020-04-131-1/+2
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to drop the inode spinlock while calling nfs4_select_rw_stateid(), since nfs4_copy_delegation_stateid() could take the delegation lock. Note that it is safe to do this, since all other calls to pnfs_update_layout() for that inode will find themselves blocked by the lock we hold on NFS_LAYOUT_FIRST_LAYOUTGET. Fixes: fc51b1cf391d ("NFS: Beware when dereferencing the delegation cred") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* | | | | Merge tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-clientLinus Torvalds2020-04-163-5/+5
|\ \ \ \ \ | |_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: - a set of patches for a deadlock on "rbd map" error path - a fix for invalid pointer dereference and uninitialized variable use on asynchronous create and unlink error paths. * tag 'ceph-for-5.7-rc2' of git://github.com/ceph/ceph-client: ceph: fix potential bad pointer deref in async dirops cb's rbd: don't mess with a page vector in rbd_notify_op_lock() rbd: don't test rbd_dev->opts in rbd_dev_image_release() rbd: call rbd_dev_unprobe() after unwatching and flushing notifies rbd: avoid a deadlock on header_rwsem when flushing notifies
| * | | | ceph: fix potential bad pointer deref in async dirops cb'sJeff Layton2020-04-133-5/+5
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new async dirops callback routines can pass ERR_PTR values to ceph_mdsc_free_path, which could cause an oops. Make ceph_mdsc_free_path ignore ERR_PTR values. Also, ensure that the pr_warn messages look sane even if ceph_mdsc_build_path fails. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | Merge tag 'for-5.7-rc1-tag' of ↵Linus Torvalds2020-04-146-87/+47
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We have a few regressions and one fix for stable: - revert fsync optimization - fix lost i_size update - fix a space accounting leak - build fix, add back definition of a deprecated ioctl flag - fix search condition for old roots in relocation" * tag 'for-5.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: re-instantiate the removed BTRFS_SUBVOL_CREATE_ASYNC definition btrfs: fix reclaim counter leak of space_info objects btrfs: make full fsyncs always operate on the entire file again btrfs: fix lost i_size update after cloning inline extent btrfs: check commit root generation in should_ignore_root
| * | | btrfs: fix reclaim counter leak of space_info objectsFilipe Manana2020-04-082-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever we add a ticket to a space_info object we increment the object's reclaim_size counter witht the ticket's bytes, and we decrement it with the corresponding amount only when we are able to grant the requested space to the ticket. When we are not able to grant the space to a ticket, or when the ticket is removed due to a signal (e.g. an application has received sigterm from the terminal) we never decrement the counter with the corresponding bytes from the ticket. This leak can result in the space reclaim code to later do much more work than necessary. So fix it by decrementing the counter when those two cases happen as well. Fixes: db161806dc5615 ("btrfs: account ticket size at add/delete time") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: make full fsyncs always operate on the entire file againFilipe Manana2020-04-082-79/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a revert of commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient"), with updated comment in btrfs_sync_file. Commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") made full fsyncs operate on the given range only as it assumed it was safe when using the NO_HOLES feature, since the hole detection was simplified some time ago and no longer was a source for races with ordered extent completion of adjacent file ranges. However it's still not safe to have a full fsync only operate on the given range, because extent maps for new extents might not be present in memory due to inode eviction or extent cloning. Consider the following example: 1) We are currently at transaction N; 2) We write to the file range [0, 1MiB); 3) Writeback finishes for the whole range and ordered extents complete, while we are still at transaction N; 4) The inode is evicted; 5) We open the file for writing, causing the inode to be loaded to memory again, which sets the 'full sync' bit on its flags. At this point the inode's list of modified extent maps is empty (figuring out which extents were created in the current transaction and were not yet logged by an fsync is expensive, that's why we set the 'full sync' bit when loading an inode); 6) We write to the file range [512KiB, 768KiB); 7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB). This correctly flushes this range and logs its extent into the log tree. When the writeback started an extent map for range [512KiB, 768KiB) was added to the inode's list of modified extents, and when the fsync() finishes logging it removes that extent map from the list of modified extent maps. This fsync also clears the 'full sync' bit; 8) We do a regular fsync() (full ranged). This fsync() ends up doing nothing because the inode's list of modified extents is empty and no other changes happened since the previous ranged fsync(), so it just returns success (0) and we end up never logging extents for the file ranges [0, 512KiB) and [768KiB, 1MiB). Another scenario where this can happen is if we replace steps 2 to 4 with cloning from another file into our test file, as that sets the 'full sync' bit in our inode's flags and does not populate its list of modified extent maps. This was causing test case generic/457 to fail sporadically when using the NO_HOLES feature, as it exercised this later case where the inode has the 'full sync' bit set and has no extent maps in memory to represent the new extents due to extent cloning. Fix this by reverting commit 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") since there is no easy way to work around it. Fixes: 0a8068a3dd4294 ("btrfs: make ranged full fsyncs more efficient") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: fix lost i_size update after cloning inline extentFilipe Manana2020-04-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When not using the NO_HOLES feature we were not marking the destination's file range as written after cloning an inline extent into it. This can lead to a data loss if the current destination file size is smaller than the source file's size. Example: $ mkfs.btrfs -f -O ^no-holes /dev/sdc $ mount /mnt/sdc /mnt $ echo "hello world" > /mnt/foo $ cp --reflink=always /mnt/foo /mnt/bar $ rm -f /mnt/foo $ umount /mnt $ mount /mnt/sdc /mnt $ cat /mnt/bar $ $ stat -c %s /mnt/bar 0 # -> the file is empty, since we deleted foo, the data lost is forever Fix that by calling btrfs_inode_set_file_extent_range() after cloning an inline extent. A test case for fstests will follow soon. Link: https://lore.kernel.org/linux-btrfs/20200404193846.GA432065@latitude/ Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Fixes: 9ddc959e802bf ("btrfs: use the file extent tree infrastructure") Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: check commit root generation in should_ignore_rootJosef Bacik2020-04-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we would set the reloc root's last snapshot to transid - 1. However there was a problem with doing this, and we changed it to setting the last snapshot to the generation of the commit node of the fs root. This however broke should_ignore_root(). The assumption is that if we are in a generation newer than when the reloc root was created, then we would find the reloc root through normal backref lookups, and thus can ignore any fs roots we find with an old enough reloc root. Now that the last snapshot could be considerably further in the past than before, we'd end up incorrectly ignoring an fs root. Thus we'd find no nodes for the bytenr we were searching for, and we'd fail to relocate anything. We'd loop through the relocate code again and see that there were still used space in that block group, attempt to relocate those bytenr's again, fail in the same way, and just loop like this forever. This is tricky in that we have to not modify the fs root at all during this time, so we need to have a block group that has data in this fs root that is not shared by any other root, which is why this has been difficult to reproduce. Fixes: 054570a1dc94 ("Btrfs: fix relocation incorrectly dropping data references") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | afs: Fix afs_d_validate() to set the right directory versionDavid Howells2020-04-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a dentry's version is somewhere between invalid_before and the current directory version, we should be setting it forward to the current version, not backwards to the invalid_before version. Note that we're only doing this at all because dentry::d_fsdata isn't large enough on a 32-bit system. Fix this by using a separate variable for invalid_before so that we don't accidentally clobber the current dir version. Fixes: a4ff7401fbfa ("afs: Keep track of invalid-before version for dentry coherency") Signed-off-by: David Howells <dhowells@redhat.com>
* | | | afs: Fix race between post-modification dir edit and readdir/d_revalidateDavid Howells2020-04-132-35/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AFS directories are retained locally as a structured file, with lookup being effected by a local search of the file contents. When a modification (such as mkdir) happens, the dir file content is modified locally rather than redownloading the directory. The directory contents are accessed in a number of ways, with a number of different locks schemes: (1) Download of contents - dvnode->validate_lock/write in afs_read_dir(). (2) Lookup and readdir - dvnode->validate_lock/read in afs_dir_iterate(), downgrading from (1) if necessary. (3) d_revalidate of child dentry - dvnode->validate_lock/read in afs_do_lookup_one() downgrading from (1) if necessary. (4) Edit of dir after modification - page locks on individual dir pages. Unfortunately, because (4) uses different locking scheme to (1) - (3), nothing protects against the page being scanned whilst the edit is underway. Even download is not safe as it doesn't lock the pages - relying instead on the validate_lock to serialise as a whole (the theory being that directory contents are treated as a block and always downloaded as a block). Fix this by write-locking dvnode->validate_lock around the edits. Care must be taken in the rename case as there may be two different dirs - but they need not be locked at the same time. In any case, once the lock is taken, the directory version must be rechecked, and the edit skipped if a later version has been downloaded by revalidation (there can't have been any local changes because the VFS holds the inode lock, but there can have been remote changes). Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...") Signed-off-by: David Howells <dhowells@redhat.com>
* | | | afs: Fix length of dump of bad YFSFetchStatus recordDavid Howells2020-04-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the length of the dump of a bad YFSFetchStatus record. The function was copied from the AFS version, but the YFS variant contains bigger fields and extra information, so expand the dump to match. Signed-off-by: David Howells <dhowells@redhat.com>
* | | | afs: Fix rename operation status deliveryDavid Howells2020-04-133-21/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The afs_deliver_fs_rename() and yfs_deliver_fs_rename() functions both only decode the second file status returned unless the parent directories are different - unfortunately, this means that the xdr pointer isn't advanced and the volsync record will be read incorrectly in such an instance. Fix this by always decoding the second status into the second status/callback block which wasn't being used if the dirs were the same. The afs_update_dentry_version() calls that update the directory data version numbers on the dentries can then unconditionally use the second status record as this will always reflect the state of the destination dir (the two records will be identical if the destination dir is the same as the source dir) Fixes: 260a980317da ("[AFS]: Add "directory write" support.") Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: David Howells <dhowells@redhat.com>