summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()Alexander Potapenko2020-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | commit 1d605416fb7175e1adf094251466caa52093b413 upstream. KMSAN reported uninitialized data being written to disk when dumping core. As a result, several kilobytes of kmalloc memory may be written to the core file and then read by a non-privileged user. Reported-by: sam <sunhaoyl@outlook.com> Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/20200419100848.63472-1-glider@google.com Link: https://github.com/google/kmsan/issues/76 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: add cond_resched() to ext4_protect_reserved_inodeShijie Luo2020-06-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit af133ade9a40794a37104ecbcc2827c0ea373a3c upstream. When journal size is set too big by "mkfs.ext4 -J size=", or when we mount a crafted image to make journal inode->i_size too big, the loop, "while (i < num)", holds cpu too long. This could cause soft lockup. [ 529.357541] Call trace: [ 529.357551] dump_backtrace+0x0/0x198 [ 529.357555] show_stack+0x24/0x30 [ 529.357562] dump_stack+0xa4/0xcc [ 529.357568] watchdog_timer_fn+0x300/0x3e8 [ 529.357574] __hrtimer_run_queues+0x114/0x358 [ 529.357576] hrtimer_interrupt+0x104/0x2d8 [ 529.357580] arch_timer_handler_virt+0x38/0x58 [ 529.357584] handle_percpu_devid_irq+0x90/0x248 [ 529.357588] generic_handle_irq+0x34/0x50 [ 529.357590] __handle_domain_irq+0x68/0xc0 [ 529.357593] gic_handle_irq+0x6c/0x150 [ 529.357595] el1_irq+0xb8/0x140 [ 529.357599] __ll_sc_atomic_add_return_acquire+0x14/0x20 [ 529.357668] ext4_map_blocks+0x64/0x5c0 [ext4] [ 529.357693] ext4_setup_system_zone+0x330/0x458 [ext4] [ 529.357717] ext4_fill_super+0x2170/0x2ba8 [ext4] [ 529.357722] mount_bdev+0x1a8/0x1e8 [ 529.357746] ext4_mount+0x44/0x58 [ext4] [ 529.357748] mount_fs+0x50/0x170 [ 529.357752] vfs_kern_mount.part.9+0x54/0x188 [ 529.357755] do_mount+0x5ac/0xd78 [ 529.357758] ksys_mount+0x9c/0x118 [ 529.357760] __arm64_sys_mount+0x28/0x38 [ 529.357764] el0_svc_common+0x78/0x130 [ 529.357766] el0_svc_handler+0x38/0x78 [ 529.357769] el0_svc+0x8/0xc [ 541.356516] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mount:18674] Link: https://lore.kernel.org/r/20200211011752.29242-1-luoshijie1@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: don't perform block validity checks on the journal inodeTheodore Ts'o2020-06-111-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 0a944e8a6c66ca04c7afbaa17e22bf208a8b37f0 upstream. Since the journal inode is already checked when we added it to the block validity's system zone, if we check it again, we'll just trigger a failure. This was causing failures like this: [ 53.897001] EXT4-fs error (device sda): ext4_find_extent:909: inode #8: comm jbd2/sda-8: pblk 121667583 bad header/extent: invalid extent entries - magic f30a, entries 8, max 340(340), depth 0(0) [ 53.931430] jbd2_journal_bmap: journal block not found at offset 49 on sda-8 [ 53.938480] Aborting journal on device sda-8. ... but only if the system was under enough memory pressure that logical->physical mapping for the journal inode gets pushed out of the extent cache. (This is why it wasn't noticed earlier.) Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: Dan Rue <dan.rue@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> [bwh: Backported to 3.16: Use EXT4_HAS_COMPAT_FEATURE()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix block validity checks for journal inodes using indirect blocksTheodore Ts'o2020-06-111-0/+6
| | | | | | | | | | | | | | | | | | | | commit 170417c8c7bb2cbbdd949bf5c443c0c8f24a203b upstream. Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") failed to add an exception for the journal inode in ext4_check_blockref(), which is the function used by ext4_get_branch() for indirect blocks. This caused attempts to read from the ext3-style journals to fail with: [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block Fix this by adding the missing exception check. Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: Arthur Marsh <arthur.marsh@internode.on.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu> [bwh: Backported to 3.16: Use EXT4_HAS_COMPAT_FEATURE] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: unsigned int compared against zeroColin Ian King2020-06-111-1/+2
| | | | | | | | | | | | | | commit fbbbbd2f28aec991f3fbc248df211550fbdfd58c upstream. There are two cases where u32 variables n and err are being checked for less than zero error values, the checks is always false because the variables are not signed. Fix this by making the variables ints. Addresses-Coverity: ("Unsigned compared against 0") Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: protect journal inode's blocks using block_validityTheodore Ts'o2020-06-112-0/+54
| | | | | | | | | | | | | | | | | commit 345c0dbf3a30872d9b204db96b5857cd00808cae upstream. Add the blocks which belong to the journal inode to block_validity's system zone so attempts to deallocate or overwrite the journal due a corrupted file system where the journal blocks are also claimed by another inode. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202879 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org [bwh: Backported to 3.16: - Use EXT4_HAS_COMPAT_FEATURE() - Use EIO instead of EFSCORRUPTED] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: Make checks for metadata_csum feature saferTahsin Erdogan2020-06-111-8/+11
| | | | | | | | | | | This is just a small part of commit dec214d00e0d7 "ext4: xattr inode deduplication" that makes checks for metadata_csum feature safer and is actually needed by following fixes. Signed-off-by: Tahsin Erdogan <tahsin@google.com> Acked-by: Jan Kara <jack@suse.cz> [bwh: Ported to 3.16: Use EXT4_HAS_RO_COMPAT_FEATURE()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* signal: Extend exec_id to 64bitsEric W. Biederman2020-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream. Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> [bwh: Backported to 3.16: - Use ACCESS_ONCE() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* cifs: fail i/o on soft mounts if sessionsetup errors outRonnie Sahlberg2020-05-221-2/+8
| | | | | | | | | | | | | | commit b0dd940e582b6a60296b9847a54012a4b080dc72 upstream. RHBZ: 1579050 If we have a soft mount we should fail commands for session-setup failures (such as the password having changed/ account being deleted/ ...) and return an error back to the application. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* NFS: Directory page cache pages need to be locked when readTrond Myklebust2020-05-221-11/+19
| | | | | | | | | | | | | | | | commit 114de38225d9b300f027e2aec9afbb6e0def154b upstream. When a NFS directory page cache page is removed from the page cache, its contents are freed through a call to nfs_readdir_clear_array(). To prevent the removal of the page cache entry until after we've finished reading it, we must take the page lock. Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* NFS: Fix memory leaks and corruption in readdirTrond Myklebust2020-05-221-2/+15
| | | | | | | | | | | | | | | | | commit 4b310319c6a8ce708f1033d57145e2aa027a883c upstream. nfs_readdir_xdr_to_array() must not exit without having initialised the array, so that the page cache deletion routines can safely call nfs_readdir_clear_array(). Furthermore, we should ensure that if we exit nfs_readdir_filler() with an error, we free up any page contents to prevent a leak if we try to fill the page again. Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* nfs: use kmap/kunmap directlyFabian Frederick2020-05-221-55/+12
| | | | | | | | | | | | | | | | | | commit 0795bf8357c1887e2a95e6e4f5b89d0896a0d929 upstream. This patch removes useless nfs_readdir_get_array() and nfs_readdir_release_array() as suggested by Trond Myklebust nfs_readdir() calls nfs_revalidate_mapping() before readdir_search_pagecache() , nfs_do_filldir(), uncached_readdir() so mapping should be correct. While kmap() can't fail, all subsequent error checks were removed as well as unused labels. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: fix race between adding and putting tree mod seq elements and nodesFilipe Manana2020-05-225-16/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7227ff4de55d931bbdc156c8ef0ce4f100c78a5b upstream. There is a race between adding and removing elements to the tree mod log list and rbtree that can lead to use-after-free problems. Consider the following example that explains how/why the problems happens: 1) Task A has mod log element with sequence number 200. It currently is the only element in the mod log list; 2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to access the tree mod log. When it enters the function, it initializes 'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock' before checking if there are other elements in the mod seq list. Since the list it empty, 'min_seq' remains set to (u64)-1. Then it unlocks the lock 'tree_mod_seq_lock'; 3) Before task A acquires the lock 'tree_mod_log_lock', task B adds itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a sequence number of 201; 4) Some other task, name it task C, modifies a btree and because there elements in the mod seq list, it adds a tree mod elem to the tree mod log rbtree. That node added to the mod log rbtree is assigned a sequence number of 202; 5) Task B, which is doing fiemap and resolving indirect back references, calls btrfs get_old_root(), with 'time_seq' == 201, which in turn calls tree_mod_log_search() - the search returns the mod log node from the rbtree with sequence number 202, created by task C; 6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating the mod log rbtree and finds the node with sequence number 202. Since 202 is less than the previously computed 'min_seq', (u64)-1, it removes the node and frees it; 7) Task B still has a pointer to the node with sequence number 202, and it dereferences the pointer itself and through the call to __tree_mod_log_rewind(), resulting in a use-after-free problem. This issue can be triggered sporadically with the test case generic/561 from fstests, and it happens more frequently with a higher number of duperemove processes. When it happens to me, it either freezes the VM or it produces a trace like the following before crashing: [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1 [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 1245.321287] RIP: 0010:rb_next+0x16/0x50 [ 1245.321307] Code: .... [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202 [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80 [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000 [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038 [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8 [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000 [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0 [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1245.321706] Call Trace: [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs] [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs] [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs] [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs] [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322066] do_vfs_ioctl+0x45a/0x750 [ 1245.322081] ksys_ioctl+0x70/0x80 [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1245.322113] __x64_sys_ioctl+0x16/0x20 [ 1245.322126] do_syscall_64+0x5c/0x280 [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1245.322155] RIP: 0033:0x7fdee3942dd7 [ 1245.322177] Code: .... [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7 [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004 [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44 [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48 [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50 [ 1245.322423] Modules linked in: .... [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]--- Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum sequence number and iterates the rbtree while holding the lock 'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock' lock, since it is now redundant. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Use tree_mod_log_write_{,un}lock() in ctree.c for consistency - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* CIFS: Fix task struct use-after-free on reconnectVincent Whitchurch2020-05-223-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f1f27ad74557e39f67a8331a808b860f89254f2d upstream. The task which created the MID may be gone by the time cifsd attempts to call the callbacks on MIDs from cifs_reconnect(). This leads to a use-after-free of the task struct in cifs_wake_up_task: ================================================================== BUG: KASAN: use-after-free in __lock_acquire+0x31a0/0x3270 Read of size 8 at addr ffff8880103e3a68 by task cifsd/630 CPU: 0 PID: 630 Comm: cifsd Not tainted 5.5.0-rc6+ #119 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x8e/0xcb print_address_description.constprop.5+0x1d3/0x3c0 ? __lock_acquire+0x31a0/0x3270 __kasan_report+0x152/0x1aa ? __lock_acquire+0x31a0/0x3270 ? __lock_acquire+0x31a0/0x3270 kasan_report+0xe/0x20 __lock_acquire+0x31a0/0x3270 ? __wake_up_common+0x1dc/0x630 ? _raw_spin_unlock_irqrestore+0x4c/0x60 ? mark_held_locks+0xf0/0xf0 ? _raw_spin_unlock_irqrestore+0x39/0x60 ? __wake_up_common_lock+0xd5/0x130 ? __wake_up_common+0x630/0x630 lock_acquire+0x13f/0x330 ? try_to_wake_up+0xa3/0x19e0 _raw_spin_lock_irqsave+0x38/0x50 ? try_to_wake_up+0xa3/0x19e0 try_to_wake_up+0xa3/0x19e0 ? cifs_compound_callback+0x178/0x210 ? set_cpus_allowed_ptr+0x10/0x10 cifs_reconnect+0xa1c/0x15d0 ? generic_ip_connect+0x1860/0x1860 ? rwlock_bug.part.0+0x90/0x90 cifs_readv_from_socket+0x479/0x690 cifs_read_from_socket+0x9d/0xe0 ? cifs_readv_from_socket+0x690/0x690 ? mempool_resize+0x690/0x690 ? rwlock_bug.part.0+0x90/0x90 ? memset+0x1f/0x40 ? allocate_buffers+0xff/0x340 cifs_demultiplex_thread+0x388/0x2a50 ? cifs_handle_standard+0x610/0x610 ? rcu_read_lock_held_common+0x120/0x120 ? mark_lock+0x11b/0xc00 ? __lock_acquire+0x14ed/0x3270 ? __kthread_parkme+0x78/0x100 ? lockdep_hardirqs_on+0x3e8/0x560 ? lock_downgrade+0x6a0/0x6a0 ? lockdep_hardirqs_on+0x3e8/0x560 ? _raw_spin_unlock_irqrestore+0x39/0x60 ? cifs_handle_standard+0x610/0x610 kthread+0x2bb/0x3a0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 Allocated by task 649: save_stack+0x19/0x70 __kasan_kmalloc.constprop.5+0xa6/0xf0 kmem_cache_alloc+0x107/0x320 copy_process+0x17bc/0x5370 _do_fork+0x103/0xbf0 __x64_sys_clone+0x168/0x1e0 do_syscall_64+0x9b/0xec0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 0: save_stack+0x19/0x70 __kasan_slab_free+0x11d/0x160 kmem_cache_free+0xb5/0x3d0 rcu_core+0x52f/0x1230 __do_softirq+0x24d/0x962 The buggy address belongs to the object at ffff8880103e32c0 which belongs to the cache task_struct of size 6016 The buggy address is located 1960 bytes inside of 6016-byte region [ffff8880103e32c0, ffff8880103e4a40) The buggy address belongs to the page: page:ffffea000040f800 refcount:1 mapcount:0 mapping:ffff8880108da5c0 index:0xffff8880103e4c00 compound_mapcount: 0 raw: 4000000000010200 ffffea00001f2208 ffffea00001e3408 ffff8880108da5c0 raw: ffff8880103e4c00 0000000000050003 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880103e3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880103e3980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880103e3a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880103e3a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880103e3b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== This can be reliably reproduced by adding the below delay to cifs_reconnect(), running find(1) on the mount, restarting the samba server while find is running, and killing find during the delay: spin_unlock(&GlobalMid_Lock); mutex_unlock(&server->srv_mutex); + msleep(10000); + cifs_dbg(FYI, "%s: issuing mid callbacks\n", __func__); list_for_each_safe(tmp, tmp2, &retry_list) { mid_entry = list_entry(tmp, struct mid_q_entry, qhead); Fix this by holding a reference to the task struct until the MID is freed. Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> [bwh: Backported to 3.16: - In _cifs_mid_q_entry_release(), use mid instead of midEntry - Adjust context, indentation] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4, jbd2: ensure panic when aborting with zero errnozhangyi (F)2020-05-222-12/+5
| | | | | | | | | | | | | | | | | | commit 51f57b01e4a3c7d7bdceffd84de35144e8c538e7 upstream. JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2 aborted, and then __ext4_abort() and ext4_handle_error() can invoke panic if ERRORS_PANIC is specified. But if the journal has been aborted with zero errno, jbd2_journal_abort() didn't set this flag so we can no longer panic. Fix this by always record the proper errno in the journal superblock. Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* jbd2: switch to use jbd2_journal_abort() when failed to submit the commit recordzhangyi (F)2020-05-221-2/+2
| | | | | | | | | | | | | | | | | commit d0a186e0d3e7ac05cc77da7c157dae5aa59f95d9 upstream. We invoke jbd2_journal_abort() to abort the journal and record errno in the jbd2 superblock when committing journal transaction besides the failure on submitting the commit record. But there is no need for the case and we can also invoke jbd2_journal_abort() instead of __jbd2_journal_abort_hard(). Fixes: 818d276ceb83a ("ext4: Add the journal checksum feature") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info ↵Kai Li2020-05-221-1/+5
| | | | | | | | | | | | | | | | | | | | | | when load journal commit a09decff5c32060639a685581c380f51b14e1fc2 upstream. If the journal is dirty when the filesystem is mounted, jbd2 will replay the journal but the journal superblock will not be updated by journal_reset() because JBD2_ABORT flag is still set (it was set in journal_init_common()). This is problematic because when a new transaction is then committed, it will be recorded in block 1 (journal->j_tail was set to 1 in journal_reset()). If unclean shutdown happens again before the journal superblock is updated, the new recorded transaction will not be replayed during the next mount (because of stale sb->s_start and sb->s_sequence values) which can lead to filesystem corruption. Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail") Signed-off-by: Kai Li <li.kai4@h3c.com> Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ubifs: Fix deadlock in concurrent bulk-read and writepageZhihao Cheng2020-05-221-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f5de5b83303e61b1f3fb09bd77ce3ac2d7a475f2 upstream. In ubifs, concurrent execution of writepage and bulk read on the same file may cause ABBA deadlock, for example (Reproduce method see Link): Process A(Bulk-read starts from page4) Process B(write page4 back) vfs_read wb_workfn or fsync ... ... generic_file_buffered_read write_cache_pages ubifs_readpage LOCK(page4) ubifs_bulk_read ubifs_writepage LOCK(ui->ui_mutex) ubifs_write_inode ubifs_do_bulk_read LOCK(ui->ui_mutex) find_or_create_page(alloc page4) ↑ LOCK(page4) <-- ABBA deadlock occurs! In order to ensure the serialization execution of bulk read, we can't remove the big lock 'ui->ui_mutex' in ubifs_bulk_read(). Instead, we allow ubifs_do_bulk_read() to lock page failed by replacing find_or_create_page(FGP_LOCK) with pagecache_get_page(FGP_LOCK | FGP_NOWAIT). Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Suggested-by: zhangyi (F) <yi.zhang@huawei.com> Fixes: 4793e7c5e1c ("UBIFS: add bulk-read facility") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206153 Signed-off-by: Richard Weinberger <richard@nod.at> [bwh: Backported to 3.16: Keep using constant GFP flags parameter.] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* nfs: NFS_SWAP should depend on SWAPGeert Uytterhoeven2020-05-221-1/+1
| | | | | | | | | | | | | | | | | | commit 474c4f306eefbb21b67ebd1de802d005c7d7ecdc upstream. If CONFIG_SWAP=n, it does not make much sense to offer the user the option to enable support for swapping over NFS, as that will still fail at run time: # swapon /swap swapon: /swap: swapon failed: Function not implemented Fix this by adding a dependency on CONFIG_SWAP. Fixes: a564b8f0398636ba ("nfs: enable swap on NFS") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* reiserfs: Fix spurious unlock in reiserfs_fill_super() error handlingJan Kara2020-05-221-1/+1
| | | | | | | | | | | | commit 4d5c1adaf893b8aa52525d2b81995e949bcb3239 upstream. When we fail to allocate string for journal device name we jump to 'error' label which tries to unlock reiserfs write lock which is not held. Jump to 'error_unlocked' instead. Fixes: f32485be8397 ("reiserfs: delay reiserfs lock until journal initialization") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* reiserfs: Fix memory leak of journal device stringJan Kara2020-05-221-0/+2
| | | | | | | | | | | | | commit 5474ca7da6f34fa95e82edc747d5faa19cbdfb5c upstream. When a filesystem is mounted with jdev mount option, we store the journal device name in an allocated string in superblock. However we fail to ever free that string. Fix it. Reported-by: syzbot+1c6756baf4b16b94d2a6@syzkaller.appspotmail.com Fixes: c3aa077648e1 ("reiserfs: Properly display mount options in /proc/mounts") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* propagate_one(): mnt_set_mountpoint() needs mount_lockAl Viro2020-05-221-5/+4
| | | | | | | | | | | | | | | | | | commit b0d3869ce9eeacbb1bbd541909beeef4126426d5 upstream. ... to protect the modification of mp->m_count done by it. Most of the places that modify that thing also have namespace_lock held, but not all of them can do so, so we really need mount_lock here. Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related bug in pivot_root(2) (fixed unnoticed in 5.3); search for other similar turds has caught out this one. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* fs/namespace.c: fix mountpoint reference counter racePiotr Krysiuk2020-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | A race condition between threads updating mountpoint reference counter affects longterm releases 4.4.220, 4.9.220, 4.14.177 and 4.19.118. The mountpoint reference counter corruption may occur when: * one thread increments m_count member of struct mountpoint [under namespace_sem, but not holding mount_lock] pivot_root() * another thread simultaneously decrements the same m_count [under mount_lock, but not holding namespace_sem] put_mountpoint() unhash_mnt() umount_mnt() mntput_no_expire() To fix this race condition, grab mount_lock before updating m_count in pivot_root(). Reference: CVE-2020-12114 Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* futex: Fix inode life-time issuePeter Zijlstra2020-04-281-0/+1
| | | | | | | | | | | | | | | | | | | commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream. As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.16: Use atomic64_cmpxchg() instead of the _relaxed() variant which we don't have] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* chardev: add helper function to register char devs with a struct deviceLogan Gunthorpe2020-04-281-0/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 233ed09d7fdacf592ee91e6c97ce5f4364fbe7c0 upstream. Credit for this patch goes is shared with Dan Williams [1]. I've taken things one step further to make the helper function more useful and clean up calling code. There's a common pattern in the kernel whereby a struct cdev is placed in a structure along side a struct device which manages the life-cycle of both. In the naive approach, the reference counting is broken and the struct device can free everything before the chardev code is entirely released. Many developers have solved this problem by linking the internal kobjs in this fashion: cdev.kobj.parent = &parent_dev.kobj; The cdev code explicitly gets and puts a reference to it's kobj parent. So this seems like it was intended to be used this way. Dmitrty Torokhov first put this in place in 2012 with this commit: 2f0157f char_dev: pin parent kobject and the first instance of the fix was then done in the input subsystem in the following commit: 4a215aa Input: fix use-after-free introduced with dynamic minor changes Subsequently over the years, however, this issue seems to have tripped up multiple developers independently. For example, see these commits: 0d5b7da iio: Prevent race between IIO chardev opening and IIO device (by Lars-Peter Clausen in 2013) ba0ef85 tpm: Fix initialization of the cdev (by Jason Gunthorpe in 2015) 5b28dde [media] media: fix use-after-free in cdev_put() when app exits after driver unbind (by Shauh Khan in 2016) This technique is similarly done in at least 15 places within the kernel and probably should have been done so in another, at least, 5 places. The kobj line also looks very suspect in that one would not expect drivers to have to mess with kobject internals in this way. Even highly experienced kernel developers can be surprised by this code, as seen in [2]. To help alleviate this situation, and hopefully prevent future wasted effort on this problem, this patch introduces a helper function to register a char device along with its parent struct device. This creates a more regular API for tying a char device to its parent without the developer having to set members in the underlying kobject. This patch introduce cdev_device_add and cdev_device_del which replaces a common pattern including setting the kobj parent, calling cdev_add and then calling device_add. It also introduces cdev_set_parent for the few cases that set the kobject parent without using device_add. [1] https://lkml.org/lkml/2017/2/13/700 [2] https://lkml.org/lkml/2017/2/10/370 Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* vfs: fix do_last() regressionAl Viro2020-04-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6404674acd596de41fd3ad5f267b4525494a891a upstream. Brown paperbag time: fetching ->i_uid/->i_mode really should've been done from nd->inode. I even suggested that, but the reason for that has slipped through the cracks and I went for dir->d_inode instead - made for more "obvious" patch. Analysis: - at the entry into do_last() and all the way to step_into(): dir (aka nd->path.dentry) is known not to have been freed; so's nd->inode and it's equal to dir->d_inode unless we are already doomed to -ECHILD. inode of the file to get opened is not known. - after step_into(): inode of the file to get opened is known; dir might be pointing to freed memory/be negative/etc. - at the call of may_create_in_sticky(): guaranteed to be out of RCU mode; inode of the file to get opened is known and pinned; dir might be garbage. The last was the reason for the original patch. Except that at the do_last() entry we can be in RCU mode and it is possible that nd->path.dentry->d_inode has already changed under us. In that case we are going to fail with -ECHILD, but we need to be careful; nd->inode is pointing to valid struct inode and it's the same as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we should use that. Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk> Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* do_last(): fetch directory ->i_mode and ->i_uid before it's too lateAl Viro2020-04-281-7/+10
| | | | | | | | | | | commit d0cb50185ae942b03c4327be322055d622dc79f6 upstream. may_create_in_sticky() call is done when we already have dropped the reference to dir. Fixes: 30aba6656f61e (namei: allow restricted O_CREAT of FIFOs and regular files) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* namei: allow restricted O_CREAT of FIFOs and regular filesSalvatore Mesoraca2020-04-281-3/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream. Disallows open of FIFOs or regular files not owned by the user in world writable sticky directories, unless the owner is the same as that of the directory or the file is opened without the O_CREAT flag. The purpose is to make data spoofing attacks harder. This protection can be turned on and off separately for FIFOs and regular files via sysctl, just like the symlinks/hardlinks protection. This patch is based on Openwall's "HARDEN_FIFO" feature by Solar Designer. This is a brief list of old vulnerabilities that could have been prevented by this feature, some of them even allow for privilege escalation: CVE-2000-1134 CVE-2007-3852 CVE-2008-0525 CVE-2009-0416 CVE-2011-4834 CVE-2015-1838 CVE-2015-7442 CVE-2016-7489 This list is not meant to be complete. It's difficult to track down all vulnerabilities of this kind because they were often reported without any mention of this particular attack vector. In fact, before hardlinks/symlinks restrictions, fifos/regular files weren't the favorite vehicle to exploit them. [s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter] Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com [keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future] [keescook@chromium.org: adjust commit subjet] Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Solar Designer <solar@openwall.com> Suggested-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* chardev: Avoid potential use-after-free in 'chrdev_open()'Will Deacon2020-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 68faa679b8be1a74e6663c21c3a9d25d32f1c079 upstream. 'chrdev_open()' calls 'cdev_get()' to obtain a reference to the 'struct cdev *' stashed in the 'i_cdev' field of the target inode structure. If the pointer is NULL, then it is initialised lazily by looking up the kobject in the 'cdev_map' and so the whole procedure is protected by the 'cdev_lock' spinlock to serialise initialisation of the shared pointer. Unfortunately, it is possible for the initialising thread to fail *after* installing the new pointer, for example if the subsequent '->open()' call on the file fails. In this case, 'cdev_put()' is called, the reference count on the kobject is dropped and, if nobody else has taken a reference, the release function is called which finally clears 'inode->i_cdev' from 'cdev_purge()' before potentially freeing the object. The problem here is that a racing thread can happily take the 'cdev_lock' and see the non-NULL pointer in the inode, which can result in a refcount increment from zero and a warning: | ------------[ cut here ]------------ | refcount_t: addition on 0; use-after-free. | WARNING: CPU: 2 PID: 6385 at lib/refcount.c:25 refcount_warn_saturate+0x6d/0xf0 | Modules linked in: | CPU: 2 PID: 6385 Comm: repro Not tainted 5.5.0-rc2+ #22 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 | RIP: 0010:refcount_warn_saturate+0x6d/0xf0 | Code: 05 55 9a 15 01 01 e8 9d aa c8 ff 0f 0b c3 80 3d 45 9a 15 01 00 75 ce 48 c7 c7 00 9c 62 b3 c6 08 | RSP: 0018:ffffb524c1b9bc70 EFLAGS: 00010282 | RAX: 0000000000000000 RBX: ffff9e9da1f71390 RCX: 0000000000000000 | RDX: ffff9e9dbbd27618 RSI: ffff9e9dbbd18798 RDI: ffff9e9dbbd18798 | RBP: 0000000000000000 R08: 000000000000095f R09: 0000000000000039 | R10: 0000000000000000 R11: ffffb524c1b9bb20 R12: ffff9e9da1e8c700 | R13: ffffffffb25ee8b0 R14: 0000000000000000 R15: ffff9e9da1e8c700 | FS: 00007f3b87d26700(0000) GS:ffff9e9dbbd00000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 00007fc16909c000 CR3: 000000012df9c000 CR4: 00000000000006e0 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | Call Trace: | kobject_get+0x5c/0x60 | cdev_get+0x2b/0x60 | chrdev_open+0x55/0x220 | ? cdev_put.part.3+0x20/0x20 | do_dentry_open+0x13a/0x390 | path_openat+0x2c8/0x1470 | do_filp_open+0x93/0x100 | ? selinux_file_ioctl+0x17f/0x220 | do_sys_open+0x186/0x220 | do_syscall_64+0x48/0x150 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 | RIP: 0033:0x7f3b87efcd0e | Code: 89 54 24 08 e8 a3 f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f4 | RSP: 002b:00007f3b87d259f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 | RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3b87efcd0e | RDX: 0000000000000000 RSI: 00007f3b87d25a80 RDI: 00000000ffffff9c | RBP: 00007f3b87d25e90 R08: 0000000000000000 R09: 0000000000000000 | R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe188f504e | R13: 00007ffe188f504f R14: 00007f3b87d26700 R15: 0000000000000000 | ---[ end trace 24f53ca58db8180a ]--- Since 'cdev_get()' can already fail to obtain a reference, simply move it over to use 'kobject_get_unless_zero()' instead of 'kobject_get()', which will cause the racing thread to return -ENXIO if the initialising thread fails unexpectedly. Cc: Hillf Danton <hdanton@sina.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+82defefbbd8527e1c2cb@syzkaller.appspotmail.com Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20191219120203.32691-1-will@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* locks: print unsigned ino in /proc/locksAmir Goldstein2020-04-281-1/+1
| | | | | | | | | | | commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream. An ino is unsigned, so display it as such in /proc/locks. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: check for directory entries too close to block endJan Kara2020-04-281-0/+5
| | | | | | | | | | | | | | | commit 109ba779d6cca2d519c5dd624a3276d03e21948e upstream. ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: check rw_devices, not num_devices for balanceJosef Bacik2020-04-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b35cf1f0bf1f2b0b193093338414b9bd63b29015 upstream. The fstest btrfs/154 reports [ 8675.381709] BTRFS: Transaction aborted (error -28) [ 8675.383302] WARNING: CPU: 1 PID: 31900 at fs/btrfs/block-group.c:2038 btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.390925] CPU: 1 PID: 31900 Comm: btrfs Not tainted 5.5.0-rc6-default+ #935 [ 8675.392780] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [ 8675.395452] RIP: 0010:btrfs_create_pending_block_groups+0x1e0/0x1f0 [btrfs] [ 8675.402672] RSP: 0018:ffffb2090888fb00 EFLAGS: 00010286 [ 8675.404413] RAX: 0000000000000000 RBX: ffff92026dfa91c8 RCX: 0000000000000001 [ 8675.406609] RDX: 0000000000000000 RSI: ffffffff8e100899 RDI: ffffffff8e100971 [ 8675.408775] RBP: ffff920247c61660 R08: 0000000000000000 R09: 0000000000000000 [ 8675.410978] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000ffffffe4 [ 8675.412647] R13: ffff92026db74000 R14: ffff920247c616b8 R15: ffff92026dfbc000 [ 8675.413994] FS: 00007fd5e57248c0(0000) GS:ffff92027d800000(0000) knlGS:0000000000000000 [ 8675.416146] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8675.417833] CR2: 0000564aa51682d8 CR3: 000000006dcbc004 CR4: 0000000000160ee0 [ 8675.419801] Call Trace: [ 8675.420742] btrfs_start_dirty_block_groups+0x355/0x480 [btrfs] [ 8675.422600] btrfs_commit_transaction+0xc8/0xaf0 [btrfs] [ 8675.424335] reset_balance_state+0x14a/0x190 [btrfs] [ 8675.425824] btrfs_balance.cold+0xe7/0x154 [btrfs] [ 8675.427313] ? kmem_cache_alloc_trace+0x235/0x2c0 [ 8675.428663] btrfs_ioctl_balance+0x298/0x350 [btrfs] [ 8675.430285] btrfs_ioctl+0x466/0x2550 [btrfs] [ 8675.431788] ? mem_cgroup_charge_statistics+0x51/0xf0 [ 8675.433487] ? mem_cgroup_commit_charge+0x56/0x400 [ 8675.435122] ? do_raw_spin_unlock+0x4b/0xc0 [ 8675.436618] ? _raw_spin_unlock+0x1f/0x30 [ 8675.438093] ? __handle_mm_fault+0x499/0x740 [ 8675.439619] ? do_vfs_ioctl+0x56e/0x770 [ 8675.441034] do_vfs_ioctl+0x56e/0x770 [ 8675.442411] ksys_ioctl+0x3a/0x70 [ 8675.443718] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 8675.445333] __x64_sys_ioctl+0x16/0x20 [ 8675.446705] do_syscall_64+0x50/0x210 [ 8675.448059] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 8675.479187] BTRFS: error (device vdb) in btrfs_create_pending_block_groups:2038: errno=-28 No space left We now use btrfs_can_overcommit() to see if we can flip a block group read only. Before this would fail because we weren't taking into account the usable un-allocated space for allocating chunks. With my patches we were allowed to do the balance, which is technically correct. The test is trying to start balance on degraded mount. So now we're trying to allocate a chunk and cannot because we want to allocate a RAID1 chunk, but there's only 1 device that's available for usage. This results in an ENOSPC. But we shouldn't even be making it this far, we don't have enough devices to restripe. The problem is we're using btrfs_num_devices(), that also includes missing devices. That's not actually what we want, we need to use rw_devices. The chunk_mutex is not needed here, rw_devices changes only in device add, remove or replace, all are excluded by EXCL_OP mechanism. Fixes: e4d8ec0f65b9 ("Btrfs: implement online profile changing") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add stacktrace, update changelog, drop chunk_mutex ] Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: do not delete mismatched root refsJosef Bacik2020-04-281-4/+6
| | | | | | | | | | | | | | | commit 423a716cd7be16fb08690760691befe3be97d3fc upstream. btrfs_del_root_ref() will simply WARN_ON() if the ref doesn't match in any way, and then continue to delete the reference. This shouldn't happen, we have these values because there's more to the reference than the original root and the sub root. If any of these checks fail, return -ENOENT. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: Remove redundant btrfs_release_path from btrfs_unlink_subvolLu Fengqi2020-04-281-1/+0
| | | | | | | | | | | | | | | commit 5b7d687ad5913a56b6a8788435d7a53990b4176d upstream. Although it is safe to call this on already released paths with no locks held or extent buffers, removing the redundant btrfs_release_path is reasonable. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16 as dependency of commit d49d3287e74f "btrfs: fix invalid removal of root ref"] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: fix infinite loop during nocow writeback due to raceFilipe Manana2020-04-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit de7999afedff02c6631feab3ea726a0e8f8c3d40 upstream. When starting writeback for a range that covers part of a preallocated extent, due to a race with writeback for another range that also covers another part of the same preallocated extent, we can end up in an infinite loop. Consider the following example where for inode 280 we have two dirty ranges: range A, from 294912 to 303103, 8192 bytes range B, from 348160 to 438271, 90112 bytes and we have the following file extent item layout for our inode: leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5 (...) item 27 key (280 108 200704) itemoff 14598 itemsize 53 extent data disk bytenr 0 nr 0 type 1 (regular) extent data offset 0 nr 94208 ram 94208 item 28 key (280 108 294912) itemoff 14545 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc) extent data offset 0 nr 81920 ram 81920 Then the following happens: 1) Writeback starts for range B (from 348160 to 438271), execution of run_delalloc_nocow() starts; 2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at the extent item at slot 28, pointing to the prealloc extent item covering the range from 294912 to 376831. This extent covers part of our range; 3) An ordered extent is created against that extent, covering the file range from 348160 to 376831 (28672 bytes); 4) We adjust 'cur_offset' to 376832 and move on to the next iteration of the while loop; 5) The call to btrfs_lookup_file_extent() leaves us at the same leaf, pointing to slot 29, 1 slot after the last item (the extent item we processed in the previous iteration); 6) Because we are a slot beyond the last item, we call btrfs_next_leaf(), which releases the search path before doing a another search for the last key of the leaf (280 108 294912); 7) Right after btrfs_next_leaf() released the path, and before it did another search for the last key of the leaf, writeback for the range A (from 294912 to 303103) completes (it was previously started at some point); 8) Upon completion of the ordered extent for range A, the prealloc extent we previously found got split into two extent items, one covering the range from 294912 to 303103 (8192 bytes), with a type of regular extent (and no longer prealloc) and another covering the range from 303104 to 376831 (73728 bytes), with a type of prealloc and an offset of 8192 bytes. So our leaf now has the following layout: leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5 (...) item 27 key (280 108 200704) itemoff 14598 itemsize 53 extent data disk bytenr 0 nr 0 type 1 extent data offset 0 nr 8192 ram 94208 item 28 key (280 108 208896) itemoff 14545 itemsize 53 extent data disk bytenr 10433142784 nr 86016 type 1 extent data offset 0 nr 86016 ram 86016 item 29 key (280 108 294912) itemoff 14492 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 1 extent data offset 0 nr 8192 ram 81920 item 30 key (280 108 303104) itemoff 14439 itemsize 53 extent data disk bytenr 10433052672 nr 81920 type 2 extent data offset 8192 nr 73728 ram 81920 9) After btrfs_next_leaf() returns, we have our path pointing to that same leaf and at slot 30, since it has a key we didn't have before and it's the first key greater then the key that was previously the last key of the leaf (key (280 108 294912)); 10) The extent item at slot 30 covers the range from 303104 to 376831 which is in our target range, so we process it, despite having already created an ordered extent against this extent for the file range from 348160 to 376831. This is because we skip to the next extent item only if its end is less than or equals to the start of our delalloc range, and not less than or equals to the current offset ('cur_offset'); 11) As a result we compute 'num_bytes' as: num_bytes = min(end + 1, extent_end) - cur_offset; = min(438271 + 1, 376832) - 376832 = 0 12) We then call create_io_em() for a 0 bytes range starting at offset 376832; 13) Then create_io_em() enters an infinite loop because its calls to btrfs_drop_extent_cache() do nothing due to the 0 length range passed to it. So no existing extent maps that cover the offset 376832 get removed, and therefore calls to add_extent_mapping() return -EEXIST, resulting in an infinite loop. This loop from create_io_em() is the following: do { btrfs_drop_extent_cache(BTRFS_I(inode), em->start, em->start + em->len - 1, 0); write_lock(&em_tree->lock); ret = add_extent_mapping(em_tree, em, 1); write_unlock(&em_tree->lock); /* * The caller has taken lock_extent(), who could race with us * to add em? */ } while (ret == -EEXIST); Also, each call to btrfs_drop_extent_cache() triggers a warning because the start offset passed to it (376832) is smaller then the end offset (376832 - 1) passed to it by -1, due to the 0 length: [258532.052621] ------------[ cut here ]------------ [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs] (...) [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1 [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs] (...) [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287 [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000 [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308 [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001 [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000 [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308 [258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000 [258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0 [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [258532.052717] Call Trace: [258532.052718] ? preempt_schedule_common+0x32/0x70 [258532.052722] ? ___preempt_schedule+0x16/0x20 [258532.052741] create_io_em+0xff/0x180 [btrfs] [258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs] [258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs] [258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs] [258532.052834] writepage_delalloc+0xe4/0x140 [btrfs] [258532.052855] __extent_writepage+0x110/0x4e0 [btrfs] [258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs] [258532.052906] extent_writepages+0x52/0xb0 [btrfs] [258532.052911] do_writepages+0x23/0x80 [258532.052915] __filemap_fdatawrite_range+0xd2/0x110 [258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs] [258532.052954] start_ordered_ops+0x57/0xa0 [btrfs] [258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs] [258532.052988] btrfs_sync_file+0x225/0x490 [btrfs] [258532.052997] __x64_sys_msync+0x199/0x200 [258532.053004] do_syscall_64+0x5c/0x250 [258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe [258532.053010] RIP: 0033:0x7f86a7dfd760 (...) [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760 [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000 [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3 [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001 [258532.053032] irq event stamp: 1653450494 [258532.053035] hardirqs last enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50 [258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20 [258532.053039] softirqs last enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd [258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120 [258532.053043] ---[ end trace 8476fce13d9ce20a ]--- Which results in flooding dmesg/syslog since btrfs_drop_extent_cache() uses WARN_ON() and not WARN_ON_ONCE(). So fix this issue by changing run_delalloc_nocow()'s loop to move to the next extent item when the current extent item ends at at offset less than or equals to the current offset instead of the start offset. Fixes: 80ff385665b7fc ("Btrfs: update nodatacow code v2") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: do not leak reloc root if we fail to read the fs rootJosef Bacik2020-04-281-0/+1
| | | | | | | | | | | | | | | | | commit ca1aa2818a53875cfdd175fb5e9a2984e997cce9 upstream. If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: skip log replay on orphaned rootsJosef Bacik2020-04-281-2/+22
| | | | | | | | | | | | | | | | | | | | | commit 9bc574de590510eff899c3ca8dbaf013566b5efe upstream. My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Pass fs_info->extent_root, instead of fs_info, as first argument to btrfs_pin_extent_for_log_replay() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: handle ENOENT in btrfs_uuid_tree_iterateJosef Bacik2020-04-281-0/+2
| | | | | | | | | | | | | | | | commit 714cd3e8cba6841220dce9063a7388a81de03825 upstream. If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: abort transaction after failed inode updates in create_subvolJosef Bacik2020-04-281-2/+8
| | | | | | | | | | | | | | | | | commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 upstream. We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: - Add root as second argument to btrfs_abort_transaction() - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* Btrfs: fix removal logic of the tree mod log that leads to use-after-free issuesFilipe Manana2020-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6609fee8897ac475378388238456c84298bff802 upstream. When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik2020-04-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f72ff01df9cf5db25c76674cac16605992d15467 upstream. Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 3.16: No fs_info variable was used here] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: ensure that a DUP or RAID1 block group has exactly two stripesJohannes Thumshirn2020-04-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 349ae63f40638a28c6fce52e8447c2d14b84cc0c upstream. We recently had a customer issue with a corrupted filesystem. When trying to mount this image btrfs panicked with a division by zero in calc_stripe_length(). The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length() takes this value and divides it by the number of copies the RAID profile is expected to have to calculate the amount of data stripes. As a DUP profile is expected to have 2 copies this division resulted in 1/2 = 0. Later then the 'data_stripes' variable is used as a divisor in the stripe length calculation which results in a division by 0 and thus a kernel panic. When encountering a filesystem with a DUP block group and a 'num_stripes' value unequal to 2, refuse mounting as the image is corrupted and will lead to unexpected behaviour. Code inspection showed a RAID1 block group has the same issues. Fixes: e06cd3dd7cea ("Btrfs: add validadtion checks for chunk loading") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: tree-checker: Fix misleading group system informationShaokun Zhang2020-04-281-1/+1
| | | | | | | | | | | | | | | | | | | commit 761333f2f50ccc887aa9957ae829300262c0d15b upstream. block_group_err shows the group system as a decimal value with a '0x' prefix, which is somewhat misleading. Fix it to print hexadecimal, as was intended. Fixes: fce466eab7ac6 ("btrfs: tree-checker: Verify block_group_item") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: tree-checker: Check level for leaves and nodesQu Wenruo2020-04-281-0/+14
| | | | | | | | | | | | | | | | | | | | | | | commit f556faa46eb4e96d0d0772e74ecf66781e132f72 upstream. Although we have tree level check at tree read runtime, it's completely based on its parent level. We still need to do accurate level check to avoid invalid tree blocks sneak into kernel space. The check itself is simple, for leaf its level should always be 0. For nodes its level should be in range [1, BTRFS_MAX_LEVEL - 1]. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: - Pass root instead of fs_info to generic_err() - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: Verify that every chunk has corresponding block group at mount timeQu Wenruo2020-04-281-1/+57
| | | | | | | | | | | | | | | | | | | | | | | | | commit 7ef49515fa6727cb4b6f2f5b0ffbc5fc20a9f8c6 upstream. If a crafted image has missing block group items, it could cause unexpected behavior and breaks the assumption of 1:1 chunk<->block group mapping. Although we have the block group -> chunk mapping check, we still need chunk -> block group mapping check. This patch will do extra check to ensure each chunk has its corresponding block group. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: Check that each block group has corresponding chunk at mount timeQu Wenruo2020-04-281-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 514c7dca85a0bf40be984dab0b477403a6db901f upstream. A crafted btrfs image with incorrect chunk<->block group mapping will trigger a lot of unexpected things as the mapping is essential. Although the problem can be caught by block group item checker added in "btrfs: tree-checker: Verify block_group_item", it's still not sufficient. A sufficiently valid block group item can pass the check added by the mentioned patch but could fail to match the existing chunk. This patch will add extra block group -> chunk mapping check, to ensure we have a completely matching (start, len, flags) chunk for each block group at mount time. Here we reuse the original helper find_first_block_group(), which is already doing the basic bg -> chunk checks, adding further checks of the start/len and type flags. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: Use root->fs_info instead of fs_info] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: validate type when reading a chunkGu Jinxiang2020-04-281-0/+28
| | | | | | | | | | | | | | | | | | | | commit 315409b0098fb2651d86553f0436b70502b29bb2 upstream. Reported in https://bugzilla.kernel.org/show_bug.cgi?id=199839, with an image that has an invalid chunk type but does not return an error. Add chunk type check in btrfs_check_chunk_valid, to detect the wrong type combinations. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199839 Reported-by: Xu Wen <wen.xu@gatech.edu> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: Use root->fs_info instead of fs_info] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: tree-checker: Detect invalid and empty essential treesQu Wenruo2020-04-281-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ba480dd4db9f1798541eb2d1c423fc95feee8d36 upstream. A crafted image has empty root tree block, which will later cause NULL pointer dereference. The following trees should never be empty: 1) Tree root Must contain at least root items for extent tree, device tree and fs tree 2) Chunk tree Or we can't even bootstrap as it contains the mapping. 3) Fs tree At least inode item for top level inode (.). 4) Device tree Dev extents for chunks 5) Extent tree Must have corresponding extent for each chunk. If any of them is empty, we are sure the fs is corrupted and no need to mount it. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: Pass root instead of fs_info to generic_err()] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: tree-checker: Verify block_group_itemQu Wenruo2020-04-284-1/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fce466eab7ac6baa9d2dcd88abcf945be3d4a089 upstream. A crafted image with invalid block group items could make free space cache code to cause panic. We could detect such invalid block group item by checking: 1) Item size Known fixed value. 2) Block group size (key.offset) We have an upper limit on block group item (10G) 3) Chunk objectid Known fixed value. 4) Type Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA. No more than 1 bit set for profile type. 5) Used space No more than the block group size. This should allow btrfs to detect and refuse to mount the crafted image. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849 Reported-by: Xu Wen <wen.xu@gatech.edu> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> [bwh: Backported to 4.4: - In check_leaf_item(), pass root->fs_info to check_block_group_item() - Include <linux/sizes.h> (in ctree.h, to match upstream) - Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* btrfs: tree-check: reduce stack consumption in check_dir_itemDavid Sterba2020-04-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | commit e2683fc9d219430f5b78889b50cde7f40efeba7b upstream. I've noticed that the updated item checker stack consumption increased dramatically in 542f5385e20cf97447 ("btrfs: tree-checker: Add checker for dir item") tree-checker.c:check_leaf +552 (176 -> 728) The array is 255 bytes long, dynamic allocation would slow down the sanity checks so it's more reasonable to keep it on-stack. Moving the variable to the scope of use reduces the stack usage again tree-checker.c:check_leaf -264 (728 -> 464) Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>