summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* fs/ntfs3: Fix slab-out-of-bounds read in run_unpackHawkins Jiawei2022-09-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller reports slab-out-of-bounds bug as follows: ================================================================== BUG: KASAN: slab-out-of-bounds in run_unpack+0x8b7/0x970 fs/ntfs3/run.c:944 Read of size 1 at addr ffff88801bbdff02 by task syz-executor131/3611 [...] Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report+0xb1/0x1e0 mm/kasan/report.c:495 run_unpack+0x8b7/0x970 fs/ntfs3/run.c:944 run_unpack_ex+0xb0/0x7c0 fs/ntfs3/run.c:1057 ntfs_read_mft fs/ntfs3/inode.c:368 [inline] ntfs_iget5+0xc20/0x3280 fs/ntfs3/inode.c:501 ntfs_loadlog_and_replay+0x124/0x5d0 fs/ntfs3/fsntfs.c:272 ntfs_fill_super+0x1eff/0x37f0 fs/ntfs3/super.c:1018 get_tree_bdev+0x440/0x760 fs/super.c:1323 vfs_get_tree+0x89/0x2f0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x1326/0x1e20 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] </TASK> The buggy address belongs to the physical page: page:ffffea00006ef600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1bbd8 head:ffffea00006ef600 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88801bbdfe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88801bbdfe80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88801bbdff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff88801bbdff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88801bbe0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Kernel will tries to read record and parse MFT from disk in ntfs_read_mft(). Yet the problem is that during enumerating attributes in record, kernel doesn't check whether run_off field loading from the disk is a valid value. To be more specific, if attr->nres.run_off is larger than attr->size, kernel will passes an invalid argument run_buf_size in run_unpack_ex(), which having an integer overflow. Then this invalid argument will triggers the slab-out-of-bounds Read bug as above. This patch solves it by adding the sanity check between the offset to packed runs and attribute size. link: https://lore.kernel.org/all/0000000000009145fc05e94bd5c3@google.com/#t Reported-and-tested-by: syzbot+8d6fbb27a6aded64b25b@syzkaller.appspotmail.com Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Validate resident attribute nameEdward Lo2022-09-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Though we already have some sanity checks while enumerating attributes, resident attribute names aren't included. This patch checks the resident attribute names are in the valid ranges. [ 259.209031] BUG: KASAN: slab-out-of-bounds in ni_create_attr_list+0x1e1/0x850 [ 259.210770] Write of size 426 at addr ffff88800632f2b2 by task exp/255 [ 259.211551] [ 259.212035] CPU: 0 PID: 255 Comm: exp Not tainted 6.0.0-rc6 #37 [ 259.212955] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 259.214387] Call Trace: [ 259.214640] <TASK> [ 259.214895] dump_stack_lvl+0x49/0x63 [ 259.215284] print_report.cold+0xf5/0x689 [ 259.215565] ? kasan_poison+0x3c/0x50 [ 259.215778] ? kasan_unpoison+0x28/0x60 [ 259.215991] ? ni_create_attr_list+0x1e1/0x850 [ 259.216270] kasan_report+0xa7/0x130 [ 259.216481] ? ni_create_attr_list+0x1e1/0x850 [ 259.216719] kasan_check_range+0x15a/0x1d0 [ 259.216939] memcpy+0x3c/0x70 [ 259.217136] ni_create_attr_list+0x1e1/0x850 [ 259.217945] ? __rcu_read_unlock+0x5b/0x280 [ 259.218384] ? ni_remove_attr+0x2e0/0x2e0 [ 259.218712] ? kernel_text_address+0xcf/0xe0 [ 259.219064] ? __kernel_text_address+0x12/0x40 [ 259.219434] ? arch_stack_walk+0x9e/0xf0 [ 259.219668] ? __this_cpu_preempt_check+0x13/0x20 [ 259.219904] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ 259.220140] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ 259.220561] ni_ins_attr_ext+0x52c/0x5c0 [ 259.220984] ? ni_create_attr_list+0x850/0x850 [ 259.221532] ? run_deallocate+0x120/0x120 [ 259.221972] ? vfs_setxattr+0x128/0x300 [ 259.222688] ? setxattr+0x126/0x140 [ 259.222921] ? path_setxattr+0x164/0x180 [ 259.223431] ? __x64_sys_setxattr+0x6d/0x80 [ 259.223828] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 259.224417] ? mi_find_attr+0x3c/0xf0 [ 259.224772] ni_insert_attr+0x1ba/0x420 [ 259.225216] ? ni_ins_attr_ext+0x5c0/0x5c0 [ 259.225504] ? ntfs_read_ea+0x119/0x450 [ 259.225775] ni_insert_resident+0xc0/0x1c0 [ 259.226316] ? ni_insert_nonresident+0x400/0x400 [ 259.227001] ? __kasan_kmalloc+0x88/0xb0 [ 259.227468] ? __kmalloc+0x192/0x320 [ 259.227773] ntfs_set_ea+0x6bf/0xb30 [ 259.228216] ? ftrace_graph_ret_addr+0x2a/0xb0 [ 259.228494] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 259.228838] ? ntfs_read_ea+0x450/0x450 [ 259.229098] ? is_bpf_text_address+0x24/0x40 [ 259.229418] ? kernel_text_address+0xcf/0xe0 [ 259.229681] ? __kernel_text_address+0x12/0x40 [ 259.229948] ? unwind_get_return_address+0x3a/0x60 [ 259.230271] ? write_profile+0x270/0x270 [ 259.230537] ? arch_stack_walk+0x9e/0xf0 [ 259.230836] ntfs_setxattr+0x114/0x5c0 [ 259.231099] ? ntfs_set_acl_ex+0x2e0/0x2e0 [ 259.231529] ? evm_protected_xattr_common+0x6d/0x100 [ 259.231817] ? posix_xattr_acl+0x13/0x80 [ 259.232073] ? evm_protect_xattr+0x1f7/0x440 [ 259.232351] __vfs_setxattr+0xda/0x120 [ 259.232635] ? xattr_resolve_name+0x180/0x180 [ 259.232912] __vfs_setxattr_noperm+0x93/0x300 [ 259.233219] __vfs_setxattr_locked+0x141/0x160 [ 259.233492] ? kasan_poison+0x3c/0x50 [ 259.233744] vfs_setxattr+0x128/0x300 [ 259.234002] ? __vfs_setxattr_locked+0x160/0x160 [ 259.234837] do_setxattr+0xb8/0x170 [ 259.235567] ? vmemdup_user+0x53/0x90 [ 259.236212] setxattr+0x126/0x140 [ 259.236491] ? do_setxattr+0x170/0x170 [ 259.236791] ? debug_smp_processor_id+0x17/0x20 [ 259.237232] ? kasan_quarantine_put+0x57/0x180 [ 259.237605] ? putname+0x80/0xa0 [ 259.237870] ? __kasan_slab_free+0x11c/0x1b0 [ 259.238234] ? putname+0x80/0xa0 [ 259.238500] ? preempt_count_sub+0x18/0xc0 [ 259.238775] ? __mnt_want_write+0xaa/0x100 [ 259.238990] ? mnt_want_write+0x8b/0x150 [ 259.239290] path_setxattr+0x164/0x180 [ 259.239605] ? setxattr+0x140/0x140 [ 259.239849] ? debug_smp_processor_id+0x17/0x20 [ 259.240174] ? fpregs_assert_state_consistent+0x67/0x80 [ 259.240411] __x64_sys_setxattr+0x6d/0x80 [ 259.240715] do_syscall_64+0x3b/0x90 [ 259.240934] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 259.241697] RIP: 0033:0x7fc6b26e4469 [ 259.242647] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 088 [ 259.244512] RSP: 002b:00007ffc3c7841f8 EFLAGS: 00000217 ORIG_RAX: 00000000000000bc [ 259.245086] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc6b26e4469 [ 259.246025] RDX: 00007ffc3c784380 RSI: 00007ffc3c7842e0 RDI: 00007ffc3c784238 [ 259.246961] RBP: 00007ffc3c788410 R08: 0000000000000001 R09: 00007ffc3c7884f8 [ 259.247775] R10: 000000000000007f R11: 0000000000000217 R12: 00000000004004e0 [ 259.248534] R13: 00007ffc3c7884f0 R14: 0000000000000000 R15: 0000000000000000 [ 259.249368] </TASK> [ 259.249644] [ 259.249888] Allocated by task 255: [ 259.250283] kasan_save_stack+0x26/0x50 [ 259.250957] __kasan_kmalloc+0x88/0xb0 [ 259.251826] __kmalloc+0x192/0x320 [ 259.252745] ni_create_attr_list+0x11e/0x850 [ 259.253298] ni_ins_attr_ext+0x52c/0x5c0 [ 259.253685] ni_insert_attr+0x1ba/0x420 [ 259.253974] ni_insert_resident+0xc0/0x1c0 [ 259.254311] ntfs_set_ea+0x6bf/0xb30 [ 259.254629] ntfs_setxattr+0x114/0x5c0 [ 259.254859] __vfs_setxattr+0xda/0x120 [ 259.255155] __vfs_setxattr_noperm+0x93/0x300 [ 259.255445] __vfs_setxattr_locked+0x141/0x160 [ 259.255862] vfs_setxattr+0x128/0x300 [ 259.256251] do_setxattr+0xb8/0x170 [ 259.256522] setxattr+0x126/0x140 [ 259.256911] path_setxattr+0x164/0x180 [ 259.257308] __x64_sys_setxattr+0x6d/0x80 [ 259.257637] do_syscall_64+0x3b/0x90 [ 259.257970] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 259.258550] [ 259.258772] The buggy address belongs to the object at ffff88800632f000 [ 259.258772] which belongs to the cache kmalloc-1k of size 1024 [ 259.260190] The buggy address is located 690 bytes inside of [ 259.260190] 1024-byte region [ffff88800632f000, ffff88800632f400) [ 259.261412] [ 259.261743] The buggy address belongs to the physical page: [ 259.262354] page:0000000081e8cac9 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x632c [ 259.263722] head:0000000081e8cac9 order:2 compound_mapcount:0 compound_pincount:0 [ 259.264284] flags: 0xfffffc0010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) [ 259.265312] raw: 000fffffc0010200 ffffea0000060d00 dead000000000004 ffff888001041dc0 [ 259.265772] raw: 0000000000000000 0000000080080008 00000001ffffffff 0000000000000000 [ 259.266305] page dumped because: kasan: bad access detected [ 259.266588] [ 259.266728] Memory state around the buggy address: [ 259.267225] ffff88800632f300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 259.267841] ffff88800632f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 259.269111] >ffff88800632f400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 259.269626] ^ [ 259.270162] ffff88800632f480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 259.270810] ffff88800632f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Validate buffer length while parsing indexEdward Lo2022-09-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indx_read is called when we have some NTFS directory operations that need more information from the index buffers. This adds a sanity check to make sure the returned index buffer length is legit, or we may have some out-of-bound memory accesses. [ 560.897595] BUG: KASAN: slab-out-of-bounds in hdr_find_e.isra.0+0x10c/0x320 [ 560.898321] Read of size 2 at addr ffff888009497238 by task exp/245 [ 560.898760] [ 560.899129] CPU: 0 PID: 245 Comm: exp Not tainted 6.0.0-rc6 #37 [ 560.899505] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 560.900170] Call Trace: [ 560.900407] <TASK> [ 560.900732] dump_stack_lvl+0x49/0x63 [ 560.901108] print_report.cold+0xf5/0x689 [ 560.901395] ? hdr_find_e.isra.0+0x10c/0x320 [ 560.901716] kasan_report+0xa7/0x130 [ 560.901950] ? hdr_find_e.isra.0+0x10c/0x320 [ 560.902208] __asan_load2+0x68/0x90 [ 560.902427] hdr_find_e.isra.0+0x10c/0x320 [ 560.902846] ? cmp_uints+0xe0/0xe0 [ 560.903363] ? cmp_sdh+0x90/0x90 [ 560.903883] ? ntfs_bread_run+0x190/0x190 [ 560.904196] ? rwsem_down_read_slowpath+0x750/0x750 [ 560.904969] ? ntfs_fix_post_read+0xe0/0x130 [ 560.905259] ? __kasan_check_write+0x14/0x20 [ 560.905599] ? up_read+0x1a/0x90 [ 560.905853] ? indx_read+0x22c/0x380 [ 560.906096] indx_find+0x2ef/0x470 [ 560.906352] ? indx_find_buffer+0x2d0/0x2d0 [ 560.906692] ? __kasan_kmalloc+0x88/0xb0 [ 560.906977] dir_search_u+0x196/0x2f0 [ 560.907220] ? ntfs_nls_to_utf16+0x450/0x450 [ 560.907464] ? __kasan_check_write+0x14/0x20 [ 560.907747] ? mutex_lock+0x8f/0xe0 [ 560.907970] ? __mutex_lock_slowpath+0x20/0x20 [ 560.908214] ? kmem_cache_alloc+0x143/0x4b0 [ 560.908459] ntfs_lookup+0xe0/0x100 [ 560.908788] __lookup_slow+0x116/0x220 [ 560.909050] ? lookup_fast+0x1b0/0x1b0 [ 560.909309] ? lookup_fast+0x13f/0x1b0 [ 560.909601] walk_component+0x187/0x230 [ 560.909944] link_path_walk.part.0+0x3f0/0x660 [ 560.910285] ? handle_lookup_down+0x90/0x90 [ 560.910618] ? path_init+0x642/0x6e0 [ 560.911084] ? percpu_counter_add_batch+0x6e/0xf0 [ 560.912559] ? __alloc_file+0x114/0x170 [ 560.913008] path_openat+0x19c/0x1d10 [ 560.913419] ? getname_flags+0x73/0x2b0 [ 560.913815] ? kasan_save_stack+0x3a/0x50 [ 560.914125] ? kasan_save_stack+0x26/0x50 [ 560.914542] ? __kasan_slab_alloc+0x6d/0x90 [ 560.914924] ? kmem_cache_alloc+0x143/0x4b0 [ 560.915339] ? getname_flags+0x73/0x2b0 [ 560.915647] ? getname+0x12/0x20 [ 560.916114] ? __x64_sys_open+0x4c/0x60 [ 560.916460] ? path_lookupat.isra.0+0x230/0x230 [ 560.916867] ? __isolate_free_page+0x2e0/0x2e0 [ 560.917194] do_filp_open+0x15c/0x1f0 [ 560.917448] ? may_open_dev+0x60/0x60 [ 560.917696] ? expand_files+0xa4/0x3a0 [ 560.917923] ? __kasan_check_write+0x14/0x20 [ 560.918185] ? _raw_spin_lock+0x88/0xdb [ 560.918409] ? _raw_spin_lock_irqsave+0x100/0x100 [ 560.918783] ? _find_next_bit+0x4a/0x130 [ 560.919026] ? _raw_spin_unlock+0x19/0x40 [ 560.919276] ? alloc_fd+0x14b/0x2d0 [ 560.919635] do_sys_openat2+0x32a/0x4b0 [ 560.920035] ? file_open_root+0x230/0x230 [ 560.920336] ? __rcu_read_unlock+0x5b/0x280 [ 560.920813] do_sys_open+0x99/0xf0 [ 560.921208] ? filp_open+0x60/0x60 [ 560.921482] ? exit_to_user_mode_prepare+0x49/0x180 [ 560.921867] __x64_sys_open+0x4c/0x60 [ 560.922128] do_syscall_64+0x3b/0x90 [ 560.922369] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 560.923030] RIP: 0033:0x7f7dff2e4469 [ 560.923681] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 088 [ 560.924451] RSP: 002b:00007ffd41a210b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000002 [ 560.925168] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7dff2e4469 [ 560.925655] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00007ffd41a211f0 [ 560.926085] RBP: 00007ffd41a252a0 R08: 00007f7dff60fba0 R09: 00007ffd41a25388 [ 560.926405] R10: 0000000000400b80 R11: 0000000000000206 R12: 00000000004004e0 [ 560.926867] R13: 00007ffd41a25380 R14: 0000000000000000 R15: 0000000000000000 [ 560.927241] </TASK> [ 560.927491] [ 560.927755] Allocated by task 245: [ 560.928409] kasan_save_stack+0x26/0x50 [ 560.929271] __kasan_kmalloc+0x88/0xb0 [ 560.929778] __kmalloc+0x192/0x320 [ 560.930023] indx_read+0x249/0x380 [ 560.930224] indx_find+0x2a2/0x470 [ 560.930695] dir_search_u+0x196/0x2f0 [ 560.930892] ntfs_lookup+0xe0/0x100 [ 560.931115] __lookup_slow+0x116/0x220 [ 560.931323] walk_component+0x187/0x230 [ 560.931570] link_path_walk.part.0+0x3f0/0x660 [ 560.931791] path_openat+0x19c/0x1d10 [ 560.932008] do_filp_open+0x15c/0x1f0 [ 560.932226] do_sys_openat2+0x32a/0x4b0 [ 560.932413] do_sys_open+0x99/0xf0 [ 560.932709] __x64_sys_open+0x4c/0x60 [ 560.933417] do_syscall_64+0x3b/0x90 [ 560.933776] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 560.934235] [ 560.934486] The buggy address belongs to the object at ffff888009497000 [ 560.934486] which belongs to the cache kmalloc-512 of size 512 [ 560.935239] The buggy address is located 56 bytes to the right of [ 560.935239] 512-byte region [ffff888009497000, ffff888009497200) [ 560.936153] [ 560.937326] The buggy address belongs to the physical page: [ 560.938228] page:0000000062a3dfae refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9496 [ 560.939616] head:0000000062a3dfae order:1 compound_mapcount:0 compound_pincount:0 [ 560.940219] flags: 0xfffffc0010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) [ 560.942702] raw: 000fffffc0010200 ffffea0000164f80 dead000000000005 ffff888001041c80 [ 560.943932] raw: 0000000000000000 0000000080080008 00000001ffffffff 0000000000000000 [ 560.944568] page dumped because: kasan: bad access detected [ 560.945735] [ 560.946112] Memory state around the buggy address: [ 560.946870] ffff888009497100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 560.947242] ffff888009497180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 560.947611] >ffff888009497200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 560.947915] ^ [ 560.948249] ffff888009497280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 560.948687] ffff888009497300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Harden against integer overflowsDan Carpenter2022-09-301-1/+1
| | | | | | | | | Smatch complains that the "add_bytes" is not to be trusted. Use size_add() to prevent an integer overflow. Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Validate attribute name offsetEdward Lo2022-09-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although the attribute name length is checked before comparing it to some common names (e.g., $I30), the offset isn't. This adds a sanity check for the attribute name offset, guarantee the validity and prevent possible out-of-bound memory accesses. [ 191.720056] BUG: unable to handle page fault for address: ffffebde00000008 [ 191.721060] #PF: supervisor read access in kernel mode [ 191.721586] #PF: error_code(0x0000) - not-present page [ 191.722079] PGD 0 P4D 0 [ 191.722571] Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 191.723179] CPU: 0 PID: 244 Comm: mount Not tainted 6.0.0-rc4 #28 [ 191.723749] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 191.724832] RIP: 0010:kfree+0x56/0x3b0 [ 191.725870] Code: 80 48 01 d8 0f 82 65 03 00 00 48 c7 c2 00 00 00 80 48 2b 15 2c 06 dd 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 0a 069 [ 191.727375] RSP: 0018:ffff8880076f7878 EFLAGS: 00000286 [ 191.727897] RAX: ffffebde00000000 RBX: 0000000000000040 RCX: ffffffff8528d5b9 [ 191.728531] RDX: 0000777f80000000 RSI: ffffffff8522d49c RDI: 0000000000000040 [ 191.729183] RBP: ffff8880076f78a0 R08: 0000000000000000 R09: 0000000000000000 [ 191.729628] R10: ffff888008949fd8 R11: ffffed10011293fd R12: 0000000000000040 [ 191.730158] R13: ffff888008949f98 R14: ffff888008949ec0 R15: ffff888008949fb0 [ 191.730645] FS: 00007f3520cd7e40(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000 [ 191.731328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 191.731667] CR2: ffffebde00000008 CR3: 0000000009704000 CR4: 00000000000006f0 [ 191.732568] Call Trace: [ 191.733231] <TASK> [ 191.733860] kvfree+0x2c/0x40 [ 191.734632] ni_clear+0x180/0x290 [ 191.735085] ntfs_evict_inode+0x45/0x70 [ 191.735495] evict+0x199/0x280 [ 191.735996] iput.part.0+0x286/0x320 [ 191.736438] iput+0x32/0x50 [ 191.736811] iget_failed+0x23/0x30 [ 191.737270] ntfs_iget5+0x337/0x1890 [ 191.737629] ? ntfs_clear_mft_tail+0x20/0x260 [ 191.738201] ? ntfs_get_block_bmap+0x70/0x70 [ 191.738482] ? ntfs_objid_init+0xf6/0x140 [ 191.738779] ? ntfs_reparse_init+0x140/0x140 [ 191.739266] ntfs_fill_super+0x121b/0x1b50 [ 191.739623] ? put_ntfs+0x1d0/0x1d0 [ 191.739984] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ 191.740466] ? put_ntfs+0x1d0/0x1d0 [ 191.740787] ? sb_set_blocksize+0x6a/0x80 [ 191.741272] get_tree_bdev+0x232/0x370 [ 191.741829] ? put_ntfs+0x1d0/0x1d0 [ 191.742669] ntfs_fs_get_tree+0x15/0x20 [ 191.743132] vfs_get_tree+0x4c/0x130 [ 191.743457] path_mount+0x654/0xfe0 [ 191.743938] ? putname+0x80/0xa0 [ 191.744271] ? finish_automount+0x2e0/0x2e0 [ 191.744582] ? putname+0x80/0xa0 [ 191.745053] ? kmem_cache_free+0x1c4/0x440 [ 191.745403] ? putname+0x80/0xa0 [ 191.745616] do_mount+0xd6/0xf0 [ 191.745887] ? path_mount+0xfe0/0xfe0 [ 191.746287] ? __kasan_check_write+0x14/0x20 [ 191.746582] __x64_sys_mount+0xca/0x110 [ 191.746850] do_syscall_64+0x3b/0x90 [ 191.747122] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 191.747517] RIP: 0033:0x7f351fee948a [ 191.748332] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 191.749341] RSP: 002b:00007ffd51cf3af8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 [ 191.749960] RAX: ffffffffffffffda RBX: 000055b903733060 RCX: 00007f351fee948a [ 191.750589] RDX: 000055b903733260 RSI: 000055b9037332e0 RDI: 000055b90373bce0 [ 191.751115] RBP: 0000000000000000 R08: 000055b903733280 R09: 0000000000000020 [ 191.751537] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 000055b90373bce0 [ 191.751946] R13: 000055b903733260 R14: 0000000000000000 R15: 00000000ffffffff [ 191.752519] </TASK> [ 191.752782] Modules linked in: [ 191.753785] CR2: ffffebde00000008 [ 191.754937] ---[ end trace 0000000000000000 ]--- [ 191.755429] RIP: 0010:kfree+0x56/0x3b0 [ 191.755725] Code: 80 48 01 d8 0f 82 65 03 00 00 48 c7 c2 00 00 00 80 48 2b 15 2c 06 dd 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 0a 069 [ 191.756744] RSP: 0018:ffff8880076f7878 EFLAGS: 00000286 [ 191.757218] RAX: ffffebde00000000 RBX: 0000000000000040 RCX: ffffffff8528d5b9 [ 191.757580] RDX: 0000777f80000000 RSI: ffffffff8522d49c RDI: 0000000000000040 [ 191.758016] RBP: ffff8880076f78a0 R08: 0000000000000000 R09: 0000000000000000 [ 191.758570] R10: ffff888008949fd8 R11: ffffed10011293fd R12: 0000000000000040 [ 191.758957] R13: ffff888008949f98 R14: ffff888008949ec0 R15: ffff888008949fb0 [ 191.759317] FS: 00007f3520cd7e40(0000) GS:ffff88805ba00000(0000) knlGS:0000000000000000 [ 191.759711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 191.760118] CR2: ffffebde00000008 CR3: 0000000009704000 CR4: 00000000000006f0 Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add null pointer check for inode operationsEdward Lo2022-09-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a sanity check for the i_op pointer of the inode which is returned after reading Root directory MFT record. We should check the i_op is valid before trying to create the root dentry, otherwise we may encounter a NPD while mounting a image with a funny Root directory MFT record. [ 114.484325] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 114.484811] #PF: supervisor read access in kernel mode [ 114.485084] #PF: error_code(0x0000) - not-present page [ 114.485606] PGD 0 P4D 0 [ 114.485975] Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 114.486570] CPU: 0 PID: 237 Comm: mount Tainted: G B 6.0.0-rc4 #28 [ 114.486977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 114.488169] RIP: 0010:d_flags_for_inode+0xe0/0x110 [ 114.488816] Code: 24 f7 ff 49 83 3e 00 74 41 41 83 cd 02 66 44 89 6b 02 eb 92 48 8d 7b 20 e8 6d 24 f7 ff 4c 8b 73 20 49 8d 7e 08 e8 60 241 [ 114.490326] RSP: 0018:ffff8880065e7aa8 EFLAGS: 00000296 [ 114.490695] RAX: 0000000000000001 RBX: ffff888008ccd750 RCX: ffffffff84af2aea [ 114.490986] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff87abd020 [ 114.491364] RBP: ffff8880065e7ac8 R08: 0000000000000001 R09: fffffbfff0f57a05 [ 114.491675] R10: ffffffff87abd027 R11: fffffbfff0f57a04 R12: 0000000000000000 [ 114.491954] R13: 0000000000000008 R14: 0000000000000000 R15: ffff888008ccd750 [ 114.492397] FS: 00007fdc8a627e40(0000) GS:ffff888058200000(0000) knlGS:0000000000000000 [ 114.492797] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 114.493150] CR2: 0000000000000008 CR3: 00000000013ba000 CR4: 00000000000006f0 [ 114.493671] Call Trace: [ 114.493890] <TASK> [ 114.494075] __d_instantiate+0x24/0x1c0 [ 114.494505] d_instantiate.part.0+0x35/0x50 [ 114.494754] d_make_root+0x53/0x80 [ 114.494998] ntfs_fill_super+0x1232/0x1b50 [ 114.495260] ? put_ntfs+0x1d0/0x1d0 [ 114.495499] ? vsprintf+0x20/0x20 [ 114.495723] ? set_blocksize+0x95/0x150 [ 114.495964] get_tree_bdev+0x232/0x370 [ 114.496272] ? put_ntfs+0x1d0/0x1d0 [ 114.496502] ntfs_fs_get_tree+0x15/0x20 [ 114.496859] vfs_get_tree+0x4c/0x130 [ 114.497099] path_mount+0x654/0xfe0 [ 114.497507] ? putname+0x80/0xa0 [ 114.497933] ? finish_automount+0x2e0/0x2e0 [ 114.498362] ? putname+0x80/0xa0 [ 114.498571] ? kmem_cache_free+0x1c4/0x440 [ 114.498819] ? putname+0x80/0xa0 [ 114.499069] do_mount+0xd6/0xf0 [ 114.499343] ? path_mount+0xfe0/0xfe0 [ 114.499683] ? __kasan_check_write+0x14/0x20 [ 114.500133] __x64_sys_mount+0xca/0x110 [ 114.500592] do_syscall_64+0x3b/0x90 [ 114.500930] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 114.501294] RIP: 0033:0x7fdc898e948a [ 114.501542] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 114.502716] RSP: 002b:00007ffd793e58f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 [ 114.503175] RAX: ffffffffffffffda RBX: 0000564b2228f060 RCX: 00007fdc898e948a [ 114.503588] RDX: 0000564b2228f260 RSI: 0000564b2228f2e0 RDI: 0000564b22297ce0 [ 114.504925] RBP: 0000000000000000 R08: 0000564b2228f280 R09: 0000000000000020 [ 114.505484] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000564b22297ce0 [ 114.505823] R13: 0000564b2228f260 R14: 0000000000000000 R15: 00000000ffffffff [ 114.506562] </TASK> [ 114.506887] Modules linked in: [ 114.507648] CR2: 0000000000000008 [ 114.508884] ---[ end trace 0000000000000000 ]--- [ 114.509675] RIP: 0010:d_flags_for_inode+0xe0/0x110 [ 114.510140] Code: 24 f7 ff 49 83 3e 00 74 41 41 83 cd 02 66 44 89 6b 02 eb 92 48 8d 7b 20 e8 6d 24 f7 ff 4c 8b 73 20 49 8d 7e 08 e8 60 241 [ 114.511762] RSP: 0018:ffff8880065e7aa8 EFLAGS: 00000296 [ 114.512401] RAX: 0000000000000001 RBX: ffff888008ccd750 RCX: ffffffff84af2aea [ 114.513103] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff87abd020 [ 114.513512] RBP: ffff8880065e7ac8 R08: 0000000000000001 R09: fffffbfff0f57a05 [ 114.513831] R10: ffffffff87abd027 R11: fffffbfff0f57a04 R12: 0000000000000000 [ 114.514757] R13: 0000000000000008 R14: 0000000000000000 R15: ffff888008ccd750 [ 114.515411] FS: 00007fdc8a627e40(0000) GS:ffff888058200000(0000) knlGS:0000000000000000 [ 114.515794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 114.516208] CR2: 0000000000000008 CR3: 00000000013ba000 CR4: 00000000000006f0 Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Fix junction point resolutionDaniel Pinto2022-09-301-2/+101
| | | | | | | | | | | | | | | | | The ntfs3 file system driver does not convert the target path of junction points to a proper Linux path. As junction points targets are always absolute paths (they start with a drive letter), all junctions will result in broken links. Translate the targets of junction points to relative paths so they point to directories inside the mounted volume. Note that Windows allows junction points to reference directories in another drive. However, as there is no way to know which drive the junctions refer to, we assume they always target the same file system they are in. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214833 Signed-off-by: Daniel Pinto <danielpinto52@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Avoid UBSAN error on true_sectors_per_clst()Shigeru Yoshida2022-09-301-1/+1
| | | | | | | | | | | | | | | | | syzbot reported UBSAN error as below: [ 76.901829][ T6677] ================================================================================ [ 76.903908][ T6677] UBSAN: shift-out-of-bounds in fs/ntfs3/super.c:675:13 [ 76.905363][ T6677] shift exponent -247 is negative This patch avoid this error. Link: https://syzkaller.appspot.com/bug?id=b0299c09a14aababf0f1c862dd4ebc8ab9eb0179 Fixes: a3b774342fa7 (fs/ntfs3: validate BOOT sectors_per_clusters) Cc: Author: Randy Dunlap <rdunlap@infradead.org> Reported-by: syzbot+35b87c668935bb55e666@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Fix memory leak on ntfs_fill_super() error pathShigeru Yoshida2022-09-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | syzbot reported kmemleak as below: BUG: memory leak unreferenced object 0xffff8880122f1540 (size 32): comm "a.out", pid 6664, jiffies 4294939771 (age 25.500s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 ed ff ed ff 00 00 00 00 ................ backtrace: [<ffffffff81b16052>] ntfs_init_fs_context+0x22/0x1c0 [<ffffffff8164aaa7>] alloc_fs_context+0x217/0x430 [<ffffffff81626dd4>] path_mount+0x704/0x1080 [<ffffffff81627e7c>] __x64_sys_mount+0x18c/0x1d0 [<ffffffff84593e14>] do_syscall_64+0x34/0xb0 [<ffffffff84600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This patch fixes this issue by freeing mount options on error path of ntfs_fill_super(). Reported-by: syzbot+9d67170b20e8f94351c8@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Use kmalloc_array for allocating multiple elementsKenneth Lee2022-09-301-1/+1
| | | | | | | | | Prefer using kmalloc_array(a, b) over kmalloc(a * b) as this improves semantics since kmalloc is intended for allocating an array of memory. Signed-off-by: Kenneth Lee <klee33@uw.edu> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Fix attr_punch_hole() null pointer derenferenceAlon Zahavi2022-09-301-1/+1
| | | | | | | | | | | | | | The bug occours due to a misuse of `attr` variable instead of `attr_b`. `attr` is being initialized as NULL, then being derenfernced as `attr->res.data_size`. This bug causes a crash of the ntfs3 driver itself, If compiled directly to the kernel, it crashes the whole system. Signed-off-by: Alon Zahavi <zahavi.alon@gmail.com> Co-developed-by: Tal Lossos <tallossos@gmail.com> Signed-off-by: Tal Lossos <tallossos@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Fix [df]mask display in /proc/mountsMarc Aurèle La France2022-09-301-2/+2
| | | | | | | | | ntfs3's dmask and fmask mount options are 16-bit quantities but are displayed as 1-extended 32-bit values in /proc/mounts. Fix this by circumventing integer promotion. Signed-off-by: Marc Aurèle La France <tsi@tuyoix.net> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add null pointer check to attr_load_runs_vcnEdward Lo2022-09-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some metadata files are handled before MFT. This adds a null pointer check for some corner cases that could lead to NPD while reading these metadata files for a malformed NTFS image. [ 240.190827] BUG: kernel NULL pointer dereference, address: 0000000000000158 [ 240.191583] #PF: supervisor read access in kernel mode [ 240.191956] #PF: error_code(0x0000) - not-present page [ 240.192391] PGD 0 P4D 0 [ 240.192897] Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 240.193805] CPU: 0 PID: 242 Comm: mount Tainted: G B 5.19.0+ #17 [ 240.194477] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 240.195152] RIP: 0010:ni_find_attr+0xae/0x300 [ 240.195679] Code: c8 48 c7 45 88 c0 4e 5e 86 c7 00 f1 f1 f1 f1 c7 40 04 00 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 e2 d9f [ 240.196642] RSP: 0018:ffff88800812f690 EFLAGS: 00000286 [ 240.197019] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffff85ef037a [ 240.197523] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff88e95f60 [ 240.197877] RBP: ffff88800812f738 R08: 0000000000000001 R09: fffffbfff11d2bed [ 240.198292] R10: ffffffff88e95f67 R11: fffffbfff11d2bec R12: 0000000000000000 [ 240.198647] R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000 [ 240.199410] FS: 00007f233c33be40(0000) GS:ffff888058200000(0000) knlGS:0000000000000000 [ 240.199895] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 240.200314] CR2: 0000000000000158 CR3: 0000000004d32000 CR4: 00000000000006f0 [ 240.200839] Call Trace: [ 240.201104] <TASK> [ 240.201502] ? ni_load_mi+0x80/0x80 [ 240.202297] ? ___slab_alloc+0x465/0x830 [ 240.202614] attr_load_runs_vcn+0x8c/0x1a0 [ 240.202886] ? __kasan_slab_alloc+0x32/0x90 [ 240.203157] ? attr_data_write_resident+0x250/0x250 [ 240.203543] mi_read+0x133/0x2c0 [ 240.203785] mi_get+0x70/0x140 [ 240.204012] ni_load_mi_ex+0xfa/0x190 [ 240.204346] ? ni_std5+0x90/0x90 [ 240.204588] ? __kasan_kmalloc+0x88/0xb0 [ 240.204859] ni_enum_attr_ex+0xf1/0x1c0 [ 240.205107] ? ni_fname_type.part.0+0xd0/0xd0 [ 240.205600] ? ntfs_load_attr_list+0xbe/0x300 [ 240.205864] ? ntfs_cmp_names_cpu+0x125/0x180 [ 240.206157] ntfs_iget5+0x56c/0x1870 [ 240.206510] ? ntfs_get_block_bmap+0x70/0x70 [ 240.206776] ? __kasan_kmalloc+0x88/0xb0 [ 240.207030] ? set_blocksize+0x95/0x150 [ 240.207545] ntfs_fill_super+0xb8f/0x1e20 [ 240.207839] ? put_ntfs+0x1d0/0x1d0 [ 240.208069] ? vsprintf+0x20/0x20 [ 240.208467] ? mutex_unlock+0x81/0xd0 [ 240.208846] ? set_blocksize+0x95/0x150 [ 240.209221] get_tree_bdev+0x232/0x370 [ 240.209804] ? put_ntfs+0x1d0/0x1d0 [ 240.210519] ntfs_fs_get_tree+0x15/0x20 [ 240.210991] vfs_get_tree+0x4c/0x130 [ 240.211455] path_mount+0x645/0xfd0 [ 240.211806] ? putname+0x80/0xa0 [ 240.212112] ? finish_automount+0x2e0/0x2e0 [ 240.212559] ? kmem_cache_free+0x110/0x390 [ 240.212906] ? putname+0x80/0xa0 [ 240.213329] do_mount+0xd6/0xf0 [ 240.213829] ? path_mount+0xfd0/0xfd0 [ 240.214246] ? __kasan_check_write+0x14/0x20 [ 240.214774] __x64_sys_mount+0xca/0x110 [ 240.215080] do_syscall_64+0x3b/0x90 [ 240.215442] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 240.215811] RIP: 0033:0x7f233b4e948a [ 240.216104] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 240.217615] RSP: 002b:00007fff02211ec8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 [ 240.218718] RAX: ffffffffffffffda RBX: 0000561cdc35b060 RCX: 00007f233b4e948a [ 240.219556] RDX: 0000561cdc35b260 RSI: 0000561cdc35b2e0 RDI: 0000561cdc363af0 [ 240.219975] RBP: 0000000000000000 R08: 0000561cdc35b280 R09: 0000000000000020 [ 240.220403] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000561cdc363af0 [ 240.220803] R13: 0000561cdc35b260 R14: 0000000000000000 R15: 00000000ffffffff [ 240.221256] </TASK> [ 240.221567] Modules linked in: [ 240.222028] CR2: 0000000000000158 [ 240.223291] ---[ end trace 0000000000000000 ]--- [ 240.223669] RIP: 0010:ni_find_attr+0xae/0x300 [ 240.224058] Code: c8 48 c7 45 88 c0 4e 5e 86 c7 00 f1 f1 f1 f1 c7 40 04 00 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 e2 d9f [ 240.225033] RSP: 0018:ffff88800812f690 EFLAGS: 00000286 [ 240.225968] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffffffff85ef037a [ 240.226624] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffff88e95f60 [ 240.227307] RBP: ffff88800812f738 R08: 0000000000000001 R09: fffffbfff11d2bed [ 240.227816] R10: ffffffff88e95f67 R11: fffffbfff11d2bec R12: 0000000000000000 [ 240.228330] R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000 [ 240.228729] FS: 00007f233c33be40(0000) GS:ffff888058200000(0000) knlGS:0000000000000000 [ 240.229281] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 240.230298] CR2: 0000000000000158 CR3: 0000000004d32000 CR4: 00000000000006f0 Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Validate data run offsetEdward Lo2022-09-305-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds sanity checks for data run offset. We should make sure data run offset is legit before trying to unpack them, otherwise we may encounter use-after-free or some unexpected memory access behaviors. [ 82.940342] BUG: KASAN: use-after-free in run_unpack+0x2e3/0x570 [ 82.941180] Read of size 1 at addr ffff888008a8487f by task mount/240 [ 82.941670] [ 82.942069] CPU: 0 PID: 240 Comm: mount Not tainted 5.19.0+ #15 [ 82.942482] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 82.943720] Call Trace: [ 82.944204] <TASK> [ 82.944471] dump_stack_lvl+0x49/0x63 [ 82.944908] print_report.cold+0xf5/0x67b [ 82.945141] ? __wait_on_bit+0x106/0x120 [ 82.945750] ? run_unpack+0x2e3/0x570 [ 82.946626] kasan_report+0xa7/0x120 [ 82.947046] ? run_unpack+0x2e3/0x570 [ 82.947280] __asan_load1+0x51/0x60 [ 82.947483] run_unpack+0x2e3/0x570 [ 82.947709] ? memcpy+0x4e/0x70 [ 82.947927] ? run_pack+0x7a0/0x7a0 [ 82.948158] run_unpack_ex+0xad/0x3f0 [ 82.948399] ? mi_enum_attr+0x14a/0x200 [ 82.948717] ? run_unpack+0x570/0x570 [ 82.949072] ? ni_enum_attr_ex+0x1b2/0x1c0 [ 82.949332] ? ni_fname_type.part.0+0xd0/0xd0 [ 82.949611] ? mi_read+0x262/0x2c0 [ 82.949970] ? ntfs_cmp_names_cpu+0x125/0x180 [ 82.950249] ntfs_iget5+0x632/0x1870 [ 82.950621] ? ntfs_get_block_bmap+0x70/0x70 [ 82.951192] ? evict+0x223/0x280 [ 82.951525] ? iput.part.0+0x286/0x320 [ 82.951969] ntfs_fill_super+0x1321/0x1e20 [ 82.952436] ? put_ntfs+0x1d0/0x1d0 [ 82.952822] ? vsprintf+0x20/0x20 [ 82.953188] ? mutex_unlock+0x81/0xd0 [ 82.953379] ? set_blocksize+0x95/0x150 [ 82.954001] get_tree_bdev+0x232/0x370 [ 82.954438] ? put_ntfs+0x1d0/0x1d0 [ 82.954700] ntfs_fs_get_tree+0x15/0x20 [ 82.955049] vfs_get_tree+0x4c/0x130 [ 82.955292] path_mount+0x645/0xfd0 [ 82.955615] ? putname+0x80/0xa0 [ 82.955955] ? finish_automount+0x2e0/0x2e0 [ 82.956310] ? kmem_cache_free+0x110/0x390 [ 82.956723] ? putname+0x80/0xa0 [ 82.957023] do_mount+0xd6/0xf0 [ 82.957411] ? path_mount+0xfd0/0xfd0 [ 82.957638] ? __kasan_check_write+0x14/0x20 [ 82.957948] __x64_sys_mount+0xca/0x110 [ 82.958310] do_syscall_64+0x3b/0x90 [ 82.958719] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 82.959341] RIP: 0033:0x7fd0d1ce948a [ 82.960193] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 82.961532] RSP: 002b:00007ffe59ff69a8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 [ 82.962527] RAX: ffffffffffffffda RBX: 0000564dcc107060 RCX: 00007fd0d1ce948a [ 82.963266] RDX: 0000564dcc107260 RSI: 0000564dcc1072e0 RDI: 0000564dcc10fce0 [ 82.963686] RBP: 0000000000000000 R08: 0000564dcc107280 R09: 0000000000000020 [ 82.964272] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000564dcc10fce0 [ 82.964785] R13: 0000564dcc107260 R14: 0000000000000000 R15: 00000000ffffffff Signed-off-by: Edward Lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add overflow check for attribute sizeedward lo2022-09-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The offset addition could overflow and pass the used size check given an attribute with very large size (e.g., 0xffffff7f) while parsing MFT attributes. This could lead to out-of-bound memory R/W if we try to access the next attribute derived by Add2Ptr(attr, asize) [ 32.963847] BUG: unable to handle page fault for address: ffff956a83c76067 [ 32.964301] #PF: supervisor read access in kernel mode [ 32.964526] #PF: error_code(0x0000) - not-present page [ 32.964893] PGD 4dc01067 P4D 4dc01067 PUD 0 [ 32.965316] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 32.965727] CPU: 0 PID: 243 Comm: mount Not tainted 5.19.0+ #6 [ 32.966050] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 32.966628] RIP: 0010:mi_enum_attr+0x44/0x110 [ 32.967239] Code: 89 f0 48 29 c8 48 89 c1 39 c7 0f 86 94 00 00 00 8b 56 04 83 fa 17 0f 86 88 00 00 00 89 d0 01 ca 48 01 f0 8d 4a 08 39 f9a [ 32.968101] RSP: 0018:ffffba15c06a7c38 EFLAGS: 00000283 [ 32.968364] RAX: ffff956a83c76067 RBX: ffff956983c76050 RCX: 000000000000006f [ 32.968651] RDX: 0000000000000067 RSI: ffff956983c760e8 RDI: 00000000000001c8 [ 32.968963] RBP: ffffba15c06a7c38 R08: 0000000000000064 R09: 00000000ffffff7f [ 32.969249] R10: 0000000000000007 R11: ffff956983c760e8 R12: ffff95698225e000 [ 32.969870] R13: 0000000000000000 R14: ffffba15c06a7cd8 R15: ffff95698225e170 [ 32.970655] FS: 00007fdab8189e40(0000) GS:ffff9569fdc00000(0000) knlGS:0000000000000000 [ 32.971098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.971378] CR2: ffff956a83c76067 CR3: 0000000002c58000 CR4: 00000000000006f0 [ 32.972098] Call Trace: [ 32.972842] <TASK> [ 32.973341] ni_enum_attr_ex+0xda/0xf0 [ 32.974087] ntfs_iget5+0x1db/0xde0 [ 32.974386] ? slab_post_alloc_hook+0x53/0x270 [ 32.974778] ? ntfs_fill_super+0x4c7/0x12a0 [ 32.975115] ntfs_fill_super+0x5d6/0x12a0 [ 32.975336] get_tree_bdev+0x175/0x270 [ 32.975709] ? put_ntfs+0x150/0x150 [ 32.975956] ntfs_fs_get_tree+0x15/0x20 [ 32.976191] vfs_get_tree+0x2a/0xc0 [ 32.976374] ? capable+0x19/0x20 [ 32.976572] path_mount+0x484/0xaa0 [ 32.977025] ? putname+0x57/0x70 [ 32.977380] do_mount+0x80/0xa0 [ 32.977555] __x64_sys_mount+0x8b/0xe0 [ 32.978105] do_syscall_64+0x3b/0x90 [ 32.978830] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 32.979311] RIP: 0033:0x7fdab72e948a [ 32.980015] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 32.981251] RSP: 002b:00007ffd15b87588 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 [ 32.981832] RAX: ffffffffffffffda RBX: 0000557de0aaf060 RCX: 00007fdab72e948a [ 32.982234] RDX: 0000557de0aaf260 RSI: 0000557de0aaf2e0 RDI: 0000557de0ab7ce0 [ 32.982714] RBP: 0000000000000000 R08: 0000557de0aaf280 R09: 0000000000000020 [ 32.983046] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000557de0ab7ce0 [ 32.983494] R13: 0000557de0aaf260 R14: 0000000000000000 R15: 00000000ffffffff [ 32.984094] </TASK> [ 32.984352] Modules linked in: [ 32.984753] CR2: ffff956a83c76067 [ 32.985911] ---[ end trace 0000000000000000 ]--- [ 32.986555] RIP: 0010:mi_enum_attr+0x44/0x110 [ 32.987217] Code: 89 f0 48 29 c8 48 89 c1 39 c7 0f 86 94 00 00 00 8b 56 04 83 fa 17 0f 86 88 00 00 00 89 d0 01 ca 48 01 f0 8d 4a 08 39 f9a [ 32.988232] RSP: 0018:ffffba15c06a7c38 EFLAGS: 00000283 [ 32.988532] RAX: ffff956a83c76067 RBX: ffff956983c76050 RCX: 000000000000006f [ 32.988916] RDX: 0000000000000067 RSI: ffff956983c760e8 RDI: 00000000000001c8 [ 32.989356] RBP: ffffba15c06a7c38 R08: 0000000000000064 R09: 00000000ffffff7f [ 32.989994] R10: 0000000000000007 R11: ffff956983c760e8 R12: ffff95698225e000 [ 32.990415] R13: 0000000000000000 R14: ffffba15c06a7cd8 R15: ffff95698225e170 [ 32.991011] FS: 00007fdab8189e40(0000) GS:ffff9569fdc00000(0000) knlGS:0000000000000000 [ 32.991524] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.991936] CR2: ffff956a83c76067 CR3: 0000000002c58000 CR4: 00000000000006f0 This patch adds an overflow check Signed-off-by: edward lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Validate BOOT record_sizeedward lo2022-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the NTFS BOOT record_size field < 0, it represents a shift value. However, there is no sanity check on the shift result and the sbi->record_bits calculation through blksize_bits() assumes the size always > 256, which could lead to NPD while mounting a malformed NTFS image. [ 318.675159] BUG: kernel NULL pointer dereference, address: 0000000000000158 [ 318.675682] #PF: supervisor read access in kernel mode [ 318.675869] #PF: error_code(0x0000) - not-present page [ 318.676246] PGD 0 P4D 0 [ 318.676502] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 318.676934] CPU: 0 PID: 259 Comm: mount Not tainted 5.19.0 #5 [ 318.677289] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 318.678136] RIP: 0010:ni_find_attr+0x2d/0x1c0 [ 318.678656] Code: 89 ca 4d 89 c7 41 56 41 55 41 54 41 89 cc 55 48 89 fd 53 48 89 d3 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 180 [ 318.679848] RSP: 0018:ffffa6c8c0297bd8 EFLAGS: 00000246 [ 318.680104] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000080 [ 318.680790] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 318.681679] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 318.682577] R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000080 [ 318.683015] R13: ffff8d5582e68400 R14: 0000000000000100 R15: 0000000000000000 [ 318.683618] FS: 00007fd9e1c81e40(0000) GS:ffff8d55fdc00000(0000) knlGS:0000000000000000 [ 318.684280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 318.684651] CR2: 0000000000000158 CR3: 0000000002e1a000 CR4: 00000000000006f0 [ 318.685623] Call Trace: [ 318.686607] <TASK> [ 318.686872] ? ntfs_alloc_inode+0x1a/0x60 [ 318.687235] attr_load_runs_vcn+0x2b/0xa0 [ 318.687468] mi_read+0xbb/0x250 [ 318.687576] ntfs_iget5+0x114/0xd90 [ 318.687750] ntfs_fill_super+0x588/0x11b0 [ 318.687953] ? put_ntfs+0x130/0x130 [ 318.688065] ? snprintf+0x49/0x70 [ 318.688164] ? put_ntfs+0x130/0x130 [ 318.688256] get_tree_bdev+0x16a/0x260 [ 318.688407] vfs_get_tree+0x20/0xb0 [ 318.688519] path_mount+0x2dc/0x9b0 [ 318.688877] do_mount+0x74/0x90 [ 318.689142] __x64_sys_mount+0x89/0xd0 [ 318.689636] do_syscall_64+0x3b/0x90 [ 318.689998] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 318.690318] RIP: 0033:0x7fd9e133c48a [ 318.690687] Code: 48 8b 0d 11 fa 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 008 [ 318.691357] RSP: 002b:00007ffd374406c8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 [ 318.691632] RAX: ffffffffffffffda RBX: 0000564d0b051080 RCX: 00007fd9e133c48a [ 318.691920] RDX: 0000564d0b051280 RSI: 0000564d0b051300 RDI: 0000564d0b0596a0 [ 318.692123] RBP: 0000000000000000 R08: 0000564d0b0512a0 R09: 0000000000000020 [ 318.692349] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000564d0b0596a0 [ 318.692673] R13: 0000564d0b051280 R14: 0000000000000000 R15: 00000000ffffffff [ 318.693007] </TASK> [ 318.693271] Modules linked in: [ 318.693614] CR2: 0000000000000158 [ 318.694446] ---[ end trace 0000000000000000 ]--- [ 318.694779] RIP: 0010:ni_find_attr+0x2d/0x1c0 [ 318.694952] Code: 89 ca 4d 89 c7 41 56 41 55 41 54 41 89 cc 55 48 89 fd 53 48 89 d3 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 44 24 180 [ 318.696042] RSP: 0018:ffffa6c8c0297bd8 EFLAGS: 00000246 [ 318.696531] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000080 [ 318.698114] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 318.699286] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 318.699795] R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000080 [ 318.700236] R13: ffff8d5582e68400 R14: 0000000000000100 R15: 0000000000000000 [ 318.700973] FS: 00007fd9e1c81e40(0000) GS:ffff8d55fdc00000(0000) knlGS:0000000000000000 [ 318.701688] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 318.702190] CR2: 0000000000000158 CR3: 0000000002e1a000 CR4: 00000000000006f0 [ 318.726510] mount (259) used greatest stack depth: 13320 bytes left This patch adds a sanity check. Signed-off-by: edward lo <edward.lo@ambergroup.io> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Rename variables and add commentKonstantin Komarov2022-09-302-13/+12
| | | | | | After renaming we don't need to split code in two lines. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add option "nocase"Konstantin Komarov2022-09-305-1/+162
| | | | | | This commit adds mount option and additional functions. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Change destroy_inode to free_inodeKonstantin Komarov2022-09-301-16/+4
| | | | | | | Many filesystems already use free_inode callback, so we will use it too from now on. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add hidedotfiles optionKonstantin Komarov2022-09-303-0/+10
| | | | | | | With this option all files with filename[0] == '.' will have FILE_ATTRIBUTE_HIDDEN attribute. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* fs/ntfs3: Add comments about cluster sizeKonstantin Komarov2022-09-303-1/+29
| | | | | | This commit adds additional info about CONFIG_NTFS3_64BIT_CLUSTER Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2022-09-255-181/+154
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Regression and bug fixes: - Performance regression fix from 5.18 on a Rasberry Pi - Fix extent parsing bug which triggers a BUG_ON when a (corrupted) extent tree has has a non-root node when zero entries. - Fix a livelock where in the right (wrong) circumstances a large number of nfsd threads can try to write to a nearly full file system, and retry for hours(!)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: limit the number of retries after discarding preallocations blocks ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 ext4: use buckets for cr 1 block scan instead of rbtree ext4: use locality group preallocation for small closed files ext4: make directory inode spreading reflect flexbg size ext4: avoid unnecessary spreading of allocations among groups ext4: make mballoc try target group first even with mb_optimize_scan
| * ext4: limit the number of retries after discarding preallocations blocksTheodore Ts'o2022-09-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch avoids threads live-locking for hours when a large number threads are competing over the last few free extents as they blocks getting added and removed from preallocation pools. From our bug reporter: A reliable way for triggering this has multiple writers continuously write() to files when the filesystem is full, while small amounts of space are freed (e.g. by truncating a large file -1MiB at a time). In the local filesystem, this can be done by simply not checking the return code of write (0) and/or the error (ENOSPACE) that is set. Over NFS with an async mount, even clients with proper error checking will behave this way since the linux NFS client implementation will not propagate the server errors [the write syscalls immediately return success] until the file handle is closed. This leads to a situation where NFS clients send a continuous stream of WRITE rpcs which result in ERRNOSPACE -- but since the client isn't seeing this, the stream of writes continues at maximum network speed. When some space does appear, multiple writers will all attempt to claim it for their current write. For NFS, we may see dozens to hundreds of threads that do this. The real-world scenario of this is database backup tooling (in particular, github.com/mdkent/percona-xtrabackup) which may write large files (>1TiB) to NFS for safe keeping. Some temporary files are written, rewound, and read back -- all before closing the file handle (the temp file is actually unlinked, to trigger automatic deletion on close/crash.) An application like this operating on an async NFS mount will not see an error code until TiB have been written/read. The lockup was observed when running this database backup on large filesystems (64 TiB in this case) with a high number of block groups and no free space. Fragmentation is generally not a factor in this filesystem (~thousands of large files, mostly contiguous except for the parts written while the filesystem is at capacity.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0Luís Henriques2022-09-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this: [ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: use buckets for cr 1 block scan instead of rbtreeJan Kara2022-09-213-149/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using rbtree for sorting groups by average fragment size is relatively expensive (needs rbtree update on every block freeing or allocation) and leads to wide spreading of allocations because selection of block group is very sentitive both to changes in free space and amount of blocks allocated. Furthermore selecting group with the best matching average fragment size is not necessary anyway, even more so because the variability of fragment sizes within a group is likely large so average is not telling much. We just need a group with large enough average fragment size so that we have high probability of finding large enough free extent and we don't want average fragment size to be too big so that we are likely to find free extent only somewhat larger than what we need. So instead of maintaing rbtree of groups sorted by fragment size keep bins (lists) or groups where average fragment size is in the interval [2^i, 2^(i+1)). This structure requires less updates on block allocation / freeing, generally avoids chaotic spreading of allocations into block groups, and still is able to quickly (even faster that the rbtree) provide a block group which is likely to have a suitably sized free space extent. This patch reduces number of block groups used when untarring archive with medium sized files (size somewhat above 64k which is default mballoc limit for avoiding locality group preallocation) to about half and thus improves write speeds for eMMC flash significantly. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: use locality group preallocation for small closed filesJan Kara2022-09-211-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Curently we don't use any preallocation when a file is already closed when allocating blocks (from writeback code when converting delayed allocation). However for small files, using locality group preallocation is actually desirable as that is not specific to a particular file. Rather it is a method to pack small files together to reduce fragmentation and for that the fact the file is closed is actually even stronger hint the file would benefit from packing. So change the logic to allow locality group preallocation in this case. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: make directory inode spreading reflect flexbg sizeJan Kara2022-09-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the Orlov inode allocator searches for free inodes for a directory only in flex block groups with at most inodes_per_group/16 more directory inodes than average per flex block group. However with growing size of flex block group this becomes unnecessarily strict. Scale allowed difference from average directory count per flex block group with flex block group size as we do with other metrics. Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: avoid unnecessary spreading of allocations among groupsJan Kara2022-09-211-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mb_set_largest_free_order() updates lists containing groups with largest chunk of free space of given order. The way it updates it leads to always moving the group to the tail of the list. Thus allocations looking for free space of given order effectively end up cycling through all groups (and due to initialization in last to first order). This spreads allocations among block groups which reduces performance for rotating disks or low-end flash media. Change mb_set_largest_free_order() to only update lists if the order of the largest free chunk in the group changed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: make mballoc try target group first even with mb_optimize_scanJan Kara2022-09-211-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of ↵Linus Torvalds2022-09-251-0/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull NVDIMM and DAX fixes from Dan Williams: "A recently discovered one-line fix for devdax that further addresses a v5.5 regression, and (a bit embarrassing) a small batch of fixes that have been sitting in my fixes tree for weeks. The older fixes have soaked in linux-next during that time and address an fsdax infinite loop and some other minor fixups. - Fix a infinite loop bug in fsdax - Fix memory-type detection for devdax (EINJ regression) - Small cleanups" * tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: devdax: Fix soft-reservation memory description fsdax: Fix infinite loop in dax_iomap_rw() nvdimm/namespace: drop nested variable in create_namespace_pmem() ndtest: Cleanup all of blk namespace specific code pmem: fix a name collision
| * \ Merge branch 'for-6.0/dax' into libnvdimm-fixesDan Williams2022-09-24588-28298/+18925
| |\ \ | | | | | | | | | | | | | | | | Pick up another "Soft Reservation" fix for v6.0-final on top of some straggling nvdimm fixes that missed v5.19.
| * | | fsdax: Fix infinite loop in dax_iomap_rw()Li Jinlin2022-07-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got an infinite loop and a WARNING report when executing a tail command in virtiofs. WARNING: CPU: 10 PID: 964 at fs/iomap/iter.c:34 iomap_iter+0x3a2/0x3d0 Modules linked in: CPU: 10 PID: 964 Comm: tail Not tainted 5.19.0-rc7 Call Trace: <TASK> dax_iomap_rw+0xea/0x620 ? __this_cpu_preempt_check+0x13/0x20 fuse_dax_read_iter+0x47/0x80 fuse_file_read_iter+0xae/0xd0 new_sync_read+0xfe/0x180 ? 0xffffffff81000000 vfs_read+0x14d/0x1a0 ksys_read+0x6d/0xf0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd The tail command will call read() with a count of 0. In this case, iomap_iter() will report this WARNING, and always return 1 which casuing the infinite loop in dax_iomap_rw(). Fixing by checking count whether is 0 in dax_iomap_rw(). Fixes: ca289e0b95af ("fsdax: switch dax_iomap_rw to use iomap_iter") Signed-off-by: Li Jinlin <lijinlin3@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220725032050.3873372-1-lijinlin3@huawei.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | | | Merge tag 'exfat-for-6.0-rc7' of ↵Linus Torvalds2022-09-211-2/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fix from Namjae Jeon: - fix integer overflow on large partitions * tag 'exfat-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: fix overflow for large capacity partition
| * | | | exfat: fix overflow for large capacity partitionYuezhang Mo2022-09-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using int type for sector index, there will be overflow in a large capacity partition. For example, if storage with sector size of 512 bytes and partition capacity is larger than 2TB, there will be overflow. Fixes: 1b6138385499 ("exfat: reduce block requests when zeroing a cluster") Cc: stable@vger.kernel.org # v5.19+ Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Andy Wu <Andy.Wu@sony.com> Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com> Acked-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
* | | | | Merge tag 'for-6.0-rc6-tag' of ↵Linus Torvalds2022-09-202-8/+74
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - two fixes for hangs in the umount sequence where threads depend on each other and the work must be finished in the right order - in zoned mode, wait for flushing all block group metadata IO before finishing the zone * tag 'for-6.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: wait for extent buffer IOs before finishing a zone btrfs: fix hang during unmount when stopping a space reclaim worker btrfs: fix hang during unmount when stopping block group reclaim worker
| * | | | | btrfs: zoned: wait for extent buffer IOs before finishing a zoneNaohiro Aota2022-09-131-2/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that ongoing IOs already finished. Or, we will see a "Zone Is Full" error for the IOs, as the ZONE_FINISH command makes the zone full. We ensure that with btrfs_wait_block_group_reservations() and btrfs_wait_ordered_roots() for a data block group. And, for a metadata block group, the comparison of alloc_offset vs meta_write_pointer mostly ensures IOs for the allocated region already sent. However, there still can be a little time frame where the IOs are sent but not yet completed. Introduce wait_eb_writebacks() to ensure such IOs are completed for a metadata block group. It walks the buffer_radix to find extent buffers in the block group and calls wait_on_extent_buffer_writeback() on them. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.19+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: fix hang during unmount when stopping a space reclaim workerFilipe Manana2022-09-131-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Often when running generic/562 from fstests we can hang during unmount, resulting in a trace like this: Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00 Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds. Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1 Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000 Sep 07 11:55:32 debian9 kernel: Call Trace: Sep 07 11:55:32 debian9 kernel: <TASK> Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0 Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70 Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0 Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130 Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0 Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420 Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0 Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200 Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0 Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530 Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140 Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30 Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0 Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170 Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs] Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0 Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120 Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30 Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs] Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0 Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160 Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0 Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0 Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40 Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90 Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7 Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0 Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570 Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000 Sep 07 11:55:32 debian9 kernel: </TASK> What happens is the following: 1) The cleaner kthread tries to start a transaction to delete an unused block group, but the metadata reservation can not be satisfied right away, so a reservation ticket is created and it starts the async metadata reclaim task (fs_info->async_reclaim_work); 2) Writeback for all the filler inodes with an i_size of 2K starts (generic/562 creates a lot of 2K files with the goal of filling metadata space). We try to create an inline extent for them, but we fail when trying to insert the inline extent with -ENOSPC (at cow_file_range_inline()) - since this is not critical, we fallback to non-inline mode (back to cow_file_range()), reserve extents, create extent maps and create the ordered extents; 3) An unmount starts, enters close_ctree(); 4) The async reclaim task is flushing stuff, entering the flush states one by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current delayed iputs. After running the delayed iputs and before calling btrfs_wait_on_delayed_iputs(), one or more ordered extents complete, and btrfs_add_delayed_iput() is called for each one through btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results in bumping fs_info->nr_delayed_iputs from 0 to some positive value. So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting for fs_info->nr_delayed_iputs to become 0; 5) The current transaction is committed by the transaction kthread, we then start unpinning extents and end up calling btrfs_try_granting_tickets() through unpin_extent_range(), since we released some space. This results in satisfying the ticket created by the cleaner kthread at step 1, waking up the cleaner kthread; 6) At close_ctree() we ask the cleaner kthread to park; 7) The cleaner kthread starts the transaction, deletes the unused block group, and then calls kthread_should_park(), which returns true, so it parks. And at this point we have the delayed iputs added by the completion of the ordered extents still pending; 8) Then later at close_ctree(), when we call: cancel_work_sync(&fs_info->async_reclaim_work); We hang forever, since the cleaner was parked and no one else can run delayed iputs after that, while the reclaim task is waiting for the remaining delayed iputs to be completed. Fix this by waiting for all ordered extents to complete and running the delayed iputs before attempting to stop the async reclaim tasks. Note that we can not wait for ordered extents with btrfs_wait_ordered_roots() (or other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE flag to be set on an ordered extent, but the delayed iput is added after that, when doing the final btrfs_put_ordered_extent(). So instead wait for the work queues used for executing ordered extent completion to be empty, which works because we do the final put on an ordered extent at btrfs_finish_ordered_io() (while we are in the unmount context). Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: fix hang during unmount when stopping block group reclaim workerFilipe Manana2022-09-131-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During early unmount, at close_ctree(), we try to stop the block group reclaim task with cancel_work_sync(), but that may hang if the block group reclaim task is currently at btrfs_relocate_block_group() waiting for the flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During unmount we only clear that flag later, after trying to stop the block group reclaim task. Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that if the reclaim task is waiting on that bit, it will stop immediately after being woken, because it sees the filesystem is closing (with a call to btrfs_fs_closing()), and then returns immediately with -EINTR. Fixes: 31e70e527806c5 ("btrfs: fix hang during unmount when block group reclaim task is running") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | | | Merge tag 'fs.fixes.v6.0-rc7' of ↵Linus Torvalds2022-09-201-0/+2
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs fix from Christian Brauner: "Beginning of the merge window we introduced the vfs{g,u}id_t types in b27c82e12965 ("attr: port attribute changes to new types") and changed various codepaths over including chown_common(). When userspace passes -1 for an ownership change the ownership fields in struct iattr stay uninitialized. Usually this is fine because any code making use of any fields in struct iattr must check the ->ia_valid field whether the value of interest has been initialized. That's true for all struct iattr passing code. However, over the course of the last year with more heavy use of KMSAN we found quite a few places that got this wrong. A recent one I fixed was 3cb6ee991496 ("9p: only copy valid iattrs in 9P2000.L setattr implementation"). But we also have LSM hooks. Actually we have two. The first one is security_inode_setattr() in notify_change() which does the right thing and passes the full struct iattr down to LSMs and thus LSMs can check whether it is initialized. But then we also have security_path_chown() which passes down a path argument and the target ownership as the filesystem would see it. For the latter we now generate the target values based on struct iattr and pass it down. However, when userspace passes -1 then struct iattr isn't initialized. This patch simply initializes ->ia_vfs{g,u}id with INVALID_VFS{G,U}ID so the hook continue to see invalid ownership when -1 is passed from userspace. The only LSM that cares about the actual values is Tomoyo. The vfs codepaths don't look at these fields without ->ia_valid being set so there's no harm in initializing ->ia_vfs{g,u}id. Arguably this is also safer since we can't end up copying valid ownership values when invalid ownership values should be passed. This only affects mainline. No kernel has been released with this and thus no backport is needed. The commit is thus marked with a Fixes: tag but annotated with "# mainline only" (I didn't quite remember what Greg said about how to tell stable autoselect to not bother with fixes for mainline only)" * tag 'fs.fixes.v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: open: always initialize ownership fields
| * | | | | | open: always initialize ownership fieldsTetsuo Handa2022-09-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Beginning of the merge window we introduced the vfs{g,u}id_t types in b27c82e12965 ("attr: port attribute changes to new types") and changed various codepaths over including chown_common(). During that change we forgot to account for the case were the passed ownership value is -1. In this case the ownership fields in struct iattr aren't initialized but we rely on them being initialized by the time we generate the ownership to pass down to the LSMs. All the major LSMs don't care about the ownership values at all. Only Tomoyo uses them and so it took a while for syzbot to unearth this issue. Fix this by initializing the ownership fields and do it within the retry_deleg block. While notify_change() doesn't alter the ownership fields currently we shouldn't rely on it. Since no kernel has been released with these changes this does not needed to be backported to any stable kernels. [Christian Brauner (Microsoft) <brauner@kernel.org>] * rewrote commit message * use INVALID_VFS{G,U}ID macros Fixes: b27c82e12965 ("attr: port attribute changes to new types") # mainline only Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | | | Merge tag 'execve-v6.0-rc7' of ↵Linus Torvalds2022-09-201-7/+0
|\ \ \ \ \ \ \ | |_|_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve reverts from Kees Cook: "The recent work to support time namespace unsharing turns out to have some undesirable corner cases, so rather than allowing the API to stay exposed for another release, it'd be best to remove it ASAP, with the replacement getting another cycle of testing. Nothing is known to use this yet, so no userspace breakage is expected. For more details, see: https://lore.kernel.org/lkml/ed418e43ad28b8688cfea2b7c90fce1c@ispras.ru Summary: - Remove the recent 'unshare time namespace on vfork+exec' feature (Andrei Vagin)" * tag 'execve-v6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: Revert "fs/exec: allow to unshare a time namespace on vfork+exec" Revert "selftests/timens: add a test for vfork+exit"
| * | | | | | Revert "fs/exec: allow to unshare a time namespace on vfork+exec"Andrei Vagin2022-09-131-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 133e2d3e81de5d9706cab2dd1d52d231c27382e5. Alexey pointed out a few undesirable side effects of the reverted change. First, it doesn't take into account that CLONE_VFORK can be used with CLONE_THREAD. Second, a child process doesn't enter a target time name-space, if its parent dies before the child calls exec. It happens because the parent clears vfork_done. Eric W. Biederman suggests installing a time namespace as a task gets a new mm. It includes all new processes cloned without CLONE_VM and all tasks that call exec(). This is an user API change, but we think there aren't users that depend on the old behavior. It is too late to make such changes in this release, so let's roll back this patch and introduce the right one in the next release. Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220913102551.1121611-3-avagin@google.com
* | | | | | | Merge tag '6.0-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2022-09-164-15/+12
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "Four smb3 fixes for stable: - important fix to revalidate mapping when doing direct writes - missing spinlock - two fixes to socket handling - trivial change to update internal version number for cifs.ko" * tag '6.0-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module number cifs: add missing spinlock around tcon refcount cifs: always initialize struct msghdr smb_msg completely cifs: don't send down the destination address to sendmsg for a SOCK_STREAM cifs: revalidate mapping when doing direct writes
| * | | | | | | cifs: update internal module numberSteve French2022-09-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To 2.39 Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | | | | cifs: add missing spinlock around tcon refcountPaulo Alcantara2022-09-141-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing spinlock to protect updates on tcon refcount in cifs_put_tcon(). Fixes: d7d7a66aacd6 ("cifs: avoid use of global locks for high contention data") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | | | | cifs: always initialize struct msghdr smb_msg completelyStefan Metzmacher2022-09-132-13/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far we were just lucky because the uninitialized members of struct msghdr are not used by default on a SOCK_STREAM tcp socket. But as new things like msg_ubuf and sg_from_iter where added recently, we should play on the safe side and avoid potention problems in future. Signed-off-by: Stefan Metzmacher <metze@samba.org> Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | | | | cifs: don't send down the destination address to sendmsg for a SOCK_STREAMStefan Metzmacher2022-09-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is ignored anyway by the tcp layer. Signed-off-by: Stefan Metzmacher <metze@samba.org> Cc: stable@vger.kernel.org Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | | | | cifs: revalidate mapping when doing direct writesRonnie Sahlberg2022-09-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel bugzilla: 216301 When doing direct writes we need to also invalidate the mapping in case we have a cached copy of the affected page(s) in memory or else subsequent reads of the data might return the old/stale content before we wrote an update to the server. Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | | | | | | Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2022-09-131-4/+8
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iov_iter fix from Al Viro: "Fix for a nfsd regression caused by the iov_iter stuff this window" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd_splice_actor(): handle compound pages
| * | | | | | | | nfsd_splice_actor(): handle compound pagesAl Viro2022-09-121-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pipe_buffer might refer to a compound page (and contain more than a PAGE_SIZE worth of data). Theoretically it had been possible since way back, but nfsd_splice_actor() hadn't run into that until copy_page_to_iter() change. Fortunately, the only thing that changes for compound pages is that we need to stuff each relevant subpage in and convert the offset into offset in the first subpage. Acked-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Fixes: f0f6b614f83d "copy_page_to_iter(): don't split high-order page in case of ITER_PIPE" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>