summaryrefslogtreecommitdiffstats
path: root/fs/ext4/ext4.h
Commit message (Collapse)AuthorAgeFilesLines
* ext4: add ext4_sb_block_valid() refactored out of ext4_inode_block_valid()Ritesh Harjani2023-03-171-0/+3
| | | | | | | | | | | | | | commit 6bc6c2bdf1baca6522b8d9ba976257d722423085 upstream. This API will be needed at places where we don't have an inode for e.g. while freeing blocks in ext4_group_add_blocks() Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/dd34a236543ad5ae7123eeebe0cb69e6bdd44f34.1644992610.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: don't set up encryption key during jbd2 transactionEric Biggers2023-01-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | From: Eric Biggers <ebiggers@google.com> commit 4c0d5778385cb3618ff26a561ce41de2b7d9de70 upstream. Commit a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature") extended the scope of the transaction in ext4_unlink() too far, making it include the call to ext4_find_entry(). However, ext4_find_entry() can deadlock when called from within a transaction because it may need to set up the directory's encryption key. Fix this by restoring the transaction to its original scope. Reported-by: syzbot+1a748d0007eeac3ab079@syzkaller.appspotmail.com Fixes: a80f7fcf1867 ("ext4: fixup ext4_fc_track_* functions' signature") Cc: <stable@vger.kernel.org> # v5.10+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20221106224841.279231-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: add EXT4_IGET_BAD flag to prevent unexpected bad inodeBaokun Li2023-01-121-1/+2
| | | | | | | | | | | | | | | | | | | | commit 63b1e9bccb71fe7d7e3ddc9877dbdc85e5d2d023 upstream. There are many places that will get unhappy (and crash) when ext4_iget() returns a bad inode. However, if iget the boot loader inode, allows a bad inode to be returned, because the inode may not be initialized. This mechanism can be used to bypass some checks and cause panic. To solve this problem, we add a special iget flag EXT4_IGET_BAD. Only with this flag we'd be returning bad inode from ext4_iget(), otherwise we always return the error code if the inode is bad inode.(suggested by Jan Kara) Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221026042310.3839669-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix undefined behavior in bit shift for ext4_check_flag_valuesGaosheng Cui2023-01-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3bf678a0f9c017c9ba7c581541dbc8453452a7ae upstream. Shifting signed 32-bit value by 31 bits is undefined, so changing significant bit to unsigned. The UBSAN warning calltrace like below: UBSAN: shift-out-of-bounds in fs/ext4/ext4.h:591:2 left shift of 1 by 31 places cannot be represented in type 'int' Call Trace: <TASK> dump_stack_lvl+0x7d/0xa5 dump_stack+0x15/0x1b ubsan_epilogue+0xe/0x4e __ubsan_handle_shift_out_of_bounds+0x1e7/0x20c ext4_init_fs+0x5a/0x277 do_one_initcall+0x76/0x430 kernel_init_freeable+0x3b3/0x422 kernel_init+0x24/0x1e0 ret_from_fork+0x1f/0x30 </TASK> Fixes: 9a4c80194713 ("ext4: ensure Inode flags consistency are checked at build time") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://lore.kernel.org/r/20221031055833.3966222-1-cuigaosheng1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: only allow test_dummy_encryption when supportedEric Biggers2022-06-091-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f41fdaea63ddf96d921ab36b2af4a90ccdb5744 upstream. Make the test_dummy_encryption mount option require that the encrypt feature flag be already enabled on the filesystem, rather than automatically enabling it. Practically, this means that "-O encrypt" will need to be included in MKFS_OPTIONS when running xfstests with the test_dummy_encryption mount option. (ext4/053 also needs an update.) Moreover, as long as the preconditions for test_dummy_encryption are being tightened anyway, take the opportunity to start rejecting it when !CONFIG_FS_ENCRYPTION rather than ignoring it. The motivation for requiring the encrypt feature flag is that: - Having the filesystem auto-enable feature flags is problematic, as it bypasses the usual sanity checks. The specific issue which came up recently is that in kernel versions where ext4 supports casefold but not encrypt+casefold (v5.1 through v5.10), the kernel will happily add the encrypt flag to a filesystem that has the casefold flag, making it unmountable -- but only for subsequent mounts, not the initial one. This confused the casefold support detection in xfstests, causing generic/556 to fail rather than be skipped. - The xfstests-bld test runners (kvm-xfstests et al.) already use the required mkfs flag, so they will not be affected by this change. Only users of test_dummy_encryption alone will be affected. But, this option has always been for testing only, so it should be fine to require that the few users of this option update their test scripts. - f2fs already requires it (for its equivalent feature flag). Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Link: https://lore.kernel.org/r/20220519204437.61645-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix use-after-free in ext4_search_dirYe Bin2022-04-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c186f0887fe7061a35cebef024550ec33ef8fbd8 upstream. We got issue as follows: EXT4-fs (loop0): mounted filesystem without journal. Opts: ,errors=continue ================================================================== BUG: KASAN: use-after-free in ext4_search_dir fs/ext4/namei.c:1394 [inline] BUG: KASAN: use-after-free in search_dirblock fs/ext4/namei.c:1199 [inline] BUG: KASAN: use-after-free in __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553 Read of size 1 at addr ffff8881317c3005 by task syz-executor117/2331 CPU: 1 PID: 2331 Comm: syz-executor117 Not tainted 5.10.0+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:83 [inline] dump_stack+0x144/0x187 lib/dump_stack.c:124 print_address_description+0x7d/0x630 mm/kasan/report.c:387 __kasan_report+0x132/0x190 mm/kasan/report.c:547 kasan_report+0x47/0x60 mm/kasan/report.c:564 ext4_search_dir fs/ext4/namei.c:1394 [inline] search_dirblock fs/ext4/namei.c:1199 [inline] __ext4_find_entry+0xdca/0x1210 fs/ext4/namei.c:1553 ext4_lookup_entry fs/ext4/namei.c:1622 [inline] ext4_lookup+0xb8/0x3a0 fs/ext4/namei.c:1690 __lookup_hash+0xc5/0x190 fs/namei.c:1451 do_rmdir+0x19e/0x310 fs/namei.c:3760 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x445e59 Code: 4d c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b c7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fff2277fac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054 RAX: ffffffffffffffda RBX: 0000000000400280 RCX: 0000000000445e59 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200000c0 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000002 R10: 00007fff2277f990 R11: 0000000000000246 R12: 0000000000000000 R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000 The buggy address belongs to the page: page:0000000048cd3304 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1317c3 flags: 0x200000000000000() raw: 0200000000000000 ffffea0004526588 ffffea0004528088 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8881317c2f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8881317c2f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8881317c3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff8881317c3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff8881317c3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== ext4_search_dir: ... de = (struct ext4_dir_entry_2 *)search_buf; dlimit = search_buf + buf_size; while ((char *) de < dlimit) { ... if ((char *) de + de->name_len <= dlimit && ext4_match(dir, fname, de)) { ... } ... de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize); if (de_len <= 0) return -1; offset += de_len; de = (struct ext4_dir_entry_2 *) ((char *) de + de_len); } Assume: de=0xffff8881317c2fff dlimit=0x0xffff8881317c3000 If read 'de->name_len' which address is 0xffff8881317c3005, obviously is out of range, then will trigger use-after-free. To solve this issue, 'dlimit' must reserve 8 bytes, as we will read 'de->name_len' to judge if '(char *) de + de->name_len' out of range. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220324064816.1209985-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix fallocate to use file_modified to update permissions consistentlyDarrick J. Wong2022-04-271-1/+1
| | | | | | | | | | | | | | | | | | | | | commit ad5cd4f4ee4d5fcdb1bfb7a0c073072961e70783 upstream. Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero, collapse, insert range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220308185043.GA117678@magnolia Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fast commit may miss file actionsXin Yin2022-03-081-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit bdc8a53a6f2f0b1cb5f991440f2100732299eb93 ] in the follow scenario: 1. jbd start transaction n 2. task A get new handle for transaction n+1 3. task A do some actions and add inode to FC_Q_MAIN fc_q 4. jbd complete transaction n and clear FC_Q_MAIN fc_q 5. task A call fsync Fast commit will lost the file actions during a full commit. we should also add updates to staging queue during a full commit. and in ext4_fc_cleanup(), when reset a inode's fc track range, check it's i_sync_tid, if it bigger than current transaction tid, do not rest it, or we will lost the track range. And EXT4_MF_FC_COMMITTING is not needed anymore, so drop it. Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Link: https://lore.kernel.org/r/20220117093655.35160-3-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: fast commit may not fallback for ineligible commitXin Yin2022-03-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e85c81ba8859a4c839bcd69c5d83b32954133a5b ] For the follow scenario: 1. jbd start commit transaction n 2. task A get new handle for transaction n+1 3. task A do some ineligible actions and mark FC_INELIGIBLE 4. jbd complete transaction n and clean FC_INELIGIBLE 5. task A call fsync In this case fast commit will not fallback to full commit and transaction n+1 also not handled by jbd. Make ext4_fc_mark_ineligible() also record transaction tid for latest ineligible case, when call ext4_fc_cleanup() check current transaction tid, if small than latest ineligible tid do not clear the EXT4_MF_FC_INELIGIBLE. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Suggested-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Link: https://lore.kernel.org/r/20220117093655.35160-2-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: simplify updating of fast commit statsHarshad Shirwadkar2022-03-081-1/+0
| | | | | | | | | | | | | [ Upstream commit 0915e464cb274648e1ef1663e1356e53ff400983 ] Move fast commit stats updating logic to a separate function from ext4_fc_commit(). This significantly improves readability of ext4_fc_commit(). Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: drop ineligible txn start stop APIsHarshad Shirwadkar2022-03-081-4/+2
| | | | | | | | | | | | | | [ Upstream commit 7bbbe241ec7ce0def9f71464c878fdbd2b0dcf37 ] This patch drops ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions should simply call ext4_fc_mark_ineligible() after starting the trasaction. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211223202140.2061101-3-harshads@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: prevent used blocks from being allocated during fast commit replayXin Yin2022-02-081-0/+3
| | | | | | | | | | | | | | | | | | commit 599ea31d13617c5484c40cdf50d88301dc351cfc upstream. During fast commit replay procedure, we clear inode blocks bitmap in ext4_ext_clear_bb(), this may cause ext4_mb_new_blocks_simple() allocate blocks still in use. Make ext4_fc_record_regions() also record physical disk regions used by inodes during replay procedure. Then ext4_mb_new_blocks_simple() can excludes these blocks in use. Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Link: https://lore.kernel.org/r/20220110035141.1980-2-yinxin.x@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: destroy ext4_fc_dentry_cachep kmemcache on module removalSebastian Andrzej Siewior2022-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | commit ab047d516dea72f011c15c04a929851e4d053109 upstream. The kmemcache for ext4_fc_dentry_cachep remains registered after module removal. Destroy ext4_fc_dentry_cachep kmemcache on module removal. Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20211110134640.lyku5vklvdndw6uk@linutronix.de Link: https://lore.kernel.org/r/YbiK3JetFFl08bd7@linutronix.de Link: https://lore.kernel.org/r/20211223164436.2628390-1-bigeasy@linutronix.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2021-10-031-3/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix a number of ext4 bugs in fast_commit, inline data, and delayed allocation. Also fix error handling code paths in ext4_dx_readdir() and ext4_fill_super(). Finally, avoid a grabbing a journal head in the delayed allocation write in the common cases where we are overwriting a pre-existing block or appending to an inode" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: recheck buffer uptodate bit under buffer lock ext4: fix potential infinite loop in ext4_dx_readdir() ext4: flush s_error_work before journal destroy in ext4_fill_super ext4: fix loff_t overflow in ext4_max_bitmap_size() ext4: fix reserved space counter leakage ext4: limit the number of blocks in one ADD_RANGE TLV ext4: enforce buffer head state assertion in ext4_da_map_blocks ext4: remove extent cache entries when truncating inline data ext4: drop unnecessary journal handle in delalloc write ext4: factor out write end code of inline file ext4: correct the error path of ext4_write_inline_data_end() ext4: check and update i_disksize properly ext4: add error checking to ext4_ext_replay_set_iblocks()
| * Merge branch 'delalloc-buffer-write' into devTheodore Ts'o2021-09-091-3/+0
| |\ | | | | | | | | | | | | | | | | | | | | | | | | Fix a bug in how we update i_disksize, and the error path in inline_data_end. Finally, drop an unnecessary creation of a journal handle which was only needed for inline data, which can give us a large performance gain in delayed allocation writes. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * ext4: factor out write end code of inline fileZhang Yi2021-09-041-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the inline_data file write end procedure are falled into the common write end functions, it is not clear. Factor them out and do some cleanup. This patch also drop ext4_da_write_inline_data_end() and switch to use ext4_write_inline_data_end() instead because we also need to do the same error processing if we failed to write data into inline entry. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
* | | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2021-09-021-11/+97
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "In addition to some ext4 bug fixes and cleanups, this cycle we add the orphan_file feature, which eliminates bottlenecks when doing a large number of parallel truncates and file deletions, and move the discard operation out of the jbd2 commit thread when using the discard mount option, to better support devices with slow discard operations" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: make the updating inode data procedure atomic ext4: remove an unnecessary if statement in __ext4_get_inode_loc() ext4: move inode eio simulation behind io completeion ext4: Improve scalability of ext4 orphan file handling ext4: Orphan file documentation ext4: Speedup ext4 orphan inode handling ext4: Move orphan inode handling into a separate file ext4: Support for checksumming from journal triggers ext4: fix race writing to an inline_data file while its xattrs are changing jbd2: add sparse annotations for add_transaction_credits() ext4: fix sparse warnings ext4: Make sure quota files are not grabbed accidentally ext4: fix e2fsprogs checksum failure for mounted filesystem ext4: if zeroout fails fall back to splitting the extent node ext4: reduce arguments of ext4_fc_add_dentry_tlv ext4: flush background discard kwork when retry allocation ext4: get discard out of jbd2 commit kthread contex ext4: remove the repeated comment of ext4_trim_all_free ext4: add new helper interface ext4_try_to_trim_range() ext4: remove the 'group' parameter of ext4_trim_extent ...
| * | ext4: Improve scalability of ext4 orphan file handlingJan Kara2021-08-301-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though the length of the critical section when adding / removing orphaned inodes was significantly reduced by using orphan file, the contention of lock protecting orphan file still appears high in profiles for truncate / unlink intensive workloads with high number of threads. This patch makes handling of orphan file completely lockless. Also to reduce conflicts between CPUs different CPUs start searching for empty slot in orphan file in different blocks. Performance comparison of locked orphan file handling, lockless orphan file handling, and completely disabled orphan inode handling from 80 CPU Xeon Server with 526 GB of RAM, filesystem located on SAS SSD disk, average of 5 runs: stress-orphan (microbenchmark truncating files byte-by-byte from N processes in parallel) Threads Time Time Time Orphan locked Orphan lockless No orphan 1 0.945600 0.939400 0.891200 2 1.331800 1.246600 1.174400 4 1.995000 1.780600 1.713200 8 6.424200 4.900000 4.106000 16 14.937600 8.516400 8.138000 32 33.038200 24.565600 24.002200 64 60.823600 39.844600 38.440200 128 122.941400 70.950400 69.315000 So we can see that with lockless orphan file handling, addition / deletion of orphaned inodes got almost completely out of picture even for a microbenchmark stressing it. For reaim creat_clo workload on ramdisk there are also noticeable gains (average of 5 runs): Clients Vanilla (ops/s) Patched (ops/s) creat_clo-1 14705.88 ( 0.00%) 14354.07 * -2.39%* creat_clo-3 27108.43 ( 0.00%) 28301.89 ( 4.40%) creat_clo-5 37406.48 ( 0.00%) 45180.73 * 20.78%* creat_clo-7 41338.58 ( 0.00%) 54687.50 * 32.29%* creat_clo-9 45226.13 ( 0.00%) 62937.07 * 39.16%* creat_clo-11 44000.00 ( 0.00%) 65088.76 * 47.93%* creat_clo-13 36516.85 ( 0.00%) 68661.97 * 88.03%* creat_clo-15 30864.20 ( 0.00%) 69551.78 * 125.35%* creat_clo-17 27478.45 ( 0.00%) 67729.08 * 146.48%* creat_clo-19 25000.00 ( 0.00%) 61621.62 * 146.49%* creat_clo-21 18772.35 ( 0.00%) 63829.79 * 240.02%* creat_clo-23 16698.94 ( 0.00%) 61938.96 * 270.92%* creat_clo-25 14973.05 ( 0.00%) 56947.61 * 280.33%* creat_clo-27 16436.69 ( 0.00%) 65008.03 * 295.51%* creat_clo-29 13949.01 ( 0.00%) 69047.62 * 395.00%* creat_clo-31 14283.52 ( 0.00%) 67982.45 * 375.95%* Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: Speedup ext4 orphan inode handlingJan Kara2021-08-301-7/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 orphan inode handling is a bottleneck for workloads which heavily truncate / unlink small files since it contends on the global s_orphan_mutex lock (and generally it's difficult to improve scalability of the ondisk linked list of orphaned inodes). This patch implements new way of handling orphan inodes. Instead of linking orphaned inode into a linked list, we store it's inode number in a new special file which we call "orphan file". Only if there's no more space in the orphan file (too many inodes are currently orphaned) we fall back to using old style linked list. Currently we protect operations in the orphan file with a spinlock for simplicity but even in this setting we can substantially reduce the length of the critical section and thus speedup some workloads. In the next patch we improve this by making orphan handling lockless. Note that the change is backwards compatible when the filesystem is clean - the existence of the orphan file is a compat feature, we set another ro-compat feature indicating orphan file needs scanning for orphaned inodes when mounting filesystem read-write. This ro-compat feature gets cleared on unmount / remount read-only. Some performance data from 80 CPU Xeon Server with 512 GB of RAM, filesystem located on SSD, average of 5 runs: stress-orphan (microbenchmark truncating files byte-by-byte from N processes in parallel) Threads Time Time Vanilla Patched 1 1.057200 0.945600 2 1.680400 1.331800 4 2.547000 1.995000 8 7.049400 6.424200 16 14.827800 14.937600 32 40.948200 33.038200 64 87.787400 60.823600 128 206.504000 122.941400 So we can see significant wins all over the board. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: Move orphan inode handling into a separate fileJan Kara2021-08-301-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move functions for handling orphan inodes into a new file fs/ext4/orphan.c to have them in one place and somewhat reduce size of other files. No code changes. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: Support for checksumming from journal triggersJan Kara2021-08-301-2/+24
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: flush background discard kwork when retry allocationWang Jianchao2021-08-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | The background discard kwork tries to mark blocks used and issue discard. This can make filesystem suffer from NOSPC error, xfstest generic/371 can fail due to it. Fix it by flushing discard kwork in ext4_should_retry_alloc. At the same time, give up discard at the moment. Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Link: https://lore.kernel.org/r/20210830075246.12516-6-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: get discard out of jbd2 commit kthread contexWang Jianchao2021-08-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, discard is issued and waited to be completed in jbd2 commit kthread context after the logs are committed. When large amount of files are deleted and discard is flooding, jbd2 commit kthread can be blocked for long time. Then all of the metadata operations can be blocked to wait the log space. One case is the page fault path with read mm->mmap_sem held, which wants to update the file time but has to wait for the log space. When other threads in the task wants to do mmap, then write mmap_sem is blocked. Finally all of the following read mmap_sem requirements are blocked, even the ps command which need to read the /proc/pid/ -cmdline. Our monitor service which needs to read /proc/pid/cmdline used to be blocked for 5 mins. This patch frees the blocks back to buddy after commit and then do discard in a async kworker context in fstrim fashion, namely, - mark blocks to be discarded as used if they have not been allocated - do discard - mark them free After this, jbd2 commit kthread won't be blocked any more by discard and we won't get NOSPC even if the discard is slow or throttled. Link: https://marc.info/?l=linux-kernel&m=162143690731901&w=2 Suggested-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Link: https://lore.kernel.org/r/20210830075246.12516-5-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: Convert to use mapping->invalidate_lockJan Kara2021-07-131-10/+0
|/ | | | | | | | | | | | | Convert ext4 to use mapping->invalidate_lock instead of its private EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: <linux-ext4@vger.kernel.org> CC: Ted Tso <tytso@mit.edu> Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
* ext4: notify sysfs on errors_count value changeJonathan Davies2021-06-291-0/+1
| | | | | | | | | | | | | | After s_error_count is incremented, signal the change in the corresponding sysfs attribute via sysfs_notify. This allows userspace to poll() on changes to /sys/fs/ext4/*/errors_count. [ Moved call of ext4_notify_error_sysfs() to flush_stashed_error_work() to avoid BUG's caused by calling sysfs_notify trying to sleep after being called from an invalid context. -- TYT ] Signed-off-by: Jonathan Davies <jonathan.davies@nutanix.com> Link: https://lore.kernel.org/r/20210611140209.28903-1-jonathan.davies@nutanix.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix comment for s_hash_unsignedEric Biggers2021-06-241-1/+1
| | | | | | | | | Fix the comment for s_hash_unsigned to not be the opposite of what it actually is. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210527235557.2377525-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: add ioctl EXT4_IOC_CHECKPOINTLeah Rumancik2021-06-221-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioctl EXT4_IOC_CHECKPOINT checkpoints and flushes the journal. This includes forcing all the transactions to the log, checkpointing the transactions, and flushing the log to disk. This ioctl takes u32 "flags" as an argument. Three flags are supported. EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN can be used to verify input to the ioctl. It returns error if there is any invalid input, otherwise it returns success without performing any checkpointing. The other two flags, EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT, can be used to issue requests to discard or zeroout the journal logs blocks, respectively. At this point, EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT is primarily added to enable testing of this codepath on devices that don't support discard. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT cannot both be set. Systems that wish to achieve content deletion SLO can set up a daemon that calls this ioctl at a regular interval such that it matches with the SLO requirement. Thus, with this patch, the ext4_dir_entry2 wipeout patch[1], and the Ext4 "-o discard" mount option set, Ext4 can now guarantee that all file contents, file metatdata, and filenames will not be accessible through the filesystem and will have had discard or zeroout requests issued for corresponding device blocks. The __jbd2_journal_erase function could also be used to discard or zero-fill the journal during journal load after recovery. This would provide a potential solution to a journal replay bug reported earlier this year[2]. After a successful journal recovery, e2fsck can call this ioctl to discard the journal as well. [1] https://lore.kernel.org/linux-ext4/YIHknqxngB1sUdie@mit.edu/ [2] https://lore.kernel.org/linux-ext4/YDZoaacIYStFQT8g@mit.edu/ Link: https://lore.kernel.org/r/20210518151327.130198-2-leah.rumancik@gmail.com Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix memory leak in ext4_fill_superPavel Skripkin2021-06-171-0/+4
| | | | | | | | | | | | | | | | | | | static int kthread(void *_create) will return -ENOMEM or -EINTR in case of internal failure or kthread_stop() call happens before threadfn call. To prevent fancy error checking and make code more straightforward we moved all cleanup code out of kmmpd threadfn. Also, dropped struct mmpd_data at all. Now struct super_block is a threadfn data and struct buffer_head embedded into struct ext4_sb_info. Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Link: https://lore.kernel.org/r/20210430185046.15742-1-paskripkin@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove redundant check buffer_uptodate()Joseph Qi2021-06-171-1/+1
| | | | | | | | | | Now set_buffer_uptodate() will test first and then set, so we don't have to check buffer_uptodate() first, remove it to simplify code. Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://lore.kernel.org/r/1619418587-5580-1-git-send-email-joseph.qi@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2021-04-301-22/+85
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New features for ext4 this cycle include support for encrypted casefold, ensure that deleted file names are cleared in directory blocks by zeroing directory entries when they are unlinked or moved as part of a hash tree node split. We also improve the block allocator's performance on a freshly mounted file system by prefetching block bitmaps. There are also the usual cleanups and bug fixes, including fixing a page cache invalidation race when there is mixed buffered and direct I/O and the block size is less than page size, and allow the dax flag to be set and cleared on inline directories" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) ext4: wipe ext4_dir_entry2 upon file deletion ext4: Fix occasional generic/418 failure fs: fix reporting supported extra file attributes for statx() ext4: allow the dax flag to be set and cleared on inline directories ext4: fix debug format string warning ext4: fix trailing whitespace ext4: fix various seppling typos ext4: fix error return code in ext4_fc_perform_commit() ext4: annotate data race in jbd2_journal_dirty_metadata() ext4: annotate data race in start_this_handle() ext4: fix ext4_error_err save negative errno into superblock ext4: fix error code in ext4_commit_super ext4: always panic when errors=panic is specified ext4: delete redundant uptodate check for buffer ext4: do not set SB_ACTIVE in ext4_orphan_cleanup() ext4: make prefetch_block_bitmaps default ext4: add proc files to monitor new structures ext4: improve cr 0 / cr 1 group scanning ext4: add MB_NUM_ORDERS macro ext4: add mballoc stats proc file ...
| * ext4: make prefetch_block_bitmaps defaultHarshad Shirwadkar2021-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Block bitmap prefetching is needed for these allocator optimization data structures to get populated and provide better group scanning order. So, turn it on bu default. prefetch_block_bitmaps mount option is now marked as removed and a new option no_prefetch_block_bitmaps is added to disable block bitmap prefetching. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add proc files to monitor new structuresHarshad Shirwadkar2021-04-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a new file "mb_structs_summary" which allows us to see the summary of the new allocator structures added in this series. Here's the sample output of file: optimize_scan: 1 max_free_order_lists: list_order_0_groups: 0 list_order_1_groups: 0 list_order_2_groups: 0 list_order_3_groups: 0 list_order_4_groups: 0 list_order_5_groups: 0 list_order_6_groups: 0 list_order_7_groups: 0 list_order_8_groups: 0 list_order_9_groups: 0 list_order_10_groups: 0 list_order_11_groups: 0 list_order_12_groups: 0 list_order_13_groups: 40 fragment_size_tree: tree_min: 16384 tree_max: 32768 tree_nodes: 40 Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: improve cr 0 / cr 1 group scanningHarshad Shirwadkar2021-04-091-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of traversing through groups linearly, scan groups in specific orders at cr 0 and cr 1. At cr 0, we want to find groups that have the largest free order >= the order of the request. So, with this patch, we maintain lists for each possible order and insert each group into a list based on the largest free order in its buddy bitmap. During cr 0 allocation, we traverse these lists in the increasing order of largest free orders. This allows us to find a group with the best available cr 0 match in constant time. If nothing can be found, we fallback to cr 1 immediately. At CR1, the story is slightly different. We want to traverse in the order of increasing average fragment size. For CR1, we maintain a rb tree of groupinfos which is sorted by average fragment size. Instead of traversing linearly, at CR1, we traverse in the order of increasing average fragment size, starting at the most optimal group. This brings down cr 1 search complexity to log(num groups). For cr >= 2, we just perform the linear search as before. Also, in case of lock contention, we intermittently fallback to linear search even in CR 0 and CR 1 cases. This allows us to proceed during the allocation path even in case of high contention. There is an opportunity to do optimization at CR2 too. That's because at CR2 we only consider groups where bb_free counter (number of free blocks) is greater than the request extent size. That's left as future work. All the changes introduced in this patch are protected under a new mount option "mb_optimize_scan". With this patchset, following experiment was performed: Created a highly fragmented disk of size 65TB. The disk had no contiguous 2M regions. Following command was run consecutively for 3 times: time dd if=/dev/urandom of=file bs=2M count=10 Here are the results with and without cr 0/1 optimizations introduced in this patch: |---------+------------------------------+---------------------------| | | Without CR 0/1 Optimizations | With CR 0/1 Optimizations | |---------+------------------------------+---------------------------| | 1st run | 5m1.871s | 2m47.642s | | 2nd run | 2m28.390s | 0m0.611s | | 3rd run | 2m26.530s | 0m1.255s | |---------+------------------------------+---------------------------| Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add mballoc stats proc fileHarshad Shirwadkar2021-04-091-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new stats for measuring the performance of mballoc. This patch is forked from Artem Blagodarenko's work that can be found here: https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch This patch reorganizes the stats by cr level. This is how the output looks like: mballoc: reqs: 0 success: 0 groups_scanned: 0 cr0_stats: hits: 0 groups_considered: 0 useless_loops: 0 bad_suggestions: 0 cr1_stats: hits: 0 groups_considered: 0 useless_loops: 0 bad_suggestions: 0 cr2_stats: hits: 0 groups_considered: 0 useless_loops: 0 cr3_stats: hits: 0 groups_considered: 0 useless_loops: 0 extents_scanned: 0 goal_hits: 0 2^n_hits: 0 breaks: 0 lost: 0 buddies_generated: 0/40 buddies_time_used: 0 preallocated: 0 discarded: 0 Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: drop s_mb_bal_lock and convert protected fields to atomicHarshad Shirwadkar2021-04-091-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s_mb_buddies_generated gets used later in this patch series to determine if the cr 0 and cr 1 optimziations should be performed or not. Currently, s_mb_buddies_generated is protected under a spin_lock. In the allocation path, it is better if we don't depend on the lock and instead read the value atomically. In order to do that, we drop s_bal_lock altogether and we convert the only two protected fields by it s_mb_buddies_generated and s_mb_generation_time to atomic type. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: optimize match for casefolded encrypted dirsDaniel Rosenberg2021-04-051-8/+9
| | | | | | | | | | | | | | | | | | | | Matching names with casefolded encrypting directories requires decrypting entries to confirm case since we are case preserving. We can avoid needing to decrypt if our hash values don't match. Signed-off-by: Daniel Rosenberg <drosen@google.com> Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: handle casefolding with encryptionDaniel Rosenberg2021-04-051-8/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for encryption with casefolding. Since the name on disk is case preserving, and also encrypted, we can no longer just recompute the hash on the fly. Additionally, to avoid leaking extra information from the hash of the unencrypted name, we use siphash via an fscrypt v2 policy. The hash is stored at the end of the directory entry for all entries inside of an encrypted and casefolded directory apart from those that deal with '.' and '..'. This way, the change is backwards compatible with existing ext4 filesystems. [ Changed to advertise this feature via the file: /sys/fs/ext4/features/encrypted_casefold -- TYT ] Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: convert to fileattrMiklos Szeredi2021-04-121-9/+3
|/ | | | | | | | Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu>
* ext4: fix rename whiteout with fast commitHarshad Shirwadkar2021-03-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds rename whiteout support in fast commits. Note that the whiteout object that gets created is actually char device. Which imples, the function ext4_inode_journal_mode(struct inode *inode) would return "JOURNAL_DATA" for this inode. This has a consequence in fast commit code that it will make creation of the whiteout object a fast-commit ineligible behavior and thus will fall back to full commits. With this patch, this can be observed by running fast commits with rename whiteout and seeing the stats generated by ext4_fc_stats tracepoint as follows: ext4_fc_stats: dev 254:32 fc ineligible reasons: XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0, RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16; num_commits:6, ineligible: 6, numblks: 3 So in short, this patch guarantees that in case of rename whiteout, we fall back to full commits. Amir mentioned that instead of creating a new whiteout object for every rename, we can create a static whiteout object with irrelevant nlink. That will make fast commits to not fall back to full commit. But until this happens, this patch will ensure correctness by falling back to full commits. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: shrink race window in ext4_should_retry_alloc()Eric Whitney2021-03-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it fails at significant rates on the two test scenarios that disable delayed allocation (ext3conv and data_journal) and force actual block allocation for the fallocate and pwrite functions in the test. The failure rate on 5.10 for both ext3conv and data_journal on one test system typically runs about 85%. On 5.11, the failure rate on ext3conv sometimes drops to as low as 1% while the rate on data_journal increases to nearly 100%. The observed failures are largely due to ext4_should_retry_alloc() cutting off block allocation retries when s_mb_free_pending (used to indicate that a transaction in progress will free blocks) is 0. However, free space is usually available when this occurs during runs of generic/371. It appears that a thread attempting to allocate blocks is just missing transaction commits in other threads that increase the free cluster count and reset s_mb_free_pending while the allocating thread isn't running. Explicitly testing for free space availability avoids this race. The current code uses a post-increment operator in the conditional expression that determines whether the retry limit has been exceeded. This means that the conditional expression uses the value of the retry counter before it's increased, resulting in an extra retry cycle. The current code actually retries twice before hitting its retry limit rather than once. Increasing the retry limit to 3 from the current actual maximum retry count of 2 in combination with the change described above reduces the observed failure rate to less that 0.1% on both ext3conv and data_journal with what should be limited impact on users sensitive to the overhead caused by retries. A per filesystem percpu counter exported via sysfs is added to allow users or developers to track the number of times the retry limit is exceeded without resorting to debugging methods. This should provide some insight into worst case retry behavior. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: support idmapped mountsChristian Brauner2021-01-241-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enable idmapped mounts for ext4. All dedicated helpers we need for this exist. So this basically just means we're passing down the user_namespace argument from the VFS methods to the relevant helpers. Let's create simple example where we idmap an ext4 filesystem: root@f2-vm:~# truncate -s 5G ext4.img root@f2-vm:~# mkfs.ext4 ./ext4.img mke2fs 1.45.5 (07-Jan-2020) Discarding device blocks: done Creating filesystem with 1310720 4k blocks and 327680 inodes Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done root@f2-vm:~# losetup -f --show ./ext4.img /dev/loop0 root@f2-vm:~# mount /dev/loop0 /mnt root@f2-vm:~# ls -al /mnt/ total 24 drwxr-xr-x 3 root root 4096 Oct 28 13:34 . drwxr-xr-x 30 root root 4096 Oct 28 13:22 .. drwx------ 2 root root 16384 Oct 28 13:34 lost+found # Let's create an idmapped mount at /idmapped1 where we map uid and gid # 0 to uid and gid 1000 root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/ root@f2-vm:/# ls -al /idmapped1/ total 24 drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 13:34 . drwxr-xr-x 30 root root 4096 Oct 28 13:22 .. drwx------ 2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found # Let's create an idmapped mount at /idmapped2 where we map uid and gid # 0 to uid and gid 2000 root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/ root@f2-vm:/# ls -al /idmapped2/ total 24 drwxr-xr-x 3 2000 2000 4096 Oct 28 13:34 . drwxr-xr-x 31 root root 4096 Oct 28 13:39 .. drwx------ 2 2000 2000 16384 Oct 28 13:34 lost+found Let's create another example where we idmap the rootfs filesystem without a mapping for uid 0 and gid 0: # Create an idmapped mount of for a full POSIX range of rootfs under # /mnt but without a mapping for uid 0 to reduce attack surface root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/ # Since we don't have a mapping for uid and gid 0 all files owned by # uid and gid 0 should show up as uid and gid 65534: root@f2-vm:/# ls -al /mnt/ total 664 drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 . drwxr-xr-x 31 root root 4096 Oct 28 13:39 .. lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 bin -> usr/bin drwxr-xr-x 4 nobody nogroup 4096 Oct 28 13:17 boot drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:48 dev drwxr-xr-x 81 nobody nogroup 4096 Oct 28 04:00 etc drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 home lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 lib -> usr/lib lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib32 -> usr/lib32 lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib64 -> usr/lib64 lrwxrwxrwx 1 nobody nogroup 10 Aug 25 07:44 libx32 -> usr/libx32 drwx------ 2 nobody nogroup 16384 Aug 25 07:47 lost+found drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 media drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 mnt drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 opt drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 proc drwx--x--x 6 nobody nogroup 4096 Oct 28 13:34 root drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:46 run lrwxrwxrwx 1 nobody nogroup 8 Aug 25 07:44 sbin -> usr/sbin drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 srv drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 sys drwxrwxrwt 10 nobody nogroup 4096 Oct 28 13:19 tmp drwxr-xr-x 14 nobody nogroup 4096 Oct 20 13:00 usr drwxr-xr-x 12 nobody nogroup 4096 Aug 25 07:45 var # Since we do have a mapping for uid and gid 1000 all files owned by # uid and gid 1000 should simply show up as uid and gid 1000: root@f2-vm:/# ls -al /mnt/home/ubuntu/ total 40 drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 00:43 . drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 2936 Oct 28 12:26 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* fs: make helpers idmap mount awareChristian Brauner2021-01-241-3/+6
| | | | | | | | | | | | | | | | | | | | | Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2020-12-241-19/+58
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Various bug fixes and cleanups for ext4; no new features this cycle" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits) ext4: remove unnecessary wbc parameter from ext4_bio_write_page ext4: avoid s_mb_prefetch to be zero in individual scenarios ext4: defer saving error info from atomic context ext4: simplify ext4 error translation ext4: move functions in super.c ext4: make ext4_abort() use __ext4_error() ext4: standardize error message in ext4_protect_reserved_inode() ext4: remove redundant sb checksum recomputation ext4: don't remount read-only with errors=continue on reboot ext4: fix deadlock with fs freezing and EA inodes jbd2: add a helper to find out number of fast commit blocks ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h ext4: fix fall-through warnings for Clang ext4: add docs about fast commit idempotence ext4: remove the unused EXT4_CURRENT_REV macro ext4: fix an IS_ERR() vs NULL check ext4: check for invalid block size early when mounting a file system ext4: fix a memory leak of ext4_free_data ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set() ext4: update ext4_data_block_valid related comments ...
| * ext4: remove unnecessary wbc parameter from ext4_bio_write_pageLei Chen2020-12-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | ext4_bio_write_page does not need wbc parameter, since its parameter io contains the io_wbc field. The io::io_wbc is initialized by ext4_io_submit_init which is called in ext4_writepages and ext4_writepage functions prior to ext4_bio_write_page. Therefor, when ext4_bio_write_page is called, wbc info has already been included in io parameter. Signed-off-by: Lei Chen <lennychen@tencent.com> Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: defer saving error info from atomic contextJan Kara2020-12-221-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When filesystem inconsistency is detected with group locked, we currently try to modify superblock to store error there without blocking. However this can cause superblock checksum failures (or DIF/DIX failure) when the superblock is just being written out. Make error handling code just store error information in ext4_sb_info structure and copy it to on-disk superblock only in ext4_commit_super(). In case of error happening with group locked, we just postpone the superblock flushing to a workqueue. [ Added fixup so that s_first_error_* does not get updated after the file system is remounted. Also added fix for syzbot failure. - Ted ] Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Hillf Danton <hdanton@sina.com> Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
| * ext4: make ext4_abort() use __ext4_error()Jan Kara2020-12-171-17/+12
| | | | | | | | | | | | | | | | | | | | | | The only difference between __ext4_abort() and __ext4_error() is that the former one ignores errors=continue mount option. Unify the code to reduce duplication. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: remove the unused EXT4_CURRENT_REV macroKaixu Xia2020-12-171-1/+0
| | | | | | | | | | | | | | | | | | There are no callers of the EXT4_CURRENT_REV macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1605164202-31120-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: use ASSERT() to replace J_ASSERT()Chunguang Xu2020-12-031-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | There are currently multiple forms of assertion, such as J_ASSERT(). J_ASEERT() is provided for the jbd module, which is a public module. Maybe we should use custom ASSERT() like other file systems, such as xfs, which would be better. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add helpers for checking whether quota can be enabled/is journalledRoman Anufriev2020-12-031-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, there are several places, where we check whether fs is capable of enabling quota or if quota is journalled with quite long and non-self-descriptive condition statements. This patch wraps these statements into helpers for better readability and easier usage. Link: https://lore.kernel.org/r/1603336860-16153-1-git-send-email-dotdot@yandex-team.ru Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Roman Anufriev <dotdot@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | Merge tag 'f2fs-for-5.11-rc1' of ↵Linus Torvalds2020-12-171-4/+0
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've made more work into per-file compression support. For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to change the algorithm or cluster size per file. F2FS_IOC_COMPRESS | DECOMPRESS_FILE provides a way to compress and decompress the existing normal files manually. There is also a new mount option, compress_mode=fs|user, which can control who compresses the data. Chao also added a checksum feature with a mount option so that we are able to detect any corrupted cluster. In addition, Daniel contributed casefolding with encryption patch, which will be used for Android devices. Summary: Enhancements: - add ioctls and mount option to manage per-file compression feature - support casefolding with encryption - support checksum for compressed cluster - avoid IO starvation by replacing mutex with rwsem - add sysfs, max_io_bytes, to control max bio size Bug fixes: - fix use-after-free issue when compression and fsverity are enabled - fix consistency corruption during fault injection test - fix data offset for lseek - get rid of buffer_head which has 32bits limit in fiemap - fix some bugs in multi-partitions support - fix nat entry count calculation in shrinker - fix some stat information And, we've refactored some logics and fix minor bugs as well" * tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits) f2fs: compress: fix compression chksum f2fs: fix shift-out-of-bounds in sanity_check_raw_super() f2fs: fix race of pending_pages in decompression f2fs: fix to account inline xattr correctly during recovery f2fs: inline: fix wrong inline inode stat f2fs: inline: correct comment in f2fs_recover_inline_data f2fs: don't check PAGE_SIZE again in sanity_check_raw_super() f2fs: convert to F2FS_*_INO macro f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size f2fs: don't allow any writes on readonly mount f2fs: avoid race condition for shrinker count f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE f2fs: add compress_mode mount option f2fs: Remove unnecessary unlikely() f2fs: init dirty_secmap incorrectly f2fs: remove buffer_head which has 32bits limit f2fs: fix wrong block count instead of bytes f2fs: use new conversion functions between blks and bytes f2fs: rename logical_to_blk and blk_to_logical f2fs: fix kbytes written stat for multi-device case ...