summaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
* ext4: fix potential use after free after remounting with noblock_validityzhangyi (F)2019-10-072-52/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7727ae52975d4f4ef7ff69ed8e6e25f6a4168158 ] Remount process will release system zone which was allocated before if "noblock_validity" is specified. If we mount an ext4 file system to two mountpoints with default mount options, and then remount one of them with "noblock_validity", it may trigger a use after free problem when someone accessing the other one. # mount /dev/sda foo # mount /dev/sda bar User access mountpoint "foo" | Remount mountpoint "bar" | ext4_map_blocks() | ext4_remount() check_block_validity() | ext4_setup_system_zone() ext4_data_block_valid() | ext4_release_system_zone() | free system_blks rb nodes access system_blks rb nodes | trigger use after free | This problem can also be reproduced by one mountpint, At the same time, add_system_zone() can get called during remount as well so there can be racing ext4_data_block_valid() reading the rbtree at the same time. This patch add RCU to protect system zone from releasing or building when doing a remount which inverse current "noblock_validity" mount option. It assign the rbtree after the whole tree was complete and do actual freeing after rcu grace period, avoid any intermediate state. Reported-by: syzbot+1e470567330b7ad711d5@syzkaller.appspotmail.com Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: fix punch hole for inline_data file systemsTheodore Ts'o2019-10-051-0/+9
| | | | | | | | | | | | | | | | | | | | | | | commit c1e8220bd316d8ae8e524df39534b8a412a45d5e upstream. If a program attempts to punch a hole on an inline data file, we need to convert it to a normal file first. This was detected using ext4/032 using the adv configuration. Simple reproducer: mke2fs -Fq -t ext4 -O inline_data /dev/vdc mount /vdc echo "" > /vdc/testfile xfs_io -c 'truncate 33554432' /vdc/testfile xfs_io -c 'fpunch 0 1048576' /vdc/testfile umount /vdc e2fsck -fy /dev/vdc Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix warning inside ext4_convert_unwritten_extents_endioRakesh Pandit2019-10-051-2/+2
| | | | | | | | | | | | | | | | commit e3d550c2c4f2f3dba469bc3c4b83d9332b4e99e1 upstream. Really enable warning when CONFIG_EXT4_DEBUG is set and fix missing first argument. This was introduced in commit ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") and splitting extents inside endio would trigger it. Fixes: ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: allow directory holesTheodore Ts'o2019-07-282-18/+46
| | | | | | | | | | | | | | | | commit 4e19d6b65fb4fc42e352ce9883649e049da14743 upstream. The largedir feature was intended to allow ext4 directories to have unmapped directory blocks (e.g., directory holes). And so the released e2fsprogs no longer enforces this for largedir file systems; however, the corresponding change to the kernel-side code was not made. This commit fixes this oversight. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: use jbd2_inode dirty range scopingRoss Zwisler2019-07-283-10/+18
| | | | | | | | | | | | | | commit 73131fbb003b3691cfcf9656f234b00da497fcd6 upstream. Use the newly introduced jbd2_inode dirty range scoping to prevent us from waiting forever when trying to complete a journal transaction. Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: enforce the immutable flag on open filesTheodore Ts'o2019-07-282-0/+15
| | | | | | | | | | | | | | | | | | | | | commit 02b016ca7f99229ae6227e7b2fc950c4e140d74a upstream. According to the chattr man page, "a file with the 'i' attribute cannot be modified..." Historically, this was only enforced when the file was opened, per the rest of the description, "... and the file can not be opened in write mode". There is general agreement that we should standardize all file systems to prevent modifications even for files that were opened at the time the immutable flag is set. Eventually, a change to enforce this at the VFS layer should be landing in mainline. Until then, enforce this at the ext4 level to prevent xfstests generic/553 from failing. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: don't allow any modifications to an immutable fileDarrick J. Wong2019-07-281-1/+45
| | | | | | | | | | | | | | commit 2e53840362771c73eb0a5ff71611507e64e8eecd upstream. Don't allow any modifications to a file that's marked immutable, which means that we have to flush all the writable pages to make the readonly and we have to check the setattr/setflags parameters more closely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-05-253-18/+19
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Bug fixes (including a regression fix) for ext4" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix dcache lookup of !casefolded directories ext4: do not delete unlinked inode from orphan list on failed truncate ext4: wait for outstanding dio during truncate in nojournal mode ext4: don't perform block validity checks on the journal inode
| * ext4: fix dcache lookup of !casefolded directoriesGabriel Krisman Bertazi2019-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Found by visual inspection, this wasn't caught by my xfstest, since it's effect is ignoring positive dentries in the cache the fallback just goes to the disk. it was introduced in the last iteration of the case-insensitive patch. d_compare should return 0 when the entries match, so make sure we are correctly comparing the entire string if the encoding feature is set and we are on a case-INsensitive directory. Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups") Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: do not delete unlinked inode from orphan list on failed truncateJan Kara2019-05-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible that unlinked inode enters ext4_setattr() (e.g. if somebody calls ftruncate(2) on unlinked but still open file). In such case we should not delete the inode from the orphan list if truncate fails. Note that this is mostly a theoretical concern as filesystem is corrupted if we reach this path anyway but let's be consistent in our orphan handling. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: wait for outstanding dio during truncate in nojournal modeJan Kara2019-05-231-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We didn't wait for outstanding direct IO during truncate in nojournal mode (as we skip orphan handling in that case). This can lead to fs corruption or stale data exposure if truncate ends up freeing blocks and these get reallocated before direct IO finishes. Fix the condition determining whether the wait is necessary. CC: stable@vger.kernel.org Fixes: 1c9114f9c0f1 ("ext4: serialize unlocked dio reads with truncate") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: don't perform block validity checks on the journal inodeTheodore Ts'o2019-05-221-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the journal inode is already checked when we added it to the block validity's system zone, if we check it again, we'll just trigger a failure. This was causing failures like this: [ 53.897001] EXT4-fs error (device sda): ext4_find_extent:909: inode #8: comm jbd2/sda-8: pblk 121667583 bad header/extent: invalid extent entries - magic f30a, entries 8, max 340(340), depth 0(0) [ 53.931430] jbd2_journal_bmap: journal block not found at offset 49 on sda-8 [ 53.938480] Aborting journal on device sda-8. ... but only if the system was under enough memory pressure that logical->physical mapping for the journal inode gets pushed out of the extent cache. (This is why it wasn't noticed earlier.) Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: Dan Rue <dan.rue@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
* | treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner2019-05-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-05-197-8/+37
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some bug fixes, and an update to the URL's for the final version of Unicode 12.1.0" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid panic during forced reboot due to aborted journal ext4: fix block validity checks for journal inodes using indirect blocks unicode: update to Unicode 12.1.0 final unicode: add missing check for an error return from utf8lookup() ext4: fix miscellaneous sparse warnings ext4: unsigned int compared against zero ext4: fix use-after-free in dx_release() ext4: fix data corruption caused by overlapping unaligned and aligned IO jbd2: fix potential double free ext4: zero out the unused memory region in the extent tree block
| * ext4: avoid panic during forced reboot due to aborted journalJan Kara2019-05-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Handling of aborted journal is a special code path different from standard ext4_error() one and it can call panic() as well. Commit 1dc1097ff60e ("ext4: avoid panic during forced reboot") forgot to update this path so fix that omission. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # 5.1
| * ext4: fix block validity checks for journal inodes using indirect blocksTheodore Ts'o2019-05-151-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") failed to add an exception for the journal inode in ext4_check_blockref(), which is the function used by ext4_get_branch() for indirect blocks. This caused attempts to read from the ext3-style journals to fail with: [ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block Fix this by adding the missing exception check. Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Reported-by: Arthur Marsh <arthur.marsh@internode.on.net> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix miscellaneous sparse warningsTheodore Ts'o2019-05-123-3/+3
| | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: unsigned int compared against zeroColin Ian King2019-05-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | There are two cases where u32 variables n and err are being checked for less than zero error values, the checks is always false because the variables are not signed. Fix this by making the variables ints. Addresses-Coverity: ("Unsigned compared against 0") Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix use-after-free in dx_release()Sahitya Tummala2019-05-101-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The buffer_head (frames[0].bh) and it's corresping page can be potentially free'd once brelse() is done inside the for loop but before the for loop exits in dx_release(). It can be free'd in another context, when the page cache is flushed via drop_caches_sysctl_handler(). This results into below data abort when accessing info->indirect_levels in dx_release(). Unable to handle kernel paging request at virtual address ffffffc17ac3e01e Call trace: dx_release+0x70/0x90 ext4_htree_fill_tree+0x2d4/0x300 ext4_readdir+0x244/0x6f8 iterate_dir+0xbc/0x160 SyS_getdents64+0x94/0x174 Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org
| * ext4: fix data corruption caused by overlapping unaligned and aligned IOLukas Czerner2019-05-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unaligned AIO must be serialized because the zeroing of partial blocks of unaligned AIO can result in data corruption in case it's overlapping another in flight IO. Currently we wait for all unwritten extents before we submit unaligned AIO which protects data in case of unaligned AIO is following overlapping IO. However if a unaligned AIO is followed by overlapping aligned AIO we can still end up corrupting data. To fix this, we must make sure that the unaligned AIO is the only IO in flight by waiting for unwritten extents conversion not just before the IO submission, but right after it as well. This problem can be reproduced by xfstest generic/538 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: zero out the unused memory region in the extent tree blockSriram Rajagopalan2019-05-101-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit zeroes out the unused memory region in the buffer_head corresponding to the extent metablock after writing the extent header and the corresponding extent node entries. This is done to prevent random uninitialized data from getting into the filesystem when the extent block is synced. This fixes CVE-2019-11833. Signed-off-by: Sriram Rajagopalan <sriramr@arista.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds2019-05-074-40/+101
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull fscrypt updates from Ted Ts'o: "Clean up fscrypt's dcache revalidation support, and other miscellaneous cleanups" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: cache decrypted symlink target in ->i_link vfs: use READ_ONCE() to access ->i_link fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertext fscrypt: only set dentry_operations on ciphertext dentries fs, fscrypt: clear DCACHE_ENCRYPTED_NAME when unaliasing directory fscrypt: fix race allowing rename() and link() of ciphertext dentries fscrypt: clean up and improve dentry revalidation fscrypt: use READ_ONCE() to access ->i_crypt_info fscrypt: remove WARN_ON_ONCE() when decryption fails fscrypt: drop inode argument from fscrypt_get_ctx()
| * | fscrypt: cache decrypted symlink target in ->i_linkEric Biggers2019-04-171-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Path lookups that traverse encrypted symlink(s) are very slow because each encrypted symlink needs to be decrypted each time it's followed. This also involves dropping out of rcu-walk mode. Make encrypted symlinks faster by caching the decrypted symlink target in ->i_link. The first call to fscrypt_get_symlink() sets it. Then, the existing VFS path lookup code uses the non-NULL ->i_link to take the fast path where ->get_link() isn't called, and lookups in rcu-walk mode remain in rcu-walk mode. Also set ->i_link immediately when a new encrypted symlink is created. To safely free the symlink target after an RCU grace period has elapsed, introduce a new function fscrypt_free_inode(), and make the relevant filesystems call it just before actually freeing the inode. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertextEric Biggers2019-04-172-39/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->lookup() in an encrypted directory begins as follows: 1. fscrypt_prepare_lookup(): a. Try to load the directory's encryption key. b. If the key is unavailable, mark the dentry as a ciphertext name via d_flags. 2. fscrypt_setup_filename(): a. Try to load the directory's encryption key. b. If the key is available, encrypt the name (treated as a plaintext name) to get the on-disk name. Otherwise decode the name (treated as a ciphertext name) to get the on-disk name. But if the key is concurrently added, it may be found at (2a) but not at (1a). In this case, the dentry will be wrongly marked as a ciphertext name even though it was actually treated as plaintext. This will cause the dentry to be wrongly invalidated on the next lookup, potentially causing problems. For example, if the racy ->lookup() was part of sys_mount(), then the new mount will be detached when anything tries to access it. This is despite the mountpoint having a plaintext path, which should remain valid now that the key was added. Of course, this is only possible if there's a userspace race. Still, the additional kernel-side race is confusing and unexpected. Close the kernel-side race by changing fscrypt_prepare_lookup() to also set the on-disk filename (step 2b), consistent with the d_flags update. Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | fscrypt: drop inode argument from fscrypt_get_ctx()Eric Biggers2019-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only reason the inode is being passed to fscrypt_get_ctx() is to verify that the encryption key is available. However, all callers already ensure this because if we get as far as trying to do I/O to an encrypted file without the key, there's already a bug. Therefore, remove this unnecessary argument. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2019-05-0716-58/+432
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Add as a feature case-insensitive directories (the casefold feature) using Unicode 12.1. Also, the usual largish number of cleanups and bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present ext4: fix ext4_show_options for file systems w/o journal unicode: refactor the rule for regenerating utf8data.h docs: ext4.rst: document case-insensitive directories ext4: Support case-insensitive file name lookups ext4: include charset encoding information in the superblock MAINTAINERS: add Unicode subsystem entry unicode: update unicode database unicode version 12.1.0 unicode: introduce test module for normalized utf8 implementation unicode: implement higher level API for string handling unicode: reduce the size of utf8data[] unicode: introduce code for UTF-8 normalization unicode: introduce UTF-8 character database ext4: actually request zeroing of inode table after grow ext4: cond_resched in work-heavy group loops ext4: fix use-after-free race with debug_want_extra_isize ext4: avoid drop reference to iloc.bh twice ext4: ignore e_value_offs for xattrs with value-in-ea-inode ext4: protect journal inode's blocks using block_validity ext4: use BUG() instead of BUG_ON(1) ...
| * | ext4: export /sys/fs/ext4/feature/casefold if Unicode support is presentTheodore Ts'o2019-05-061-0/+6
| | | | | | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: fix ext4_show_options for file systems w/o journalDebabrata Banerjee2019-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of removing EXT4_MOUNT_JOURNAL_CHECKSUM from s_def_mount_opt as I assume was intended, all other options were blown away leading to _ext4_show_options() output being incorrect. Fixes: 1e381f60dad9 ("ext4: do not allow journal_opts for fs w/o journal") Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: Support case-insensitive file name lookupsGabriel Krisman Bertazi2019-04-259-21/+221
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the actual support for case-insensitive file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the casefold feature set is able to configure directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups to succeed in that directory in a case-insensitive fashion, i.e: match a directory entry even if the name used by userspace is not a byte per byte match with the disk name, but is an equivalent case-insensitive version of the Unicode string. This operation is called a case-insensitive file name lookup. The feature is configured as an inode attribute applied to directories and inherited by its children. This attribute can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. * dcache handling: For a +F directory, Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite which equivalent string was used in a previous lookup, without having to resort to ->lookup(). d_hash() of casefolded directories is implemented as the hash of the casefolded string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare() uses the utf8_strncasecmp() infrastructure, which handles the comparison of equivalent, same case, names as well. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. * on-disk data: Despite using a specific version of the name as the internal representation within the dcache, the name stored and fetched from the disk is a byte-per-byte match with what the user requested, making this implementation 'name-preserving'. i.e. no actual information is lost when writing to storage. DX is supported by modifying the hashes used in +F directories to make them case/encoding-aware. The new disk hashes are calculated as the hash of the full casefolded string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide an exact name. * Dealing with invalid sequences: By default, when a invalid UTF-8 sequence is identified, ext4 will treat it as an opaque byte sequence, ignoring the encoding and reverting to the old behavior for that unique file. This means that case-insensitive file name lookup will not work only for that file. An optional bit can be set in the superblock telling the filesystem code and userspace tools to enforce the encoding. When that optional bit is set, any attempt to create a file name using an invalid UTF-8 sequence will fail and return an error to userspace. * Normalization algorithm: The UTF-8 algorithms used to compare strings in ext4 is implemented lives in fs/unicode, and is based on a previous version developed by SGI. It implements the Canonical decomposition (NFD) algorithm described by the Unicode specification 12.1, or higher, combined with the elimination of ignorable code points (NFDi) and full case-folding (CF) as documented in fs/unicode/utf8_norm.c. NFD seems to be the best normalization method for EXT4 because: - It has a lower cost than NFC/NFKC (which requires decomposing to NFD as an intermediary step) - It doesn't eliminate important semantic meaning like compatibility decompositions. Although: - This implementation is not completely linguistic accurate, because different languages have conflicting rules, which would require the specialization of the filesystem to a given locale, which brings all sorts of problems for removable media and for users who use more than one language. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: include charset encoding information in the superblockGabriel Krisman Bertazi2019-04-252-1/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support for encoding is considered an incompatible feature, since it has potential to create collisions of file names in existing filesystems. If the feature flag is not enabled, the entire filesystem will operate on opaque byte sequences, respecting the original behavior. The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences. The magic number is mapped to the exact charset table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encoding I am supporting right now is utf8-12.1.0. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: actually request zeroing of inode table after growKirill Tkhai2019-04-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is never possible, that number of block groups decreases, since only online grow is supported. But after a growing occured, we have to zero inode tables for just created new block groups. Fixes: 19c5246d2516 ("ext4: add new online resize interface") Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: cond_resched in work-heavy group loopsKhazhismel Kumykov2019-04-252-0/+3
| | | | | | | | | | | | | | | | | | Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
| * | ext4: fix use-after-free race with debug_want_extra_isizeBarret Rhoden2019-04-251-24/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When remounting with debug_want_extra_isize, we were not performing the same checks that we do during a normal mount. That allowed us to set a value for s_want_extra_isize that reached outside the s_inode_size. Fixes: e2b911c53584 ("ext4: clean up feature test macros with predicate functions") Reported-by: syzbot+f584efa0ac7213c226b7@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * | ext4: avoid drop reference to iloc.bh twicePan Bian2019-04-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The reference to iloc.bh has been dropped in ext4_mark_iloc_dirty. However, the reference is dropped again if error occurs during ext4_handle_dirty_metadata, which may result in use-after-free bugs. Fixes: fb265c9cb49e("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
| * | ext4: ignore e_value_offs for xattrs with value-in-ea-inodeTheodore Ts'o2019-04-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In other places in fs/ext4/xattr.c, if e_value_inum is non-zero, the code ignores the value in e_value_offs. The e_value_offs *should* be zero, but we shouldn't depend upon it, since it might not be true in a corrupted/fuzzed file system. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202897 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202877 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | ext4: protect journal inode's blocks using block_validityTheodore Ts'o2019-04-092-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the blocks which belong to the journal inode to block_validity's system zone so attempts to deallocate or overwrite the journal due a corrupted file system where the journal blocks are also claimed by another inode. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202879 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | ext4: use BUG() instead of BUG_ON(1)Arnd Bergmann2019-04-072-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BUG_ON(1) leads to bogus warnings from clang when CONFIG_PROFILE_ANNOTATED_BRANCHES is set: fs/ext4/inode.c:544:4: error: variable 'retval' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] BUG_ON(1); ^~~~~~~~~ include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON' ^~~~~~~~~~~~~~~~~~~ include/linux/compiler.h:48:23: note: expanded from macro 'unlikely' ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/ext4/inode.c:591:6: note: uninitialized use occurs here if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) { ^~~~~~ fs/ext4/inode.c:544:4: note: remove the 'if' if its condition is always true BUG_ON(1); ^ include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON' ^ fs/ext4/inode.c:502:12: note: initialize the variable 'retval' to silence this warning Change it to BUG() so clang can see that this code path can never continue. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Jan Kara <jack@suse.cz>
| * | ext4: fix prefetchw of NULL pageLiu Xiang2019-04-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ext4_mpage_readpages(), if the parameter pages is not NULL, another parameter page is NULL. At the first time prefetchw(&page->flags) works on NULL. From second time, prefetchw(&page->flags) always works on the last consumed page. This might do little improvment for handling current page. So prefetchw() should be called while the page pointer has just been updated. Signed-off-by: Liu Xiang <liu.xiang6@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: remove incorrect comment for NEXT_ORPHAN()Eric Biggers2019-04-061-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment above NEXT_ORPHAN() was meant for ext4_encrypted_inode(), which was moved by commit a7550b30ab70 ("ext4 crypto: migrate into vfs's crypto engine") but the comment was accidentally left in place. Since ext4_encrypted_inode() has now been removed, just remove the comment. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
| * | ext4: make sanity check in mballoc more strictJan Kara2019-04-061-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | The sanity check in mb_find_extent() only checked that returned extent does not extend past blocksize * 8, however it should not extend past EXT4_CLUSTERS_PER_GROUP(sb). This can happen when clusters_per_group < blocksize * 8 and the tail of the bitmap is not properly filled by 1s which happened e.g. when ancient kernels have grown the filesystem. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds2019-05-074-31/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
| * | block: remove the i argument to bio_for_each_segment_allChristoph Hellwig2019-04-302-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: remove CONFIG_LBDAFChristoph Hellwig2019-04-062-27/+7
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge branch 'work.icache' of ↵Linus Torvalds2019-05-071-3/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs inode freeing updates from Al Viro: "Introduction of separate method for RCU-delayed part of ->destroy_inode() (if any). Pretty much as posted, except that destroy_inode() stashes ->free_inode into the victim (anon-unioned with ->i_fops) before scheduling i_callback() and the last two patches (sockfs conversion and folding struct socket_wq into struct socket) are excluded - that pair should go through netdev once davem reopens his tree" * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits) orangefs: make use of ->free_inode() shmem: make use of ->free_inode() hugetlb: make use of ->free_inode() overlayfs: make use of ->free_inode() jfs: switch to ->free_inode() fuse: switch to ->free_inode() ext4: make use of ->free_inode() ecryptfs: make use of ->free_inode() ceph: use ->free_inode() btrfs: use ->free_inode() afs: switch to use of ->free_inode() dax: make use of ->free_inode() ntfs: switch to ->free_inode() securityfs: switch to ->free_inode() apparmor: switch to ->free_inode() rpcpipe: switch to ->free_inode() bpf: switch to ->free_inode() mqueue: switch to ->free_inode() ufs: switch to ->free_inode() coda: switch to ->free_inode() ...
| * | ext4: make use of ->free_inode()Al Viro2019-05-011-3/+2
| |/ | | | | | | | | | | | | | | the rest of this ->destroy_inode() instance could probably be folded into ext4_evict_inode() Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'linus' of ↵Linus Torvalds2019-05-061-1/+0
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Add support for AEAD in simd - Add fuzz testing to testmgr - Add panic_on_fail module parameter to testmgr - Use per-CPU struct instead multiple variables in scompress - Change verify API for akcipher Algorithms: - Convert x86 AEAD algorithms over to simd - Forbid 2-key 3DES in FIPS mode - Add EC-RDSA (GOST 34.10) algorithm Drivers: - Set output IV with ctr-aes in crypto4xx - Set output IV in rockchip - Fix potential length overflow with hashing in sun4i-ss - Fix computation error with ctr in vmx - Add SM4 protected keys support in ccree - Remove long-broken mxc-scc driver - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits) crypto: ccree - use a proper le32 type for le32 val crypto: ccree - remove set but not used variable 'du_size' crypto: ccree - Make cc_sec_disable static crypto: ccree - fix spelling mistake "protedcted" -> "protected" crypto: caam/qi2 - generate hash keys in-place crypto: caam/qi2 - fix DMA mapping of stack memory crypto: caam/qi2 - fix zero-length buffer DMA mapping crypto: stm32/cryp - update to return iv_out crypto: stm32/cryp - remove request mutex protection crypto: stm32/cryp - add weak key check for DES crypto: atmel - remove set but not used variable 'alg_name' crypto: picoxcell - Use dev_get_drvdata() crypto: crypto4xx - get rid of redundant using_sd variable crypto: crypto4xx - use sync skcipher for fallback crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues crypto: crypto4xx - fix ctr-aes missing output IV crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o crypto: ccree - handle tee fips error during power management resume crypto: ccree - add function to handle cryptocell tee fips error ...
| * crypto: shash - remove shash_desc::flagsEric Biggers2019-04-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flags field in 'struct shash_desc' never actually does anything. The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP. However, no shash algorithm ever sleeps, making this flag a no-op. With this being the case, inevitably some users who can't sleep wrongly pass MAY_SLEEP. These would all need to be fixed if any shash algorithm actually started sleeping. For example, the shash_ahash_*() functions, which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP from the ahash API to the shash API. However, the shash functions are called under kmap_atomic(), so actually they're assumed to never sleep. Even if it turns out that some users do need preemption points while hashing large buffers, we could easily provide a helper function crypto_shash_update_large() which divides the data into smaller chunks and calls crypto_shash_update() and cond_resched() for each chunk. It's not necessary to have a flag in 'struct shash_desc', nor is it necessary to make individual shash algorithms aware of this at all. Therefore, remove shash_desc::flags, and document that the crypto_shash_*() functions can be called from any context. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2019-03-247-59/+58
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for 5.1" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: prohibit fstrim in norecovery mode ext4: cleanup bh release code in ext4_ind_remove_space() ext4: brelse all indirect buffer in ext4_ind_remove_space() ext4: report real fs size after failed resize ext4: add missing brelse() in add_new_gdb_meta_bg() ext4: remove useless ext4_pin_inode() ext4: avoid panic during forced reboot ext4: fix data corruption caused by unaligned direct AIO ext4: fix NULL pointer dereference while journal is aborted
| * ext4: prohibit fstrim in norecovery modeDarrick J. Wong2019-03-231-0/+7
| | | | | | | | | | | | | | | | | | | | The ext4 fstrim implementation uses the block bitmaps to find free space that can be discarded. If we haven't replayed the journal, the bitmaps will be stale and we absolutely *cannot* use stale metadata to zap the underlying storage. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: cleanup bh release code in ext4_ind_remove_space()zhangyi (F)2019-03-231-25/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we are releasing the indirect buffer where we are done with it in ext4_ind_remove_space(), so we can see the brelse() and BUFFER_TRACE() everywhere. It seems fragile and hard to read, and we may probably forget to release the buffer some day. This patch cleans up the code by putting of the code which releases the buffers to the end of the function. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>