summaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'ext4_for_linus_stable' of ↵Greg Kroah-Hartman2018-09-178-26/+72
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Ted writes: Various ext4 bug fixes; primarily making ext4 more robust against maliciously crafted file systems, and some DAX fixes. * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, dax: set ext4_dax_aops for dax files ext4, dax: add ext4_bmap to ext4_dax_aops ext4: don't mark mmp buffer head dirty ext4: show test_dummy_encryption mount option in /proc/mounts ext4: close race between direct IO and ext4_break_layouts() ext4: fix online resizing for bigalloc file systems with a 1k block size ext4: fix online resize's handling of a too-small final block group ext4: recalucate superblock checksum after updating free blocks/inodes ext4: avoid arithemetic overflow that can trigger a BUG ext4: avoid divide by zero fault when deleting corrupted inline directories ext4: check to make sure the rename(2)'s destination is not freed ext4: add nonstring annotations to ext4.h
| * ext4, dax: set ext4_dax_aops for dax filesToshi Kani2018-09-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sync syscall to DAX file needs to flush processor cache, but it currently does not flush to existing DAX files. This is because 'ext4_da_aops' is set to address_space_operations of existing DAX files, instead of 'ext4_dax_aops', since S_DAX flag is set after ext4_set_aops() in the open path. New file -------- lookup_open ext4_create __ext4_new_inode ext4_set_inode_flags // Set S_DAX flag ext4_set_aops // Set aops to ext4_dax_aops Existing file ------------- lookup_open ext4_lookup ext4_iget ext4_set_aops // Set aops to ext4_da_aops ext4_set_inode_flags // Set S_DAX flag Change ext4_iget() to initialize i_flags before ext4_set_aops(). Fixes: 5f0663bb4a64 ("ext4, dax: introduce ext4_dax_aops") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Suggested-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
| * ext4, dax: add ext4_bmap to ext4_dax_aopsToshi Kani2018-09-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ext4 mount path calls .bmap to the journal inode. This currently works for the DAX mount case because ext4_iget() always set 'ext4_da_aops' to any regular files. In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4 DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for DAX inodes. Fixes: 5f0663bb4a64 ("ext4, dax: introduce ext4_dax_aops") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Suggested-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
| * ext4: don't mark mmp buffer head dirtyLi Dongyang2018-09-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marking mmp bh dirty before writing it will make writeback pick up mmp block later and submit a write, we don't want the duplicate write as kmmpd thread should have full control of reading and writing the mmp block. Another reason is we will also have random I/O error on the writeback request when blk integrity is enabled, because kmmpd could modify the content of the mmp block(e.g. setting new seq and time) while the mmp block is under I/O requested by writeback. Signed-off-by: Li Dongyang <dongyangli@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@vger.kernel.org
| * ext4: show test_dummy_encryption mount option in /proc/mountsEric Biggers2018-09-151-0/+2
| | | | | | | | | | | | | | | | | | When in effect, add "test_dummy_encryption" to _ext4_show_options() so that it is shown in /proc/mounts and other relevant procfs files. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: close race between direct IO and ext4_break_layouts()Ross Zwisler2018-09-111-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the refcount of a page is lowered between the time that it is returned by dax_busy_page() and when the refcount is again checked in ext4_break_layouts() => ___wait_var_event(), the waiting function ext4_wait_dax_page() will never be called. This means that ext4_break_layouts() will still have 'retry' set to false, so we'll stop looping and never check the refcount of other pages in this inode. Instead, always continue looping as long as dax_layout_busy_page() gives us a page which it found with an elevated refcount. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: fix online resizing for bigalloc file systems with a 1k block sizeTheodore Ts'o2018-09-031-1/+2
| | | | | | | | | | | | | | | | | | | | An online resize of a file system with the bigalloc feature enabled and a 1k block size would be refused since ext4_resize_begin() did not understand s_first_data_block is 0 for all bigalloc file systems, even when the block size is 1k. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: fix online resize's handling of a too-small final block groupTheodore Ts'o2018-09-031-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid growing the file system to an extent so that the last block group is too small to hold all of the metadata that must be stored in the block group. This problem can be triggered with the following reproducer: umount /mnt mke2fs -F -m0 -b 4096 -t ext4 -O resize_inode,^has_journal \ -E resize=1073741824 /tmp/foo.img 128M mount /tmp/foo.img /mnt truncate --size 1708M /tmp/foo.img resize2fs /dev/loop0 295400 umount /mnt e2fsck -fy /tmp/foo.img Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: recalucate superblock checksum after updating free blocks/inodesTheodore Ts'o2018-09-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mounting the superblock, ext4_fill_super() calculates the free blocks and free inodes and stores them in the superblock. It's not strictly necessary, since we don't use them any more, but it's nice to keep them roughly aligned to reality. Since it's not critical for file system correctness, the code doesn't call ext4_commit_super(). The problem is that it's in ext4_commit_super() that we recalculate the superblock checksum. So if we're not going to call ext4_commit_super(), we need to call ext4_superblock_csum_set() to make sure the superblock checksum is consistent. Most of the time, this doesn't matter, since we end up calling ext4_commit_super() very soon thereafter, and definitely by the time the file system is unmounted. However, it doesn't work in this sequence: mke2fs -Fq -t ext4 /dev/vdc 128M mount /dev/vdc /vdc cp xfstests/git-versions /vdc godown /vdc umount /vdc mount /dev/vdc tune2fs -l /dev/vdc With this commit, the "tune2fs -l" no longer fails. Reported-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
| * ext4: avoid arithemetic overflow that can trigger a BUGTheodore Ts'o2018-09-012-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | A maliciously crafted file system can cause an overflow when the results of a 64-bit calculation is stored into a 32-bit length parameter. https://bugzilla.kernel.org/show_bug.cgi?id=200623 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Wen Xu <wen.xu@gatech.edu> Cc: stable@vger.kernel.org
| * ext4: avoid divide by zero fault when deleting corrupted inline directoriesTheodore Ts'o2018-08-272-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A specially crafted file system can trick empty_inline_dir() into reading past the last valid entry in a inline directory, and then run into the end of xattr marker. This will trigger a divide by zero fault. Fix this by using the size of the inline directory instead of dir->i_size. Also clean up error reporting in __ext4_check_dir_entry so that the message is clearer and more understandable --- and avoids the division by zero trap if the size passed in is zero. (I'm not sure why we coded it that way in the first place; printing offset % size is actually more confusing and less useful.) https://bugzilla.kernel.org/show_bug.cgi?id=200933 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Wen Xu <wen.xu@gatech.edu> Cc: stable@vger.kernel.org
| * ext4: check to make sure the rename(2)'s destination is not freedTheodore Ts'o2018-08-271-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | If the destination of the rename(2) system call exists, the inode's link count (i_nlinks) must be non-zero. If it is, the inode can end up on the orphan list prematurely, leading to all sorts of hilarity, including a use-after-free. https://bugzilla.kernel.org/show_bug.cgi?id=200931 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Wen Xu <wen.xu@gatech.edu> Cc: stable@vger.kernel.org
| * ext4: add nonstring annotations to ext4.hTheodore Ts'o2018-08-271-3/+14
| | | | | | | | | | | | | | | | | | | | This suppresses some false positives in gcc 8's -Wstringop-truncation Suggested by Miguel Ojeda (hopefully the __nonstring definition will eventually get accepted in the compiler-gcc.h header file). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
* | ext4: readpages() should submit IO as read-aheadJens Axboe2018-08-173-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | dax: remove VM_MIXEDMAP for fsdax and device daxDave Jiang2018-08-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is reworked from an earlier patch that Dan has posted: https://patchwork.kernel.org/patch/10131727/ VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds2018-08-142-4/+7
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
| * block: Define and use STAT_READ and STAT_WRITEMichael Callahan2018-07-182-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add defines for STAT_READ and STAT_WRITE for indexing the partition stat entries. This clarifies some fs/ code which has hardcoded 1 for STAT_WRITE and will make it easier to extend the stats with additional fields. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()zhong jiang2018-08-041-2/+1
| | | | | | | | | | | | | | The err is not used after initalization. So just remove the variable. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: improve code readability in ext4_iget()Liu Song2018-08-021-10/+7
| | | | | | | | | | | | | | | | Merge the duplicated complex conditions to improve code readability. Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
* | ext4: fix spectre gadget in ext4_mb_regular_allocator()Jeremy Cline2018-08-021-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to index arrays which makes it a potential spectre gadget. Fix this by sanitizing the value assigned to 'ac->ac2_order'. This covers the following accesses found with the help of smatch: * fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential spectre issue 'grp->bb_counters' [w] (local cap) * fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue 'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap) * fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue 'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap) Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jeremy Cline <jcline@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: check for NUL characters in extended attribute's nameTheodore Ts'o2018-08-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extended attribute names are defined to be NUL-terminated, so the name must not contain a NUL character. This is important because there are places when remove extended attribute, the code uses strlen to determine the length of the entry. That should probably be fixed at some point, but code is currently really messy, so the simplest fix for now is to simply validate that the extended attributes are sane. https://bugzilla.kernel.org/show_bug.cgi?id=200401 Reported-by: Wen Xu <wen.xu@gatech.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: use ext4_warning() for sb_getblk failureWang Shilong2018-08-012-6/+6
| | | | | | | | | | | | | | | | | | Out of memory should not be considered as critical errors; so replace ext4_error() with ext4_warnig(). Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: fix race when setting the bitmap corrupted flagWang Shilong2018-07-291-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever we hit block or inode bitmap corruptions we set bit and then reduce this block group free inode/clusters counter to expose right available space. However some of ext4_mark_group_bitmap_corrupted() is called inside group spinlock, some are not, this could make it happen that we double reduce one block group free counters from system. Always hold group spinlock for it could fix it, but it looks a little heavy, we could use test_and_set_bit() to fix race problems here. Signed-off-by: Wang Shilong <wshilong@ddn.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: reset error code in ext4_find_entry in fallbackEric Sandeen2018-07-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ext4_find_entry() falls back to "searching the old fashioned way" due to a corrupt dx dir, it needs to reset the error code to NULL so that the nonstandard ERR_BAD_DX_DIR code isn't returned to userspace. https://bugzilla.kernel.org/show_bug.cgi?id=199947 Reported-by: Anatoly Trosinenko <anatoly.trosinenko@yandex.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: handle layout changes to pinned DAX mappingsRoss Zwisler2018-07-294-0/+68
| | | | | | | | | | | | | | | | | | | | | | Follow the lead of xfs_break_dax_layouts() and add synchronization between operations in ext4 which remove blocks from an inode (hole punch, truncate down, etc.) and pages which are pinned due to DAX DMA operations. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com>
* | ext4: use swap macro in mext_page_double_lockGustavo A. R. Silva2018-07-291-3/+1
| | | | | | | | | | | | | | | | | | | | Make use of the swap macro and remove unnecessary variable *tmp*. This makes the code easier to read and maintain. This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: check allocation failure when duplicating "data" in ext4_remount()Chengguang Xu2018-07-291-0/+3
| | | | | | | | | | | | | | | | | | | | There is no check for allocation failure when duplicating "data" in ext4_remount(). Check for failure and return error -ENOMEM in this case. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* | ext4: fix warning message in ext4_enable_quotas()Junichi Uekawa2018-07-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | Output the warning message before we clobber type and be -1 all the time. The error message would now be [ 1.519791] EXT4-fs warning (device vdb): ext4_enable_quotas:5402: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix. Signed-off-by: Junichi Uekawa <uekawa@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* | ext4: super: extend timestamps to 40 bitsArnd Bergmann2018-07-293-12/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode timestamps use 34 bits in ext4, but the various timestamps in the superblock are limited to 32 bits. If every user accesses these as 'unsigned', then this is good until year 2106, but it seems better to extend this a bit further in the process of removing the deprecated get_seconds() function. This adds another byte for each timestamp in the superblock, making them long enough to store timestamps beyond what is in the inodes, which seems good enough here (in ocfs2, they are already 64-bit wide, which is appropriate for a new layout). I did not modify e2fsprogs, which obviously needs the same change to actually interpret future timestamps correctly. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: use timespec64 for all inode timesArnd Bergmann2018-07-292-14/+10
| | | | | | | | | | | | | | | | | | | | This is the last missing piece for the inode times on 32-bit systems: now that VFS interfaces use timespec64, we just need to stop truncating the tv_sec values for y2038 compatibililty. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: use ktime_get_real_seconds for i_dtimeArnd Bergmann2018-07-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | We only care about the low 32-bit for i_dtime as explained in commit b5f515735bea ("ext4: avoid Y2038 overflow in recently_deleted()"), so the use of get_seconds() is correct here, but that function is getting removed in the process of the y2038 fixes, so let's use the modern ktime_get_real_seconds() here. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: use 64-bit timestamps for mmp_timeArnd Bergmann2018-07-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | The mmp_time field is 64 bits wide, which is good, but calling get_seconds() results in a 32-bit value on 32-bit architectures. Using ktime_get_real_seconds() instead returns 64 bits everywhere. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: sysfs: print ext4_super_block fields as little-endianArnd Bergmann2018-07-291-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While working on extended rand for last_error/first_error timestamps, I noticed that the endianess is wrong; we access the little-endian fields in struct ext4_super_block as native-endian when we print them. This adds a special case in ext4_attr_show() and ext4_attr_store() to byteswap the superblock fields if needed. In older kernels, this code was part of super.c, it got moved to sysfs.c in linux-4.4. Cc: stable@vger.kernel.org Fixes: 52c198c6820f ("ext4: add sysfs entry showing whether the fs contains errors") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: fix check to prevent initializing reserved inodesTheodore Ts'o2018-07-292-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is valid" will complain if block group zero does not have the EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct, since a freshly created file system has this flag cleared. It gets almost immediately after the file system is mounted read-write --- but the following somewhat unlikely sequence will end up triggering a false positive report of a corrupted file system: mkfs.ext4 /dev/vdc mount -o ro /dev/vdc /vdc mount -o remount,rw /dev/vdc Instead, when initializing the inode table for block group zero, test to make sure that itable_unused count is not too large, since that is the case that will result in some or all of the reserved inodes getting cleared. This fixes the failures reported by Eric Whiteney when running generic/230 and generic/231 in the the nojournal test case. Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid") Reported-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | ext4: check for allocation block validity with block group lockedTheodore Ts'o2018-07-122-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 044e6e3d74a3: "ext4: don't update checksum of new initialized bitmaps" the buffer valid bit will get set without actually setting up the checksum for the allocation bitmap, since the checksum will get calculated once we actually allocate an inode or block. If we are doing this, then we need to (re-)check the verified bit after we take the block group lock. Otherwise, we could race with another process reading and verifying the bitmap, which would then complain about the checksum being invalid. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | ext4: fix inline data updates with checksums enabledTheodore Ts'o2018-07-102-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | The inline data code was updating the raw inode directly; this is problematic since if metadata checksums are enabled, ext4_mark_inode_dirty() must be called to update the inode's checksum. In addition, the jbd2 layer requires that get_write_access() be called before the metadata buffer is modified. Fix both of these problems. https://bugzilla.kernel.org/show_bug.cgi?id=200443 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* | ext4: clear mmp sequence number when remounting read-onlyTheodore Ts'o2018-07-082-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Previously, when an MMP-protected file system is remounted read-only, the kmmpd thread would exit the next time it woke up (a few seconds later), without resetting the MMP sequence number back to EXT4_MMP_SEQ_CLEAN. Fix this by explicitly killing the MMP thread when the file system is remounted read-only. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger@dilger.ca>
* | ext4: fix false negatives *and* false positives in ext4_check_descriptors()Theodore Ts'o2018-07-081-3/+2
|/ | | | | | | | | | | | | | | | | Ext4_check_descriptors() was getting called before s_gdb_count was initialized. So for file systems w/o the meta_bg feature, allocation bitmaps could overlap the block group descriptors and ext4 wouldn't notice. For file systems with the meta_bg feature enabled, there was a fencepost error which would cause the ext4_check_descriptors() to incorrectly believe that the block allocation bitmap overlaps with the block group descriptor blocks, and it would reject the mount. Fix both of these problems. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2018-07-0810-95/+147
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o: "Bug fixes for ext4; most of which relate to vulnerabilities where a maliciously crafted file system image can result in a kernel OOPS or hang. At least one fix addresses an inline data bug could be triggered by userspace without the need of a crafted file system (although it does require that the inline data feature be enabled)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: check superblock mapped prior to committing ext4: add more mount time checks of the superblock ext4: add more inode number paranoia checks ext4: avoid running out of journal credits when appending to an inline file jbd2: don't mark block as modified if the handle is out of credits ext4: never move the system.data xattr out of the inode body ext4: clear i_data in ext4_inode_info when removing inline data ext4: include the illegal physical block in the bad map ext4_error msg ext4: verify the depth of extent tree in ext4_find_extent() ext4: only look at the bg_flags field if it is valid ext4: make sure bitmaps and the inode table don't overlap with bg descriptors ext4: always check block group bounds in ext4_init_block_bitmap() ext4: always verify the magic number in xattr blocks ext4: add corruption check in ext4_xattr_set_entry() ext4: add warn_on_error mount option
| * ext4: check superblock mapped prior to committingJon Derrick2018-07-021-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch attempts to close a hole leading to a BUG seen with hot removals during writes [1]. A block device (NVME namespace in this test case) is formatted to EXT4 without partitions. It's mounted and write I/O is run to a file, then the device is hot removed from the slot. The superblock attempts to be written to the drive which is no longer present. The typical chain of events leading to the BUG: ext4_commit_super() __sync_dirty_buffer() submit_bh() submit_bh_wbc() BUG_ON(!buffer_mapped(bh)); This fix checks for the superblock's buffer head being mapped prior to syncing. [1] https://www.spinics.net/lists/linux-ext4/msg56527.html Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: add more mount time checks of the superblockTheodore Ts'o2018-06-171-11/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | The kernel's ext4 mount-time checks were more permissive than e2fsprogs's libext2fs checks when opening a file system. The superblock is considered too insane for debugfs or e2fsck to operate on it, the kernel has no business trying to mount it. This will make file system fuzzing tools work harder, but the failure cases that they find will be more useful and be easier to evaluate. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: add more inode number paranoia checksTheodore Ts'o2018-06-173-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If there is a directory entry pointing to a system inode (such as a journal inode), complain and declare the file system to be corrupted. Also, if the superblock's first inode number field is too small, refuse to mount the file system. This addresses CVE-2018-10882. https://bugzilla.kernel.org/show_bug.cgi?id=200069 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: avoid running out of journal credits when appending to an inline fileTheodore Ts'o2018-06-163-57/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Use a separate journal transaction if it turns out that we need to convert an inline file to use an data block. Otherwise we could end up failing due to not having journal credits. This addresses CVE-2018-10883. https://bugzilla.kernel.org/show_bug.cgi?id=200071 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: never move the system.data xattr out of the inode bodyTheodore Ts'o2018-06-161-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When expanding the extra isize space, we must never move the system.data xattr out of the inode body. For performance reasons, it doesn't make any sense, and the inline data implementation assumes that system.data xattr is never in the external xattr block. This addresses CVE-2018-10880 https://bugzilla.kernel.org/show_bug.cgi?id=200005 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: clear i_data in ext4_inode_info when removing inline dataTheodore Ts'o2018-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When converting from an inode from storing the data in-line to a data block, ext4_destroy_inline_data_nolock() was only clearing the on-disk copy of the i_blocks[] array. It was not clearing copy of the i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually used by ext4_map_blocks(). This didn't matter much if we are using extents, since the extents header would be invalid and thus the extents could would re-initialize the extents tree. But if we are using indirect blocks, the previous contents of the i_blocks array will be treated as block numbers, with potentially catastrophic results to the file system integrity and/or user data. This gets worse if the file system is using a 1k block size and s_first_data is zero, but even without this, the file system can get quite badly corrupted. This addresses CVE-2018-10881. https://bugzilla.kernel.org/show_bug.cgi?id=200015 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: include the illegal physical block in the bad map ext4_error msgTheodore Ts'o2018-06-151-2/+2
| | | | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: verify the depth of extent tree in ext4_find_extent()Theodore Ts'o2018-06-142-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | If there is a corupted file system where the claimed depth of the extent tree is -1, this can cause a massive buffer overrun leading to sadness. This addresses CVE-2018-10877. https://bugzilla.kernel.org/show_bug.cgi?id=199417 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: only look at the bg_flags field if it is validTheodore Ts'o2018-06-144-6/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bg_flags field in the block group descripts is only valid if the uninit_bg or metadata_csum feature is enabled. We were not consistently looking at this field; fix this. Also block group #0 must never have uninitialized allocation bitmaps, or need to be zeroed, since that's where the root inode, and other special inodes are set up. Check for these conditions and mark the file system as corrupted if they are detected. This addresses CVE-2018-10876. https://bugzilla.kernel.org/show_bug.cgi?id=199403 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: make sure bitmaps and the inode table don't overlap with bg descriptorsTheodore Ts'o2018-06-131-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | It's really bad when the allocation bitmaps and the inode table overlap with the block group descriptors, since it causes random corruption of the bg descriptors. So we really want to head those off at the pass. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * ext4: always check block group bounds in ext4_init_block_bitmap()Theodore Ts'o2018-06-131-7/+3
| | | | | | | | | | | | | | | | | | | | | | Regardless of whether the flex_bg feature is set, we should always check to make sure the bits we are setting in the block bitmap are within the block group bounds. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org