summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-07-201-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap fix from Darrick Wong: "Fix partial write regression. It turns out that fstests doesn't have any test coverage for short writes, but LTP does. Fortunately, this was caught right after -rc1 was tagged. Summary: - Fix a bug wherein a failed write could clobber short write status" * tag 'iomap-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write iomap: fix a regression for partial write errors
| * iomap: micro optimize the ki_pos assignment in iomap_file_buffered_writeChristoph Hellwig2023-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | We have the new value for ki_pos right at hand in iter.pos, so assign that instead of recalculating it from ret. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
| * iomap: fix a regression for partial write errorsChristoph Hellwig2023-07-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When write* wrote some data it should return the amount of written data and not the error code that caused it to stop. Fix a recent regression in iomap_file_buffered_write that caused it to return the errno instead. Fixes: 219580eea1ee ("iomap: update ki_pos in iomap_file_buffered_write") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
* | Merge tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-07-203-13/+71
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "Flexarray declaration conversions. This probably should've been done with the merge window open, but I was not aware that the UBSAN knob would be getting turned up for 6.5, and the fstests failures due to the kernel warnings are getting in the way of testing. Summary: - Convert all the array[1] declarations into the accepted flex array[] declarations so that UBSAN and friends will not get confused" * tag 'xfs-6.5-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: convert flex-array declarations in xfs attr shortform objects xfs: convert flex-array declarations in xfs attr leaf blocks xfs: convert flex-array declarations in struct xfs_attrlist*
| * | xfs: convert flex-array declarations in xfs attr shortform objectsDarrick J. Wong2023-07-172-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
| * | xfs: convert flex-array declarations in xfs attr leaf blocksDarrick J. Wong2023-07-172-10/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24 index 2 is out of range for type '__u8 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] __vfs_getxattr+0xa3/0x100 vfs_getxattr+0x87/0x1d0 do_getxattr+0x17a/0x220 getxattr+0x89/0xf0 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
| * | xfs: convert flex-array declarations in struct xfs_attrlist*Darrick J. Wong2023-07-171-2/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. This may cause friction with userspace header declarations, but suck is life. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18 index 1 is out of range for type '__s32 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] vfs_ioctl+0x1f/0x60 The kernel and xfsprogs code that uses these structures will not have problems, but the long tail of external user programs might. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
* | Merge tag 'for-6.5-rc2-tag' of ↵Linus Torvalds2023-07-206-46/+79
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Stable fixes: - fix race between balance and cancel/pause - various iput() fixes - fix use-after-free of new block group that became unused - fix warning when putting transaction with qgroups enabled after abort - fix crash in subpage mode when page could be released between map and map read - when scrubbing raid56 verify the P/Q stripes unconditionally - fix minor memory leak in zoned mode when a block group with an unexpected superblock is found Regression fixes: - fix ordered extent split error handling when submitting direct IO - user irq-safe locking when adding delayed iputs" * tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix warning when putting transaction with qgroups enabled after abort btrfs: fix ordered extent split error handling in btrfs_dio_submit_io btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand btrfs: raid56: always verify the P/Q contents for scrub btrfs: use irq safe locking when running and adding delayed iputs btrfs: fix iput() on error pointer after error during orphan cleanup btrfs: fix double iput() on inode after an error during orphan cleanup btrfs: zoned: fix memory leak after finding block group with super blocks btrfs: fix use-after-free of new block group that became unused btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block btrfs: fix race between balance and cancel/pause
| * | btrfs: fix warning when putting transaction with qgroups enabled after abortFilipe Manana2023-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix ordered extent split error handling in btrfs_dio_submit_ioChristoph Hellwig2023-07-181-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io fails to allocate memory for a new ordered_extent, it calls into the btrfs_dio_end_io for error handling. btrfs_dio_end_io then assumes that bbio->ordered is set because it is supposed to be at this point, except for this error handling corner case. Try to not overload the btrfs_dio_end_io with error handling of a bio in a non-canonical state, and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io directly for this error case. Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Fixes: b41b6f6937dc ("btrfs: use btrfs_finish_ordered_extent to complete direct writes") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expandJosef Bacik2023-07-181-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While trying to get the subpage blocksize tests running, I hit the following panic on generic/476 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229 kernel BUG at fs/btrfs/subpage.c:229! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023 pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : btrfs_subpage_assert+0xbc/0xf0 lr : btrfs_subpage_assert+0xbc/0xf0 Call trace: btrfs_subpage_assert+0xbc/0xf0 btrfs_subpage_clear_checked+0x38/0xc0 btrfs_page_clear_checked+0x48/0x98 btrfs_truncate_block+0x5d0/0x6a8 btrfs_cont_expand+0x5c/0x528 btrfs_write_check.isra.0+0xf8/0x150 btrfs_buffered_write+0xb4/0x760 btrfs_do_write_iter+0x2f8/0x4b0 btrfs_file_write_iter+0x1c/0x30 do_iter_readv_writev+0xc8/0x158 do_iter_write+0x9c/0x210 vfs_iter_write+0x24/0x40 iter_file_splice_write+0x224/0x390 direct_splice_actor+0x38/0x68 splice_direct_to_actor+0x12c/0x260 do_splice_direct+0x90/0xe8 generic_copy_file_range+0x50/0x90 vfs_copy_file_range+0x29c/0x470 __arm64_sys_copy_file_range+0xcc/0x498 invoke_syscall.constprop.0+0x80/0xd8 do_el0_svc+0x6c/0x168 el0_svc+0x50/0x1b0 el0t_64_sync_handler+0x114/0x120 el0t_64_sync+0x194/0x198 This happens because during btrfs_cont_expand we'll get a page, set it as mapped, and if it's not Uptodate we'll read it. However between the read and re-locking the page we could have called release_folio() on the page, but left the page in the file mapping. release_folio() can clear the page private, and thus further down we blow up when we go to modify the subpage bits. Fix this by putting the set_page_extent_mapped() after the read. This is safe because read_folio() will call set_page_extent_mapped() before it does the read, and then if we clear page private but leave it on the mapping we're completely safe re-setting set_page_extent_mapped(). With this patch I can now run generic/476 without panicing. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: raid56: always verify the P/Q contents for scrubQu Wenruo2023-07-181-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [REGRESSION] Commit 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") changed the behavior of scrub_rbio(). Initially if we have no error reading the raid bio, we will assign @need_check to true, then finish_parity_scrub() would later verify the content of P/Q stripes before writeback. But after that commit we never verify the content of P/Q stripes and just writeback them. This can lead to unrepaired P/Q stripes during scrub, or already corrupted P/Q copied to the dev-replace target. [FIX] The situation is more complex than the regression, in fact the initial behavior is not 100% correct either. If we have the following rare case, it can still lead to the same problem using the old behavior: 0 16K 32K 48K 64K Data 1: |IIIIIII| | Data 2: | | Parity: | |CCCCCCC| | Where "I" means IO error, "C" means corruption. In the above case, we're scrubbing the parity stripe, then read out all the contents of Data 1, Data 2, Parity stripes. But found IO error in Data 1, which leads to rebuild using Data 2 and Parity and got the correct data. In that case, we would not verify if the Parity is correct for range [16K, 32K). So here we have to always verify the content of Parity no matter if we did recovery or not. This patch would remove the @need_check parameter of finish_parity_scrub() completely, and would always do the P/Q verification before writeback. Fixes: 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") CC: stable@vger.kernel.org # 6.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: use irq safe locking when running and adding delayed iputsFilipe Manana2023-07-181-10/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running delayed iputs, which never happens in an irq context, needs to lock the spinlock fs_info->delayed_iput_lock. When finishing bios for data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which needs to add a delayed iput and for that it needs to acquire the spinlock fs_info->delayed_iput_lock. Without disabling irqs when running delayed iputs we can therefore deadlock on that spinlock. The same deadlock can also happen when adding an inode to the delayed iputs list, since this can be done outside an irq context as well. Syzbot recently reported this, which results in the following trace: ================================ WARNING: inconsistent lock state 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline] ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 {IN-SOFTIRQ-W} state was registered at: lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490 btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline] btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547 __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline] __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112 btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163 btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378 bio_endio+0x589/0x690 block/bio.c:1617 req_bio_endio block/blk-mq.c:766 [inline] blk_update_request+0x5c5/0x1620 block/blk-mq.c:911 blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370 blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110 __do_softirq+0x1d4/0x905 kernel/softirq.c:553 run_ksoftirqd kernel/softirq.c:921 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:913 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 irq event stamp: 39 hardirqs last enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575 hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575 softirqs last enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs_info->delayed_iput_lock); <Interrupt> lock(&fs_info->delayed_iput_lock); *** DEADLOCK *** 1 lock held by btrfs-cleaner/16079: #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463 stack backtrace: CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3978 [inline] valid_state kernel/locking/lockdep.c:4020 [inline] mark_lock_irq kernel/locking/lockdep.c:4223 [inline] mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685 mark_lock kernel/locking/lockdep.c:4649 [inline] mark_usage kernel/locking/lockdep.c:4598 [inline] __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> So fix this by using spin_lock_irq() and spin_unlock_irq() when running delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore() when adding a delayed iput(). Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com Fixes: ec63b84d4611 ("btrfs: add an ordered_extent pointer to struct btrfs_bio") Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix iput() on error pointer after error during orphan cleanupFilipe Manana2023-07-181-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body of the following if statement: if (ret == -ENOENT || inode->i_nlink) { (...) trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { ret = PTR_ERR(trans); iput(inode); goto out; } (...) ret = btrfs_del_orphan_item(trans, root, found_key.objectid); btrfs_end_transaction(trans); if (ret) { iput(inode); goto out; } continue; } If we get an error from btrfs_start_transaction() or from the call to btrfs_del_orphan_item() we end calling iput() against an inode pointer that has a value of ERR_PTR(-ENOENT), resulting in a crash with the following trace: [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096 [876.667] #PF: supervisor read access in kernel mode [876.667] #PF: error_code(0x0000) - not-present page [876.667] PGD 0 P4D 0 [876.668] Oops: 0000 [#1] PREEMPT SMP PTI [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [876.668] RIP: 0010:iput+0xa/0x20 [876.668] Code: ff ff ff 66 (...) [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282 [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000 [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4 [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0 [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4 [876.670] FS: 00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000 [876.670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0 [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [876.672] Call Trace: [876.744] <TASK> [876.744] ? __die_body+0x1b/0x60 [876.744] ? page_fault_oops+0x15d/0x450 [876.745] ? __kmem_cache_alloc_node+0x47/0x410 [876.745] ? do_user_addr_fault+0x65/0x8a0 [876.745] ? exc_page_fault+0x74/0x170 [876.746] ? asm_exc_page_fault+0x22/0x30 [876.746] ? iput+0xa/0x20 [876.746] btrfs_orphan_cleanup+0x221/0x330 [btrfs] [876.746] btrfs_lookup_dentry+0x58f/0x5f0 [btrfs] [876.747] btrfs_lookup+0xe/0x30 [btrfs] [876.747] __lookup_slow+0x82/0x130 [876.785] walk_component+0xe5/0x160 [876.786] path_lookupat.isra.0+0x6e/0x150 [876.786] filename_lookup+0xcf/0x1a0 [876.786] ? mod_objcg_state+0xd2/0x360 [876.786] ? obj_cgroup_charge+0xf5/0x110 [876.787] ? should_failslab+0xa/0x20 [876.787] ? kmem_cache_alloc+0x47/0x450 [876.787] vfs_path_lookup+0x51/0x90 [876.788] mount_subtree+0x8d/0x130 [876.788] btrfs_mount+0x149/0x410 [btrfs] [876.788] ? __kmem_cache_alloc_node+0x47/0x410 [876.788] ? vfs_parse_fs_param+0xc0/0x110 [876.789] legacy_get_tree+0x24/0x50 [876.834] vfs_get_tree+0x22/0xd0 [876.852] path_mount+0x2d8/0x9c0 [876.852] do_mount+0x79/0x90 [876.852] __x64_sys_mount+0x8e/0xd0 [876.853] do_syscall_64+0x38/0x90 [876.899] entry_SYSCALL_64_after_hwframe+0x72/0xdc [876.958] RIP: 0033:0x7f5fbe50b76a [876.959] Code: 48 8b 0d a9 (...) [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0 [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001 [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70 [876.960] </TASK> So fix this by setting 'inode' to NULL whenever we get an error from btrfs_iget(), and to make the code simpler, stop testing for 'ret' being -ENOENT to check if we have an inode - instead test for 'inode' being NULL or not. Having a NULL 'inode' prevents any iput() call from crashing, as iput() ignores NULL inode pointers. Also, stop testing for a NULL return value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT), and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM). We also don't need the extra iput() calls on the error branches for the btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have already called iput() before, so remove them. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix double iput() on inode after an error during orphan cleanupFilipe Manana2023-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At btrfs_orphan_cleanup(), if we were able to find the inode, we do an iput() on the inode, then if btrfs_drop_verity_items() succeeds and then either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do another iput() in the respective error paths, resulting in an extra iput() on the inode. Fix this by setting inode to NULL after the first iput(), as iput() ignores a NULL inode pointer argument. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: zoned: fix memory leak after finding block group with super blocksFilipe Manana2023-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At exclude_super_stripes(), if we happen to find a block group that has super blocks mapped to it and we are on a zoned filesystem, we error out as this is not supposed to happen, indicating either a bug or maybe some memory corruption for example. However we are exiting the function without freeing the memory allocated for the logical address of the super blocks. Fix this by freeing the logical address. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix use-after-free of new block group that became unusedFilipe Manana2023-07-112-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a task creates a new block group and that block group becomes unused before we finish its creation, at btrfs_create_pending_block_groups(), then when btrfs_mark_bg_unused() is called against the block group, we assume that the block group is currently in the list of block groups to reclaim, and we move it out of the list of new block groups and into the list of unused block groups. This has two consequences: 1) We move it out of the list of new block groups associated to the current transaction. So the block group creation is not finished and if we attempt to delete the bg because it's unused, we will not find the block group item in the extent tree (or the new block group tree), its device extent items in the device tree etc, resulting in the deletion to fail due to the missing items; 2) We don't increment the reference count on the block group when we move it to the list of unused block groups, because we assumed the block group was on the list of block groups to reclaim, and in that case it already has the correct reference count. However the block group was on the list of new block groups, in which case no extra reference was taken because it's local to the current task. This later results in doing an extra reference count decrement when removing the block group from the unused list, eventually leading the reference count to 0. This second case was caught when running generic/297 from fstests, which produced the following assertion failure and stack trace: [589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299 [589.559] ------------[ cut here ]------------ [589.559] kernel BUG at fs/btrfs/block-group.c:4299! [589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI [589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.561] Code: 68 62 da c0 (...) [589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246 [589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000 [589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff [589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50 [589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00 [589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100 [589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000 [589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0 [589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [589.564] Call Trace: [589.564] <TASK> [589.565] ? __die_body+0x1b/0x60 [589.565] ? die+0x39/0x60 [589.565] ? do_trap+0xeb/0x110 [589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.566] ? do_error_trap+0x6a/0x90 [589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.566] ? exc_invalid_op+0x4e/0x70 [589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] ? asm_exc_invalid_op+0x16/0x20 [589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs] [589.567] close_ctree+0x35d/0x560 [btrfs] [589.568] ? fsnotify_sb_delete+0x13e/0x1d0 [589.568] ? dispose_list+0x3a/0x50 [589.568] ? evict_inodes+0x151/0x1a0 [589.568] generic_shutdown_super+0x73/0x1a0 [589.569] kill_anon_super+0x14/0x30 [589.569] btrfs_kill_super+0x12/0x20 [btrfs] [589.569] deactivate_locked_super+0x2e/0x70 [589.569] cleanup_mnt+0x104/0x160 [589.570] task_work_run+0x56/0x90 [589.570] exit_to_user_mode_prepare+0x160/0x170 [589.570] syscall_exit_to_user_mode+0x22/0x50 [589.570] ? __x64_sys_umount+0x12/0x20 [589.571] do_syscall_64+0x48/0x90 [589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc [589.571] RIP: 0033:0x7f497ff0a567 [589.571] Code: af 98 0e (...) [589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567 [589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0 [589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100 [589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0 [589.573] </TASK> [589.574] Modules linked in: dm_snapshot dm_thin_pool (...) [589.576] ---[ end trace 0000000000000000 ]--- Fix this by adding a runtime flag to the block group to tell that the block group is still in the list of new block groups, and therefore it should not be moved to the list of unused block groups, at btrfs_mark_bg_unused(), until the flag is cleared, when we finish the creation of the block group at btrfs_create_pending_block_groups(). Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_blockChristoph Hellwig2023-07-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mirror_num_ret is allowed to be NULL, although it has to be set when smap is set. Unfortunately that is not a well enough specifiable invariant for static type checkers, so add a NULL check to make sure they are fine. Fixes: 03793cbbc80f ("btrfs: add fast path for single device io in __btrfs_map_block") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix race between balance and cancel/pauseJosef Bacik2023-07-111-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported a panic that looks like this: assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259 Call Trace: <TASK> btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline] btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline] btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The reproducer is running a balance and a cancel or pause in parallel. The way balance finishes is a bit wonky, if we were paused we need to save the balance_ctl in the fs_info, but clear it otherwise and cleanup. However we rely on the return values being specific errors, or having a cancel request or no pause request. If balance completes and returns 0, but we have a pause or cancel request we won't do the appropriate cleanup, and then the next time we try to start a balance we'll trip this ASSERT. The error handling is just wrong here, we always want to clean up, unless we got -ECANCELLED and we set the appropriate pause flag in the exclusive op. With this patch the reproducer ran for an hour without tripping, previously it would trip in less than a few minutes. Reported-by: syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | Merge tag 'fuse-update-6.5' of ↵Linus Torvalds2023-07-193-13/+20
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Small but important fixes and a trivial cleanup" * tag 'fuse-update-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: ioctl: translate ENOSYS in outarg fuse: revalidate: don't invalidate if interrupted fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT fuse: remove duplicate check for nodeid fuse: add feature flag for expire-only
| * | fuse: ioctl: translate ENOSYS in outargMiklos Szeredi2023-06-211-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fuse shouldn't return ENOSYS from its ioctl implementation. If userspace responds with ENOSYS it should be translated to ENOTTY. There are two ways to return an error from the IOCTL request: - fuse_out_header.error - fuse_ioctl_out.result Commit 02c0cab8e734 ("fuse: ioctl: translate ENOSYS") already fixed this issue for the first case, but missed the second case. This patch fixes the second case. Reported-by: Jonathan Katz <jkatz@eitmlabs.org> Closes: https://lore.kernel.org/all/CALKgVmcC1VUV_gJVq70n--omMJZUb4HSh_FqvLTHgNBc+HCLFQ@mail.gmail.com/ Fixes: 02c0cab8e734 ("fuse: ioctl: translate ENOSYS") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: revalidate: don't invalidate if interruptedMiklos Szeredi2023-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the LOOKUP request triggered from fuse_dentry_revalidate() is interrupted, then the dentry will be invalidated, possibly resulting in submounts being unmounted. Reported-by: Xu Rongbo <xurongbo@baidu.com> Closes: https://lore.kernel.org/all/CAJfpegswN_CJJ6C3RZiaK6rpFmNyWmXfaEpnQUJ42KCwNF5tWw@mail.gmail.com/ Fixes: 9e6268db496a ("[PATCH] FUSE - read-write operations") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: Apply flags2 only when userspace set the FUSE_INIT_EXTBernd Schubert2023-06-071-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is just a safety precaution to avoid checking flags on memory that was initialized on the user space side. libfuse zeroes struct fuse_init_out outarg, but this is not guranteed to be done in all implementations. Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is set. There is a risk with this change, though - it might break existing user space libraries, which are already using flags2 without setting FUSE_INIT_EXT. The corresponding libfuse patch is here https://github.com/libfuse/libfuse/pull/662 Signed-off-by: Bernd Schubert <bschubert@ddn.com> Fixes: 53db28933e95 ("fuse: extend init flags") Cc: <stable@vger.kernel.org> # v5.17 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: remove duplicate check for nodeidzyfjeff2023-06-071-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | before this check, the nodeid has already been checked once, so the check here doesn't make an sense, so remove the check for nodeid here. if (err || !outarg->nodeid) goto out_put_forget; err = -EIO; >>> if (!outarg->nodeid) goto out_put_forget; Signed-off-by: zyfjeff <zyfjeff@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | fuse: add feature flag for expire-onlyMiklos Szeredi2023-06-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of FUSE_NOTIFY_INVAL_ENTRY is effective. This is needed for backports of this feature, otherwise the server could just check the protocol version. Fixes: 4f8d37020e1f ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY") Cc: <stable@vger.kernel.org> # v6.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | | Merge tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2023-07-167-31/+43
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull smb client fixes from Steve French: - Two reconnect fixes: important fix to address inFlight count to leak (which can leak credits), and fix for better handling a deleted share - DFS fix - SMB1 cleanup fix - deferred close fix * tag '6.5-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix mid leak during reconnection after timeout threshold cifs: is_network_name_deleted should return a bool smb: client: fix missed ses refcounting smb: client: Fix -Wstringop-overflow issues cifs: if deferred close is disabled then close files immediately
| * | | cifs: fix mid leak during reconnection after timeout thresholdShyam Prasad N2023-07-141-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the number of responses with status of STATUS_IO_TIMEOUT exceeds a specified threshold (NUM_STATUS_IO_TIMEOUT), we reconnect the connection. But we do not return the mid, or the credits returned for the mid, or reduce the number of in-flight requests. This bug could result in the server->in_flight count to go bad, and also cause a leak in the mids. This change moves the check to a few lines below where the response is decrypted, even of the response is read from the transform header. This way, the code for returning the mids can be reused. Also, the cifs_reconnect was reconnecting just the transport connection before. In case of multi-channel, this may not be what we want to do after several timeouts. Changed that to reconnect the session and the tree too. Also renamed NUM_STATUS_IO_TIMEOUT to a more appropriate name MAX_STATUS_IO_TIMEOUT. Fixes: 8e670f77c4a5 ("Handle STATUS_IO_TIMEOUT gracefully") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: is_network_name_deleted should return a boolShyam Prasad N2023-07-143-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, is_network_name_deleted and it's implementations do not return anything if the network name did get deleted. So the function doesn't fully achieve what it advertizes. Changed the function to return a bool instead. It will now return true if the error returned is STATUS_NETWORK_NAME_DELETED and the share (tree id) was found to be connected. It returns false otherwise. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | smb: client: fix missed ses refcountingPaulo Alcantara2023-07-122-17/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use new cifs_smb_ses_inc_refcount() helper to get an active reference of @ses and @ses->dfs_root_ses (if set). This will prevent @ses->dfs_root_ses of being put in the next call to cifs_put_smb_ses() and thus potentially causing an use-after-free bug. Fixes: 8e3554150d6c ("cifs: fix sharing of DFS connections") Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | smb: client: Fix -Wstringop-overflow issuesGustavo A. R. Silva2023-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pSMB->hdr.Protocol is an array of size 4 bytes, hence when the compiler analyzes this line of code parm_data = ((char *) &pSMB->hdr.Protocol) + offset; it legitimately complains about the fact that offset points outside the bounds of the array. Notice that the compiler gives priority to the object as an array, rather than merely the address of one more byte in a structure to wich offset should be added (which seems to be the actual intention of the original implementation). Fix this by explicitly instructing the compiler to treat the code as a sequence of bytes in struct smb_com_transaction2_spi_req, and not as an array accessed through pointer notation. Notice that ((char *)pSMB) + sizeof(pSMB->hdr.smb_buf_length) points to the same address as ((char *) &pSMB->hdr.Protocol), therefore this results in no differences in binary output. Fixes the following -Wstringop-overflow warnings when built s390 architecture with defconfig (GCC 13): CC [M] fs/smb/client/cifssmb.o In function 'cifs_init_ace', inlined from 'posix_acl_to_cifs' at fs/smb/client/cifssmb.c:3046:3, inlined from 'cifs_do_set_acl' at fs/smb/client/cifssmb.c:3191:15: fs/smb/client/cifssmb.c:2987:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2987 | cifs_ace->cifs_e_perm = local_ace->e_perm; | ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ In file included from fs/smb/client/cifssmb.c:27: fs/smb/client/cifspdu.h: In function 'cifs_do_set_acl': fs/smb/client/cifspdu.h:384:14: note: at offset [7, 11] into destination object 'Protocol' of size 4 384 | __u8 Protocol[4]; | ^~~~~~~~ In function 'cifs_init_ace', inlined from 'posix_acl_to_cifs' at fs/smb/client/cifssmb.c:3046:3, inlined from 'cifs_do_set_acl' at fs/smb/client/cifssmb.c:3191:15: fs/smb/client/cifssmb.c:2988:30: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=] 2988 | cifs_ace->cifs_e_tag = local_ace->e_tag; | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ fs/smb/client/cifspdu.h: In function 'cifs_do_set_acl': fs/smb/client/cifspdu.h:384:14: note: at offset [6, 10] into destination object 'Protocol' of size 4 384 | __u8 Protocol[4]; | ^~~~~~~~ This helps with the ongoing efforts to globally enable -Wstringop-overflow. Link: https://github.com/KSPP/linux/issues/310 Fixes: dc1af4c4b472 ("cifs: implement set acl method") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | cifs: if deferred close is disabled then close files immediatelyBharath SM2023-07-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If defer close timeout value is set to 0, then there is no need to include files in the deferred close list and utilize the delayed worker for closing. Instead, we can close them immediately. Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
* | | | erofs: fix fsdax unavailability for chunk-based regular filesXin Yin2023-07-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DAX can be used to share page cache between VMs, reducing guest memory overhead. And chunk based data format is widely used for VM and container image. So enable dax support for it, make erofs better used for VM scenarios. Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230711062130.7860-1-yinxin.x@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | erofs: avoid infinite loop in z_erofs_do_read_page() when reading beyond EOFChunhai Guo2023-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z_erofs_do_read_page() may loop infinitely due to the inappropriate truncation in the below statement. Since the offset is 64 bits and min_t() truncates the result to 32 bits. The solution is to replace unsigned int with a 64-bit type, such as erofs_off_t. cur = end - min_t(unsigned int, offset + end - map->m_la, end); - For example: - offset = 0x400160000 - end = 0x370 - map->m_la = 0x160370 - offset + end - map->m_la = 0x400000000 - offset + end - map->m_la = 0x00000000 (truncated as unsigned int) - Expected result: - cur = 0 - Actual result: - cur = 0x370 Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230710093410.44071-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | erofs: avoid useless loops in z_erofs_pcluster_readmore() when reading ↵Chunhai Guo2023-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | beyond EOF z_erofs_pcluster_readmore() may take a long time to loop when the page offset is large enough, which is unnecessary should be prevented. For example, when the following case is encountered, it will loop 4691368 times, taking about 27 seconds: - offset = 19217289215 - inode_size = 1442672 Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Fixes: 386292919c25 ("erofs: introduce readmore decompression strategy") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230710042531.28761-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | erofs: simplify z_erofs_transform_plain()Gao Xiang2023-07-121-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use memcpy_to_page() instead of open-coding them. In addition, add a missing flush_dcache_page() even though almost all modern architectures clear `PG_dcache_clean` flag for new file cache pages so that it doesn't change anything in practice. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230627161240.331-2-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | | erofs: get rid of the remaining kmap_atomic()Gao Xiang2023-07-121-9/+9
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's unnecessary to use kmap_atomic() compared with kmap_local_page(). In addition, kmap_atomic() is deprecated now. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230627161240.331-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | | Merge tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-07-091-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fix from Darrick Wong: "Nothing exciting here, just getting rid of a gcc warning that I got tired of seeing when I turn on gcov" * tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix uninit warning in xfs_growfs_data
| * | | xfs: fix uninit warning in xfs_growfs_dataDarrick J. Wong2023-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quiet down this gcc warning: fs/xfs/xfs_fsops.c: In function ‘xfs_growfs_data’: fs/xfs/xfs_fsops.c:219:21: error: ‘lastag_extended’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 219 | if (lastag_extended) { | ^~~~~~~~~~~~~~~ fs/xfs/xfs_fsops.c:100:33: note: ‘lastag_extended’ was declared here 100 | bool lastag_extended; | ^~~~~~~~~~~~~~~ By setting its value explicitly. From code analysis I don't think this is a real problem, but I have better things to do than analyse this closely. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* | | | Merge tag '6.5-rc-smb3-client-fixes-part2' of ↵Linus Torvalds2023-07-094-5/+72
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - fix potential use after free in unmount - minor cleanup - add worker to cleanup stale directory leases * tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: cifs: Add a laundromat thread for cached directories smb: client: remove redundant pointer 'server' cifs: fix session state transition to avoid use-after-free issue
| * | | | cifs: Add a laundromat thread for cached directoriesRonnie Sahlberg2023-07-052-0/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and drop cached directories after 30 seconds Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | smb: client: remove redundant pointer 'server'Colin Ian King2023-07-011-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer 'server' is assigned but never read, the pointer is redundant and can be removed. Cleans up clang scan build warning: fs/smb/client/dfs.c:217:3: warning: Value stored to 'server' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | | | cifs: fix session state transition to avoid use-after-free issueWinston Wen2023-07-011-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We switch session state to SES_EXITING without cifs_tcp_ses_lock now, it may lead to potential use-after-free issue. Consider the following execution processes: Thread 1: __cifs_put_smb_ses() spin_lock(&cifs_tcp_ses_lock) if (--ses->ses_count > 0) spin_unlock(&cifs_tcp_ses_lock) return spin_unlock(&cifs_tcp_ses_lock) ---> **GAP** spin_lock(&ses->ses_lock) if (ses->ses_status == SES_GOOD) ses->ses_status = SES_EXITING spin_unlock(&ses->ses_lock) Thread 2: cifs_find_smb_ses() spin_lock(&cifs_tcp_ses_lock) list_for_each_entry(ses, ...) spin_lock(&ses->ses_lock) if (ses->ses_status == SES_EXITING) spin_unlock(&ses->ses_lock) continue ... spin_unlock(&ses->ses_lock) if (ret) cifs_smb_ses_inc_refcount(ret) spin_unlock(&cifs_tcp_ses_lock) If thread 1 is preempted in the gap and thread 2 start executing, thread 2 will get the session, and soon thread 1 will switch the session state to SES_EXITING and start releasing it, even though thread 1 had increased the session's refcount and still uses it. So switch session state under cifs_tcp_ses_lock to eliminate this gap. Signed-off-by: Winston Wen <wentao@uniontech.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | | | Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of ↵Linus Torvalds2023-07-082-7/+26
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "16 hotfixes. Six are cc:stable and the remainder address post-6.4 issues" The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since it was all hopefully fixed in mainline. * tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib: dhry: fix sleeping allocations inside non-preemptable section kasan, slub: fix HW_TAGS zeroing with slub_debug kasan: fix type cast in memory_is_poisoned_n mailmap: add entries for Heiko Stuebner mailmap: update manpage link bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page MAINTAINERS: add linux-next info mailmap: add Markus Schneider-Pargmann writeback: account the number of pages written back mm: call arch_swap_restore() from do_swap_page() squashfs: fix cache race with migration mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison docs: update ocfs2-devel mailing list address MAINTAINERS: update ocfs2-devel mailing list address mm: disable CONFIG_PER_VMA_LOCK until its fixed fork: lock VMAs of the parent process when forking
| * | | | | squashfs: fix cache race with migrationVincent Whitchurch2023-07-081-4/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Migration replaces the page in the mapping before copying the contents and the flags over from the old page, so check that the page in the page cache is really up to date before using it. Without this, stressing squashfs reads with parallel compaction sometimes results in squashfs reporting data corruption. Link: https://lkml.kernel.org/r/20230629-squashfs-cache-migration-v1-1-d50ebe55099d@axis.com Fixes: e994f5b677ee ("squashfs: cache partial compressed blocks") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | docs: update ocfs2-devel mailing list addressAnthony Iliopoulos2023-07-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ocfs2-devel mailing list has been migrated to the kernel.org infrastructure, update all related documentation pointers to reflect the change. Link: https://lkml.kernel.org/r/20230628013437.47030-3-ailiop@suse.com Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | | | Merge tag 'ceph-for-6.5-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds2023-07-078-40/+154
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "A bunch of CephFS fixups from Xiubo, mostly around dropping caps, along with a fix for a regression in the readahead handling code which sneaked in with the switch to netfs helpers" * tag 'ceph-for-6.5-rc1' of https://github.com/ceph/ceph-client: ceph: don't let check_caps skip sending responses for revoke msgs ceph: issue a cap release immediately if no cap exists ceph: trigger to flush the buffer when making snapshot ceph: fix blindly expanding the readahead windows ceph: add a dedicated private data for netfs rreq ceph: voluntarily drop Xx caps for requests those touch parent mtime ceph: try to dump the msgs when decoding fails ceph: only send metrics when the MDS rank is ready
| * | | | | | ceph: don't let check_caps skip sending responses for revoke msgsXiubo Li2023-06-301-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a client sends out a cap update dropping caps with the prior 'seq' just before an incoming cap revoke request, then the client may drop the revoke because it believes it's already released the requested capabilities. This causes the MDS to wait indefinitely for the client to respond to the revoke. It's therefore always a good idea to ack the cap revoke request with the bumped up 'seq'. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61782 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | | ceph: issue a cap release immediately if no cap existsXiubo Li2023-06-301-11/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case: mds client - Releases cap and put Inode - Increase cap->seq and sends revokes req to the client - Receives release req and - Receives & drops the revoke req skip removing the cap and then eval the CInode and issue or revoke caps again. - Receives & drops the caps update or revoke req - Health warning for client isn't responding to mclientcaps(revoke) All the IMPORT/REVOKE/GRANT cap ops will increase the session seq in MDS side and then the client need to issue a cap release to unblock MDS to remove the corresponding cap to unblock possible waiters. Link: https://tracker.ceph.com/issues/61332 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | | ceph: trigger to flush the buffer when making snapshotXiubo Li2023-06-302-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'i_wr_ref' is used to track the 'Fb' caps, while whenever the 'Fb' caps is took the kclient will always take the 'Fw' caps at the same time. That means it will always be a false check in __ceph_finish_cap_snap(). When writing to buffer the kclient will take both 'Fb|Fw' caps and then write the contents to the buffer pages by increasing the 'i_wrbuffer_ref' and then just release both 'Fb|Fw'. This is different with the user space libcephfs, which will keep the 'Fb' being took and use 'i_wr_ref' instead of 'i_wrbuffer_ref' to track this until the buffer is flushed to Rados. We need to defer flushing the capsnap until the corresponding buffer pages are all flushed to Rados, and at the same time just trigger to flush the buffer pages immediately. Link: https://tracker.ceph.com/issues/48640 Link: https://tracker.ceph.com/issues/59343 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | | ceph: fix blindly expanding the readahead windowsXiubo Li2023-06-301-7/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Blindly expanding the readahead windows will cause unneccessary pagecache thrashing and also will introduce the network workload. We should disable expanding the windows if the readahead is disabled and also shouldn't expand the windows too much. Expanding forward firstly instead of expanding backward for possible sequential reads. Bound `rreq->len` to the actual file size to restore the previous page cache usage. The posix_fadvise may change the maximum size of a file readahead. Cc: stable@vger.kernel.org Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead") Link: https://lore.kernel.org/ceph-devel/20230504082510.247-1-sehuww@mail.scut.edu.cn Link: https://www.spinics.net/lists/ceph-users/msg76183.html Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>