summaryrefslogtreecommitdiffstats
path: root/fs/iomap
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-12-142-2/+271
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull XFS updates from Darrick Wong: "The highlight of this is a batch of fixes for the online metadata checking code as we start the loooong march towards merging online repair. I aim to merge that in time for the 2023 LTS. There are also a large number of data corruption and race condition fixes in this patchset. Most notably fixed are write() calls to unwritten extents racing with writeback, which required some late(r than I prefer) code changes to iomap to support the necessary revalidations. I don't really like iomap changes going in past -rc4, but Dave and I have been working on it long enough that I chose to push it for 6.2 anyway. There are also a number of other subtle problems fixed, including the log racing with inode writeback to write inodes with incorrect link count to disk; file data mapping corruptions as a result of incorrect lock cycling when attaching dquots; refcount metadata corruption if one actually manages to share a block 2^32 times; and the log clobbering cow staging extents if they were formerly metadata blocks. Summary: - Fix a race condition w.r.t. percpu inode free counters - Fix a broken error return in xfs_remove - Print FS UUID at mount/unmount time - Numerous fixes to the online fsck code - Fix inode locking inconsistency problems when dealing with realtime metadata files - Actually merge pull requests so that we capture the cover letter contents - Fix a race between rebuilding VFS inode state and the AIL flushing inodes that could cause corrupt inodes to be written to the filesystem - Fix a data corruption problem resulting from a write() to an unwritten extent racing with writeback started on behalf of memory reclaim changing the extent state - Add debugging knobs so that we can test iomap invalidation - Fix the blockdev pagecache contents being stale after unmounting the filesystem, leading to spurious xfs_db errors and corrupt metadumps - Fix a file mapping corruption bug due to ilock cycling when attaching dquots to a file during delalloc reservation - Fix a refcount btree corruption problem due to the refcount adjustment code not handling MAXREFCOUNT correctly, resulting in unnecessary record splits - Fix COW staging extent alloctions not being classified as USERDATA, which results in filestreams being ignored and possible data corruption if the allocation was filled from the AGFL and the block buffer is still being tracked in the AIL - Fix new duplicated includes - Fix a race between the dquot shrinker and dquot freeing that could cause a UAF" * tag 'xfs-6.2-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (50 commits) xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING xfs: Remove duplicated include in xfs_iomap.c xfs: invalidate xfs_bufs when allocating cow extents xfs: get rid of assert from xfs_btree_islastblock xfs: estimate post-merge refcounts correctly xfs: hoist refcount record merge predicates xfs: fix super block buf log item UAF during force shutdown xfs: wait iclog complete before tearing down AIL xfs: attach dquots to inode before reading data/cow fork mappings xfs: shut up -Wuninitialized in xfsaild_push xfs: use memcpy, not strncpy, to format the attr prefix during listxattr xfs: invalidate block device page cache during unmount xfs: add debug knob to slow down write for fun xfs: add debug knob to slow down writeback for fun xfs: drop write error injection is unfixable, remove it xfs: use iomap_valid method to detect stale cached iomaps iomap: write iomap validity checks xfs: xfs_bmap_punch_delalloc_range() should take a byte range iomap: buffered write failure should not truncate the page cache xfs,iomap: move delalloc punching to iomap ...
| * iomap: write iomap validity checksDave Chinner2022-11-292-2/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent multithreaded write data corruption has been uncovered in the iomap write code. The core of the problem is partial folio writes can be flushed to disk while a new racing write can map it and fill the rest of the page: writeback new write allocate blocks blocks are unwritten submit IO ..... map blocks iomap indicates UNWRITTEN range loop { lock folio copyin data ..... IO completes runs unwritten extent conv blocks are marked written <iomap now stale> get next folio } Now add memory pressure such that memory reclaim evicts the partially written folio that has already been written to disk. When the new write finally gets to the last partial page of the new write, it does not find it in cache, so it instantiates a new page, sees the iomap is unwritten, and zeros the part of the page that it does not have data from. This overwrites the data on disk that was originally written. The full description of the corruption mechanism can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ To solve this problem, we need to check whether the iomap is still valid after we lock each folio during the write. We have to do it after we lock the page so that we don't end up with state changes occurring while we wait for the folio to be locked. Hence we need a mechanism to be able to check that the cached iomap is still valid (similar to what we already do in buffered writeback), and we need a way for ->begin_write to back out and tell the high level iomap iterator that we need to remap the remaining write range. The iomap needs to grow some storage for the validity cookie that the filesystem provides to travel with the iomap. XFS, in particular, also needs to know some more information about what the iomap maps (attribute extents rather than file data extents) to for the validity cookie to cover all the types of iomaps we might need to validate. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| * iomap: buffered write failure should not truncate the page cacheDave Chinner2022-11-291-15/+180
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iomap_file_buffered_write_punch_delalloc() currently invalidates the page cache over the unused range of the delalloc extent that was allocated. While the write allocated the delalloc extent, it does not own it exclusively as the write does not hold any locks that prevent either writeback or mmap page faults from changing the state of either the page cache or the extent state backing this range. Whilst xfs_bmap_punch_delalloc_range() already handles races in extent conversion - it will only punch out delalloc extents and it ignores any other type of extent - the page cache truncate does not discriminate between data written by this write or some other task. As a result, truncating the page cache can result in data corruption if the write races with mmap modifications to the file over the same range. generic/346 exercises this workload, and if we randomly fail writes (as will happen when iomap gets stale iomap detection later in the patchset), it will randomly corrupt the file data because it removes data written by mmap() in the same page as the write() that failed. Hence we do not want to punch out the page cache over the range of the extent we failed to write to - what we actually need to do is detect the ranges that have dirty data in cache over them and *not punch them out*. To do this, we have to walk the page cache over the range of the delalloc extent we want to remove. This is made complex by the fact we have to handle partially up-to-date folios correctly and this can happen even when the FSB size == PAGE_SIZE because we now support multi-page folios in the page cache. Because we are only interested in discovering the edges of data ranges in the page cache (i.e. hole-data boundaries) we can make use of mapping_seek_hole_data() to find those transitions in the page cache. As we hold the invalidate_lock, we know that the boundaries are not going to change while we walk the range. This interface is also byte-based and is sub-page block aware, so we can find the data ranges in the cache based on byte offsets rather than page, folio or fs block sized chunks. This greatly simplifies the logic of finding dirty cached ranges in the page cache. Once we've identified a range that contains cached data, we can then iterate the range folio by folio. This allows us to determine if the data is dirty and hence perform the correct delalloc extent punching operations. The seek interface we use to iterate data ranges will give us sub-folio start/end granularity, so we may end up looking up the same folio multiple times as the seek interface iterates across each discontiguous data region in the folio. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| * xfs,iomap: move delalloc punching to iomapDave Chinner2022-11-231-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because that's what Christoph wants for this error handling path only XFS uses. It requires a new iomap export for handling errors over delalloc ranges. This is basically the XFS code as is stands, but even though Christoph wants this as iomap funcitonality, we still have to call it from the filesystem specific ->iomap_end callback, and call into the iomap code with yet another filesystem specific callback to punch the delalloc extent within the defined ranges. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
* | iomap: directly use logical block sizeKeith Busch2022-11-141-2/+1
|/ | | | | | | | | | | | | | Don't transform the logical block size to a bit shift only to shift it back to the original block size. Just use the size. Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
* iomap: add a tracepoint for mappings returned by map_blocksDarrick J. Wong2022-10-022-0/+2
| | | | | | | | Add a new tracepoint so we can see what mapping the filesystem returns to writeback a dirty page. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* iomap: iomap: fix memory corruption when recording errors during writebackDarrick J. Wong2022-10-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every now and then I see this crash on arm64: Unable to handle kernel NULL pointer dereference at virtual address 00000000000000f8 Buffer I/O error on dev dm-0, logical block 8733687, async page read Mem abort info: ESR = 0x0000000096000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 64k pages, 42-bit VAs, pgdp=0000000139750000 [00000000000000f8] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Buffer I/O error on dev dm-0, logical block 8733688, async page read Dumping ftrace buffer: Buffer I/O error on dev dm-0, logical block 8733689, async page read (ftrace buffer empty) XFS (dm-0): log I/O error -5 Modules linked in: dm_thin_pool dm_persistent_data XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ec/0x590 [xfs] (fs/xfs/xfs_trans_buf.c:296). dm_bio_prison XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): xfs_imap_lookup: xfs_ialloc_read_agi() returned error -5, agno 0 dm_bufio dm_log_writes xfs nft_chain_nat xt_REDIRECT nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_REJECT potentially unexpected fatal signal 6. nf_reject_ipv6 potentially unexpected fatal signal 6. ipt_REJECT nf_reject_ipv4 CPU: 1 PID: 122166 Comm: fsstress Tainted: G W 6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7 rpcsec_gss_krb5 auth_rpcgss xt_tcpudp ip_set_hash_ip ip_set_hash_net xt_set nft_compat ip_set_hash_mac ip_set nf_tables Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021 pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) ip_tables pc : 000003fd6d7df200 x_tables lr : 000003fd6d7df1ec overlay nfsv4 CPU: 0 PID: 54031 Comm: u4:3 Tainted: G W 6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7405 Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021 Workqueue: writeback wb_workfn sp : 000003ffd9522fd0 (flush-253:0) pstate: 60401005 (nZCv daif +PAN -UAO -TCO -DIT +SSBS BTYPE=--) pc : errseq_set+0x1c/0x100 x29: 000003ffd9522fd0 x28: 0000000000000023 x27: 000002acefeb6780 x26: 0000000000000005 x25: 0000000000000001 x24: 0000000000000000 x23: 00000000ffffffff x22: 0000000000000005 lr : __filemap_set_wb_err+0x24/0xe0 x21: 0000000000000006 sp : fffffe000f80f760 x29: fffffe000f80f760 x28: 0000000000000003 x27: fffffe000f80f9f8 x26: 0000000002523000 x25: 00000000fffffffb x24: fffffe000f80f868 x23: fffffe000f80fbb0 x22: fffffc0180c26a78 x21: 0000000002530000 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000001 x13: 0000000000470af3 x12: fffffc0058f70000 x11: 0000000000000040 x10: 0000000000001b20 x9 : fffffe000836b288 x8 : fffffc00eb9fd480 x7 : 0000000000f83659 x6 : 0000000000000000 x5 : 0000000000000869 x4 : 0000000000000005 x3 : 00000000000000f8 x20: 000003fd6d740020 x19: 000000000001dd36 x18: 0000000000000001 x17: 000003fd6d78704c x16: 0000000000000001 x15: 000002acfac87668 x2 : 0000000000000ffa x1 : 00000000fffffffb x0 : 00000000000000f8 Call trace: errseq_set+0x1c/0x100 __filemap_set_wb_err+0x24/0xe0 iomap_do_writepage+0x5e4/0xd5c write_cache_pages+0x208/0x674 iomap_writepages+0x34/0x60 xfs_vm_writepages+0x8c/0xcc [xfs 7a861f39c43631f15d3a5884246ba5035d4ca78b] x14: 0000000000000000 x13: 2064656e72757465 x12: 0000000000002180 x11: 000003fd6d8a82d0 x10: 0000000000000000 x9 : 000003fd6d8ae288 x8 : 0000000000000083 x7 : 00000000ffffffff x6 : 00000000ffffffee x5 : 00000000fbad2887 x4 : 000003fd6d9abb58 x3 : 000003fd6d740020 x2 : 0000000000000006 x1 : 000000000001dd36 x0 : 0000000000000000 CPU: 1 PID: 122167 Comm: fsstress Tainted: G W 6.0.0-rc5-djwa #rc5 3004c9f1de887ebae86015f2677638ce51ee7 do_writepages+0x90/0x1c4 __writeback_single_inode+0x4c/0x4ac Hardware name: QEMU KVM Virtual Machine, BIOS 1.5.1 06/16/2021 writeback_sb_inodes+0x214/0x4ac wb_writeback+0xf4/0x3b0 pstate: 60001000 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) wb_workfn+0xfc/0x580 process_one_work+0x1e8/0x480 pc : 000003fd6d7df200 worker_thread+0x78/0x430 This crash is a result of iomap_writepage_map encountering some sort of error during writeback and wanting to set that error code in the file mapping so that fsync will report it. Unfortunately, the code dereferences folio->mapping after unlocking the folio, which means that another thread could have removed the page from the page cache (writeback doesn't hold the invalidation lock) and give it to somebody else. At best we crash the system like above; at worst, we corrupt memory or set an error on some other unsuspecting file while failing to record the problems with *this* file. Regardless, fix the problem by reporting the error to the inode mapping. NOTE: Commit 598ecfbaa742 lifted the XFS writeback code to iomap, so this fix should be backported to XFS in the 4.6-5.4 kernels in addition to iomap in the 5.5-5.19 kernels. Fixes: e735c0079465 ("iomap: Convert iomap_add_to_ioend() to take a folio") # 5.17 onward Fixes: 598ecfbaa742 ("iomap: lift the xfs writeback code to iomap") # 5.5-5.16, needs backporting Fixes: 150d5be09ce4 ("xfs: remove xfs_cancel_ioend") # 4.6-5.4, needs backporting Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* Merge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-08-111-15/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more iomap updates from Darrick Wong: "In the past 10 days or so I've not heard any ZOMG STOP style complaints about removing ->writepage support from gfs2 or zonefs, so here's the pull request removing them (and the underlying fs iomap support) from the kernel: - Remove iomap_writepage and all callers, since the mm apparently never called the zonefs or gfs2 writepage functions" * tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: remove iomap_writepage zonefs: remove ->writepage gfs2: remove ->writepage gfs2: stop using generic_writepages in gfs2_ail1_start_one
| * iomap: remove iomap_writepageChristoph Hellwig2022-07-221-15/+0
| | | | | | | | | | | | | | | | | | | | Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* | new iov_iter flavour - ITER_UBUFAl Viro2022-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-08-031-7/+8
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap updates from Darrick Wong: "The most notable change in this first batch is that we no longer schedule pages beyond i_size for writeback, preferring instead to let truncate deal with those pages. Next week, there may be a second pull request to remove iomap_writepage from the other two filesystems (gfs2/zonefs) that use iomap for buffered IO. This follows in the same vein as the recent removal of writepage from XFS, since it hasn't been triggered in a few years; it does nothing during direct reclaim; and as far as the people who examined the patchset can tell, it's moving the codebase in the right direction. However, as it was a late addition to for-next, I'm holding off on that section for another week of testing to see if anyone can come up with a solid reason for holding off in the meantime. Summary: - Skip writeback for pages that are completely beyond EOF - Minor code cleanups" * tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: dax: set did_zero to true when zeroing successfully iomap: set did_zero to true when zeroing successfully iomap: skip pages past eof in iomap_do_writepage()
| * iomap: set did_zero to true when zeroing successfullyKaixu Xia2022-06-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | It is unnecessary to check and set did_zero value in while() loop in iomap_zero_iter(), we can set did_zero to true only when zeroing successfully at last. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * iomap: skip pages past eof in iomap_do_writepage()Chris Mason2022-06-301-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iomap_do_writepage() sends pages past i_size through folio_redirty_for_writepage(), which normally isn't a problem because truncate and friends clean them very quickly. When the system has cgroups configured, we can end up in situations where one cgroup has almost no dirty pages at all, and other cgroups consume the entire background dirty limit. This is especially common in our XFS workloads in production because they have cgroups using O_DIRECT for almost all of the IO mixed in with cgroups that do more traditional buffered IO work. We've hit storms where the redirty path hits millions of times in a few seconds, on all a single file that's only ~40 pages long. This leads to long tail latencies for file writes because the pdflush workers are hogging the CPU from some kworkers bound to the same CPU. Reproducing this on 5.18 was tricky because 869ae85dae ("xfs: flush new eof page on truncate...") ends up writing/waiting most of these dirty pages before truncate gets a chance to wait on them. The actual repro looks like this: /* * run me in a cgroup all alone. Start a second cgroup with dd * streaming IO into the block device. */ int main(int ac, char **av) { int fd; int ret; char buf[BUFFER_SIZE]; char *filename = av[1]; memset(buf, 0, BUFFER_SIZE); if (ac != 2) { fprintf(stderr, "usage: looper filename\n"); exit(1); } fd = open(filename, O_WRONLY | O_CREAT, 0600); if (fd < 0) { err(errno, "failed to open"); } fprintf(stderr, "looping on %s\n", filename); while(1) { /* * skip past page 0 so truncate doesn't write and wait * on our extent before changing i_size */ ret = lseek(fd, 8192, SEEK_SET); if (ret < 0) err(errno, "lseek"); ret = write(fd, buf, BUFFER_SIZE); if (ret != BUFFER_SIZE) err(errno, "write failed"); /* start IO so truncate has to wait after i_size is 0 */ ret = sync_file_range(fd, 16384, 4095, SYNC_FILE_RANGE_WRITE); if (ret < 0) err(errno, "sync_file_range"); ret = ftruncate(fd, 0); if (ret < 0) err(errno, "truncate"); usleep(1000); } } And this bpftrace script will show when you've hit a redirty storm: kretprobe:xfs_vm_writepages { delete(@dirty[pid]); } kprobe:xfs_vm_writepages { @dirty[pid] = 1; } kprobe:folio_redirty_for_writepage /@dirty[pid] > 0/ { $inode = ((struct folio *)arg1)->mapping->host->i_ino; @inodes[$inode] = count(); @redirty++; if (@redirty > 90000) { printf("inode %d redirty was %d", $inode, @redirty); exit(); } } This patch has the same number of failures on xfstests as unpatched 5.18: Failures: generic/648 xfs/019 xfs/050 xfs/168 xfs/299 xfs/348 xfs/506 xfs/543 I also ran it through a long stress of multiple fsx processes hammering. (Johannes Weiner did significant tracing and debugging on this as well) Signed-off-by: Chris Mason <clm@fb.com> Co-authored-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <domas@fb.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* | Merge tag 'pull-work.iov_iter-base' of ↵Linus Torvalds2022-08-031-9/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()
| * | iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNCAl Viro2022-06-101-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New helper to be used instead of direct checks for IOCB_DSYNC: iocb_is_dsync(iocb). Checks converted, which allows to avoid the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines) from iocb_flags() - it's checked in iocb_is_dsync() instead Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | teach iomap_dio_rw() to suppress dsyncAl Viro2022-06-101-9/+11
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New flag, equivalent to removal of IOCB_DSYNC from iocb flags. This mimics what btrfs is doing (and that's what btrfs will switch to). However, I'm not at all sure that we want to suppress REQ_FUA for those - all btrfs hack really cares about is suppression of generic_write_sync(). For now let's keep the existing behaviour, but I really want to hear more detailed arguments pro or contra. [folded brain fix from willy] Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2022-08-031-28/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull folio updates from Matthew Wilcox: - Fix an accounting bug that made NR_FILE_DIRTY grow without limit when running xfstests - Convert more of mpage to use folios - Remove add_to_page_cache() and add_to_page_cache_locked() - Convert find_get_pages_range() to filemap_get_folios() - Improvements to the read_cache_page() family of functions - Remove a few unnecessary checks of PageError - Some straightforward filesystem conversions to use folios - Split PageMovable users out from address_space_operations into their own movable_operations - Convert aops->migratepage to aops->migrate_folio - Remove nobh support (Christoph Hellwig) * tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits) fs: remove the NULL get_block case in mpage_writepages fs: don't call ->writepage from __mpage_writepage fs: remove the nobh helpers jfs: stop using the nobh helper ext2: remove nobh support ntfs3: refactor ntfs_writepages mm/folio-compat: Remove migration compatibility functions fs: Remove aops->migratepage() secretmem: Convert to migrate_folio hugetlb: Convert to migrate_folio aio: Convert to migrate_folio f2fs: Convert to filemap_migrate_folio() ubifs: Convert to filemap_migrate_folio() btrfs: Convert btrfs_migratepage to migrate_folio mm/migrate: Add filemap_migrate_folio() mm/migrate: Convert migrate_page() to migrate_folio() nfs: Convert to migrate_folio btrfs: Convert btree_migratepage to migrate_folio mm/migrate: Convert expected_page_refs() to folio_expected_refs() mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() ...
| * | mm/migrate: Add filemap_migrate_folio()Matthew Wilcox (Oracle)2022-08-021-25/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is nothing iomap-specific about iomap_migratepage(), and it fits a pattern used by several other filesystems, so move it to mm/migrate.c, convert it to be filemap_migrate_folio() and convert the iomap filesystems to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
| * | iomap: Remove test for folio errorMatthew Wilcox (Oracle)2022-06-291-3/+0
| |/ | | | | | | | | | | | | | | Just because there has been a read error doesn't mean we should avoid marking this part of the folio as uptodate. Indeed, it may overwrite the error part of the folio and let us mark the entire folio uptodate. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* | Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-blockLinus Torvalds2022-08-021-6/+6
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - Improve the type checking of request flags (Bart) - Ensure queue mapping for a single queues always picks the right queue (Bart) - Sanitize the io priority handling (Jan) - rq-qos race fix (Jinke) - Reserved tags handling improvements (John) - Separate memory alignment from file/disk offset aligment for O_DIRECT (Keith) - Add new ublk driver, userspace block driver using io_uring for communication with the userspace backend (Ming) - Use try_cmpxchg() to cleanup the code in various spots (Uros) - Finally remove bdevname() (Christoph) - Clean up the zoned device handling (Christoph) - Clean up independent access range support (Christoph) - Clean up and improve block sysfs handling (Christoph) - Clean up and improve teardown of block devices. This turns the usual two step process into something that is simpler to implement and handle in block drivers (Christoph) - Clean up chunk size handling (Christoph) - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu, Ming, Sebastian, Yang, Ying) * tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits) ublk_drv: fix double shift bug ublk_drv: make sure that correct flags(features) returned to userspace ublk_drv: fix error handling of ublk_add_dev ublk_drv: fix lockdep warning block: remove __blk_get_queue block: call blk_mq_exit_queue from disk_release for never added disks blk-mq: fix error handling in __blk_mq_alloc_disk ublk: defer disk allocation ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask ublk: fold __ublk_create_dev into ublk_ctrl_add_dev ublk: cleanup ublk_ctrl_uring_cmd ublk: simplify ublk_ch_open and ublk_ch_release ublk: remove the empty open and release block device operations ublk: remove UBLK_IO_F_PREFLUSH ublk: add a MAINTAINERS entry block: don't allow the same type rq_qos add more than once mmc: fix disk/queue leak in case of adding disk failure ublk_drv: fix an IS_ERR() vs NULL check ublk: remove UBLK_IO_F_INTEGRITY ublk_drv: remove unneeded semicolon ...
| * | fs/iomap: Use the new blk_opf_t typeBart Van Assche2022-07-141-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve static type checking by using the enum req_op type for variables that represent a request operation and the new blk_opf_t type for the combination of a request operation and request flags. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-56-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | iomap: add support for dma aligned direct-ioKeith Busch2022-06-271-2/+2
| |/ | | | | | | | | | | | | | | | | | | Use the address alignment requirements from the block_device for direct io instead of requiring addresses be aligned to the block size. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220610195830.3574005-12-kbusch@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | iomap: Return -EAGAIN from iomap_write_iter()Stefan Roesch2022-07-241-0/+4
| | | | | | | | | | | | | | | | | | | | | | If iomap_write_iter() encounters -EAGAIN, return -EAGAIN to the caller. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-7-shr@fb.com Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: make the suggested ternary edit] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | iomap: Add async buffered write supportStefan Roesch2022-07-241-5/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds async buffered write support to iomap. This replaces the call to balance_dirty_pages_ratelimited() with the call to balance_dirty_pages_ratelimited_flags. This allows to specify if the write request is async or not. In addition this also moves the above function call to the beginning of the function. If the function call is at the end of the function and the decision is made to throttle writes, then there is no request that io-uring can wait on. By moving it to the beginning of the function, the write request is not issued, but returns -EAGAIN instead. io-uring will punt the request and process it in the io-worker. By moving the function call to the beginning of the function, the write throttling will happen one page later. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-6-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | iomap: Add flags parameter to iomap_page_create()Stefan Roesch2022-07-241-10/+20
|/ | | | | | | | | | | | | | | Add the kiocb flags parameter to the function iomap_page_create(). Depending on the value of the flags parameter it enables different gfp flags. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-5-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2022-05-242-22/+18
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
| * iomap: Convert to release_folioMatthew Wilcox (Oracle)2022-05-092-13/+11
| | | | | | | | | | | | | | | | Change all the filesystems which used iomap_releasepage to use the new function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * fs: Convert mpage_readpage to mpage_read_folioMatthew Wilcox (Oracle)2022-05-091-1/+1
| | | | | | | | | | | | | | | | | | mpage_readpage still works in terms of pages, and has not been audited for correctness with large folios, so include an assertion that the filesystem is not passing it large folios. Convert all the filesystems to call mpage_read_folio() instead of mpage_readpage(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
| * fs: Convert block_read_full_page() to block_read_full_folio()Matthew Wilcox (Oracle)2022-05-091-1/+1
| | | | | | | | | | | | | | | | | | This function is NOT converted to handle large folios, so include an assert that the filesystem isn't passing one in. Otherwise, use the folio functions instead of the page functions, where they exist. Convert all filesystems which use block_read_full_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
| * fs: Convert iomap_readpage to iomap_read_folioMatthew Wilcox (Oracle)2022-05-091-7/+5
| | | | | | | | | | | | A straightforward conversion as iomap_readpage already worked in folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* | Merge tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2022-05-241-3/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap updates from Darrick Wong: "There's a couple of corrections sent in by Andreas for some accounting errors. The biggest change this time around is that writeback errors longer clear pageuptodate nor does XFS invalidate the page cache anymore. This brings XFS (and gfs2/zonefs) behavior in line with every other Linux filesystem driver, and fixes some UAF bugs that only cropped up after willy turned on multipage folios for XFS in 5.18-rc1. Regrettably, it took all the way to the end of the 5.18 cycle to find the source of these bugs and reach a consensus that XFS' writeback failure behavior from 20 years ago is no longer necessary. Summary: - Fix a couple of accounting errors in the buffered io code. - Discontinue the practice of marking folios !uptodate and invalidating them when writeback fails. This fixes some UAF bugs when multipage folios are enabled, and brings the behavior of XFS/gfs/zonefs into alignment with the behavior of all the other Linux filesystems" * tag 'iomap-5.19-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: don't invalidate folios after writeback errors iomap: iomap_write_end cleanup iomap: iomap_write_failed fix
| * | iomap: don't invalidate folios after writeback errorsDarrick J. Wong2022-05-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS has the unique behavior (as compared to the other Linux filesystems) that on writeback errors it will completely invalidate the affected folio and force the page cache to reread the contents from disk. All other filesystems leave the page mapped and up to date. This is a rude awakening for user programs, since (in the case where write fails but reread doesn't) file contents will appear to revert to old disk contents with no notification other than an EIO on fsync. This might have been annoying back in the days when iomap dealt with one page at a time, but with multipage folios, we can now throw away *megabytes* worth of data for a single write error. On *most* Linux filesystems, a program can respond to an EIO on write by redirtying the entire file and scheduling it for writeback. This isn't foolproof, since the page that failed writeback is no longer dirty and could be evicted, but programs that want to recover properly *also* have to detect XFS and regenerate every write they've made to the file. When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio invalidates multipage folios that could be undergoing writeback. If, say, we have a 256K folio caching a mix of written and unwritten extents, it's possible that we could start writeback of the first (say) 64K of the folio and then hit a writeback error on the next 64K. We then free the iop attached to the folio, which is really bad because writeback completion on the first 64k will trip over the "blocks per folio > 1 && !iop" assertion. This can't be fixed by only invalidating the folio if writeback fails at the start of the folio, since the folio is marked !uptodate, which trips other assertions elsewhere. Get rid of the whole behavior entirely. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * | iomap: iomap_write_end cleanupAndreas Gruenbacher2022-05-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In iomap_write_end(), only call iomap_write_failed() on the byte range that has failed. This should improve code readability, but doesn't fix an actual bug because iomap_write_failed() is called after updating the file size here and it only affects the memory beyond the end of the file. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | iomap: iomap_write_failed fixAndreas Gruenbacher2022-05-081-1/+2
| |/ | | | | | | | | | | | | | | | | | | The @lend parameter of truncate_pagecache_range() should be the offset of the last byte of the hole, not the first byte beyond it. Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
* | Merge tag 'for-5.19-tag' of ↵Linus Torvalds2022-05-241-7/+18
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - subpage: - support for PAGE_SIZE > 4K (previously only 64K) - make it work with raid56 - repair super block num_devices automatically if it does not match the number of device items - defrag can convert inline extents to regular extents, up to now inline files were skipped but the setting of mount option max_inline could affect the decision logic - zoned: - minimal accepted zone size is explicitly set to 4MiB - make zone reclaim less aggressive and don't reclaim if there are enough free zones - add per-profile sysfs tunable of the reclaim threshold - allow automatic block group reclaim for non-zoned filesystems, with sysfs tunables - tree-checker: new check, compare extent buffer owner against owner rootid Performance: - avoid blocking on space reservation when doing nowait direct io writes (+7% throughput for reads and writes) - NOCOW write throughput improvement due to refined locking (+3%) - send: reduce pressure to page cache by dropping extent pages right after they're processed Core: - convert all radix trees to xarray - add iterators for b-tree node items - support printk message index - user bulk page allocation for extent buffers - switch to bio_alloc API, use on-stack bios where convenient, other bio cleanups - use rw lock for block groups to favor concurrent reads - simplify workques, don't allocate high priority threads for all normal queues as we need only one - refactor scrub, process chunks based on their constraints and similarity - allocate direct io structures on stack and pass around only pointers, avoids allocation and reduces potential error handling Fixes: - fix count of reserved transaction items for various inode operations - fix deadlock between concurrent dio writes when low on free data space - fix a few cases when zones need to be finished VFS, iomap: - add helper to check if sb write has started (usable for assertions) - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io" * tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits) btrfs: zoned: introduce a minimal zone size 4M and reject mount btrfs: allow defrag to convert inline extents to regular extents btrfs: add "0x" prefix for unsupported optional features btrfs: do not account twice for inode ref when reserving metadata units btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer btrfs: send: avoid trashing the page cache btrfs: send: keep the current inode open while processing it btrfs: allocate the btrfs_dio_private as part of the iomap dio bio btrfs: move struct btrfs_dio_private to inode.c btrfs: remove the disk_bytenr in struct btrfs_dio_private btrfs: allocate dio_data on stack iomap: add per-iomap_iter private data iomap: allow the file system to provide a bio_set for direct I/O btrfs: add a btrfs_dio_rw wrapper btrfs: zoned: zone finish unused block group btrfs: zoned: properly finish block group on metadata write btrfs: zoned: finish block group when there are no more allocatable bytes left btrfs: zoned: consolidate zone finish functions btrfs: zoned: introduce btrfs_zoned_bg_is_full btrfs: improve error reporting in lookup_inline_extent_backref ...
| * | iomap: add per-iomap_iter private dataChristoph Hellwig2022-05-161-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the file system to keep state for all iterations. For now only wire it up for direct I/O as there is an immediate need for it there. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | iomap: allow the file system to provide a bio_set for direct I/OChristoph Hellwig2022-05-161-4/+13
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the file system to provide a specific bio_set for allocating direct I/O bios. This will allow file systems that use the ->submit_io hook to stash away additional information for file system use. To make use of this additional space for information in the completion path, the file system needs to override the ->bi_end_io callback and then call back into iomap, so export iomap_dio_bio_end_io for that. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | block: ignore RWF_HIPRI hint for sync dioMing Lei2022-05-021-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far bio is marked as REQ_POLLED if RWF_HIPRI/IOCB_HIPRI is passed from userspace sync io interface, then block layer tries to poll until the bio is completed. But the current implementation calls blk_io_schedule() if bio_poll() returns 0, and this way causes io hang or timeout easily. But looks no one reports this kind of issue, which should have been triggered in normal io poll sanity test or blktests block/007 as observed by Changhui, that means it is very likely that no one uses it or no one cares it. Also after io_uring is invented, io poll for sync dio becomes legacy interface. So ignore RWF_HIPRI hint for sync dio. CC: linux-mm@kvack.org Cc: linux-xfs@vger.kernel.org Reported-by: Changhui Zhong <czhong@redhat.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220420143110.2679002-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: add a bdev_fua helperChristoph Hellwig2022-04-171-2/+1
|/ | | | | | | | | | | Add a helper to check the FUA flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* iomap: Simplify is_partially_uptodate a littleMatthew Wilcox (Oracle)2022-04-011-5/+4
| | | | | | | | | | Remove the unnecessary variable 'len' and fix a comment to refer to the folio instead of the page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge tag 'write-page-prefaulting' of ↵Linus Torvalds2022-03-261-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull iomap fixlet from Andreas Gruenbacher: "Fix buffered write page prefaulting. I forgot to send it the previous merge window. I've only improved the patch description since" * tag 'write-page-prefaulting' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: fs/iomap: Fix buffered write page prefaulting
| * fs/iomap: Fix buffered write page prefaultingAndreas Gruenbacher2022-03-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When part of the user buffer passed to generic_perform_write() or iomap_file_buffered_write() cannot be faulted in for reading, the entire write currently fails. The correct behavior would be to write all the data that can be written, up to the point of failure. Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable") gave us the information needed, so fix the page prefaulting in generic_perform_write() and iomap_write_iter() to only bail out when no pages could be faulted in. We already factor in that pages that are faulted in may no longer be resident by the time they are accessed. Paging out pages has the same effect as not faulting in those pages in the first place, so the code can already deal with that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* | Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds2022-03-262-3/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
| * | block: remove the per-bio/request write hintChristoph Hellwig2022-03-072-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe2022-03-072-20/+14
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
* | \ \ Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2022-03-222-29/+19
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
| * | | | iomap: Remove iomap_invalidatepage()Matthew Wilcox (Oracle)2022-03-152-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use iomap_invalidate_folio() in all the iomap-based filesystems and rename the iomap_invalidatepage tracepoint. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | fs: Convert is_partially_uptodate to foliosMatthew Wilcox (Oracle)2022-03-141-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the uptodate property is maintained on a per-folio basis, the is_partially_uptodate method should also take a folio. Fix the types at the same time so it's clear that it returns true/false and takes the count in bytes, not blocks. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
| * | | | iomap: Fix iomap_invalidatepage tracepointMatthew Wilcox (Oracle)2022-03-141-1/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This tracepoint is defined to take an offset in the file, not an offset in the folio. Fixes: 1ac994525b9d ("iomap: Remove pgoff from tracepoints") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
* | | | Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds2022-03-221-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull folio updates from Matthew Wilcox: - Rewrite how munlock works to massively reduce the contention on i_mmap_rwsem (Hugh Dickins): https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/ - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig): https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/ - Convert GUP to use folios and make pincount available for order-1 pages. (Matthew Wilcox) - Convert a few more truncation functions to use folios (Matthew Wilcox) - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox) - Convert rmap_walk to use folios (Matthew Wilcox) - Convert most of shrink_page_list() to use a folio (Matthew Wilcox) - Add support for creating large folios in readahead (Matthew Wilcox) * tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits) mm/damon: minor cleanup for damon_pa_young selftests/vm/transhuge-stress: Support file-backed PMD folios mm/filemap: Support VM_HUGEPAGE for file mappings mm/readahead: Switch to page_cache_ra_order mm/readahead: Align file mappings for non-DAX mm/readahead: Add large folio readahead mm: Support arbitrary THP sizes mm: Make large folios depend on THP mm: Fix READ_ONLY_THP warning mm/filemap: Allow large folios to be added to the page cache mm: Turn can_split_huge_page() into can_split_folio() mm/vmscan: Convert pageout() to take a folio mm/vmscan: Turn page_check_references() into folio_check_references() mm/vmscan: Account large folios correctly mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios mm/vmscan: Free non-shmem folios without splitting them mm/rmap: Constify the rmap_walk_control argument mm/rmap: Convert rmap_walk() to take a folio mm: Turn page_anon_vma() into folio_anon_vma() mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read() ...