summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2021-07-189-37/+174
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "A few fixes for issues in the new online shrink code, additional corrections for my recent bug-hunt w.r.t. extent size hints on realtime, and improved input checking of the GROWFSRT ioctl. IOW, the usual 'I somehow got bored during the merge window and resumed auditing the farther reaches of xfs': - Fix shrink eligibility checking when sparse inode clusters enabled - Reset '..' directory entries when unlinking directories to prevent verifier errors if fs is shrinked later - Don't report unusable extent size hints to FSGETXATTR - Don't warn when extent size hints are unusable because the sysadmin configured them that way - Fix insufficient parameter validation in GROWFSRT ioctl - Fix integer overflow when adding rt volumes to filesystem" * tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: detect misaligned rtinherit directory extent size hints xfs: fix an integer overflow error in xfs_growfs_rt xfs: improve FSGROWFSRT precondition checking xfs: don't expose misaligned extszinherit hints to userspace xfs: correct the narrative around misaligned rtinherit/extszinherit dirs xfs: reset child dir '..' entry when unlinking child xfs: check for sparse inode clusters that cross new EOAG when shrinking
| * xfs: detect misaligned rtinherit directory extent size hintsDarrick J. Wong2021-07-151-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | If we encounter a directory that has been configured to pass on an extent size hint to a new realtime file and the hint isn't an integer multiple of the rt extent size, we should flag the hint for administrative review because that is a misconfiguration (that other parts of the kernel will fix automatically). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: fix an integer overflow error in xfs_growfs_rtDarrick J. Wong2021-07-151-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | During a realtime grow operation, we run a single transaction for each rt bitmap block added to the filesystem. This means that each step has to be careful to increase sb_rblocks appropriately. Fix the integer overflow error in this calculation that can happen when the extent size is very large. Found by running growfs to add a rt volume to a filesystem formatted with a 1g rt extent size. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: improve FSGROWFSRT precondition checkingDarrick J. Wong2021-07-151-7/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the checking at the start of a realtime grow operation so that we avoid accidentally set a new extent size that is too large and avoid adding an rt volume to a filesystem with rmap or reflink because we don't support rt rmap or reflink yet. While we're at it, separate the checks so that we're only testing one aspect at a time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: don't expose misaligned extszinherit hints to userspaceDarrick J. Wong2021-07-151-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 603f000b15f2 changed xfs_ioctl_setattr_check_extsize to reject an attempt to set an EXTSZINHERIT extent size hint on a directory with RTINHERIT set if the hint isn't a multiple of the realtime extent size. However, I have recently discovered that it is possible to change the realtime extent size when adding a rt device to a filesystem, which means that the existence of directories with misaligned inherited hints is not an accident. As a result, it's possible that someone could have set a valid hint and added an rt volume with a different rt extent size, which invalidates the ondisk hints. After such a sequence, FSGETXATTR will report a misaligned hint, which FSSETXATTR will trip over, causing confusion if the user was doing the usual GET/SET sequence to change some other attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned properly. Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: correct the narrative around misaligned rtinherit/extszinherit dirsDarrick J. Wong2021-07-153-22/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While auditing the realtime growfs code, I realized that the GROWFSRT ioctl (and by extension xfs_growfs) has always allowed sysadmins to change the realtime extent size when adding a realtime section to the filesystem. Since we also have always allowed sysadmins to set RTINHERIT and EXTSZINHERIT on directories even if there is no realtime device, this invalidates the premise laid out in the comments added in commit 603f000b15f2. In other words, this is not a case of inadequate metadata validation. This is a case of nearly forgotten (and apparently untested) but supported functionality. Update the comments to reflect what we've learned, and remove the log message about correcting the misalignment. Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
| * xfs: reset child dir '..' entry when unlinking childDarrick J. Wong2021-07-151-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running xfs/168, I noticed a second source of post-shrink corruption errors causing shutdowns. Let's say that directory B has a low inode number and is a child of directory A, which has a high number. If B is empty but open, and unlinked from A, B's dotdot link continues to point to A. If A is then unlinked and the filesystem shrunk so that A is no longer a valid inode, a subsequent AIL push of B will trip the inode verifiers because the dotdot entry points outside of the filesystem. To avoid this problem, reset B's dotdot entry to the root directory when unlinking directories, since the root directory cannot be removed. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
| * xfs: check for sparse inode clusters that cross new EOAG when shrinkingDarrick J. Wong2021-07-153-0/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While running xfs/168, I noticed occasional write verifier shutdowns involving inodes at the very end of the filesystem. Existing inode btree validation code checks that all inode clusters are fully contained within the filesystem. However, due to inadequate checking in the fs shrink code, it's possible that there could be a sparse inode cluster at the end of the filesystem where the upper inodes of the cluster are marked as holes and the corresponding blocks are free. In this case, the last blocks in the AG are listed in the bnobt. This enables the shrink to proceed but results in a filesystem that trips the inode verifiers. Fix this by disallowing the shrink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* | Merge tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2021-07-182-20/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap fixes from Darrick Wong: "A handful of bugfixes for the iomap code. There's nothing especially exciting here, just fixes for UBSAN (not KASAN as I erroneously wrote in the tag message) warnings about undefined behavior in the SEEK_DATA/SEEK_HOLE code, and some reshuffling of per-page block state info to fix some problems with gfs2. - Fix KASAN warnings due to integer overflow in SEEK_DATA/SEEK_HOLE - Fix assertion errors when using inlinedata files on gfs2" * tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor iomap: Don't create iomap_page objects for inline files iomap: Permit pages without an iop to enter writeback iomap: remove the length variable in iomap_seek_hole iomap: remove the length variable in iomap_seek_data
| * | iomap: Don't create iomap_page objects in iomap_page_mkwrite_actorAndreas Gruenbacher2021-07-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we create those objects in iomap_writepage_map when needed, there's no need to pre-create them in iomap_page_mkwrite_actor anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | iomap: Don't create iomap_page objects for inline filesAndreas Gruenbacher2021-07-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In iomap_readpage_actor, don't create iop objects for inline inodes. Otherwise, iomap_read_inline_data will set PageUptodate without setting iop->uptodate, and iomap_page_release will eventually complain. To prevent this kind of bug from occurring in the future, make sure the page doesn't have private data attached in iomap_read_inline_data. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | iomap: Permit pages without an iop to enter writebackAndreas Gruenbacher2021-07-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create an iop in the writeback path if one doesn't exist. This allows us to avoid creating the iop in some cases. We'll initially do that for pages with inline data, but it can be extended to pages which are entirely within an extent. It also allows for an iop to be removed from pages in the future (eg page split). Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | iomap: remove the length variable in iomap_seek_holeChristoph Hellwig2021-07-151-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The length variable is rather pointless given that it can be trivially deduced from offset and size. Also the initial calculation can lead to KASAN warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
| * | iomap: remove the length variable in iomap_seek_dataChristoph Hellwig2021-07-151-10/+6
| |/ | | | | | | | | | | | | | | | | | | | | | | The length variable is rather pointless given that it can be trivially deduced from offset and size. Also the initial calculation can lead to KASAN warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
* | Merge tag '5.14-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2021-07-178-20/+124
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "Eight cifs/smb3 fixes, including three for stable. Three are DFS related fixes, and two to fix problems pointed out by static checkers" * tag '5.14-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: do not share tcp sessions of dfs connections SMB3.1.1: fix mount failure to some servers when compression enabled cifs: added WARN_ON for all the count decrements cifs: fix missing null session check in mount cifs: handle reconnect of tcon when there is no cached dfs referral cifs: fix the out of range assignment to bit fields in parse_server_interfaces cifs: Do not use the original cruid when following DFS links for multiuser mounts cifs: use the expiry output of dns_query to schedule next resolution
| * | cifs: do not share tcp sessions of dfs connectionsPaulo Alcantara2021-07-162-3/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that we do not share tcp sessions of dfs mounts when mounting regular shares that connect to same server. DFS connections rely on a single instance of tcp in order to do failover properly in cifs_reconnect(). Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | SMB3.1.1: fix mount failure to some servers when compression enabledSteve French2021-07-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | When sending the compression context to some servers, they rejected the SMB3.1.1 negotiate protocol because they expect the compression context to have a data length of a multiple of 8. Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: added WARN_ON for all the count decrementsShyam Prasad N2021-07-152-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a few ref counters srv_count, ses_count and tc_count which we use for ref counting. Added a WARN_ON during the decrement of each of these counters to make sure that they don't go below their minimum values. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: fix missing null session check in mountSteve French2021-07-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although it is unlikely to be have ended up with a null session pointer calling cifs_try_adding_channels in cifs_mount. Coverity correctly notes that we are already checking for it earlier (when we return from do_dfs_failover), so at a minimum to clarify the code we should make sure we also check for it when we exit the loop so we don't end up calling cifs_try_adding_channels or mount_setup_tlink with a null ses pointer. Addresses-Coverity: 1505608 ("Derefernce after null check") Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: handle reconnect of tcon when there is no cached dfs referralPaulo Alcantara2021-07-151-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | When there is no cached DFS referral of tcon->dfs_path, then reconnect to same share. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: fix the out of range assignment to bit fields in parse_server_interfacesHyunchul Lee2021-07-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Because the out of range assignment to bit fields are compiler-dependant, the fields could have wrong value. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: Do not use the original cruid when following DFS links for multiuser ↵Ronnie Sahlberg2021-07-141-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mounts Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565 cruid should only be used for the initial mount and after this we should use the current users credentials. Ignore the original cruid mount argument when creating a new context for a multiuser mount following a DFS link. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Cc: stable@vger.kernel.org # 5.11+ Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
| * | cifs: use the expiry output of dns_query to schedule next resolutionShyam Prasad N2021-07-146-10/+65
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | We recently fixed DNS resolution of the server hostname during reconnect. However, server IP address may change, even when the old one continues to server (although sub-optimally). We should schedule the next DNS resolution based on the TTL of the DNS record used for the last resolution. This way, we resolve the server hostname again when a DNS record expires. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Steve French <stfrench@microsoft.com>
* | Merge tag 'io_uring-5.14-2021-07-16' of git://git.kernel.dk/linux-blockLinus Torvalds2021-07-161-3/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "Two small fixes: one fixing the process target of a check, and the other a minor issue with the drain error handling" * tag 'io_uring-5.14-2021-07-16' of git://git.kernel.dk/linux-block: io_uring: fix io_drain_req() io_uring: use right task for exiting checks
| * | io_uring: fix io_drain_req()Pavel Begunkov2021-07-111-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_drain_req() return whether the request has been consumed or not, not an error code. Fix a stupid mistake slipped from optimisation patches. Reported-by: syzbot+ba6fcd859210f4e9e109@syzkaller.appspotmail.com Fixes: 76cc33d79175a ("io_uring: refactor io_req_defer()") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4d3c53c4274ffff307c8ae062fc7fda63b978df2.1626039606.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: use right task for exiting checksPavel Begunkov2021-07-111-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | When we use delayed_work for fallback execution of requests, current will be not of the submitter task, and so checks in io_req_task_submit() may not behave as expected. Currently, it leaves inline completions not flushed, so making io_ring_exit_work() to hang. Use the submitter task for all those checks. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'zonefs-5.14-rc2' of ↵Linus Torvalds2021-07-161-3/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single patch to remove an unnecessary NULL bio check (from Xianting)" * tag 'zonefs-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: remove redundant null bio check
| * | zonefs: remove redundant null bio checkXianting Tian2021-07-161-3/+0
| |/ | | | | | | | | | | | | | | bio_alloc() with __GFP_DIRECT_RECLAIM, which is included in GFP_NOFS, never fails, see comments in bio_alloc_bioset(). Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
* | Merge tag 'configfs-5.13-1' of git://git.infradead.org/users/hch/configfsLinus Torvalds2021-07-151-7/+22
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull configfs fix from Christoph Hellwig: - fix the read and write iterators (Bart Van Assche) * tag 'configfs-5.13-1' of git://git.infradead.org/users/hch/configfs: configfs: fix the read and write iterators
| * | configfs: fix the read and write iteratorsBart Van Assche2021-07-131-7/+22
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7fe1e79b59ba ("configfs: implement the .read_iter and .write_iter methods") changed the simple_read_from_buffer() calls into copy_to_iter() calls and the simple_write_to_buffer() calls into copy_from_iter() calls. The simple*buffer() methods update the file offset (*ppos) but the read and write iterators not yet. Make the read and write iterators update the file offset (iocb->ki_pos). This patch has been tested as follows: # modprobe target_core_user # dd if=/sys/kernel/config/target/dbroot bs=1 /var/target 12+0 records in 12+0 records out 12 bytes copied, 9.5539e-05 s, 126 kB/s # cd /sys/kernel/config/acpi/table # mkdir test # cd test # dmesg -c >/dev/null; printf 'SSDT\x8\0\0\0abcdefghijklmnopqrstuvwxyz' | dd of=aml bs=1; dmesg -c 34+0 records in 34+0 records out 34 bytes copied, 0.010627 s, 3.2 kB/s [ 261.056551] ACPI configfs: invalid table length Reported-by: Yanko Kaneti <yaneti@declera.com> Cc: Yanko Kaneti <yaneti@declera.com> Fixes: 7fe1e79b59ba ("configfs: implement the .read_iter and .write_iter methods") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | Merge tag 'Wimplicit-fallthrough-clang-5.14-rc2' of ↵Linus Torvalds2021-07-152-9/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull fallthrough fixes from Gustavo Silva: "This fixes many fall-through warnings when building with Clang and -Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for Clang, globally. It's also important to notice that since we have adopted the use of the pseudo-keyword macro fallthrough, we also want to avoid having more /* fall through */ comments being introduced. Contrary to GCC, Clang doesn't recognize any comments as implicit fall-through markings when the -Wimplicit-fallthrough option is enabled. So, in order to avoid having more comments being introduced, we use the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang, will cause a warning in case a code comment is intended to be used as a fall-through marking. The patch for Makefile also enforces this. We had almost 4,000 of these issues for Clang in the beginning, and there might be a couple more out there when building some architectures with certain configurations. However, with the recent fixes I think we are in good shape and it is now possible to enable the warning for Clang" * tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits) Makefile: Enable -Wimplicit-fallthrough for Clang powerpc/smp: Fix fall-through warning for Clang dmaengine: mpc512x: Fix fall-through warning for Clang usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang powerpc/powernv: Fix fall-through warning for Clang MIPS: Fix unreachable code issue MIPS: Fix fall-through warnings for Clang ASoC: Mediatek: MT8183: Fix fall-through warning for Clang power: supply: Fix fall-through warnings for Clang dmaengine: ti: k3-udma: Fix fall-through warning for Clang s390: Fix fall-through warnings for Clang dmaengine: ipu: Fix fall-through warning for Clang iommu/arm-smmu-v3: Fix fall-through warning for Clang mmc: jz4740: Fix fall-through warning for Clang PCI: Fix fall-through warning for Clang scsi: libsas: Fix fall-through warning for Clang video: fbdev: Fix fall-through warning for Clang math-emu: Fix fall-through warning cpufreq: Fix fall-through warning for Clang drm/msm: Fix fall-through warning in msm_gem_new_impl() ...
| * | fcntl: Fix unreachable code in do_fcntl()Gustavo A. R. Silva2021-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following warning: fs/fcntl.c:373:3: warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough] fallthrough; ^ include/linux/compiler_attributes.h:210:41: note: expanded from macro 'fallthrough' # define fallthrough __attribute__((__fallthrough__)) by placing the fallthrough; statement inside ifdeffery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
| * | xfs: Fix multiple fall-through warnings for ClangGustavo A. R. Silva2021-07-121-8/+8
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to enable -Wimplicit-fallthrough for Clang, fix the following warnings by replacing /* fallthrough */ comments, and its variants, with the new pseudo-keyword macro fallthrough: fs/xfs/libxfs/xfs_attr.c:487:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:500:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:532:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:594:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:607:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:1410:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:1445:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] fs/xfs/libxfs/xfs_attr.c:1473:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough] Notice that Clang doesn't recognize /* fallthrough */ comments as implicit fall-through markings, so in order to globally enable -Wimplicit-fallthrough for Clang, these comments need to be replaced with fallthrough; in the whole codebase. Link: https://github.com/KSPP/linux/issues/115 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-07-154-11/+45
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: mm (kasan, pagealloc, rmap, hmm, and hugetlb), and hfs" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/hugetlb: fix refs calculation from unaligned @vaddr hfs: add lock nesting notation to hfs_find_init hfs: fix high memory mapping in hfs_bnode_read hfs: add missing clean-up in hfs_fill_super lib/test_hmm: remove set but unused page variable mm: fix the try_to_unmap prototype for !CONFIG_MMU mm/page_alloc: further fix __alloc_pages_bulk() return value mm/page_alloc: correct return value when failing at preparing mm/page_alloc: avoid page allocator recursion with pagesets.lock held Revert "mm/page_alloc: make should_fail_alloc_page() static" kasan: fix build by including kernel.h kasan: add memzero init for unaligned size at DEBUG mm: move helper to check slub_debug_enabled
| * | hfs: add lock nesting notation to hfs_find_initDesmond Cheong Zhi Xi2021-07-152-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reports a possible recursive lock in [1]. This happens due to missing lock nesting information. From the logs, we see that a call to hfs_fill_super is made to mount the hfs filesystem. While searching for the root inode, the lock on the catalog btree is grabbed. Then, when the parent of the root isn't found, a call to __hfs_bnode_create is made to create the parent of the root. This eventually leads to a call to hfs_ext_read_extent which grabs a lock on the extents btree. Since the order of locking is catalog btree -> extents btree, this lock hierarchy does not lead to a deadlock. To tell lockdep that this locking is safe, we add nesting notation to distinguish between catalog btrees, extents btrees, and attributes btrees (for HFS+). This has already been done in hfsplus. Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1] Link: https://lkml.kernel.org/r/20210701030756.58760-4-desmondcheongzx@gmail.com Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reported-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com Tested-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | hfs: fix high memory mapping in hfs_bnode_readDesmond Cheong Zhi Xi2021-07-151-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pages that we read in hfs_bnode_read need to be kmapped into kernel address space. However, currently only the 0th page is kmapped. If the given offset + length exceeds this 0th page, then we have an invalid memory access. To fix this, we kmap relevant pages one by one and copy their relevant portions of data. An example of invalid memory access occurring without this fix can be seen in the following crash report: ================================================================== BUG: KASAN: use-after-free in memcpy include/linux/fortify-string.h:191 [inline] BUG: KASAN: use-after-free in hfs_bnode_read+0xc4/0xe0 fs/hfs/bnode.c:26 Read of size 2 at addr ffff888125fdcffe by task syz-executor5/4634 CPU: 0 PID: 4634 Comm: syz-executor5 Not tainted 5.13.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x195/0x1f8 lib/dump_stack.c:120 print_address_description.constprop.0+0x1d/0x110 mm/kasan/report.c:233 __kasan_report mm/kasan/report.c:419 [inline] kasan_report.cold+0x7b/0xd4 mm/kasan/report.c:436 check_region_inline mm/kasan/generic.c:180 [inline] kasan_check_range+0x154/0x1b0 mm/kasan/generic.c:186 memcpy+0x24/0x60 mm/kasan/shadow.c:65 memcpy include/linux/fortify-string.h:191 [inline] hfs_bnode_read+0xc4/0xe0 fs/hfs/bnode.c:26 hfs_bnode_read_u16 fs/hfs/bnode.c:34 [inline] hfs_bnode_find+0x880/0xcc0 fs/hfs/bnode.c:365 hfs_brec_find+0x2d8/0x540 fs/hfs/bfind.c:126 hfs_brec_read+0x27/0x120 fs/hfs/bfind.c:165 hfs_cat_find_brec+0x19a/0x3b0 fs/hfs/catalog.c:194 hfs_fill_super+0xc13/0x1460 fs/hfs/super.c:419 mount_bdev+0x331/0x3f0 fs/super.c:1368 hfs_mount+0x35/0x40 fs/hfs/super.c:457 legacy_get_tree+0x10c/0x220 fs/fs_context.c:592 vfs_get_tree+0x93/0x300 fs/super.c:1498 do_new_mount fs/namespace.c:2905 [inline] path_mount+0x13f5/0x20e0 fs/namespace.c:3235 do_mount fs/namespace.c:3248 [inline] __do_sys_mount fs/namespace.c:3456 [inline] __se_sys_mount fs/namespace.c:3433 [inline] __x64_sys_mount+0x2b8/0x340 fs/namespace.c:3433 do_syscall_64+0x37/0xc0 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x45e63a Code: 48 c7 c2 bc ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d2 e8 88 04 00 00 0f 1f 84 00 00 00 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f9404d410d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000020000248 RCX: 000000000045e63a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f9404d41120 RBP: 00007f9404d41120 R08: 00000000200002c0 R09: 0000000020000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 R13: 0000000000000003 R14: 00000000004ad5d8 R15: 0000000000000000 The buggy address belongs to the page: page:00000000dadbcf3e refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x125fdc flags: 0x2fffc0000000000(node=0|zone=2|lastcpupid=0x3fff) raw: 02fffc0000000000 ffffea000497f748 ffffea000497f6c8 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888125fdce80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888125fdcf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888125fdcf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888125fdd000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888125fdd080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== Link: https://lkml.kernel.org/r/20210701030756.58760-3-desmondcheongzx@gmail.com Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | hfs: add missing clean-up in hfs_fill_superDesmond Cheong Zhi Xi2021-07-151-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "hfs: fix various errors", v2. This series ultimately aims to address a lockdep warning in hfs_find_init reported by Syzbot [1]. The work done for this led to the discovery of another bug, and the Syzkaller repro test also reveals an invalid memory access error after clearing the lockdep warning. Hence, this series is broken up into three patches: 1. Add a missing call to hfs_find_exit for an error path in hfs_fill_super 2. Fix memory mapping in hfs_bnode_read by fixing calls to kmap 3. Add lock nesting notation to tell lockdep that the observed locking hierarchy is safe This patch (of 3): Before exiting hfs_fill_super, the struct hfs_find_data used in hfs_find_init should be passed to hfs_find_exit to be cleaned up, and to release the lock held on the btree. The call to hfs_find_exit is missing from an error path. We add it back in by consolidating calls to hfs_find_exit for error paths. Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1] Link: https://lkml.kernel.org/r/20210701030756.58760-1-desmondcheongzx@gmail.com Link: https://lkml.kernel.org/r/20210701030756.58760-2-desmondcheongzx@gmail.com Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | fs: add vfs_parse_fs_param_source() helperChristian Brauner2021-07-141-18/+36
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a simple helper that filesystems can use in their parameter parser to parse the "source" parameter. A few places open-coded this function and that already caused a bug in the cgroup v1 parser that we fixed. Let's make it harder to get this wrong by introducing a helper which performs all necessary checks. Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12 Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'vboxsf-v5.14-1' of ↵Linus Torvalds2021-07-133-38/+116
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux Pull vboxsf fixes from Hans de Goede: "This adds support for the atomic_open directory-inode op to vboxsf. Note this is not just an enhancement this also fixes an actual issue which users are hitting, see the commit message of the "boxsf: Add support for the atomic_open directory-inode" patch" * tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux: vboxsf: Add support for the atomic_open directory-inode op vboxsf: Add vboxsf_[create|release]_sf_handle() helpers vboxsf: Make vboxsf_dir_create() return the handle for the created file vboxsf: Honor excl flag to the dir-inode create op
| * | vboxsf: Add support for the atomic_open directory-inode opHans de Goede2021-06-231-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Opening a new file is done in 2 steps on regular filesystems: 1. Call the create inode-op on the parent-dir to create an inode to hold the meta-data related to the file. 2. Call the open file-op to get a handle for the file. vboxsf however does not really use disk-backed inodes because it is based on passing through file-related system-calls through to the hypervisor. So both steps translate to an open(2) call being passed through to the hypervisor. With the handle returned by the first call immediately being closed again. Making 2 open calls for a single open(..., O_CREATE, ...) calls has 2 problems: a) It is not really efficient. b) It actually breaks some apps. An example of b) is doing a git clone inside a vboxsf mount. When git clone tries to create a tempfile to store the pak files which is downloading the following happens: 1. vboxsf_dir_mkfile() gets called with a mode of 0444 and succeeds. 2. vboxsf_file_open() gets called with file->f_flags containing O_RDWR. When the host is a Linux machine this fails because doing a open(..., O_RDWR) on a file which exists and has mode 0444 results in an -EPERM error. Other network-filesystems and fuse avoid the problem of needing to pass 2 open() calls to the other side by using the atomic_open directory-inode op. This commit fixes git clone not working inside a vboxsf mount, by adding support for the atomic_open directory-inode op. As an added bonus this should also make opening new files faster. The atomic_open implementation is modelled after the atomic_open implementations from the 9p and fuse code. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Reported-by: Ludovic Pouzenc <bugreports@pouzenc.fr> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
| * | vboxsf: Add vboxsf_[create|release]_sf_handle() helpersHans de Goede2021-06-232-27/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out the code to create / release a struct vboxsf_handle into 2 new helper functions. This is a preparation patch for adding atomic_open support. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com>
| * | vboxsf: Make vboxsf_dir_create() return the handle for the created fileHans de Goede2021-06-231-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Make vboxsf_dir_create() optionally return the vboxsf-handle for the created file. This is a preparation patch for adding atomic_open support. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com>
| * | vboxsf: Honor excl flag to the dir-inode create opHans de Goede2021-06-231-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Honor the excl flag to the dir-inode create op, instead of behaving as if it is always set. Note the old behavior still worked most of the time since a non-exclusive open only calls the create op, if there is a race and the file is created between the dentry lookup and the calling of the create call. While at it change the type of the is_dir parameter to the vboxsf_dir_create() helper from an int to a bool, to be consistent with the use of bool for the excl parameter. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com>
* | | Merge tag 'for-5.14-rc1-tag' of ↵Linus Torvalds2021-07-139-286/+687
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs zoned mode fixes from David Sterba: - fix deadlock when allocating system chunk - fix wrong mutex unlock on an error path - fix extent map splitting for append operation - update and fix message reporting unusable chunk space - don't block when background zone reclaim runs with balance in parallel * tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree btrfs: don't block if we can't acquire the reclaim lock btrfs: properly split extent_map for REQ_OP_ZONE_APPEND btrfs: rework chunk allocation to avoid exhaustion of the system chunk array btrfs: fix deadlock with concurrent chunk allocations involving system chunks btrfs: zoned: print unusable percentage when reclaiming block groups btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
| * | btrfs: zoned: fix wrong mutex unlock on failure to allocate log root treeFilipe Manana2021-07-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When syncing the log, if we fail to allocate the root node for the log root tree: 1) We are unlocking fs_info->tree_log_mutex, but at this point we have not yet locked this mutex; 2) We have locked fs_info->tree_root->log_mutex, but we end up not unlocking it; So fix this by unlocking fs_info->tree_root->log_mutex instead of fs_info->tree_log_mutex. Fixes: e75f9fd194090e ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex") CC: stable@vger.kernel.org # 5.13+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: don't block if we can't acquire the reclaim lockJohannes Thumshirn2021-07-071-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we can't acquire the reclaim_bgs_lock on block group reclaim, we block until it is free. This can potentially stall for a long time. While reclaim of block groups is necessary for a good user experience on a zoned file system, there still is no need to block as it is best effort only, just like when we're deleting unused block groups. CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: properly split extent_map for REQ_OP_ZONE_APPENDNaohiro Aota2021-07-071-29/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Damien reported a test failure with btrfs/209. The test itself ran fine, but the fsck ran afterwards reported a corrupted filesystem. The filesystem corruption happens because we're splitting an extent and then writing the extent twice. We have to split the extent though, because we're creating too large extents for a REQ_OP_ZONE_APPEND operation. When dumping the extent tree, we can see two EXTENT_ITEMs at the same start address but different lengths. $ btrfs inspect dump-tree /dev/nullb1 -t extent ... item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 257 offset 786432 count 1 The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in extract_ordered_extent(). Since extract_ordered_extent() uses create_io_em() to split an existing extent_map, we will have split->orig_start != split->start. Then, it will be logged with non-zero "extent data offset". Finally, the logged entries are replayed into a duplicated EXTENT_ITEM. Introduce and use proper splitting function for extent_map. The function is intended to be simple and specific usage for extract_ordered_extent() e.g. not supporting compression case (we do not allow splitting compressed extent_map anyway). There was a question raised by Qu, in summary why we want to split the extent map (and not the bio): The problem is not the limit on the zone end, which as you mention is the same as the block group end. The problem is that data write use zone append (ZA) operations. ZA BIOs cannot be split so a large extent may need to be processed with multiple ZA BIOs, While that is also true for regular writes, the major difference is that ZA are "nameless" write operation giving back the written sectors on completion. And ZA operations may be reordered by the block layer (not intentionally though). Combine both of these characteristics and you can see that the data for a large extent may end up being shuffled when written resulting in data corruption and the impossibility to map the extent to some start sector. To avoid this problem, zoned btrfs uses the principle "one data extent == one ZA BIO". So large extents need to be split. This is unfortunate, but we can revisit this later and optimize, e.g. merge back together the fragments of an extent once written if they actually were written sequentially in the zone. Reported-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ CC: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: rework chunk allocation to avoid exhaustion of the system chunk arrayFilipe Manana2021-07-077-184/+546
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") fixed a problem that resulted in exhausting the system chunk array in the superblock when there are many tasks allocating chunks in parallel. Basically too many tasks enter the first phase of chunk allocation without previous tasks having finished their second phase of allocation, resulting in too many system chunks being allocated. That was originally observed when running the fallocate tests of stress-ng on a PowerPC machine, using a node size of 64K. However that commit also introduced a deadlock where a task in phase 1 of the chunk allocation waited for another task that had allocated a system chunk to finish its phase 2, but that other task was waiting on an extent buffer lock held by the first task, therefore resulting in both tasks not making any progress. That change was later reverted by a patch with the subject "btrfs: fix deadlock with concurrent chunk allocations involving system chunks", since there is no simple and short solution to address it and the deadlock is relatively easy to trigger on zoned filesystems, while the system chunk array exhaustion is not so common. This change reworks the chunk allocation to avoid the system chunk array exhaustion. It accomplishes that by making the first phase of chunk allocation do the updates of the device items in the chunk btree and the insertion of the new chunk item in the chunk btree. This is done while under the protection of the chunk mutex (fs_info->chunk_mutex), in the same critical section that checks for available system space, allocates a new system chunk if needed and reserves system chunk space. This way we do not have chunk space reserved until the second phase completes. The same logic is applied to chunk removal as well, since it keeps reserved system space long after it is done updating the chunk btree. For direct allocation of system chunks, the previous behaviour remains, because otherwise we would deadlock on extent buffers of the chunk btree. Changes to the chunk btree are by large done by chunk allocation and chunk removal, which first reserve chunk system space and then later do changes to the chunk btree. The other remaining cases are uncommon and correspond to adding a device, removing a device and resizing a device. All these other cases do not pre-reserve system space, they modify the chunk btree right away, so they don't hold reserved space for a long period like chunk allocation and chunk removal do. The diff of this change is huge, but more than half of it is just addition of comments describing both how things work regarding chunk allocation and removal, including both the new behavior and the parts of the old behavior that did not change. CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix deadlock with concurrent chunk allocations involving system chunksFilipe Manana2021-07-073-69/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a task attempting to allocate a new chunk verifies that there is not currently enough free space in the system space_info and there is another task that allocated a new system chunk but it did not finish yet the creation of the respective block group, it waits for that other task to finish creating the block group. This is to avoid exhaustion of the system chunk array in the superblock, which is limited, when we have a thundering herd of tasks allocating new chunks. This problem was described and fixed by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations"). However there are two very similar scenarios where this can lead to a deadlock: 1) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. However before task B ends its transaction handle and finishes the creation of the system block group, it attempts to allocate another chunk (like a data chunk for an fallocate operation for a very large range). Task B will be unable to progress and allocate the new chunk, because task A set space_info->chunk_alloc to 1 and therefore it loops at btrfs_chunk_alloc() waiting for task A to finish its chunk allocation and set space_info->chunk_alloc to 0, but task A is waiting on task B to finish creation of the new system block group, therefore resulting in a deadlock; 2) Task B allocated a new system chunk and task A is waiting on task B to finish creation of the respective system block group. By the time that task B enter the final phase of block group allocation, which happens at btrfs_create_pending_block_groups(), when it modifies the extent tree, the device tree or the chunk tree to insert the items for some new block group, it needs to allocate a new chunk, so it ends up at btrfs_chunk_alloc() and keeps looping there because task A has set space_info->chunk_alloc to 1, but task A is waiting for task B to finish creation of the new system block group and release the reserved system space, therefore resulting in a deadlock. In short, the problem is if a task B needs to allocate a new chunk after it previously allocated a new system chunk and if another task A is currently waiting for task B to complete the allocation of the new system chunk. Unfortunately this deadlock scenario introduced by the previous fix for the system chunk array exhaustion problem does not have a simple and short fix, and requires a big change to rework the chunk allocation code so that chunk btree updates are all made in the first phase of chunk allocation. And since this deadlock regression is being frequently hit on zoned filesystems and the system chunk array exhaustion problem is triggered in more extreme cases (originally observed on PowerPC with a node size of 64K when running the fallocate tests from stress-ng), revert the changes from that commit. The next patch in the series, with a subject of "btrfs: rework chunk allocation to avoid exhaustion of the system chunk array" does the necessary changes to fix the system chunk array exhaustion problem. Reported-by: Naohiro Aota <naohiro.aota@wdc.com> Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/ Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") CC: stable@vger.kernel.org # 5.12+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: zoned: print unusable percentage when reclaiming block groupsJohannes Thumshirn2021-07-071-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we're automatically reclaiming a zone, because its zone_unusable value is above the reclaim threshold, we're only logging how much percent of the zone's capacity are used, but not how much of the capacity is unusable. Also print the percentage of the unusable space in the block group before we're reclaiming it. Example: BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable CC: stable@vger.kernel.org # 5.13 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>