| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull xfs fixes from Darrick Wong:
"A few fixes for issues in the new online shrink code, additional
corrections for my recent bug-hunt w.r.t. extent size hints on
realtime, and improved input checking of the GROWFSRT ioctl.
IOW, the usual 'I somehow got bored during the merge window and
resumed auditing the farther reaches of xfs':
- Fix shrink eligibility checking when sparse inode clusters enabled
- Reset '..' directory entries when unlinking directories to prevent
verifier errors if fs is shrinked later
- Don't report unusable extent size hints to FSGETXATTR
- Don't warn when extent size hints are unusable because the sysadmin
configured them that way
- Fix insufficient parameter validation in GROWFSRT ioctl
- Fix integer overflow when adding rt volumes to filesystem"
* tag 'xfs-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: detect misaligned rtinherit directory extent size hints
xfs: fix an integer overflow error in xfs_growfs_rt
xfs: improve FSGROWFSRT precondition checking
xfs: don't expose misaligned extszinherit hints to userspace
xfs: correct the narrative around misaligned rtinherit/extszinherit dirs
xfs: reset child dir '..' entry when unlinking child
xfs: check for sparse inode clusters that cross new EOAG when shrinking
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 603f000b15f2 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit 603f000b15f2.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull iomap fixes from Darrick Wong:
"A handful of bugfixes for the iomap code.
There's nothing especially exciting here, just fixes for UBSAN (not
KASAN as I erroneously wrote in the tag message) warnings about
undefined behavior in the SEEK_DATA/SEEK_HOLE code, and some
reshuffling of per-page block state info to fix some problems with
gfs2.
- Fix KASAN warnings due to integer overflow in SEEK_DATA/SEEK_HOLE
- Fix assertion errors when using inlinedata files on gfs2"
* tag 'iomap-5.14-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor
iomap: Don't create iomap_page objects for inline files
iomap: Permit pages without an iop to enter writeback
iomap: remove the length variable in iomap_seek_hole
iomap: remove the length variable in iomap_seek_data
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Leizhen (ThunderTown) <thunder.leizhen@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull cifs fixes from Steve French:
"Eight cifs/smb3 fixes, including three for stable.
Three are DFS related fixes, and two to fix problems pointed out by
static checkers"
* tag '5.14-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not share tcp sessions of dfs connections
SMB3.1.1: fix mount failure to some servers when compression enabled
cifs: added WARN_ON for all the count decrements
cifs: fix missing null session check in mount
cifs: handle reconnect of tcon when there is no cached dfs referral
cifs: fix the out of range assignment to bit fields in parse_server_interfaces
cifs: Do not use the original cruid when following DFS links for multiuser mounts
cifs: use the expiry output of dns_query to schedule next resolution
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Make sure that we do not share tcp sessions of dfs mounts when
mounting regular shares that connect to same server. DFS connections
rely on a single instance of tcp in order to do failover properly in
cifs_reconnect().
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When sending the compression context to some servers, they rejected
the SMB3.1.1 negotiate protocol because they expect the compression
context to have a data length of a multiple of 8.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We have a few ref counters srv_count, ses_count and
tc_count which we use for ref counting. Added a WARN_ON
during the decrement of each of these counters to make
sure that they don't go below their minimum values.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Although it is unlikely to be have ended up with a null
session pointer calling cifs_try_adding_channels in cifs_mount.
Coverity correctly notes that we are already checking for
it earlier (when we return from do_dfs_failover), so at
a minimum to clarify the code we should make sure we also
check for it when we exit the loop so we don't end up calling
cifs_try_adding_channels or mount_setup_tlink with a null
ses pointer.
Addresses-Coverity: 1505608 ("Derefernce after null check")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When there is no cached DFS referral of tcon->dfs_path, then reconnect
to same share.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Because the out of range assignment to bit fields
are compiler-dependant, the fields could have wrong
value.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
mounts
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=213565
cruid should only be used for the initial mount and after this we should use the current
users credentials.
Ignore the original cruid mount argument when creating a new context for a multiuser mount
following a DFS link.
Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11+
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We recently fixed DNS resolution of the server hostname during reconnect.
However, server IP address may change, even when the old one continues
to server (although sub-optimally).
We should schedule the next DNS resolution based on the TTL of
the DNS record used for the last resolution. This way, we resolve the
server hostname again when a DNS record expires.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull io_uring fixes from Jens Axboe:
"Two small fixes: one fixing the process target of a check, and the
other a minor issue with the drain error handling"
* tag 'io_uring-5.14-2021-07-16' of git://git.kernel.dk/linux-block:
io_uring: fix io_drain_req()
io_uring: use right task for exiting checks
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
io_drain_req() return whether the request has been consumed or not, not
an error code. Fix a stupid mistake slipped from optimisation patches.
Reported-by: syzbot+ba6fcd859210f4e9e109@syzkaller.appspotmail.com
Fixes: 76cc33d79175a ("io_uring: refactor io_req_defer()")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d3c53c4274ffff307c8ae062fc7fda63b978df2.1626039606.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When we use delayed_work for fallback execution of requests, current
will be not of the submitter task, and so checks in io_req_task_submit()
may not behave as expected. Currently, it leaves inline completions not
flushed, so making io_ring_exit_work() to hang. Use the submitter task
for all those checks.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cb413c715bed0bc9c98b169059ea9c8a2c770715.1625881431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fix from Damien Le Moal:
"A single patch to remove an unnecessary NULL bio check (from
Xianting)"
* tag 'zonefs-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: remove redundant null bio check
|
| |/
| |
| |
| |
| |
| |
| |
| | |
bio_alloc() with __GFP_DIRECT_RECLAIM, which is included in
GFP_NOFS, never fails, see comments in bio_alloc_bioset().
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pull configfs fix from Christoph Hellwig:
- fix the read and write iterators (Bart Van Assche)
* tag 'configfs-5.13-1' of git://git.infradead.org/users/hch/configfs:
configfs: fix the read and write iterators
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 7fe1e79b59ba ("configfs: implement the .read_iter and .write_iter
methods") changed the simple_read_from_buffer() calls into copy_to_iter()
calls and the simple_write_to_buffer() calls into copy_from_iter() calls.
The simple*buffer() methods update the file offset (*ppos) but the read
and write iterators not yet. Make the read and write iterators update the
file offset (iocb->ki_pos).
This patch has been tested as follows:
# modprobe target_core_user
# dd if=/sys/kernel/config/target/dbroot bs=1
/var/target
12+0 records in
12+0 records out
12 bytes copied, 9.5539e-05 s, 126 kB/s
# cd /sys/kernel/config/acpi/table
# mkdir test
# cd test
# dmesg -c >/dev/null; printf 'SSDT\x8\0\0\0abcdefghijklmnopqrstuvwxyz' | dd of=aml bs=1; dmesg -c
34+0 records in
34+0 records out
34 bytes copied, 0.010627 s, 3.2 kB/s
[ 261.056551] ACPI configfs: invalid table length
Reported-by: Yanko Kaneti <yaneti@declera.com>
Cc: Yanko Kaneti <yaneti@declera.com>
Fixes: 7fe1e79b59ba ("configfs: implement the .read_iter and .write_iter methods")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo Silva:
"This fixes many fall-through warnings when building with Clang and
-Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for
Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough, we also want to avoid having
more /* fall through */ comments being introduced. Contrary to GCC,
Clang doesn't recognize any comments as implicit fall-through markings
when the -Wimplicit-fallthrough option is enabled.
So, in order to avoid having more comments being introduced, we use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used as
a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning, and
there might be a couple more out there when building some
architectures with certain configurations. However, with the recent
fixes I think we are in good shape and it is now possible to enable
the warning for Clang"
* tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
Makefile: Enable -Wimplicit-fallthrough for Clang
powerpc/smp: Fix fall-through warning for Clang
dmaengine: mpc512x: Fix fall-through warning for Clang
usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
powerpc/powernv: Fix fall-through warning for Clang
MIPS: Fix unreachable code issue
MIPS: Fix fall-through warnings for Clang
ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
power: supply: Fix fall-through warnings for Clang
dmaengine: ti: k3-udma: Fix fall-through warning for Clang
s390: Fix fall-through warnings for Clang
dmaengine: ipu: Fix fall-through warning for Clang
iommu/arm-smmu-v3: Fix fall-through warning for Clang
mmc: jz4740: Fix fall-through warning for Clang
PCI: Fix fall-through warning for Clang
scsi: libsas: Fix fall-through warning for Clang
video: fbdev: Fix fall-through warning for Clang
math-emu: Fix fall-through warning
cpufreq: Fix fall-through warning for Clang
drm/msm: Fix fall-through warning in msm_gem_new_impl()
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fix the following warning:
fs/fcntl.c:373:3: warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
fallthrough;
^
include/linux/compiler_attributes.h:210:41: note: expanded from macro 'fallthrough'
# define fallthrough __attribute__((__fallthrough__))
by placing the fallthrough; statement inside ifdeffery.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In preparation to enable -Wimplicit-fallthrough for Clang, fix
the following warnings by replacing /* fallthrough */ comments,
and its variants, with the new pseudo-keyword macro fallthrough:
fs/xfs/libxfs/xfs_attr.c:487:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:500:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:532:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:594:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:607:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1410:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1445:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
fs/xfs/libxfs/xfs_attr.c:1473:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
Notice that Clang doesn't recognize /* fallthrough */ comments as
implicit fall-through markings, so in order to globally enable
-Wimplicit-fallthrough for Clang, these comments need to be
replaced with fallthrough; in the whole codebase.
Link: https://github.com/KSPP/linux/issues/115
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Merge misc fixes from Andrew Morton:
"13 patches.
Subsystems affected by this patch series: mm (kasan, pagealloc, rmap,
hmm, and hugetlb), and hfs"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/hugetlb: fix refs calculation from unaligned @vaddr
hfs: add lock nesting notation to hfs_find_init
hfs: fix high memory mapping in hfs_bnode_read
hfs: add missing clean-up in hfs_fill_super
lib/test_hmm: remove set but unused page variable
mm: fix the try_to_unmap prototype for !CONFIG_MMU
mm/page_alloc: further fix __alloc_pages_bulk() return value
mm/page_alloc: correct return value when failing at preparing
mm/page_alloc: avoid page allocator recursion with pagesets.lock held
Revert "mm/page_alloc: make should_fail_alloc_page() static"
kasan: fix build by including kernel.h
kasan: add memzero init for unaligned size at DEBUG
mm: move helper to check slub_debug_enabled
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Syzbot reports a possible recursive lock in [1].
This happens due to missing lock nesting information. From the logs, we
see that a call to hfs_fill_super is made to mount the hfs filesystem.
While searching for the root inode, the lock on the catalog btree is
grabbed. Then, when the parent of the root isn't found, a call to
__hfs_bnode_create is made to create the parent of the root. This
eventually leads to a call to hfs_ext_read_extent which grabs a lock on
the extents btree.
Since the order of locking is catalog btree -> extents btree, this lock
hierarchy does not lead to a deadlock.
To tell lockdep that this locking is safe, we add nesting notation to
distinguish between catalog btrees, extents btrees, and attributes
btrees (for HFS+). This has already been done in hfsplus.
Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1]
Link: https://lkml.kernel.org/r/20210701030756.58760-4-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reported-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com
Tested-by: syzbot+b718ec84a87b7e73ade4@syzkaller.appspotmail.com
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pages that we read in hfs_bnode_read need to be kmapped into kernel
address space. However, currently only the 0th page is kmapped. If the
given offset + length exceeds this 0th page, then we have an invalid
memory access.
To fix this, we kmap relevant pages one by one and copy their relevant
portions of data.
An example of invalid memory access occurring without this fix can be seen
in the following crash report:
==================================================================
BUG: KASAN: use-after-free in memcpy include/linux/fortify-string.h:191 [inline]
BUG: KASAN: use-after-free in hfs_bnode_read+0xc4/0xe0 fs/hfs/bnode.c:26
Read of size 2 at addr ffff888125fdcffe by task syz-executor5/4634
CPU: 0 PID: 4634 Comm: syz-executor5 Not tainted 5.13.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x195/0x1f8 lib/dump_stack.c:120
print_address_description.constprop.0+0x1d/0x110 mm/kasan/report.c:233
__kasan_report mm/kasan/report.c:419 [inline]
kasan_report.cold+0x7b/0xd4 mm/kasan/report.c:436
check_region_inline mm/kasan/generic.c:180 [inline]
kasan_check_range+0x154/0x1b0 mm/kasan/generic.c:186
memcpy+0x24/0x60 mm/kasan/shadow.c:65
memcpy include/linux/fortify-string.h:191 [inline]
hfs_bnode_read+0xc4/0xe0 fs/hfs/bnode.c:26
hfs_bnode_read_u16 fs/hfs/bnode.c:34 [inline]
hfs_bnode_find+0x880/0xcc0 fs/hfs/bnode.c:365
hfs_brec_find+0x2d8/0x540 fs/hfs/bfind.c:126
hfs_brec_read+0x27/0x120 fs/hfs/bfind.c:165
hfs_cat_find_brec+0x19a/0x3b0 fs/hfs/catalog.c:194
hfs_fill_super+0xc13/0x1460 fs/hfs/super.c:419
mount_bdev+0x331/0x3f0 fs/super.c:1368
hfs_mount+0x35/0x40 fs/hfs/super.c:457
legacy_get_tree+0x10c/0x220 fs/fs_context.c:592
vfs_get_tree+0x93/0x300 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0x13f5/0x20e0 fs/namespace.c:3235
do_mount fs/namespace.c:3248 [inline]
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x2b8/0x340 fs/namespace.c:3433
do_syscall_64+0x37/0xc0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x45e63a
Code: 48 c7 c2 bc ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d2 e8 88 04 00 00 0f 1f 84 00 00 00 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f9404d410d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000248 RCX: 000000000045e63a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f9404d41120
RBP: 00007f9404d41120 R08: 00000000200002c0 R09: 0000000020000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000003 R14: 00000000004ad5d8 R15: 0000000000000000
The buggy address belongs to the page:
page:00000000dadbcf3e refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x125fdc
flags: 0x2fffc0000000000(node=0|zone=2|lastcpupid=0x3fff)
raw: 02fffc0000000000 ffffea000497f748 ffffea000497f6c8 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888125fdce80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888125fdcf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff888125fdcf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff888125fdd000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888125fdd080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================
Link: https://lkml.kernel.org/r/20210701030756.58760-3-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Patch series "hfs: fix various errors", v2.
This series ultimately aims to address a lockdep warning in
hfs_find_init reported by Syzbot [1].
The work done for this led to the discovery of another bug, and the
Syzkaller repro test also reveals an invalid memory access error after
clearing the lockdep warning. Hence, this series is broken up into
three patches:
1. Add a missing call to hfs_find_exit for an error path in
hfs_fill_super
2. Fix memory mapping in hfs_bnode_read by fixing calls to kmap
3. Add lock nesting notation to tell lockdep that the observed locking
hierarchy is safe
This patch (of 3):
Before exiting hfs_fill_super, the struct hfs_find_data used in
hfs_find_init should be passed to hfs_find_exit to be cleaned up, and to
release the lock held on the btree.
The call to hfs_find_exit is missing from an error path. We add it back
in by consolidating calls to hfs_find_exit for error paths.
Link: https://syzkaller.appspot.com/bug?id=f007ef1d7a31a469e3be7aeb0fde0769b18585db [1]
Link: https://lkml.kernel.org/r/20210701030756.58760-1-desmondcheongzx@gmail.com
Link: https://lkml.kernel.org/r/20210701030756.58760-2-desmondcheongzx@gmail.com
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.
Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux
Pull vboxsf fixes from Hans de Goede:
"This adds support for the atomic_open directory-inode op to vboxsf.
Note this is not just an enhancement this also fixes an actual issue
which users are hitting, see the commit message of the "boxsf: Add
support for the atomic_open directory-inode" patch"
* tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
vboxsf: Add support for the atomic_open directory-inode op
vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
vboxsf: Make vboxsf_dir_create() return the handle for the created file
vboxsf: Honor excl flag to the dir-inode create op
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Opening a new file is done in 2 steps on regular filesystems:
1. Call the create inode-op on the parent-dir to create an inode
to hold the meta-data related to the file.
2. Call the open file-op to get a handle for the file.
vboxsf however does not really use disk-backed inodes because it
is based on passing through file-related system-calls through to
the hypervisor. So both steps translate to an open(2) call being
passed through to the hypervisor. With the handle returned by
the first call immediately being closed again.
Making 2 open calls for a single open(..., O_CREATE, ...) calls
has 2 problems:
a) It is not really efficient.
b) It actually breaks some apps.
An example of b) is doing a git clone inside a vboxsf mount.
When git clone tries to create a tempfile to store the pak
files which is downloading the following happens:
1. vboxsf_dir_mkfile() gets called with a mode of 0444 and succeeds.
2. vboxsf_file_open() gets called with file->f_flags containing
O_RDWR. When the host is a Linux machine this fails because doing
a open(..., O_RDWR) on a file which exists and has mode 0444 results
in an -EPERM error.
Other network-filesystems and fuse avoid the problem of needing to
pass 2 open() calls to the other side by using the atomic_open
directory-inode op.
This commit fixes git clone not working inside a vboxsf mount,
by adding support for the atomic_open directory-inode op.
As an added bonus this should also make opening new files faster.
The atomic_open implementation is modelled after the atomic_open
implementations from the 9p and fuse code.
Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Reported-by: Ludovic Pouzenc <bugreports@pouzenc.fr>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Factor out the code to create / release a struct vboxsf_handle into
2 new helper functions.
This is a preparation patch for adding atomic_open support.
Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Make vboxsf_dir_create() optionally return the vboxsf-handle for
the created file. This is a preparation patch for adding atomic_open
support.
Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Honor the excl flag to the dir-inode create op, instead of behaving
as if it is always set.
Note the old behavior still worked most of the time since a non-exclusive
open only calls the create op, if there is a race and the file is created
between the dentry lookup and the calling of the create call.
While at it change the type of the is_dir parameter to the
vboxsf_dir_create() helper from an int to a bool, to be consistent with
the use of bool for the excl parameter.
Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support")
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
|
|\ \ \
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs zoned mode fixes from David Sterba:
- fix deadlock when allocating system chunk
- fix wrong mutex unlock on an error path
- fix extent map splitting for append operation
- update and fix message reporting unusable chunk space
- don't block when background zone reclaim runs with balance in
parallel
* tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
btrfs: don't block if we can't acquire the reclaim lock
btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
btrfs: fix deadlock with concurrent chunk allocations involving system chunks
btrfs: zoned: print unusable percentage when reclaiming block groups
btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When syncing the log, if we fail to allocate the root node for the log
root tree:
1) We are unlocking fs_info->tree_log_mutex, but at this point we have
not yet locked this mutex;
2) We have locked fs_info->tree_root->log_mutex, but we end up not
unlocking it;
So fix this by unlocking fs_info->tree_root->log_mutex instead of
fs_info->tree_log_mutex.
Fixes: e75f9fd194090e ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex")
CC: stable@vger.kernel.org # 5.13+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If we can't acquire the reclaim_bgs_lock on block group reclaim, we
block until it is free. This can potentially stall for a long time.
While reclaim of block groups is necessary for a good user experience on
a zoned file system, there still is no need to block as it is best
effort only, just like when we're deleting unused block groups.
CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck ran afterwards reported a corrupted filesystem.
The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.
When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.
$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 257 offset 786432 count 1
item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
refs 1 gen 7 flags DATA
extent data backref root FS_TREE objectid 257 offset 786432 count 1
The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
extract_ordered_extent(). Since extract_ordered_extent() uses
create_io_em() to split an existing extent_map, we will have
split->orig_start != split->start. Then, it will be logged with non-zero
"extent data offset". Finally, the logged entries are replayed into
a duplicated EXTENT_ITEM.
Introduce and use proper splitting function for extent_map. The function is
intended to be simple and specific usage for extract_ordered_extent() e.g.
not supporting compression case (we do not allow splitting compressed
extent_map anyway).
There was a question raised by Qu, in summary why we want to split the
extent map (and not the bio):
The problem is not the limit on the zone end, which as you mention is
the same as the block group end. The problem is that data write use zone
append (ZA) operations. ZA BIOs cannot be split so a large extent may
need to be processed with multiple ZA BIOs, While that is also true for
regular writes, the major difference is that ZA are "nameless" write
operation giving back the written sectors on completion. And ZA
operations may be reordered by the block layer (not intentionally
though). Combine both of these characteristics and you can see that the
data for a large extent may end up being shuffled when written resulting
in data corruption and the impossibility to map the extent to some start
sector.
To avoid this problem, zoned btrfs uses the principle "one data extent
== one ZA BIO". So large extents need to be split. This is unfortunate,
but we can revisit this later and optimize, e.g. merge back together the
fragments of an extent once written if they actually were written
sequentially in the zone.
Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.
However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.
This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.
The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.
For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.
The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").
However there are two very similar scenarios where this can lead to a
deadlock:
1) Task B allocated a new system chunk and task A is waiting on task B
to finish creation of the respective system block group. However before
task B ends its transaction handle and finishes the creation of the
system block group, it attempts to allocate another chunk (like a data
chunk for an fallocate operation for a very large range). Task B will
be unable to progress and allocate the new chunk, because task A set
space_info->chunk_alloc to 1 and therefore it loops at
btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
and set space_info->chunk_alloc to 0, but task A is waiting on task B
to finish creation of the new system block group, therefore resulting
in a deadlock;
2) Task B allocated a new system chunk and task A is waiting on task B to
finish creation of the respective system block group. By the time that
task B enter the final phase of block group allocation, which happens
at btrfs_create_pending_block_groups(), when it modifies the extent
tree, the device tree or the chunk tree to insert the items for some
new block group, it needs to allocate a new chunk, so it ends up at
btrfs_chunk_alloc() and keeps looping there because task A has set
space_info->chunk_alloc to 1, but task A is waiting for task B to
finish creation of the new system block group and release the reserved
system space, therefore resulting in a deadlock.
In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.
Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.
Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad0607 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When we're automatically reclaiming a zone, because its zone_unusable
value is above the reclaim threshold, we're only logging how much
percent of the zone's capacity are used, but not how much of the
capacity is unusable.
Also print the percentage of the unusable space in the block group
before we're reclaiming it.
Example:
BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable
CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|