summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'vfs-6.9-rc1.fixes' of ↵Linus Torvalds2024-03-181-0/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window: - Undo the hiding of silly-rename files in afs. If they're hidden they can't be deleted by rm manually anymore causing regressions - Avoid caching the preferred address for an afs server to avoid accidently overriding an explicitly specified preferred server address - Fix bad stat() and rmdir() interaction in afs - Take a passive reference on the superblock when opening a block device so the holder is available to concurrent callers from the block layer - Clear private data pointer in fscache_begin_operation() to avoid it being falsely treated as valid" * tag 'vfs-6.9-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fscache: Fix error handling in fscache_begin_operation() fs,block: get holder during claim afs: Fix occasional rmdir-then-VNOVNODE with generic/011 afs: Don't cache preferred address afs: Revert "afs: Hide silly-rename files from userspace"
| * fs,block: get holder during claimChristian Brauner2024-03-181-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we open block devices as files we need to deal with the realities that closing is a deferred operation. An operation on the block device such as e.g., freeze, thaw, or removal that runs concurrently with umount, tries to acquire a stable reference on the holder. The holder might already be gone though. Make that reliable by grabbing a passive reference to the holder during bdev_open() and releasing it during bdev_release(). Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reported-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@infradead.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
* | block: fix mismatched kerneldoc function nameJiapeng Chong2024-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | No functional modification involved. block/blk-settings.c:281: warning: expecting prototype for queue_limits_commit_set(). Prototype was for queue_limits_set() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8539 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240314025615.71269-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Revert "blk-lib: check for kill signal"Christoph Hellwig2024-03-131-39/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 8a08c5fd89b447a7de7eb293a7a274c46b932ba2. It turns out while this is a perfectly valid and long overdue thing to do for user initiated discards / zeroing from the ioctl handler, it actually breaks file system use of the discard helper by interrupting in places the file system doesn't expect, and by leaving the bio chain in a state that the file system callers of (at least) __blkdev_issue_discard do not expect. Revert the change for now, we'll redo it for the next merge window after refactoring the code to better split the file system vs ioctl callers and cleaning up a few other loose ends. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240314021623.1908895-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Revert "block/mq-deadline: use correct way to throttling write requests"Bart Van Assche2024-03-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code "max(1U, 3 * (1U << shift) / 4)" comes from the Kyber I/O scheduler. The Kyber I/O scheduler maintains one internal queue per hwq and hence derives its async_depth from the number of hwq tags. Using this approach for the mq-deadline scheduler is wrong since the mq-deadline scheduler maintains one internal queue for all hwqs combined. Hence this revert. Cc: stable@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Cc: Zhiguo Niu <Zhiguo.Niu@unisoc.com> Fixes: d47f9717e5cf ("block/mq-deadline: use correct way to throttling write requests") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240313214218.1736147-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: limit block time caching to in_task() contextJens Axboe2024-03-131-1/+1
|/ | | | | | | | | | | | | We should not have any callers of this from non-task context, but Jakub ran [1] into one from blk-iocost. Rather than risk running into others, or future ones, just limit blk_time_get_ns() to when it is called from a task. Any other usage is invalid. [1] https://lore.kernel.org/lkml/CAHk-=wiOaBLqarS2uFhM1YdwOvCX4CZaWkeyNDY1zONpbYw2ig@mail.gmail.com/ Fixes: da4c8c3d0975 ("block: cache current nsec time in struct blk_plug") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Revert "dm: use queue_limits_set"Linus Torvalds2024-03-111-1/+1
| | | | | | | | | | | This reverts commit 8e0ef412869430d114158fc3b9b1fb111e247bd3. It's broken, and causes the boot to fail on encrypted volumes. Reported-and-bisected-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/all/20240311235023.GA1205@cmpxchg.org/ Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds2024-03-1129-361/+687
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
| * block: partitions: only define function mac_fix_string for CONFIG_PPC_PMACColin Ian King2024-03-091-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | The helper function mac_fix_string is only required with CONFIG_PPC_PMAC, add #if CONFIG_PPC_PMAC and #endif around the function. Cleans up clang scan build warning: block/partitions/mac.c:23:20: warning: unused function 'mac_fix_string' [-Wunused-function] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20240308133921.2058227-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Merge tag 'md-6.9-20240306' of ↵Jens Axboe2024-03-061-24/+0
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD atomic queue limits changes from Song. * tag 'md-6.9-20240306' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper
| | * block: remove disk_stack_limitsChristoph Hellwig2024-03-061-24/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disk_stack_limits is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed--by: Song Liu <song@kernel.org> Tested-by: Song Liu <song@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240303140150.5435-12-hch@lst.de
| * | block: move capacity validation to blkpg_do_ioctl()Li Lingfeng2024-03-062-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6d4e80db4ebe ("block: add capacity validation in bdev_add_partition()") add check of partition's start and end sectors to prevent exceeding the size of the disk when adding partitions. However, there is still no check for resizing partitions now. Move the check to blkpg_do_ioctl() to cover resizing partitions. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240305032132.548958-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: prevent division by zero in blk_rq_stat_sum()Roman Smirnov2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The expression dst->nr_samples + src->nr_samples may have zero value on overflow. It is necessary to add a check to avoid division by zero. Found by Linux Verification Center (linuxtesting.org) with Svace. Signed-off-by: Roman Smirnov <r.smirnov@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20240305134509.23108-1-r.smirnov@omp.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | sed-opal: Remove the ret variable from the functionLi kunyu2024-03-061-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The ret variable in the function has not yet been effective and can be removed. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20240306101444.1244-1-kunyu@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | sed-opal: Remove unnecessary ‘0’ values from retLi kunyu2024-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | ret is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20240306100659.106521-1-kunyu@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | sed-opal: Remove unnecessary ‘0’ values from errLi zeming2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | err is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20240306100216.69340-1-zeming@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | sed-opal: Remove unnecessary ‘0’ values from errorLi zeming2024-03-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | error is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Link: https://lore.kernel.org/r/20240306095608.26839-1-zeming@nfschina.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: make block_class constantRicardo B. Marliere2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the block_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240305-class_cleanup-block-v1-1-130bb27b9c72@marliere.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: Fix page refcounts for unaligned buffers in __bio_release_pages()Tony Battersby2024-03-061-3/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | Fix an incorrect number of pages being released for buffers that do not start at the beginning of a page. Fixes: 1b151e2435fc ("block: Remove special-casing of compound pages") Cc: stable@vger.kernel.org Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Tested-by: Greg Edwards <gedwards@ddn.com> Link: https://lore.kernel.org/r/86e592a9-98d4-4cff-a646-0c0084328356@cybernetics.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * dm: use queue_limits_setChristoph Hellwig2024-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240228225653.947152-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add a queue_limits_stack_bdev helperChristoph Hellwig2024-03-011-0/+25
| | | | | | | | | | | | | | | | | | | | | | Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add a queue_limits_set helperChristoph Hellwig2024-03-011-0/+18
| | | | | | | | | | | | | | | | | | | | Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228225653.947152-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-mq: don't change nr_hw_queues and nr_maps for kdump kernelMing Lei2024-02-281-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so nr_hw_queues for each mapping is supposed to be 1 already. More importantly, this way may cause trouble for driver, because blk-mq and driver see different queue mapping since driver should setup hardware queue setting before calling into allocating blk-mq tagset. So not overriding nr_hw_queues and nr_maps for kdump kernel. Cc: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240228040857.306483-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * bdev: remove SLAB_MEM_SPREAD flag usageChengming Zhou2024-02-241-1/+1
| | | | | | | | | | | | | | | | | | The SLAB_MEM_SPREAD flag is already a no-op as of 6.8-rc1, remove its usage so we can delete it from slab. No functional change. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20240224134646.829105-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block/blk-mq: Don't complete locally if capacities are differentQais Yousef2024-02-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic in blk_mq_complete_need_ipi() assumes SMP systems where all CPUs have equal compute capacities and only LLC cache can make a different on perceived performance. But this assumption falls apart on HMP systems where LLC is shared, but the CPUs have different capacities. Staying local then can have a big performance impact if the IO request was done from a CPU with higher capacity but the interrupt is serviced on a lower capacity CPU. Use the new cpus_equal_capacity() function to check if we need to send an IPI. Without the patch I see the BLOCK softirq always running on little cores (where the hardirq is serviced). With it I can see it running on all cores. This was noticed after the topology change [1] where now on a big.LITTLE we truly get that the LLC is shared between all cores where as in the past it was being misrepresented for historical reasons. The logic exposed a missing dependency on capacities for such systems where there can be a big performance difference between the CPUs. This of course introduced a noticeable change in behavior depending on how the topology is presented. Leading to regressions in some workloads as the performance of the BLOCK softirq on littles can be noticeably worse on some platforms. Worth noting that we could have checked for capacities being greater than or equal instead for equality. This will lead to favouring higher performance always. But opted for equality instead to match the performance of the requester without making an assumption that can lead to power trade-offs which these systems tend to be sensitive about. If the requester would like to run faster, it's better to rely on the scheduler to give the IO requester via some facility to run on a faster core; and then if the interrupt triggered on a CPU with different capacity we'll make sure to match the performance the requester is supposed to run at. [1] https://lpc.events/event/16/contributions/1342/attachments/962/1883/LPC-2022-Android-MC-Phantom-Domains.pdf Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-3-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-lib: check for kill signalKeith Busch2024-02-241-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of these block operations can access a significant capacity and take longer than the user expected. A user may change their mind about wanting to run that command and attempt to kill the process and do something else with their device. But since the task is uninterruptable, they have to wait for it to finish, which could be many hours. Check for a fatal signal at each iteration so the user doesn't have to wait for their regretted operation to complete naturally. Reported-by: Conrad Meyer <conradmeyer@meta.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: io wait hang check helperKeith Busch2024-02-243-27/+17
| | | | | | | | | | | | | | | | | | | | | | | | This is the same in two places, and another will be added soon. Create a helper for it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: cleanup __blkdev_issue_write_zeroesKeith Busch2024-02-241-12/+9
| | | | | | | | | | | | | | | | | | | | | | Use min to calculate the next number of sectors like everyone else. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: blkdev_issue_secure_erase loop styleKeith Busch2024-02-241-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | Use consistent coding style in this file. All the other loops for the same purpose use "while (nr_sects)", so they win. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240223155910.3622666-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix deadlock between bd_link_disk_holder and partition scanLi Nan2024-02-231-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'open_mutex' of gendisk is used to protect open/close block devices. But in bd_link_disk_holder(), it is used to protect the creation of symlink between holding disk and slave bdev, which introduces some issues. When bd_link_disk_holder() is called, the driver is usually in the process of initialization/modification and may suspend submitting io. At this time, any io hold 'open_mutex', such as scanning partitions, can cause deadlocks. For example, in raid: T1 T2 bdev_open_by_dev lock open_mutex [1] ... efi_partition ... md_submit_bio md_ioctl mddev_syspend -> suspend all io md_add_new_disk bind_rdev_to_array bd_link_disk_holder try lock open_mutex [2] md_handle_request -> wait mddev_resume T1 scan partition, T2 add a new device to raid. T1 waits for T2 to resume mddev, but T2 waits for open_mutex held by T1. Deadlock occurs. Fix it by introducing a local mutex 'blk_holder_mutex' to replace 'open_mutex'. Fixes: 1b0a2d950ee2 ("md: use new apis to suspend array for ioctls involed array reconfiguration") Reported-by: mgperkow@gmail.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218459 Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240221090122.1281868-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Do not include rbtree.h in blk-zoned.cDamien Le Moal2024-02-221-1/+0
| | | | | | | | | | | | | | | | | | | | The block zone code does not use RB-tree. So remove the include of linux/rbtree.h as it is not needed. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240222131724.1803520-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Clear zone limits for a non-zoned stacked queueDamien Le Moal2024-02-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device mapper may create a non-zoned mapped device out of a zoned device (e.g., the dm-zoned target). In such case, some queue limit such as the max_zone_append_sectors and zone_write_granularity endup being non zero values for a block device that is not zoned. Avoid this by clearing these limits in blk_stack_limits() when the stacked zoned limit is false. Fixes: 3093a479727b ("block: inherit the zoned characteristics in blk_stack_limits") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240222131724.1803520-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix virt_boundary handling in blk_validate_limitsChristoph Hellwig2024-02-211-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | Don't set the default max_segment_size value when a virt_boundary is used. Fixes: d690cb8ae14b ("block: add an API to atomically update queue limits") Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20240221125010.3609444-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig2024-02-191-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: pass a queue_limits argument to blk_mq_alloc_diskChristoph Hellwig2024-02-131-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: pass a queue_limits argument to blk_mq_init_queueChristoph Hellwig2024-02-132-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also rename the function to blk_mq_alloc_queue as that is a much better name for a function that allocates a queue and always pass the queuedata argument instead of having a separate version for the extra argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: pass a queue_limits argument to blk_alloc_queueChristoph Hellwig2024-02-134-14/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | Pass a queue_limits to blk_alloc_queue and apply it after validating and capping the values using blk_validate_limits. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: use queue_limits_commit_update in queue_discard_max_storeChristoph Hellwig2024-02-131-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert queue_discard_max_store to use queue_limits_commit_update to check and update the max_discard_sectors limit and freeze the queue before doing so to ensure we don't have requests in flight while changing the limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add a max_user_discard_sectors queue limitChristoph Hellwig2024-02-132-12/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new max_user_discard_sectors limit that mirrors max_user_sectors and stores the value that the user manually set. This now allows updates of the max_hw_discard_sectors to not worry about the user limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: use queue_limits_commit_update in queue_max_sectors_storeChristoph Hellwig2024-02-131-25/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert queue_max_sectors_store to use queue_limits_commit_update to check and update the max_sectors limit and freeze the queue before doing so to ensure we don't have requests in flight while changing the limits. Note that this removes the previously held queue_lock that doesn't protect against any other reader or writer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add an API to atomically update queue limitsChristoph Hellwig2024-02-133-37/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new queue_limits_{start,commit}_update pair of functions that allows taking an atomic snapshot of queue limits, update it, and commit it if it passes validity checking. Also use the low-level validation helper to implement blk_set_default_limits instead of duplicating the initialization. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: decouple blk_set_stacking_limits from blk_set_default_limitsChristoph Hellwig2024-02-131-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | blk_set_stacking_limits uses very little from blk_set_default_limits. Open code these initializations in preparation for rewriting blk_set_default_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: refactor disk_update_readaheadChristoph Hellwig2024-02-131-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out a blk_apply_bdi_limits limits helper that can be used with an explicit queue_limits argument, which will be useful later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: support PI at non-zero offset within metadataKanchan Joshi2024-02-123-14/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer integrity processing assumes that protection information (PI) is placed in the first bytes of each metadata block. Remove this limitation and include the metadata before the PI in the calculation of the guard tag. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Chinmay Gameti <c.gameti@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: refactor guard helpersKanchan Joshi2024-02-121-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | Allow computation using the existing guard value. This is a prep patch. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove gfp_flags from blkdev_zone_mgmtJohannes Thumshirn2024-02-121-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Simplify the allocation of slab cachesKunwu Chan2024-02-081-2/+1
| | | | | | | | | | | | | | | | | | | | Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240131094323.146659-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: optimise in irq bio put cachingPavel Begunkov2024-02-081-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When enlisting a bio into ->free_list_irq we protect the list by disabling irqs. It's likely they're already disabled and performance of local_irq_{save,restore}() is decent, but it's not zero cost. Let's only use the irq cache when when we're serving a hard irq, which allows to remove local_irq_{save,restore}(), and fall back to bio_free() in all left cases. Profiles indicate that the bio_put() cost is reduced by ~3.5 times (1.76% -> 0.49%), and total throughput of a CPU bound benchmark improve by around 1% (t/io_uring with high QD and several drives). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/36d207540b7046c653cc16e5ff08fe7234b19f81.1707314970.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: extend bio caching to task contextPavel Begunkov2024-02-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | bio_put_percpu_cache() puts all non-iopoll bios into the irq-safe list, which entails disabling irqs. The overhead of that is not that bad when interrupts are already off but getting worse otherwise. We can optimise it when we're in the task context by using ->free_list directly just as the IOPOLL path does. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4774e1a0f905f96c63174b0f3e4f79f0d9b63246.1707314970.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-throttle: Eliminate redundant checks for data directionTang Yizhou2024-02-051-2/+2
| | | | | | | | | | | | | | | | | | | | After calling throtl_peek_queued(), the data direction can be determined so there is no need to call bio_data_dir() to check the direction again. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240123081248.3752878-1-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>