summaryrefslogtreecommitdiffstats
path: root/block
Commit message (Collapse)AuthorAgeFilesLines
* block: Deny writable memory mapping if block is read-onlyLoic Poulain2023-05-191-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | User should not be able to write block device if it is read-only at block level (e.g force_ro attribute). This is ensured in the regular fops write operation (blkdev_write_iter) but not when writing via user mapping (mmap), allowing user to actually write a read-only block device via a PROT_WRITE mapping. Example: This can lead to integrity issue of eMMC boot partition (e.g mmcblk0boot0) which is read-only by default. To fix this issue, simply deny shared writable mapping if the block is readonly. Note: Block remains writable if switch to read-only is performed after the initial mapping, but this is expected behavior according to commit a32e236eb93e ("Partially revert "block: fail op_is_write() requests to read-only partitions"")'. Signed-off-by: Loic Poulain <loic.poulain@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230510074223.991297-1-loic.poulain@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linuxLinus Torvalds2023-05-066-146/+84
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: - MD pull request via Song: - Improve raid5 sequential IO performance on spinning disks, which fixes a regression since v6.0 (Jan Kara) - Fix bitmap offset types, which fixes an issue introduced in this merge window (Jonathan Derrick) - Cleanup of hweight type used for cgroup writeback (Maxim) - Fix a regression with the "has_submit_bio" changes across partitions (Ming) - Cleanup of QUEUE_FLAG_ADD_RANDOM clearing. We used to set this flag on queues non blk-mq queues, and hence some drivers clear it unconditionally. Since all of these have since been converted to true blk-mq drivers, drop the useless clear as the bit is not set (Chaitanya) - Fix the flags being set in a bio for a flush for drbd (Christoph) - Cleanup and deduplication of the code handling setting block device capacity (Damien) - Fix for ublk handling IO timeouts (Ming) - Fix for a regression in blk-cgroup teardown (Tao) - NBD documentation and code fixes (Eric) - Convert blk-integrity to using device_attributes rather than a second kobject to manage lifetimes (Thomas) * tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux: ublk: add timeout handler drbd: correctly submit flush bio on barrier mailmap: add mailmap entries for Jens Axboe block: Skip destroyed blkg when restart in blkg_destroy_all() writeback: fix call of incorrect macro md: Fix bitmap offset type in sb writer md/raid5: Improve performance for sequential IO docs nbd: userspace NBD now favors github over sourceforge block nbd: use req.cookie instead of req.handle uapi nbd: add cookie alias to handle uapi nbd: improve doc links to userspace spec blk-integrity: register sysfs attributes on struct device blk-integrity: convert to struct device_attribute blk-integrity: use sysfs_emit block/drivers: remove dead clear of random flag block: sync part's ->bd_has_submit_bio with disk's block: Cleanup set_capacity()/bdev_set_nr_sectors()
| * block: Skip destroyed blkg when restart in blkg_destroy_all()Tao Su2023-04-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel hang in blkg_destroy_all() when total blkg greater than BLKG_DESTROY_BATCH_SIZE, because of not removing destroyed blkg in blkg_list. So the size of blkg_list is same after destroying a batch of blkg, and the infinite 'restart' occurs. Since blkg should stay on the queue list until blkg_free_workfn(), skip destroyed blkg when restart a new round, which will solve this kernel hang issue and satisfy the previous will to restart. Reported-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Tao Su <tao1.su@linux.intel.com> Fixes: f1c006f1c685 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230428045149.1310073-1-tao1.su@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-integrity: register sysfs attributes on struct deviceThomas Weißschuh2023-04-263-69/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "integrity" kobject only acted as a holder for static sysfs entries. It also was embedded into struct gendisk without managing it, violating assumptions of the driver core. Instead register the sysfs entries directly onto the struct device. Also drop the now unused member integrity_kobj from struct gendisk. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-3-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-integrity: convert to struct device_attributeThomas Weißschuh2023-04-261-65/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | An upcoming patch will register the integrity attributes directly with the struct device kobject. For this the attributes have to be implemented in terms of struct device_attribute. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-2-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-integrity: use sysfs_emitThomas Weißschuh2023-04-261-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | The correct way to emit data into sysfs is via sysfs_emit(), use it. Also perform some trivial syntactic cleanups. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230309-kobj_release-gendisk_integrity-v3-1-ceccb4493c46@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: sync part's ->bd_has_submit_bio with disk'sMing Lei2023-04-251-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | submit_bio() always uses bio->bi_bdev->bd_has_submit_bio to decide if disk's ->submit_bio() is called, and bio->bi_bdev could point to one partition device. So we have to sync part bdev's ->bd_has_submit_bio with disk's. Reported-by: Changhui Zhong <czhong@redhat.com> Link: https://lore.kernel.org/linux-block/ZEdItaPqif8fp85H@ovpn-8-24.pek2.redhat.com/T/#t Fixes: 9f4107b07b17 ("block: store bdev->bd_disk->fops->submit_bio state in bdev") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230425034154.110099-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Cleanup set_capacity()/bdev_set_nr_sectors()Damien Le Moal2023-04-244-14/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code for setting a block device capacity (bd_nr_sectors field of struct block_device) is duplicated in set_capacity() and bdev_set_nr_sectors(). Clean this up by making bdev_set_nr_sectors() a block layer internal function defined in block/bdev.c instead of having this function statically defined in block/partitions/core.c. With this change, set_capacity() implementation can be simplified to only calling bdev_set_nr_sectors(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230424131318.79935-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'driver-core-6.4-rc1' of ↵Linus Torvalds2023-04-272-14/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the large set of driver core changes for 6.4-rc1. Once again, a busy development cycle, with lots of changes happening in the driver core in the quest to be able to move "struct bus" and "struct class" into read-only memory, a task now complete with these changes. This will make the future rust interactions with the driver core more "provably correct" as well as providing more obvious lifetime rules for all busses and classes in the kernel. The changes required for this did touch many individual classes and busses as many callbacks were changed to take const * parameters instead. All of these changes have been submitted to the various subsystem maintainers, giving them plenty of time to review, and most of them actually did so. Other than those changes, included in here are a small set of other things: - kobject logging improvements - cacheinfo improvements and updates - obligatory fw_devlink updates and fixes - documentation updates - device property cleanups and const * changes - firwmare loader dependency fixes. All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits) device property: make device_property functions take const device * driver core: update comments in device_rename() driver core: Don't require dynamic_debug for initcall_debug probe timing firmware_loader: rework crypto dependencies firmware_loader: Strip off \n from customized path zram: fix up permission for the hot_add sysfs file cacheinfo: Add use_arch[|_cache]_info field/function arch_topology: Remove early cacheinfo error message if -ENOENT cacheinfo: Check cache properties are present in DT cacheinfo: Check sib_leaf in cache_leaves_are_shared() cacheinfo: Allow early level detection when DT/ACPI info is missing/broken cacheinfo: Add arm64 early level initializer implementation cacheinfo: Add arch specific early level initializer tty: make tty_class a static const structure driver core: class: remove struct class_interface * from callbacks driver core: class: mark the struct class in struct class_interface constant driver core: class: make class_register() take a const * driver core: class: mark class_release() as taking a const * driver core: remove incorrect comment for device_create* MIPS: vpe-cmp: remove module owner pointer from struct class usage. ...
| * \ Merge 6.3-rc5 into driver-core-nextGreg Kroah-Hartman2023-04-036-27/+22
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | driver core: class: remove dev_kobj from struct classGreg Kroah-Hartman2023-03-311-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The dev_kobj field in struct class is now only written to, but never read from, so it can be removed as it is useless. Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20230331093318.82288-5-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | driver core: class: remove module * from class_create()Greg Kroah-Hartman2023-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The module pointer in class_create() never actually did anything, and it shouldn't have been requred to be set as a parameter even if it did something. So just remove it and fix up all callers of the function in the kernel tree at the same time. Cc: "Rafael J. Wysocki" <rafael@kernel.org> Acked-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Link: https://lore.kernel.org/r/20230313181843.1207845-4-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | driver core: remove CONFIG_SYSFS_DEPRECATED and CONFIG_SYSFS_DEPRECATED_V2Greg Kroah-Hartman2023-03-061-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_SYSFS_DEPRECATED was added in commit 88a22c985e35 ("CONFIG_SYSFS_DEPRECATED") in 2006 to allow systems with older versions of some tools (i.e. Fedora 3's version of udev) to boot properly. Four years later, in 2010, the option was attempted to be removed as most of userspace should have been fixed up properly by then, but some kernel developers clung to those old systems and refused to update, so we added CONFIG_SYSFS_DEPRECATED_V2 in commit e52eec13cd6b ("SYSFS: Allow boot time switching between deprecated and modern sysfs layout") to allow them to continue to boot properly, and we allowed a boot time parameter to be used to switch back to the old format if needed. Over time, the logic that was covered under these config options was slowly removed from individual driver subsystems successfully, removed, and the only thing that is now left in the kernel are some changes in the block layer's representation in sysfs where real directories are used instead of symlinks like normal. Because the original changes were done to userspace tools in 2006, and all distros that use those tools are long end-of-life, and older non-udev-based systems do not care about the block layer's sysfs representation, it is time to finally remove this old logic and the config entries from the kernel. Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-block@vger.kernel.org Cc: linux-doc@vger.kernel.org Acked-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20230223073326.2073220-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds2023-04-2640-1069/+920
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
| * | | sed-opal: geometry feature reporting commandOndrej Kozina2023-04-191-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Locking range start and locking range length attributes may be require to satisfy restrictions exposed by OPAL2 geometry feature reporting. Geometry reporting feature is described in TCG OPAL SSC, section 3.1.1.4 (ALIGN, LogicalBlockSize, AlignmentGranularity and LowestAlignedLBA). 4.3.5.2.1.1 RangeStart Behavior: [ StartAlignment = (RangeStart modulo AlignmentGranularity) - LowestAlignedLBA ] When processing a Set method or CreateRow method on the Locking table for a non-Global Range row, if: a) the AlignmentRequired (ALIGN above) column in the LockingInfo table is TRUE; b) RangeStart is non-zero; and c) StartAlignment is non-zero, then the method SHALL fail and return an error status code INVALID_PARAMETER. 4.3.5.2.1.2 RangeLength Behavior: If RangeStart is zero, then [ LengthAlignment = (RangeLength modulo AlignmentGranularity) - LowestAlignedLBA ] If RangeStart is non-zero, then [ LengthAlignment = (RangeLength modulo AlignmentGranularity) ] When processing a Set method or CreateRow method on the Locking table for a non-Global Range row, if: a) the AlignmentRequired (ALIGN above) column in the LockingInfo table is TRUE; b) RangeLength is non-zero; and c) LengthAlignment is non-zero, then the method SHALL fail and return an error status code INVALID_PARAMETER In userspace we stuck to logical block size reported by general block device (via sysfs or ioctl), but we can not read 'AlignmentGranularity' or 'LowestAlignedLBA' anywhere else and we need to get those values from sed-opal interface otherwise we will not be able to report or avoid locking range setup INVALID_PARAMETER errors above. Signed-off-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Milan Broz <gmazyland@gmail.com> Link: https://lore.kernel.org/r/20230411090931.9193-2-okozina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flushChristoph Hellwig2023-04-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b12e5c6c755a accidentally changes blk_kick_flush to do a head insert into the requeue list, fix this up. Fixes: b12e5c6c755a ("blk-mq: pass a flags argument to blk_mq_add_to_requeue_list") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230416073553.966161-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block, bfq: Fix division by zero error on zero wsumColin Ian King2023-04-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the weighted sum is zero the calculation of limit causes a division by zero error. Fix this by continuing to the next level. This was discovered by running as root: stress-ng --ioprio 0 Fixes divison by error oops: [ 521.450556] divide error: 0000 [#1] SMP NOPTI [ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1 [ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400 [ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44 [ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046 [ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 [ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978 [ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0 [ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18 [ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970 [ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000 [ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0 [ 521.455491] PKRU: 55555554 [ 521.455619] Call Trace: [ 521.455736] <TASK> [ 521.455837] ? bfq_request_merge+0x3a/0xc0 [ 521.456027] ? elv_merge+0x115/0x140 [ 521.456191] bfq_limit_depth+0xc8/0x240 [ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0 [ 521.456577] blk_mq_submit_bio+0x23c/0x6c0 [ 521.456766] __submit_bio+0xb8/0x140 [ 521.457236] submit_bio_noacct_nocheck+0x212/0x300 [ 521.457748] submit_bio_noacct+0x1a6/0x580 [ 521.458220] submit_bio+0x43/0x80 [ 521.458660] ext4_io_submit+0x23/0x80 [ 521.459116] ext4_do_writepages+0x40a/0xd00 [ 521.459596] ext4_writepages+0x65/0x100 [ 521.460050] do_writepages+0xb7/0x1c0 [ 521.460492] __filemap_fdatawrite_range+0xa6/0x100 [ 521.460979] file_write_and_wait_range+0xbf/0x140 [ 521.461452] ext4_sync_file+0x105/0x340 [ 521.461882] __x64_sys_fsync+0x67/0x100 [ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0 [ 521.462768] do_syscall_64+0x3b/0xc0 [ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4 [ 521.463621] RIP: 0033:0x5640b6c56590 [ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | block: store bdev->bd_disk->fops->submit_bio state in bdevJens Axboe2023-04-163-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a long chain of memory dereferencing just to whether or not this disk has a special submit_bio helper. As that's not necessarily the common case, add a bd_has_submit_bio state in the bdev to avoid traversing this memory dependency chain if we don't need to. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | Merge tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme into ↵Jens Axboe2023-04-143-50/+0
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for-6.4/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.4 - drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - validate nvmet module parameters (Chaitanya Kulkarni) - fence TCP socket on receive error (Chris Leech) - fix async event trace event (Keith Busch) - minor cleanups (Chaitanya Kulkarni, zhenwei pi) - fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - fix irq locking in nvme-fcloop (Ming Lei) - remove queue mapping helper for rdma devices (Sagi Grimberg)" * tag 'nvme-6.4-2023-04-14' of git://git.infradead.org/nvme: nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage" blk-mq-rdma: remove queue mapping helper for rdma devices nvme-rdma: minor cleanup in nvme_rdma_create_cq() nvme: fix double blk_mq_complete_request for timeout request with low probability nvme: fix async event trace event nvme-apple: return directly instead of else nvme-apple: return directly instead of else nvmet-tcp: validate idle poll modparam value nvmet-tcp: validate so_priority modparam value nvme-tcp: fence TCP socket on receive error nvmet: remove nvmet_req_cns_error_complete nvmet: rename nvmet_execute_identify_cns_cs_ns nvmet: fix Identify Identification Descriptor List handling nvmet: cleanup nvmet_execute_identify() nvmet: fix I/O Command Set specific Identify Controller nvmet: fix Identify Active Namespace ID list handling nvmet: fix Identify Controller handling nvmet: fix Identify Namespace handling nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns() nvme-pci: drop redundant pci_enable_pcie_error_reporting()
| | * | | blk-mq-rdma: remove queue mapping helper for rdma devicesSagi Grimberg2023-04-133-50/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No rdma device exposes its irq vectors affinity today. So the only mapping that we have left, is the default blk_mq_map_queues, which we fallback to anyways. Also fixup the only consumer of this helper (nvme-rdma). Remove this now dead code. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | | blk-mq: remove __blk_mq_run_hw_queueChristoph Hellwig2023-04-131-20/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from interrupt context and a blk_mq_run_dispatch_ops-protected call to blk_mq_sched_dispatch_requests. Open code the call to blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE to blk_mq_run_hw_queue where it can be extended to all !async calls, while the other call is from workqueue context and thus obviously does not need the assert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queueChristoph Hellwig2023-04-131-27/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with async=false, so move the handling there. With this __blk_mq_delay_run_hw_queue can be merged into blk_mq_delay_run_hw_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queueChristoph Hellwig2023-04-131-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection. For the async dispatch case having a check before scheduling the work still makes sense to avoid needless workqueue scheduling, so just keep it for that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fnChristoph Hellwig2023-04-131-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection, so remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: cleanup __blk_mq_sched_dispatch_requestsChristoph Hellwig2023-04-131-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __blk_mq_sched_dispatch_requests currently has duplicated logic for the cases where requests are on the hctx dispatch list or not. Merge the two with a new need_dispatch variable and remove a few pointless local variables. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: pass a flags argument to blk_mq_add_to_requeue_listChristoph Hellwig2023-04-133-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: pass a flags argument to elevator_type->insert_requestsChristoph Hellwig2023-04-135-18/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: pass a flags argument to blk_mq_request_bypass_insertChristoph Hellwig2023-04-133-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: pass a flags argument to blk_mq_insert_requestChristoph Hellwig2023-04-132-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the at_head bool with a flags argument that so far only contains a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep for head insertions into the blk-mq dispatch queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_listChristoph Hellwig2023-04-133-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_add_to_requeue_list takes a bool parameter to control how to kick the requeue list at the end of the function. Move the call to blk_mq_kick_requeue_list to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: don't run the hw_queue from blk_mq_request_bypass_insertChristoph Hellwig2023-04-133-16/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_request_bypass_insert takes a bool parameter to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: don't run the hw_queue from blk_mq_insert_requestChristoph Hellwig2023-04-131-24/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_insert_request takes two bool parameters to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: fold __blk_mq_try_issue_directly into its two callersChristoph Hellwig2023-04-131-41/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the wildly different behavior based on the bypass_insert argument, not a whole lot of code in __blk_mq_try_issue_directly is actually shared between blk_mq_try_issue_directly and blk_mq_request_issue_directly. Remove __blk_mq_try_issue_directly and fold the code into the two callers instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: factor out a blk_mq_get_budget_and_tag helperChristoph Hellwig2023-04-131-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out a helper from __blk_mq_try_issue_directly in preparation of folding that function into its two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: refactor the DONTPREP/SOFTBARRIER andling in blk_mq_requeue_workChristoph Hellwig2023-04-131-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: refactor passthrough vs flush handling in blk_mq_insert_requestChristoph Hellwig2023-04-131-32/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While both passthrough and flush requests call directly into blk_mq_request_bypass_insert, the parameters aren't the same. Split the handling into two separate conditionals and turn the whole function into an if/elif/elif/else flow instead of the gotos. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: remove blk_flush_queue_rqChristoph Hellwig2023-04-131-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just call blk_mq_add_to_requeue_list directly from the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: fold __blk_mq_insert_req_list into blk_mq_insert_requestChristoph Hellwig2023-04-131-18/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove this very small helper and fold it into the only caller. Note that this moves the trace_block_rq_insert out of ctx->lock, matching the other calls to this tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: fold __blk_mq_insert_request into blk_mq_insert_requestChristoph Hellwig2023-04-132-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no good point in keeping the __blk_mq_insert_request around for two function calls and a singler caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: move blk_mq_sched_insert_request to blk-mq.cChristoph Hellwig2023-04-134-83/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_sched_insert_request is the main request insert helper and not directly I/O scheduler related. Move blk_mq_sched_insert_request to blk-mq.c, rename it to blk_mq_insert_request and mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_listChristoph Hellwig2023-04-135-34/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: move more logic into blk_mq_insert_requestsChristoph Hellwig2023-04-133-20/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move all logic related to the direct insert (including the call to blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code flow up a bit, and to allow marking blk_mq_try_issue_list_directly static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: include <linux/blk-mq.h> in block/blk-mq.hChristoph Hellwig2023-04-1315-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: remove blk-mq-tag.hChristoph Hellwig2023-04-1313-85/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-mq: don't plug for head insertions in blk_execute_rq_nowaitChristoph Hellwig2023-04-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Plugs never insert at head, so don't plug for head insertions. Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-throttle: only enable blk-stat when BLK_DEV_THROTTLING_LOWChengming Zhou2023-04-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_throtl_register() will unconditionally enable blk-stat for gendisk when register, even when we have no BLK_DEV_THROTTLING_LOW config. Since the kernel always has only BLK_DEV_THROTTLING config and the BLK_DEV_THROTTLING_LOW config is still in EXPERIMENTAL state, we can just skip blk-stat when !BLK_DEV_THROTTLING_LOW. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230413062805.2081970-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-stat: fix QUEUE_FLAG_STATS clearChengming Zhou2023-04-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to set QUEUE_FLAG_STATS for two cases: 1. blk_stat_enable_accounting() 2. blk_stat_add_callback() So we should clear it only when ((q->stats->accounting == 0) && list_empty(&q->stats->callbacks)). blk_stat_disable_accounting() only check if q->stats->accounting is 0 before clear the flag, this patch fix it. Also add list_empty(&q->stats->callbacks)) check when enable, or the flag is already set. The bug can be reproduced on kernel without BLK_DEV_THROTTLING (since it unconditionally enable accounting, see the next patch). # cat /sys/block/sr0/queue/scheduler none mq-deadline [bfq] # cat /sys/kernel/debug/block/sr0/state SAME_COMP|IO_STAT|INIT_DONE|STATS|REGISTERED|NOWAIT|30 # echo none > /sys/block/sr0/queue/scheduler # cat /sys/kernel/debug/block/sr0/state SAME_COMP|IO_STAT|INIT_DONE|REGISTERED|NOWAIT # cat /sys/block/sr0/queue/wbt_lat_usec 75000 We can see that after changing elevator from "bfq" to "none", "STATS" flag is lost even though WBT callback still need it. Fixes: 68497092bde9 ("block: make queue stat accounting a reference") Cc: <stable@vger.kernel.org> # v5.17+ Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230413062805.2081970-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-iolatency: Make initialization lazyTejun Heo2023-04-133-15/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Other rq_qos policies such as wbt and iocost are lazy-initialized when they are configured for the first time for the device but iolatency is initialized unconditionally from blkcg_init_disk() during gendisk init. Lazy init is beneficial because rq_qos policies add runtime overhead when initialized as every IO has to walk all registered rq_qos callbacks. This patch switches iolatency to lazy initialization too so that it only registered its rq_qos policy when it is first configured. Note that there is a known race condition between blkcg config file writes and del_gendisk() and this patch makes iolatency susceptible to it by exposing the init path to race against the deletion path. However, that problem already exists in iocost and is being worked on. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230413000649.115785-5-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blk-iolatency: s/blkcg_rq_qos/iolat_rq_qos/Tejun Heo2023-04-132-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The name was too generic given that there are multiple blkcg rq-qos policies. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230413000649.115785-4-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | blkcg: Restructure blkg_conf_prep() and friendsTejun Heo2023-04-136-78/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to support lazy init of rq-qos policies so that iolatency is enabled lazily on configuration instead of gendisk initialization. The way blkg config helpers are structured now is a bit awkward for that. Let's restructure: * blkcg_conf_open_bdev() is renamed to blkg_conf_open_bdev(). The blkcg_ prefix was used because the bdev opening step is blkg-independent. However, the distinction is too subtle and confuses more than helps. Let's switch to blkg prefix so that it's consistent with the type and other helper names. * struct blkg_conf_ctx now remembers the original input string and is always initialized by the new blkg_conf_init(). * blkg_conf_open_bdev() is updated to take a pointer to blkg_conf_ctx like blkg_conf_prep() and can be called multiple times safely. Instead of modifying the double pointer to input string directly, blkg_conf_open_bdev() now sets blkg_conf_ctx->body. * blkg_conf_finish() is renamed to blkg_conf_exit() for symmetry and now must be called on all blkg_conf_ctx's which were initialized with blkg_conf_init(). Combined, this allows the users to either open the bdev first or do it altogether with blkg_conf_prep() which will help implementing lazy init of rq-qos policies. blkg_conf_init/exit() will also be used implement synchronization against device removal. This is necessary because iolat / iocost are configured through cgroupfs instead of one of the files under /sys/block/DEVICE. As cgroupfs operations aren't synchronized with block layer, the lazy init and other configuration operations may race against device removal. This patch makes blkg_conf_init/exit() used consistently for all cgroup-orginating configurations making them a good place to implement explicit synchronization. Users are updated accordingly. No behavior change is intended by this patch. v2: bfq wasn't updated in v1 causing a build error. Fixed. v3: Update the description to include future use of blkg_conf_init/exit() as synchronization points. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Yu Kuai <yukuai1@huaweicloud.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230413000649.115785-3-tj@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>