summaryrefslogtreecommitdiffstats
path: root/drivers
Commit message (Collapse)AuthorAgeFilesLines
* bcache: use the default discard granularityChristoph Hellwig2023-12-291-1/+0
| | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* zram: use the default discard granularityChristoph Hellwig2023-12-291-1/+0
| | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: use the default discard granularityChristoph Hellwig2023-12-291-1/+0
| | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nbd: use the default discard granularityChristoph Hellwig2023-12-291-5/+1
| | | | | | | | | | The discard granularity now defaults to a single sector, so don't set that value explicitly. Also don't bother clearing it as a discard granularity without discard_sectors doesn't mean anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: discard_granularity should not be smaller than a sectorChristoph Hellwig2023-12-291-1/+1
| | | | | | | | | | | Just like all block I/O, discards are in units of sectors. Thus setting a smaller than sector size discard limit in case of > 512 byte sectors in bcache doesn't make sense. Always set the discard granularity to 512 bytes instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: rename and document BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-1/+1
| | | | | | | | Give BLK_DEF_MAX_SECTORS a _CAP postfix and document what it is used for. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: don't abuse BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-1/+2
| | | | | | | | | | BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors limits. Don't use it to initialize max_hw_setors, which is a hardware / driver capacility. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* aoe: don't abuse BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-1/+2
| | | | | | | | | | BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors limits. Don't use it to initialize max_hw_setors, which is a hardware / driver capacility. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORSChristoph Hellwig2023-12-271-10/+2
| | | | | | | | | | | | | | | null_blk has some rather odd capping of the max_hw_sectors value to BLK_DEF_MAX_SECTORS, which doesn't make sense - max_hw_sector is the hardware limit, and BLK_DEF_MAX_SECTORS despite the confusing name is the default cap for the max_sectors field used for normal file system I/O. Remove all the capping, and simply leave it to the block layer or user to take up or not all of that for file system I/O. Fixes: ea17fd354ca8 ("null_blk: Allow controlling max_hw_sectors limit") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227092305.279567-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* loop: don't update discard limits from loop_set_statusChristoph Hellwig2023-12-271-2/+0
| | | | | | | | | loop_set_status doesn't change anything relevant to the discard and write_zeroes setting, so don't bother calling loop_config_discard. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231227082020.249427-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* drbd: actlog: fix kernel-doc warnings and spellingRandy Dunlap2023-12-221-6/+10
| | | | | | | | | | | | | | | | | | | | | | Fix all kernel-doc warnings in drbd_actlog.c: drbd_actlog.c:963: warning: No description found for return value of 'drbd_rs_begin_io' drbd_actlog.c:1015: warning: Function parameter or member 'peer_device' not described in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: Excess function parameter 'device' description in 'drbd_try_rs_begin_io' drbd_actlog.c:1015: warning: No description found for return value of 'drbd_try_rs_begin_io' drbd_actlog.c:1197: warning: No description found for return value of 'drbd_rs_del_all' Fix one spelling error (s/ore/or/). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: <drbd-dev@lists.linbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Link: https://lore.kernel.org/r/20231222061909.8791-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme into ↵Jens Axboe2023-12-2111-156/+277
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for-6.8/block Pull NVMe updates from Keith: "nvme updates for Linux 6.8 - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith)" * tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme: nvme-fc: set numa_node after nvme_init_ctrl nvme-fabrics: don't check discovery ioccsz/iorcsz nvmet: configfs: use ctrl->instance to track passthru subsystems nvme: repack struct nvme_ns_head nvme: add csi, ms and nuse to sysfs nvme: rename ns attribute group nvme: refactor ns info setup function nvme: refactor ns info helpers nvme: move ns id info to struct nvme_ns_head nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl nvmet: allow identical cntlid_min and cntlid_max settings nvme-fabrics: check ioccsz and iorcsz nvme: introduce nvme_check_ctrl_fabric_info helper
| * nvme-fc: set numa_node after nvme_init_ctrlKeith Busch2023-12-211-4/+2
| | | | | | | | | | | | | | | | | | | | | | nvme_init_ctrl() resets numa_node to NUMA_NO_NODE, so be sure to set the desired value after that function call so it won't be overwritten. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme-fabrics: don't check discovery ioccsz/iorcszMax Gurtovoy2023-12-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking their values during identify controller phase. Fixes: 2fcd3ab39826 ("nvme-fabrics: check ioccsz and iorcsz") Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: configfs: use ctrl->instance to track passthru subsystemsEvan Burgess2023-12-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prevent enabling more than one passthrough subsystem per NVMe controller, passthru.c maintains an xarray indexed by cntlid values. Passthrough for a given nvmet subsystem cannot be enabled by configfs if the subsystem's passthru_ctrl->cntlid value is already accounted for in the xarray. However, according to the NVMe spec (rev 2.0c, p.145), "The Controller ID (CNTLID) value returned in the Identify Controller data structure may be used to uniquely identify a controller within an NVM subsystem," meaning that cntlid values are not guaranteed to be globally unique across multiple subsystems. Instead, the cntlid only uniquely identifies multiple controllers _within_ a subsystem. As a result, multiple unique & valid NVMe targets can be blocked from enabling passthrough at the same time if their controllers share cntlid values, a behavior allowed by the spec. Fix this by indexing the xarray with passthru_ctrl->instance values, which are allocated per controller by IDA and thus should be truly unique. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Evan Burgess <evan.burgess@seagate.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: repack struct nvme_ns_headDaniel Wagner2023-12-191-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ns_id, lba_shift and ms are always accessed for every read/write I/O in nvme_setup_rw. By grouping these variables into one cacheline we can safe some cycles. 4k sequential reads: baseline patched Bandwidth: 1620 1634 IOPs 66345579 66910939 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: add csi, ms and nuse to sysfsDaniel Wagner2023-12-193-1/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | libnvme is using the sysfs for enumarating the nvme resources. Though there are few missing attritbutes in the sysfs. For these libnvme issues commands during discovering. As the kernel already knows all these attributes and we would like to avoid libnvme to issue commands all the time, expose these missing attributes. The nuse value is updated on request because the nuse is a volatile value. Since any user can read the sysfs attribute, a very simple rate limit is added (update once every 5 seconds). A more sophisticated update strategy can be added later if there is actually a need for it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: rename ns attribute groupDaniel Wagner2023-12-194-10/+10
| | | | | | | | | | | | | | | | | | | | | | Drop the 'id' part of the attribute group name because we want to expose non 'id' related attributes via the ns attribute group. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: refactor ns info setup functionDaniel Wagner2023-12-192-60/+62
| | | | | | | | | | | | | | | | | | | | | | Use nvme_ns_head instead of nvme_ns where possible. This reduces the coupling between the different data structures. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: refactor ns info helpersDaniel Wagner2023-12-194-28/+34
| | | | | | | | | | | | | | | | | | | | | | | | Pass in the nvme_ns_head pointer directly. This reduces the necessity on the caller side have the nvme_ns data structure present. Thus we can refactor the caller side in the next step as well. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: move ns id info to struct nvme_ns_headDaniel Wagner2023-12-195-66/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the namesapce info to struct nvme_ns_head, because it's the same for all associated namespaces. Note: with multipathing enabled the PI information is shared between all paths. If a path is using a different PI configuration it will overwrite the previous settings. This is obviously not correct and such configuration will be rejected in future. For the time being we expect a correctly configured storage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrlGuixin Liu2023-12-131-3/+0
| | | | | | | | | | | | | | | | | | | | The cntlid_min and cntlid_max are checked in configfs, don't check again in nvmet_alloc_ctrl(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvmet: allow identical cntlid_min and cntlid_max settingsGuixin Liu2023-12-131-2/+2
| | | | | | | | | | | | | | | | | | | | When the user wants to restrict to only creating one controller, they can set cntlid_min and cntlid_max to the same value. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme-fabrics: check ioccsz and iorcszGuixin Liu2023-12-061-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that ioccsz and iorcsz returned by target are correct before use it. Per 2.0a base NVMe spec: I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines the maximum I/O command capsule size in 16 byte units. The minimum value that shall be indicated is 4 corresponding to 64 bytes. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
| * nvme: introduce nvme_check_ctrl_fabric_info helperGuixin Liu2023-12-061-18/+24
| | | | | | | | | | | | | | | | | | | | Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info returned by target. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
* | sd: only call disk_clear_zoned when neededChristoph Hellwig2023-12-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disk_clear_zoned only needs to be called when a device reported zone managed mode first and we clear it. Add a check so that disk_clear_zoned isn't called on devices that were never zoned. This avoids a fairly expensive queue freezing when revalidating conventional devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: simplify disk_set_zonedChristoph Hellwig2023-12-195-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: remove support for the host aware zone modelChristoph Hellwig2023-12-1911-90/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | virtio_blk: remove the broken zone revalidation supportChristoph Hellwig2023-12-191-26/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | virtblk_revalidate_zones is called unconditionally from virtblk_config_changed_work from the virtio config_changed callback. virtblk_revalidate_zones is a bit odd in that it re-clears the zoned state for host aware or non-zoned devices, which isn't needed unless the zoned mode changed - but a zone mode change to a host managed model isn't handled at all, and virtio_blk also doesn't handle any other config change except for a capacity change is handled (and even if it was the upper layers above virtio_blk wouldn't handle it very well). But even the useful case of a size change that would add or remove zones isn't handled properly as blk_revalidate_disk_zones expects the device capacity to cover all zones, but the capacity is only updated after virtblk_revalidate_zones. As this code appears to be entirely untested and is getting in the way remove it for now, but it can be readded in a fixed version with proper test coverage if needed. Fixes: 95bfec41bd3d ("virtio-blk: add support for zoned block devices") Fixes: f1ba4e674feb ("virtio-blk: fix to match virtio spec") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | virtio_blk: cleanup zoned device probingChristoph Hellwig2023-12-191-28/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move reading and checking the zoned model from virtblk_probe_zoned_device into the caller, leaving only the code to perform the actual setup for host managed zoned devices in virtblk_probe_zoned_device. This allows to share the model reading and sharing between builds with and without CONFIG_BLK_DEV_ZONED, and improve it for the !CONFIG_BLK_DEV_ZONED case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'md-next-20231219' of ↵Jens Axboe2023-12-1910-1362/+198
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.8/block Pull MD updates from Song: "1. Remove deprecated flavors, by Song Liu; 2. raid1 read error check support, by Li Nan; 3. Better handle events off-by-1 case, by Alex Lyakas." * tag 'md-next-20231219' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: Remove deprecated CONFIG_MD_FAULTY md: Remove deprecated CONFIG_MD_MULTIPATH md: Remove deprecated CONFIG_MD_LINEAR md/raid1: support read error check md: factor out a helper exceed_read_errors() to check read_errors md: Whenassemble the array, consult the superblock of the freshest device md/raid1: remove unnecessary null checking
| * | md: Remove deprecated CONFIG_MD_FAULTYSong Liu2023-12-193-377/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md-faulty has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-4-song@kernel.org
| * | md: Remove deprecated CONFIG_MD_MULTIPATHSong Liu2023-12-194-609/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md-multipath has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org
| * | md: Remove deprecated CONFIG_MD_LINEARSong Liu2023-12-195-342/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md-linear has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org
| * | md/raid1: support read error checkLi Nan2023-12-151-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors."), rdev will be set to faulty if it reads data error to many times in raid10. Add this mechanism to raid1 now. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-3-linan666@huaweicloud.com
| * | md: factor out a helper exceed_read_errors() to check read_errorsLi Nan2023-12-153-46/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move check_decay_read_errors() to raid1-10.c and factor out a helper exceed_read_errors() to check if read_errors exceeds the limit, so that raid1 can also use it. There are no functional changes. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231215023852.3478228-2-linan666@huaweicloud.com
| * | md: Whenassemble the array, consult the superblock of the freshest deviceAlex Lyakas2023-12-151-10/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upon assembling the array, both kernel and mdadm allow the devices to have event counter difference of 1, and still consider them as up-to-date. However, a device whose event count is behind by 1, may in fact not be up-to-date, and array resync with such a device may cause data corruption. To avoid this, consult the superblock of the freshest device about the status of a device, whose event counter is behind by 1. Signed-off-by: Alex Lyakas <alex.lyakas@zadara.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/1702470271-16073-1-git-send-email-alex.lyakas@zadara.com
| * | md/raid1: remove unnecessary null checkingGou Hao2023-12-151-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | If %__GFP_DIRECT_RECLAIM is set then bio_alloc_bioset will always be able to allocate a bio. See comment of bio_alloc_bioset. Signed-off-by: Gou Hao <gouhao@uniontech.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231214151458.28970-1-gouhao@uniontech.com
* | | block/rnbd-srv: Check for unlikely string overflowKees Cook2023-12-131-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "dev_search_path" can technically be as large as PATH_MAX, there was a risk of truncation when copying it and a second string into "full_path" since it was also PATH_MAX sized. The W=1 builds were reporting this warning: drivers/block/rnbd/rnbd-srv.c: In function 'process_msg_open.isra': drivers/block/rnbd/rnbd-srv.c:616:51: warning: '%s' directive output may be truncated writing up to 254 bytes into a region of size between 0 and 4095 [-Wformat-truncation=] 616 | snprintf(full_path, PATH_MAX, "%s/%s", | ^~ In function 'rnbd_srv_get_full_path', inlined from 'process_msg_open.isra' at drivers/block/rnbd/rnbd-srv.c:721:14: drivers/block/rnbd/rnbd-srv.c:616:17: note: 'snprintf' output between 2 and 4351 bytes into a destination of size 4096 616 | snprintf(full_path, PATH_MAX, "%s/%s", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 617 | dev_search_path, dev_name); | ~~~~~~~~~~~~~~~~~~~~~~~~~~ To fix this, unconditionally check for truncation (as was already done for the case where "%SESSNAME%" was present). Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312100355.lHoJPgKy-lkp@intel.com/ Cc: Md. Haris Iqbal <haris.iqbal@ionos.com> Cc: Jack Wang <jinpu.wang@ionos.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <linux-block@vger.kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20231212214738.work.169-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'md-next-20231208' of ↵Jens Axboe2023-12-089-442/+188
|\| | | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.8/block Pull MD updates from Song: "1. Fix/Cleanup RCU usage from conf->disks[i].rdev, by Yu Kuai; 2. Fix raid5 hang issue, by Junxiao Bi; 3. Add Yu Kuai as Reviewer of the md subsystem." * tag 'md-next-20231208' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: synchronize flush io with array reconfiguration MAINTAINERS: SOFTWARE RAID: Add Yu Kuai as Reviewer md/md-multipath: remove rcu protection to access rdev from conf md/raid5: remove rcu protection to access rdev from conf md/raid1: remove rcu protection to access rdev from conf md/raid10: remove rcu protection to access rdev from conf md: remove flag RemoveSynchronized Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" md: bypass block throttle for superblock update
| * md: synchronize flush io with array reconfigurationYu Kuai2023-12-011-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently rcu is used to protect iterating rdev from submit_flushes(): submit_flushes remove_and_add_spares synchronize_rcu pers->hot_remove_disk() rcu_read_lock() rdev_for_each_rcu if (rdev->raid_disk >= 0) rdev->radi_disk = -1; atomic_inc(&rdev->nr_pending) rcu_read_unlock() bi = bio_alloc_bioset() bi->bi_end_io = md_end_flush bi->private = rdev submit_bio // issue io for removed rdev Fix this problem by grabbing 'acive_io' before iterating rdev, make sure that remove_and_add_spares() won't concurrent with submit_flushes(). Fixes: a2826aa92e2e ("md: support barrier requests on all personalities.") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231129020234.1586910-1-yukuai1@huaweicloud.com
| * md/md-multipath: remove rcu protection to access rdev from confYu Kuai2023-11-271-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; And these will cover all the scenarios in md-multipath. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-6-yukuai1@huaweicloud.com
| * md/raid5: remove rcu protection to access rdev from confYu Kuai2023-11-274-144/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid456. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-5-yukuai1@huaweicloud.com
| * md/raid1: remove rcu protection to access rdev from confYu Kuai2023-11-271-39/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid1. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-4-yukuai1@huaweicloud.com
| * md/raid10: remove rcu protection to access rdev from confYu Kuai2023-11-271-155/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because it's safe to accees rdev from conf: - If any spinlock is held, because synchronize_rcu() from md_kick_rdev_from_array() will prevent 'rdev' to be freed until spinlock is released; - If 'reconfig_lock' is held, because rdev can't be added or removed from array; - If there is normal IO inflight, because mddev_suspend() will prevent rdev to be added or removed from array; - If there is sync IO inflight, because 'MD_RECOVERY_RUNNING' is checked in remove_and_add_spares(). And these will cover all the scenarios in raid10. This patch also cleanup the code to handle the case that replacement replace rdev while IO is still inflight. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-3-yukuai1@huaweicloud.com
| * md: remove flag RemoveSynchronizedYu Kuai2023-11-276-72/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rcu is not used correctly here, because synchronize_rcu() is called before replacing old value, for example: remove_and_add_spares // other path synchronize_rcu // called before replacing old value set_bit(RemoveSynchronized) rcu_read_lock() rdev = conf->mirros[].rdev pers->hot_remove_disk conf->mirros[].rdev = NULL; if (!test_bit(RemoveSynchronized)) synchronize_rcu /* * won't be called, and won't wait * for concurrent readers to be done. */ // access rdev after remove_and_add_spares() rcu_read_unlock() Fortunately, there is a separate rcu protection to prevent such rdev to be freed: md_kick_rdev_from_array //other path rcu_read_lock() rdev = conf->mirros[].rdev list_del_rcu(&rdev->same_set) rcu_read_unlock() /* * rdev can be removed from conf, but * rdev won't be freed. */ synchronize_rcu() free rdev Hence remove this useless flag and prepare to remove rcu protection to access rdev from 'conf'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231125081604.3939938-2-yukuai1@huaweicloud.com
| * Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"Junxiao Bi2023-11-271-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 5e2cf333b7bd5d3e62595a44d598a254c697cd74. That commit introduced the following race and can cause system hung. md_write_start: raid5d: // mddev->in_sync == 1 set "MD_SB_CHANGE_PENDING" // running before md_write_start wakeup it waiting "MD_SB_CHANGE_PENDING" cleared >>>>>>>>> hung wakeup mddev->thread ... waiting "MD_SB_CHANGE_PENDING" cleared >>>> hung, raid5d should clear this flag but get hung by same flag. The issue reverted commit fixing is fixed by last patch in a new way. Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231108182216.73611-2-junxiao.bi@oracle.com
| * md: bypass block throttle for superblock updateJunxiao Bi2023-11-271-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") introduced a hung bug and will be reverted in next patch, since the issue that commit is fixing is due to md superblock write is throttled by wbt, to fix it, we can have superblock write bypass block layer throttle. Fixes: 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d") Cc: stable@vger.kernel.org # v5.19+ Suggested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231108182216.73611-1-junxiao.bi@oracle.com
* | nvme: use bio_integrity_map_userKeith Busch2023-12-011-168/+29
|/ | | | | | | | | | | | Map user metadata buffers directly. Now that the bio tracks the metadata, nvme doesn't need special metadata handling and tracking with callbacks and additional fields in the pdu. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/rnbd: use %pe to print errorsSupriti Singh2023-11-272-13/+13
| | | | | | | | | | | | While printing error, replace %ld by %pe. %pe prints a string whereas %ld would print an error code. Signed-off-by: Supriti Singh <supriti.singh@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com> Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com> Link: https://lore.kernel.org/r/20231124213422.113449-3-haris.iqbal@ionos.com Signed-off-by: Jens Axboe <axboe@kernel.dk>